2010-08-31

the MathLink ABI

adda/ABI"mathlink

8.12: news.adda/mathematica`mathlink:
. MathLink is the same as the
binary version of adda:
it's the communication protocol between
the kernel that provides the app,
and the notebook that is the user's agent;
. any program that adopts this protocol,
can communicate with any other app,
and of course, the user's agent .
. adopters can communicate as
client, server, or peer-to-peer .
. it makes accessable any hardware resources
having a c interface .
. because it's the connection between
the user's shell and the engine,
you're assured that its API is complete:
anything you can see as a user,
can also be seen by an
app that has embedded Mathematica .
. conversely, a custom user's shell
can have integrated access to both
Mathematica and addx .
. Wolfram already has a customized
user's shell, .NET/Link,
that integrates microsoft's .NET
. this is a complete integration,
extending the Mathematica language
with all existing and future .NET types
(which include all library calls)
allowing the same immediate run mode,
for RAD programming of .NET
and Mathematica extensions .
-- and it's openware!

. its protocol includes internet,
and includes data integrity .
. if the binary protocol is not usable,
it can also revert to html or xml .
. as the binary version of a
full-featured language,
in can express "(out-of-band data,
such as exceptions).
. having this network access
means it can support a
Parallel Computing Toolkit
and act like the X protocol,
running on a server, while
viewed from a laptop or tablet;

. in addition to being an ipc protocol
for extending and embedding Mathematica,
Mathlink is a reference to some
pre-built integrations with some other
popular app's like Excel,
where it can a either extend or replace
that spreadsheet's macro language .

. the C/C++ MathLink Software Developer Kit (SDK)
ships with every version of Mathematica .

massively parallel computing for the masses!
. the Mathematica cloud computing service
is a collaborative effort of
Wolfram Research's parallel programming API,
Nimbis Services's job routing,
and R Systems NA's supercomputer time .
. it assists parallel programming
by providing an integrated
technical computing platform,
enabling computation, visualization,
and data access.
(2008.12: Mathematica 7 features concurrency primitives:
Wolfram's new Parallelize, ParallelTry
and ParallelEvaluate functions
provide automatic and concurrent
expression evaluation.
Parallel performance can be tweaked and queued using
the ParallelMap, ParallelCombine, ParallelSubmit,
WaitAll and WaitNext functions.
These and many other parallel computing functions
ensure that developers have tremendously granular control
over what will be sent through the parallel pipeline
and exactly how that data will be processed.
. this concurrency can also be fully utilized
with Wolfram's gridMathematica
and upcoming CloudMathematica add-ons .)
. Nimbis Services, Inc., is a clearing-house,
providing business users an easy to use
menu of hpc services,
including TOP500 supercomputers
and the Amazon Elastic Compute Cloud,
all in one "instant" storefront.
. Nimbis Services will enable access to
R Systems NA, Inc.,
whose R Smarr cluster was the
44th fastest supercomputing system
on the TOP500 list in 2008 .
. R Systems has exceptionally large memory
in multi-core HPC resources
with a double-data and quad-data rate
InfiniBand network .
R Systems not yet accessable:
. only Amazon EC2 configurations
are currently operational.

8.14: adda/ABI/versioning:
. an abi (app'binary interface)
may need to be revised;
as an extension to the addm abi:
(type, subtype, function, args)
there needs to be an identity number
that is established by handshake
which is detected by the msg being
sent from the zero identity .
. identities are retained for the duration of
a connection session;
. the handshake is coded in the original ABI version,
and the parties can then haggle about
what the remaining session will be coded in .

2010-08-30

lang's for supercomputer concurrency

adda/concurrency/lang's for supercomputer concurrency
7.27:
. the DARPA`HPCS program
(High Productivity Computing Systems)
is meant to tame the costs of HPC
(High Performance Computing)
-- HPC is the use of supercomputers for
simulations to solve problems in
chemistry, physics, ecology, sociology,
and esp'ly warfare .
. programmer productivity
means making it easier to develope
code that can make full use of
a supercomputer's concurrency .
. the main source of the cost
is a lack of smart software tools
that can turn science experts
into computer coders .
. toward this end, they are funding
the design of new concurrency lang's .
7.28:
. the DoD and DARPA already made
quite an investment in the
Ada concurrency lang',
a language designed for expressing
the sort of concurrency needed by
embedded system engineers; 7.31:
but, the software developer community
spurned Ada as soon as there was
something better ...
. pascal was popular before '85,
and when Ada arrived '83,
it was popular too;
then came the Visual version
of Basic (familiar and slick).
. the top demanded langs by year were:
'85: C, lisp, Ada, basic;
'95: v.basic, c++, C, lisp, Ada;
'05: Java, C, c++, perl, php,

. the only currently popular lang's
meant for exploiting multi-core cpu's
if not other forms of concurrency, are:
(3.0% rating) obj'c 2.0 (with blocks)
(0.6% rating) Go
(0.4% rating) Ada
7.29:
. whether or not Ada's concurrency model
is well-suited for supercomputers
as well as embedded systems,
it is not increasing coder'productivity .
. while Ada boosted productivity beyond
that offered by C,
it was nevertheless proven to do
less for productivity than Haskell
.

HPC Productivity 2004/Kepner{ 2003(pdf), 2004(pdf) }:
. lang's were compared for
expressiveness vs performance:
. the goal of a high-performance lang'
is to have the expressiveness
of Matlab and Python,
with the performance of VHDL
(VHDL is a version of Ada for ASICs ).
. UPC (Unified Parallel C) and Co-array Fortran
are half way to high-productivity
merely by using PGAS
(Partitioned global address space)
rather than MPI
(Message Passing Interface).
. the older tool set: C, MPI, and openMP
is both slow and difficult to use .

. the 2 lang's that DARPA is banking on now
are Cray`Chapel, and IBM`X10 Java .
. they were also funding Sun`Fortress
until 2006,
which features a syntax like advanced math
-- the sort of greek that physics experts
are expected to appreciate .

LLVM concurrency representations

Mac OS X 10.6 (and later):
. The OpenCL GPGPU implementation is built on
Clang and LLVM compiler technology.
This requires parsing an extended dialect of C at runtime
and JIT compiling it to run on the
CPU, GPU, or both at the same time.
OpenCL (Open Computing Language)
is a framework for writing programs
that execute across heterogeneous platforms
consisting of CPUs, GPUs, and other processors.
OpenCL includes a language (based on C99)
for writing kernels (functions that execute on OpenCL devices),
plus APIs that are used to define and then control the platforms.
Open for Improvement:
. With features like OpenCL and Grand Central Dispatch,
Snow Leopard will be better equipped
to manage parallelism across processors
and push optimized code to the GPU's cores,
as described in WWDC 2008: New in Mac OS X Snow Leopard.
However, in order for the OS to
efficiently schedule parallel tasks,
the code needs to be explicitly optimized
for for parallelism by the compiler.
. LLVM will be a key tool in prepping code for
high performance scheduling.
LLVM-CHiMPS (pdf)
LLVM for the CHiMPS 
(Compiling hll to Massively Pipelined System)
National Center for Supercomputing Applications/
Reconfigurable Systems Summer Institute July 8, 2008/
Compilation Environment for FPGAs:
. Using LLVM Compiler Infrastructure and
CHiMPS Computational Model
. A computational model and architecture for
FPGA computing by Xilinx, Inc.
- Standard software development model (ANSI C)
Trade performance for convenience
- Virtualized hardware architecture
CHiMPS Target Language (CTL) instructions
- Cycle accurate simulator
- Runs on BEE2
Implementation of high level representations:

# Limitations in optimization
- CTL code is generated at compile time
No optimization by LLVM for a source code in which no
such expressions can be optimized at compile time
- LLVM does not have a chance to dynamically optimize
the source code at run time
- LLVM is not almighty
Floating point math is still difficult to LLVM
Cray Opteron Compiler: Brief History of Time (pdf)
Cray has a long tradition of high performance compilers
Vectorization
Parallelization
Code transformation
...
Began internal investigation leveraging LLVM
Decided to move forward with Cray X86 compiler
First release December 2008

Fully optimized and integrated into the compiler
No preprocessor involved
Target the network appropriately:
.  GASNet with Portals . DMAPP with Gemini & Aries .
Why a Cray X86 Compiler?
Standard conforming languages and programming models
Fortran 2003
UPC & CoArray Fortran
. Ability and motivation to provide
high-quality support for
custom Cray network hardware
. Cray technology focused on scientific applications
Takes advantage of Cray’s extensive knowledge of
automatic vectorization and
automatic shared memory parallelization
Supplements, rather than replaces, the available compiler choices

. cray has added parallelization and fortran support .
. ported to cray x2 .
. generating code for upc and caf (pgas langs) .
. supports openmp 2.0 std and nesting .

. Cray compiler supports a full and growing set of
directives and pragmas:
!dir$ concurrent
!dir$ ivdep
!dir$ interchange
!dir$ unroll
!dir$ loop_info [max_trips] [cache_na] ... Many more
!dir$ blockable
man directives
man loop_info
weaknesses:
Tuned Performance
Vectorization
Non-temporal caching
Blocking
Many end-cases
Scheduling, Spilling
No C++, Very young X86 compiler
future:
optimized PGAS -- requires Gemini network for speed
Improved Vectorization
Automatic Parallelization:
. Modernized version of Cray X1 streaming capability
. Interacts with OMP directives
[OpenMP -- Multi-Processing]

DCT (discrete control theory) for avoiding deadlock

[8.30:
. exciting claims I haven't researched yet ...]
8.5: news.adda/concurrency/dct/Gadara`Discrete Control Theory:

Eliminating Concurrency Bugs with Control Engineering (pdf)
Concurrent programming is notoriously difficult
and is becoming increasingly prevalent
as multicore hardware compels performance-conscious developers
to parallelize software.
If we cannot enable the average programmer
to write correct and efficient parallel software
at reasonable cost,
the computer industry's rate of value creation
may decline substantially.
Our research addresses the
challenges of concurrent programming
by leveraging control engineering,
a body of technique that can
constrain the behavior of complex systems,
prevent runtime failures,
and relieve human designers and operators
of onerous responsibilities.
In past decades,
control theory made industrial processes
-- complex and potentially dangerous --
safe and manageable
and relieved human operators
of tedious and error-prone chores.
Today, Discrete Control Theory promises
similar benefits for concurrent software.
This talk describes an application of the
control engineering paradigm to concurrent software:
Gadara, which uses Discrete Control Theory
to eliminate deadlocks in
shared-memory multithreaded software.

promise pipelining

8.21: news.adda/co/promises/wiki brings understanding:
. yahoo!, this wiki page finally made sense of promises
as exemplified by e-lang's tutorial
which graphically showed things incorrectly;
so that unless you ignored the diagram
you couldn't possibly make sense of the tutorial .
[8.30: ### the following is just my
version of that page, not a working tutorial ###]

t1 := x`a();
t2 := y`b();
t3 := t1`c(t2);
. "( x`a() ) means to send the message a()
asynchronously to x.
If, x, y, t1, and t2
are all located on the same remote machine,
a pipelined implementation can compute t3 with
one round-trip instead of three.
[. the original diagram showed all involved objects
existing on the client's (caller's) node,
not the remote server's;
so, you'd have to be left wondering
how is the claimed pipelining
possible for t1`c(t2)
if the temp's t1, and t2
are back at the caller's?! ]
Because all three messages are destined for
objects which are on the same remote machine,
only one request need be sent
and only one response
need be received containing the result.
. the actual message looks like:
do (remote`x`a) and save as t1;
do (remote`y`b) and save as t2;
do (t1`c(t2)) using previous saves;
and send it back .
Promise pipelining should be distinguished from
parallel asynchronous message passing.
In a system supporting parallel message passing
but not pipelining,
the messages x`a() and y`b()
in the above example could proceed in parallel,
but the send of t1`c(t2) would have to wait until
both t1 and t2 had been received,
even when x, y, t1, and t2 are on the same remote machine.
. Promise pipelining vs
pipelined message processing:
. in Actor systems,
it is possible for an actor to begin a message
before having completed
processing of the previous message.
[. this is the usual behavior for Ada tasks;
tasks are very general, and the designer of one
can make a task that does nothing more than
collect and sort all the messages that get queued;
and then even when it accepts a job,
it can subsequently requeue it .]

the co operator

8.20: adda/co/the co operator:
. as shown in chapel,
instead of declaring a block to be a coprogram,
any stmt can be run as a coprogram with:
co stmt;
-- stmt can be procedure literal;
[8.28: I'm wondering why this couldn't be
obvious to the compiler
or to an llvm-type runtime .]
. co { set of stmt's}
-- runs each element in a set in parallel;
var's can be declared sets of stmts .

the Concurrency and Coordination Runtime (CCR)

8.19: adda/co/the ms`ccr model:
. the new view I got from reading about
ms`robotics`lang and the CCR
was that task entries are separate from tasks,
and can be shared among tasks .
. entries can be dynamically assigned to tasks;
agents can dynamically change .
. integrating this with OOP,
any msg can be sent to any task
or any shared queue;
and it can look at both the
name and the arg list`types,
to decide how to handle it .
thoughts on Ada's Tasking:
. an entry corresponds to an obj`msg,
each Ada task has its own queue
on which task entries are posted .
. Ada protected types saved time by
skipping the queue posting,
and going straight to atomic access
else waiting for another to finish .
. the queueing then must be occuring
with the scheduler:
tasks are suspended waiting for a resource,
and if each resource had its own list,
then every time that resource is finished,
the scheduler checks that resource's queue
for the next task to wake .

why isn't oop's modularity thread-safe?

8.6: adda/concurrency/oop`modular not threadsafe:

. assuming the oop lang's were very modular;
why wouldn't they be easy to distribute?
some have said it's the central heap model .
. in fact,
java was not really that modular:
whether it was from obj' vs value semantics
or letting other objects share in the
modification of locals,
or simply not being able to
finish one self change
before being read by another thread,
java was not always threadsafe .
. to be {concurrently modular, threadsafe},
class methods need to be atomic
as are Ada's protected types .

. would it help if they didn't share?
[8.17: ie,
why is the default assignment a pointer share;
or, why have sharing pointers without
distinguishing sharers from owners ? ]

[8.17: can't we find concurrency elsewhere?]
. the basic structure hasn't changed
(programs = dstrs + algor's):
obj`methods are serving up only brief procedures;
the bulk of processing comes from the
algorithm that employees the obj's .
[8.17: no:
. even if the methods are brief,
and even if there are many subroutine calls;
all the leaf calls are to obj`methods
-- and most time is actually spent in leaf calls;
however,
at the algorithm level,
we can identify many concurrable calls
for which the compiler can verify
that what we are calling concurrent
in fact involves no object var'sharing .]