2016-09-17

Milk for openMP

9.15: news.adda/lang/co/Milk for openMP:
917: summary:
. Milk language optimizes openMP,
to avoid having to rewrite code.
. if you are starting from scratch,
better to avoid openMP.

Larry Hardesty September 2016:
New programming language delivers fourfold speedups on
problems common in the age of big data.
In today’s computer chips, memory management is based on
what computer scientists call the principle of locality:
If a program needs a chunk of data stored at some memory location,
it probably needs the neighboring chunks as well.
But that assumption breaks down in the age of big data,
now that computer programs more frequently act on
just a few data items
scattered arbitrarily across huge data sets. (sparse).
Since fetching data from their main memory banks
is the major performance bottleneck in today’s chips,
having to fetch it more frequently
can dramatically slow program execution.
. a programming language called Milk,
lets application developers manage memory more efficiently
in programs that deal with scattered data points in large data sets.
Thinking locally
Today’s computer chips are not optimized for sparse data
— in fact, the reverse is true.
Because fetching data from the chip’s main memory bank is slow,
every core, or processor, in a modern chip has its own “cache,”
a relatively small, local, high-speed memory bank.
Rather than fetching a single data item at a time from main memory,
a core will fetch an entire block of data.
And that block is selected according to the principle of locality.
Batch processing
Milk simply adds a few commands to OpenMP,
an extension of languages such as C and Fortran
that makes it easier to write code for multicore processors.
With Milk, a programmer inserts a couple additional lines of code
around any instruction that iterates through
a large data collection
looking for a comparatively small number of items.
. when a core discovers that it needs a piece of data,
it doesn’t request it from main memory.
Instead, it adds the data item’s address
to a list of locally stored addresses.
When the list is long enough,
all the chip’s cores pool their lists,
they group together those addresses that are near each other,
and redistribute them to the cores.
That way, each core requests only data items that it knows it needs
and that can be retrieved efficiently.
That’s the high-level description,
but the details get more complicated.
In fact, most modern computer chips
have several different levels of caches,
each one larger but also slightly less efficient.
The Milk compiler has to keep track of
not only a list of memory addresses
but also the data stored at those addresses,
and it regularly shuffles both around between cache levels.
It also has to decide which addresses should be retained
because they might be accessed again, and which to discard.
The work combines detailed knowledge about
the design of memory controllers and compilers
to implement good optimizations for current hardware.

details at your state college:
Optimizing Indirect Memory References with milk
Authors: Vladimir Kiriansky, Yunming Zhang, Saman Amarasinghe
Massachusetts Institute of Technology, Cambridge, MA, USA
PACT '16 Proceedings of the 2016 International Conference
on Parallel Architectures and Compilation Pages 299-312
ACM New York, NY, USA ©2016
abstract:
Modern applications such as graph and data analytics,
when operating on real world data,
have working sets much larger than cache capacity
and are bottlenecked by DRAM.
To make matters worse,
DRAM bandwidth is increasing much slower than
per CPU core count, while DRAM latency has been virtually stagnant.
Parallel applications that are bound by memory bandwidth fail to scale,
while applications bound by memory latency
draw a small fraction of much-needed bandwidth.
While expert programmers may be able to tune important applications
by hand through heroic effort,
traditional compiler cache optimizations have not been sufficiently
aggressive to overcome the growing DRAM gap.
In this paper, we introduce milk - a C/C++ language extension
that allows programmers to annotate memory-bound loops concisely.
Using optimized intermediate data structures,
random indirect memory references are transformed into
batches of efficient sequential DRAM accesses.
A simple semantic model enhances programmer productivity
for efficient parallelization with OpenMP.
We evaluate the MILK compiler on parallel implementations of
traditional graph applications,
demonstrating performance gains of up to 3x.

openMP is
low-level & difficult; vs high-level & inefficient.
the older tool set: C, MPI, and openMP
is both slow and difficult to use.

. the openMP model is
all the threads on shared mem';
it doesn't allow locality exploitation;
modifying shared data may require synchronization
(locks, semaphores).

Serdar Yegulalp:
Apps that run in parallel [in a shared memory model]
contend with each other for memory access,
so any gains from parallel processing
are offset by the time spent waiting for memory.
Milk performs "DRAM-conscious clustering."
Since data shuttled from memory is cached locally on the CPU,
batching together data requests from multiple processes
allows the on-CPU cache to be shared more evenly between them.
. Milk is extending an existing library, OpenMP,
used by annotating sections of the code
with directives ("pragmas") to the compiler to use OpenMP extensions,
and Milk works the same way.
The directives are syntactically similar,
so existing OpenMP apps don't have to be heavily reworked
to be sped up; however the most advanced use of Milk
also requires writing some functions calls.

co.nextbigfuture:
Their ACM paper "Optimizing Indirect Memory References with milk"
has doi of 10.1145/2967938.2967948
If you put the DOI into sci-hub you can read the whole thing of course.
me:
I searched at http://scihub.org/ for 10.1145/2967938.2967948
and found only links to dl.acm.org.
[see a state college for possible free access].
... co:
Use sci-hub.cc. scihub.org doesn't look or work anything like sci-hub
EDIT: actually it is an open access article and you can just get it from the ACM page by clicking the PDF link.
It's just my habit to always go to sci-hub first.
http://sci-hub.cc/10.1145/2967938.2967948
Donations for sci-hub.cc can also be sent to the bitcoin wallet:
1K4t2vSBSS2xFjZ6PofYnbgZewjeqbG1TM