2010-08-30

LLVM concurrency representations

Mac OS X 10.6 (and later):
. The OpenCL GPGPU implementation is built on
Clang and LLVM compiler technology.
This requires parsing an extended dialect of C at runtime
and JIT compiling it to run on the
CPU, GPU, or both at the same time.
OpenCL (Open Computing Language)
is a framework for writing programs
that execute across heterogeneous platforms
consisting of CPUs, GPUs, and other processors.
OpenCL includes a language (based on C99)
for writing kernels (functions that execute on OpenCL devices),
plus APIs that are used to define and then control the platforms.
Open for Improvement:
. With features like OpenCL and Grand Central Dispatch,
Snow Leopard will be better equipped
to manage parallelism across processors
and push optimized code to the GPU's cores,
as described in WWDC 2008: New in Mac OS X Snow Leopard.
However, in order for the OS to
efficiently schedule parallel tasks,
the code needs to be explicitly optimized
for for parallelism by the compiler.
. LLVM will be a key tool in prepping code for
high performance scheduling.
LLVM-CHiMPS (pdf)
LLVM for the CHiMPS 
(Compiling hll to Massively Pipelined System)
National Center for Supercomputing Applications/
Reconfigurable Systems Summer Institute July 8, 2008/
Compilation Environment for FPGAs:
. Using LLVM Compiler Infrastructure and
CHiMPS Computational Model
. A computational model and architecture for
FPGA computing by Xilinx, Inc.
- Standard software development model (ANSI C)
Trade performance for convenience
- Virtualized hardware architecture
CHiMPS Target Language (CTL) instructions
- Cycle accurate simulator
- Runs on BEE2
Implementation of high level representations:

# Limitations in optimization
- CTL code is generated at compile time
No optimization by LLVM for a source code in which no
such expressions can be optimized at compile time
- LLVM does not have a chance to dynamically optimize
the source code at run time
- LLVM is not almighty
Floating point math is still difficult to LLVM
Cray Opteron Compiler: Brief History of Time (pdf)
Cray has a long tradition of high performance compilers
Vectorization
Parallelization
Code transformation
...
Began internal investigation leveraging LLVM
Decided to move forward with Cray X86 compiler
First release December 2008

Fully optimized and integrated into the compiler
No preprocessor involved
Target the network appropriately:
.  GASNet with Portals . DMAPP with Gemini & Aries .
Why a Cray X86 Compiler?
Standard conforming languages and programming models
Fortran 2003
UPC & CoArray Fortran
. Ability and motivation to provide
high-quality support for
custom Cray network hardware
. Cray technology focused on scientific applications
Takes advantage of Cray’s extensive knowledge of
automatic vectorization and
automatic shared memory parallelization
Supplements, rather than replaces, the available compiler choices

. cray has added parallelization and fortran support .
. ported to cray x2 .
. generating code for upc and caf (pgas langs) .
. supports openmp 2.0 std and nesting .

. Cray compiler supports a full and growing set of
directives and pragmas:
!dir$ concurrent
!dir$ ivdep
!dir$ interchange
!dir$ unroll
!dir$ loop_info [max_trips] [cache_na] ... Many more
!dir$ blockable
man directives
man loop_info
weaknesses:
Tuned Performance
Vectorization
Non-temporal caching
Blocking
Many end-cases
Scheduling, Spilling
No C++, Very young X86 compiler
future:
optimized PGAS -- requires Gemini network for speed
Improved Vectorization
Automatic Parallelization:
. Modernized version of Cray X1 streaming capability
. Interacts with OMP directives
[OpenMP -- Multi-Processing]