ICS 2013 WORKSHOPS & TUTORIALS

The 27th International Conference on Supercomputing (ICS2013) program will include workshops and tutorials scheduled on Monday, June 10th and on Tuesday, June 11th.

Workshops

Tutorials

Schedule

Monday 10th June

Morning 8:30-12:00

W1: International Workshop on Runtime and Operating Systems for Supercomputers (ROSS)

 

T3: It's Elemental

Afternoon 13:30-17:00

Tuesday 11th June

Morning 8:30-12:00

 

T1: Compiler Optimization

Afternoon 13:30-17:00




Workshops



Duration and schedule:

Full Day (Monday, June 10th 8:30 - 17:00)

Location:

Bloch

Workshop Chairs

Torsten Hoefler (ETH Zurich) and Kamil Iskra (Argonne National Lab)

Description

The complexity of node architectures in supercomputers increases as we cross petaflop milestones on the way towards Exascale. Increasing levels of parallelism in multi- and many-core chips and emerging heterogeneity of computational resources coupled with energy and memory constraints force a reevaluation of our approaches towards operating systems and runtime environments. The International Workshop on Runtime and Operating Systems for Supercomputers provides a forum for researchers to exchange ideas and discuss research questions that are relevant to upcoming supercomputers.

Workshop's web page

Back to schedule



Duration and schedule:

Half Day workshop (Tuesday, June 11th 13:30 - 17:00)

Location:

Bloch

Organizers

Joseph Sloan (University of Illinois at Urbana Champaign)

Description

Circuit and logic variability from process scaling is leading to significant reliability problems in future systems. The increasingly stringent power constraints on system designs are making prior hardware and software-based fault tolerance approaches impractical due to their heavy reliance on redundant, worst-case, and conservative designs. Instead, algorithm-based approaches provide applications the flexibility to adapt to inherent application error tolerances and leverage the patterns of higher level abstractions.

Workshop's web page

Back to schedule



Tutorials



T1: Compiler Optimization

Duration and schedule:

Half Day tutorial (Tuesday, June 11th 8:30 - 12:00)

Location:

Joplin

Organizers

J. (Ram) Ramanujam (Louisiana State University) and P. (Saday) Sadayappan (The Ohio State University)

Description

On-chip parallelism with multiple cores is now ubiquitous. Because of power and cooling constraints, recent performance improvements in both general-purpose and special-purpose processors have come primarily from increased on-chip parallelism from multiple cores rather than increased clock rates. Parallelism is therefore of considerable interest to a much broader group than developers of parallel applications for high-end supercomputers. Several programming environments have recently emerged in response to the need to develop applications for graphics processing units (GPUs) and multicore processors. This tutorial will address the following topics:

Tutorial's web page

Back to schedule



T2: Fault-tolerance Techniques for HPC

Duration and schedule:

Half Day tutorial (Monday, June 10th 13:30 - 17:00)

Location:

Sousa

Organizers

Thomas Hérault and Yves Robert (University of Tennessee)

Description

Resilience is a critical issue for large-scale platforms. This tutorial provides a comprehensive survey on fault-tolerant techniques for high-performance computing. It is organized along four main topics: (i) An overview of failure types (software/hardware, transient/fail-stop), and typical probability distributions (Exponential, Weibull, Log-Normal); (ii) Application-specific techniques, such as ABFT for grid-based algorithms or fixed-point convergence for iterative applications; (iii) General-purpose techniques, which include several checkpoint and rollback recovery protocols, possibly combined with replication; and (iv) Relevant execution scenarios will be evaluated and compared through quantitative models (from Young's approximation to Daly's formulas and recent work). The half-day tutorial is open to all ICS 2013 attendees who are interested in the current status and expected promise of fault-tolerant approaches for scientific applications. There are no audience prerequisites: background will be provided for all protocols and probabilistic models. Only the last part of the tutorial devoted to assessing the future of the methods will involve more advanced analysis tools.

Tutorial's web page

Back to schedule



T3: It's Elemental

Duration and schedule:

Half Day tutorial (Monday, June 10th 8:30 - 12:00)

Location:

Sousa

Organizers

Bryan Marker (University of Texas), Jack Poulson (Stanford University) and Robert van de Geijn (University of Texas)

Description

Parallelization, targeting distributed memory architectures, of dense matrix computations is covered at least briefly in most introductory books and courses that include topics on numerical algorithms. The problem is that the algorithms that are typically covered are not those used in practice. The main objectives of this tutorial are to correct the basic misconceptions that have been perpetuated for at least two decades and to show how looking at the subject in just the right way exposes a systematic framework that allows novices to understand how we as experts develop and implement practical high performance libraries. This then allows us to bring participants to the forefront of the field, where new mechanical approaches automatically perform the tasks of the expert library developer in this domain.

Tutorial's web page

Back to schedule



T4: DLA on Multicore with Accelerators

Duration and schedule:

Half Day tutorial (Monday, June 10th 13:30 - 17:00)

Location:

Joplin / Seeger

Organizers

Piotr Luszczek (University of Tennessee Knoxville) and Aurelien Bouteiller (University of Tennessee Knoxville)

Description

Today, a desktop computer with a multicore processor and a GPU accelerator or a many-core accelerator can already provide a Tera-FLOP of performance. This tremendous computational power can only be fully utilized with the appropriate software infrastructure. Most often a major part of the computational effort in scientific and engineering computing goes towards solving linear algebra sub-problems. This tutorial shows design and optimization techniques of the state-of-the-art numerical libraries for solving problems in dense linear algebra.

The main objective of this tutorial is to show specific methods and their implementations that deal with portability and scalability of high performance codes. The use case of numerical linear algebra serves as a convenient example of how these techniques achieve their main objective — maximizing the efficiency with respect to the metric of choice: peak floating-point performance of the machine.

The tutorial consists of three parts. The first part focuses on the challenges of multicore programming. We show some of the ways of dealing with prevalent need of parallelism, pitfalls of concurrency, aspects of affinity and locality, varying task granularity, load imbalance, and separation of concerns. We compare our scheduling approach based on DAGs (Direct Acyclic Graphs) against the commonly known standards, libraries, and languages such as OpenMP and its tasks, Cilk’s extension to C, Intel’s Thread Building Block’s for C++, and Apple’s Grand Central Dispatch. The concepts are illustrated by the actual techniques applied within the PLASMA (Parallel Linear Algebra Software for Multicore Architectures) and QUARK (QUeing And Runtime for Kernels) projects. The second part discusses GPU and/or coprocessor acceleration issues including the software heterogeneity, system bus bottleneck, and overlapping techniques available in the various ports of the MAGMA (Matrix Algebra on GPU and Multicore Architectures) project. Finally, the third part will treat the ongoing efforts in linear algebra software for distributed memory machines with heterogeneous nodes: the PARSEC (Parallel Runtime Scheduling and Execution Controller) and DPLASMA projects. The key concepts covered in this part are communication-computation overlap, modern techniques for flow control, data distribution, dependence discovery and tracking through both compiler-oriented methods and runtime discovery.

The target audience consists mainly of users of parallel machines interested in advanced optimization techniques on distributed memory heterogeneous architectures as well as users of dense linear algebra libraries. The prerequisite knowledge includes basic understanding of modern hardware and familiarity with parallel software for multi-core and accelerator units.

Tutorial's web page

Back to schedule



T5: PerfExpert and MACPO

Duration and schedule:

Full Day tutorial (Tuesday, June 11th 8:30 - 17:00)

Location:

Sousa

Organizers

James Browne, Leonardo Fialho, Ashay Rane (University of Texas)

Description

The goal of this tutorial is to enable application developers and users to optimize the performance of their applications on the multicore chips and multichip nodes (homogeneous or heterogeneous) of modern cluster systems with minimal effort in particular without having to modify or annotate their programs for measurement or learn about the details of performance measurement such as which performance counters to use, etc.
The compute nodes of modern cluster computers almost universally contain multiple multicore processors and increasingly also incorporate accelerators such as Nvidia GPGPUs or Intel MICs. Optimization of application codes for these environments has in the past, required detailed knowledge of computer architecture, compilers, performance optimization, etc. There are three aspects to performance optimization for these envirnments: optimization for the multicore chips, identification of code segments to be mapped to the accelerators for execution and optimization for the code on the accelerators. This tutorial will approach the first two of these tasks in detail and sketch approaches to the third task. The tutorial will use the PerfExpert and MACPO tools to aid in these tasks. PerfExpert is an expert system that captures knowledge of multicore chip architecture and compilers. It automatically detects probable performance bottlenecks in each important procedure and loop and identifies the likely cause of the bottleneck. For each bottleneck type, PerfExpert suggests optimization strategies, code examples, and compiler switches that can be used by the application developer to improve performance. MACPO is a tool which generates metrics such as reuse distances, strides, cache conflicts and cache latencies for the data structures in code segments which are performance bottlenecks. Combining the code segment measurements and analyses from PerfExpert with the knowledge of data structure access behavior from MACPO enables effective diagnosis of performance bottlenecks and selection of code segments for accelerator execution. The tutorial will be “hands on” with minimal lecturing. Each participant will have a guest account on Stampede and/or Lonestar at TACC. Each participant should bring a laptop with which she/he can access these systems. Example and demonstration applications are provided as a part of the tutorial but participants are encouraged to come prepared to apply PerfExpert to one of their applications. (The application must successfully compile and execute on either Longhorn, Lonestar or another cluster upon which PerfExpert has been installed.)

Tutorial's web page

Back to schedule



T6: SnuCL

Duration and schedule:

Half Day tutorial (Tuesday, June 11th 8:30 - 12:00)

Location:

Seeger

Organizers

Jaejin Lee (Seoul National University)

Description

OpenCL is a programming model for heterogeneous parallel computing systems. OpenCL provides a common abstraction layer across different multicore architectures, such as CPUs, GPUs, DSPs, and Xeon Phi processors. However, current OpenCL is restricted to a single heterogeneous system. To target heterogenous clusters, programmers must use the OpenCL framework combining with a communication library, such as MPI. The same thing is true for CUDA. This tutorial will cover accelerator architectures, such as GPUs and Xeon Phi, and introduction to OpenCL programming. In addition, it introduces an OpenCL framework, called SnuCL. SnuCL naturally extends the original OpenCL semantics to the heterogeneous cluster environment. It is a freely available, open-source software developed at Seoul National University. SnuCL provides an illusion of a single heterogeneous system for the programmer. SnuCL achieves both high performance and ease of programming. Finally, we characterize the performance of an OpenCL implementation (SNU NPB suite) of the NAS Parallel Benchmark suite.

Tutorial's web page

Back to schedule



T7: Advanced MPI

Duration and schedule:

Half Day tutorial (Tuesday, June 11th 13:30 - 17:00)

Location:

Joplin

Organizers

Pavan Balaji (Argonne National Laboratory) and Torsten Hoefler (ETH Zurich)

Description

The Message Passing Interface (MPI) has been the de facto standard for parallel programming for nearly two decades now. However, a vast majority of applications only rely on basic MPI-1 features without taking advantage of the rich set of functionality the rest of the standard provides. Further, with the advent of MPI-3 (to be released September 2012), a vast number of new features are being introduced in MPI, including efficient one-sided communication, support for external tools, non-blocking collective operations, and improved support for topology-aware data movement. This is an advanced-level tutorial that will provide an overview of various powerful features in MPI, especially with MPI-2 and MPI-3.

Tutorial's web page

Back to schedule



T8: Charm++

Duration and schedule:

Half Day tutorial (Tuesday, June 11th 13:30 - 17:00)

Location:

Seeger

Organizers

Laxmikant "Sanjay" Kale, Jonathan Lifflander (University of Illinois at Urbana Champaign)

Description

The tutorial will present Charm++, a portable, C++-based parallel programming system, designed with programmer productivity as a major goal. Attendees will become familiar with the asynchronous, object-based programming model of Charm++ and the capabilities its adaptive runtime system offers.
Charm++ is a portable, mature environment that provides the foundation for several highly scalable and widely used applications in science and engineering. Charm++ runs on multicore desktops with shared memory, clusters of all sizes, and IBM and Cray supercomputers, and efficiently supports accelerators where available. The Parallel Programming Laboratory has developed and supported Charm++ and its predecessor systems for over 20 years. Its most widely-used application, the biomolecular simulation program NAMD, accounts for a large fraction of NSF supercomputer usage and won the Gordon Bell prize at SC 2002. It also won a Performance award in the 2011 HPC Challenge and reached Finalist status in the 2012 competition. Its adaptive features will be necessary to effectively use increasingly heterogeneous processors for the next-generation of applications supporting sophisticated techniques, such as multiple physics and adaptive refinement.
With Charm++, programmers decompose a computation into a large number of objects, without regard for the number of processors in a given machine. The runtime system assigns these objects to processors, naturally overlapping communication with computation and automating resource management. This flexibility enables optimization of characteristics like load balance and network topology-aware mapping independent of the application's core logic. Further, the runtime system supports multiple fault tolerance schemes, so applications can continue to run through component failures.
The tutorial will start by introducing attendees to message-driven parallel programming with examples presented in Charm++. After that, we will cover the basics of creating a parallel program in Charm++. Attendees will then learn how to enable load balancing through migratable objects and how to detect and treat load imbalance. The tutorial will conclude with a hands-on session in which attendees will construct a simple application, and an overview of the tools and advanced capabilities of the Charm++ ecosystem.
The target audience for this tutorial is programmers and researchers with some parallel programming experience and basic knowledge of C++.

Tutorial's web page

Back to schedule