# Guide:TAUChapel

### From TAU Wiki

Revision as of 17:24, 5 October 2013 (edit)Scottb (Talk | contribs) (→Performance Results) ← Previous diff |
Revision as of 19:01, 5 October 2013 (edit)Scottb (Talk | contribs) (→Performance Results) Next diff → |
||

Line 90: |
Line 90: | ||

But since each source file is included as a header, none of them will be instrumented. However these sources files can be modified to add TAU probes directly. Furthermore sampling can be added get more detail (time spent in the pthread library for example). | But since each source file is included as a header, none of them will be instrumented. However these sources files can be modified to add TAU probes directly. Furthermore sampling can be added get more detail (time spent in the pthread library for example). | ||

+ | |||

+ | Using PDT is also an option, here is a profile from Titan (Cray XK7) using PDT for instrumentation: | ||

+ | |||

+ | [[Image:chapel_titan.png]] | ||

+ | |||

+ | === Source Code === | ||

+ | |||

+ | [[Image:pi.chpl]] |

## Revision as of 19:01, 5 October 2013

## Contents |

# Chapel

## MonteCarlo example

To test out some Chapel's language features let us program a MonteCarlo simulation to calculate PI. We can calculate PI by assessing how many points with coordinates x,y fit in the unit circle, ie x^2+y^2<=1.

### Basic

Here is the basic routine that computes PI:

proc compute_pi(p_x: [] real(64), p_y: [] real(64)) : real { var c : sync int; c = 0; forall i in 1..n { if (x ** 2 + y ** 2 <= 1) then c += 1; } return c * 4.0 / n; }

Notice that the **forall** here will compute each iteration in parallel, hence the need to define variable **c** as a **sync** variable. Performance here is limited by the need to synchronize access to **c**. Take a look of this profile:

70% percent of the time is spent in synchronization. Let's see if we can do better.

### Procedure promotion

One feature of Chapel is procedure promotion, this is where calling a procedure that takes scalar arguments with an array, will have be as if each element of the array is passed to the procedure in parallel:

proc compute_pi(p_x: [] real(64), p_y: [] real(64)) : real { var c : sync int; forall i in in_circle(p_x, p_y) { c += i; } return c * 4.0 / n; } proc in_circle(x: real(64), y: real(64)): bool { return (x ** 2 + y ** 2) <= 1; }

### Reduction

Furthermore with reorganization will allow us to take advantage of Chapel's built in reduction:

proc compute_pi(p_x: [] real(64), p_y: [] real(64)) : real { var c : int; c= +reduce in_circle(p_x, p_y); return c * 4.0 / n; }

This also improves performance:

### Multiple Locales

Let's look at how the array of x and y values are allocated:

var p_x: [1..n] real(64); var p_y: [1..n] real(64);

However Chapel provides a way to distribute these array across multiple locales:

const space = {1..n}; var Dom: domain(1) dmapped Block(boundingBox=space) = space; var p_x: [Dom] real(64); var p_y: [Dom] real(64);

This **Block** mapping will allocate the elements block-wise among the locales. Furthermore the reduction used earlier will continue to work.

### Performance Results

There are a couple of options for collecting Chapel performance data with TAU. To begin configure TAU with PDT, pthreads and bfd (for sampling).

Compiling Chapel with **--savec c_code** will store the intermediate C sources files in **c_code**. Compiling the C code with TAU is easy:

make -f c_code/Makefile CC=tau_cc.sh

But since each source file is included as a header, none of them will be instrumented. However these sources files can be modified to add TAU probes directly. Furthermore sampling can be added get more detail (time spent in the pthread library for example).

Using PDT is also an option, here is a profile from Titan (Cray XK7) using PDT for instrumentation: