From TAU Wiki

(Difference between revisions)
Jump to: navigation, search
Revision as of 19:01, 5 October 2013 (edit)
Scottb (Talk | contribs)
(Performance Results)
← Previous diff
Revision as of 19:10, 5 October 2013 (edit)
Scottb (Talk | contribs)
(Source Code)
Next diff →
Line 97: Line 97:
=== Source Code === === Source Code ===

Revision as of 19:10, 5 October 2013



MonteCarlo example

To test out some Chapel's language features let us program a MonteCarlo simulation to calculate PI. We can calculate PI by assessing how many points with coordinates x,y fit in the unit circle, ie x^2+y^2<=1.


Here is the basic routine that computes PI:

proc compute_pi(p_x: [] real(64), p_y: [] real(64)) : real {

 var c : sync int;
 c = 0;
 forall i in 1..n {
   if (x ** 2 + y ** 2 <= 1) then
       c += 1;
 return c * 4.0 / n;


Notice that the forall here will compute each iteration in parallel, hence the need to define variable c as a sync variable. Performance here is limited by the need to synchronize access to c. Take a look of this profile:


70% percent of the time is spent in synchronization. Let's see if we can do better.

Procedure promotion

One feature of Chapel is procedure promotion, this is where calling a procedure that takes scalar arguments with an array, will have be as if each element of the array is passed to the procedure in parallel:

proc compute_pi(p_x: [] real(64), p_y: [] real(64)) : real {

 var c : sync int;
 forall i in in_circle(p_x, p_y) {
   c += i;
 return c * 4.0 / n;

proc in_circle(x: real(64), y: real(64)): bool
  return (x ** 2 + y ** 2) <= 1;


Furthermore with reorganization will allow us to take advantage of Chapel's built in reduction:

proc compute_pi(p_x: [] real(64), p_y: [] real(64)) : real {

 var c : int;
 c= +reduce in_circle(p_x, p_y);
 return c * 4.0 / n;


This also improves performance:


Multiple Locales

Let's look at how the array of x and y values are allocated:

var p_x: [1..n] real(64);
var p_y: [1..n] real(64);

However Chapel provides a way to distribute these array across multiple locales:

const space = {1..n};
var Dom: domain(1) dmapped Block(boundingBox=space) = space;

var p_x: [Dom] real(64);
var p_y: [Dom] real(64);

This Block mapping will allocate the elements block-wise among the locales. Furthermore the reduction used earlier will continue to work.

Performance Results

There are a couple of options for collecting Chapel performance data with TAU. To begin configure TAU with PDT, pthreads and bfd (for sampling).

Compiling Chapel with --savec c_code will store the intermediate C sources files in c_code. Compiling the C code with TAU is easy:

make -f c_code/Makefile

But since each source file is included as a header, none of them will be instrumented. However these sources files can be modified to add TAU probes directly. Furthermore sampling can be added get more detail (time spent in the pthread library for example).

Using PDT is also an option, here is a profile from Titan (Cray XK7) using PDT for instrumentation:


Source Code

Image:Chapel pi.chpl

Personal tools