[Tau-announcements] TAU v2.20.1 released

Sameer Shende sameer at cs.uoregon.edu
Tue Mar 22 07:43:38 PDT 2011


	We are pleased to announce the release of TAU v2.20.1:


The following new features have been added since TAU 2.20 released on
November 11, 2010.

1. Improved support for profiling GPGPU applications

In our last release, we introduced support for tracking events occuring on the host for applications that use GPGPUs. This release extends that work and we can now observe timings on the GPGPUs in a separate thread of execution. It also supports access to hardware performance counters on the GPGPU using PAPI. Data transfers between the host and GPGPU tasks are also shown in separate atomic events. This release supports OpenCL as well as CUDA v3 and v4.x. To enable these measurements, simply configure TAU with -cuda=<dir> and use:
% tau_exec -cuda -T serial ./a.out
% tau_exec -opencl -T serial ./a.out

	while running an un-instrumented or instrumented application. TAU will intercept the interactions between the host and the device and generate performance information at the Cuda driver or the OpenCL library level. This release also supports demangling of internal kernel names at runtime using the BFD library. We would like to thank the NVIDIA Corporation for their support of the TAU project and the assistance provided.

2. Profiling accelerator primitives with the PGI compiler
TAU is now updated to support tracking GPGPU executions showing not only the time taken in various runtime routines but also the data transfers associated with each variable for the recent PGI compilers. Variable names as well as attributes (array sizes, dimensions, element sizes, and stride) and routines, file names and line numbers associated with each PGI C #acc pragma and Fortran !$acc region directive are now shown in the profile. In the example below, we can observe the upload and download times for variables "a", "b", and "c" in the matrix multiply operation (a=b*c):

%Time    Exclusive    Inclusive       #Call      #Subrs  Inclusive Name
              msec   total msec                          usec/call
100.0        0.089        4,850           1           1    4850592 .TAU application
100.0           90        4,850           1           5    4850503 mymatrixmultiply [{mmdriv.f90} {1,0}]
 98.1        0.697        4,759           5          85     951959 multiply_matrices [{mm2.f90} {5,0}]
 56.1        2,721        2,721           5           0     544381 __pgi_cu_downloadx multiply_matrices var=a, dims=2, desc.devx=0, desc.devstride=1, desc.hoststride=1, de
sc.size=3000, desc.extent=3000, elementsize=4 [{/mnt/netapp/home1/sameer/mm/mm2.f90}{20}]
 38.5        1,869        1,869           5           0     373954 __pgi_cu_init multiply_matrices [{/mnt/netapp/home1/sameer/mm/mm2.f90}{9}]
  1.7           81           81           5           0      16206 __pgi_cu_uploadx multiply_matrices var=c, dims=2, desc.devx=0, desc.devstride=1, desc.hoststride=1, desc
.size=3000, elementsize=4 [{/mnt/netapp/home1/sameer/mm/mm2.f90}{9}]
  1.6           79           79           5           0      15959 __pgi_cu_uploadx multiply_matrices var=b, dims=2, desc.devx=0, desc.devstride=1, desc.hoststride=1, desc
.size=3000, elementsize=4 [{/mnt/netapp/home1/sameer/mm/mm2.f90}{9}]
  0.1            3            3          15           0        217 __pgi_cu_free multiply_matrices [{/mnt/netapp/home1/sameer/mm/mm2.f90}]
  0.1            2            2          15           0        193 __pgi_cu_alloc multiply_matrices [{/mnt/netapp/home1/sameer/mm/mm2.f90}{9}]
  0.0        0.194        0.194           5           0         39 __pgi_cu_module multiply_matrices [{/mnt/netapp/home1/sameer/mm/mm2.f90}{9}]

We would like to thank PGI for their support of the TAU project and the assistance provided.

3. ParaProf enhancements
ParaProf has a new library of example topologies and supports a language for specifying new topologies in the 3D window for visualizing hardware topologies. This allows a user to load a custom topology text file and specify how ranks in an MPI application are mapped to the X, Y, and Z coordinates. It allows a user to view the time spent in a given routine across the entire machine in 3D. Example topologies such as sphere, cylinder and cube are included. Currently, we support the IBM BlueGene/P's topology metadata collected in TAU and all platforms supported by the Cube data format. The 3D visualization window has also been updated with scroll bars.

Updated the TAU adapters for the SCORE-P measuremnt substrate (www.score-p.org) with support for shared objects in tau_exec and tau_run. This allows an uninstrumented application to be rewritten or to be launched by tau_exec to generate OTF2 trace files that may be loaded in the Vampir visualizer *without* rewriting of binary traces or conversion from one format to another. This allows a user to run a parallel program and analyze the traces immediately after running the code without any additional merging/conversion/unification steps.

5. Bug fixes and enhancements
a. support for -optCompInst for Cray CCE compilers for OpenMP programs.
b. updated support for Sun Solaris CC compiler
c. fixed bug for order of initialization of PAPI for tracing
d. tau_instrumentor now supports instrumentation of single line Fortran DO loops
e. PostgreSQL jar file support updated in PerfDMF
f. fixed bug for PGI -optCompInst for C++ and tau_exec -memory for Intel -optCompInst.
g. support for TAU_PROFILE_FORMAT="merged" with tau_exec -io.

	Please let us know if we may assist you further.
	- Sameer
	(for tau-team @ cs.uoregon.edu)

More information about the Tau-announcements mailing list