Guide:TAUCrayOpenAcc

From TAU Wiki

(Difference between revisions)
Jump to: navigation, search
Revision as of 19:19, 21 September 2013 (edit)
Scottb (Talk | contribs)

← Previous diff
Revision as of 21:20, 21 September 2013 (edit)
Scottb (Talk | contribs)

Next diff →
Line 20: Line 20:
Only 25 GB in about 11,500 copies. Only 25 GB in about 11,500 copies.
 +
 +== Configuring ==
 +
 +Here is how to configure and use TAU to collect Cray OpenACC:
 +
 + ./configure -arch=craycnl -cuda=/opt/nvidia/cudatoolkit/4.1.28 -cudalibrary=-L/opt/nvidia/cudatoolkit/4.1.28/lib64\ -L/opt/nvidia/cudatoolkit/4.1.28/extras/CUPTI/lib64\ -lcupti\ -L/opt/cray/nvidia/default/lib64\ -lcuda -bfd=none -mpi -useropt=-DTAU_MPICH3
 +
 +And run this way:
 +
 + export TAU_CUPTI_API=driver
 + aprun -n 8 tau_exec -T mpi,cray,cupti -cupti ./himeno

Revision as of 21:20, 21 September 2013

Jacobin example

Let's look at a simple Jocobin example written in Cray OpenACC. We will start with a simple OpenACC parallel loop directive right before the Jacobian computation.Here is the TAU profile:

Image:step3_basic.jpg

We have profiles for the Jacobi kernel ("jacobi_$ck_L215_2"), Memory copies, and CPU synchronization. Look at the time spent copying data to the GPU, it completely dominates the runtime, let look at the some details:

Image:step3_bytes.jpg

Nearly 26,000 Memory copies for a total of 99 GB. That is a lot of memory being moved. As a improvement let's try to keep as much data on the GPU as possible.

Next we have initialized the matrices on GPU, performed on the initialization on the GPU. This is the profile we see:

Image:step4_basic.jpg

Much better performance Memory copies to the GPU and now a quarter of what it was. The second kernel ("jacobi_$ck_L281_6") is the final reduction. And the number of bytes copied:

Image:step4_bytes.jpg

Only 25 GB in about 11,500 copies.

Configuring

Here is how to configure and use TAU to collect Cray OpenACC:

./configure -arch=craycnl -cuda=/opt/nvidia/cudatoolkit/4.1.28 -cudalibrary=-L/opt/nvidia/cudatoolkit/4.1.28/lib64\ -L/opt/nvidia/cudatoolkit/4.1.28/extras/CUPTI/lib64\ -lcupti\ -L/opt/cray/nvidia/default/lib64\ -lcuda -bfd=none -mpi -useropt=-DTAU_MPICH3

And run this way:

export TAU_CUPTI_API=driver
aprun -n 8 tau_exec -T mpi,cray,cupti -cupti ./himeno
Personal tools