Guide:BlueGene PAPI Counter Analysis

From Tau Wiki
Jump to: navigation, search

Introduction

PAPI provides access to the limited hardware counters available on IBM BlueGene Machines. Here, we perform a simple analysis of two matrix multiply algorithms. The full source code of this example is provided in the TAU distribution in the examples/papi directory. We have increased the problem size to 1024 for this guide.

Matrix Multiply

We analyze two matrix multiple algorithms. First, the simplest:

for (i = 0; i < SIZE; i++) 
  for (j = 0; j < SIZE; j++)
    for (k = 0; k < SIZE; k++)
      C[i][j] += A[i][k] * B[k][j];

And the second employs a strip mining optimization:

for (i=0; i < SIZE; i++)
  for (k=0; k < SIZE; k++)
    for (sz = 0; sz < SIZE; sz+=CACHE) {
      vl = (SIZE - sz < CACHE ? SIZE - sz : CACHE); 
      for(strip = sz; strip < sz+vl; strip++)
        C[i][strip] += A[i][k]*B[k][strip];
    }

PAPI Event Selection

We choose the following PAPI counters to track for our execution. Some counters are mutually exclusive, so you may need to run the program more than once.

PAPI_L2_DCA  : Level 2 data cache accesses
PAPI_FML_INS : Floating point multiply instructions
PAPI_FMA_INS : FMA instructions completed
PAPI_BGL_OED : BGL special event: Oedipus operations

Experiment

For our experiment, we will compile our program with the -O0, -O2, -O3, -O4, and -O5 optimization flags to compare both the time to solution and hardware counter data. We hope that the hardware counters will provide us insight into how the optimzations affect our program.

Results

First, a graph showing the executions time:

Pbgl3.png

Not surprisingly, the overall execution is ordered with the optimization levels. Level 0 is the slowest, and level 5 is the fastest. Interestingly though, the strip mining optimization is slower on levels 3, 4, and 5 than the regular matrix multiply. Not only that, but it it progressively slower. Level 4 is slower than 3, and level 5 is slower than 4.

Next we look at the exclusive times:

Pbgl2.png

The exclusive times for the two matrix multiply methods are the same as the inclusive because they call no routines. But in this chart, we can see the differences easier.

Hardware Counter Results

Following is a table showing the results for all the PAPI counters for each optimization level:

Pbgl.png

The BGL_TIMERS column is the time, in seconds, given from a low overhead timer available on Blue Gene systems.

-O0 vs. -O2

Here we see that the compiler has combined the floating point multiply and add instructions into Fused Multiply Add instructions (FMA)

-O2 vs. -O3

The compiler has used intrinsic Double Hummer (Oedipus) SIMD instructions to convert 1,073,741,824 FMA instructions into 536,870,912 OED instructions (1,073,741,824 / 2 = 536,870,912) for the strip-mine method

-O3 vs. -O4

The compiler has converted 1,065,353,216 (1,073,741,824 - 8,388,608) FMA instructions into 532,676,608 OED instructions (1,065,353,216 / 2 = 532,676,608)

-O4 vs. -O5

No instruction change.


PAPI Events Available on Blue Gene

The following is the output from the papi_avail program on Blue Gene with PAPI 3.5

Available events and hardware information.
-------------------------------------------------------------------------
Vendor string and code   :  (1312)
Model string and code    : PVR=0x5202:0x1891  Serial=R00-M0-N0-C:J16-U01 (1375869073)
CPU Revision             : 20994.062500
CPU Megahertz            : 700.000000
CPU's in this Node       : 1
Nodes in this System     : 16
Total CPU's              : 16
Number Hardware Counters : 52
Max Multiplex Counters   : 32
-------------------------------------------------------------------------
The following correspond to fields in the PAPI_event_info_t structure.

Name            Code            Avail   Deriv   Description (Note)
PAPI_L1_DCM     0x80000000      No      No      Level 1 data cache misses
PAPI_L1_ICM     0x80000001      No      No      Level 1 instruction cache misses
PAPI_L2_DCM     0x80000002      No      No      Level 2 data cache misses
PAPI_L2_ICM     0x80000003      No      No      Level 2 instruction cache misses
PAPI_L3_DCM     0x80000004      No      No      Level 3 data cache misses
PAPI_L3_ICM     0x80000005      No      No      Level 3 instruction cache misses
PAPI_L1_TCM     0x80000006      No      No      Level 1 cache misses
PAPI_L2_TCM     0x80000007      No      No      Level 2 cache misses
PAPI_L3_TCM     0x80000008      Yes     No      Level 3 cache misses
PAPI_CA_SNP     0x80000009      No      No      Requests for a snoop
PAPI_CA_SHR     0x8000000a      No      No      Requests for exclusive access to shared cache line
PAPI_CA_CLN     0x8000000b      No      No      Requests for exclusive access to clean cache line
PAPI_CA_INV     0x8000000c      No      No      Requests for cache line invalidation
PAPI_CA_ITV     0x8000000d      No      No      Requests for cache line intervention
PAPI_L3_LDM     0x8000000e      Yes     Yes     Level 3 load misses
PAPI_L3_STM     0x8000000f      Yes     No      Level 3 store misses
PAPI_BRU_IDL    0x80000010      No      No      Cycles branch units are idle
PAPI_FXU_IDL    0x80000011      No      No      Cycles integer units are idle
PAPI_FPU_IDL    0x80000012      No      No      Cycles floating point units are idle
PAPI_LSU_IDL    0x80000013      No      No      Cycles load/store units are idle
PAPI_TLB_DM     0x80000014      No      No      Data translation lookaside buffer misses
PAPI_TLB_IM     0x80000015      No      No      Instruction translation lookaside buffer misses
PAPI_TLB_TL     0x80000016      No      No      Total translation lookaside buffer misses
PAPI_L1_LDM     0x80000017      No      No      Level 1 load misses
PAPI_L1_STM     0x80000018      No      No      Level 1 store misses
PAPI_L2_LDM     0x80000019      No      No      Level 2 load misses
PAPI_L2_STM     0x8000001a      No      No      Level 2 store misses
PAPI_BTAC_M     0x8000001b      No      No      Branch target address cache misses
PAPI_PRF_DM     0x8000001c      No      No      Data prefetch cache misses
PAPI_L3_DCH     0x8000001d      No      No      Level 3 data cache hits
PAPI_TLB_SD     0x8000001e      No      No      Translation lookaside buffer shootdowns
PAPI_CSR_FAL    0x8000001f      No      No      Failed store conditional instructions
PAPI_CSR_SUC    0x80000020      No      No      Successful store conditional instructions
PAPI_CSR_TOT    0x80000021      No      No      Total store conditional instructions
PAPI_MEM_SCY    0x80000022      No      No      Cycles Stalled Waiting for memory accesses
PAPI_MEM_RCY    0x80000023      No      No      Cycles Stalled Waiting for memory Reads
PAPI_MEM_WCY    0x80000024      No      No      Cycles Stalled Waiting for memory writes
PAPI_STL_ICY    0x80000025      No      No      Cycles with no instruction issue
PAPI_FUL_ICY    0x80000026      No      No      Cycles with maximum instruction issue
PAPI_STL_CCY    0x80000027      No      No      Cycles with no instructions completed
PAPI_FUL_CCY    0x80000028      No      No      Cycles with maximum instructions completed
PAPI_HW_INT     0x80000029      No      No      Hardware interrupts
PAPI_BR_UCN     0x8000002a      No      No      Unconditional branch instructions
PAPI_BR_CN      0x8000002b      No      No      Conditional branch instructions
PAPI_BR_TKN     0x8000002c      No      No      Conditional branch instructions taken
PAPI_BR_NTK     0x8000002d      No      No      Conditional branch instructions not taken
PAPI_BR_MSP     0x8000002e      No      No      Conditional branch instructions mispredicted
PAPI_BR_PRC     0x8000002f      No      No      Conditional branch instructions correctly predicted
PAPI_FMA_INS    0x80000030      Yes     No      FMA instructions completed
PAPI_TOT_IIS    0x80000031      No      No      Instructions issued
PAPI_TOT_INS    0x80000032      No      No      Instructions completed
PAPI_INT_INS    0x80000033      No      No      Integer instructions
PAPI_FP_INS     0x80000034      No      No      Floating point instructions
PAPI_LD_INS     0x80000035      No      No      Load instructions
PAPI_SR_INS     0x80000036      No      No      Store instructions
PAPI_BR_INS     0x80000037      No      No      Branch instructions
PAPI_VEC_INS    0x80000038      No      No      Vector/SIMD instructions
PAPI_RES_STL    0x80000039      No      No      Cycles stalled on any resource
PAPI_FP_STAL    0x8000003a      No      No      Cycles the FP unit(s) are stalled
PAPI_TOT_CYC    0x8000003b      Yes     No      Total cycles
PAPI_LST_INS    0x8000003c      No      No      Load/store instructions completed
PAPI_SYC_INS    0x8000003d      No      No      Synchronization instructions completed
PAPI_L1_DCH     0x8000003e      No      No      Level 1 data cache hits
PAPI_L2_DCH     0x8000003f      Yes     Yes     Level 2 data cache hits
PAPI_L1_DCA     0x80000040      No      No      Level 1 data cache accesses
PAPI_L2_DCA     0x80000041      Yes     Yes     Level 2 data cache accesses
PAPI_L3_DCA     0x80000042      No      No      Level 3 data cache accesses
PAPI_L1_DCR     0x80000043      No      No      Level 1 data cache reads
PAPI_L2_DCR     0x80000044      No      No      Level 2 data cache reads
PAPI_L3_DCR     0x80000045      No      No      Level 3 data cache reads
PAPI_L1_DCW     0x80000046      No      No      Level 1 data cache writes
PAPI_L2_DCW     0x80000047      No      No      Level 2 data cache writes
PAPI_L3_DCW     0x80000048      No      No      Level 3 data cache writes
PAPI_L1_ICH     0x80000049      No      No      Level 1 instruction cache hits
PAPI_L2_ICH     0x8000004a      No      No      Level 2 instruction cache hits
PAPI_L3_ICH     0x8000004b      No      No      Level 3 instruction cache hits
PAPI_L1_ICA     0x8000004c      No      No      Level 1 instruction cache accesses
PAPI_L2_ICA     0x8000004d      No      No      Level 2 instruction cache accesses
PAPI_L3_ICA     0x8000004e      No      No      Level 3 instruction cache accesses
PAPI_L1_ICR     0x8000004f      No      No      Level 1 instruction cache reads
PAPI_L2_ICR     0x80000050      No      No      Level 2 instruction cache reads
PAPI_L3_ICR     0x80000051      No      No      Level 3 instruction cache reads
PAPI_L1_ICW     0x80000052      No      No      Level 1 instruction cache writes
PAPI_L2_ICW     0x80000053      No      No      Level 2 instruction cache writes
PAPI_L3_ICW     0x80000054      No      No      Level 3 instruction cache writes
PAPI_L1_TCH     0x80000055      No      No      Level 1 total cache hits
PAPI_L2_TCH     0x80000056      No      No      Level 2 total cache hits
PAPI_L3_TCH     0x80000057      Yes     No      Level 3 total cache hits
PAPI_L1_TCA     0x80000058      No      No      Level 1 total cache accesses
PAPI_L2_TCA     0x80000059      No      No      Level 2 total cache accesses
PAPI_L3_TCA     0x8000005a      No      No      Level 3 total cache accesses
PAPI_L1_TCR     0x8000005b      No      No      Level 1 total cache reads
PAPI_L2_TCR     0x8000005c      No      No      Level 2 total cache reads
PAPI_L3_TCR     0x8000005d      No      No      Level 3 total cache reads
PAPI_L1_TCW     0x8000005e      No      No      Level 1 total cache writes
PAPI_L2_TCW     0x8000005f      No      No      Level 2 total cache writes
PAPI_L3_TCW     0x80000060      No      No      Level 3 total cache writes
PAPI_FML_INS    0x80000061      Yes     No      Floating point multiply instructions
PAPI_FAD_INS    0x80000062      Yes     No      Floating point add instructions
PAPI_FDV_INS    0x80000063      No      No      Floating point divide instructions
PAPI_FSQ_INS    0x80000064      No      No      Floating point square root instructions
PAPI_FNV_INS    0x80000065      No      No      Floating point inverse instructions
PAPI_FP_OPS     0x80000066      No      No      Floating point operations
PAPI_BGL_OED    0x80000067      Yes     No      BGL special event: Oedipus operations
PAPI_BGL_TS_32B 0x80000068      Yes     Yes     BGL special event: Torus 32B chunks sent
PAPI_BGL_TS_FULL        0x80000069      Yes     Yes     BGL special event: Torus no token UPC cycles
PAPI_BGL_TR_DPKT        0x8000006a      Yes     Yes     BGL special event: Tree 256 byte packets
PAPI_BGL_TR_FULL        0x8000006b      Yes     Yes     BGL special event: UPC cycles (CLOCKx2) tree rcv is full
-------------------------------------------------------------------------
avail.c                                  PASSED