5.2. Data Access Pattern Problems

5.2. Data Access Pattern Problems
	Chapter 5. Memory Performance Problems and Solutions

5.2.1. Inefficient Loop Nesting

Inefficient loop nesting is a problem specific to algorithms that work on multidimensional arrays. If the array is traversed along the wrong axis only one element will be used from each cache line before the program continues to the next cache line.

For example, consider this small snippet of C code iterating over a 2-dimensional matrix:

double array[SIZE][SIZE];

for (int col = 0; col < SIZE; col++)
  for (int row = 0; row < SIZE; row++)
    array[row][col] = f(row, col);

In one iteration of the outer loop the inner loop will access one element in each row of a specific column. For example, for a 8-by-8 matrix:

Figure 5.10. Inefficient Loop Nesting

For each element touched by the loop, a new cache line is accessed. The first level cache of a processor may not hold more than a few hundred cache lines. If the array has more than a couple of hundred rows, the rows from the top of the matrix may already have been evicted from the cache when bottom of the matrix is reached. The next iteration of the outer loop then has to reload all of the cache lines again, causing a cache miss for each element that is touched. This is completely devastating to the performance of the loops.

Since only one element from each fetched cache line is used, incorrect loop nesting also wastes a lot of memory bandwidth on fetching data that isn't used.

The problem can be fixed by changing the nesting of the loops so that they traverse the matrix along the rows instead of along the columns.

double array[SIZE][SIZE];

for (int row = 0; row < SIZE; row++)
  for (int col = 0; col < SIZE; col++)
    array[row][col] = f(row, col);

Each iteration of the outer loop now uses all of the elements in each cache line and the same cache lines will not have to be reused by later iterations, resulting in greatly reduced cache misses and cache line fetches.

Figure 5.11. Efficient Loop Nesting

Accessing data sequentially also helps the hardware prefetcher to do its job well. This will maximize the use of available memory bandwidth.

If it is not possible to change the loop nesting without changing the calculation result an alternative is to transpose the matrix. By changing the data layout the access patterns become more regular, making better use of fetched cache lines and helping the hardware prefetcher to do a better job. The cost of the incorrect loop nesting may far outweigh the cost of transposing the matrix.

Incorrect loop nesting is not exclusive to 2-dimensional matrices. The same effect can occur when traversing any regular multi-dimensional array, for example, in a 3-dimensional matrix.

Fortran organizes its multi-dimensional arrays in the opposite direction compared to C, so if you are programming in Fortran you would want to iterate over the matrix along the columns instead of along the rows.

5.2.2. Random Access Pattern

Random memory access patterns are generally harmful to cache and memory performance. With random access patterns we do not mean truly random accesses, but generally accesses that are not to sequential addresses, and that are not to the same data or close to data that has recently been used.

As described in Chapter 3, Introduction to Caches, caches work on the expectation that data that has recently been touched and data close to data that has recently been touched is more likely to be accessed again. Random access patterns are bad because they generally contain little such reuse, which means there is a low probability to find the accessed data in the cache.

Random access patterns often also lead to a low cache line utilization. Individual data elements in cache lines may be accessed without touching the rest of the cache line. This increases the fetch ratio and memory bandwidth requirement of the application in relation to the amount of data it actually uses.

The hardware prefetcher present in modern processors also relies on finding regular access patterns to determine what data to prefetch, and thereby hide memory access latencies. Random access patterns therefore make the hardware prefetcher ineffective. Certain types of irregular access patterns may even trick the hardware prefetcher into prefetching data that is not useful, wasting memory bandwidth.

Even accessing main memory is faster when the access is to an address close to the last address that was accessed.

Random access patterns can originate from different sources:

Data structures
Dynamic memory allocation
Algorithms

Some data structures inherently cause random access patterns.

For example, hash tables are often designed to spread elements randomly throughout the table to avoid collisions. Element lookups in the table therefore a cause a random access pattern.

Tree structures are another common source of random access patterns. Lookups of different keys in the tree cause traversals of different paths through the tree, causing seemingly random memory access patterns.

Replacing such data structures with more cache friendly data structures may provide performance improvements. See Section 5.5, “Common Data Structures” for more information .

Dynamic memory allocation can contribute to placing related data objects far away from each other. Memory allocators usually attempt to allocate memory regions sequentially, but as the application runs and memory on the heap is allocated, deallocated and allocated again the heap may become fragmented. Allocated memory regions may then be spread out over the heap.

Data structures that are allocated incrementally while the application runs are also likely to be spread out over the address space because of other allocations made by the application.

An example of a random access pattern caused by dynamic memory allocation could be a linked list created from newly allocated memory regions. The memory regions may be spread out more or less randomly through the address space. When the application traverses the list it would therefore cause memory accesses in a more or less random pattern. If elements are incrementally added to the list during the execution that is likely to further worsen the problem because of memory fragmentation.

If random memory access patterns because of dynamic memory allocation is a problem, it may be possible to implement a custom memory allocator that puts related data close together. If you, for example, have a data structure that is allocated incrementally during the execution, you could allocate a pool of memory for that data structure when the application starts and then use memory from that pool for the incremental allocations.

Finally some algorithms may inherently cause irregular access patterns.

Many graph algorithms are examples of such algorithms. If you represent a graph as an adjacency matrix the data structure itself is very regular. However, if you perform a depth first or breadth first search in the graph the algorithm will cause irregular accesses to the adjacency matrix as it follows the edges between the graph nodes.

Fixing random access patterns caused by algorithms is generally hard. There may be alternative algorithms with better cache behaviour, but they may be inferior in other ways. Sometimes it may be possibly to adapt a data structure to the algorithm, for example, it may be possible to sort an adjacency matrix so that nodes that are connected are close to each other, so that following the edges results in a more regular access pattern.

5.2.3. Unexploited Data Reuse Opportunities

To use a cache optimally an application should, once it has loaded a cache line into the cache, perform as much work as possible involving that cache line before it moves on to other data. This way the number of times the cache line has to be loaded into the cache is minimized, and thereby the number of cache line fetches and the probability of cache misses are reduced.

To achieve this there are two basic techniques. If the data set is large, try working on a small part of the data set at a time so that that part is reused as much as possible. If there are several operations that should be performed on a data set then perform them all at once on each element, instead of performing them one at a time on all elements.

Freja currently detects two types of such data reuse opportunities, blocking and loop fusion.

5.2.3.1. Blocking

Blocking refers to a class of techniques that aim to break down the original problem into small chunks, so that all data needed to process each chunk fits in the cache. This way that data can be optimally reused before the next chunk is processed.

Blocking can be performed to optimize reuse of adjacent data, spatial blocking, or to optimize for reuse of the same data, temporal blocking, or both.

An example of a problem where spatial blocking can be used is creating a transposed copy of a matrix:

void transpose(double dst[SIZE][SIZE], double src[SIZE][SIZE]) {
  int i, j;
  for (i = 0; i < SIZE; i++)
    for (j = 0; j < SIZE; j++)
      dst[j][i] = src[i][j];
}

We see that the src matrix is read one row at a time and that the dst matrix is written one column at a time. If the matrix is large enough there won't be enough cache lines for all the rows accessed in the dst matrix in one iteration of the outer loop, and we get a very inefficient accesses pattern as described in Section 5.2.1, “Inefficient Loop Nesting”.

We could change the nesting of the loops as suggested there:

void transpose(double dst[SIZE][SIZE], double src[SIZE][SIZE]) {
  int i, j;
  for (j = 0; j < SIZE; j++)
    for (i = 0; i < SIZE; i++)
      dst[j][i] = src[i][j];
}

However, while this fixed the problem with the dst matrix the src is instead read one column at a time, so we have only moved the problem.

What we can do is to only copy a few columns from the src matrix at a time, so that we know there are enough cache lines to hold the corresponding rows in the dst matrix. Each such set of columns becomes a block.

void transpose(double dst[SIZE][SIZE], double src[SIZE][SIZE]) {
  int jb, i, j;
  for (jb = 0; jb < SIZE; jb += BLOCK)
    for (i = 0; i < SIZE; i++)
      for (j = jb; j < jb + BLOCK && j < SIZE; j++)
        dst[j][i] = src[i][j];
}

The variable jb now keeps track of which columns are being copied from the source matrix in each block. In the first iteration of the outermost loop the columns 0 to BLOCK-1 are copied, in the next iteration the columns BLOCK to 2*BLOCK-1 are copied, and so on.

By choosing a small enough value for the block size BLOCK we can make sure that the corresponding rows written to the dst matrix in each iteration of the middle loop fit in the cache.

Temporal blocking is sometimes possible for iterated algorithms. Data produced during one iteration is used in the next one. By performing several iterations for a sub-problem, the previous iteration values will remain in the cache when performing the next iteration.

The cookbook way of performing blocking is to focus on reuse of a target structure, and work outwards through loop levels until reaching a loop level that causes all elements of the target structure to be accessed for each iteration. Inner loops to this outer loop level are candidates to be broken up. Consider the following trivial example:

for (int iter = 0; iter < ITER; iter++) // "outer" loop
  for (int j = 0; j < SIZE; j++)
    func(a[j]);

Each iteration of the iter loop causes each element of a to be accessed. Hence we declare the iter loop to be the outer loop. Loop levels interior to this loop, namely for j in this case, are candidates to be blocked. Upon blocking, one new loop level is placed outside the outer loop and the original loop is kept in its the original location, with the restriction that it works on only a subset of the data for each iteration of the new outermost loop:

for (int jj = 0; jj < SIZE; jj += BLOCK)
  for (int iter = 0; iter < ITER; iter++) // "outer" loop
    for (int j = jj; j < jj + BLOCK; j++)
      func(a[j]);

Blocking will cause calculations to occur in a different order than in the original code, and sometimes this is fine. Many numerical algorithms can be reformulated to maintain accumulators to hold partial results between processing of different blocks. Some algorithms, however, will not lend themselves to blocking that easily.

One famous example is blocking a matrix multiplication. The unoptimized code looks like:

for (int i = 0; i < SIZE; i++)
  for (int j = 0; j < SIZE; j++)
    c[i][j] = 0;

for (int i = 0; i < SIZE; i++) // "outer" loop
  for (int j = 0; j < SIZE; j++)
    for (int k = 0; k < SIZE; k++)
      c[i][j] += a[i][k] * b[k][j];

The array b is traversed in the wrong way, and we should try to block with respect to it. Note that the for i loop level is the outer loop which causes all b elements to be revisited. for j and for k are therefore candidate loops to block. This can be done in different ways, for instance only splitting the k loop:

for (int i = 0; i < SIZE; i++)
  for (int j = 0; j < SIZE; j++)
    c[i][j] = 0;

for (int kk = 0; kk < SIZE; kk += BLOCK)
  for (int i = 0; i < SIZE; i++) // "outer" loop
    for (int j = 0; j < SIZE; j++)
      for (int k = kk; k < kk + BLOCK && k < SIZE; k++)
        c[i][j] += a[i][k] * b[k][j];

Or you could even split all the loops, possibly with different blocking factors:

for (int i = 0; i < SIZE; i++)
  for (int j = 0; j < SIZE; j++)
    c[i][j] = 0;

for (int ii = 0; ii < SIZE; ii += BLOCK_I)
  for (int kk = 0; kk < SIZE; kk += BLOCK_K)
    for (int jj = 0; jj < SIZE; jj += BLOCK_J)
      for (int i = ii; i < ii + BLOCK_I && i < SIZE; i++)
        for (int k = kk; k < kk + BLOCK_K && k < SIZE; k++)
          for (int j = jj; j < jj + BLOCK_J && j < SIZE; j++)
            c[i][j] += a[i][k] * b[k][j];

When blocking, special care needs to be taken to look for data dependencies. If one iteration depends on values produced in an earlier iteration, it follows that one must give special treatment to the boundary zones of the blocks to preserve the application semantics.

The blocking factor should be chosen to make the active problem size fit inside the target cache size. Different loop levels can be blocked to fit different cache levels.

5.2.3.2. Loop Fusion

When two loops use the same data it is beneficial to reduce the amount of data accessed between the loops. Cache lines fetched into the cache by the first loop may then still be in the cache when the second loop is run, so that it doesn't have to fetch them. The less data touched between the two loops, the better. Ultimately, if it is possible to completely merge the loop bodies, then all potential misses from the second loop will disappear.

It is not always possible to move the operations between loops. Special care must be taken not to move any data accesses past a write accesses to the same address, since that will change the meaning of the program.

Freja will identify such accesses that may preclude loop fusion. However, there are cases where it may fail, for example, when a dependence is carried in a processor register and never stored to memory, or when Freja's sparse sampling of information missed some dependency.

You therefore must always verify that the loops really are fusible, even if Freja suggests fusion.

Depending on the data dependencies at hand, there are several variations of how loop fusion can be achieved:

Moving instructions down from the first loop into the second loop.
Moving instructions up from the second loop into the first loop.
Moving instructions down from the second loop into the next iteration and into the first loop
Moving instructions up from the first loop into the previous iteration of the second loop.
Perform a partial loop merge and only move the instructions that do not have data dependencies, in any of the directions listed above.

Note that since there are four different directions to move the instructions, the code motion barriers are evaluated independently for each potential motion direction.

Consider this example:

double vector[SIZE];
double vector2[SIZE];
int i;

for (i = 0; i < SIZE; i++)
  vector[i] = i;

f(vector2);

for (i = 0; i < SIZE; i++)
  vector[i] *= vector2[i];

These two loops both use all elements in vector. The first loop always misses in the cache, assuming that vector has not been accessed before.

If vector is smaller than the available cache, then it might still be in the cache when the second loop starts, but that depends on how much data is accessed in the function f. One thing to consider is whether the call to f can be moved out of the way. We can't tell from looking at this snippet whether it would be possible, since the function body is not revealed.

Assuming that f only reads the elements in vector2 it would be possible to move f before the first loop, and thus increase the likelihood that data remains in the cache between the loops.

double vector[SIZE];
double vector2[SIZE];
int i;

f(vector2);

for (i = 0; i < SIZE; i++)
  vector[i] = i;

for (i = 0; i < SIZE; i++)
  vector[i] *= vector2[i];

If vector is larger than the available cache the second loop will still suffer from cache misses when accessing vector, even though f has been moved away from between the loops. We should therefore try to merge the two loops.

double vector[SIZE];
double vector2[SIZE];
int i;

f(vector2);

for (i = 0; i < SIZE; i++) {
  vector[i] = i;
  vector[i] *= vector2[i];
}

Now Freja may indicate another fusion possibility between a potential loop inside f and the one remaining loop. In that case the programmer needs to decide whether it is worth duplicating the contents of f to merge these loops, sacrificing legibility for performance.

5.2.3.3. Expert Loop Fusion

A variation of loop fusion occurs when there are multiple possible fusions targeting the same loop. The loop body may need to be fused multiple times in order to maintain proper program semantics. Consider this example:

for (i = 0; i < SIZE; i++) {
  vector[i] = i;
}

if (do_this) {
  for (i = 0; i < SIZE; i++) {
    vector[i]++;
  }
} else {
  for (i = 0; i < SIZE; i++) {
    vector[i]--;
  }
}

In this case, the do_this test is independent of the loops, and could be moved to before the whole section. But then the first loop needs to be duplicated in each of the then and else clauses. Then there are two loop fusion opportunities, which can be resolved like this:

if (do_this) {
  for (i = 0; i < SIZE; i++) {
    vector[i] = i; /* duplicated line */
    vector[i]++;
  }
} else {
  for (i = 0; i < SIZE; i++) {
    vector[i] = i; /* duplicated line */
    vector[i]--;
  }
}

The Expert Loop Fusion advice expresses this analysis as one step, by identifying all other candidate fusion targets. In principle, there can be multiple "first" loops and multiple "second" loops at the same time, and all of them are identified at the same time.


5.1. Data Layout Problems		5.3. Non-Temporal Data