5.5. Common Data Structures

5.5. Common Data Structures
	Chapter 5. Memory Performance Problems and Solutions

Certain data structures such as linked lists, trees and hash tables typically have quite bad cache behavior. There are two reasons for this. Firstly, individual elements are often allocated dynamically, which means housekeeping data for the memory allocator is placed between the elements. This causes poor cache line utilization.

Secondly, using these data structures often causes random access patterns, see Section 5.2.2, “Random Access Pattern”, caused either by dynamic memory allocation or the way the data structures are traversed.

When using data structures from a library, for example, from the STL library in C++, one thing to consider is that they have been designed to work reasonably well in all cases, especially with very large numbers of elements. This means that for structures with a small numbers of elements they often add unnecessary overhead. Implementing your own data structure adapted specifically to your application may provide significant performance gains.

Generally, a structure that is never, or very rarely, changed can be more efficiently implemented than a structure that needs to support efficient updating. Supporting efficient changes often requires pointer indirections between elements and dynamic memory allocation of individual elements. This can often be avoided in a static representation.

Sometimes a structure is used in two phases, first being filled in with data and then only being used for lookups. It may then pay off to first use a representation that supports efficient updating, and then convert it to a more efficient read-only data structure once it has been filled in.

5.5.1. Arrays

From a cache performance point of view a linear search in an array is a quite ideal workload. The linear search has a good spatial locality, and the regular access pattern means that the hardware prefetcher can effectively prefetch the accessed data.

Elements in an array are also efficiently packed. Only a single memory allocation is done for all the elements and the elements are stored contiguously in memory. In comparison a linked list, for example, often uses more memory for different kinds of overhead than for actually storing the data.

The drawback of a linear search is of course that the number of accesses that need to be made grows linearly with the number of elements, so it does not work well for large numbers of elements. However, the low overhead usually makes the performance superior to more complex structures when there are only a few elements, and depending on cache pressure and other factors it can still be competitive up to a few tens of elements.

If all, or the vast majority, of the accesses to a structure with a large number of elements are lookups, using a binary search in a sorted array may be more efficient than using a search tree structure. For lookups the binary search offers the same logarithmic time complexity as a search tree. However, the elements in an array are much more efficiently packed than a tree with dynamically allocated nodes, and it will therefore achieve much better cache performance.

5.5.2. Linked Lists

Linked lists are commonly plagued by two problems, they cause a high degree of memory overhead and they cause random access patterns.

A linked list typically allocates its elements dynamically, which causes some memory management overhead for each element. Each element also has to contain a pointer to the next element and possibly to the previous element. On a 64-bit processor this can easily add up to 16 - 32 bytes of overhead per element. This means that, for example, in a linked list of pointers typically only a third or less of the memory is actually used for the stored data, the rest is overhead.

Dynamic memory allocation also means that the list elements may be spread out in memory. Even if the list elements can only be accessed sequentially, the access pattern when the list is traversed may therefore still be irregular.

The memory layout depends on the memory allocator, as well as the algorithms that assemble and manipulate the list. If a linked list is causing performance problems, look over the node allocation.

If the list is never updated it should be replaced with an array. This may also be true if the list is only rarely updated or only contains a few elements. If the list is changed now and then, but a part of the application reads the linked list repeatedly between the changes, it may even pay off to copy the list to a temporary array to use in that part of the application.

Another way to hide latencies in linked lists is to add a pointer to a node several steps further ahead to the elements, and when traversing the linked list use this pointer to prefetch elements further ahead:

struct node {
  struct node *next;
  struct node *prefetch;
  int data;
}

/* Traverse a list and populate prefetch hints to point
 * distance steps ahead.
 */
void prepare_prefetch_hint(struct node *head, int distance)
{
    struct node *q, *p;
    int distance = PREFETCH_DISTANCE;
    for (p = head; p; p = p->next)
        if (0 == distance--) break;
    for (q = head; p && q; p = p->next, q = q->next)
        q->prefetch = p;
}

void traverse(struct node *head) {
  struct node *p = head;
  for (p = head; p != NULL; p = p->next) {
    prefetch(p->prefetch);
    access(p->data);
  }
}

/* Call function once after updating list */
prepare_prefetch_hint(head, 8);

/* Assuming many traversals between each update */
traverse(head);

Since prefetch instructions have no side effects the prefetch pointers do not need to be kept completely accurate at all times, It may be enough to only update them now and then if the list is frequently changed.

The tricky thing is to determine how many elements ahead to prefetch. This is a function of memory latencies, cache size and how much processing is done for each node. Freja will help you determine the right distance.

5.5.3. Trees

Trees, like linked lists, cause a lot of memory overhead from dynamic memory allocation and pointers between nodes, and the nodes may be spread out in memory because of dynamic memory allocation.

Tree operations also have an inherently random access pattern, since element lookups and changes will follow different paths through the tree in an irregular fashion. Rebalancing operations also keep rearranging the tree layout as the tree is being modified.

Iterating through the elements in the sorting order of the keys is very inefficient compared to iterating over a sorted array, so if that is done repeatedly between changes in the tree it may pay off to copy the tree contents to a temporary array and iterate over it instead.

If the tree contents are static, it may be worth while to spend some time to arrange the nodes so that adjacent nodes also reside close to each other in memory. Replacing the tree lookups with a binary searches in a sorted array will provide the same logarithmic time complexity, but with less memory overhead and therefore better cache line utilization and performance.

It may also be possible to collapse several layers of a tree, or a cluster of graph nodes into a more densely allocated sub-structure.

5.5.4. Hash Tables

Hash tables often suffer from random access patterns and poor cache line utilization. One reason for the random access patterns can be the hash function itself. Hash functions are often to designed to map keys to more or less random indexes in the hash table to avoid collisions. However, this means that lookups will also access randomly distributed indexes. If you know that some keys are likely to be looked up in sequence, try to map those keys to adjacent indexes.

For example, if the hash key is an integer and you know that lookups are likely to be done on sequential keys, just using the key modulo the hash table size will work well. The sequential keys will then map to sequential locations in the hash table.

A random access pattern is often enough to cause poor cache line utilization by itself, but another factor in hash tables is the fill ratio. Hash tables are often sized to be larger than the number of elements to reduce the number of collisions. This leaves unused indexes in the table, lowering the cache line utilization. Decreasing the size of the table will increase the number of collisions, but may pay off in increased cache line utilization and reduced cache misses.

In hash tables using collision lists, these lists may cause problems as described in Section 5.5.2, “Linked Lists”. Replacing them with arrays may improve performance.


5.4. Multithreading Problems		5.6. Final Remedies