3.7. Multithreading and Cache Coherence

Computers with multiple threads of execution, either with multiple processors, multiple cores per processor, or both, introduce additional complexity to caches. Different threads accessing the same data can now have private copies of the data in their local caches, but writes to the data by one thread must still be seen by all other threads. Some mechanism to keep the caches synchronized is needed.

The mechanism that keeps the caches synchronized is called a cache coherence protocol. There are different possible coherence protocols, but most modern processors use the MESI protocol or some variation such as the MOESI protocol. Freja therefore currently models the MESI protocol.

The MESI protocol is named after the four states of a cache line: modified, exclusive, shared and invalid:

Cache line states in the MESI protocol

M - Modified

The data in the cache line is modified and is guaranteed to only reside in this cache. The copy in main memory is not up to date, so when the cache line leaves the modified state the data must be written back to main memory.

E - Exclusive

The data in the cache line is unmodified, but is guaranteed to only reside in this cache.

S - Shared

The data in the cache line is unmodified, and there may also be copies of it in other caches.

I - Invalid

The cache line does not contain valid data.

This section is not intended to be a complete description of cache coherency, but only a quick introduction allowing you to understand the types of multithreading problems Freja can identify. Good sources with more detailed information on cache coherency can be found by searching on the internet.

We will now go through some examples of how the MESI protocol works. For simplicity we will assume that we are running on a two-processor system where each processor has its own private cache on a single cache level:

Example System

Figure 3.1. Example System


In these examples data transfers are drawn in red, while downgrade and invalidation traffic is drawn in blue.

If a thread reads data not present in any cache, it will fetch the line into its cache in exclusive state (E):

Cache Coherence, Example 1

Figure 3.2. Cache Coherence, Example 1


If a thread reads from a cache line that is in shared state (S) in another thread's cache, it fetches the cache line into its cache in shared state (S):

Cache Coherence, Example 2

Figure 3.3. Cache Coherence, Example 2


If a thread reads from a cache line that is in exclusive state (E) in another thread's cache, it fetches the cache line to its cache in shared state (S) and downgrades the cache line to shared state (S) in the other cache:

Cache Coherence, Example 3

Figure 3.4. Cache Coherence, Example 3


If a thread reads from a cache line that is in modified state (M) in another thread's cache, the other cache must first write-back its modified version of the cache line and downgrade it to shared state (S). The thread doing the read can then add the cache line to its cache in shared state (S):

Cache Coherence, Example 4

Figure 3.5. Cache Coherence, Example 4


When a thread has a cache line in exclusive (E) or modified state (M) it can write to it with very low overhead, since it knows that no other thread can have a copy of the line that needs to be invalidated. A write to an exclusive cache line makes it modified (M):

Cache Coherence, Example 5

Figure 3.6. Cache Coherence, Example 5


When a thread writes to a cache line that it has in shared state (S) it must upgrade the line to modified state (M). In order to do this it must invalidate (I) any copies of the line in other caches, so that they do not retain an outdated copy:

Cache Coherence, Example 6

Figure 3.7. Cache Coherence, Example 6


When a thread writes to a cache line that it does not have in its cache, it must fetch the line and invalidate (I) it in all other caches. If another thread has a modified (M) copy of the cache line it must first write it back before the thread doing the write can fetch it.

Cache Coherence, Example 7

Figure 3.8. Cache Coherence, Example 7


The above is not a complete enumeration of coherence interaction, and in reality there are other access types to consider such as prefetches and cache line flushes, but it should give basic understanding of have cache coherence affects.

Based on the coherence related events, Freja presents statics about the following expensive thread interactions:

Upgrades

The upgrade count is the number of memory accesses that cause a cache line to be upgraded from shared state to either exclusive or modified state.

There are two scenarios where this can happen. A thread can read a cache line into its cache in shared state because it is already in the cache of another thread, and then write to it. Or, a thread has the cache line in its cache in exclusive or modified state, but it is downgraded to shared state because of a read from another thread, and the thread then writes to the line.

Coherence misses

The coherence miss count is the number of memory accesses that miss because a cache line that would otherwise be present in the thread's cache has been invalidated by a write from another thread.

Coherence write-backs

The coherence write-back count is the number of memory access that force a cache line that is in modified state in another thread's cache to be written back.

All of the above can also be reported as ratios. They are then reported as the percentage of memory accesses that suffer from one of these interactions. For example, a coherence miss ratio of 3% means that on average 3 out of every 100 memory accesses suffer from coherence misses.