2.3. Common Report Settings

Some of the performance problems that Freja looks for are more visible on some cache levels than others, depending on topology, number of caches, and what processor features exist on different cache levels.

For reporting purposes, Freja considers one cache level at a time. A consequence is that it will focus on those problems that are visible on the selected cache level, and disregard such suggestions that are inapplicable on that level.

[Tip]Tip

To get the full picture, it may be necessary to prepare multiple reports, one for each cache level.

This section outline some common analysis scenarios, and explain the corresponding settings.

2.3.1. Cache Performance

From a performance optimization perspective, it makes sense to start optimizing with respect to the highest level cache. There are at least two reasons.

A cache miss in the last cache level is much more expensive that a cache miss in the first cache level, since they have to all the way out to the memory to satisfy the memory access. Conversely, if such misses can be avoided, the benefit is most noticeable when optimizing on the highest level.

The highest cache level is also the largest cache. It will be less difficult to squeeze in the data set to fit in the highest cache, than it will be to make it fit in the smaller lower level caches.

This is the default analysis mode of Freja. No extra parameters are required.

2.3.2. Analysis of software prefetch instructions

Depending on the processor, the software prefetch instructions may fetch data to the lowest level cache, or to some outer level.

AMD processors have typically fetched data into the L1 cache level, while Intel processors normally target L2. Generation of prefetch related advice is only active when preparing a report for the corresponding cache level.

Select the target cache level using --level cache-level. If that cache level is not affected by prefetch instructions, the output from Freja will include a warning to that effect.

2.3.3. Threading Advice - Inter-Cache Communication

ThreadSpotter™ offers advice and statistics in relation to how well multithreaded programs communicate among their caches. The analysis requires that there are several caches for the analyzed cache level, since if there was only one cache, there would be no other cache to exchange data with.

So, ThreadSpotter™ only presents communication related advice if:

  • The application has multiple threads.

  • The application threads use memory to share data with each other (as opposed to using operating system communication channels to exchange data). That is, one thread writes and another one reads data from a particular place in memory.

  • ThreadSpotter™ deduces or is told that there are more than one cache at the specified cache-level.

The latter point is a function of what processor model is being used, what cache-level is the target cache-level, and whether the user overrides the number of caches to be considered.

[Note]Note

ThreadSpotter™ assumes one processor unless you tell it otherwise. If this is not the case, use the setting --number-of-caches number to inform ThreadSpotter™ on the total number of caches to assume.

Example: If you are interested in finding problems related to the communication traffic between L1 caches for a dual socket, quad core Intel system (Yorkfield), where each socket has four private L1 caches, use the following parameters:

$ report --cpu intel/yorkfield_4_12288 --number-of-caches 8 --level 1 ...

2.3.4. Threading Advice - Inter-Socket Communication

In a similar way, looking at inter-socket communication requires providing ThreadSpotter™ with the total number of top-level caches in the system.

Example: AMD Opteron 2427, codename Istanbul, has four cores and a shared L3 cache. To analyze the communication between sockets in a system consisting of two such processors, use the following command:

$ report --cpu amd/istanbul --number-of-caches 2 --level 3 ...

This would cause ThreadSpotter™ to distribute the threads of the application onto two L3 caches. The resulting traffic would correspond to the communication between the two processors.