Freja

Manual

Version @SVN_REVISION@

2016-11-28

All rights reserved.


      ParaTools, Inc.

      2836 Kincaid St.
      97405 Eugene
      OR

      
    


Table of Contents

1. Introduction
1.1. Overview
1.2. Technology
1.3. Limitations
2. Running Freja
2.1. Using the Graphical User Interface
2.1.1. Sampling an Application
2.1.2. Generating a Report from a Sample File
2.1.3. Sampling and Generating a Report
2.1.4. Viewing an Existing Report
2.1.5. Advanced Sampling Settings
2.1.6. Advanced Report Settings
2.1.7. Using a Different Browser
2.1.8. Using Firefox on Multiple Computers
2.2. Using the Command Line Tools
2.2.1. Sampling an Application
2.2.2. Creating a Report
2.2.3. Viewing a Report
2.3. Common Report Settings
2.3.1. Cache Performance
2.3.2. Analysis of software prefetch instructions
2.3.3. Threading Advice - Inter-Cache Communication
2.3.4. Threading Advice - Inter-Socket Communication
2.4. Advanced Use
2.4.1. Burst Sampling
2.4.2. Sampling Start Conditions
2.4.3. Sampling Stop Conditions
2.4.4. Sample Files
2.5. Running Freja in a Virtualized Environment
3. Introduction to Caches
3.1. Motivation for Caches
3.2. Cache Lines and Cache Size
3.3. Replacement Policies
3.4. Cache Misses
3.5. Data Locality
3.6. Prefetching
3.6.1. Software Prefetching
3.6.2. Hardware Prefetching
3.7. Multithreading and Cache Coherence
3.8. Fetch Ratio
3.9. Upgrade Ratio
3.10. Write-Back Ratio
3.11. Memory Bandwidth
4. Freja Concepts
4.1. Issues
4.2. Loops
4.3. Instruction Groups
4.4. Last Writer
4.5. Fetch Utilization
4.6. Write-Back Utilization
4.7. Communication Utilization
4.8. Utilization Corrected Fetch Ratio
4.9. Utilization Corrected Write-Back Ratio
4.10. Hardware Prefetch Probability
4.11. Access Randomness
4.12. Call Stack
4.13. Sample Period
5. Memory Performance Problems and Solutions
5.1. Data Layout Problems
5.1.1. Partially Used Structures
5.1.2. Too Large Data Types
5.1.3. Alignment Problems
5.1.4. Dynamic Memory Allocation
5.2. Data Access Pattern Problems
5.2.1. Inefficient Loop Nesting
5.2.2. Random Access Pattern
5.2.3. Unexploited Data Reuse Opportunities
5.3. Non-Temporal Data
5.3.1. Example of Non-Temporal Data Optimization
5.3.2. Singlethreaded Uses of Non-Temporal Hints
5.3.3. Multithreaded Uses of Non-Temporal Hints
5.3.4. Concurrent Uses of Non-Temporal Hints
5.3.5. Types of Non-Temporal Hint Instructions
5.3.6. Using Non-Temporal Hint Instructions
5.4. Multithreading Problems
5.4.1. False Sharing
5.4.2. Poor Communication Utilization
5.5. Common Data Structures
5.5.1. Arrays
5.5.2. Linked Lists
5.5.3. Trees
5.5.4. Hash Tables
5.6. Final Remedies
6. Optimization Workflow
6.1. Initial State: Correct, Measurable Program, Good Test Case
6.2. Avoid Unnecessary Memory Accesses
6.3. Optimize Data Layout
6.4. Optimize Access Patterns
6.5. Utilize Reuse Opportunities
6.6. Use Non-Temporal Hints for Data without Temporal Reuse
6.7. Avoid False Sharing
6.8. Avoid Communication between Caches (Coherence Traffic)
6.9. Hide Remaining Misses
7. Reading the Report
7.1. Statistics
7.1.1. Reading the Statistics
7.1.2. Reading the Diagrams
7.2. The Report Layout
7.3. The Summary Frame
7.3.1. The Summary Tab
7.3.2. The Loops Tab
7.3.3. The Bandwidth Issues Tab
7.3.4. The Latency Issues Tab
7.3.5. The Multi-Threading Issues Tab
7.3.6. The Pollution Issues Tab
7.3.7. The Files Tab
7.3.8. The Execution Tab
7.3.9. The About/Help Tab
7.4. The Issue Frame
7.4.1. Statistics
7.4.2. Instructions
7.4.3. Loop Details
7.4.4. Issue Details
7.5. The Source Code Frame
8. Issue Reference
8.1. Utilization Issues
8.1.1. Fetch Utilization
8.1.2. Write-Back Utilization
8.1.3. Communication Utilization
8.2. Inefficient Loop Nesting
8.3. Random Access Pattern
8.4. Loop Fusion
8.5. Blocking
8.6. Software Prefetch Issues
8.6.1. Prefetch Unnecessary
8.6.2. Prefetch too Distant
8.6.3. Prefetch too Close
8.7. Fetch Hot-Spot
8.8. Write-back Hot-Spot
8.9. Non-Temporal Store Possible
8.10. Non-Temporal Data
8.11. False Sharing
8.12. Communication Hot-Spot
9. Technical Support
A. Sampling MPI Applications
A.1. Introduction
A.2. Scope
A.3. Sampling of MPI Applications
A.4. Alternative method: wrapper scripts
A.5. Scratch directories
A.6. Cray, Torque PBS, and ALPS
A.7. Cray, SLURM, and ALPS
A.8. MPI related limitations
B. Cross-Architecture Analysis
B.1. Introduction
B.2. Supported Non-x86 Processors
B.3. Considerations for Accurate Cross-Architecture Analysis
B.4. Sampling the Required Cache Line Size
B.5. x86-centric Issues
B.5.1. Non-Temporal Data
B.5.2. Non-Temporal Store Possible
B.6. Considerations for Specific Processors
C. Supported CPU types
D. Credits
D.1. libelf
D.2. libdwarf
D.3. libgd-2.0.34
D.4. OpenSSL
D.5. klibc
I. Command Reference
internal — GUI for Freja sampling and report generation
sample — sample the memory access pattern of a process and generate a sample file
report — generate a report from a sample file
view — start a report viewer
license — Install a license file

List of Figures

2.1. Overview of the GUI
2.2. Processor Model Selector
2.3. Advanced Sampling Settings
2.4. Advanced Report Settings
3.1. Example System
3.2. Cache Coherence, Example 1
3.3. Cache Coherence, Example 2
3.4. Cache Coherence, Example 3
3.5. Cache Coherence, Example 4
3.6. Cache Coherence, Example 5
3.7. Cache Coherence, Example 6
3.8. Cache Coherence, Example 7
5.1. Data Layout Example
5.2. Good Utilization
5.3. Poor Utilization
5.4. Unused Fields
5.5. No Unused Fields
5.6. Poor Internal Alignment
5.7. Good Internal Alignment
5.8. External Alignment
5.9. Dynamic Memory Allocation
5.10. Inefficient Loop Nesting
5.11. Efficient Loop Nesting
5.12. False Sharing Example, Step 1
5.13. False Sharing Example, Step 2
5.14. False Sharing Example, Step 3
5.15. False Sharing Example, Step 4
5.16. False Sharing Example, Step 5
5.17. False Sharing Example, Step 6
5.18. False Sharing Example, Fixed
5.19. Matrix Accesses with False Sharing
5.20. Matrix Accesses without False Sharing
7.1. Issue Statistics Section
7.2. Summary Statistics
7.3. Issue Statistics
7.4. Loop Statistics
7.5. Instruction Group Statistics
7.6. Fetch/Miss Ratio Diagram
7.7. Write-Back Ratio Diagram
7.8. Utilization Diagram
7.9. Report Outline
7.10. The Summary Tab
7.11. The Loops Tab
7.12. The Bandwidth Issues Tab
7.13. The Latency Issues Tab
7.14. The Multi-Threading Issues Tab
7.15. The Pollution Issues Tab
7.16. The Files Tab
7.17. The Execution Tab
7.18. The About/Help Tab
7.19. Issue Statistic Sections
7.20. Instructions with Collapsed Call Stack
7.21. Instructions with Expanded Call Stack
7.22. Loop
7.23. Source Code with Collapsed Lines
7.24. Source Code with Expanded Lines
8.1. Fetch Utilization Issue
8.2. Write-Back Utilization Issue
8.3. Communication Utilization Issue
8.4. Inefficient Loop Nesting Issue
8.5. Random Access Pattern Issue
8.6. Loop Fusion Issue
8.7. Blocking Issue
8.8. Prefetch Unnecessary Issue
8.9. Prefetch too Distant Issue
8.10. Prefetch too Close Issue
8.11. Fetch Hot-Spot Issue
8.12. Write-back Hot-Spot Issue
8.13. Non-Temporal Store Possible Issue
8.14. Non-Temporal Data Issue
8.15. False Sharing Issue
8.16. Communication Hot-Spot Issue
A.1. MPI Sampling Principles
A.2. Message Passing Toolkit, runtime system and shepherd process

List of Tables

9.1. Electronic Services
A.1. Fingerprint filename substitutions
A.2. %r substitutions
C.1. AMD
C.2. ARM
C.3. Freescale
C.4. IBM
C.5. Intel
9. Filename substitutions
10. %r substitutions

List of Examples

A.1. Sampling Open MPI ranks using a wrapper script
A.2. Sampling with a wrapper script
A.3. Selectively sampling ranks
4. Starting an application in the sampler
5. Attaching to a running process
6. Burst sampling a long running application
7. Using a template name for output file
8. Analyzing sample files using autodetected CPU models
9. Specifying a CPU model
10. Using custom thread to cache mappings
11. Installing a license file for the current user
12. Installing a reference to three license servers