Event-based profiling (EBP) helps identify the root cause for CPU- and memory-related performance issues. EBP uses the performance monitoring hardware in AMD processors to count the number of occurrences of hardware events. The kind and frequency of these events may indicate the presence of a pipeline bottleneck, poor memory access pattern, poorly predicted conditional branches, or some other performance issue. Once hot-spots are found through time-based profiling, EBP is used to follow-up and investigate the hot-spots in order to identify and exploit opportunities for optimization.
Retired instructions, DC accesses and misses per software module (EBP) ![]()
Retired instructions, DC accesses and misses for source-level hot-spot (EBP) ![]()
AMD processors are equipped with performance monitoring counters (PMC). Each counter may count exactly one hardware event at a time. A hardware event is a condition (or change in hardware condition) like CPU clocks, retired x86 instructions, data cache accesses, or data cache misses. The number of counters and the hardware events that can be measured are processor-dependent. The CodeAnalyst online help provides a quick guide to the events that are available for each AMD processor family. See Performance Monitoring Events for descriptions of the events supported by AMD processors. However, you should consult the BIOS and Kernel Developer's Guide for the AMD processor in your test platform for the latest information. The number of events and, in some cases, the event behavior may vary by revision within a processor family as well.
Like time-based profiling, event-based profiling relies upon statistical sampling to build a program profile. CodeAnalyst handles the details of PMC configuration and sampling. However, the following short description of how CodeAnalyst performs event-based profiling may help to understand how to use CodeAnalyst more effectively. Each counter must be configured with:
Since EBP is a statistical method, it also depends upon a statistically significant quantity of samples in order to support reasoning about program behavior. As discussed in time-based profiling, the number of samples collected depends upon the sampling period (the event count parameter) and the measurement period (the length of time during which samples are collected). The number of samples collected can be increased by using a smaller sampling period or by increasing the length of time that samples are taken.
Use of a smaller sampling period increases data collection overhead. Since data collection must be performed on the same platform as the test workload, more frequent sampling increases the intrusiveness of event-based profiling and the sampling process adversely affects shared hardware resources like instruction and data caches, translation lookaside buffers and branch history tables. Extremely small sampling periods may also cause system instability. Start off conservatively and slowly decrease the sampling period for an event until the appropriate volume of samples is generated.
An additional complicating factor when choosing the sampling period for an event is the behavior of the workload itself. Some workloads are CPU-intensive while other workloads are memory-intensive. Some workloads may be CPU-intensive and require high memory bandwidth to stream data into the CPU. For example, a CPU-intensive application that performs few memory access operations will cause relatively few data-cache miss events simply because it does not access memory very often. The characteristics of the workload may even vary by phase where the phase setting up a computation has a different behavior from the computation phase itself. Thus, the workload behavior determines the frequency of certain kinds of events and changes to the sample period may be necessary in practice.
As mentioned earlier, the number of available performance counters is processor-dependent. AMD Family 10h processors, for example, provide four performance counters. The number of available performance counters determines the number of events that can be measured together at one time. Ordinarily, this would limit the number of events that can be measured in a single experimental run. However, CodeAnalyst for Linux®, allows users to set more than four performance counters within a profiling session. Up to four performance counters can be grouped together, and CodeAnalyst will rerun the application to collect performance data for each group. See Event Counter Multiplexing for further details.
NOTE: The semantic for Event Multiplexing used in CodeAnalyst for Linux is different from the one in Windows.
Each EBP sample has a weight. In the simplest case, this weight is equal to the sampling period for the event being measured since one sample is generated for every SP events, where SP is the sampling period. The weight is affected by event multiplexing and is determined by the number of event groups and the number of times an event is a member of a group. The effective sample weight for an event that is a member of every group is the base sampling period. If an event is a member of a single group and there are N groups, the effective sample weight is N times the sampling period. Sample weight must be taken into account when performing arithmetic on sample counts to compute rates like the number of data cache accesses per retired instruction. Sample counts must be normalized to have the same sample weighting before they can be arithmetically combined. CodeAnalyst handles sample weighting (normalization) transparently when it performs arithmetic on sample counts between events of different types.
To make the process of configuration easier, CodeAnalyst provides several predefined profile configurations in which the choice of events and sampling periods have already been made. These predefined profile configurations are: