This section describes processor performance monitor events available for performance analysis and tuning for AMD Family 11h processors. The AMD Family 11h processors provide four 48-bit performance counters per available core, which allows four types of events to be monitored simultaneously. The performance counters are not guaranteed to be fully accurate and should be used as a relative measure of performance to assist in application tuning. Unlisted event numbers are reserved and their results are undefined.
The Event Select value is used to select the event to be monitored. The Unit Mask is used to further qualify the event selected by the Event Select value. The Mask Value given here is an index and corresponds to actual 8-bit Unit Mask as specified in the following table.
Mask value | Unit Mask |
0 | 0x01 |
1 | 0x02 |
2 | 0x04 |
3 | 0x08 |
4 | 0x10 |
5 | 0x20 |
6 | 0x40 |
7 | 0x80 |
Unless otherwise stated, the Unit Mask values shown may be combined to select any desired combination of the sub-events for a given event. For events where no Unit Mask table is shown, the Unit Mask is not applicable and the results are undefined.
Speculative vs. Retired events: Several events may include speculative activity, meaning the events may be associated with false-path instructions that are ultimately discarded due to a branch misprediction. Events associated with "Retire" reflect actual program execution. For events where the distinction may matter, these are explicitly labeled as one or the other.
Dual-core operation: In AMD64 dual-core processors, each core has its own set of event counters. However, each core shares the event-select logic for events in the shared Northbridge logic, allowing an overwrite of a Northbridge event select (including unit mask) that was previously set up by the other core, changing the event that the first core thinks it is counting.
Note: This conflict between cores occurs between corresponding event counters, e.g., PMC0 vs. PMC0. So both cores cannot simultaneously monitor different Northbridge events using the same counter. When using the performance counters simultaneously in both cores, care must be taken to avoid this conflict, such as by having one core monitor the desired Northbridge events and the other core either monitor events internal to itself, or not use the corresponding event counters.
For detailed information, refer to the BIOS and Kernel Developer's Guide for AMD Athlon™ 64 and AMD Opteron™ Processors, order# 26094.
Abbreviation: FPU ops
The number of operations (uops) dispatched to the FPU execution pipelines. This event reflects how busy the FPU pipelines are. This includes all operations done by x87, MMX® and SSE instructions, including moves. Each increment represents a one-cycle dispatch event; packed 128-bit SSE operations count as two ops; scalar operations count as one. Speculative. (See also event CBh). Note: Since this event includes non-numeric operations it is not suitable for measuring MFLOPs.
Note: Since this event includes non-numeric operations it is not suitable for measuring MFLOPs.
Value | Unit mask description |
0 | Add pipe ops excluding junk ops |
1 | Multiply pipe ops excluding junk ops |
2 | Store pipe ops excluding junk ops |
3 | Add pipe load ops |
4 | Multiply pipe load ops |
5 | Store pipe load ops |
Abbreviation: No FPU op cycles
The number of cycles in which the FPU is empty.
Abbreviation: Fast flag FPU ops
The number of FPU operations that use the fast flag interface (e.g. FCOMI, COMISS, COMISD, UCOMISS, UCOMISD). This event is a speculative event.
Abbreviation: Seg reg loads
The number of segment register loads performed.
Value | Unit mask description |
0 | ES |
1 | CS |
2 | SS |
3 | DS |
4 | FS |
5 | GS |
6 | HS |
Abbreviation: Restart self-mod code
The number of pipeline restarts that were caused by self-modifying code (a store that hits any instruction that's been fetched for execution beyond the instruction doing the store).
Abbreviation: Restart probe hit
The number of pipeline restarts caused by an invalidating probe hitting on a speculative out-of-order load.
Abbreviation: LS2 buffer full
The number of cycles that the LS2 buffer is full. This buffer holds stores waiting to retire as well as requests that missed the data cache and are waiting on a refill. This condition will stall further data cache accesses, although such stalls may be overlapped by independent instruction execution.
Abbreviation: Locked ops
This event covers locked operations performed and their execution time. The execution time represented by the cycle counts is typically overlapped to a large extent with other instructions. The non-speculative cycles event is suitable for event-based profiling of lock operations that tend to miss in the cache.
Value | Unit mask description |
0 | Number of locked instructions executed |
1 | Number of cycles spent in speculative phase |
2 | Number of cycles spent in non-speculative phase |
Abbreviation: DC accesses
The number of accesses to the data cache for load and store references. This may include certain microcode scratchpad accesses, although these are generally rare. Each increment represents an eight-byte access, although the instruction may only be accessing a portion of that. This event is a speculative event.
Abbreviation: DC misses
The number of data cache references which missed in the data cache. This event is a speculative event.
Except in the case of streaming stores, only the first miss for a given line is included - access attempts by other instructions while the refill is still pending are not included in this event. So in the absence of streaming stores, each event reflects one 64-byte cache line refill, and counts of this event are the same as, or very close to, the combined count for event 42h.
Streaming stores however will cause this event for every such store, since the target memory is not refilled into the cache. Hence this event should not be used as an indication of data cache refill activity - event 42h should be used for such measurements. (See event 65h for an indication of streaming store activity.) A large difference between events 41h (with all UNIT_MASK bits set) and 42h would be due mainly to streaming store activity.
Abbreviation: DC refills L2/sys
The number of data cache refills satisfied from the L2 cache (and/or the system), per the UNIT_MASK. UNIT_MASK bits 4:1 allow a breakdown of refills from the L2 by coherency state. UNIT_MASK bit 0 reflects refills which missed in the L2, and provides the same measure as the combined sub-events of event 43h. Each increment reflects a 64-byte transfer. This event is a speculative event.
Value | Unit mask description |
0 | Refill from system |
1 | Shared-state line from L2 |
2 | Exclusive-state line from L2 |
3 | Owned-state line from L2 |
4 | Modified-state line from L2 |
Abbreviation: DC refills sys
The number of L1 cache refills satisfied from the system (system memory or another cache), as opposed to the L2. The UNIT_MASK selects lines in one or more specific coherency states. Each increment reflects a 64-byte transfer. This event is a speculative event.
Value | Unit mask description |
0 | Invalid |
1 | Shared |
2 | Exclusive |
3 | Owned |
4 | Modified |
Abbreviation: DC evicted
The number of L1 data cache lines written to the L2 cache or system memory, having been displaced by L1 refills. The UNIT_MASK may be used to count only victims in specific coherency states. Each increment represents a 64-byte transfer. This event is a speculative event.
In most cases, L1 victims are moved to the L2 cache, displacing an older cache line there. Lines brought into the data cache by PrefetchNTA instructions, however, are evicted directly to system memory (if dirty) or invalidated (if clean). There is no provision for measuring this component by itself. The Invalid case (UNIT_MASK value 01h) reflects the replacement of lines that would have been invalidated by probes for write operations from another processor or DMA activity.
Value | Unit mask description |
0 | Invalid |
1 | Shared |
2 | Exclusive |
3 | Owned |
4 | Modified |
Abbreviation: DTLB L1M L2H
The number of data cache accesses that miss in the L1 DTLB and hit in the L2 DTLB. This event is a speculative event.
Abbreviation: DTLB L1M L2M
The number of data cache accesses that miss in both the L1 and L2 DTLBs. This event is a speculative event.
Abbreviation: Misalign access
The number of data cache accesses that are misaligned. These are accesses which cross an eight-byte boundary. They incur an extra cache access (reflected in event 40h), and an extra cycle of latency on reads. This event is a speculative event.
Abbreviation: Late cancel
Abbreviation: Early cancel
Abbreviation: 1-bit ECC errors
The number of single-bit errors corrected by either of the error detection/correction mechanisms in the data cache.
Value | Unit mask description |
0 | Scrubber error |
1 | Piggyback scrubber errors |
Abbreviation: Prefetch inst
The number of prefetch instructions dispatched by the decoder. Such instructions may or may not cause a cache line transfer. All Dcache and L2 accesses, hits and misses by prefetch instructions, except for prefetch instructions that collide with an outstanding hardware prefetch, are included in these events. This event is a speculative event.
Value | Unit mask description |
0 | Load (Prefetch, PrefetchT0/T1/T2 |
1 | Store (PrefetchW) |
2 | NTA (PrefetchNTA) |
Abbreviation: DC misses locked inst
The number of data cache misses incurred by locked instructions. (The total number of locked instructions may be obtained from event 24h.)
Such misses may be satisfied from the L2 or system memory, but there is no provision for distinguishing between the two. When used for event-based profiling, this event will tend to occur very close to the offending instructions. (See also event 24h.) This event is also included in the basic Dcache miss event (event 41h).
Value | Unit mask description |
1 | Data cache misses by locked instructions |
Abbreviation: Mem type req
These events reflect accesses to uncachable (UC) or write-combining (WC) memory regions (as defined by MTRR or PAT settings) and Streaming Store activity to WB memory. Both the WC and Streaming Store events reflect Write Combining buffer flushes, not individual store instructions. WC buffer flushes which typically consist of one 64-byte write to the system for each flush (assuming software typically fills a buffer before it gets flushed). A partially-filled buffer will require two or more smaller writes to the system. The WC event reflects flushes of WC buffers that were filled by stores to WC memory or streaming stores to WB memory. The Streaming Store event reflects only flushes due to streaming stores (which are typically only to WB memory). The difference between counts of these two events reflects the true amount of write events to WC memory.
Value | Unit mask description |
0 | Requests to non-cacheable (UC) memory |
1 | Requests to write-combining (WC) memory or WC buffer flushes to WB memory |
7 | Streaming store (SS) requests |
Abbreviation: Data prefetcher
These events reflect requests made by the data prefetcher. UNIT_MASK bit 1 counts total prefetch requests, while bit 0 counts requests where the target block is found in the L2 or data cache. The difference between the two represents actual data read (in units of 64-byte cache lines) from the system by the prefetcher. This is also included in the count of event 7Fh, UNIT_MASK bit 0 (combined with other L2 fill events).
Value | Unit mask description |
0 | Cancelled prefetches |
1 | Prefetch attempts |
Abbreviation: Sys read resp
The number of responses from the system for cache refill requests. The UNIT_MASK may be used to select specific cache coherency states. Each increment represents one 64-byte cache line transferred from the system (DRAM or another cache, including another core on the same node) to the data cache, instruction cache or L2 cache (for data prefetcher and TLB table walks). Modified-state responses may be for Dcache store miss refills, PrefetchW software prefetches, hardware prefetches for a store-miss stream, or Change-to-Dirty requests that get a dirty (Owned) probe hit in another cache. Exclusive responses may be for any Icache refill, Dcache load miss refill, other software prefetches, hardware prefetches for a load-miss stream, or TLB table walks that miss in the L2 cache; Shared responses may be for any of those that hit a clean line in another cache.
Value | Unit mask description |
0 | Exclusive |
1 | Modified |
2 | Shared |
4 | Data error |
Abbreviation: Quad written to sys
The number of quadword (8-byte) data transfers from the processor to the system. These may be part of a 64-byte cache line writeback or a 64-byte dirty probe hit response, each of which would cause eight increments; or a partial or complete Write Combining buffer flush (Sized Write), which could cause from one to eight increments.
Value | Unit mask description |
0 | Quadword write transfer |
Abbreviation: L2 requests
The number of requests to the L2 cache for Icache or Dcache fills, or page table lookups for the TLB. These events reflect only read requests to the L2; writes to the L2 are indicated by event 7Fh. These include some amount of retries associated with address or resource conflicts. Such retries tend to occur more as the L2 gets busier, and in certain extreme cases (such as large block moves that overflow the L2) these extra requests can dominate the event count.
These extra requests are not a direct indication of performance impact - they simply reflect opportunistic accesses that don't complete. But because of this, they are not a good indication of actual cache line movement. The Icache and Dcache miss and refill events (81h, 82h, 83h, 41h, 42h, 43h) provide a more accurate indication of this, and are the preferred way to measure such traffic.
Value | Unit mask description |
0 | IC fill |
1 | DC fill |
2 | TLB fill (page table walks) |
3 | Tag snoop request |
4 | Cancelled request |
Abbreviation: L2 misses
The number of requests that miss in the L2 cache. This may include some amount of speculative activity, as well as some amount of retried requests as described in event 7Dh. The IC-fill-miss and DC-fill-miss events tend to mirror the Icache and Dcache refill-from-system events (83h and 43h, respectively), and tend to include more speculative activity than those events.
Value | Unit mask description |
0 | IC fill |
1 | DC fill (includes possible replays) |
2 | TLB page table walk |
Abbreviation: L2 fill/write
The number of lines written into the L2 cache due to victim writebacks from the Icache or Dcache, TLB page table walks and the hardware data prefetcher (UNIT_MASK bit 0); or writebacks of dirty lines from the L2 to the system (UNIT_MASK bit 1). Each increment represents a 64-byte cache line transfer.
Note: Victim writebacks from the Dcache may be measured separately using event 44h. However this is not quite the same as the Dcache component of event 7Fh, the main difference being PrefetchNTA lines. When these are evicted from the Dcache due to replacement, they are written out to system memory (if dirty) or simply invalidated (if clean), rather than being moved to the L2 cache.
Value | Unit mask description |
0 | L2 fills (victims from L1 caches, TLB page table walks and data prefetches) |
1 | L2 writebacks to system |
Abbreviation: IC fetches
The number of instruction cache accesses by the instruction fetcher. Each access is an aligned 16 byte read, from which a varying number of instructions may be decoded.
Abbreviation: IC misses
The number of instruction fetches that miss in the instruction cache. This is typically equal to or very close to the sum of events 82h and 83h. Each miss results in a 64-byte cache line refill.
Abbreviation: IC refills from L2
The number of instruction cache refills satisfied from the L2 cache. Each increment represents one 64-byte cache line transfer.
Abbreviation: IC refills from sys
The number of instruction cache refills from system memory (or another cache). Each increment represents one 64-byte cache line transfer.
Abbreviation: ITLB L1M L2H
The number of instruction fetches that miss in the L1 ITLB but hit in the L2 ITLB.
Abbreviation: ITLB L1M L2M
The number of instruction fetches that miss in both the L1 and L2 ITLBs.
Abbreviation: Restart i-stream probe
The number of pipeline restarts caused by invalidating probes that hit on the instruction stream currently being executed. This would happen if the active instruction stream was being modified by another processor in an MP system - typically a highly unlikely event.
Abbreviation: Inst fetch stall
The number of cycles the instruction fetcher is stalled. This may be for a variety of reasons such as branch predictor updates, unconditional branch bubbles, far jumps and cache misses, among others. May be overlapped by instruction dispatch stalls or instruction execution, such that these stalls don't necessarily impact performance.
Abbreviation: RET stack hits
The number of near return instructions (RET or RET Iw) that get their return address from the return address stack (i.e. where the stack has not gone empty). This may include cases where the address is incorrect (return mispredicts). This may also include speculatively executed false-path returns. Return mispredicts are typically caused by the return address stack underflowing, however they may also be caused by an imbalance in calls vs. returns, such as doing a call but then popping the return address off the stack.
Note: This event cannot be reliably compared with events C9h and CAh (such as to calculate percentage of return mispredicts due to an empty return address stack), since it may include speculatively executed false-path returns that are not included in those retire-time events.
Abbreviation: RET stack overflows
The number of (near) call instructions that cause the return address stack to overflow. When this happens, the oldest entry is discarded. This count may include speculatively executed calls.
Abbreviation: Ret CLFLUSH inst
The number of CLFLUSH instructions retired.
Abbreviation: Ret CPUID inst
The number of CPUID instructions retired.
Abbreviation: CPU clocks
The number of clocks that the CPU is not in a halted state (due to STPCLK or a HALT instruction). Note: this event allows system idle time to be automatically factored out from IPC (or CPI) measurements, providing the OS halts the CPU when going idle. If the OS goes into an idle loop rather than halting, such calculations will be influenced by the IPC of the idle loop.
Abbreviation: Ret inst
The number of instructions retired (execution completed and architectural state updated). This count includes exceptions and interrupts - each exception or interrupt is counted as one instruction.
Abbreviation: Ret uops
The number of micro-ops retired. This includes all processor activity (instructions, exceptions, interrupts, microcode assists, etc.).
Abbreviation: Ret branch
The number of branch instructions retired. This includes all types of architectural control flow changes, including exceptions and interrupts.
Abbreviation: Ret misp branch
The number of branch instructions retired, of any type, that were not correctly predicted. This includes those for which prediction is not attempted (far control transfers, exceptions and interrupts).
Abbreviation: Ret taken branch
The number of taken branches that were retired. This includes all types of architectural control flow changes, including exceptions and interrupts.
Abbreviation: Ret taken branch misp
The number of retired taken branch instructions that were mispredicted.
Abbreviation: Ret far xfers
The number of far control transfers retired including far call/jump/return, IRET, SYSCALL and SYSRET, plus exceptions and interrupts. Far control transfers are not subject to branch prediction.
Abbreviation: Ret branch resyncs
The number of resync branches. These reflect pipeline restarts due to certain microcode assists and events such as writes to the active instruction stream, among other things. Each occurrence reflects a restart penalty similar to a branch mispredict. Relatively rare.
Abbreviation: Ret near RET
The number of near return instructions (RET or RET Iw) retired.
Abbreviation: Ret near RET misp
The number of near returns retired that were not correctly predicted by the return address predictor. Each such mispredict incurs the same penalty as a mispredicted conditional branch instruction.
Abbreviation: Ret ind branch misp
The number of indirect branch instructions retired where the target address was not correctly predicted.
Abbreviation: Ret MMX/FP inst
The number of MMX®, SSE or X87 instructions retired. The UNIT_MASK allows the selection of the individual classes of instructions as given in the table. Each increment represents one complete instruction.
Note: Since this event includes non-numeric instructions it is not suitable for measuring MFLOPS.
Value | Unit mask description |
0 | x87 instructions |
1 | MMX and 3DNow instructions |
2 | Packed SSE and SSE2 instructions |
3 | Scalar SSE and SSE2 instructions |
Abbreviation: Ret fastpath double op
Value | Unit mask description |
0 | With low op in position 0 |
1 | With low op in position 1 |
2 | With low op in position 2 |
Abbreviation: Int-masked cycles
The number of processor cycles where interrupts are masked (EFLAGS.IF = 0). Using edge-counting with this event will give the number of times IF is cleared; dividing the cycle-count value by this value gives the average length of time that interrupts are disabled on each instance. Compare the edge count with event CFh to determine how often interrupts are disabled for interrupt handling vs. other reasons (e.g. critical sections).
Abbreviation: Int-masked pending
The number of processor cycles where interrupts are masked (EFLAGS.IF = 0) and an interrupt is pending. Using edge-counting with this event and comparing the resulting count with the edge count for event CDh gives the proportion of interrupts for which handling is delayed due to prior interrupts being serviced, critical sections, etc. The cycle count value gives the total amount of time for such delays. The cycle count divided by the edge count gives the average length of each such delay.
Abbreviation: Int taken
The number of hardware interrupts taken. This does not include software interrupts (INT n instruction).
Abbreviation: Decoder empty
The number of processor cycles where the decoder has nothing to dispatch (typically waiting on an instruction fetch that missed the Icache, or for the target fetch after a branch mispredict).
Abbreviation: Dispatch stalls
The number of processor cycles where the decoder is stalled for any reason (has one or more instructions ready but can't dispatch them due to resource limitations in execution). This is the combined effect of events D2h - DAh, some of which may overlap; this event reflects the net stall cycles. The more common stall conditions (events D5h, D6h, D7h, D8h, and to a lesser extent D2) may overlap considerably. The occurrence of these stalls is highly dependent on the nature of the code being executed (instruction mix, memory reference patterns, etc.).
Abbreviation: Stall branch abort
The number of processor cycles the decoder is stalled waiting for the pipe to drain after a mispredicted branch. This stall occurs if the corrected target instruction reaches the dispatch stage before the pipe has emptied. See also event D1h.
Abbreviation: Stall serialization
The number of processor cycles the decoder is stalled due to a serializing operation, which waits for the execution pipeline to drain. Relatively rare; mainly associated with system instructions. See also event D1h.
Abbreviation: Stall seg load
The number of processor cycles the decoder is stalled due to a segment load instruction being encountered while execution of a previous segment load operation is still pending. Relatively rare except in 16-bit code. See also event D1h.
Abbreviation: Stall reorder full
The number of processor cycles the decoder is stalled because the reorder buffer is full. May occur simultaneously with certain other stall conditions; see event D1h.
Abbreviation: Stall res station full
The number of processor cycles the decoder is stalled because a required integer unit reservation stations is full. May occur simultaneously with certain other stall conditions; see event D1h.
Abbreviation: Stall FPU full
The number of processor cycles the decoder is stalled because the scheduler for the Floating Point Unit is full. This condition can be caused by a lack of parallelism in FP-intensive code, or by cache misses on FP operand loads (which could also show up as event D8h instead, depending on the nature of the instruction sequences). May occur simultaneously with certain other stall conditions; see event D1h
Abbreviation: Stall LS full
The number of processor cycles the decoder is stalled because the Load/Store Unit is full. This generally occurs due to heavy cache miss activity. May occur simultaneously with certain other stall conditions; see event D1h.
Abbreviation: Stall waiting quiet
The number of processor cycles the decoder is stalled waiting for all outstanding requests to the system to be resolved. Relatively rare; associated with certain system instructions and types of interrupts. May partially overlap certain other stall conditions; see event D1h.
Abbreviation: Stall far/resync
The number of processor cycles the decoder is stalled waiting for the execution pipeline to drain before dispatching the target instructions of a far control transfer or a Resync (an instruction stream restart associated with certain microcode assists). Relatively rare; does not overlap with other stall conditions. See also event D1h.
Abbreviation: FPU except
The number of floating point unit exceptions for microcode assists. The UNIT_MASK may be used to isolate specific types of exceptions.
Value | Unit mask description |
0 | x87 reclass microfaults |
1 | SSE retype microfaults |
2 | SSE reclass microfaults |
3 | SSE and x87 microtraps |
Abbreviation: DR0 matches
The number of matches on the address in breakpoint register DR0, per the breakpoint type specified in DR7. The breakpoint does not have to be enabled. Each instruction breakpoint match incurs an overhead of about 120 cycles; load/store breakpoint matches do not incur any overhead.
Abbreviation: DR1 matches
The number of matches on the address in breakpoint register DR1. See notes for event DCh.
Abbreviation: DR2 matches
The number of matches on the address in breakpoint register DR2. See notes for event DCh.
Abbreviation: DR3 matches
The number of matches on the address in breakpoint register DR3. See notes for event DCh.
Abbreviation: DRAM accesses
The number of memory accesses performed by the local DRAM controller. The UNIT_MASK may be used to isolate the different DRAM page access cases. Page miss cases incur an extra latency to open a page; page conflict cases incur both a page-close as well as page-open penalties. These penalties may be overlapped by DRAM accesses for other requests and don't necessarily represent lost DRAM bandwidth. The associated penalties are as follows:
Page miss: Trcd (DRAM RAS-to-CAS delay)
Page conflict: Trp + Trcd (DRAM row-precharge time plus RAS-to-CAS delay)
Each DRAM access represents one 64-byte block of data transferred if the DRAM is configured for 64-byte granularity, or one 32-byte block if the DRAM is configured for 32-byte granularity. (The latter is only applicable to single-channel DRAM systems, which may be configured either way.)
Value | Unit mask description |
0 | DCT0 page hit |
1 | DCT0 page miss |
2 | DCT0 page conflict |
3 | DCT1 page hit |
4 | DCT1 page miss |
5 | DCT1 page conflict |
6 | Write request |
7 | Read request |
Abbreviation: Page table overflows
The number of page table overflows in the local DRAM controller. This table maintains information about which DRAM pages are open. An overflow occurs when a request for a new page arrives when the maximum number of pages are already open. Each occurrence reflects an access latency penalty equivalent to a page conflict.
Value | Unit mask description |
0 | DCT page table overflow |
1 | Number of stale table entry hits (hit on a page closed too soon) |
2 | Page table idle cycle limit incremented |
3 | Page table idle cycle limit decremented |
Abbreviation: Turnarounds
The number of turnarounds on the local DRAM data bus. The UNIT_MASK may be used to isolate the different cases. These represent lost DRAM bandwidth, which may be calculated as follows (in bytes per occurrence):
DIMM turnaround: DRAM_width_in_bytes * 2 edges_per_memclk * 2
R/W turnaround: DRAM_width_in_bytes * 2 edges_per_memclk * 1
R/W turnaround: DRAM_width_in_bytes * 2 edges_per_memclk * (Tcl-1)
where DRAM_width_in_bytes is 8 or 16 (for single- or dual-channel systems), and Tcl is the CAS latency of the DRAM in memory system clock cycles (where the memory clock for DDR-400, or PC3200 DIMMS, for example, would be 200 MHz).
Value | Unit mask description |
0 | DIMM (chip select) turnaround |
1 | Read to write turnaround |
2 | Write to read turnaround |
Abbreviation: XXXX
Value | Unit mask description |
2 | F2x[1,0]94[DcqBypassMax] counter reached |
Abbreviation: Thermal/ECC errors
Value | Unit mask description |
0 | Revision A: Reserved, Revision B: Number of clocks MEMHOT_L is asserted |
2 | Number of times the HTC transitions from inactive to active |
5 | Number of clocks HTC P-state is inactive |
6 | Number of clocks HTC P-state is active |
7 | PROCHOT_L asserted by an external source and P-state change occurred |
Abbreviation: CPU/IO req mem/IO
These events reflect request flow between units and nodes, as selected by the UNIT_MASK. The UNIT_MASK is divided into two fields: request type (CPU or I/O access to I/O or Memory) and source/target location (local vs. remote). One or more requests types must be enabled via bits 3:0, and at least one source and one target location must be selected via bits 7:4. Each event reflects a request of the selected type(s) going from the selected source(s) to the selected target(s).
Not all possible paths are supported. The following table shows the UNIT_MASK values that are valid for each request type: Any of the mask values shown may be logically ORed to combine the events. For instance, local CPU requests to both local and remote nodes would be A8h | 98h = B8h. Any CPU to any I/O would be A4h | 94h | 64h = F4h (but remote CPU to remote I/O requests would not be included).
Request type | CPU to Memory |
CPU to memory | A8h |
CPU to IO | A4h |
IO to memory | A2h |
IO to IO | A1h |
Note: It is not possible to tell from these events how much data is going in which direction, as there is no distinction between reads and writes. Also, particularly for I/O, the requests may be for varying amounts of data, anywhere from one to sixty-four bytes. Event E5h provides an indication of 32- and 64-byte read and write transfers for such requests (although from the target point of view). For a direct measure of the amount and direction of data flowing between nodes, use events F6h, F7h and F8h.
Value | Unit mask description |
0 | I/O to I/O |
1 | I/O to memory |
2 | CPU to I/O |
3 | CPU to memory |
Abbreviation: Cache block cmd
The number of requests made to the system for cache line transfers or coherency state changes, by request type. Each increment represents one cache line transfer, except for Change-to-Dirty. If a Change-to-Dirty request hits on a line in another processor's cache that's in the Owned state, it will cause a cache line transfer, otherwise there is no data transfer associated with Change-to-Dirty requests.
Value | Unit mask description |
0 | Victim block (writeback) |
2 | Read block (Dcache load miss refill) |
3 | Read block shared (ICache refill) |
4 | Read block modified (DCache store miss refill) |
5 | Change to Dirty (first store to clean block in cache) |
Abbreviation: Sized cmd
The number of Sized Read/Write commands handled by the System Request Interface (local processor and hostbridge interface to the system). These commands may originate from the processor or hostbridge. Typical uses of the various Sized Read/Write commands are given in the UNIT_MASK table. See also event E5h, which covers commonly-used block sizes for these requests, and event ECh, which provides a separate measure of Hostbridge accesses.
Value | Unit mask description |
0 | NonPosted SzWr byte (1-32 bytes) |
1 | NonPosted SzWr DWORD (1-16 DWORDs) |
2 | Posted SzWr byte (1-32 bytes) |
3 | Posted SzWr DWORD (1-16 DWORDs) |
4 | SzRd byte (4 bytes) |
5 | SzRd DWORD (1-16 DWORDs) |
Abbreviation: Probe resp/up req
This covers two unrelated sets of events: cache probe results, and requests received by the Hostbridge from devices on non-coherent links.
Probe results: These events reflect the results of probes sent from a memory controller to local caches. They provide an indication of the degree data and code is shared between processors (or moved between processors due to process migration). The dirty-hit events indicate the transfer of a 64-byte cache line to the requestor (for a read or cache refill) or the target memory (for a write). The system bandwidth used by these, in terms of bytes per unit of time, may be calculated as 64 times the event count, divided by the elapsed time. Sized writes to memory that cover a full cache line do not incur this cache line transfer -- they simply invalidate the line and are reported as clean hits. Cache line transfers will occur for Change2Dirty requests that hit cache lines in the Owned state. (Such cache lines are counted as Modified-state refills for event 6Ch, System Read Responses.)
Upstream requests: The upstream read and write events reflect requests originating from a device on a local IO link. The two read events allow display refresh traffic in a UMA system to be measured separately from other DMA activity. Display refresh traffic is typically dominated by 64-byte transfers. Non-display-related DMA accesses may be anywhere from 1 to 64 bytes in size, but may be dominated by a particular size such as 32 or 64 bytes, depending on the nature of the devices.
Value | Unit mask description |
0 | Probe miss |
1 | Probe hit clean |
2 | Probe hit dirty without memory cancel |
3 | Probe hit dirty with memory cancel |
4 | Upstream display refresh/ISOC reads |
5 | Upstream non-display refresh reads |
6 | Upstream ISOC writes |
7 | Upstream non-ISOC writes |
Abbreviation: DEV events
Value | Unit mask description |
4 | DEV hit |
5 | DEV miss |
6 | DEV error |
Abbreviation: MCT requests
Value | Unit mask description |
3 | 32 bytes sized writes |
4 | 64 bytes sized writes |
5 | 32 bytes sized reads |
6 | 64 bytes sized reads |
Abbreviation: Sideband signals
Value | Unit mask description |
0 | HALT |
1 | STOPGRANT |
2 | SHUTDOWN |
3 | WBINVD |
4 | INVD |
Abbreviation: Int events
Value | Unit mask description |
0 | Fixed |
1 | LPA |
2 | SMI |
3 | NMI |
4 | INIT |
5 | STARTUP |
6 | INT |
7 | EOI |
Abbreviation: HT0 bandwidth
Value | Unit mask description |
0 | Command DWORD sent |
1 | Address DWORD sent |
2 | Data DWORD sent |
3 | Buffer release DWORD sent |
4 | NOP DWORD sent (idle) |
5 | Per packet CRC sent |