122 lines
5.3 KiB
ReStructuredText
122 lines
5.3 KiB
ReStructuredText
![]() |
.. _imc:
|
||
|
|
||
|
OPAL/Skiboot In-Memory Collection (IMC) interface Documentation
|
||
|
===============================================================
|
||
|
|
||
|
Overview:
|
||
|
---------
|
||
|
|
||
|
In-Memory-Collection (IMC) is performance monitoring infrastrcuture
|
||
|
for counters that (once started) can be read from memory at any time by
|
||
|
an operating system. Such counters include those for the Nest and Core
|
||
|
units, enabling continuous monitoring of resource utilisation on the chip.
|
||
|
|
||
|
The API is agnostic as to how these counters are implemented. For the
|
||
|
Nest units, they're implemented by having microcode in an on-chip
|
||
|
microcontroller and for core units, they are implemented as part of core logic
|
||
|
to gather data and periodically write it to the memory locations.
|
||
|
|
||
|
Nest (On-Chip, Off-Core) unit:
|
||
|
------------------------------
|
||
|
|
||
|
Nest units have dedicated hardware counters which can be programmed
|
||
|
to monitor various chip resources such as memory bandwidth,
|
||
|
xlink bandwidth, alink bandwidth, PCI, NVlink and so on. These Nest
|
||
|
unit PMU counters can be programmed in-band via scom. But alternatively,
|
||
|
programming of these counters and periodically moving the counter data
|
||
|
to memory are offloaded to a hardware engine part of OCC (On-Chip Controller).
|
||
|
|
||
|
Microcode, starts to run at system boot in OCC complex, initialize these
|
||
|
Nest unit PMUs and periodically accumulate the nest pmu counter values
|
||
|
to memory. List of supported events by the microcode is packages as a DTS
|
||
|
and stored in IMA_CATALOG partition.
|
||
|
|
||
|
Core unit:
|
||
|
----------
|
||
|
|
||
|
Core IMC PMU counters are handled in the core-imc unit. Each core has
|
||
|
4 Core Performance Monitoring Counters (CPMCs) which are used by Core-IMC logic.
|
||
|
Two of these are dedicated to count core cycles and instructions.
|
||
|
The 2 remaining CPMCs have to multiplex 128 events each.
|
||
|
|
||
|
Core IMC hardware does not support interrupts and it peridocially (based on
|
||
|
sampling duration) fetches the counter data and accumulate to main memory.
|
||
|
Memory to accumulate counter data are refered from "PDBAR" (per-core scom)
|
||
|
and "LDBAR" per-thread spr.
|
||
|
|
||
|
Trace mode of IMC:
|
||
|
------------------
|
||
|
|
||
|
POWER9 support two modes for IMC which are the Accumulation mode and
|
||
|
Trace mode. In Accumulation mode event counts are accumulated in system
|
||
|
memory. Hypervisor/kernel then reads the posted counts periodically, or
|
||
|
when requested. In IMC Trace mode, the 64 bit trace scom value is initialized
|
||
|
with the event information. The CPMC*SEL and CPMC_LOAD in the trace scom, specifies
|
||
|
the event to be monitored and the sampling duration. On each overflow in the
|
||
|
CPMC*SEL, hardware snapshots the program counter along with event counts
|
||
|
and writes into memory pointed by LDBAR. LDBAR has bits to indicate whether
|
||
|
hardware is configured for accumulation or trace mode.
|
||
|
Currently the event monitored for trace-mode is fixed as cycle.
|
||
|
|
||
|
PMI interrupt handling is avoided, since IMC trace mode snapshots the
|
||
|
program counter and update to the memory. And this also provide a way for
|
||
|
the operating system to do instruction sampling in real time without
|
||
|
PMI(Performance Monitoring Interrupts) processing overhead.
|
||
|
|
||
|
**Example:**
|
||
|
|
||
|
Performance data using 'perf top' with and without trace-imc event:
|
||
|
|
||
|
|
||
|
*PMI interrupts count when `perf top` command is executed without trace-imc event.*
|
||
|
::
|
||
|
|
||
|
# cat /proc/interrupts (a snippet from the output)
|
||
|
9944 1072 804 804 1644 804 1306
|
||
|
804 804 804 804 804 804 804
|
||
|
804 804 1961 1602 804 804 1258
|
||
|
[-----------------------------------------------------------------]
|
||
|
803 803 803 803 803 803 803
|
||
|
803 803 803 803 804 804 804
|
||
|
804 804 804 804 804 804 803
|
||
|
803 803 803 803 803 1306 803
|
||
|
803 Performance monitoring interrupts
|
||
|
|
||
|
|
||
|
*PMI interrupts count when `perf top` command executed with trace-imc event
|
||
|
(executed right after 'perf top' without trace-imc event).*
|
||
|
::
|
||
|
|
||
|
# perf top -e trace_imc/trace_cycles/
|
||
|
12.50% [kernel] [k] arch_cpu_idle
|
||
|
11.81% [kernel] [k] __next_timer_interrupt
|
||
|
11.22% [kernel] [k] rcu_idle_enter
|
||
|
10.25% [kernel] [k] find_next_bit
|
||
|
7.91% [kernel] [k] do_idle
|
||
|
7.69% [kernel] [k] rcu_dynticks_eqs_exit
|
||
|
5.20% [kernel] [k] tick_nohz_idle_stop_tick
|
||
|
[-----------------------]
|
||
|
|
||
|
# cat /proc/interrupts (a snippet from the output)
|
||
|
|
||
|
9944 1072 804 804 1644 804 1306
|
||
|
804 804 804 804 804 804 804
|
||
|
804 804 1961 1602 804 804 1258
|
||
|
[-----------------------------------------------------------------]
|
||
|
803 803 803 803 803 803 803
|
||
|
803 803 803 804 804 804 804
|
||
|
804 804 804 804 804 804 803
|
||
|
803 803 803 803 803 1306 803
|
||
|
803 Performance monitoring interrupts
|
||
|
|
||
|
Here the PMI interrupts count remains the same.
|
||
|
|
||
|
OPAL APIs:
|
||
|
----------
|
||
|
|
||
|
The OPAL API is simple: a call to init a counter type, and calls to
|
||
|
start and stop collection. The memory locations are described in the
|
||
|
device tree.
|
||
|
|
||
|
See :ref:`opal-imc-counters` and :ref:`device-tree/imc`
|