82 lines
3.7 KiB
ReStructuredText
82 lines
3.7 KiB
ReStructuredText
.. _skiboot-5.10.3:
|
|
|
|
==============
|
|
skiboot-5.10.3
|
|
==============
|
|
|
|
skiboot 5.10.3 was released on Thursday March 28th, 2018. It replaces
|
|
:ref:`skiboot-5.10.2` as the current stable release in the 5.10.x series.
|
|
|
|
It is recommended that 5.10.3 be used instead of any previous 5.10.x version
|
|
due to the bug fixes and debugging enhancements in it.
|
|
|
|
Over :ref:`skiboot-5.10.2`, we have a few improvements and bug fixes:
|
|
|
|
- NPU2: dump NPU2 registers on npu2 HMI
|
|
|
|
Due to the nature of debugging npu2 issues, folk are wanting the
|
|
full list of NPU2 registers dumped when there's a problem.
|
|
|
|
This is different than the solution introduced in 5.10.1
|
|
as there we would dump the registers in a way that would trigger a FIR
|
|
bit that would confuse PRD.
|
|
- npu2: Add performance tuning SCOM inits
|
|
|
|
Peer-to-peer GPU bandwidth latency testing has produced some tunable
|
|
values that improve performance. Add them to our device initialization.
|
|
|
|
File these under things that need to be cleaned up with nice #defines
|
|
for the register names and bitfields when we get time.
|
|
|
|
A few of the settings are dependent on the system's particular NVLink
|
|
topology, so introduce a helper to determine how many links go to a
|
|
single GPU.
|
|
- hw/npu2: Assign a unique LPARSHORTID per GPU
|
|
|
|
This gets used elsewhere to index items in the XTS tables.
|
|
- occ: Set up OCC messaging even if we fail to setup pstates
|
|
|
|
This means that we no longer hit this bug if we fail to get valid pstates
|
|
from the OCC. ::
|
|
|
|
[console-pexpect]#echo 1 > //sys/firmware/opal/sensor_groups//occ-csm0/clear
|
|
echo 1 > //sys/firmware/opal/sensor_groups//occ-csm0/clear
|
|
[ 94.019971181,5] CPU ATTEMPT TO RE-ENTER FIRMWARE! PIR=083d cpu @0x33cf4000 -> pir=083d token=8
|
|
[ 94.020098392,5] CPU ATTEMPT TO RE-ENTER FIRMWARE! PIR=083d cpu @0x33cf4000 -> pir=083d token=8
|
|
[ 10.318805] Disabling lock debugging due to kernel taint
|
|
[ 10.318808] Severe Machine check interrupt [Not recovered]
|
|
[ 10.318812] NIP [000000003003e434]: 0x3003e434
|
|
[ 10.318813] Initiator: CPU
|
|
[ 10.318815] Error type: Real address [Load/Store (foreign)]
|
|
[ 10.318817] opal: Hardware platform error: Unrecoverable Machine Check exception
|
|
[ 10.318821] CPU: 117 PID: 2745 Comm: sh Tainted: G M 4.15.9-openpower1 #3
|
|
[ 10.318823] NIP: 000000003003e434 LR: 000000003003025c CTR: 0000000030030240
|
|
[ 10.318825] REGS: c00000003fa7bd80 TRAP: 0200 Tainted: G M (4.15.9-openpower1)
|
|
[ 10.318826] MSR: 9000000000201002 <SF,HV,ME,RI> CR: 48002888 XER: 20040000
|
|
[ 10.318831] CFAR: 0000000030030258 DAR: 394a00147d5a03a6 DSISR: 00000008 SOFTE: 1
|
|
- core/fast-reboot: disable fast reboot upon fundamental entry/exit/locking errors
|
|
|
|
This disables fast reboot in several more cases where serious errors
|
|
like lock corruption or call re-entrancy are detected.
|
|
- core/opal: allow some re-entrant calls
|
|
|
|
This allows a small number of OPAL calls to succeed despite re-entering
|
|
the firmware, and rejects others rather than aborting.
|
|
|
|
This allows a system reset interrupt that interrupts OPAL to do something
|
|
useful. Sreset other CPUs, use the console, which allows xmon to work or
|
|
stack traces to be printed, reboot the system.
|
|
|
|
Use OPAL_INTERNAL_ERROR when rejecting, rather than OPAL_BUSY, which is
|
|
used for many other things that does not mean a serious permanent error.
|
|
- core/opal: abort in case of re-entrant OPAL call
|
|
|
|
The stack is already destroyed by the time we get here, so there
|
|
is not much point continuing.
|
|
- npu2: Disable fast reboot
|
|
|
|
Fast reboot does not yet work right with the NPU. It's been disabled on
|
|
NVLink and OpenCAPI machines. Do the same for NVLink2.
|
|
|
|
This amounts to a port of 3e4577939bbf ("npu: Fix broken fast reset")
|
|
from the npu code to npu2.
|