1276 lines
54 KiB
ReStructuredText
1276 lines
54 KiB
ReStructuredText
![]() |
.. _skiboot-6.3:
|
|||
|
|
|||
|
skiboot-6.3
|
|||
|
===========
|
|||
|
|
|||
|
skiboot v6.3 was released on Friday May 3rd 2019. It is the first
|
|||
|
release of skiboot 6.3, which becomes the new stable release
|
|||
|
of skiboot following the 6.2 release, first released December 14th 2018.
|
|||
|
|
|||
|
Skiboot 6.3 will mark the basis for op-build v2.3.
|
|||
|
|
|||
|
skiboot v6.3 contains all bug fixes as of :ref:`skiboot-6.0.20`,
|
|||
|
and :ref:`skiboot-6.2.3` (the currently maintained
|
|||
|
stable releases).
|
|||
|
|
|||
|
For how the skiboot stable releases work, see :ref:`stable-rules` for details.
|
|||
|
|
|||
|
Over skiboot 6.2, we have the following changes:
|
|||
|
|
|||
|
.. _skiboot-6.3-new-features:
|
|||
|
|
|||
|
New Features
|
|||
|
------------
|
|||
|
|
|||
|
- hw/imc: Enable opal calls to init/start/stop IMC Trace mode
|
|||
|
|
|||
|
New OPAL APIs for In-Memory Collection Counter infrastructure(IMC),
|
|||
|
including a new device type called OPAL_IMC_COUNTERS_TRACE.
|
|||
|
- xive: Add calls to save/restore the queues and VPs HW state
|
|||
|
|
|||
|
To be able to support migration of guests using the XIVE native
|
|||
|
exploitation mode, (where the queue is effectively owned by the
|
|||
|
guest), KVM needs to be able to save and restore the HW-modified
|
|||
|
fields of the queue, such as the current queue producer pointer and
|
|||
|
generation bit, and to retrieve the modified thread context registers
|
|||
|
of the VP from the NVT structure : the VP interrupt pending bits.
|
|||
|
|
|||
|
However, there is no need to set back the NVT structure on P9. P10
|
|||
|
should be the same.
|
|||
|
- witherspoon: Add nvlink2 interconnect information
|
|||
|
|
|||
|
GPUs on Redbud and Sequoia platforms are interconnected in groups of
|
|||
|
2 or 3 GPUs. The problem with that is if the user decides to pass a single
|
|||
|
GPU from a group to the userspace, we need to ensure that links between
|
|||
|
GPUs do not get enabled.
|
|||
|
|
|||
|
A V100 GPU provides a way to disable selected links. In order to only
|
|||
|
disable links to peer GPUs, we need a topology map.
|
|||
|
|
|||
|
This adds an "ibm,nvlink-peers" property to a GPU DT node with phandles
|
|||
|
of peer GPUs and NVLink2 bridges. The index in the property is a GPU link
|
|||
|
number.
|
|||
|
- platforms/romulus: Also support talos
|
|||
|
|
|||
|
The two are similar enough and I'd like to have a slot table for our
|
|||
|
Talos.
|
|||
|
- OpenCAPI support! (see :ref:`skiboot-6.3-OpenCAPI` section)
|
|||
|
- opal/hmi: set a flag to inform OS that TOD/TB has failed.
|
|||
|
|
|||
|
Set a flag to indicate OS about TOD/TB failure as part of new
|
|||
|
opal_handle_hmi2 handler. This flag then can be used by OS to make sure
|
|||
|
functions depending on TB value (e.g. udelay()) are aware of TB not
|
|||
|
ticking.
|
|||
|
- astbmc: Enable IPMI HIOMAP for AMI platforms
|
|||
|
|
|||
|
Required for Habanero, Palmetto and Romulus.
|
|||
|
- power-mgmt : occ : Add 'freq-domain-mask' DT property
|
|||
|
|
|||
|
Add a new device-tree property freq-domain-indicator to define group of
|
|||
|
CPUs which would share same frequency. This property has been added under
|
|||
|
power-mgmt node. It is a bitmask.
|
|||
|
|
|||
|
Bitwise AND is taken between this bitmask value and PIR of cpu. All the
|
|||
|
CPUs lying in the same frequency domain will have same result for AND.
|
|||
|
|
|||
|
For example, For POWER9, 0xFFF0 indicates quad wide frequency domain.
|
|||
|
Taking AND with the PIR of CPUs will yield us frequency domain which is
|
|||
|
quad wise distribution as last 4 bits have been masked which represent the
|
|||
|
cores.
|
|||
|
|
|||
|
Similarly, 0xFFF8 will represent core wide frequency domain for P8.
|
|||
|
|
|||
|
Also, Add a new device-tree property domain-runs-at which will denote the
|
|||
|
strategy OCC is using to change the frequency of a frequency-domain. There
|
|||
|
can be two strategy - FREQ_MOST_RECENTLY_SET and FREQ_MAX_IN_DOMAIN.
|
|||
|
|
|||
|
FREQ_MOST_RECENTLY_SET : the OCC sets the frequency of the quad to the most
|
|||
|
recent frequency value requested by the CPUs in the quad.
|
|||
|
|
|||
|
FREQ_MAX_IN_DOMAIN : the OCC sets the frequency of the CPUs in
|
|||
|
the Quad to the maximum of the latest frequency requested by each of
|
|||
|
the component cores.
|
|||
|
- powercap: occ: Fix the powercapping range allowed for user
|
|||
|
|
|||
|
OCC provides two limits for minimum powercap. One being hard powercap
|
|||
|
minimum which is guaranteed by OCC and the other one is a soft
|
|||
|
powercap minimum which is lesser than hard-min and may or may not be
|
|||
|
asserted due to various power-thermal reasons. So to allow the users
|
|||
|
to access the entire powercap range, this patch exports soft powercap
|
|||
|
minimum as the "powercap-min" DT property. And it also adds a new
|
|||
|
DT property called "powercap-hard-min" to export the hard-min powercap
|
|||
|
limit.
|
|||
|
- Add NVDIMM support
|
|||
|
|
|||
|
NVDIMMs are memory modules that use a battery backup system to allow the
|
|||
|
contents RAM to be saved to non-volatile storage if system power goes
|
|||
|
away unexpectedly. This allows them to be used a high-performance
|
|||
|
storage device, suitable for serving as a cache for SSDs and the like.
|
|||
|
|
|||
|
Configuration of NVDIMMs is handled by hostboot and communicated to OPAL
|
|||
|
via the HDAT. We need to parse out the NVDIMM memory ranges and create
|
|||
|
memory regions with the "pmem-region" compatible label to make them
|
|||
|
available to the host.
|
|||
|
- core/exceptions: implement support for MCE interrupts in powersave
|
|||
|
|
|||
|
The ISA specifies that MCE interrupts in power saving modes will enter
|
|||
|
at 0x200 with powersave bits in SRR1 set. This is not currently
|
|||
|
supported properly, the MCE will just happen like a normal interrupt,
|
|||
|
but GPRs could be lost, which would lead to crashes (e.g., r1, r2, r13
|
|||
|
etc).
|
|||
|
|
|||
|
So check the power save bits similarly to the sreset vector, and
|
|||
|
handle this properly.
|
|||
|
- core/exceptions: allow recoverable sreset exceptions
|
|||
|
|
|||
|
This requires implementing the MSR[RI] bit. Then just allow all
|
|||
|
non-fatal sreset exceptions to recover.
|
|||
|
- core/exceptions: implement an exception handler for non-powersave sresets
|
|||
|
|
|||
|
Detect non-powersave sresets and send them to the normal exception
|
|||
|
handler which prints registers and stack.
|
|||
|
- Add PVR_TYPE_P9P
|
|||
|
|
|||
|
Enable a new PVR to get us running on another p9 variant.
|
|||
|
|
|||
|
Since v6.3-rc2:
|
|||
|
|
|||
|
- Expose PNOR Flash partitions to host MTD driver via devicetree
|
|||
|
|
|||
|
This makes it possible for the host to directly address each
|
|||
|
partition without requiring each application to directly parse
|
|||
|
the FFS headers. This has been in use for some time already to
|
|||
|
allow BOOTKERNFW partition updates from the host.
|
|||
|
|
|||
|
All partitions except BOOTKERNFW are marked readonly.
|
|||
|
|
|||
|
The BOOTKERNFW partition is currently exclusively used by the TalosII platform
|
|||
|
|
|||
|
- Write boot progress to LPC port 80h
|
|||
|
|
|||
|
This is an adaptation of what we currently do for op_display() on FSP
|
|||
|
machines, inventing an encoding for what we can write into the single
|
|||
|
byte at LPC port 80h.
|
|||
|
|
|||
|
Port 80h is often used on x86 systems to indicate boot progress/status
|
|||
|
and dates back a decent amount of time. Since a byte isn't exactly very
|
|||
|
expressive for everything that can go on (and wrong) during boot, it's
|
|||
|
all about compromise.
|
|||
|
|
|||
|
Some systems (such as Zaius/Barreleye G2) have a physical dual 7 segment
|
|||
|
display that display these codes. So far, this has only been driven by
|
|||
|
hostboot (see hostboot commit 90ec2e65314c).
|
|||
|
|
|||
|
- Write boot progress to LPC ports 81 and 82
|
|||
|
|
|||
|
There's a thought to write more extensive boot progress codes to LPC
|
|||
|
ports 81 and 82 to supplement/replace any reliance on port 80.
|
|||
|
|
|||
|
We want to still emit port 80 for platforms like Zaius and Barreleye
|
|||
|
that have the physical display. Ports 81 and 82 can be monitored by a
|
|||
|
BMC though.
|
|||
|
|
|||
|
- Add Talos II platform
|
|||
|
|
|||
|
Talos II has some hardware differences from Romulus, therefore
|
|||
|
we cannot guarantee Talos II == Romulus in skiboot. Copy and
|
|||
|
slightly modify the Romulus files for Talos II.
|
|||
|
|
|||
|
Since v6.3-rc1:
|
|||
|
|
|||
|
- cpufeatures: Add tm-suspend-hypervisor-assist and tm-suspend-xer-so-bug node
|
|||
|
|
|||
|
tm-suspend-hypervisor-assist for P9 >=DD2.2
|
|||
|
And a tm-suspend-xer-so-bug node for P9 DD2.2 only.
|
|||
|
|
|||
|
I also treat P9P as P9 DD2.3 and add a unit test for the cpufeatures
|
|||
|
infrastructure.
|
|||
|
|
|||
|
Fixes: https://github.com/open-power/skiboot/issues/233
|
|||
|
|
|||
|
|
|||
|
Deprecated/Removed Features
|
|||
|
---------------------------
|
|||
|
|
|||
|
- opal: Deprecate reading the PHB status
|
|||
|
|
|||
|
The OPAL_PCI_EEH_FREEZE_STATUS call takes a bunch of parameters, one of
|
|||
|
them is @phb_status. It is defined as __be64* and always NULL in
|
|||
|
the current Linux upstream but if anyone ever decides to read that status,
|
|||
|
then the PHB3's handler will assume it is struct OpalIoPhb3ErrorData*
|
|||
|
(which is a lot bigger than 8 bytes) and zero it causing the stack
|
|||
|
corruption; p7ioc-phb has the same issue.
|
|||
|
|
|||
|
This removes @phb_status from all eeh_freeze_status() hooks and moves
|
|||
|
the error message from PHB4 to the affected OPAL handlers.
|
|||
|
|
|||
|
As far as we can tell, nobody has ever used this and thus it's safe to remove.
|
|||
|
- Remove POWER9N DD1 support
|
|||
|
|
|||
|
This is not a shipping product and is no longer supported by Linux
|
|||
|
or other firmware components.
|
|||
|
|
|||
|
Since v6.3-rc3:
|
|||
|
|
|||
|
- Disable fast-reset for POWER8
|
|||
|
|
|||
|
There is a bug with fast-reset when CPU cores are busy, which can be
|
|||
|
reproduced by running `stress` and then trying `reboot -ff` (this is
|
|||
|
what the op-test test cases FastRebootHostStress and
|
|||
|
FastRebootHostStressTorture do). What happens is the cores lock up,
|
|||
|
which isn't the best thing in the world when you want them to start
|
|||
|
executing instructions again.
|
|||
|
|
|||
|
A workaround is to use instruction ramming, which while greatly
|
|||
|
increasing the reliability of fast-reset on p8, doesn't make it perfect.
|
|||
|
|
|||
|
Instruction ramming is what pdbg was modified to do in order to have the
|
|||
|
sreset functionality work reliably on p8.
|
|||
|
pdbg patches: https://patchwork.ozlabs.org/project/pdbg/list/?series=96593&state=*
|
|||
|
|
|||
|
Fixes: https://github.com/open-power/skiboot/issues/185
|
|||
|
|
|||
|
General
|
|||
|
-------
|
|||
|
|
|||
|
- core/i2c: Various bits of refactoring
|
|||
|
- refactor backtrace generation infrastructure
|
|||
|
- astbmc: Handle failure to initialise raw flash
|
|||
|
|
|||
|
Initialising raw flash lead to a dead assignment to rc. Check the return
|
|||
|
code and take the failure path as necessary. Both before and after the
|
|||
|
fix we see output along the lines of the following when flash_init()
|
|||
|
fails: ::
|
|||
|
|
|||
|
[ 53.283182881,7] IRQ: Registering 0800..0ff7 ops @0x300d4b98 (data 0x3052b9d8)
|
|||
|
[ 53.283184335,7] IRQ: Registering 0ff8..0fff ops @0x300d4bc8 (data 0x3052b9d8)
|
|||
|
[ 53.283185513,7] PHB#0000: Initializing PHB...
|
|||
|
[ 53.288260827,4] FLASH: Can't load resource id:0. No system flash found
|
|||
|
[ 53.288354442,4] FLASH: Can't load resource id:1. No system flash found
|
|||
|
[ 53.342933439,3] CAPP: Error loading ucode lid. index=200ea
|
|||
|
[ 53.462749486,2] NVRAM: Failed to load
|
|||
|
[ 53.462819095,2] NVRAM: Failed to load
|
|||
|
[ 53.462894236,2] NVRAM: Failed to load
|
|||
|
[ 53.462967071,2] NVRAM: Failed to load
|
|||
|
[ 53.463033077,2] NVRAM: Failed to load
|
|||
|
[ 53.463144847,2] NVRAM: Failed to load
|
|||
|
|
|||
|
Eventually followed by: ::
|
|||
|
|
|||
|
[ 57.216942479,5] INIT: platform wait for kernel load failed
|
|||
|
[ 57.217051132,5] INIT: Assuming kernel at 0x20000000
|
|||
|
[ 57.217127508,3] INIT: ELF header not found. Assuming raw binary.
|
|||
|
[ 57.217249886,2] NVRAM: Failed to load
|
|||
|
[ 57.221294487,0] FATAL: Kernel is zeros, can't execute!
|
|||
|
[ 57.221397429,0] Assert fail: core/init.c:615:0
|
|||
|
[ 57.221471414,0] Aborting!
|
|||
|
CPU 0028 Backtrace:
|
|||
|
S: 0000000031d43c60 R: 000000003001b274 ._abort+0x4c
|
|||
|
S: 0000000031d43ce0 R: 000000003001b2f0 .assert_fail+0x34
|
|||
|
S: 0000000031d43d60 R: 0000000030014814 .load_and_boot_kernel+0xae4
|
|||
|
S: 0000000031d43e30 R: 0000000030015164 .main_cpu_entry+0x680
|
|||
|
S: 0000000031d43f00 R: 0000000030002718 boot_entry+0x1c0
|
|||
|
--- OPAL boot ---
|
|||
|
|
|||
|
Analysis of the execution paths suggests we'll always "safely" end this
|
|||
|
way due the setup sequence for the blocklevel callbacks in flash_init()
|
|||
|
and error handling in blocklevel_get_info(), and there's no current risk
|
|||
|
of executing from unexpected memory locations. As such the issue is
|
|||
|
reduced to down to a fix for poor error hygene in the original change
|
|||
|
and a resolution for a Coverity warning (famous last words etc).
|
|||
|
- core/flash: Retry requests as necessary in flash_load_resource()
|
|||
|
|
|||
|
We would like to successfully boot if we have a dependency on the BMC
|
|||
|
for flash even if the BMC is not current ready to service flash
|
|||
|
requests. On the assumption that it will become ready, retry for several
|
|||
|
minutes to cover a BMC reboot cycle and *eventually* rather than
|
|||
|
*immediately* crash out with: ::
|
|||
|
|
|||
|
[ 269.549748] reboot: Restarting system
|
|||
|
[ 390.297462587,5] OPAL: Reboot request...
|
|||
|
[ 390.297737995,5] RESET: Initiating fast reboot 1...
|
|||
|
[ 391.074707590,5] Clearing unused memory:
|
|||
|
[ 391.075198880,5] PCI: Clearing all devices...
|
|||
|
[ 391.075201618,7] Clearing region 201ffe000000-201fff800000
|
|||
|
[ 391.086235699,5] PCI: Resetting PHBs and training links...
|
|||
|
[ 391.254089525,3] FFS: Error 17 reading flash header
|
|||
|
[ 391.254159668,3] FLASH: Can't open ffs handle: 17
|
|||
|
[ 392.307245135,5] PCI: Probing slots...
|
|||
|
[ 392.363723191,5] PCI Summary:
|
|||
|
...
|
|||
|
[ 393.423255262,5] OCC: All Chip Rdy after 0 ms
|
|||
|
[ 393.453092828,5] INIT: Starting kernel at 0x20000000, fdt at
|
|||
|
0x30800a88 390645 bytes
|
|||
|
[ 393.453202605,0] FATAL: Kernel is zeros, can't execute!
|
|||
|
[ 393.453247064,0] Assert fail: core/init.c:593:0
|
|||
|
[ 393.453289682,0] Aborting!
|
|||
|
CPU 0040 Backtrace:
|
|||
|
S: 0000000031e03ca0 R: 000000003001af60 ._abort+0x4c
|
|||
|
S: 0000000031e03d20 R: 000000003001afdc .assert_fail+0x34
|
|||
|
S: 0000000031e03da0 R: 00000000300146d8 .load_and_boot_kernel+0xb30
|
|||
|
S: 0000000031e03e70 R: 0000000030026cf0 .fast_reboot_entry+0x39c
|
|||
|
S: 0000000031e03f00 R: 0000000030002a4c fast_reset_entry+0x2c
|
|||
|
--- OPAL boot ---
|
|||
|
|
|||
|
The OPAL flash API hooks directly into the blocklevel layer, so there's
|
|||
|
no delay for e.g. the host kernel, just for asynchronously loaded
|
|||
|
resources during boot.
|
|||
|
- fast-reboot: occ: Call occ_pstates_init() on fast-reset on all machines
|
|||
|
|
|||
|
Commit 815417dcda2e ("init, occ: Initialise OCC earlier on BMC systems")
|
|||
|
conditionally invoked occ_pstates_init() only on FSP based systems in
|
|||
|
load_and_boot_kernel(). Due to this pstate table is re-parsed on FSP
|
|||
|
system and skipped on BMC system during fast-reboot. So this patch fixes
|
|||
|
this by invoking occ_pstates_init() on all boxes during fast-reboot.
|
|||
|
- opal/hmi: Don't retry TOD recovery if it is already in failed state.
|
|||
|
|
|||
|
On TOD failure, all cores/thread receives HMI and very first thread that
|
|||
|
gets interrupt fixes the TOD where as others just resets the respective
|
|||
|
HMER error bit and return. But when TOD is unrecoverable, all the threads
|
|||
|
try to do TOD recovery one by one causing threads to spend more time inside
|
|||
|
opal. Set a global flag when TOD is unrecoverable so that rest of the
|
|||
|
threads go back to linux immediately avoiding lock ups in system
|
|||
|
reboot/panic path.
|
|||
|
- hw/bt: Do not disable ipmi message retry during OPAL boot
|
|||
|
|
|||
|
Currently OPAL doesn't know whether BMC is functioning or not. If BMC is
|
|||
|
down (like BMC reboot), then we keep on retry sending message to BMC. So
|
|||
|
in some corner cases we may hit hard lockup issue in kernel.
|
|||
|
|
|||
|
Ideally we should avoid using synchronous path as much as possible. But
|
|||
|
for now commit 01f977c3 added option to disable message retry in synchronous.
|
|||
|
But this fix is not required during boot. Hence lets disable IPMI message
|
|||
|
retry during OPAL boot.
|
|||
|
- hdata/memory: Fix warning message
|
|||
|
|
|||
|
Even though we added memory to device tree, we are getting below warning. ::
|
|||
|
|
|||
|
[ 57.136949696,3] Unable to use memory range 0 from MSAREA 0
|
|||
|
[ 57.137049753,3] Unable to use memory range 0 from MSAREA 1
|
|||
|
[ 57.137152335,3] Unable to use memory range 0 from MSAREA 2
|
|||
|
[ 57.137251218,3] Unable to use memory range 0 from MSAREA 3
|
|||
|
- hw/bt: Add backend interface to disable ipmi message retry option
|
|||
|
|
|||
|
During boot OPAL makes IPMI_GET_BT_CAPS call to BMC to get BT interface
|
|||
|
capabilities which includes IPMI message max resend count, message
|
|||
|
timeout, etc,. Most of the time OPAL gets response from BMC within
|
|||
|
specified timeout. In some corner cases (like mboxd daemon reset in BMC,
|
|||
|
BMC reboot, etc) OPAL may not get response within timeout period. In
|
|||
|
such scenarios, OPAL resends message until max resend count reaches.
|
|||
|
|
|||
|
OPAL uses synchronous IPMI message (ipmi_queue_msg_sync()) for few
|
|||
|
operations like flash read, write, etc. Thread will wait in OPAL until
|
|||
|
it gets response from BMC. In some corner cases like BMC reboot, thread
|
|||
|
may wait in OPAL for long time (more than 20 seconds) and results in
|
|||
|
kernel hardlockup.
|
|||
|
|
|||
|
This patch introduces new interface to disable message resend option. We
|
|||
|
will disable message resend option for synchrous message. This will
|
|||
|
greatly reduces kernel hardlock up issues.
|
|||
|
|
|||
|
This is short term fix. Long term solution is to convert all synchronous
|
|||
|
messages to asynhrounous one.
|
|||
|
- ipmi/power: Fix system reboot issue
|
|||
|
|
|||
|
Kernel makes reboot/shudown OPAL call for reboot/shutdown. Once kernel
|
|||
|
gets response from OPAL it runs opal_poll_events() until firmware
|
|||
|
handles the request.
|
|||
|
|
|||
|
On BMC based system, OPAL makes IPMI call (IPMI_CHASSIS_CONTROL) to
|
|||
|
initiate system reboot/shutdown. At present OPAL queues IPMI messages
|
|||
|
and return SUCESS to Host. If BMC is not ready to accept command (like
|
|||
|
BMC reboot), then these message will fail. We have to manually
|
|||
|
reboot/shutdown the system using BMC interface.
|
|||
|
|
|||
|
This patch adds logic to validate message return value. If message failed,
|
|||
|
then it will resend the message. At some stage BMC will be ready to accept
|
|||
|
message and handles IPMI message.
|
|||
|
- firmware-versions: Add test case for parsing VERSION
|
|||
|
|
|||
|
Also make it possible to use with afl-lop/afl-fuzz just to help make
|
|||
|
*sure* we're all good.
|
|||
|
|
|||
|
Additionally, if we hit a entry in VERSION that is larger than our
|
|||
|
buffer size, we skip over it gracefully rather than overwriting the
|
|||
|
stack. This is only a problem if VERSION isn't trusted, which as of
|
|||
|
4b8cc05a94513816d43fb8bd6178896b430af08f it is verified as part of
|
|||
|
Secure Boot.
|
|||
|
- core/fast-reboot: improve NMI handling during fast reset
|
|||
|
|
|||
|
Improve sreset and MCE handling in fast reboot. Switch the HILE bit
|
|||
|
off before copying OPAL's exception vectors, so NMIs can be handled
|
|||
|
properly. Also disable MSR[ME] while the vectors are being overwritten
|
|||
|
- core/cpu: HID update race
|
|||
|
|
|||
|
If the per-core HID register is updated concurrently by multiple
|
|||
|
threads, updates can get lost. This has been observed during fast
|
|||
|
reboot where the HILE bit does not get cleared on all cores, which
|
|||
|
can cause machine check exception interrupts to crash.
|
|||
|
|
|||
|
Fix this by only updating HID on thread0.
|
|||
|
- SLW: Print verbose info on errors only
|
|||
|
|
|||
|
Change print level from debug to warning for reporting
|
|||
|
bad EC_PPM_SPECIAL_WKUP_* scom values. To reduce cluttering
|
|||
|
in the log print only on error.
|
|||
|
|
|||
|
Since v6.3-rc2:
|
|||
|
|
|||
|
- hw/xscom: add missing P9P chip name
|
|||
|
- asm/head: balance branches to avoid link stack predictor mispredicts
|
|||
|
|
|||
|
The Linux wrapper for OPAL call and return is arranged like this: ::
|
|||
|
|
|||
|
__opal_call:
|
|||
|
mflr r0
|
|||
|
std r0,PPC_STK_LROFF(r1)
|
|||
|
LOAD_REG_ADDR(r11, opal_return)
|
|||
|
mtlr r11
|
|||
|
hrfid -> OPAL
|
|||
|
|
|||
|
opal_return:
|
|||
|
ld r0,PPC_STK_LROFF(r1)
|
|||
|
mtlr r0
|
|||
|
blr
|
|||
|
|
|||
|
When skiboot returns to Linux, it branches to LR (i.e., opal_return)
|
|||
|
with a blr. This unbalances the link stack predictor and will cause
|
|||
|
mispredicts back up the return stack.
|
|||
|
- external/mambo: also invoke readline for the non-autorun case
|
|||
|
- asm/head.S: set POWER9 radix HID bit at entry
|
|||
|
|
|||
|
When running in virtual memory mode, the radix MMU hid bit should not
|
|||
|
be changed, so set this in the initial boot SPR setup.
|
|||
|
|
|||
|
As a side effect, fast reboot also has HID0:RADIX bit set by the
|
|||
|
shared spr init, so no need for an explicit call.
|
|||
|
- build: link with --orphan-handling=warn
|
|||
|
|
|||
|
The linker can warn when the linker script does not explicitly place
|
|||
|
all sections. These orphan sections are placed according to
|
|||
|
heuristics, which may not always be desirable. Enable this warning.
|
|||
|
- build: -fno-asynchronous-unwind-tables
|
|||
|
|
|||
|
skiboot does not use unwind tables, this option saves about 100kB,
|
|||
|
mostly from .text.
|
|||
|
- opal/hmi: Initialize the hmi event with old value of TFMR.
|
|||
|
|
|||
|
Do this before we fix TFAC errors. Otherwise the event at host console
|
|||
|
shows no thread error reported in TFMR register.
|
|||
|
|
|||
|
Without this patch the console event show TFMR with no thread error:
|
|||
|
(DEC parity error TFMR[59] injection) ::
|
|||
|
|
|||
|
[ 53.737572] Severe Hypervisor Maintenance interrupt [Recovered]
|
|||
|
[ 53.737596] Error detail: Timer facility experienced an error
|
|||
|
[ 53.737611] HMER: 0840000000000000
|
|||
|
[ 53.737621] TFMR: 3212000870e04000
|
|||
|
|
|||
|
After this patch it shows old TFMR value on host console: ::
|
|||
|
|
|||
|
[ 2302.267271] Severe Hypervisor Maintenance interrupt [Recovered]
|
|||
|
[ 2302.267305] Error detail: Timer facility experienced an error
|
|||
|
[ 2302.267320] HMER: 0840000000000000
|
|||
|
[ 2302.267330] TFMR: 3212000870e14010
|
|||
|
|
|||
|
|
|||
|
IBM FSP based platforms
|
|||
|
-----------------------
|
|||
|
|
|||
|
- platforms/firenze: Rework I2C controller fixups
|
|||
|
- platforms/zz: Re-enable LXVPD slot information parsing
|
|||
|
|
|||
|
From memory this was disabled in the distant past since we were waiting
|
|||
|
for an updates to the LXPVD format. It looks like that never happened
|
|||
|
so re-enable it for the ZZ platform so that we can get PCI slot location
|
|||
|
codes on ZZ.
|
|||
|
|
|||
|
HIOMAP
|
|||
|
------
|
|||
|
- astbmc: Try IPMI HIOMAP for P8
|
|||
|
|
|||
|
The HIOMAP protocol was developed after the release of P8 in preparation
|
|||
|
for P9. As a consequence P9 always uses it, but it has rarely been
|
|||
|
enabled for P8. P8DTU has recently added IPMI HIOMAP support to its BMC
|
|||
|
firmware, so enable its use in skiboot with P8 machines. Doing so
|
|||
|
requires some rework to ensure fallback works correctly as in the past
|
|||
|
the fallback was to mbox, which will only work for P9.
|
|||
|
- libflash/ipmi-hiomap: Enforce message size for empty response
|
|||
|
|
|||
|
The protocol defines the response to the associated messages as empty
|
|||
|
except for the command ID and sequence fields. If the BMC is returning
|
|||
|
extra data consider the message malformed.
|
|||
|
- libflash/ipmi-hiomap: Remove unused close handling
|
|||
|
|
|||
|
Issuing a HIOMAP_C_CLOSE is not required by the protocol specification,
|
|||
|
rather a close can be implicit in a subsequent
|
|||
|
CREATE_{READ,WRITE}_WINDOW request. The implicit close provides an
|
|||
|
opportunity to reduce LPC traffic and the implementation takes up that
|
|||
|
optimisation, so remove the case from the IPMI callback handler.
|
|||
|
- libflash/ipmi-hiomap: Overhaul event handling
|
|||
|
|
|||
|
Reworking the event handling was inspired by a bug report by Vasant
|
|||
|
where the host would get wedged on multiple flash access attempts in the
|
|||
|
face of a persistent error state on the BMC-side. The cause of this bug
|
|||
|
was the early-exit based on ctx->update, which erronously assumed that
|
|||
|
all events had been completely handled in prior calls to
|
|||
|
ipmi_hiomap_handle_events(). This is not true if e.g.
|
|||
|
HIOMAP_E_DAEMON_READY is clear in the prior calls.
|
|||
|
|
|||
|
Regardless, there were other correctness and efficiency problems with
|
|||
|
the handling strategy:
|
|||
|
|
|||
|
* Ack-able event state was not restored in the face of errors in the
|
|||
|
process of re-establishing protocol state
|
|||
|
* It forced needless window restoration with respect to the context in
|
|||
|
which ipmi_hiomap_handle_events() was called.
|
|||
|
* Tests for HIOMAP_E_DAEMON_READY and HIOMAP_E_FLASH_LOST were redundant
|
|||
|
with the overhauled error handling introduced in the previous patch
|
|||
|
|
|||
|
Fix all of the above issues and add comments to explain the event
|
|||
|
handling flow.
|
|||
|
- libflash/ipmi-hiomap: Overhaul error handling
|
|||
|
|
|||
|
The aim is to improve the robustness with respect to absence of the
|
|||
|
BMC-side daemon. The current error handling roughly mirrors what was
|
|||
|
done for the mailbox implementation, but there's room for improvement.
|
|||
|
|
|||
|
Errors are split into two classes, those that affect the transport state
|
|||
|
and those that affect the window validity. From here, we push the
|
|||
|
transport state error checks right to the bottom of the stack, to ensure
|
|||
|
the link is known to be in a good state before any message is sent.
|
|||
|
Window validity tests remain as they were in the hiomap_window_move()
|
|||
|
and ipmi_hiomap_read() functions. Validity tests are not necessary in
|
|||
|
the write and erase paths as we will receive an error response from the
|
|||
|
BMC when performing a dirty or flush on an invalid window.
|
|||
|
|
|||
|
Recovery also remains as it was, done on entry to the blocklevel
|
|||
|
callbacks. If an error state is encountered in the middle of an
|
|||
|
operation no attempt is made to recover it on the spot, instead the
|
|||
|
error is returned up the stack and the caller can choose how it wishes
|
|||
|
to respond.
|
|||
|
- libflash/ipmi-hiomap: Fix leak of msg in callback
|
|||
|
|
|||
|
Since v6.3-rc1:
|
|||
|
|
|||
|
- libflash/ipmi-hiomap: Fix blocks count issue
|
|||
|
|
|||
|
We convert data size to block count and pass block count to BMC.
|
|||
|
If data size is not block aligned then we endup sending block count
|
|||
|
less than actual data. BMC will write partial data to flash memory.
|
|||
|
|
|||
|
Sample log ::
|
|||
|
|
|||
|
[ 594.388458416,7] HIOMAP: Marked flash dirty at 0x42010 for 8
|
|||
|
[ 594.398756487,7] HIOMAP: Flushed writes
|
|||
|
[ 594.409596439,7] HIOMAP: Marked flash dirty at 0x42018 for 3970
|
|||
|
[ 594.419897507,7] HIOMAP: Flushed writes
|
|||
|
|
|||
|
In this case HIOMAP sent data with block count=0 and hence BMC didn't
|
|||
|
flush data to flash.
|
|||
|
|
|||
|
|
|||
|
|
|||
|
POWER8
|
|||
|
------
|
|||
|
- hw/phb3/naples: Disable D-states
|
|||
|
|
|||
|
Putting "Mellanox Technologies MT27700 Family [ConnectX-4] [15b3:1013]"
|
|||
|
(more precisely, the second of 2 its PCI functions, no matter in what
|
|||
|
order) into the D3 state causes EEH with the "PCT timeout" error.
|
|||
|
This has been noticed on garrison machines only and firestones do not
|
|||
|
seem to have this issue.
|
|||
|
|
|||
|
This disables D-states changing for devices on root buses on Naples by
|
|||
|
installing a config space access filter (copied from PHB4).
|
|||
|
- cpufeatures: Always advertise POWER8NVL as DD2
|
|||
|
|
|||
|
Despite the major version of PVR being 1 (0x004c0100) for POWER8NVL,
|
|||
|
these chips are functionally equalent to P8/P8E DD2 levels.
|
|||
|
|
|||
|
This advertises POWER8NVL as DD2. As the result, skiboot adds
|
|||
|
ibm,powerpc-cpu-features/processor-control-facility for such CPUs and
|
|||
|
the linux kernel can use hypervisor doorbell messages to wake secondary
|
|||
|
threads; otherwise "KVM: CPU %d seems to be stuck" would appear because
|
|||
|
of missing LPCR_PECEDH.
|
|||
|
|
|||
|
p8dtu Platform
|
|||
|
^^^^^^^^^^^^^^
|
|||
|
- p8dtu: Configure BMC graphics
|
|||
|
|
|||
|
We can no-longer read the values from the BMC in the way we have in the
|
|||
|
past. Values were provided by Eric Chen of SMC.
|
|||
|
- p8dtu: Enable HIOMAP support
|
|||
|
|
|||
|
Vesnin Platform
|
|||
|
^^^^^^^^^^^^^^^
|
|||
|
- platforms/vesnin: Disable PCIe port bifurcation
|
|||
|
|
|||
|
PCIe ports connected to CPU1 and CPU3 now work as x16 instead of x8x8.
|
|||
|
|
|||
|
- Fix hang in pnv_platform_error_reboot path due to TOD failure.
|
|||
|
|
|||
|
On TOD failure, with TB stuck, when linux heads down to
|
|||
|
pnv_platform_error_reboot() path due to unrecoverable hmi event, the panic
|
|||
|
cpu gets stuck in OPAL inside ipmi_queue_msg_sync(). At this time, rest
|
|||
|
all other cpus are in smp_handle_nmi_ipi() waiting for panic cpu to proceed.
|
|||
|
But with panic cpu stuck inside OPAL, linux never recovers/reboot. ::
|
|||
|
|
|||
|
p0 c1 t0
|
|||
|
NIA : 0x000000003001dd3c <.time_wait+0x64>
|
|||
|
CFAR : 0x000000003001dce4 <.time_wait+0xc>
|
|||
|
MSR : 0x9000000002803002
|
|||
|
LR : 0x000000003002ecf8 <.ipmi_queue_msg_sync+0xec>
|
|||
|
|
|||
|
STACK: SP NIA
|
|||
|
0x0000000031c236e0 0x0000000031c23760 (big-endian)
|
|||
|
0x0000000031c23760 0x000000003002ecf8 <.ipmi_queue_msg_sync+0xec>
|
|||
|
0x0000000031c237f0 0x00000000300aa5f8 <.hiomap_queue_msg_sync+0x7c>
|
|||
|
0x0000000031c23880 0x00000000300aaadc <.hiomap_window_move+0x150>
|
|||
|
0x0000000031c23950 0x00000000300ab1d8 <.ipmi_hiomap_write+0xcc>
|
|||
|
0x0000000031c23a90 0x00000000300a7b18 <.blocklevel_raw_write+0xbc>
|
|||
|
0x0000000031c23b30 0x00000000300a7c34 <.blocklevel_write+0xfc>
|
|||
|
0x0000000031c23bf0 0x0000000030030be0 <.flash_nvram_write+0xd4>
|
|||
|
0x0000000031c23c90 0x000000003002c128 <.opal_write_nvram+0xd0>
|
|||
|
0x0000000031c23d20 0x00000000300051e4 <opal_entry+0x134>
|
|||
|
0xc000001fea6e7870 0xc0000000000a9060 <opal_nvram_write+0x80>
|
|||
|
0xc000001fea6e78c0 0xc000000000030b84 <nvram_write_os_partition+0x94>
|
|||
|
0xc000001fea6e7960 0xc0000000000310b0 <nvram_pstore_write+0xb0>
|
|||
|
0xc000001fea6e7990 0xc0000000004792d4 <pstore_dump+0x1d4>
|
|||
|
0xc000001fea6e7ad0 0xc00000000018a570 <kmsg_dump+0x140>
|
|||
|
0xc000001fea6e7b40 0xc000000000028e5c <panic_flush_kmsg_end+0x2c>
|
|||
|
0xc000001fea6e7b60 0xc0000000000a7168 <pnv_platform_error_reboot+0x68>
|
|||
|
0xc000001fea6e7bd0 0xc0000000000ac9b8 <hmi_event_handler+0x1d8>
|
|||
|
0xc000001fea6e7c80 0xc00000000012d6c8 <process_one_work+0x1b8>
|
|||
|
0xc000001fea6e7d20 0xc00000000012da28 <worker_thread+0x88>
|
|||
|
0xc000001fea6e7db0 0xc0000000001366f4 <kthread+0x164>
|
|||
|
0xc000001fea6e7e20 0xc00000000000b65c <ret_from_kernel_thread+0x5c>
|
|||
|
|
|||
|
This is because, there is a while loop towards the end of
|
|||
|
ipmi_queue_msg_sync() which keeps looping until "sync_msg" does not match
|
|||
|
with "msg". It loops over time_wait_ms() until exit condition is met. In
|
|||
|
normal scenario time_wait_ms() calls run pollers so that ipmi backend gets
|
|||
|
a chance to check ipmi response and set sync_msg to NULL. ::
|
|||
|
|
|||
|
while (sync_msg == msg)
|
|||
|
time_wait_ms(10);
|
|||
|
|
|||
|
But in the event when TB is in failed state time_wait_ms()->time_wait_poll()
|
|||
|
returns immediately without calling pollers and hence we end up looping
|
|||
|
forever. This patch fixes this hang by calling opal_run_pollers() in TB
|
|||
|
failed state as well.
|
|||
|
|
|||
|
|
|||
|
.. _skiboot-6.3-power9:
|
|||
|
|
|||
|
POWER9
|
|||
|
------
|
|||
|
|
|||
|
- Retry link training at PCIe GEN1 if presence detected but training repeatedly failed
|
|||
|
|
|||
|
Certain older PCIe 1.0 devices will not train unless the training process starts at GEN1 speeds.
|
|||
|
As a last resort when a device will not train, fall back to GEN1 speed for the last training attempt.
|
|||
|
|
|||
|
This is verified to fix devices based on the Conexant CX23888 on the Talos II platform.
|
|||
|
- hw/phb4: Drop FRESET_DEASSERT_DELAY state
|
|||
|
|
|||
|
The delay between the ASSERT_DELAY and DEASSERT_DELAY states is set to
|
|||
|
one timebase tick. This state seems to have been a hold over from PHB3
|
|||
|
where it was used to add a 1s delay between de-asserting PERST and
|
|||
|
polling the link for the CAPI FPGA. There's no requirement for that here
|
|||
|
since the link polling on PHB4 is a bit smarter so we should be fine.
|
|||
|
- hw/phb4: Factor out PERST control
|
|||
|
|
|||
|
Some time ago Mikey added some code work around a bug we found where a
|
|||
|
certain RAID card wouldn't come back again after a fast-reboot. The
|
|||
|
workaround is setting the Link Disable bit before asserting PERST and
|
|||
|
clear it after de-asserting PERST.
|
|||
|
|
|||
|
Currently we do this in the FRESET path, but not in the CRESET path.
|
|||
|
This patch moves the PERST control into its own function to reduce
|
|||
|
duplication and to the workaround is applied in all circumstances.
|
|||
|
- hw/phb4: Remove FRESET presence check
|
|||
|
|
|||
|
When we do an freset the first step is to check if a card is present in
|
|||
|
the slot. However, this only occurs when we enter phb4_freset() with the
|
|||
|
slot state set to SLOT_NORMAL. This occurs in:
|
|||
|
|
|||
|
a) The creset path, and
|
|||
|
b) When the OS manually requests an FRESET via an OPAL call.
|
|||
|
|
|||
|
(a) is problematic because in the boot path the generic code will put the
|
|||
|
slot into FRESET_START manually before calling into phb4_freset(). This
|
|||
|
can result in a situation where a device is detected on boot, but not
|
|||
|
after a CRESET.
|
|||
|
|
|||
|
I've noticed this occurring on systems where the PHB's slot presence
|
|||
|
detect signal is not wired to an adapter. In this situation we can rely
|
|||
|
on the in-band presence mechanism, but the presence check will make
|
|||
|
us exit before that has a chance to work.
|
|||
|
|
|||
|
Additionally, if we enter from the CRESET path this early exit leaves
|
|||
|
the slot's PERST signal being left asserted. This isn't currently an issue,
|
|||
|
but if we want to support hotplug of devices into the root port it will
|
|||
|
be.
|
|||
|
- hw/phb4: Skip FRESET PERST when coming from CRESET
|
|||
|
|
|||
|
PERST is asserted at the beginning of the CRESET process to prevent
|
|||
|
the downstream device from interacting with the host while the PHB logic
|
|||
|
is being reset and re-initialised. There is at least a 100ms wait during
|
|||
|
the CRESET processing so it's not necessary to wait this time again
|
|||
|
in the FRESET handler.
|
|||
|
|
|||
|
This patch extends the delay after re-setting the PHB logic to extend
|
|||
|
to the 250ms PERST wait period that we typically use and sets the
|
|||
|
skip_perst flag so that we don't wait this time again in the FRESET
|
|||
|
handler.
|
|||
|
- hw/phb4: Look for the hub-id from in the PBCQ node
|
|||
|
|
|||
|
The hub-id is stored in the PBCQ node rather than the stack node so we
|
|||
|
never add it to the PHB node. This breaks the lxvpd slot lookup code
|
|||
|
since the hub-id is encoded in the VPD record that we need to find the
|
|||
|
slot information.
|
|||
|
- hdata/iohub: Look for IOVPD on P9
|
|||
|
|
|||
|
P8 and P9 use the same IO VPD setup, so we need to load the IOHUB VPD on
|
|||
|
P9 systems too.
|
|||
|
|
|||
|
Since v6.3-rc2:
|
|||
|
|
|||
|
- hw/phb4: Squash the IO bridge window
|
|||
|
|
|||
|
The PCI-PCI bridge spec says that bridges that implement an IO window
|
|||
|
should hardcode the IO base and limit registers to zero.
|
|||
|
Unfortunately, these registers only define the upper bits of the IO
|
|||
|
window and the low bits are assumed to be 0 for the base and 1 for the
|
|||
|
limit address. As a result, setting both to zero can be mis-interpreted
|
|||
|
as a 4K IO window.
|
|||
|
|
|||
|
This patch fixes the problem the same way PHB3 does. It sets the IO base
|
|||
|
and limit values to 0xf000 and 0x1000 respectively which most software
|
|||
|
interprets as a disabled window.
|
|||
|
|
|||
|
lspci before patch: ::
|
|||
|
|
|||
|
0000:00:00.0 PCI bridge: IBM Device 04c1 (prog-if 00 [Normal decode])
|
|||
|
I/O behind bridge: 00000000-00000fff
|
|||
|
|
|||
|
lspci after patch: ::
|
|||
|
|
|||
|
0000:00:00.0 PCI bridge: IBM Device 04c1 (prog-if 00 [Normal decode])
|
|||
|
I/O behind bridge: None
|
|||
|
|
|||
|
- hw/xscom: Enable sw xstop by default on p9
|
|||
|
|
|||
|
This was disabled at some point during bringup to make life easier for
|
|||
|
the lab folks trying to debug NVLink issues. This hack really should
|
|||
|
have never made it out into the wild though, so we now have the
|
|||
|
following situation occuring in the field:
|
|||
|
|
|||
|
1) A bad happens
|
|||
|
2) The host kernel recieves an unrecoverable HMI and calls into OPAL to
|
|||
|
request a platform reboot.
|
|||
|
3) OPAL rejects the reboot attempt and returns to the kernel with
|
|||
|
OPAL_PARAMETER.
|
|||
|
4) Kernel panics and attempts to kexec into a kdump kernel.
|
|||
|
|
|||
|
A side effect of the HMI seems to be CPUs becoming stuck which results
|
|||
|
in the initialisation of the kdump kernel taking a extremely long time
|
|||
|
(6+ hours). It's also been observed that after performing a dump the
|
|||
|
kdump kernel then crashes itself because OPAL has ended up in a bad
|
|||
|
state as a side effect of the HMI.
|
|||
|
|
|||
|
All up, it's not very good so re-enable the software checkstop by
|
|||
|
default. If people still want to turn it off they can using the nvram
|
|||
|
override.
|
|||
|
|
|||
|
|
|||
|
CAPI2
|
|||
|
^^^^^
|
|||
|
- capp/phb4: Prevent HMI from getting triggered when disabling CAPP
|
|||
|
|
|||
|
While disabling CAPP an HMI gets triggered as soon as ETU is put in
|
|||
|
reset mode. This is caused as before we can disabled CAPP, it detects
|
|||
|
PHB link going down and triggers an HMI requesting Opal to perform
|
|||
|
CAPP recovery. This has an un-intended side effect of spamming the
|
|||
|
Opal logs with malfunction alert messages and may also confuse the
|
|||
|
user.
|
|||
|
|
|||
|
To prevent this we mask the CAPP FIR error 'PHB Link Down' Bit(31)
|
|||
|
when we are disabling CAPP just before we put ETU in reset in
|
|||
|
phb4_creset(). Also now since bringing down the PHB link now wont
|
|||
|
trigger an HMI and CAPP recovery, hence we manually set the
|
|||
|
PHB4_CAPP_RECOVERY flag on the phb to force recovery during creset.
|
|||
|
|
|||
|
- phb4/capp: Implement sequence to disable CAPP and enable fast-reset
|
|||
|
|
|||
|
We implement h/w sequence to disable CAPP in disable_capi_mode() and
|
|||
|
with it also enable fast-reset for CAPI mode in phb4_set_capi_mode().
|
|||
|
|
|||
|
Sequence to disable CAPP is executed in three phases. The first two
|
|||
|
phase is implemented in disable_capi_mode() where we reset the CAPP
|
|||
|
registers followed by PEC registers to their init values. The final
|
|||
|
third final phase is to reset the PHB CAPI Compare/Mask Register and
|
|||
|
is done in phb4_init_ioda3(). The reason to move the PHB reset to
|
|||
|
phb4_init_ioda3() is because by the time Opal PCI reset state machine
|
|||
|
reaches this function the PHB is already un-fenced and its
|
|||
|
configuration registers accessible via mmio.
|
|||
|
- capp/phb4: Force CAPP to PCIe mode during kernel shutdown
|
|||
|
|
|||
|
This patch introduces a new opal syncer for PHB4 named
|
|||
|
phb4_host_sync_reset(). We register this opal syncer when CAPP is
|
|||
|
activated successfully in phb4_set_capi_mode() so that it will be
|
|||
|
called at kernel shutdown during fast-reset.
|
|||
|
|
|||
|
During kernel shutdown the function will then repeatedly call
|
|||
|
phb->ops->set_capi_mode() to switch switch CAPP to PCIe mode. In case
|
|||
|
set_capi_mode() indicates its OPAL_BUSY, which indicates that CAPP is
|
|||
|
still transitioning to new state; it calls slot->ops.run_sm() to
|
|||
|
ensure that Opal slot reset state machine makes forward progress.
|
|||
|
|
|||
|
|
|||
|
Witherspoon Platform
|
|||
|
^^^^^^^^^^^^^^^^^^^^
|
|||
|
- platforms/witherspoon: Make PCIe shared slot error message more informative
|
|||
|
|
|||
|
If we're missing chips for some reason, we print a warning when configuring
|
|||
|
the PCIe shared slot.
|
|||
|
|
|||
|
The warning doesn't really make it clear what "shared slot" is, and if it's
|
|||
|
printed, it'll come right after a bunch of messages about NPU setup, so
|
|||
|
let's clarify the message to explicitly mention PCI.
|
|||
|
- witherspoon: Add nvlink2 interconnect information
|
|||
|
|
|||
|
See :ref:`skiboot-6.3-new-features` for details.
|
|||
|
|
|||
|
Zaius Platform
|
|||
|
^^^^^^^^^^^^^^
|
|||
|
|
|||
|
- zaius: Add BMC description
|
|||
|
|
|||
|
Frederic reported that Zaius was failing with a NULL dereference when
|
|||
|
trying to initialise IPMI HIOMAP. It turns out that the BMC wasn't
|
|||
|
described at all, so add a description.
|
|||
|
|
|||
|
p9dsu platform
|
|||
|
^^^^^^^^^^^^^^
|
|||
|
- p9dsu: Fix p9dsu default variant
|
|||
|
|
|||
|
Add the default when no riser_id is returned from the ipmi query.
|
|||
|
|
|||
|
Allow a little more time for BMC reply and cleanup some label strings.
|
|||
|
|
|||
|
|
|||
|
PCIe
|
|||
|
----
|
|||
|
|
|||
|
See :ref:`skiboot-6.3-power9` for POWER9 specific PCIe changes.
|
|||
|
|
|||
|
- core/pcie-slot: Don't bail early in the power on case
|
|||
|
|
|||
|
Exiting early in the power off case makes sense since we can't disable
|
|||
|
slot power (or assert PERST) for suprise hotplug slots. However, we
|
|||
|
should not exit early in the power-on case since it's possible slot
|
|||
|
power may have been disabled (or just not enabled at boot time).
|
|||
|
- firenze-pci: Always init slot info from LXVPD
|
|||
|
|
|||
|
We can slot information from the LXVPD without having power control
|
|||
|
information about that slot. This patch changes the init path so that
|
|||
|
we always override the add_properties() call rather than only when we
|
|||
|
have power control information about the slot.
|
|||
|
- fsp/lxvpd: Print more LXVPD slot information
|
|||
|
|
|||
|
Useful to know since it changes the behaviour of the slot core.
|
|||
|
- core/pcie-slot: Set power state from the PWRCTL flag
|
|||
|
|
|||
|
For some reason we look at the power control indicator and use that to
|
|||
|
determine if the slot is "off" rather than the power control flag that
|
|||
|
is used to power down the slot.
|
|||
|
|
|||
|
While we're here change the default behaviour so that the slot is
|
|||
|
assumed to be powered on if there's no slot capability, or if there's
|
|||
|
no power control available.
|
|||
|
- core/pci: Increase the max slot string size
|
|||
|
|
|||
|
The maximum string length for the slot label / device location code in
|
|||
|
the PCI summary is currently 32 characters. This results in some IBM
|
|||
|
location codes being truncated due to their length, e.g. ::
|
|||
|
|
|||
|
PHB#0001:02:11.0 [SWDN] SLOT=C11 x8
|
|||
|
PHB#0001:13:00.0 [EP ] *snip* LOC_CODE=U78D3.ND1.WZS004A-P1-C
|
|||
|
PHB#0001:13:00.1 [EP ] *snip* LOC_CODE=U78D3.ND1.WZS004A-P1-C
|
|||
|
PHB#0001:13:00.2 [EP ] *snip* LOC_CODE=U78D3.ND1.WZS004A-P1-C
|
|||
|
PHB#0001:13:00.3 [EP ] *snip* LOC_CODE=U78D3.ND1.WZS004A-P1-C
|
|||
|
|
|||
|
Which obscure the actual location of the card, and it looks bad. This
|
|||
|
patch increases the maximum length of the label string to 80 characters
|
|||
|
since that's the maximum length for a location code.
|
|||
|
|
|||
|
|
|||
|
Since v6.3-rc3:
|
|||
|
|
|||
|
- pci: Try harder to add meaningful ibm,loc-code
|
|||
|
|
|||
|
We keep the existing logic of looking to the parent for the slot-label or
|
|||
|
slot-location-code, but we add logic to (if all that fails) we look
|
|||
|
directly for the slot-location-code (as this should give us the correct
|
|||
|
loc code for things directly under the PHB), and otherwise we just look
|
|||
|
for a loc-code.
|
|||
|
|
|||
|
The applicable bit of PAPR here is:
|
|||
|
|
|||
|
R1–12.1–1. Each instance of a hardware entity (FRU) has a platform
|
|||
|
unique location code and any node in the OF
|
|||
|
device tree that describes a part of a hardware entity must include the
|
|||
|
“ibm,loc-code” property with a
|
|||
|
value that represents the location code for that hardware entity.
|
|||
|
|
|||
|
which we weren't really fully obeying at any recent (ever?) point in
|
|||
|
time. Now we should do okay, at least for PCI.
|
|||
|
|
|||
|
Since v6.3-rc2:
|
|||
|
- core/pci: Use PHB io-base-location by default for PHB slots
|
|||
|
|
|||
|
On witherspoon only the GPU slots and the three pluggable PCI slots
|
|||
|
(SLOT0, 1, 2) have platform defined slot names. For builtin devices such
|
|||
|
as the SATA controller or the PLX switch that fans out to the GPU slots
|
|||
|
we have no location codes which some people consider an issue.
|
|||
|
|
|||
|
This patch address the problem by making the ibm,slot-location-code for
|
|||
|
the root port device default to the ibm,io-base-location-code which is
|
|||
|
typically the location code for the system itself.
|
|||
|
|
|||
|
e.g. ::
|
|||
|
|
|||
|
pciex@600c3c0100000/ibm,loc-code
|
|||
|
"UOPWR.0000000-Node0-Proc0"
|
|||
|
|
|||
|
pciex@600c3c0100000/pci@0/ibm,loc-code
|
|||
|
"UOPWR.0000000-Node0-Proc0"
|
|||
|
|
|||
|
pciex@600c3c0100000/pci@0/usb-xhci@0/ibm,loc-code
|
|||
|
"UOPWR.0000000-Node0"
|
|||
|
|
|||
|
The PHB node, and the root complex nodes have a loc code of the
|
|||
|
processor they are attached to, while the usb-xhci device under the
|
|||
|
root port has a location code of the system itself.
|
|||
|
|
|||
|
- hw/phb4: Read ibm,loc-code from PBCQ node
|
|||
|
|
|||
|
On P9 the PBCQs are subdivided by stacks which implement the PCI Express
|
|||
|
logic. When phb4 was forked from phb3 most of the properties that were
|
|||
|
in the pbcq node moved into the stack node, but ibm,loc-code was not one
|
|||
|
of them. This patch fixes the phb4 init sequence to read the base
|
|||
|
location code from the PBCQ node (parent of the stack node) rather than
|
|||
|
the stack node itself.
|
|||
|
|
|||
|
|
|||
|
.. _skiboot-6.3-OpenCAPI:
|
|||
|
|
|||
|
OpenCAPI
|
|||
|
--------
|
|||
|
- npu2/hw-procedures: Fix parallel zcal for opencapi
|
|||
|
|
|||
|
For opencapi, we currently do impedance calibration when initializing
|
|||
|
the PHY for the device, which could run in parallel if we have
|
|||
|
multiple opencapi devices. But if 2 devices are on the same
|
|||
|
obus, the 2 calibration sequences could overlap, which likely yields
|
|||
|
bad results and is useless anyway since it only needs to be done once
|
|||
|
per obus.
|
|||
|
|
|||
|
This patch splits the opencapi PHY reset in 2 parts:
|
|||
|
|
|||
|
- a 'init' part called serially at boot. That's when zcal is done. If
|
|||
|
we have 2 devices on the same socket, the zcal won't be redone,
|
|||
|
since we're called serially and we'll see it has already be done for
|
|||
|
the obus
|
|||
|
- a 'reset' part called during fundamental reset as a prereq for link
|
|||
|
training. It does the PHY setup for a set of lanes and the dccal.
|
|||
|
|
|||
|
The PHY team confirmed there's no dependency between zcal and the
|
|||
|
other reset steps and it can be moved earlier.
|
|||
|
- npu2-hw-procedures: Fix zcal in mixed opencapi and nvlink mode
|
|||
|
|
|||
|
The zcal procedure needs to be run once per obus. We keep track of
|
|||
|
which obus is already calibrated in an array indexed by the obus
|
|||
|
number. However, the obus number is inferred from the brick index,
|
|||
|
which works well for nvlink but not for opencapi.
|
|||
|
|
|||
|
Create an obus_index() function, which, from a device, returns the
|
|||
|
correct obus index, irrespective of the device type.
|
|||
|
- npu2-opencapi: Fix adapter reset when using 2 adapters
|
|||
|
|
|||
|
If two opencapi adapters are on the same obus, we may try to train the
|
|||
|
two links in parallel at boot time, when all the PCI links are being
|
|||
|
trained. Both links use the same i2c controller to handle the reset
|
|||
|
signal, so some care is needed to make sure resetting one doesn't
|
|||
|
interfere with the reset of the other. We need to keep track of the
|
|||
|
current state of the i2c controller (and use locking).
|
|||
|
|
|||
|
This went mostly unnoticed as you need to have 2 opencapi cards on the
|
|||
|
same socket and links tended to train anyway because of the retries.
|
|||
|
- npu2-opencapi: Extend delay after releasing reset on adapter
|
|||
|
|
|||
|
Give more time to the FPGA to process the reset signal. The previous
|
|||
|
delay, 5ms, is too short for newer adapters with bigger FPGAs. Extend
|
|||
|
it to 250ms.
|
|||
|
Ultimately, that delay will likely end up being added to the opencapi
|
|||
|
specification, but we are not there yet.
|
|||
|
- npu2-opencapi: ODL should be in reset when enabled
|
|||
|
|
|||
|
We haven't hit any problem so far, but from the ODL designer, the ODL
|
|||
|
should be in reset when it is enabled.
|
|||
|
|
|||
|
The ODL remains in reset until we start a fundamental reset to
|
|||
|
initiate link training. We still assert and deassert the ODL reset
|
|||
|
signal as part of the normal procedure just before training the
|
|||
|
link. Asserting is therefore useless at boot, since the ODL is already
|
|||
|
in reset, but we keep it as it's only a scom write and it's needed
|
|||
|
when we reset/retrain from the OS.
|
|||
|
- npu2-opencapi: Keep ODL and adapter in reset at the same time
|
|||
|
|
|||
|
Split the function to assert and deassert the reset signal on the ODL,
|
|||
|
so that we can keep the ODL in reset while we reset the adapter,
|
|||
|
therefore having a window where both sides are in reset.
|
|||
|
|
|||
|
It is actually not required with our current DLx at boot time, but I
|
|||
|
need to split the ODL reset function for the following patch and it
|
|||
|
will become useful/required later when we introduce resetting an
|
|||
|
opencapi link from the OS.
|
|||
|
- npu2-opencapi: Setup perf counters to detect CRC errors
|
|||
|
|
|||
|
It's possible to set up performance counters for the PLL to detect
|
|||
|
various conditions for the links in nvlink or opencapi mode. Since
|
|||
|
those counters are currently unused, let's configure them when an obus
|
|||
|
is in opencapi mode to detect CRC errors on the link. Each link has
|
|||
|
two counters:
|
|||
|
- CRC error detected by the host
|
|||
|
- CRC error detected by the DLx (NAK received by the host)
|
|||
|
|
|||
|
We also dump the counters shortly after the link trains, but they can
|
|||
|
be read multiple times through cronus, pdbg or linux. The counters are
|
|||
|
configured to be reset after each read.
|
|||
|
|
|||
|
Since v6.3-rc1:
|
|||
|
|
|||
|
- opal/hmi: Never trust a cow!
|
|||
|
|
|||
|
With opencapi, it's fairly common to trigger HMIs during AFU
|
|||
|
development on the FPGA, by not replying in time to an NPU command,
|
|||
|
for example. So shift the blame reported by that cow to avoid crowding
|
|||
|
my mailbox.
|
|||
|
- hw/npu2: Dump (more) npu2 registers on link error and HMIs
|
|||
|
|
|||
|
We were already logging some NPU registers during an HMI. This patch
|
|||
|
cleans up a bit how it is done and separates what is global from what
|
|||
|
is specific to nvlink or opencapi.
|
|||
|
|
|||
|
Since we can now receive an error interrupt when an opencapi link goes
|
|||
|
down unexpectedly, we also dump the NPU state but we limit it to the
|
|||
|
registers of the brick which hit the error.
|
|||
|
|
|||
|
The list of registers to dump was worked out with the hw team to
|
|||
|
allow for proper debugging. For each register, we print the name as
|
|||
|
found in the NPU workbook, the scom address and the register value.
|
|||
|
- hw/npu2: Report errors to the OS if an OpenCAPI brick is fenced
|
|||
|
|
|||
|
Now that the NPU may report interrupts due to the link going down
|
|||
|
unexpectedly, report those errors to the OS when queried by the
|
|||
|
'next_error' PHB callback.
|
|||
|
|
|||
|
The hardware doesn't support recovery of the link when it goes down
|
|||
|
unexpectedly. So we report the PHB as dead, so that the OS can log the
|
|||
|
proper message, notify the drivers and take the devices down.
|
|||
|
- hw/npu2: Fix OpenCAPI PE assignment
|
|||
|
|
|||
|
When we support mixing NVLink and OpenCAPI devices on the same NPU, we're
|
|||
|
going to have to share the same range of 16 PE numbers between NVLink and
|
|||
|
OpenCAPI PHBs.
|
|||
|
|
|||
|
For OpenCAPI devices, PE assignment is only significant for determining
|
|||
|
which System Interrupt Log register is used for a particular brick - unlike
|
|||
|
NVLink, it doesn't play any role in determining how links are fenced.
|
|||
|
|
|||
|
Split the PE range into a lower half which is used for NVLink, and an upper
|
|||
|
half that is used for OpenCAPI, with a fixed PE number assigned per brick.
|
|||
|
|
|||
|
As the PE assignment for OpenCAPI devices is fixed, set the PE once
|
|||
|
during device init and then ignore calls to the set_pe() operation.
|
|||
|
|
|||
|
- opal-api: Reserve 2 OPAL API calls for future OpenCAPI LPC use
|
|||
|
|
|||
|
OpenCAPI Lowest Point of Coherency (LPC) memory is going to require
|
|||
|
some extra OPAL calls to set up NPU BARs. These calls will most likely be
|
|||
|
called OPAL_NPU_LPC_ALLOC and OPAL_NPU_LPC_RELEASE, we're not quite ready
|
|||
|
to upstream that code yet though.
|
|||
|
|
|||
|
|
|||
|
|
|||
|
NVLINK2
|
|||
|
-------
|
|||
|
- npu2: Allow ATSD for LPAR other than 0
|
|||
|
|
|||
|
Each XTS MMIO ATSD# register is accompanied by another register -
|
|||
|
XTS MMIO ATSD0 LPARID# - which controls LPID filtering for ATSD
|
|||
|
transactions.
|
|||
|
|
|||
|
When a host system passes a GPU through to a guest, we need to enable
|
|||
|
some ATSD for an LPAR. At the moment the host assigns one ATSD to
|
|||
|
a NVLink bridge and this maps it to an LPAR when GPU is assigned to
|
|||
|
the LPAR. The link number is used for an ATSD index.
|
|||
|
|
|||
|
ATSD6&7 stay mapped to the host (LPAR=0) all the time which seems to be
|
|||
|
acceptable price for the simplicity.
|
|||
|
- npu2: Add XTS_BDF_MAP wildcard refcount
|
|||
|
|
|||
|
Currently PID wildcard is programmed into the NPU once and never cleared
|
|||
|
up. This works for the bare metal as MSR does not change while the host
|
|||
|
OS is running.
|
|||
|
|
|||
|
However with the device virtualization, we need to keep track of wildcard
|
|||
|
entries use and clear them up before switching a GPU from a host to
|
|||
|
a guest or vice versa.
|
|||
|
|
|||
|
This adds refcount to a NPU2, one counter per wildcard entry. The index
|
|||
|
is a short lparid (4 bits long) which is allocated in opal_npu_map_lpar()
|
|||
|
and should be smaller than NPU2_XTS_BDF_MAP_SIZE (defined as 16).
|
|||
|
|
|||
|
Since v6.3-rc2:
|
|||
|
- npu2: Disable Probe-to-Invalid-Return-Modified-or-Owned snarfing by default
|
|||
|
|
|||
|
V100 GPUs are known to violate NVLink2 protocol in some cases (one is when
|
|||
|
memory was accessed by the CPU and they by GPU using so called block
|
|||
|
linear mapping) and issue double probes to NPU which can cope with this
|
|||
|
problem only if CONFIG_ENABLE_SNARF_CPM ("disable/enable Probe.I.MO
|
|||
|
snarfing a cp_m") is not set in the CQ_SM Misc Config register #0.
|
|||
|
If the bit is set (which is the case today), NPU issues the machine
|
|||
|
check stop.
|
|||
|
|
|||
|
The snarfing feature is designed to detect 2 probes in flight and combine
|
|||
|
them into one.
|
|||
|
|
|||
|
This adds a new "opal-npu2-snarf-cpm" nvram variable which controls
|
|||
|
CONFIG_ENABLE_SNARF_CPM for all NVLinks to prevent the machine check
|
|||
|
stop from happening.
|
|||
|
|
|||
|
This disables snarfing by default as otherwise a broken GPU driver can
|
|||
|
crash the entire box even when a GPU is passed through to a guest.
|
|||
|
This provides a dial to allow regression tests (might be useful for
|
|||
|
a bare metal). To enable snarfing, the user needs to run: ::
|
|||
|
|
|||
|
sudo nvram -p ibm,skiboot --update-config opal-npu2-snarf-cpm=enable
|
|||
|
|
|||
|
and reboot the host system.
|
|||
|
|
|||
|
- hw/npu2: Show name of opencapi error interrupts
|
|||
|
|
|||
|
|
|||
|
Debugging and simulation
|
|||
|
------------------------
|
|||
|
|
|||
|
- external/mambo: Error out if kernel is too large
|
|||
|
|
|||
|
If you're trying to boot a gigantic kernel in mambo (which you can
|
|||
|
reproduce by building a kernel with CONFIG_MODULES=n) you'll get
|
|||
|
misleading errors like: ::
|
|||
|
|
|||
|
WARNING: 0: (0): [0:0]: Invalid/unsupported instr 0x00000000[INVALID]
|
|||
|
WARNING: 0: (0): PC(EA): 0x0000000030000010 PC(RA):0x0000000030000010 MSR: 0x9000000000000000 LR: 0x0000000000000000
|
|||
|
WARNING: 0: (0): numInstructions = 0
|
|||
|
WARNING: 1: (1): [0:0]: Invalid/unsupported instr 0x00000000[INVALID]
|
|||
|
WARNING: 1: (1): PC(EA): 0x0000000000000E40 PC(RA):0x0000000000000E40 MSR: 0x9000000000000000 LR: 0x0000000000000000
|
|||
|
WARNING: 1: (1): numInstructions = 1
|
|||
|
WARNING: 1: (1): Interrupt to 0x0000000000000E40 from 0x0000000000000E40
|
|||
|
INFO: 1: (2): ** Execution stopped: Continuous Interrupt, Instruction caused exception, **
|
|||
|
|
|||
|
So add an error to skiboot.tcl to warn the user before this happens.
|
|||
|
Making PAYLOAD_ADDR further back is one way to do this but if there's a
|
|||
|
less gross way to generally work around this very niche problem, I can
|
|||
|
suggest that instead.
|
|||
|
- external/mambo: Populate kernel-base-address in the DT
|
|||
|
|
|||
|
skiboot.tcl defines PAYLOAD_ADDR as 0x20000000, which is the default in
|
|||
|
skiboot. This is also the default in skiboot unless kernel-base-address
|
|||
|
is set in the device tree.
|
|||
|
|
|||
|
If you change PAYLOAD_ADDR to something else for mambo, skiboot won't
|
|||
|
see it because it doesn't set that DT property, so fix it so that it does.
|
|||
|
- external/mambo: allow CPU targeting for most debug utils
|
|||
|
|
|||
|
Debug util functions target CPU 0:0:0 by default Some can be
|
|||
|
overidden explicitly per invocation, and others can't at all.
|
|||
|
Even for those that can be overidden, it is a pain to type
|
|||
|
them out when you're debugging a particular thread.
|
|||
|
|
|||
|
Provide a new 'target' function that allows the default CPU
|
|||
|
target to be changed. Wire that up that default to all other utils.
|
|||
|
Provide a new 'S' step command which only steps the target CPU.
|
|||
|
- qemu: bt device isn't always hanging off /
|
|||
|
|
|||
|
Just use the normal for_each_compatible instead.
|
|||
|
|
|||
|
Otherwise in the qemu model as executed by op-test,
|
|||
|
we wouldn't go down the astbmc_init() path, thus not having flash.
|
|||
|
- devicetree: Add p9-simics.dts
|
|||
|
|
|||
|
Add a p9-based devicetree that's suitable for use with Simics.
|
|||
|
- devicetree: Move power9-phb4.dts
|
|||
|
|
|||
|
Clean up the formatting of power9-phb4.dts and move it to
|
|||
|
external/devicetree/p9.dts. This sets us up to include it as the basis
|
|||
|
for other trees.
|
|||
|
- devicetree: Add nx node to power9-phb4.dts
|
|||
|
|
|||
|
A (non-qemu) p9 without an nx node will assert in p9_darn_init(): ::
|
|||
|
|
|||
|
dt_for_each_compatible(dt_root, nx, "ibm,power9-nx")
|
|||
|
break;
|
|||
|
if (!nx) {
|
|||
|
if (!dt_node_is_compatible(dt_root, "qemu,powernv"))
|
|||
|
assert(nx);
|
|||
|
return;
|
|||
|
}
|
|||
|
|
|||
|
Since NX is this essential, add it to the device tree.
|
|||
|
- devicetree: Fix typo in power9-phb4.dts
|
|||
|
|
|||
|
Change "impi" to "ipmi".
|
|||
|
- devicetree: Fix syntax error in power9-phb4.dts
|
|||
|
|
|||
|
Remove the extra space causing this: ::
|
|||
|
|
|||
|
Error: power9-phb4.dts:156.15-16 syntax error
|
|||
|
FATAL ERROR: Unable to parse input tree
|
|||
|
- core/init: enable machine check on secondaries
|
|||
|
|
|||
|
Secondary CPUs currently run with MSR[ME]=0 during boot, whih means
|
|||
|
if they take a machine check, the system will checkstop.
|
|||
|
|
|||
|
Enable ME where possible and allow them to print registers.
|
|||
|
|
|||
|
Utilities
|
|||
|
---------
|
|||
|
- pflash: Don't try update RO ToC
|
|||
|
|
|||
|
In the future it's likely the ToC will be marked as read-only. Don't
|
|||
|
error out by assuming its writable.
|
|||
|
- pflash: Support encoding/decoding ECC'd partitions
|
|||
|
|
|||
|
With the new --ecc option, pflash can add/remove ECC when
|
|||
|
reading/writing flash partitions protected by ECC.
|
|||
|
|
|||
|
This is *not* flawless with current PNORs out in the wild though, as
|
|||
|
they do not typically fill the whole partition with valid ECC data, so
|
|||
|
you have to know how big the valid ECC'd data is and specify the size
|
|||
|
manually. Note that for some partitions this is pratically impossible
|
|||
|
without knowing the details of the content of the partition.
|
|||
|
|
|||
|
A future patch is likely to introduce an option to "stop reading data
|
|||
|
when ECC starts failing and assume everything is okay rather than error
|
|||
|
out" to support reading the "valid" data from existing PNOR images.
|
|||
|
|
|||
|
Since v6.3-rc2:
|
|||
|
|
|||
|
- opal-prd: Fix memory leak in is-fsp-system check
|
|||
|
- opal-prd: Check malloc return value
|