979 lines
39 KiB
ReStructuredText
979 lines
39 KiB
ReStructuredText
.. _skiboot-5.7-rc1:
|
|
|
|
skiboot-5.7-rc1
|
|
===============
|
|
|
|
skiboot v5.7-rc1 was released on Monday July 3rd 2017. It is the first
|
|
release candidate of skiboot 5.7, which will become the new stable release
|
|
of skiboot following the 5.6 release, first released 24th May 2017.
|
|
|
|
skiboot v5.7-rc1 contains all bug fixes as of :ref:`skiboot-5.4.6`
|
|
and :ref:`skiboot-5.1.19` (the currently maintained stable releases). We
|
|
do not currently expect to do any 5.6.x stable releases.
|
|
|
|
For how the skiboot stable releases work, see :ref:`stable-rules` for details.
|
|
|
|
The current plan is to cut the final 5.7 by July 12th, with skiboot 5.7
|
|
being for all POWER8 and POWER9 platforms in op-build v1.18 (Due July 12th).
|
|
This is a short cycle as this release is mainly targetted towards POWER9
|
|
bringup efforts.
|
|
|
|
This is the second release using the new regular six week release cycle,
|
|
similar to op-build, but slightly offset to allow for a short stabilisation
|
|
period. Expected release dates and contents are tracked using GitHub milestone
|
|
and issues: https://github.com/open-power/skiboot/milestones
|
|
|
|
Over skiboot-5.6, we have the following changes:
|
|
|
|
New Features
|
|
------------
|
|
|
|
New features in this release for POWER9 systems:
|
|
|
|
- In Memory Counters (IMC) (See :ref:`imc` for details)
|
|
- phb4: Activate shared PCI slot on witherspoon (see :ref:`Shared Slot <shared-slot-5.7-rc1-rn>`)
|
|
- phb4 capi (i.e. CAPI2): Enable capi mode for PHB4 (see :ref:`CAPI on PHB4 <capi2-5.7-rc1-rn>`)
|
|
|
|
New feature for IBM FSP based systems:
|
|
|
|
- fsp/tpo: Provide support for disabling TPO alarm
|
|
|
|
This patch adds support for disabling a preconfigured
|
|
Timed-Power-On(TPO) alarm on FSP based systems. Presently once a TPO alarm
|
|
is configured from the kernel it will be triggered even if its
|
|
subsequently disabled.
|
|
|
|
With this patch a TPO alarm can be disabled by passing
|
|
y_m_d==hr_min==0 to fsp_opal_tpo_write(). A branch is added to the
|
|
function to handle this case by sending FSP_CMD_TPO_DISABLE message to
|
|
the FSP instead of usual FSP_CMD_TPO_WRITE message. The kernel is
|
|
expected to call opal_tpo_write() with y_m_d==hr_min==0 to request
|
|
opal to disable TPO alarm.
|
|
|
|
POWER9
|
|
------
|
|
|
|
Development on POWER9 systems continues in earnest.
|
|
|
|
This release includes the first support for POWER9 DD2 chips. Future releases
|
|
will likely contain more bug fixes, this release has booted on real hardware.
|
|
|
|
- hdata: Reserve Trace Areas
|
|
|
|
When hostboot is configured to setup in memory tracing it will reserve
|
|
some memory for use by the hardware tracing facility. We need to mark
|
|
these areas as off limits to the operating system and firmware.
|
|
- hdata: Make out-of-range idata print at PR_DEBUG
|
|
|
|
Some fields just aren't populated on some systems.
|
|
|
|
- hdata: Ignore unnamed memory reservations.
|
|
|
|
Hostboot should name any and all memory reservations that it provides.
|
|
Currently some hostboots export a broken reservation covering the first
|
|
256MB of memory and this causes the system to crash at boot due to an
|
|
invalid free because this overlaps with the static "ibm,os-reserve"
|
|
region (which covers the first 768MB of memory).
|
|
|
|
According to the hostboot team unnamed reservations are invalid and can
|
|
be ignored.
|
|
|
|
- hdata: Check the Host I2C devices array version
|
|
|
|
Currently this is not populated on FSP machines which causes some
|
|
obnoxious errors to appear in the boot log. We also only want to
|
|
parse version 1 of this structure since future versions will completely
|
|
change the array item format.
|
|
|
|
- Ensure P9 DD1 workarounds apply only to Nimbus
|
|
|
|
The workarounds for P9 DD1 are only needed for Nimbus. P9 Cumulus will
|
|
be DD1 but don't need these same workarounds.
|
|
|
|
This patch ensures the P9 DD1 workarounds only apply to Nimbus. It
|
|
also renames some things to make clear what's what.
|
|
|
|
- cpu: Cleanup AMR and IAMR when re-initializing CPUs
|
|
|
|
There's a bug in current Linux kernels leaving crap in those registers
|
|
accross kexec and not sanitizing them on boot. This breaks kexec under
|
|
some circumstances (such as booting a hash kernel from a radix one
|
|
on P9 DD2.0).
|
|
|
|
The long term fix is in Linux, but this workaround is a reasonable
|
|
way of "sanitizing" those SPRs when Linux calls opal_reinit_cpus()
|
|
and shouldn't have adverse effects.
|
|
|
|
We could also use that same mechanism to cleanup other things as
|
|
well such as restoring some other SPRs to their default value in
|
|
the future.
|
|
|
|
- Set POWER9 RPR SPR to 0x00000103070F1F3F. Same value as P8.
|
|
|
|
Without this, thread priorities inside a core don't work.
|
|
|
|
- cpu: Support setting HID[RADIX] and set it by default on P9
|
|
|
|
This adds new opal_reinit_cpus() flags to setup radix or hash
|
|
mode in HID[8] on POWER9.
|
|
|
|
By default HID[8] will be set. On P9 DD1.0, Linux will change
|
|
it as needed. On P9 DD2.0 hash works in radix mode (radix is
|
|
really "dual" mode) so KVM won't break and existing kernels
|
|
will work.
|
|
|
|
Newer kernels built for hash will call this to clear the HID bit
|
|
and thus get the full size of the TLB as an optimization.
|
|
|
|
- Add "cleanup_global_tlb" for P9 and later
|
|
|
|
Uses broadcast TLBIE's to cleanup the TLB on all cores and on
|
|
the nest MMU
|
|
|
|
- xive: DD2.0 updates
|
|
|
|
Add support for StoreEOI, fix StoreEOI MMIO offset in ESB page,
|
|
and other cleanups
|
|
|
|
- Update default TSCR value for P9 as recommended by HW folk.
|
|
|
|
- xive: Fix initialisation of xive_cpu_state struct
|
|
|
|
When using XIVE emulation with DEBUG=1, we run into crashes in log_add()
|
|
due to the xive_cpu_state->log_pos being uninitialised (and thus, with
|
|
DEBUG enabled, initialised to the poison value of 0x99999999).
|
|
|
|
OCC/Power Management
|
|
^^^^^^^^^^^^^^^^^^^^
|
|
|
|
With this release, it's possible to boot POWER9 systems with the OCC
|
|
enabled and change CPU frequencies. Doing so does require other firmware
|
|
components to also support this (otherwise the frequency will not be set).
|
|
|
|
- occ: Skip setting cores to nominal frequency in P9
|
|
|
|
In P9, once OCC is up, it is supposed to setup the cores to nominal
|
|
frequency. So skip this step in OPAL.
|
|
- occ: Fix Pstate ordering for P9
|
|
|
|
In P9 the pstate values are positive. They are continuous set of
|
|
unsigned integers [0 to +N] where Pmax is 0 and Pmin is N. The
|
|
linear ordering of pstates for P9 has changed compared to P8.
|
|
P8 has neagtive pstate values advertised as [0 to -N] where Pmax
|
|
is 0 and Pmin is -N. This patch adds helper routines to abstract
|
|
pstate comparison with pmax and adds sanity pstate limit checks.
|
|
This patch also fixes pstate arithmetic by using labs().
|
|
- p8-i2c: occ: Add support for OCC to use I2C engines
|
|
|
|
This patch adds support to share the I2C engines with host and OCC.
|
|
OCC uses I2C engines to read DIMM temperatures and to communicate with
|
|
GPU. OCC Flag register is used for locking between host and OCC. Host
|
|
requests for the bus by setting a bit in OCC Flag register. OCC sends
|
|
an interrupt to indicate the change in ownership.
|
|
|
|
opal-prd/PRD
|
|
^^^^^^^^^^^^
|
|
|
|
- opal-prd: Handle SBE passthrough message passing
|
|
|
|
This patch adds support to send SBE pass through command to HBRT.
|
|
- SBE: Add passthrough command support
|
|
|
|
SBE sends passthrough command. We have to capture this interrupt and
|
|
send event to HBRT via opal-prd (user space daemon).
|
|
- opal-prd: hook up reset_pm_complex
|
|
|
|
This change provides the facility to invoke HBRT's reset_pm_complex, in
|
|
the same manner is done with process_occ_reset previously.
|
|
|
|
We add a control command for `opal-prd pm-complex reset`, which is just
|
|
an alias for occ_reset at this stage.
|
|
|
|
- prd: Implement firmware side of opaque PRD channel
|
|
|
|
This change introduces the firmware side of the opaque HBRT <--> OPAL
|
|
message channel. We define a base message format to be shared with HBRT
|
|
(in include/prd-fw-msg.h), and allow firmware requests and responses to
|
|
be sent over this channel.
|
|
|
|
We don't currently have any notifications defined, so have nothing to do
|
|
for firmware_notify() at this stage.
|
|
|
|
- opal-prd: Add firmware_request & firmware_notify implementations
|
|
|
|
This change adds the implementation of firmware_request() and
|
|
firmware_notify(). To do this, we need to add a message queue, so that
|
|
we can properly handle out-of-order messages coming from firmware.
|
|
|
|
- opal-prd: Add support for variable-sized messages
|
|
|
|
With the introductuion of the opaque firmware channel, we want to
|
|
support variable-sized messages. Rather than expecting to read an
|
|
entire 'struct opal_prd_msg' in one read() call, we can split this
|
|
over mutiple reads, potentially expanding our message buffer.
|
|
|
|
- opal-prd: Sync hostboot interfaces with HBRT
|
|
|
|
This change adds new callbacks defined for p9, and the base thunks for
|
|
the added calls.
|
|
|
|
- opal-prd: interpret log level prefixes from HBRT
|
|
|
|
Interpret the (optional) \*_MRK log prefixes on HBRT messages, and set
|
|
the syslog log priority to suit.
|
|
|
|
- opal-prd: Add occ reset to usage text
|
|
- opal-prd: allow different chips for occ control actions
|
|
|
|
The `occ reset` and `occ error` actions can both take a chip id
|
|
argument, but we're currently just using zero. This change changes the
|
|
control message format to pass the chip ID from the control process to
|
|
the opal-prd daemon.
|
|
|
|
|
|
PCI/PHB4
|
|
^^^^^^^^
|
|
|
|
- phb4: Fix number of index bits in IODA tables
|
|
|
|
On PHB4 the number of index bits in the IODA table address register
|
|
was bumped to 10 bits to accomodate for 1024 MSIs and 1024 TVEs (DD2).
|
|
|
|
However our macro only defined the field to be 9 bits, thus causing
|
|
"interesting" behaviours on some systems.
|
|
|
|
- phb4: Harden init with bad PHBs
|
|
|
|
Currently if we read all 1's from the EEH or IRQ capabilities, we end
|
|
up train wrecking on some other random code (eg. an assert() in xive).
|
|
|
|
This hardens the PHB4 code to look for these bad reads and more
|
|
gracefully fails the init for that PHB alone. This allows the rest of
|
|
the system to boot and ignore those bad PHBs.
|
|
|
|
- phb4 capi (i.e. CAPI2): Handle HMI events
|
|
|
|
Find the CAPP on the chip associated with the HMI event for PHB4.
|
|
The recovery mode (re-initialization of the capp, resume of functional
|
|
operations) is only available with P9 DD2. A new patch will be provided
|
|
to support this feature.
|
|
|
|
.. _capi2-5.7-rc1-rn:
|
|
|
|
- phb4 capi (i.e. CAPI2): Enable capi mode for PHB4
|
|
|
|
Enable the Coherently attached processor interface. The PHB is used as
|
|
a CAPI interface.
|
|
CAPI Adapters can be connected to either PEC0 or PEC2. Single port
|
|
CAPI adapter can be connected to either PEC0 or PEC2, but Dual-Port
|
|
Adapter can be only connected to PEC2
|
|
* CAPP0 attached to PHB0(PEC0 - single port)
|
|
* CAPP1 attached to PHB3(PEC2 - single or dual port)
|
|
|
|
- hw/phb4: Rework phb4_get_presence_state()
|
|
|
|
There are two issues in current implementation: It should return errcode
|
|
visibile to Linux, which has prefix OPAL_*. The code isn't very obvious.
|
|
|
|
This returns OPAL_HARDWARE when the PHB is broken. Otherwise, OPAL_SUCCESS
|
|
is always returned. In the mean while, It refactors the code to make it
|
|
obvious: OPAL_PCI_SLOT_PRESENT is returned when the presence signal (low active)
|
|
or PCIe link is active. Otherwise, OPAL_PCI_SLOT_EMPTY is returned.
|
|
|
|
- phb4: Error injection for config space
|
|
|
|
Implement CFG (config space) error injection.
|
|
|
|
This works the same as PHB3. MMIO and DMA error injection require a
|
|
rewrite, so they're unsupported for now.
|
|
|
|
While it's not feature complete, this at least provides an easy way to
|
|
inject an error that will trigger EEH.
|
|
|
|
- phb4: Error clear implementation
|
|
- phb4: Mask link down errors during reset
|
|
|
|
During a hot reset the PCI link will drop, so we need to mask link down
|
|
events to prevent unnecessary errors.
|
|
- phb4: Implement root port initialization
|
|
|
|
phb4_root_port_init() was a NOP before, so fix that.
|
|
- phb4: Complete reset implementation
|
|
|
|
This implements complete reset (creset) functionality for POWER9 DD1.
|
|
|
|
Only partially tested and contends with some DD1 errata, but it's a start.
|
|
|
|
.. _shared-slot-5.7-rc1-rn:
|
|
|
|
- phb4: Activate shared PCI slot on witherspoon
|
|
|
|
Witherspoon systems come with a 'shared' PCI slot: physically, it
|
|
looks like a x16 slot, but it's actually two x8 slots connected to two
|
|
PHBs of two different chips. Taking advantage of it requires some
|
|
logic on the PCI adapter. Only the Mellanox CX5 adapter is known to
|
|
support it at the time of this writing.
|
|
|
|
This patch enables support for the shared slot on witherspoon if a x16
|
|
adapter is detected. Each x8 slot has a presence bit, so both bits
|
|
need to be set for the activation to take place. Slot sharing is
|
|
activated through a gpio.
|
|
|
|
Note that there's no easy way to be sure that the card is indeed a
|
|
shared-slot compatible PCI adapter and not a normal x16 card. Plugging
|
|
a normal x16 adapter on the shared slot should be avoided on
|
|
witherspoon, as the link won't train on the second slot, resulting in
|
|
a timeout and a longer boot time. Only the first slot is usable and
|
|
the x16 adapter will end up using only half the lines.
|
|
|
|
If the PCI card plugged on the physical slot is only x8 (or less),
|
|
then the presence bit of the second slot is not set, so this patch
|
|
does nothing. The x8 (or less) adapter should work like on any other
|
|
physical slot.
|
|
|
|
- phb4: Block D-state power management on direct slots
|
|
|
|
As current revisions of PHB4 don't properly handle the resulting
|
|
L1 link transition.
|
|
|
|
- phb4: Call pci config filters
|
|
|
|
- phb4: Mask out write-1-to-clear registers in RC cfg
|
|
|
|
The root complex config space only supports 4-byte accesses. Thus, when
|
|
the client requests a smaller size write, we do a read-modify-write to
|
|
the register.
|
|
|
|
However, some register have bits defined as "write 1 to clear".
|
|
|
|
If we do a RMW cycles on such a register and such bits are 1 in the
|
|
part that the client doesn't intend to modify, we will accidentally
|
|
write back those 1's and clear the corresponding bit.
|
|
|
|
This avoids it by masking out those magic bits from the "old" value
|
|
read from the register.
|
|
|
|
- phb4: Properly mask out link down errors during reset
|
|
- phb3/4: Silence a useless warning
|
|
|
|
PHB's don't have base location codes on non-FSP systems and it's
|
|
normal.
|
|
|
|
- phb4: Workaround bug in spec 053
|
|
|
|
Wait for DLP PGRESET to clear *after* lifting the PCIe core reset
|
|
|
|
- phb4: DD2.0 updates
|
|
|
|
Support StoreEOI, full complements of PEs (twice as big TVT)
|
|
and other updates.
|
|
|
|
Also renumber init steps to match spec 063
|
|
|
|
|
|
NPU2
|
|
^^^^
|
|
|
|
Note that currently NPU2 support is limited to POWER9 DD1 hardware.
|
|
|
|
- platforms/astbmc/witherspoon.c: Add NPU2 slot mappings
|
|
|
|
For NVLink2 to function PCIe devices need to be associated with the right
|
|
NVLinks. This association is supposed to be passed down to Skiboot via HDAT but
|
|
those fields are still not correctly filled out. To work around this we add slot
|
|
tables for the NVLinks similar to what we have for P8+.
|
|
|
|
- hw/npu2.c: Fix device aperture calculation
|
|
|
|
The POWER9 NPU2 implements an address compression scheme to compress 56-bit P9
|
|
physical addresses to 47-bit GPU addresses. System software needs to know both
|
|
addresses, unfortunately the calculation of the compressed address was
|
|
incorrect. Fix it here.
|
|
|
|
- hw/npu2.c: Change MCD BAR allocation order
|
|
|
|
MCD BARs need to be correctly aligned to the size of the region. As GPU
|
|
memory is allocated from the top of memory down we should start allocating
|
|
from the highest GPU memory address to the lowest to ensure correct
|
|
alignment.
|
|
|
|
- NPU2: Add flag to nvlink config space indicating DL reset state
|
|
|
|
Device drivers need to be able to determine if the DL is out of reset or
|
|
not so they can safely probe to see if links have already been trained.
|
|
This patch adds a flag to the vendor specific config space indicating if
|
|
the DL is out of reset.
|
|
|
|
- hw/npu2.c: Hardcode MSR_SF when setting up npu XTS contexts
|
|
|
|
We don't support anything other than 64-bit mode for address translations so we
|
|
can safely hardcode it.
|
|
|
|
- hw/npu2-hw-procedures.c: Add nvram option to override zcal calculations
|
|
|
|
In some rare cases the zcal state machine may fail and flag an error. According
|
|
to hardware designers it is sometimes ok to ignore this failure and use nominal
|
|
values for the calculations. In this case we add a nvram variable
|
|
(nv_zcal_override) which will cause skiboot to ignore the failure and use the
|
|
nominal value specified in nvram.
|
|
- npu2: Fix npu2_{read,write}_4b()
|
|
|
|
When writing or reading 4-byte values, we need to use the upper half of
|
|
the 64-bit SCOM register.
|
|
|
|
Fix npu2_{read,write}_4b() and their callers to use uint32_t, and
|
|
appropriately shift the value being written or returned.
|
|
|
|
|
|
- hw/npu2.c: Fix opal_npu_map_lpar to search for existing BDF
|
|
- hw/npu2-hw-procedures.c: Fix running of zcal procedure
|
|
|
|
The zcal procedure should only be run once per obus (ie. once per group of 3
|
|
links). Clean up the code and fix the potential buffer overflow due to a typo.
|
|
Also updates the zcal settings to their proper values.
|
|
- hw/npu2.c: Add memory coherence directory programming
|
|
|
|
The memory coherence directory (MCD) needs to know which system memory addresses
|
|
belong to the GPU. This amounts to setting a BAR and a size in the MCD to cover
|
|
the addresses assigned to each of the GPUs. To ease assignment we assume GPUs
|
|
are assigned memory in a contiguous block per chip.
|
|
|
|
|
|
pflash/libflash
|
|
---------------
|
|
|
|
- libflash/libffs: Zero checksum words
|
|
|
|
On writing ffs entries to flash libffs doesn't zero checksum words
|
|
before calculating the checksum across the entire structure. This causes
|
|
an inaccurate calculation of the checksum as it may calculate a checksum
|
|
on non-zero checksum bytes.
|
|
|
|
- libffs: Fix ffs_lookup_part() return value
|
|
|
|
It would return success when the part wasn't found
|
|
- libflash/libffs: Correctly update the actual size of the partition
|
|
|
|
libffs has been updating FFS partition information in the wrong place
|
|
which leads to incomplete erases and corruption.
|
|
- libflash: Initialise entries list earlier
|
|
|
|
In the bail-out path we call ffs_close() to tear down the partially
|
|
initialised ffs_handle. ffs_close() expects the entries list to be
|
|
initialised so we need to do that earlier to prevent a null pointer
|
|
dereference.
|
|
|
|
mbox-flash
|
|
----------
|
|
|
|
mbox-flash is the emerging standard way of talking to host PNOR flash
|
|
on POWER9 systems.
|
|
|
|
- libflash/mbox-flash: Implement MARK_WRITE_ERASED mbox call
|
|
|
|
Version two of the mbox-flash protocol defines a new command:
|
|
MARK_WRITE_ERASED.
|
|
|
|
This command provides a simple way to mark a region of flash as all 0xff
|
|
without the need to go and write all 0xff. This is an optimisation as
|
|
there is no need for an erase before a write, it is the responsibility of
|
|
the BMC to deal with the flash correctly, however in v1 it was ambiguous
|
|
what a client should do if the flash should be erased but not actually
|
|
written to. This allows of a optimal path to resolve this problem.
|
|
|
|
- libflash/mbox-flash: Update to V2 of the protocol
|
|
|
|
Updated version 2 of the protocol can be found at:
|
|
https://github.com/openbmc/mboxbridge/blob/master/Documentation/mbox_protocol.md
|
|
|
|
This commit changes mbox-flash such that it will preferentially talk
|
|
version 2 to any capable daemon but still remain capable of talking to
|
|
v1 daemons.
|
|
|
|
Version two changes some of the command definitions for increased
|
|
consistency and usability.
|
|
Version two includes more attention bits - these are now dealt with at a
|
|
simple level.
|
|
- libflash/mbox-flash: Implement MARK_WRITE_ERASED mbox call
|
|
|
|
Version two of the mbox-flash protocol defines a new command:
|
|
MARK_WRITE_ERASED.
|
|
|
|
This command provides a simple way to mark a region of flash as all 0xff
|
|
without the need to go and write all 0xff. This is an optimisation as
|
|
there is no need for an erase before a write, it is the responsibility of
|
|
the BMC to deal with the flash correctly, however in v1 it was ambiguous
|
|
what a client should do if the flash should be erased but not actually
|
|
written to. This allows of a optimal path to resolve this problem.
|
|
|
|
- libflash/mbox-flash: Update to V2 of the protocol
|
|
|
|
Updated version 2 of the protocol can be found at:
|
|
https://github.com/openbmc/mboxbridge/blob/master/Documentation/mbox_protocol.md
|
|
|
|
This commit changes mbox-flash such that it will preferentially talk
|
|
version 2 to any capable daemon but still remain capable of talking to
|
|
v1 daemons.
|
|
|
|
Version two changes some of the command definitions for increased
|
|
consistency and usability.
|
|
Version two includes more attention bits - these are now dealt with at a
|
|
simple level.
|
|
|
|
- hw/lpc-mbox: Use message registers for interrupts
|
|
|
|
Currently the BMC raises the interrupt using the BMC control register.
|
|
It does so on all accesses to the 16 'data' registers meaning that when
|
|
the BMC only wants to set the ATTN (on which we have interrupts enabled)
|
|
bit we will also get a control register based interrupt.
|
|
|
|
The solution here is to mask that interrupt permanantly and enable
|
|
interrupts on the protocol defined 'response' data byte.
|
|
|
|
General fixes
|
|
-------------
|
|
|
|
- Reduce log level on non-error log messages
|
|
|
|
90% of what we print isn't useful to a normal user. This
|
|
dramatically reduces the amount of messages printed by
|
|
OPAL in normal circumstances.
|
|
|
|
- init: Silence messages and call ourselves "OPAL"
|
|
- psi: Switch to ESB mode later
|
|
|
|
There's an errata, if we switch to ESB mode before setting up
|
|
the various ESB mode related registers, a pending interrupts
|
|
can go wrong.
|
|
|
|
- lpc: Enable "new" SerIRQ mode
|
|
- hw/ipmi/ipmi-sel: missing newline in prlog warning
|
|
|
|
- p8-i2c OCC lock: fix locking in p9_i2c_bus_owner_change
|
|
- Convert important polling loops to spin at lowest SMT priority
|
|
|
|
The pattern of calling cpu_relax() inside a polling loop does
|
|
not suit the powerpc SMT priority instructions. Prefrred is to
|
|
set a low priority then spin until break condition is reached,
|
|
then restore priority.
|
|
|
|
- Improve cpu_idle when PM is disabled
|
|
|
|
Split cpu_idle() into cpu_idle_delay() and cpu_idle_job() rather than
|
|
requesting the idle type as a function argument. Have those functions
|
|
provide a default polling (non-PM) implentation which spin at the
|
|
lowest SMT priority.
|
|
|
|
- core/fdt: Always add a reserve map
|
|
|
|
Currently we skip adding the reserved ranges block to the generated
|
|
FDT blob if we are excluding the root node. This can result in a DTB
|
|
that dtc will barf on because the reserved memory ranges overlap with
|
|
the start of the dt_struct block. As an example: ::
|
|
|
|
$ fdtdump broken.dtb -d
|
|
/dts-v1/;
|
|
// magic: 0xd00dfeed
|
|
// totalsize: 0x7f3 (2035)
|
|
// off_dt_struct: 0x30 <----\
|
|
// off_dt_strings: 0x7b8 | this is bad!
|
|
// off_mem_rsvmap: 0x30 <----/
|
|
// version: 17
|
|
// last_comp_version: 16
|
|
// boot_cpuid_phys: 0x0
|
|
// size_dt_strings: 0x3b
|
|
// size_dt_struct: 0x788
|
|
|
|
/memreserve/ 0x100000000 0x300000004;
|
|
/memreserve/ 0x3300000001 0x169626d2c;
|
|
/memreserve/ 0x706369652d736c6f 0x7473000000000003;
|
|
*continues*
|
|
|
|
With this patch: ::
|
|
|
|
$ fdtdump working.dtb -d
|
|
/dts-v1/;
|
|
// magic: 0xd00dfeed
|
|
// totalsize: 0x803 (2051)
|
|
// off_dt_struct: 0x40
|
|
// off_dt_strings: 0x7c8
|
|
// off_mem_rsvmap: 0x30
|
|
// version: 17
|
|
// last_comp_version: 16
|
|
// boot_cpuid_phys: 0x0
|
|
// size_dt_strings: 0x3b
|
|
// size_dt_struct: 0x788
|
|
|
|
// 0040: tag: 0x00000001 (FDT_BEGIN_NODE)
|
|
/ {
|
|
// 0048: tag: 0x00000003 (FDT_PROP)
|
|
// 07fb: string: phandle
|
|
// 0054: value
|
|
phandle = <0x00000001>;
|
|
*continues*
|
|
|
|
- hw/lpc-mbox: Use message registers for interrupts
|
|
|
|
Currently the BMC raises the interrupt using the BMC control register.
|
|
It does so on all accesses to the 16 'data' registers meaning that when
|
|
the BMC only wants to set the ATTN (on which we have interrupts enabled)
|
|
bit we will also get a control register based interrupt.
|
|
|
|
The solution here is to mask that interrupt permanantly and enable
|
|
interrupts on the protocol defined 'response' data byte.
|
|
|
|
|
|
PCI
|
|
---
|
|
- pci: Wait 20ms before checking presence detect on PCIe
|
|
|
|
As the PHB presence logic has a debounce timer that can take
|
|
a while to settle.
|
|
|
|
- phb3+iov: Fixup support for config space filters
|
|
|
|
The filter should be called before the HW access and its
|
|
return value control whether to perform the access or not
|
|
- core/pci: Use PCI slot's power facality in pci_enable_bridge()
|
|
|
|
The current implmentation has incorrect assumptions: there is
|
|
always a PCI slot associated with root port and PCIe switch
|
|
downstream port and all of them are capable to change its
|
|
power state by register PCICAP_EXP_SLOTCTL. Firstly, there
|
|
might not a PCI slot associated with the root port or PCIe
|
|
switch downstream port. Secondly, the power isn't controlled
|
|
by standard config register (PCICAP_EXP_SLOTCTL). There are
|
|
I2C slave devices used to control the power states on Tuleta.
|
|
|
|
In order to use the PCI slot's methods to manage the power
|
|
states, this does:
|
|
|
|
* Introduce PCI_SLOT_FLAG_ENFORCE, indicates the request operation
|
|
is enforced to be applied.
|
|
* pci_enable_bridge() is split into 3 functions: pci_bridge_power_on()
|
|
to power it on; pci_enable_bridge() as a place holder and
|
|
pci_bridge_wait_link() to wait the downstream link to come up.
|
|
* In pci_bridge_power_on(), the PCI slot's specific power management
|
|
methods are used if there is a PCI slot associated with the PCIe
|
|
switch downstream port or root port.
|
|
- platforms/astbmc/slots.c: Allow comparison of bus numbers when matching slots
|
|
|
|
When matching devices on multiple down stream PLX busses we need to compare more
|
|
than just the device-id of the PCIe BDFN, so increase the mask to do so.
|
|
|
|
Tests and simulators
|
|
--------------------
|
|
|
|
- boot-tests: add OpenBMC support
|
|
- boot_test.sh: Add SMC BMC support
|
|
|
|
Your BMC needs a special debug image flashed to use this, the exact
|
|
image and methods aren't something I can publish here, but if you work
|
|
for IBM or SMC you can find out from the right sources.
|
|
|
|
A few things are needed to move around to be able to flash to a SMC BMC.
|
|
|
|
For a start, the SSH daemon will only accept connections after a special
|
|
incantation (which I also can't share), but you should put that in the
|
|
~/.skiboot_boot_tests file along with some other default login information
|
|
we don't publicise too broadly (because Security Through Obscurity is
|
|
*obviously* a good idea....)
|
|
|
|
We also can't just directly "ssh /bin/true", we need an expect script,
|
|
and we can't scp, but we can anonymous rsync!
|
|
|
|
You also need a pflash binary to copy over.
|
|
- hdata_to_dt: Add PVR overrides to the usage text
|
|
- mambo: Add a reservation for the initramfs
|
|
|
|
On most systems the initramfs is loaded inside the part of memory
|
|
reserved for the OS [0x0-0x30000000] and skiboot will never touch it.
|
|
On mambo it's loaded at 0x80000000 and if you're unlucky skiboot can
|
|
allocate over the top of it and corrupt the initramfs blob.
|
|
|
|
There might be the downside that the kernel cannot re-use the initramfs
|
|
memory since it's marked as reserved, but the kernel might also free it
|
|
anyway.
|
|
- mambo: Update P9 PVR to reflect Scale out 24 core chips
|
|
|
|
The P9 PVR bits 48:51 don't indicate a revision but instead different
|
|
configurations. From BookIV we have:
|
|
|
|
==== ===================
|
|
Bits Configuration
|
|
==== ===================
|
|
0 Scale out 12 cores
|
|
1 Scale out 24 cores
|
|
2 Scale up 12 cores
|
|
3 Scale up 24 cores
|
|
==== ===================
|
|
|
|
Skiboot will mostly the use "Scale out 24 core" configuration
|
|
(ie. SMT4 not SMT8) so reflect this in mambo.
|
|
- core: Move enable_mambo_console() into chip initialisation
|
|
|
|
Rather than having a wart in main_cpu_entry() that initialises the mambo
|
|
console, we can move it into init_chips() which is where we discover that we're
|
|
on mambo.
|
|
|
|
- mambo: Create multiple chips when we have multiple CPUs
|
|
|
|
Currently when we boot mambo with multiple CPUs, we create multiple CPU nodes in
|
|
the device tree, and each claims to be on a separate chip.
|
|
|
|
However we don't create multiple xscom nodes, which means skiboot only knows
|
|
about a single chip, and all CPUs end up on it. At the moment mambo is not able
|
|
to create multiple xscom controllers. We can create fake ones, just by faking
|
|
the device tree up, but that seems uglier than this solution.
|
|
|
|
So create a mambo-chip for each CPU other than 0, to tell skiboot we want a
|
|
separate chip created. This then enables Linux to see multiple chips: ::
|
|
|
|
smp: Brought up 2 nodes, 2 CPUs
|
|
numa: Node 0 CPUs: 0
|
|
numa: Node 1 CPUs: 1
|
|
|
|
- chip: Add support for discovering chips on mambo
|
|
|
|
Currently the only way for skiboot to discover chips is by looking for xscom
|
|
nodes. But on mambo it's currently not possible to create multiple xscom nodes,
|
|
which means we can only simulate a single chip system.
|
|
|
|
However it seems we can fairly cleanly add support for a special mambo chip
|
|
node, and use that to instantiate multiple chips.
|
|
|
|
Add a check in init_chip() that we're not clobbering an already initialised
|
|
chip, now that we have two places that initialise chips.
|
|
- mambo: Make xscom claim to be DD 2.0
|
|
|
|
In the mambo tcl we set the CPU version to DD 2.0, because mambo is not
|
|
bug compatible with DD 1.
|
|
|
|
But in xscom_read_cfam_chipid() we have a hard coded value, to work
|
|
around the lack of the f000f register, which claims to be P9 DD 1.0.
|
|
|
|
This doesn't seem to cause crashes or anything, but at boot we do see: ::
|
|
|
|
[ 0.003893084,5] XSCOM: chip 0x0 at 0x1a0000000000 [P9N DD1.0]
|
|
|
|
So fix it to claim that the xscom is also DD 2.0 to match the CPU.
|
|
|
|
- mambo: Match whole string when looking up symbols with linsym/skisym
|
|
|
|
linsym/skisym use a regex to match the symbol name, and accepts a
|
|
partial match against the entry in the symbol map, which can lead to
|
|
somewhat confusing results, eg: ::
|
|
|
|
systemsim % linsym early_setup
|
|
0xc000000000027890
|
|
systemsim % linsym early_setup$
|
|
0xc000000000aa8054
|
|
systemsim % linsym early_setup_secondary
|
|
0xc000000000027890
|
|
|
|
I don't think that's the behaviour we want, so append a $ to the name so
|
|
that the symbol has to match against the whole entry, eg: ::
|
|
|
|
systemsim % linsym early_setup
|
|
0xc000000000aa8054
|
|
|
|
- Disable nap on P8 Mambo, public release has bugs
|
|
- mambo: Allow loading multiple CPIOs
|
|
|
|
Currently we have support for loading a single CPIO and telling Linux to
|
|
use it as the initrd. But the Linux code actually supports having
|
|
multiple CPIOs contiguously in memory, between initrd-start and end, and
|
|
will unpack them all in order. That is a really nice feature as it means
|
|
you can have a base CPIO with your root filesystem, and then tack on
|
|
others as you need for various tests etc.
|
|
|
|
So expand the logic to handle SKIBOOT_INITRD, and treat it as a comma
|
|
separated list of CPIOs to load. I chose comma as it's fairly rare in
|
|
filenames, but we could make it space, colon, whatever. Or we could add
|
|
a new environment variable entirely. The code also supports trimming
|
|
whitespace from the values, so you can have "cpio1, cpio2".
|
|
- hdata/test: Add memory reservations to hdata_to_dt
|
|
|
|
Currently memory reservations are parsed, but since they are not
|
|
processed until mem_region_init() they don't appear in the output
|
|
device tree blob. Several bugs have been found with memory reservations
|
|
so we want them to be part of the test output.
|
|
|
|
Add them and clean up several usages of printf() since we want only the
|
|
dtb to appear in standard out.
|
|
|
|
IBM FSP systems
|
|
---------------
|
|
|
|
- FSP/CONSOLE: Fix possible NULL dereference
|
|
- platforms/ibm-fsp/firenze: Fix PCI slot power-off pattern
|
|
|
|
When powering off the PCI slot, the corresponding bits should
|
|
be set to 0bxx00xx00 instead of 0bxx11xx11. Otherwise, the
|
|
specified PCI slot can't be put into power-off state. Fortunately,
|
|
it didn't introduce any side-effects so far.
|
|
- FSP/CONSOLE: Workaround for unresponsive ipmi daemon
|
|
|
|
We use TCE mapped area to write data to console. Console header
|
|
(fsp_serbuf_hdr) is modified by both FSP and OPAL (OPAL updates
|
|
next_in pointer in fsp_serbuf_hdr and FSP updates next_out pointer).
|
|
|
|
Kernel makes opal_console_write() OPAL call to write data to console.
|
|
OPAL write data to TCE mapped area and sends MBOX command to FSP.
|
|
If our console becomes full and we have data to write to console,
|
|
we keep on waiting until FSP reads data.
|
|
|
|
In some corner cases, where FSP is active but not responding to
|
|
console MBOX message (due to buggy IPMI) and we have heavy console
|
|
write happening from kernel, then eventually our console buffer
|
|
becomes full. At this point OPAL starts sending OPAL_BUSY_EVENT to
|
|
kernel. Kernel will keep on retrying. This is creating kernel soft
|
|
lockups. In some extreme case when every CPU is trying to write to
|
|
console, user will not be able to ssh and thinks system is hang.
|
|
|
|
If we reset FSP or restart IPMI daemon on FSP, system recovers and
|
|
everything becomes normal.
|
|
|
|
This patch adds workaround to above issue by returning OPAL_HARDWARE
|
|
when cosole is full. Side effect of this patch is, we may endup dropping
|
|
latest console data. But better to drop console data than system hang.
|
|
|
|
- FSP: Set status field in response message for timed out message
|
|
|
|
For timed out FSP messages, we set message status as "fsp_msg_timeout".
|
|
But most FSP driver users (like surviellance) are ignoring this field.
|
|
They always look for FSP returned status value in callback function
|
|
(second byte in word1). So we endup treating timed out message as success
|
|
response from FSP.
|
|
|
|
Sample output: ::
|
|
|
|
[69902.432509048,7] SURV: Sending the heartbeat command to FSP
|
|
[70023.226860117,4] FSP: Response from FSP timed out, word0 = d66a00d7, word1 = 0 state: 3
|
|
....
|
|
[70023.226901445,7] SURV: Received heartbeat acknowledge from FSP
|
|
[70023.226903251,3] FSP: fsp_trigger_reset() entry
|
|
|
|
Here SURV code thought it got valid response from FSP. But actually we didn't
|
|
receive response from FSP.
|
|
|
|
This patch fixes above issue by updating status field in response structure.
|
|
|
|
- FSP: Improve timeout message
|
|
|
|
- FSP/RTC: Fix possible FSP R/R issue in rtc write path
|
|
- hw/fsp/rtc: read/write cached rtc tod on fsp hir.
|
|
|
|
Currently fsp-rtc reads/writes the cached RTC TOD on an fsp
|
|
reset. Use latest fsp_in_rr() function to properly read the cached rtc
|
|
value when fsp reset initiated by the hir.
|
|
|
|
Below is the kernel trace when we set hw clock, when hir process starts. ::
|
|
|
|
[ 1727.775824] NMI watchdog: BUG: soft lockup - CPU#57 stuck for 23s! [hwclock:7688]
|
|
[ 1727.775856] Modules linked in: vmx_crypto ibmpowernv ipmi_powernv uio_pdrv_genirq ipmi_devintf powernv_op_panel uio ipmi_msghandler powernv_rng leds_powernv ip_tables x_tables autofs4 ses enclosure scsi_transport_sas crc32c_vpmsum lpfc ipr tg3 scsi_transport_fc
|
|
[ 1727.775883] CPU: 57 PID: 7688 Comm: hwclock Not tainted 4.10.0-14-generic #16-Ubuntu
|
|
[ 1727.775883] task: c000000fdfdc8400 task.stack: c000000fdfef4000
|
|
[ 1727.775884] NIP: c00000000090540c LR: c0000000000846f4 CTR: 000000003006dd70
|
|
[ 1727.775885] REGS: c000000fdfef79a0 TRAP: 0901 Not tainted (4.10.0-14-generic)
|
|
[ 1727.775886] MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>
|
|
[ 1727.775889] CR: 28024442 XER: 20000000
|
|
[ 1727.775890] CFAR: c00000000008472c SOFTE: 1
|
|
GPR00: 0000000030005128 c000000fdfef7c20 c00000000144c900 fffffffffffffff4
|
|
GPR04: 0000000028024442 c00000000090540c 9000000000009033 0000000000000000
|
|
GPR08: 0000000000000000 0000000031fc4000 c000000000084710 9000000000001003
|
|
GPR12: c0000000000846e8 c00000000fba0100
|
|
[ 1727.775897] NIP [c00000000090540c] opal_set_rtc_time+0x4c/0xb0
|
|
[ 1727.775899] LR [c0000000000846f4] opal_return+0xc/0x48
|
|
[ 1727.775899] Call Trace:
|
|
[ 1727.775900] [c000000fdfef7c20] [c00000000090540c] opal_set_rtc_time+0x4c/0xb0 (unreliable)
|
|
[ 1727.775901] [c000000fdfef7c60] [c000000000900828] rtc_set_time+0xb8/0x1b0
|
|
[ 1727.775903] [c000000fdfef7ca0] [c000000000902364] rtc_dev_ioctl+0x454/0x630
|
|
[ 1727.775904] [c000000fdfef7d40] [c00000000035b1f4] do_vfs_ioctl+0xd4/0x8c0
|
|
[ 1727.775906] [c000000fdfef7de0] [c00000000035bab4] SyS_ioctl+0xd4/0xf0
|
|
[ 1727.775907] [c000000fdfef7e30] [c00000000000b184] system_call+0x38/0xe0
|
|
[ 1727.775908] Instruction dump:
|
|
[ 1727.775909] f821ffc1 39200000 7c832378 91210028 38a10020 39200000 38810028 f9210020
|
|
[ 1727.775911] 4bfffe6d e8810020 80610028 4b77f61d <60000000> 7c7f1b78 3860000a 2fbffff4
|
|
|
|
This is found when executing the testcase
|
|
https://github.com/open-power/op-test-framework/blob/master/testcases/fspresetReload.py
|
|
|
|
With this fix ran fsp hir torture testcase in the above test
|
|
which is working fine.
|
|
- occ: Set return variable to correct value
|
|
|
|
When entering this section of code rc will be zero. If fsp_mkmsg() fails
|
|
the code responsible for printing an error message won't be set.
|
|
Resetting rc should allow for the error case to trigger if fsp_mkmsg
|
|
fails.
|
|
- capp: Fix hang when CAPP microcode LID is missing on FSP machine
|
|
|
|
When the LID is absent, we fail early with an error from
|
|
start_preload_resource. In that case, capp_ucode_info.load_result
|
|
isn't set properly causing a subsequent capp_lid_download() to
|
|
call wait_for_resource_loaded() on something that isn't being
|
|
loaded, thus hanging.
|
|
|
|
- FSP: Add check to detect FSP R/R inside fsp_sync_msg()
|
|
|
|
OPAL sends MBOX message to FSP and updates message state from fsp_msg_queued
|
|
-> fsp_msg_sent. fsp_sync_msg() queues message and waits until we get response
|
|
from FSP. During FSP R/R we move outstanding MBOX messages from msgq to rr_queue
|
|
including inflight message (fsp_reset_cmdclass()). But we are not resetting
|
|
inflight message state.
|
|
|
|
In extreme croner case where we sent message to FSP via fsp_sync_msg() path
|
|
and FSP R/R happens before getting respose from FSP, then we will endup waiting
|
|
in fsp_sync_msg() until everything becomes normal.
|
|
|
|
This patch adds fsp_in_rr() check to fsp_sync_msg() and return error to caller
|
|
if FSP is in R/R.
|
|
- FSP: Add check to detect FSP R/R inside fsp_sync_msg()
|
|
|
|
OPAL sends MBOX message to FSP and updates message state from fsp_msg_queued
|
|
-> fsp_msg_sent. fsp_sync_msg() queues message and waits until we get response
|
|
from FSP. During FSP R/R we move outstanding MBOX messages from msgq to rr_queue
|
|
including inflight message (fsp_reset_cmdclass()). But we are not resetting
|
|
inflight message state.
|
|
|
|
In extreme croner case where we sent message to FSP via fsp_sync_msg() path
|
|
and FSP R/R happens before getting respose from FSP, then we will endup waiting
|
|
in fsp_sync_msg() until everything becomes normal.
|
|
|
|
This patch adds fsp_in_rr() check to fsp_sync_msg() and return error to caller
|
|
if FSP is in R/R.
|
|
- capp: Fix hang when CAPP microcode LID is missing on FSP machine
|
|
|
|
When the LID is absent, we fail early with an error from
|
|
start_preload_resource. In that case, capp_ucode_info.load_result
|
|
isn't set properly causing a subsequent capp_lid_download() to
|
|
call wait_for_resource_loaded() on something that isn't being
|
|
loaded, thus hanging.
|
|
- FSP/CONSOLE: Do not free fsp_msg in error path
|
|
|
|
as we reuse same msg to send next output message.
|
|
|
|
- platform/zz: Acknowledge OCC_LOAD mbox message in ZZ
|
|
|
|
In P9 FSP box, OCC image is pre-loaded. So do not handle the load
|
|
command and send SUCCESS to FSP on recieving OCC_LOAD mbox message.
|
|
|
|
- FSP/RTC: Improve error log
|
|
|
|
astbmc systems
|
|
--------------
|
|
|
|
- platforms/astbmc: Don't validate model on palmetto
|
|
|
|
The platform isn't compatible with palmetto until the root device-tree
|
|
node's "model" property is NULL or "palmetto". However, we could have
|
|
"TN71-BP012" for the property on palmetto. ::
|
|
|
|
linux# cat /proc/device-tree/model
|
|
TN71-BP012
|
|
|
|
This skips the validation on root device-tree node's "model" property
|
|
on palmetto, meaning we check the "compatible" property only.
|
|
|
|
|