Commit Graph

886869 Commits

Author SHA1 Message Date
Mika Westerberg
ad9001f2f4 PCI/PM: Add missing link delays required by the PCIe spec
Currently Linux does not follow PCIe spec regarding the required delays
after reset. A concrete example is a Thunderbolt add-in-card that consists
of a PCIe switch and two PCIe endpoints:

  +-1b.0-[01-6b]----00.0-[02-6b]--+-00.0-[03]----00.0 TBT controller
                                  +-01.0-[04-36]-- DS hotplug port
                                  +-02.0-[37]----00.0 xHCI controller
                                  \-04.0-[38-6b]-- DS hotplug port

The root port (1b.0) and the PCIe switch downstream ports are all PCIe Gen3
so they support 8GT/s link speeds.

We wait for the PCIe hierarchy to enter D3cold (runtime):

  pcieport 0000:00:1b.0: power state changed by ACPI to D3cold

When it wakes up from D3cold, according to the PCIe 5.0 section 5.8 the
PCIe switch is put to reset and its power is re-applied. This means that we
must follow the rules in PCIe 5.0 section 6.6.1.

For the PCIe Gen3 ports we are dealing with here, the following applies:

  With a Downstream Port that supports Link speeds greater than 5.0 GT/s,
  software must wait a minimum of 100 ms after Link training completes
  before sending a Configuration Request to the device immediately below
  that Port. Software can determine when Link training completes by polling
  the Data Link Layer Link Active bit or by setting up an associated
  interrupt (see Section 6.7.3.3).

Translating this into the above topology we would need to do this (DLLLA
stands for Data Link Layer Link Active):

  0000:00:1b.0: wait for 100 ms after DLLLA is set before access to 0000:01:00.0
  0000:02:00.0: wait for 100 ms after DLLLA is set before access to 0000:03:00.0
  0000:02:02.0: wait for 100 ms after DLLLA is set before access to 0000:37:00.0

I've instrumented the kernel with some additional logging so we can see the
actual delays performed:

  pcieport 0000:00:1b.0: power state changed by ACPI to D0
  pcieport 0000:00:1b.0: waiting for D3cold delay of 100 ms
  pcieport 0000:00:1b.0: waiting for D3hot delay of 10 ms
  pcieport 0000:02:01.0: waiting for D3hot delay of 10 ms
  pcieport 0000:02:04.0: waiting for D3hot delay of 10 ms

For the switch upstream port (01:00.0 reachable through 00:1b.0 root port)
we wait for 100 ms but not taking into account the DLLLA requirement. We
then wait 10 ms for D3hot -> D0 transition of the root port and the two
downstream hotplug ports. This means that we deviate from what the spec
requires.

Performing the same check for system sleep (s2idle) transitions it turns
out to be even worse. None of the mandatory delays are performed. If this
would be S3 instead of s2idle then according to PCI FW spec 3.2 section
4.6.8. there is a specific _DSM that allows the OS to skip the delays but
this platform does not provide the _DSM and does not go to S3 anyway so no
firmware is involved that could already handle these delays.

On this particular platform these delays are not actually needed because
there is an additional delay as part of the ACPI power resource that is
used to turn on power to the hierarchy but since that additional delay is
not required by any of standards (PCIe, ACPI) it is not present in the
Intel Ice Lake, for example where missing the mandatory delays causes
pciehp to start tearing down the stack too early (links are not yet
trained). Below is an example how it looks like when this happens:

  pcieport 0000:83:04.0: pciehp: Slot(4): Card not present
  pcieport 0000:87:04.0: PME# disabled
  pcieport 0000:83:04.0: pciehp: pciehp_unconfigure_device: domain🚌dev = 0000:86:00
  pcieport 0000:86:00.0: Refused to change power state, currently in D3
  pcieport 0000:86:00.0: restoring config space at offset 0x3c (was 0xffffffff, writing 0x201ff)
  pcieport 0000:86:00.0: restoring config space at offset 0x38 (was 0xffffffff, writing 0x0)
  ...

There is also one reported case (see the bugzilla link below) where the
missing delay causes xHCI on a Titan Ridge controller fail to runtime
resume when USB-C dock is plugged. This does not involve pciehp but instead
the PCI core fails to runtime resume the xHCI device:

  pcieport 0000:04:02.0: restoring config space at offset 0xc (was 0x10000, writing 0x10020)
  pcieport 0000:04:02.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100406)
  xhci_hcd 0000:39:00.0: Refused to change power state, currently in D3
  xhci_hcd 0000:39:00.0: restoring config space at offset 0x3c (was 0xffffffff, writing 0x1ff)
  xhci_hcd 0000:39:00.0: restoring config space at offset 0x38 (was 0xffffffff, writing 0x0)
  ...

Add a new function pci_bridge_wait_for_secondary_bus() that is called on
PCI core resume and runtime resume paths accordingly if the bridge entered
D3cold (and thus went through reset).

This is second attempt to add the missing delays. The previous solution in
c2bf1fc212 ("PCI: Add missing link delays required by the PCIe spec") was
reverted because of two issues it caused:

  1. One system become unresponsive after S3 resume due to PME service
     spinning in pcie_pme_work_fn(). The root port in question reports that
     the xHCI sent PME but the xHCI device itself does not have PME status
     set. The PME status bit is never cleared in the root port resulting
     the indefinite loop in pcie_pme_work_fn().

  2. Slows down resume if the root/downstream port does not support Data
     Link Layer Active Reporting because pcie_wait_for_link_delay() waits
     1100 ms in that case.

This version should avoid the above issues because we restrict the delay to
happen only if the port went into D3cold.

Link: https://lore.kernel.org/linux-pci/SL2P216MB01878BBCD75F21D882AEEA2880C60@SL2P216MB0187.KORP216.PROD.OUTLOOK.COM/
Link: https://bugzilla.kernel.org/show_bug.cgi?id=203885
Link: https://lore.kernel.org/r/20191112091617.70282-3-mika.westerberg@linux.intel.com
Reported-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Tested-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2019-11-20 17:37:24 -06:00
Mika Westerberg
4827d63891 PCI/PM: Add pcie_wait_for_link_delay()
Add pcie_wait_for_link_delay().  Similar to pcie_wait_for_link() but allows
passing custom activation delay in milliseconds.

Link: https://lore.kernel.org/r/20191112091617.70282-2-mika.westerberg@linux.intel.com
Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-11-20 17:37:20 -06:00
Bjorn Helgaas
327ccbbcc1 PCI/PM: Return error when changing power state from D3cold
pci_raw_set_power_state() uses the Power Management capability to change a
device's power state.  The capability is in config space, which is
accessible in D0, D1, D2, and D3hot, but not in D3cold.

If we call pci_raw_set_power_state() on a device that's in D3cold, config
reads fail and return ~0 data, which we erroneously interpreted as "the
device is in D3hot", leading to messages like this:

  pcieport 0000:03:00.0: Refused to change power state, currently in D3

The PCI_PM_CTRL has several RsvdP fields, so ~0 is never a valid register
value.  If we get that value, print a more informative message and return
an error.

Changing the power state of a device from D3cold must be done by a platform
power management method or some other non-config space mechanism.

Link: https://lore.kernel.org/r/20190822200551.129039-4-helgaas@kernel.org
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
2019-11-20 17:36:48 -06:00
Bjorn Helgaas
e43f15ea2f PCI/PM: Decode D3cold power state correctly
Use pci_power_name() to print pci_power_t correctly.  This changes:

  "state 0" or "D0"   to   "D0"
  "state 1" or "D1"   to   "D1"
  "state 2" or "D2"   to   "D2"
  "state 3" or "D3"   to   "D3hot"
  "state 4" or "D4"   to   "D3cold"

Changes dmesg logging only, no other functional change intended.

Link: https://lore.kernel.org/r/20190822200551.129039-3-helgaas@kernel.org
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
2019-11-20 17:36:11 -06:00
Rafael J. Wysocki
9c77e63bd8 PCI/PM: Fold __pci_complete_power_transition() into its caller
Because pci_set_power_state() has become the only caller of
__pci_complete_power_transition(), there is no need for the latter to
be a separate function any more, so fold it into the former, drop a
redundant check and reduce the number of lines of code somewhat.

Code rearrangement, no intentional functional impact.

Link: https://lore.kernel.org/r/15576968.k611qn3UU0@kreacher
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
2019-11-20 17:36:08 -06:00
Rafael J. Wysocki
d6aa37cd04 PCI/PM: Avoid exporting __pci_complete_power_transition()
Notice that radeon_set_suspend(), which is the only caller of
__pci_complete_power_transition() outside of pci.c, really only
cares about the pci_platform_power_transition() invoked by it,
so export the latter instead of it, update the radeon driver to
call pci_platform_power_transition() directly and make
__pci_complete_power_transition() static.

Code rearrangement, no intentional functional impact.

Link: https://lore.kernel.org/r/1731661.ykamz2Tiuf@kreacher
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
2019-11-20 17:35:52 -06:00
Rafael J. Wysocki
dc2256b073 PCI/PM: Fold __pci_start_power_transition() into its caller
Because pci_power_up() has become the only caller of
__pci_start_power_transition(), there is no need for the latter to
be a separate function any more, so fold it into the former, drop a
redundant check and reduce the number of lines of code somewhat.

Code rearrangement, no intentional functional impact.

Link: https://lore.kernel.org/r/3458080.lsoDbfkST9@kreacher
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
2019-11-20 17:35:48 -06:00
Rafael J. Wysocki
adfac8f6b7 PCI/PM: Use pci_power_up() in pci_set_power_state()
Make it explicitly clear that the code to put devices into D0 in
pci_set_power_state() and in pci_pm_default_resume_early() is the
same by making the latter use pci_power_up() for transitions into D0.

Code rearrangement, no intentional functional impact.

Link: https://lore.kernel.org/r/2520019.OZ1nXS5aSj@kreacher
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
2019-11-20 17:35:41 -06:00
Rafael J. Wysocki
81cfa5908f PCI/PM: Move power state update away from pci_power_up()
Move the invocation of pci_update_current_state() from pci_power_up() to
pci_pm_default_resume_early(), which is the only caller of that function.

Preparatory change, no functional impact.

Link: https://lore.kernel.org/r/37482337.udjOGdOKNb@kreacher
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
2019-11-20 17:35:34 -06:00
Bjorn Helgaas
1a1daf097e PCI/PM: Remove unused pci_driver.suspend_late() hook
The struct pci_driver.suspend_late() hook is one of the legacy PCI power
management callbacks, and there are no remaining users of it.  Remove it.

Link: https://lore.kernel.org/r/20191101204558.210235-7-helgaas@kernel.org
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-11-20 17:35:25 -06:00
Bjorn Helgaas
89cdbc3546 PCI/PM: Remove unused pci_driver.resume_early() hook
The struct pci_driver.resume_early() hook is one of the legacy PCI power
management callbacks, and there are no remaining users of it.  Remove it.

Link: https://lore.kernel.org/r/20191101204558.210235-6-helgaas@kernel.org
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-11-20 17:35:14 -06:00
Bjorn Helgaas
77b84bb306 xen-platform: Convert to generic power management
Convert xen-platform from the legacy PCI power management callbacks to the
generic operations.  This is one step towards removing support for the
legacy PCI callbacks.

The generic .resume_noirq() operation is called by pci_pm_resume_noirq() at
the same point the legacy PCI .resume_early() callback was, so this patch
should not change the xen-platform behavior.

Link: https://lore.kernel.org/r/20191101204558.210235-5-helgaas@kernel.org
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: KarimAllah Ahmed <karahmed@amazon.de>
2019-11-20 17:35:03 -06:00
Bjorn Helgaas
baef7f8e5e PCI/PM: Simplify pci_set_power_state()
Check for the PCI_DEV_FLAGS_NO_D3 quirk early, before calling
__pci_start_power_transition().  This way all the cases where we don't need
to do anything at all are checked up front.

This doesn't fix anything because if the caller requested D3hot or D3cold,
__pci_start_power_transition() is a no-op.  But calling it is pointless and
makes the code harder to analyze.

Link: https://lore.kernel.org/r/20191101204558.210235-4-helgaas@kernel.org
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-11-20 17:34:47 -06:00
Bjorn Helgaas
993cc6d1bd PCI/PM: Expand PM reset messages to mention D3hot (not just D3)
pci_pm_reset() resets a device by putting it in D3hot and bringing it back
to D0.  Clarify related messages to mention "D3hot" explicitly instead of
just "D3".

Link: https://lore.kernel.org/r/20191101204558.210235-3-helgaas@kernel.org
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-11-20 17:34:36 -06:00
Bjorn Helgaas
7e24bc347e PCI/PM: Apply D2 delay as milliseconds, not microseconds
PCI_PM_D2_DELAY is defined as 200, which is milliseconds, but previously we
used udelay(), which only waited for 200 microseconds.  Use msleep()
instead so we wait the correct amount of time.  See PCIe r5.0, sec 5.9.

Link: https://lore.kernel.org/r/20191101204558.210235-2-helgaas@kernel.org
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-11-20 17:34:26 -06:00
Bjorn Helgaas
12bcae44bf PCI/PM: Use pci_WARN() to include device information
Add and use pci_WARN() wrappers so warnings include device information.

Link: https://lore.kernel.org/r/20191017212851.54237-3-helgaas@kernel.org
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-11-20 17:34:17 -06:00
Bjorn Helgaas
6941a0c2bd PCI/PM: Use PCI dev_printk() wrappers for consistency
Use the PCI dev_printk() wrappers for consistency with the rest of the PCI
core.  No functional change intended.

Link: https://lore.kernel.org/r/20191017212851.54237-2-helgaas@kernel.org
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-11-20 17:34:06 -06:00
Bjorn Helgaas
b64cf7a171 PCI/PM: Wrap long lines in documentation
Documentation/power/pci.rst is wrapped to fit in 80 columns, but directory
structure changes made a few lines longer.  Wrap them so they all fit in 80
columns again.

Link: https://lore.kernel.org/r/20191014230016.240912-7-helgaas@kernel.org
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-11-20 17:33:56 -06:00
Bjorn Helgaas
85a9b0507d PCI/PM: Note that PME can be generated from D0
Per PCIe r5.0 sec 7.5.2.1, PME may be generated from D0, so update
Documentation/power/pci.rst to reflect that.

Link: https://lore.kernel.org/r/20191016194450.68959-1-helgaas@kernel.org
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-11-20 17:33:46 -06:00
Bjorn Helgaas
6da2f2ccfd PCI/PM: Make power management op coding style consistent
Some of the power management ops use this style:

  struct device_driver *drv = dev->driver;
  if (drv && drv->pm && drv->pm->prepare(dev))
    drv->pm->prepare(dev);

while others use this:

  const struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL;
  if (pm && pm->runtime_resume)
    pm->runtime_resume(dev);

Convert the first style to the second so they're all consistent.  Remove
local "error" variables when unnecessary.  No functional change intended.

Link: https://lore.kernel.org/r/20191014230016.240912-6-helgaas@kernel.org
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-11-20 17:33:32 -06:00
Bjorn Helgaas
f7b32a86e4 PCI/PM: Run resume fixups before disabling wakeup events
pci_pm_resume() and pci_pm_restore() call pci_pm_default_resume(), which
runs resume fixups before disabling wakeup events:

  static void pci_pm_default_resume(struct pci_dev *pci_dev)
  {
    pci_fixup_device(pci_fixup_resume, pci_dev);
    pci_enable_wake(pci_dev, PCI_D0, false);
  }

pci_pm_runtime_resume() does both of these, but in the opposite order:

  pci_enable_wake(pci_dev, PCI_D0, false);
  pci_fixup_device(pci_fixup_resume, pci_dev);

We should always use the same ordering unless there's a reason to do
otherwise.  Change pci_pm_runtime_resume() to call pci_pm_default_resume()
instead of open-coding this, so the fixups are always done before disabling
wakeup events.

pci_pm_default_resume() is called from pci_pm_runtime_resume(), which is
under #ifdef CONFIG_PM.  If SUSPEND and HIBERNATION are disabled, PM_SLEEP
is disabled also, so move pci_pm_default_resume() from #ifdef
CONFIG_PM_SLEEP to #ifdef CONFIG_PM.

Link: https://lore.kernel.org/r/20191014230016.240912-5-helgaas@kernel.org
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-11-20 17:33:15 -06:00
Bjorn Helgaas
ec6a75ef8e PCI/PM: Clear PCIe PME Status even for legacy power management
Previously, pci_pm_resume_noirq() cleared the PME Status bit in the Root
Status register only if the device had no driver or the driver did not
implement legacy power management.  It should clear PME Status regardless
of what sort of power management the driver supports, so do this before
checking for legacy power management.

This affects Root Ports and Root Complex Event Collectors, for which the
usual driver is the PCIe portdrv, which implements new power management, so
this change is just on principle, not to fix any actual defects.

Fixes: a39bd851dc ("PCI/PM: Clear PCIe PME Status bit in core, not PCIe port driver")
Link: https://lore.kernel.org/r/20191014230016.240912-4-helgaas@kernel.org
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-11-20 17:32:57 -06:00
Bjorn Helgaas
dc68b40678 PCI/PM: Correct pci_pm_thaw_noirq() documentation
According to the documentation, pci_pm_thaw_noirq() did not put the device
into the full-power state and restore its standard configuration registers.
This is incorrect, so update the documentation to match the code.

Link: https://lore.kernel.org/r/20191014230016.240912-3-helgaas@kernel.org
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-11-20 17:32:30 -06:00
Dexuan Cui
f2c33ccacb PCI/PM: Always return devices to D0 when thawing
pci_pm_thaw_noirq() is supposed to return the device to D0 and restore its
configuration registers, but previously it only did that for devices whose
drivers implemented the new power management ops.

Hibernation, e.g., via "echo disk > /sys/power/state", involves freezing
devices, creating a hibernation image, thawing devices, writing the image,
and powering off.  The fact that thawing did not return devices with legacy
power management to D0 caused errors, e.g., in this path:

  pci_pm_thaw_noirq
    if (pci_has_legacy_pm_support(pci_dev)) # true for Mellanox VF driver
      return pci_legacy_resume_early(dev)   # ... legacy PM skips the rest
    pci_set_power_state(pci_dev, PCI_D0)
    pci_restore_state(pci_dev)
  pci_pm_thaw
    if (pci_has_legacy_pm_support(pci_dev))
      pci_legacy_resume
	drv->resume
	  mlx4_resume
	    ...
	      pci_enable_msix_range
	        ...
		  if (dev->current_state != PCI_D0)  # <---
		    return -EINVAL;

which caused these warnings:

  mlx4_core a6d1:00:02.0: INTx is not supported in multi-function mode, aborting
  PM: dpm_run_callback(): pci_pm_thaw+0x0/0xd7 returns -95
  PM: Device a6d1:00:02.0 failed to thaw: error -95

Return devices to D0 and restore config registers for all devices, not just
those whose drivers support new power management.

[bhelgaas: also call pci_restore_state() before pci_legacy_resume_early(),
update comment, add stable tag, commit log]
Link: https://lore.kernel.org/r/KU1P153MB016637CAEAD346F0AA8E3801BFAD0@KU1P153MB0166.APCP153.PROD.OUTLOOK.COM
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: stable@vger.kernel.org	# v4.13+
2019-11-20 17:32:25 -06:00
Dave Airlie
30c185da76 Merge tag 'drm-intel-next-fixes-2019-11-20' of git://anongit.freedesktop.org/drm/drm-intel into drm-next
- Includes gvt-next-fixes-2019-11-12
- Fix Bugzilla #112051: Fix detection for a CMP-V PCH
- Fix Bugzilla #112256: Corrupted page table at address on plymouth splash
- Fix Bugzilla #111594: Avoid losing RC6 when HuC authentication is used
- Fix for OA/perf metric coherency, restore GT coarse power gating workaround
- Avoid atomic context on error capture
- Avoid MST bitmask overflowing to EDP/DPI input select
- Fixes to CI found dmesg splats

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20191120204035.GA14908@jlahtine-desk.ger.corp.intel.com
2019-11-21 09:17:03 +10:00
Dmitry Monakhov
40d47c155e block,bfq: Skip tracing hooks if possible
In most cases blk_tracing is not active, but  bfq_log_bfqq macro
generate pid_str unconditionally, which result in significant overhead.

## Test
modprobe null_blk
echo bfq > /sys/block/nullb0/queue/scheduler
fio --name=t --ioengine=libaio --direct=1 --filename=/dev/nullb0 \
   --runtime=30 --time_based=1 --rw=write --iodepth=128 --bs=4k

# Results
|        | baseline | w/ patch | gain |
| iops   | 113.19K  | 126.42K  | +11% |

Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Dmitry Monakhov <dmonakhov@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-20 16:10:29 -07:00
Dave Airlie
c22fe762ba Merge tag 'drm-next-5.5-2019-11-15' of git://people.freedesktop.org/~agd5f/linux into drm-next
drm-next-5.5-2019-11-15:

amdgpu:
- Fix AVFS handling on SMU7 parts with custom power tables
- Enable Overdrive sysfs interface for Navi parts
- Fix power limit handling on smu11 parts
- Fix pcie link sysfs output for Navi
- Probably cancel MM worker threads on shutdown

radeon:
- Cleanup for ppc change

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alex Deucher <alexdeucher@gmail.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20191115163516.3714-1-alexander.deucher@amd.com
2019-11-21 08:52:27 +10:00
Alexander Duyck
e3cb0c7102 x86/ioperm: Fix use of deprecated config option
The commit

  111e7b15cf ("x86/ioperm: Extend IOPL config to control ioperm() as well")

replaced X86_IOPL_EMULATION with X86_IOPL_IOPERM. However it appears
that there was at least one spot missed as tss_update_io_bitmap() still
had a reference to it contained in the code.

The result of this is that it exposed a NULL pointer dereference as seen
below with a linux-next next-20191120 kernel:

  BUG: kernel NULL pointer dereference, address: 0000000000000000
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 0 P4D 0
  Oops: 0000 [#1] SMP PTI
  CPU: 5 PID: 1542 Comm: ovs-vswitchd Tainted: G        W 5.4.0-rc8-next-20191120 #125
  RIP: 0010:tss_update_io_bitmap+0x4e/0x180
  Code: 10 31 c0 65 48 03 1d 69 54 5d 6d 65 48 8b 04 25 40 8c 01 00 48 8b 10 \
	  f7 c2 00 00 40 00 0f 84 8c 00 00 00 4c 8b a0 c0 22 00 00 <49> 8b 04 \
	  24 48 39 43 68 74 2e 8b 53 70 41 39 54 24 0c 48 8d 7b 78
  RSP: 0018:ffffb8888a0ebf08 EFLAGS: 00010006
  RAX: ffff8a429811a680 RBX: ffff8a4c3f946000 RCX: 0000000000000011
  RDX: 0000000000400080 RSI: 0000000000400080 RDI: 0000000000000000
  RBP: ffffb8888a0ebf30 R08: 00007ffffb5d7ce0 R09: 0000000000000000
  R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
  R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
  FS:  00007f68a9635c40(0000) GS:ffff8a4c3f940000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000000 CR3: 000000103572a001 CR4: 00000000001606e0
  Call Trace:
   ? syscall_slow_exit_work+0x39/0xdb
   do_syscall_64+0x1a5/0x200
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f68a7aff797

Fixes: 111e7b15cf ("x86/ioperm: Extend IOPL config to control ioperm() as well")
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20191120222426.3060.18462.stgit@localhost.localdomain
2019-11-20 23:40:05 +01:00
Mike Snitzer
f612b2132d Revert "dm crypt: use WQ_HIGHPRI for the IO and crypt workqueues"
This reverts commit a1b89132dc.

Revert required hand-patching due to subsequent changes that were
applied since commit a1b89132dc.

Requires: ed0302e830 ("dm crypt: make workqueue names device-specific")
Cc: stable@vger.kernel.org
Bug: https://bugzilla.kernel.org/show_bug.cgi?id=199857
Reported-by: Vito Caputo <vcaputo@pengaru.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-11-20 17:27:39 -05:00
Daniel Borkmann
196e8ca748 bpf: Switch bpf_map_{area_alloc,area_mmapable_alloc}() to u64 size
Given we recently extended the original bpf_map_area_alloc() helper in
commit fc9702273e ("bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY"),
we need to apply the same logic as in ff1c08e1f7 ("bpf: Change size
to u64 for bpf_map_{area_alloc, charge_init}()"). To avoid conflicts,
extend it for bpf-next.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-11-20 23:18:58 +01:00
Daniel Borkmann
91e6015b08 bpf: Emit audit messages upon successful prog load and unload
Allow for audit messages to be emitted upon BPF program load and
unload for having a timeline of events. The load itself is in
syscall context, so additional info about the process initiating
the BPF prog creation can be logged and later directly correlated
to the unload event.

The only info really needed from BPF side is the globally unique
prog ID where then audit user space tooling can query / dump all
info needed about the specific BPF program right upon load event
and enrich the record, thus these changes needed here can be kept
small and non-intrusive to the core.

Raw example output:

  # auditctl -D
  # auditctl -a always,exit -F arch=x86_64 -S bpf
  # ausearch --start recent -m 1334
  [...]
  ----
  time->Wed Nov 20 12:45:51 2019
  type=PROCTITLE msg=audit(1574271951.590:8974): proctitle="./test_verifier"
  type=SYSCALL msg=audit(1574271951.590:8974): arch=c000003e syscall=321 success=yes exit=14 a0=5 a1=7ffe2d923e80 a2=78 a3=0 items=0 ppid=742 pid=949 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=2 comm="test_verifier" exe="/root/bpf-next/tools/testing/selftests/bpf/test_verifier" subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 key=(null)
  type=UNKNOWN[1334] msg=audit(1574271951.590:8974): auid=0 uid=0 gid=0 ses=2 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 pid=949 comm="test_verifier" exe="/root/bpf-next/tools/testing/selftests/bpf/test_verifier" prog-id=3260 event=LOAD
  ----
  time->Wed Nov 20 12:45:51 2019
type=UNKNOWN[1334] msg=audit(1574271951.590:8975): prog-id=3260 event=UNLOAD
  ----
  [...]

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20191120213816.8186-1-jolsa@kernel.org
2019-11-20 13:44:51 -08:00
Bjorn Helgaas
9d09e5a95c PCI: Fix typos
Fix typos in drivers/pci.  Comment and whitespace changes only.

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2019-11-20 15:20:21 -06:00
David S. Miller
064a18998b mlx5-fixes-2019-11-20
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAl3VowIACgkQSD+KveBX
 +j7cZgf/aEWFOw6e9oaELHqsWYWaqBabAh/celXLIVx7in1JR5oCGSQHRAH/5JB2
 HHfvXvN5Yk9YNga5HtT4mqZS6NsgYksG3pneuJApLcbY627pAzlw2i90yqIKz8In
 svulz/BBv22msxk/F2ZyQ04zltiNplrHI1ESbxmMhLuhRA5M9AwRTjfa2Uk6kbLj
 pmwmAEymNsxNfRfL4/sVMXgbUcTOkf38h4qAWKTnUZeCFCVk2pbIjBkNTQ6eG+kY
 rbYYqMyhqusGhvkiP083rdnu9aKmVUG55jIyd00PhrrVow9HSTKwNqlcq/+qtYpE
 H5dhgxJexrXJd02m+KK2h+DZxPsHDA==
 =t0rV
 -----END PGP SIGNATURE-----

Merge tag 'mlx5-fixes-2019-11-20' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
Mellanox, mlx5 fixes 2019-11-20

This series introduces some fixes to mlx5 driver.

Please pull and let me know if there is any problem.

For -stable v4.9:
 ('net/mlx5e: Fix set vf link state error flow')

For -stable v4.14
 ('net/mlxfw: Verify FSM error code translation doesn't exceed array size')

For -stable v4.19
 ('net/mlx5: Fix auto group size calculation')

For -stable v5.3
 ('net/mlx5e: Fix error flow cleanup in mlx5e_tc_tun_create_header_ipv4/6')
 ('net/mlx5e: Do not use non-EXT link modes in EXT mode')
 ('net/mlx5: Update the list of the PCI supported devices')
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-20 12:56:32 -08:00
David S. Miller
e2193c9334 Merge branch 'r8169-smaller-improvements-to-firmware-handling'
Heiner Kallweit says:

====================
r8169: smaller improvements to firmware handling

This series includes few smaller improvements to firmware handling.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-20 12:50:25 -08:00
Heiner Kallweit
df0120f12f r8169: add check for PHY_MDIO_CHG to rtl_nic_fw_data_ok
Only values 0 and 1 are currently defined as parameters for
PHY_MDIO_CHG. Instead of silently ignoring unknown values and
misinterpreting the firmware code let's explicitly check.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-20 12:50:24 -08:00
Heiner Kallweit
cfccde80e8 r8169: use macro FIELD_SIZEOF in definition of FW_OPCODE_SIZE
Using macro FIELD_SIZEOF makes this define easier understandable.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-20 12:50:24 -08:00
Heiner Kallweit
e20c43dbdf r8169: change mdelay to msleep in rtl_fw_write_firmware
We're not in atomic context here, therefore switch to msleep.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-20 12:50:24 -08:00
Prashant Malani
8481141246 r8152: Re-order napi_disable in rtl8152_close
Both rtl_work_func_t() and rtl8152_close() call napi_disable().
Since the two calls aren't protected by a lock, if the close
function starts executing before the work function, we can get into a
situation where the napi_disable() function is called twice in
succession (first by rtl8152_close(), then by set_carrier()).

In such a situation, the second call would loop indefinitely, since
rtl8152_close() doesn't call napi_enable() to clear the NAPI_STATE_SCHED
bit.

The rtl8152_close() function in turn issues a
cancel_delayed_work_sync(), and so it would wait indefinitely for the
rtl_work_func_t() to complete. Since rtl8152_close() is called by a
process holding rtnl_lock() which is requested by other processes, this
eventually leads to a system deadlock and crash.

Re-order the napi_disable() call to occur after the work function
disabling and urb cancellation calls are issued.

Change-Id: I6ef0b703fc214998a037a68f722f784e1d07815e
Reported-by: http://crbug.com/1017928
Signed-off-by: Prashant Malani <pmalani@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-20 12:48:13 -08:00
David S. Miller
b172845a40 Merge branch 'qca_spi-fixes'
Stefan Wahren says:

====================
net: qca_spi: Fix receive and reset issues

This small patch series fixes two major issues in the SPI driver for the
QCA700x.

It has been tested on a Charge Control C 300 (NXP i.MX6ULL +
2x QCA7000).
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-20 12:42:23 -08:00
Stefan Wahren
bc19c32904 net: qca_spi: Move reset_count to struct qcaspi
The reset counter is specific for every QCA700x chip. So move this
into the private driver struct. Otherwise we get unpredictable reset
behavior in setups with multiple QCA700x chips.

Fixes: 291ab06ecf (net: qualcomm: new Ethernet over SPI driver for QCA7000)
Signed-off-by: Stefan Wahren <stefan.wahren@in-tech.com>
Signed-off-by: Stefan Wahren <wahrenst@gmx.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-20 12:42:23 -08:00
Michael Heimpold
3e7e676c81 net: qca_spi: fix receive buffer size check
When receiving many or larger packets, e.g. when doing a file download,
it was observed that the read buffer size register reports up to 4 bytes
more than the current define allows in the check.
If this is the case, then no data transfer is initiated to receive the
packets (and thus to empty the buffer) which results in a stall of the
interface.

These 4 bytes are a hardware generated frame length which is prepended
to the actual frame, thus we have to respect it during our check.

Fixes: 026b907d58 ("net: qca_spi: Add available buffer space verification")
Signed-off-by: Michael Heimpold <michael.heimpold@in-tech.com>
Signed-off-by: Stefan Wahren <wahrenst@gmx.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-20 12:42:23 -08:00
Thomas Bogendoerfer
e2ffe3ff6f net: ipconfig: Wait for deferred device probes
If network device drives are using deferred probing, it was possible
that waiting for devices to show up in ipconfig was already over,
when the device eventually showed up. By calling wait_for_device_probe()
we now make sure deferred probing is done before checking for available
devices.

Signed-off-by: Thomas Bogendoerfer <tbogendoerfer@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-20 12:39:53 -08:00
Mao Wenan
2be8ca97d0 vsock/vmci: make vmci_vsock_cb_host_called static
When using make C=2 drivers/misc/vmw_vmci/vmci_driver.o
to compile, below warning can be seen:
drivers/misc/vmw_vmci/vmci_driver.c:33:6: warning:
symbol 'vmci_vsock_cb_host_called' was not declared. Should it be static?

This patch make symbol vmci_vsock_cb_host_called static.

Fixes: b1bba80a43 ("vsock/vmci: register vmci_transport only when VMCI guest/host are active")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Mao Wenan <maowenan@huawei.com>
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-20 12:39:29 -08:00
David S. Miller
aee024f610 Merge branch 'ibmvnic-regression'
Juliet Kim says:

====================
Support both XIVE and XICS modes in ibmvnic

This series aims to support both XICS and XIVE with avoiding
a regression in behavior when a system runs in XICS mode.

Patch 1 reverts commit 11d49ce9f7
(“net/ibmvnic: Fix EOI when running in XIVE mode.”)

Patch 2 Ignore H_FUNCTION return from H_EOI to tolerate XIVE mode
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-20 12:37:15 -08:00
Juliet Kim
2df5c60e19 net/ibmvnic: Ignore H_FUNCTION return from H_EOI to tolerate XIVE mode
Reversion of commit 11d49ce9f7
(“net/ibmvnic: Fix EOI when running in XIVE mode.”) leaves us
calling H_EOI even in XIVE mode. That will fail with H_FUNCTION
because H_EOI is not supported in that mode. That failure is
harmless. Ignore it so we can use common code for both XICS and
XIVE.

Signed-off-by: Juliet Kim <julietk@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-20 12:37:15 -08:00
Juliet Kim
284f87d2f3 Revert "net/ibmvnic: Fix EOI when running in XIVE mode"
This reverts commit 11d49ce9f7
(“net/ibmvnic: Fix EOI when running in XIVE mode.”) since that
has the unintended effect of changing the interrupt priority
and emits warning when running in legacy XICS mode.

Signed-off-by: Juliet Kim <julietk@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-20 12:37:15 -08:00
David S. Miller
e07e75412b Merge branch 'page_pool-DMA-sync'
Lorenzo Bianconi says:

====================
add DMA-sync-for-device capability to page_pool API

Introduce the possibility to sync DMA memory for device in the page_pool API.
This feature allows to sync proper DMA size and not always full buffer
(dma_sync_single_for_device can be very costly).
Please note DMA-sync-for-CPU is still device driver responsibility.
Relying on page_pool DMA sync mvneta driver improves XDP_DROP pps of
about 170Kpps:

- XDP_DROP DMA sync managed by mvneta driver:	~420Kpps
- XDP_DROP DMA sync managed by page_pool API:	~585Kpps

Do not change naming convention for the moment since the changes will hit other
drivers as well. I will address it in another series.

Changes since v4:
- do not allow the driver to set max_len to 0
- convert PP_FLAG_DMA_MAP/PP_FLAG_DMA_SYNC_DEV to BIT() macro

Changes since v3:
- move dma_sync_for_device before putting the page in ptr_ring in
  __page_pool_recycle_into_ring since ptr_ring can be consumed
  concurrently. Simplify the code moving dma_sync_for_device
  before running __page_pool_recycle_direct/__page_pool_recycle_into_ring

Changes since v2:
- rely on PP_FLAG_DMA_SYNC_DEV flag instead of dma_sync

Changes since v1:
- rename sync in dma_sync
- set dma_sync_size to 0xFFFFFFFF in page_pool_recycle_direct and
  page_pool_put_page routines
- Improve documentation
====================

Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-20 12:34:37 -08:00
Lorenzo Bianconi
07e13edbb6 net: mvneta: get rid of huge dma sync in mvneta_rx_refill
Get rid of costly dma_sync_single_for_device in mvneta_rx_refill
since now the driver can let page_pool API to manage needed DMA
sync with a proper size.

- XDP_DROP DMA sync managed by mvneta driver:	~420Kpps
- XDP_DROP DMA sync managed by page_pool API:	~585Kpps

Tested-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-20 12:34:29 -08:00
Lorenzo Bianconi
e68bc75691 net: page_pool: add the possibility to sync DMA memory for device
Introduce the following parameters in order to add the possibility to sync
DMA memory for device before putting allocated pages in the page_pool
caches:
- PP_FLAG_DMA_SYNC_DEV: if set in page_pool_params flags, all pages that
  the driver gets from page_pool will be DMA-synced-for-device according
  to the length provided by the device driver. Please note DMA-sync-for-CPU
  is still device driver responsibility
- offset: DMA address offset where the DMA engine starts copying rx data
- max_len: maximum DMA memory size page_pool is allowed to flush. This
  is currently used in __page_pool_alloc_pages_slow routine when pages
  are allocated from page allocator
These parameters are supposed to be set by device drivers.

This optimization reduces the length of the DMA-sync-for-device.
The optimization is valid because pages are initially
DMA-synced-for-device as defined via max_len. At RX time, the driver
will perform a DMA-sync-for-CPU on the memory for the packet length.
What is important is the memory occupied by packet payload, because
this is the area CPU is allowed to read and modify. As we don't track
cache-lines written into by the CPU, simply use the packet payload length
as dma_sync_size at page_pool recycle time. This also take into account
any tail-extend.

Tested-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-20 12:34:28 -08:00
Lorenzo Bianconi
f383b29500 net: mvneta: rely on page_pool_recycle_direct in mvneta_run_xdp
Rely on page_pool_recycle_direct and not on xdp_return_buff in
mvneta_run_xdp. This is a preliminary patch to limit the dma sync len
to the one strictly necessary

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-20 12:34:17 -08:00