Commit Graph

1042671 Commits

Author SHA1 Message Date
Dongliang Mu
6fa54bc713 media: em28xx-input: fix refcount bug in em28xx_usb_disconnect
If em28xx_ir_init fails, it would decrease the refcount of dev. However,
in the em28xx_ir_fini, when ir is NULL, it goes to ref_put and decrease
the refcount of dev. This will lead to a refcount bug.

Fix this bug by removing the kref_put in the error handling code
of em28xx_ir_init.

refcount_t: underflow; use-after-free.
WARNING: CPU: 0 PID: 7 at lib/refcount.c:28 refcount_warn_saturate+0x18e/0x1a0 lib/refcount.c:28
Modules linked in:
CPU: 0 PID: 7 Comm: kworker/0:1 Not tainted 5.13.0 #3
Workqueue: usb_hub_wq hub_event
RIP: 0010:refcount_warn_saturate+0x18e/0x1a0 lib/refcount.c:28
Call Trace:
  kref_put.constprop.0+0x60/0x85 include/linux/kref.h:69
  em28xx_usb_disconnect.cold+0xd7/0xdc drivers/media/usb/em28xx/em28xx-cards.c:4150
  usb_unbind_interface+0xbf/0x3a0 drivers/usb/core/driver.c:458
  __device_release_driver drivers/base/dd.c:1201 [inline]
  device_release_driver_internal+0x22a/0x230 drivers/base/dd.c:1232
  bus_remove_device+0x108/0x160 drivers/base/bus.c:529
  device_del+0x1fe/0x510 drivers/base/core.c:3540
  usb_disable_device+0xd1/0x1d0 drivers/usb/core/message.c:1419
  usb_disconnect+0x109/0x330 drivers/usb/core/hub.c:2221
  hub_port_connect drivers/usb/core/hub.c:5151 [inline]
  hub_port_connect_change drivers/usb/core/hub.c:5440 [inline]
  port_event drivers/usb/core/hub.c:5586 [inline]
  hub_event+0xf81/0x1d40 drivers/usb/core/hub.c:5668
  process_one_work+0x2c9/0x610 kernel/workqueue.c:2276
  process_scheduled_works kernel/workqueue.c:2338 [inline]
  worker_thread+0x333/0x5b0 kernel/workqueue.c:2424
  kthread+0x188/0x1d0 kernel/kthread.c:319
  ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295

Reported-by: Dongliang Mu <mudongliangabcd@gmail.com>
Fixes: ac56886371 ("media: em28xx: Fix possible memory leak of em28xx struct")
Signed-off-by: Dongliang Mu <mudongliangabcd@gmail.com>
Signed-off-by: Sean Young <sean@mess.org>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
2021-08-04 14:43:49 +02:00
Viktor Prutyanov
49be1c78d5 media: rc: introduce Meson IR TX driver
This patch adds the driver for Amlogic Meson IR transmitter.

Some Amlogic SoCs such as A311D and T950D4 have IR transmitter
(also called blaster) controller onboard. It is capable of sending
IR signals with arbitrary carrier frequency and duty cycle.

The driver supports 2 modulation clock sources:
 - xtal3 clock (xtal divided by 3)
 - 1us clock

Signed-off-by: Viktor Prutyanov <viktor.prutyanov@phystech.edu>
Signed-off-by: Sean Young <sean@mess.org>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
2021-08-04 14:43:49 +02:00
Viktor Prutyanov
e9f504f7b5 media: rc: meson-ir-tx: document device tree bindings
This patch adds binding documentation for the IR transmitter
available in Amlogic Meson SoCs.

Signed-off-by: Viktor Prutyanov <viktor.prutyanov@phystech.edu>
Acked-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Reviewed-by: Rob Herring <robh@kernel.org>
Signed-off-by: Sean Young <sean@mess.org>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
2021-08-04 14:43:49 +02:00
Arnd Bergmann
8e816b9915 Merge tag 'at91-dt-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/at91/linux into arm/dt
AT91 dt for 5.15:

- add sama7g5 SoC and associated evaluation kit, the sama7g5-ek
- adaptation of some DT for sama5d27 som1 ek, sama5d4 xplained and
  sama5d2 icp boards
- fixes to gpio and shutdown controller nodes for all boards

* tag 'at91-dt-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/at91/linux:
  ARM: dts: at91: use the right property for shutdown controller
  ARM: dts: at91: sama5d2_icp: enable digital filter for I2C nodes
  ARM: dts: at91: sama5d4_xplained: change the key code of the gpio key
  ARM: dts: at91: add conflict note for d3
  ARM: dts: at91: add pinctrl-{names, 0} for all gpios
  ARM: dts: at91: sama5d27_som1_ek: enable ADC node
  ARM: dts: at91: sama5d4_xplained: Remove spi0 node
  dt-bindings: atmel-sysreg: add bindings for sama7g5
  ARM: dts: at91: add sama7g5 SoC DT and sama7g5-ek
  dt-bindings: ARM: at91: document sama7g5ek board

Link: https://lore.kernel.org/r/20210804085000.13233-1-nicolas.ferre@microchip.com
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2021-08-04 14:31:32 +02:00
Arnd Bergmann
72ee3b4dc2 Merge tag 'ux500-dts-v5.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-nomadik into arm/dt
Ux500 Device Tree updates for the v5.15 kernel cycle:

- New device trees for these mobile phones:
  - Samsung Gavini
  - Samsung Codina
  - Samsung Kyle
- Flag eMMC cards as non-SD non-SDIO to save time
- Link USB PHY to USB controller in the device tree
- Fix up the operating points to the actual clock frequencies

* tag 'ux500-dts-v5.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-nomadik:
  ARM: dts: ux500: Adjust operating points to reality
  ARM: dts: ux500: Add a device tree for Kyle
  ARM: dts: ux500: Add devicetree for Codina
  ARM: dts: ux500: ab8500: Link USB PHY to USB controller node
  ARM: dts: ux500: Flag eMMCs as non-SDIO/SD
  ARM: dts: ux500: Add device tree for Samsung Gavini

Link: https://lore.kernel.org/r/CACRpkdbjBv5ywZZD8rK07d5sLcHsG8o4iYD-3jHO=HLg6-nKnA@mail.gmail.com
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2021-08-04 14:30:29 +02:00
Takashi Iwai
f2553d4678 ASoC: amd: vangogh: Drop superfluous mmap callback
The mmap callback of vangogh driver just calls the default mmap
handler, and it's superfluous, as the PCM core would call it if not
set.  Let's drop the superfluous mmap callback.

Fixes: 361414dc1f ("ASoC: amd: add vangogh i2s dma driver pm ops")
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Link: https://lore.kernel.org/r/20210804075223.9823-1-tiwai@suse.de
Signed-off-by: Mark Brown <broonie@kernel.org>
2021-08-04 13:20:14 +01:00
Marc Zyngier
47e6223c84 KVM: arm64: Unregister HYP sections from kmemleak in protected mode
Booting a KVM host in protected mode with kmemleak quickly results
in a pretty bad crash, as kmemleak doesn't know that the HYP sections
have been taken away. This is specially true for the BSS section,
which is part of the kernel BSS section and registered at boot time
by kmemleak itself.

Unregister the HYP part of the BSS before making that section
HYP-private. The rest of the HYP-specific data is obtained via
the page allocator or lives in other sections, none of which is
subjected to kmemleak.

Fixes: 90134ac9ca ("KVM: arm64: Protect the .hyp sections from the host")
Reviewed-by: Quentin Perret <qperret@google.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org # 5.13
Link: https://lore.kernel.org/r/20210802123830.2195174-3-maz@kernel.org
2021-08-04 13:10:47 +01:00
Marc Zyngier
eb48d154cd arm64: Move .hyp.rodata outside of the _sdata.._edata range
The HYP rodata section is currently lumped together with the BSS,
which isn't exactly what is expected (it gets registered with
kmemleak, for example).

Move it away so that it is actually marked RO. As an added
benefit, it isn't registered with kmemleak anymore.

Fixes: 380e18ade4 ("KVM: arm64: Introduce a BSS section for use at Hyp")
Suggested-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org #5.13
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/20210802123830.2195174-2-maz@kernel.org
2021-08-04 13:09:20 +01:00
Praveen Kumar
e5d9b714fe x86/hyperv: fix root partition faults when writing to VP assist page MSR
For root partition the VP assist pages are pre-determined by the
hypervisor. The root kernel is not allowed to change them to
different locations. And thus, we are getting below stack as in
current implementation root is trying to perform write to specific
MSR.

    [ 2.778197] unchecked MSR access error: WRMSR to 0x40000073 (tried to write 0x0000000145ac5001) at rIP: 0xffffffff810c1084 (native_write_msr+0x4/0x30)
    [ 2.784867] Call Trace:
    [ 2.791507] hv_cpu_init+0xf1/0x1c0
    [ 2.798144] ? hyperv_report_panic+0xd0/0xd0
    [ 2.804806] cpuhp_invoke_callback+0x11a/0x440
    [ 2.811465] ? hv_resume+0x90/0x90
    [ 2.818137] cpuhp_issue_call+0x126/0x130
    [ 2.824782] __cpuhp_setup_state_cpuslocked+0x102/0x2b0
    [ 2.831427] ? hyperv_report_panic+0xd0/0xd0
    [ 2.838075] ? hyperv_report_panic+0xd0/0xd0
    [ 2.844723] ? hv_resume+0x90/0x90
    [ 2.851375] __cpuhp_setup_state+0x3d/0x90
    [ 2.858030] hyperv_init+0x14e/0x410
    [ 2.864689] ? enable_IR_x2apic+0x190/0x1a0
    [ 2.871349] apic_intr_mode_init+0x8b/0x100
    [ 2.878017] x86_late_time_init+0x20/0x30
    [ 2.884675] start_kernel+0x459/0x4fb
    [ 2.891329] secondary_startup_64_no_verify+0xb0/0xbb

Since the hypervisor already provides the VP assist pages for root
partition, we need to memremap the memory from hypervisor for root
kernel to use. The mapping is done in hv_cpu_init during bringup and is
unmapped in hv_cpu_die during teardown.

Signed-off-by: Praveen Kumar <kumarpraveen@linux.microsoft.com>
Reviewed-by: Sunil Muthuswamy <sunilmut@microsoft.com>
Link: https://lore.kernel.org/r/20210731120519.17154-1-kumarpraveen@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
2021-08-04 11:56:53 +00:00
Nick Richardson
c2eecaa193 pktgen: Remove redundant clone_skb override
When the netif_receive xmit_mode is set, a line is supposed to set
clone_skb to a default 0 value. This line is made redundant due to a
preceding line that checks if clone_skb is more than zero and returns
-ENOTSUPP.

Overriding clone_skb to 0 does not make any difference to the behavior
because if it was positive we return error. So it can be either 0 or
negative, and in both cases the behavior is the same.

Remove redundant line that sets clone_skb to zero.

Signed-off-by: Nick Richardson <richardsonnick@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-04 12:54:09 +01:00
Jonathan Lemon
773bda9649 ptp: ocp: Expose various resources on the timecard.
The OpenCompute timecard driver has additional functionality besides
a clock.  Make the following resources available:

 - The external timestamp channels (ts0/ts1)
 - devlink support for flashing and health reporting
 - GPS and MAC serial ports
 - board serial number (obtained from i2c device)

Also add watchdog functionality for when GNSS goes into holdover.

The resources are collected under a timecard class directory:

  [jlemon@timecard ~]$ ls -g /sys/class/timecard/ocp1/
  total 0
  -r--r--r--. 1 root 4096 Aug  3 19:49 available_clock_sources
  -rw-r--r--. 1 root 4096 Aug  3 19:49 clock_source
  lrwxrwxrwx. 1 root    0 Aug  3 19:49 device -> ../../../0000:04:00.0/
  -r--r--r--. 1 root 4096 Aug  3 19:49 gps_sync
  lrwxrwxrwx. 1 root    0 Aug  3 19:49 i2c -> ../../xiic-i2c.1024/i2c-2/
  drwxr-xr-x. 2 root    0 Aug  3 19:49 power/
  lrwxrwxrwx. 1 root    0 Aug  3 19:49 pps ->
  ../../../../../virtual/pps/pps1/
  lrwxrwxrwx. 1 root    0 Aug  3 19:49 ptp -> ../../ptp/ptp2/
  -r--r--r--. 1 root 4096 Aug  3 19:49 serialnum
  lrwxrwxrwx. 1 root    0 Aug  3 19:49 subsystem ->
  ../../../../../../class/timecard/
  lrwxrwxrwx. 1 root    0 Aug  3 19:49 ttyGPS -> ../../tty/ttyS7/
  lrwxrwxrwx. 1 root    0 Aug  3 19:49 ttyMAC -> ../../tty/ttyS8/
  -rw-r--r--. 1 root 4096 Aug  3 19:39 uevent

The labeling is needed at the minimum, in order to tell the serial
devices apart.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-04 12:52:50 +01:00
Pavel Tikhomirov
04190bf894 sock: allow reading and changing sk_userlocks with setsockopt
SOCK_SNDBUF_LOCK and SOCK_RCVBUF_LOCK flags disable automatic socket
buffers adjustment done by kernel (see tcp_fixup_rcvbuf() and
tcp_sndbuf_expand()). If we've just created a new socket this adjustment
is enabled on it, but if one changes the socket buffer size by
setsockopt(SO_{SND,RCV}BUF*) it becomes disabled.

CRIU needs to call setsockopt(SO_{SND,RCV}BUF*) on each socket on
restore as it first needs to increase buffer sizes for packet queues
restore and second it needs to restore back original buffer sizes. So
after CRIU restore all sockets become non-auto-adjustable, which can
decrease network performance of restored applications significantly.

CRIU need to be able to restore sockets with enabled/disabled adjustment
to the same state it was before dump, so let's add special setsockopt
for it.

Let's also export SOCK_SNDBUF_LOCK and SOCK_RCVBUF_LOCK flags to uAPI so
that using these interface one can reenable automatic socket buffer
adjustment on their sockets.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-04 12:52:03 +01:00
Ivan T. Ivanov
6b67d4d63e net: usb: lan78xx: don't modify phy_device state concurrently
Currently phy_device state could be left in inconsistent state shown
by following alert message[1]. This is because phy_read_status could
be called concurrently from lan78xx_delayedwork, phy_state_machine and
__ethtool_get_link. Fix this by making sure that phy_device state is
updated atomically.

[1] lan78xx 1-1.1.1:1.0 eth0: No phy led trigger registered for speed(-1)

Signed-off-by: Ivan T. Ivanov <iivanov@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-04 12:51:14 +01:00
Jakub Kicinski
396492b4c5 docs: networking: netdevsim rules
There are aspects of netdevsim which are commonly
misunderstood and pointed out in review. Cong
suggest we document them.

Suggested-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-04 12:43:27 +01:00
Peilin Ye
625af9f029 tc-testing: Add control-plane selftests for sch_mq
Recently we added multi-queue support to netdevsim in commit d4861fc6be
("netdevsim: Add multi-queue support"); add a few control-plane selftests
for sch_mq using this new feature.

Use nsPlugin.py to avoid network interface name collisions.

Reviewed-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-04 12:42:27 +01:00
Vladimir Oltean
a54182b2a5 Revert "net: build all switchdev drivers as modules when the bridge is a module"
This reverts commit b0e8181762. Explicit
driver dependency on the bridge is no longer needed since
switchdev_bridge_port_{,un}offload() is no longer implemented by the
bridge driver but by switchdev.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-04 12:35:07 +01:00
Vladimir Oltean
957e2235e5 net: make switchdev_bridge_port_{,unoffload} loosely coupled with the bridge
With the introduction of explicit offloading API in switchdev in commit
2f5dc00f7a ("net: bridge: switchdev: let drivers inform which bridge
ports are offloaded"), we started having Ethernet switch drivers calling
directly into a function exported by net/bridge/br_switchdev.c, which is
a function exported by the bridge driver.

This means that drivers that did not have an explicit dependency on the
bridge before, like cpsw and am65-cpsw, now do - otherwise it is not
possible to call a symbol exported by a driver that can be built as
module unless you are a module too.

There was an attempt to solve the dependency issue in the form of commit
b0e8181762 ("net: build all switchdev drivers as modules when the
bridge is a module"). Grygorii Strashko, however, says about it:

| In my opinion, the problem is a bit bigger here than just fixing the
| build :(
|
| In case, of ^cpsw the switchdev mode is kinda optional and in many
| cases (especially for testing purposes, NFS) the multi-mac mode is
| still preferable mode.
|
| There were no such tight dependency between switchdev drivers and
| bridge core before and switchdev serviced as independent, notification
| based layer between them, so ^cpsw still can be "Y" and bridge can be
| "M". Now for mostly every kernel build configuration the CONFIG_BRIDGE
| will need to be set as "Y", or we will have to update drivers to
| support build with BRIDGE=n and maintain separate builds for
| networking vs non-networking testing.  But is this enough?  Wouldn't
| it cause 'chain reaction' required to add more and more "Y" options
| (like CONFIG_VLAN_8021Q)?
|
| PS. Just to be sure we on the same page - ARM builds will be forced
| (with this patch) to have CONFIG_TI_CPSW_SWITCHDEV=m and so all our
| automation testing will just fail with omap2plus_defconfig.

In the light of this, it would be desirable for some configurations to
avoid dependencies between switchdev drivers and the bridge, and have
the switchdev mode as completely optional within the driver.

Arnd Bergmann also tried to write a patch which better expressed the
build time dependency for Ethernet switch drivers where the switchdev
support is optional, like cpsw/am65-cpsw, and this made the drivers
follow the bridge (compile as module if the bridge is a module) only if
the optional switchdev support in the driver was enabled in the first
place:
https://patchwork.kernel.org/project/netdevbpf/patch/20210802144813.1152762-1-arnd@kernel.org/

but this still did not solve the fact that cpsw and am65-cpsw now must
be built as modules when the bridge is a module - it just expressed
correctly that optional dependency. But the new behavior is an apparent
regression from Grygorii's perspective.

So to support the use case where the Ethernet driver is built-in,
NET_SWITCHDEV (a bool option) is enabled, and the bridge is a module, we
need a framework that can handle the possible absence of the bridge from
the running system, i.e. runtime bloatware as opposed to build-time
bloatware.

Luckily we already have this framework, since switchdev has been using
it extensively. Events from the bridge side are transmitted to the
driver side using notifier chains - this was originally done so that
unrelated drivers could snoop for events emitted by the bridge towards
ports that are implemented by other drivers (think of a switch driver
with LAG offload that listens for switchdev events on a bonding/team
interface that it offloads).

There are also events which are transmitted from the driver side to the
bridge side, which again are modeled using notifiers.
SWITCHDEV_FDB_ADD_TO_BRIDGE is an example of this, and deals with
notifying the bridge that a MAC address has been dynamically learned.
So there is a precedent we can use for modeling the new framework.

The difference compared to SWITCHDEV_FDB_ADD_TO_BRIDGE is that the work
that the bridge needs to do when a port becomes offloaded is blocking in
its nature: replay VLANs, MDBs etc. The calling context is indeed
blocking (we are under rtnl_mutex), but the existing switchdev
notification chain that the bridge is subscribed to is only the atomic
one. So we need to subscribe the bridge to the blocking switchdev
notification chain too.

This patch:
- keeps the driver-side perception of the switchdev_bridge_port_{,un}offload
  unchanged
- moves the implementation of switchdev_bridge_port_{,un}offload from
  the bridge module into the switchdev module.
- makes everybody that is subscribed to the switchdev blocking notifier
  chain "hear" offload & unoffload events
- makes the bridge driver subscribe and handle those events
- moves the bridge driver's handling of those events into 2 new
  functions called br_switchdev_port_{,un}offload. These functions
  contain in fact the core of the logic that was previously in
  switchdev_bridge_port_{,un}offload, just that now we go through an
  extra indirection layer to reach them.

Unlike all the other switchdev notification structures, the structure
used to carry the bridge port information, struct
switchdev_notifier_brport_info, does not contain a "bool handled".
This is because in the current usage pattern, we always know that a
switchdev bridge port offloading event will be handled by the bridge,
because the switchdev_bridge_port_offload() call was initiated by a
NETDEV_CHANGEUPPER event in the first place, where info->upper_dev is a
bridge. So if the bridge wasn't loaded, then the CHANGEUPPER event
couldn't have happened.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-04 12:35:07 +01:00
Arnd Bergmann
12c3dca25d ARM: ep93xx: remove MaverickCrunch support
The MaverickCrunch support for ep93xx never made it into glibc and
was removed from gcc in its 4.8 release in 2012. It is now one of
the last parts of arch/arm/ that fails to build with the clang
integrated assembler, which is unlikely to ever want to support it.

The two alternatives are to force the use of binutils/gas when
building the crunch support, or to remove it entirely.

According to Hartley Sweeten:

 "Martin Guy did a lot of work trying to get the maverick crunch working
  but I was never able to successfully use it for anything. It "kind"
  of works but depending on the EP93xx silicon revision there are still
  a number of hardware bugs that either give imprecise or garbage results.

  I have no problem with removing the kernel support for the maverick
  crunch."

Unless someone else comes up with a good reason to keep it around,
remove it now. This touches mostly the ep93xx platform, but removes
a bit of code from ARM common ptrace and signal frame handling as well.

If there are remaining users of MaverickCrunch, they can use LTS
kernels for at least another five years before kernel support ends.

Link: https://lore.kernel.org/linux-arm-kernel/20210802141245.1146772-1-arnd@kernel.org/
Link: https://lore.kernel.org/linux-arm-kernel/20210226164345.3889993-1-arnd@kernel.org/
Link: https://github.com/ClangBuiltLinux/linux/issues/1272
Link: https://gcc.gnu.org/legacy-ml/gcc/2008-03/msg01063.html
Cc: "Martin Guy" <martinwguy@martinwguy@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2021-08-04 13:30:04 +02:00
Om Prakash Singh
f62750e691 PCI: tegra194: Cleanup unused code
Remove unused code from function tegra_pcie_config_ep.

Link: https://lore.kernel.org/r/20210623100525.19944-6-omp@nvidia.com
Signed-off-by: Om Prakash Singh <omp@nvidia.com>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Reviewed-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Vidya Sagar <vidyas@nvidia.com>
2021-08-04 12:28:17 +01:00
Om Prakash Singh
de2bbf2b71 PCI: tegra194: Don't allow suspend when Tegra PCIe is in EP mode
When Tegra PCIe is in endpoint mode it should be available for root port.
PCIe link up by root port fails if it is in suspend state. So, don't allow
Tegra to suspend when endpoint mode is enabled.

Link: https://lore.kernel.org/r/20210623100525.19944-5-omp@nvidia.com
Signed-off-by: Om Prakash Singh <omp@nvidia.com>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Reviewed-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Vidya Sagar <vidyas@nvidia.com>
2021-08-04 12:28:17 +01:00
Om Prakash Singh
834c5cf2b5 PCI: tegra194: Disable interrupts before entering L2
In suspend_noirq() call if link doesn't goto L2, PERST# is asserted
to bring link to detect state. However, this is causing surprise
link down AER error. Since Kernel is executing noirq suspend calls,
AER interrupt is not processed. PME and AER are shared interrupts
and PCIe subsystem driver enables wake capability of PME irq during
suspend. So this AER will cause suspend failure due to pending
AER interrupt.

After PCIe link is in L2, interrupts are not expected since PCIe
controller will be in reset state. Disable PCIe interrupts before
going to L2 state to avoid pending AER interrupt.

Link: https://lore.kernel.org/r/20210623100525.19944-4-omp@nvidia.com
Signed-off-by: Om Prakash Singh <omp@nvidia.com>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Reviewed-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Vidya Sagar <vidyas@nvidia.com>
2021-08-04 12:28:17 +01:00
Om Prakash Singh
43537cf7e3 PCI: tegra194: Fix MSI-X programming
Lower order MSI-X address is programmed in MSIX_ADDR_MATCH_HIGH_OFF
DBI register instead of higher order address. This patch fixes this
programming mistake.

Link: https://lore.kernel.org/r/20210623100525.19944-3-omp@nvidia.com
Signed-off-by: Om Prakash Singh <omp@nvidia.com>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Reviewed-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Vidya Sagar <vidyas@nvidia.com>
2021-08-04 12:28:17 +01:00
Om Prakash Singh
ceb1412c1c PCI: tegra194: Fix handling BME_CHGED event
In tegra_pcie_ep_hard_irq(), APPL_INTR_STATUS_L0 is stored in val and again
APPL_INTR_STATUS_L1_0_0 is also stored in val. So when execution reaches
"if (val & APPL_INTR_STATUS_L0_PCI_CMD_EN_INT)", val is not correct.

Link: https://lore.kernel.org/r/20210623100525.19944-2-omp@nvidia.com
Signed-off-by: Om Prakash Singh <omp@nvidia.com>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Reviewed-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Vidya Sagar <vidyas@nvidia.com>
2021-08-04 12:28:16 +01:00
Miklos Szeredi
badc741459 fuse: move option checking into fuse_fill_super()
Checking whether the "fd=", "rootmode=", "user_id=" and "group_id=" mount
options are present can be moved from fuse_get_tree() into
fuse_fill_super() where the value of the options are consumed.

This relaxes semantics of reusing a fuse blockdev mount using the device
name.  Before this patch presence of these options were enforced but values
ignored, after this patch these options are completely ignored in this
case.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-08-04 13:22:58 +02:00
Miklos Szeredi
84c215075b fuse: name fs_context consistently
Naming convention under fs/fuse/:

	struct fuse_conn *fc;
	struct fs_context *fsc;

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-08-04 13:22:58 +02:00
Miklos Szeredi
e1e71c1688 fuse: fix use after free in fuse_read_interrupt()
There is a potential race between fuse_read_interrupt() and
fuse_request_end().

TASK1
  in fuse_read_interrupt(): delete req->intr_entry (while holding
  fiq->lock)

TASK2
  in fuse_request_end(): req->intr_entry is empty -> skip fiq->lock
  wake up TASK3

TASK3
  request is freed

TASK1
  in fuse_read_interrupt(): dereference req->in.h.unique ***BAM***


Fix by always grabbing fiq->lock if the request was ever interrupted
(FR_INTERRUPTED set) thereby serializing with concurrent
fuse_read_interrupt() calls.

FR_INTERRUPTED is set before the request is queued on fiq->interrupts.
Dequeing the request is done with list_del_init() but FR_INTERRUPTED is not
cleared in this case.

Reported-by: lijiazi <lijiazi@xiaomi.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-08-04 13:22:58 +02:00
Rob Herring
aeaea8969b PCI: iproc: Fix BCMA probe resource handling
In commit 7ef1c871da ("PCI: iproc: Use
pci_parse_request_of_pci_ranges()"), calling
devm_request_pci_bus_resources() was dropped from the common iProc
probe code, but is still needed for BCMA bus probing. Without it, there
will be lots of warnings like this:

pci 0000:00:00.0: BAR 8: no space for [mem size 0x00c00000]
pci 0000:00:00.0: BAR 8: failed to assign [mem size 0x00c00000]

Add back calling devm_request_pci_bus_resources() and adding the
resources to pci_host_bridge.windows for BCMA bus probe.

Link: https://lore.kernel.org/r/20210803215656.3803204-2-robh@kernel.org
Fixes: 7ef1c871da ("PCI: iproc: Use pci_parse_request_of_pci_ranges()")
Reported-by: Rafał Miłecki <zajec5@gmail.com>
Tested-by: Rafał Miłecki <rafal@milecki.pl>
Signed-off-by: Rob Herring <robh@kernel.org>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Cc: Srinath Mannam <srinath.mannam@broadcom.com>
Cc: Roman Bacik <roman.bacik@broadcom.com>
Cc: Bharat Gooty <bharat.gooty@broadcom.com>
Cc: Abhishek Shah <abhishek.shah@broadcom.com>
Cc: Jitendra Bhivare <jitendra.bhivare@broadcom.com>
Cc: Ray Jui <ray.jui@broadcom.com>
Cc: Florian Fainelli <f.fainelli@gmail.com>
Cc: BCM Kernel Feedback <bcm-kernel-feedback-list@broadcom.com>
Cc: Scott Branden <sbranden@broadcom.com>
Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Cc: "Krzysztof Wilczyński" <kw@linux.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
2021-08-04 12:20:00 +01:00
Rob Herring
d277f6e88c PCI: of: Don't fail devm_pci_alloc_host_bridge() on missing 'ranges'
Commit 669cbc7081 ("PCI: Move DT resource setup into
devm_pci_alloc_host_bridge()") made devm_pci_alloc_host_bridge() fail on
any DT resource parsing errors, but Broadcom iProc uses
devm_pci_alloc_host_bridge() on BCMA bus devices that don't have DT
resources. In particular, there is no 'ranges' property. Fix iProc by
making 'ranges' optional.

If 'ranges' is required by a platform, there's going to be more errors
latter on if it is missing.

Link: https://lore.kernel.org/r/20210803215656.3803204-1-robh@kernel.org
Fixes: 669cbc7081 ("PCI: Move DT resource setup into devm_pci_alloc_host_bridge()")
Reported-by: Rafał Miłecki <zajec5@gmail.com>
Tested-by: Rafał Miłecki <rafal@milecki.pl>
Signed-off-by: Rob Herring <robh@kernel.org>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: Srinath Mannam <srinath.mannam@broadcom.com>
Cc: Roman Bacik <roman.bacik@broadcom.com>
Cc: Bharat Gooty <bharat.gooty@broadcom.com>
Cc: Abhishek Shah <abhishek.shah@broadcom.com>
Cc: Jitendra Bhivare <jitendra.bhivare@broadcom.com>
Cc: Ray Jui <ray.jui@broadcom.com>
Cc: Florian Fainelli <f.fainelli@gmail.com>
Cc: BCM Kernel Feedback <bcm-kernel-feedback-list@broadcom.com>
Cc: Scott Branden <sbranden@broadcom.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
2021-08-04 12:20:00 +01:00
Shaik Sajida Bhanu
67b13f3e22 mmc: sdhci-msm: Update the software timeout value for sdhc
Whenever SDHC run at clock rate 50MHZ or below, the hardware data
timeout value will be 21.47secs, which is approx. 22secs and we have
a current software timeout value as 10secs. We have to set software
timeout value more than the hardware data timeout value to avioid seeing
the below register dumps.

[  332.953670] mmc2: Timeout waiting for hardware interrupt.
[  332.959608] mmc2: sdhci: ============ SDHCI REGISTER DUMP ===========
[  332.966450] mmc2: sdhci: Sys addr:  0x00000000 | Version:  0x00007202
[  332.973256] mmc2: sdhci: Blk size:  0x00000200 | Blk cnt:  0x00000001
[  332.980054] mmc2: sdhci: Argument:  0x00000000 | Trn mode: 0x00000027
[  332.986864] mmc2: sdhci: Present:   0x01f801f6 | Host ctl: 0x0000001f
[  332.993671] mmc2: sdhci: Power:     0x00000001 | Blk gap:  0x00000000
[  333.000583] mmc2: sdhci: Wake-up:   0x00000000 | Clock:    0x00000007
[  333.007386] mmc2: sdhci: Timeout:   0x0000000e | Int stat: 0x00000000
[  333.014182] mmc2: sdhci: Int enab:  0x03ff100b | Sig enab: 0x03ff100b
[  333.020976] mmc2: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00000000
[  333.027771] mmc2: sdhci: Caps:      0x322dc8b2 | Caps_1:   0x0000808f
[  333.034561] mmc2: sdhci: Cmd:       0x0000183a | Max curr: 0x00000000
[  333.041359] mmc2: sdhci: Resp[0]:   0x00000900 | Resp[1]:  0x00000000
[  333.048157] mmc2: sdhci: Resp[2]:   0x00000000 | Resp[3]:  0x00000000
[  333.054945] mmc2: sdhci: Host ctl2: 0x00000000
[  333.059657] mmc2: sdhci: ADMA Err:  0x00000000 | ADMA Ptr:
0x0000000ffffff218
[  333.067178] mmc2: sdhci_msm: ----------- VENDOR REGISTER DUMP
-----------
[  333.074343] mmc2: sdhci_msm: DLL sts: 0x00000000 | DLL cfg:
0x6000642c | DLL cfg2: 0x0020a000
[  333.083417] mmc2: sdhci_msm: DLL cfg3: 0x00000000 | DLL usr ctl:
0x00000000 | DDR cfg: 0x80040873
[  333.092850] mmc2: sdhci_msm: Vndr func: 0x00008a9c | Vndr func2 :
0xf88218a8 Vndr func3: 0x02626040
[  333.102371] mmc2: sdhci: ============================================

So, set software timeout value more than hardware timeout value.

Signed-off-by: Shaik Sajida Bhanu <sbhanu@codeaurora.org>
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/1626435974-14462-1-git-send-email-sbhanu@codeaurora.org
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
2021-08-04 13:17:21 +02:00
Christophe Kerello
d8e193f13b mmc: mmci: stm32: Check when the voltage switch procedure should be done
If the card has not been power cycled, it may still be using 1.8V
signaling. This situation is detected in mmc_sd_init_card function and
should be handled in mmci stm32 variant.  The host->pwr_reg variable is
also correctly protected with spin locks.

Fixes: 94b94a93e3 ("mmc: mmci_sdmmc: Implement signal voltage callbacks")
Signed-off-by: Christophe Kerello <christophe.kerello@foss.st.com>
Signed-off-by: Yann Gautier <yann.gautier@foss.st.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210701143353.13188-1-yann.gautier@foss.st.com
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
2021-08-04 12:46:02 +02:00
Vincent Whitchurch
25f8203b4b mmc: dw_mmc: Fix hang on data CRC error
When a Data CRC interrupt is received, the driver disables the DMA, then
sends the stop/abort command and then waits for Data Transfer Over.

However, sometimes, when a data CRC error is received in the middle of a
multi-block write transfer, the Data Transfer Over interrupt is never
received, and the driver hangs and never completes the request.

The driver sets the BMOD.SWR bit (SDMMC_IDMAC_SWRESET) when stopping the
DMA, but according to the manual CMD.STOP_ABORT_CMD should be programmed
"before assertion of SWR".  Do these operations in the recommended
order.  With this change the Data Transfer Over is always received
correctly in my tests.

Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com>
Reviewed-by: Jaehoon Chung <jh80.chung@samsung.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210630102232.16011-1-vincent.whitchurch@axis.com
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
2021-08-04 12:41:20 +02:00
David S. Miller
9c0532f9cc Merge tag 'linux-can-next-for-5.15-20210804' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next
Marc Kleine-Budde says:

====================
pull-request: can-next 2021-08-04

this is a pull request of 5 patches for net-next/master.

The first patch is by me and fixes a typo in a comment in the CAN
J1939 protocol.

The next 2 patches are by Oleksij Rempel and update the CAN J1939
protocol to send RX status updates via the error queue mechanism.

The next patch is by me and adds a missing variable initialization to
the flexcan driver (the problem was introduced in the current net-next
cycle).

The last patch is by Aswath Govindraju and adds power-domains to the
Bosch m_can DT binding documentation.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-04 11:30:09 +01:00
Linus Walleij
47adef20e6 pata: ixp4xx: Rewrite to use device tree
This rewrites the IXP4xx CF (IDE) libata driver to use the
device tree exclusively to look up its resources:

- Probe exclusively from the device tree and look up all
  resources from there.
- Allocate a local state container with devres and pass
  this around in .private_data.
- Initialize with struct ata_port_info.
- Use the .set_piomode() callback instead of the much
  wider .set_mode(), we only support PIO after all.
- Bump driver version number from 0.2 to 1.0 to reflect this
  wider change.
- Get a handle on the expansion bus syscon regmap to alter
  the timings on the chip select.
- Put in the more elaborate timing adjustment code for PIO0
  to PIO4 in 8 and 16bit mode from the downstream OpenWrt
  patch.

The board file initialization path and platform data include
is dropped because the board files will be deleted at the same
time as this patch is merged.

The platform data file is not deleted right now so as not to
conflict with the removal of board files.

Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2021-08-04 12:20:44 +02:00
Linus Walleij
be470496ee pata: ixp4xx: Add DT bindings
This adds device tree bindings for the Intel IXP4xx compact flash card
interface.

Cc: devicetree@vger.kernel.org
Reviewed-by: Rob Herring <robh@kernel.org>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2021-08-04 12:20:37 +02:00
Linus Walleij
8e3d25a623 pata: ixp4xx: Refer to cmd and ctl rather than csN
The two "cs0" and "cs1" are "chip selects" but on some
platforms such as GW2358 they are actually both in CS3
making this terminology very confusing. Call the
addresses "cmd" and "ctl" after function instead.

Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2021-08-04 12:20:32 +02:00
Linus Walleij
d2b507acc6 pata: ixp4xx: Use IS_ENABLED() to determine endianness
Instead of an ARM-specific ifdef, use the global CPU config
and if (IS_ENABLED()).

Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2021-08-04 12:20:27 +02:00
Linus Walleij
f62b38965a pata: ixp4xx: Use local dev variable
Let's simplify all &pdev->dev references by creating a
local struct device *dev variable.

Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2021-08-04 12:20:21 +02:00
Linus Walleij
21a0a29d16 watchdog: ixp4xx: Rewrite driver to use core
This rewrites the IXP4xx watchdog driver as follows:

- Spawn the watchdog driver as a platform device from the timer
  driver. It's one device in the hardware, and the fact that
  Linux splits the handling into two different devices is
  a Linux pecularity, and thus it becomes a Linux pecularity
  to spawn a separate watchdog driver.

- Spawn the watchdog driver from the timer driver at probe().
  This is well after the timer driver as actually registered and
  started and we know the register base is available.

- Instead of looping back callbacks to the timer drivers for all
  watchdog calls, pass the register base to the watchdog driver
  and manage the registers there. The two drivers aren't even
  interested in the same register so the spinlock is totally
  surplus, delete it.

- Replace pretty much all of the content in the watchdog driver
  with a simple, modern watchdog driver utilizing the watchdog
  core instead of registering its own misc device and ioctl()
  handling.

- Drop module parameters as the same already exist in the
  watchdog core.

What remains is a slim elegant (IMO) watchdog driver using the
watchdog core, spawning from device tree or boardfile alike.

Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2021-08-04 12:20:13 +02:00
Linus Walleij
1c953bda90 bus: ixp4xx: Add a driver for IXP4xx expansion bus
The Intel IXP4xx SoCs have an expansion bus that is usually just
used for flash memory and configured by the boot loaders and can
be accessed using the "simple-bus".

However some devices need more elaborate configuration and then we
need to provide a proper 3-unit address space indicating chip
select for each device and provide timing and similar information.

Cc: Marc Zyngier <maz@kernel.org>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2021-08-04 12:20:06 +02:00
Linus Walleij
3fbcc6763b bus: ixp4xx: Add DT bindings for the IXP4xx expansion bus
This adds device tree bindings for the IXP4xx expansion bus controller.

Cc: Marc Zyngier <maz@kernel.org>
Cc: devicetree@vger.kernel.org
Reviewed-by: Rob Herring <robh@kernel.org>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2021-08-04 12:19:58 +02:00
Aswath Govindraju
d85165b238 dt-bindings: net: can: Document power-domains property
Document power-domains property for adding the Power domain provider.

Link: https://lore.kernel.org/r/20210802091822.16407-1-a-govindraju@ti.com
Signed-off-by: Aswath Govindraju <a-govindraju@ti.com>
Acked-by: Rob Herring <robh@kernel.org>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2021-08-04 12:11:57 +02:00
Marc Kleine-Budde
3362666972 can: flexcan: flexcan_clks_enable(): add missing variable initialization
This patch adds the missing initialization of the "err" variable in
the flexcan_clks_enable() function.

Fixes: d9cead75b1 ("can: flexcan: add mcf5441x support")
Link: https://lore.kernel.org/r/20210728075428.1493568-1-mkl@pengutronix.de
Reported-by: kernel test robot <lkp@intel.com>
Cc: Angelo Dureghello <angelo@kernel-space.org>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2021-08-04 12:11:56 +02:00
Oleksij Rempel
5b9272e93f can: j1939: extend UAPI to notify about RX status
To be able to create applications with user friendly feedback, we need be
able to provide receive status information.

Typical ETP transfer may take seconds or even hours. To give user some
clue or show a progress bar, the stack should push status updates.
Same as for the TX information, the socket error queue will be used with
following new signals:
- J1939_EE_INFO_RX_RTS   - received and accepted request to send signal.
- J1939_EE_INFO_RX_DPO   - received data package offset signal
- J1939_EE_INFO_RX_ABORT - RX session was aborted

Instead of completion signal, user will get data package.
To activate this signals, application should set
SOF_TIMESTAMPING_RX_SOFTWARE to the SO_TIMESTAMPING socket option. This
will avoid unpredictable application behavior for the old software.

Link: https://lore.kernel.org/r/20210707094854.30781-3-o.rempel@pengutronix.de
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2021-08-04 12:11:52 +02:00
Oleksij Rempel
cd85d3aed5 can: j1939: rename J1939_ERRQUEUE_* to J1939_ERRQUEUE_TX_*
Prepare the world for the J1939_ERRQUEUE_RX_ version

Link: https://lore.kernel.org/r/20210707094854.30781-2-o.rempel@pengutronix.de
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2021-08-04 12:11:48 +02:00
Sean Christopherson
179c6c27bf KVM: SVM: Fix off-by-one indexing when nullifying last used SEV VMCB
Use the raw ASID, not ASID-1, when nullifying the last used VMCB when
freeing an SEV ASID.  The consumer, pre_sev_run(), indexes the array by
the raw ASID, thus KVM could get a false negative when checking for a
different VMCB if KVM manages to reallocate the same ASID+VMCB combo for
a new VM.

Note, this cannot cause a functional issue _in the current code_, as
pre_sev_run() also checks which pCPU last did VMRUN for the vCPU, and
last_vmentry_cpu is initialized to -1 during vCPU creation, i.e. is
guaranteed to mismatch on the first VMRUN.  However, prior to commit
8a14fe4f0c ("kvm: x86: Move last_cpu into kvm_vcpu_arch as
last_vmentry_cpu"), SVM tracked pCPU on its own and zero-initialized the
last_cpu variable.  Thus it's theoretically possible that older versions
of KVM could miss a TLB flush if the first VMRUN is on pCPU0 and the ASID
and VMCB exactly match those of a prior VM.

Fixes: 70cd94e60c ("KVM: SVM: VMRUN should use associated ASID when SEV is enabled")
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-04 06:02:09 -04:00
Paolo Bonzini
85cd39af14 KVM: Do not leak memory for duplicate debugfs directories
KVM creates a debugfs directory for each VM in order to store statistics
about the virtual machine.  The directory name is built from the process
pid and a VM fd.  While generally unique, it is possible to keep a
file descriptor alive in a way that causes duplicate directories, which
manifests as these messages:

  [  471.846235] debugfs: Directory '20245-4' with parent 'kvm' already present!

Even though this should not happen in practice, it is more or less
expected in the case of KVM for testcases that call KVM_CREATE_VM and
close the resulting file descriptor repeatedly and in parallel.

When this happens, debugfs_create_dir() returns an error but
kvm_create_vm_debugfs() goes on to allocate stat data structs which are
later leaked.  The slow memory leak was spotted by syzkaller, where it
caused OOM reports.

Since the issue only affects debugfs, do a lookup before calling
debugfs_create_dir, so that the message is downgraded and rate-limited.
While at it, ensure kvm->debugfs_dentry is NULL rather than an error
if it is not created.  This fixes kvm_destroy_vm_debugfs, which was not
checking IS_ERR_OR_NULL correctly.

Cc: stable@vger.kernel.org
Fixes: 536a6f88c4 ("KVM: Create debugfs dir and stat files for each VM")
Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-04 06:02:03 -04:00
Like Xu
e79f49c37c KVM: x86/pmu: Introduce pmc->is_paused to reduce the call time of perf interfaces
Based on our observations, after any vm-exit associated with vPMU, there
are at least two or more perf interfaces to be called for guest counter
emulation, such as perf_event_{pause, read_value, period}(), and each one
will {lock, unlock} the same perf_event_ctx. The frequency of calls becomes
more severe when guest use counters in a multiplexed manner.

Holding a lock once and completing the KVM request operations in the perf
context would introduce a set of impractical new interfaces. So we can
further optimize the vPMU implementation by avoiding repeated calls to
these interfaces in the KVM context for at least one pattern:

After we call perf_event_pause() once, the event will be disabled and its
internal count will be reset to 0. So there is no need to pause it again
or read its value. Once the event is paused, event period will not be
updated until the next time it's resumed or reprogrammed. And there is
also no need to call perf_event_period twice for a non-running counter,
considering the perf_event for a running counter is never paused.

Based on this implementation, for the following common usage of
sampling 4 events using perf on a 4u8g guest:

  echo 0 > /proc/sys/kernel/watchdog
  echo 25 > /proc/sys/kernel/perf_cpu_time_max_percent
  echo 10000 > /proc/sys/kernel/perf_event_max_sample_rate
  echo 0 > /proc/sys/kernel/perf_cpu_time_max_percent
  for i in `seq 1 1 10`
  do
  taskset -c 0 perf record \
  -e cpu-cycles -e instructions -e branch-instructions -e cache-misses \
  /root/br_instr a
  done

the average latency of the guest NMI handler is reduced from
37646.7 ns to 32929.3 ns (~1.14x speed up) on the Intel ICX server.
Also, in addition to collecting more samples, no loss of sampling
accuracy was observed compared to before the optimization.

Signed-off-by: Like Xu <likexu@tencent.com>
Message-Id: <20210728120705.6855-1-likexu@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
2021-08-04 05:55:56 -04:00
Peter Xu
a75b540451 KVM: X86: Optimize zapping rmap
Using rmap_get_first() and rmap_remove() for zapping a huge rmap list could be
slow.  The easy way is to travers the rmap list, collecting the a/d bits and
free the slots along the way.

Provide a pte_list_destroy() and do exactly that.

Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20210730220605.26377-1-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-04 05:55:56 -04:00
Peter Xu
13236e25eb KVM: X86: Optimize pte_list_desc with per-array counter
Add a counter field into pte_list_desc, so as to simplify the add/remove/loop
logic.  E.g., we don't need to loop over the array any more for most reasons.

This will make more sense after we've switched the array size to be larger
otherwise the counter will be a waste.

Initially I wanted to store a tail pointer at the head of the array list so we
don't need to traverse the list at least for pushing new ones (if without the
counter we traverse both the list and the array).  However that'll need
slightly more change without a huge lot benefit, e.g., after we grow entry
numbers per array the list traversing is not so expensive.

So let's be simple but still try to get as much benefit as we can with just
these extra few lines of changes (not to mention the code looks easier too
without looping over arrays).

I used the same a test case to fork 500 child and recycle them ("./rmap_fork
500" [1]), this patch further speeds up the total fork time of about 4%, which
is a total of 33% of vanilla kernel:

        Vanilla:      473.90 (+-5.93%)
        3->15 slots:  366.10 (+-4.94%)
        Add counter:  351.00 (+-3.70%)

[1] 825436f825

Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20210730220602.26327-1-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-04 05:55:56 -04:00
Peter Xu
dc1cff9691 KVM: X86: MMU: Tune PTE_LIST_EXT to be bigger
Currently rmap array element only contains 3 entries.  However for EPT=N there
could have a lot of guest pages that got tens of even hundreds of rmap entry.

A normal distribution of a 6G guest (even if idle) shows this with rmap count
statistics:

Rmap_Count:     0       1       2-3     4-7     8-15    16-31   32-63   64-127  128-255 256-511 512-1023
Level=4K:       3089171 49005   14016   1363    235     212     15      7       0       0       0
Level=2M:       5951    227     0       0       0       0       0       0       0       0       0
Level=1G:       32      0       0       0       0       0       0       0       0       0       0

If we do some more fork some pages will grow even larger rmap counts.

This patch makes PTE_LIST_EXT bigger so it'll be more efficient for the general
use case of EPT=N as we do list reference less and the loops over PTE_LIST_EXT
will be slightly more efficient; but still not too large so less waste when
array not full.

It should not affecting EPT=Y since EPT normally only has zero or one rmap
entry for each page, so no array is even allocated.

With a test case to fork 500 child and recycle them ("./rmap_fork 500" [1]),
this patch speeds up fork time of about 29%.

    Before: 473.90 (+-5.93%)
    After:  366.10 (+-4.94%)

[1] 825436f825

Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20210730220455.26054-6-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-04 05:55:56 -04:00