linux

Author	SHA1	Message	Date
Dongliang Mu	6fa54bc713	media: em28xx-input: fix refcount bug in em28xx_usb_disconnect If em28xx_ir_init fails, it would decrease the refcount of dev. However, in the em28xx_ir_fini, when ir is NULL, it goes to ref_put and decrease the refcount of dev. This will lead to a refcount bug. Fix this bug by removing the kref_put in the error handling code of em28xx_ir_init. refcount_t: underflow; use-after-free. WARNING: CPU: 0 PID: 7 at lib/refcount.c:28 refcount_warn_saturate+0x18e/0x1a0 lib/refcount.c:28 Modules linked in: CPU: 0 PID: 7 Comm: kworker/0:1 Not tainted 5.13.0 #3 Workqueue: usb_hub_wq hub_event RIP: 0010:refcount_warn_saturate+0x18e/0x1a0 lib/refcount.c:28 Call Trace: kref_put.constprop.0+0x60/0x85 include/linux/kref.h:69 em28xx_usb_disconnect.cold+0xd7/0xdc drivers/media/usb/em28xx/em28xx-cards.c:4150 usb_unbind_interface+0xbf/0x3a0 drivers/usb/core/driver.c:458 __device_release_driver drivers/base/dd.c:1201 [inline] device_release_driver_internal+0x22a/0x230 drivers/base/dd.c:1232 bus_remove_device+0x108/0x160 drivers/base/bus.c:529 device_del+0x1fe/0x510 drivers/base/core.c:3540 usb_disable_device+0xd1/0x1d0 drivers/usb/core/message.c:1419 usb_disconnect+0x109/0x330 drivers/usb/core/hub.c:2221 hub_port_connect drivers/usb/core/hub.c:5151 [inline] hub_port_connect_change drivers/usb/core/hub.c:5440 [inline] port_event drivers/usb/core/hub.c:5586 [inline] hub_event+0xf81/0x1d40 drivers/usb/core/hub.c:5668 process_one_work+0x2c9/0x610 kernel/workqueue.c:2276 process_scheduled_works kernel/workqueue.c:2338 [inline] worker_thread+0x333/0x5b0 kernel/workqueue.c:2424 kthread+0x188/0x1d0 kernel/kthread.c:319 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 Reported-by: Dongliang Mu <mudongliangabcd@gmail.com> Fixes: `ac56886371` ("media: em28xx: Fix possible memory leak of em28xx struct") Signed-off-by: Dongliang Mu <mudongliangabcd@gmail.com> Signed-off-by: Sean Young <sean@mess.org> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>	2021-08-04 14:43:49 +02:00
Viktor Prutyanov	49be1c78d5	media: rc: introduce Meson IR TX driver This patch adds the driver for Amlogic Meson IR transmitter. Some Amlogic SoCs such as A311D and T950D4 have IR transmitter (also called blaster) controller onboard. It is capable of sending IR signals with arbitrary carrier frequency and duty cycle. The driver supports 2 modulation clock sources: - xtal3 clock (xtal divided by 3) - 1us clock Signed-off-by: Viktor Prutyanov <viktor.prutyanov@phystech.edu> Signed-off-by: Sean Young <sean@mess.org> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>	2021-08-04 14:43:49 +02:00
Viktor Prutyanov	e9f504f7b5	media: rc: meson-ir-tx: document device tree bindings This patch adds binding documentation for the IR transmitter available in Amlogic Meson SoCs. Signed-off-by: Viktor Prutyanov <viktor.prutyanov@phystech.edu> Acked-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com> Reviewed-by: Rob Herring <robh@kernel.org> Signed-off-by: Sean Young <sean@mess.org> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>	2021-08-04 14:43:49 +02:00
Arnd Bergmann	8e816b9915	Merge tag 'at91-dt-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/at91/linux into arm/dt AT91 dt for 5.15: - add sama7g5 SoC and associated evaluation kit, the sama7g5-ek - adaptation of some DT for sama5d27 som1 ek, sama5d4 xplained and sama5d2 icp boards - fixes to gpio and shutdown controller nodes for all boards * tag 'at91-dt-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/at91/linux: ARM: dts: at91: use the right property for shutdown controller ARM: dts: at91: sama5d2_icp: enable digital filter for I2C nodes ARM: dts: at91: sama5d4_xplained: change the key code of the gpio key ARM: dts: at91: add conflict note for d3 ARM: dts: at91: add pinctrl-{names, 0} for all gpios ARM: dts: at91: sama5d27_som1_ek: enable ADC node ARM: dts: at91: sama5d4_xplained: Remove spi0 node dt-bindings: atmel-sysreg: add bindings for sama7g5 ARM: dts: at91: add sama7g5 SoC DT and sama7g5-ek dt-bindings: ARM: at91: document sama7g5ek board Link: https://lore.kernel.org/r/20210804085000.13233-1-nicolas.ferre@microchip.com Signed-off-by: Arnd Bergmann <arnd@arndb.de>	2021-08-04 14:31:32 +02:00
Arnd Bergmann	72ee3b4dc2	Merge tag 'ux500-dts-v5.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-nomadik into arm/dt Ux500 Device Tree updates for the v5.15 kernel cycle: - New device trees for these mobile phones: - Samsung Gavini - Samsung Codina - Samsung Kyle - Flag eMMC cards as non-SD non-SDIO to save time - Link USB PHY to USB controller in the device tree - Fix up the operating points to the actual clock frequencies * tag 'ux500-dts-v5.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-nomadik: ARM: dts: ux500: Adjust operating points to reality ARM: dts: ux500: Add a device tree for Kyle ARM: dts: ux500: Add devicetree for Codina ARM: dts: ux500: ab8500: Link USB PHY to USB controller node ARM: dts: ux500: Flag eMMCs as non-SDIO/SD ARM: dts: ux500: Add device tree for Samsung Gavini Link: https://lore.kernel.org/r/CACRpkdbjBv5ywZZD8rK07d5sLcHsG8o4iYD-3jHO=HLg6-nKnA@mail.gmail.com Signed-off-by: Arnd Bergmann <arnd@arndb.de>	2021-08-04 14:30:29 +02:00
Takashi Iwai	f2553d4678	ASoC: amd: vangogh: Drop superfluous mmap callback The mmap callback of vangogh driver just calls the default mmap handler, and it's superfluous, as the PCM core would call it if not set. Let's drop the superfluous mmap callback. Fixes: `361414dc1f` ("ASoC: amd: add vangogh i2s dma driver pm ops") Signed-off-by: Takashi Iwai <tiwai@suse.de> Link: https://lore.kernel.org/r/20210804075223.9823-1-tiwai@suse.de Signed-off-by: Mark Brown <broonie@kernel.org>	2021-08-04 13:20:14 +01:00
Marc Zyngier	47e6223c84	KVM: arm64: Unregister HYP sections from kmemleak in protected mode Booting a KVM host in protected mode with kmemleak quickly results in a pretty bad crash, as kmemleak doesn't know that the HYP sections have been taken away. This is specially true for the BSS section, which is part of the kernel BSS section and registered at boot time by kmemleak itself. Unregister the HYP part of the BSS before making that section HYP-private. The rest of the HYP-specific data is obtained via the page allocator or lives in other sections, none of which is subjected to kmemleak. Fixes: `90134ac9ca` ("KVM: arm64: Protect the .hyp sections from the host") Reviewed-by: Quentin Perret <qperret@google.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Cc: stable@vger.kernel.org # 5.13 Link: https://lore.kernel.org/r/20210802123830.2195174-3-maz@kernel.org	2021-08-04 13:10:47 +01:00
Marc Zyngier	eb48d154cd	arm64: Move .hyp.rodata outside of the _sdata.._edata range The HYP rodata section is currently lumped together with the BSS, which isn't exactly what is expected (it gets registered with kmemleak, for example). Move it away so that it is actually marked RO. As an added benefit, it isn't registered with kmemleak anymore. Fixes: `380e18ade4` ("KVM: arm64: Introduce a BSS section for use at Hyp") Suggested-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Cc: stable@vger.kernel.org #5.13 Acked-by: Catalin Marinas <catalin.marinas@arm.com> Link: https://lore.kernel.org/r/20210802123830.2195174-2-maz@kernel.org	2021-08-04 13:09:20 +01:00
Praveen Kumar	e5d9b714fe	x86/hyperv: fix root partition faults when writing to VP assist page MSR For root partition the VP assist pages are pre-determined by the hypervisor. The root kernel is not allowed to change them to different locations. And thus, we are getting below stack as in current implementation root is trying to perform write to specific MSR. [ 2.778197] unchecked MSR access error: WRMSR to 0x40000073 (tried to write 0x0000000145ac5001) at rIP: 0xffffffff810c1084 (native_write_msr+0x4/0x30) [ 2.784867] Call Trace: [ 2.791507] hv_cpu_init+0xf1/0x1c0 [ 2.798144] ? hyperv_report_panic+0xd0/0xd0 [ 2.804806] cpuhp_invoke_callback+0x11a/0x440 [ 2.811465] ? hv_resume+0x90/0x90 [ 2.818137] cpuhp_issue_call+0x126/0x130 [ 2.824782] __cpuhp_setup_state_cpuslocked+0x102/0x2b0 [ 2.831427] ? hyperv_report_panic+0xd0/0xd0 [ 2.838075] ? hyperv_report_panic+0xd0/0xd0 [ 2.844723] ? hv_resume+0x90/0x90 [ 2.851375] __cpuhp_setup_state+0x3d/0x90 [ 2.858030] hyperv_init+0x14e/0x410 [ 2.864689] ? enable_IR_x2apic+0x190/0x1a0 [ 2.871349] apic_intr_mode_init+0x8b/0x100 [ 2.878017] x86_late_time_init+0x20/0x30 [ 2.884675] start_kernel+0x459/0x4fb [ 2.891329] secondary_startup_64_no_verify+0xb0/0xbb Since the hypervisor already provides the VP assist pages for root partition, we need to memremap the memory from hypervisor for root kernel to use. The mapping is done in hv_cpu_init during bringup and is unmapped in hv_cpu_die during teardown. Signed-off-by: Praveen Kumar <kumarpraveen@linux.microsoft.com> Reviewed-by: Sunil Muthuswamy <sunilmut@microsoft.com> Link: https://lore.kernel.org/r/20210731120519.17154-1-kumarpraveen@linux.microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org>	2021-08-04 11:56:53 +00:00
Nick Richardson	c2eecaa193	pktgen: Remove redundant clone_skb override When the netif_receive xmit_mode is set, a line is supposed to set clone_skb to a default 0 value. This line is made redundant due to a preceding line that checks if clone_skb is more than zero and returns -ENOTSUPP. Overriding clone_skb to 0 does not make any difference to the behavior because if it was positive we return error. So it can be either 0 or negative, and in both cases the behavior is the same. Remove redundant line that sets clone_skb to zero. Signed-off-by: Nick Richardson <richardsonnick@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-08-04 12:54:09 +01:00
Jonathan Lemon	773bda9649	ptp: ocp: Expose various resources on the timecard. The OpenCompute timecard driver has additional functionality besides a clock. Make the following resources available: - The external timestamp channels (ts0/ts1) - devlink support for flashing and health reporting - GPS and MAC serial ports - board serial number (obtained from i2c device) Also add watchdog functionality for when GNSS goes into holdover. The resources are collected under a timecard class directory: [jlemon@timecard ~]$ ls -g /sys/class/timecard/ocp1/ total 0 -r--r--r--. 1 root 4096 Aug 3 19:49 available_clock_sources -rw-r--r--. 1 root 4096 Aug 3 19:49 clock_source lrwxrwxrwx. 1 root 0 Aug 3 19:49 device -> ../../../0000:04:00.0/ -r--r--r--. 1 root 4096 Aug 3 19:49 gps_sync lrwxrwxrwx. 1 root 0 Aug 3 19:49 i2c -> ../../xiic-i2c.1024/i2c-2/ drwxr-xr-x. 2 root 0 Aug 3 19:49 power/ lrwxrwxrwx. 1 root 0 Aug 3 19:49 pps -> ../../../../../virtual/pps/pps1/ lrwxrwxrwx. 1 root 0 Aug 3 19:49 ptp -> ../../ptp/ptp2/ -r--r--r--. 1 root 4096 Aug 3 19:49 serialnum lrwxrwxrwx. 1 root 0 Aug 3 19:49 subsystem -> ../../../../../../class/timecard/ lrwxrwxrwx. 1 root 0 Aug 3 19:49 ttyGPS -> ../../tty/ttyS7/ lrwxrwxrwx. 1 root 0 Aug 3 19:49 ttyMAC -> ../../tty/ttyS8/ -rw-r--r--. 1 root 4096 Aug 3 19:39 uevent The labeling is needed at the minimum, in order to tell the serial devices apart. Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-08-04 12:52:50 +01:00
Pavel Tikhomirov	04190bf894	sock: allow reading and changing sk_userlocks with setsockopt SOCK_SNDBUF_LOCK and SOCK_RCVBUF_LOCK flags disable automatic socket buffers adjustment done by kernel (see tcp_fixup_rcvbuf() and tcp_sndbuf_expand()). If we've just created a new socket this adjustment is enabled on it, but if one changes the socket buffer size by setsockopt(SO_{SND,RCV}BUF) it becomes disabled. CRIU needs to call setsockopt(SO_{SND,RCV}BUF) on each socket on restore as it first needs to increase buffer sizes for packet queues restore and second it needs to restore back original buffer sizes. So after CRIU restore all sockets become non-auto-adjustable, which can decrease network performance of restored applications significantly. CRIU need to be able to restore sockets with enabled/disabled adjustment to the same state it was before dump, so let's add special setsockopt for it. Let's also export SOCK_SNDBUF_LOCK and SOCK_RCVBUF_LOCK flags to uAPI so that using these interface one can reenable automatic socket buffer adjustment on their sockets. Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-08-04 12:52:03 +01:00
Ivan T. Ivanov	6b67d4d63e	net: usb: lan78xx: don't modify phy_device state concurrently Currently phy_device state could be left in inconsistent state shown by following alert message[1]. This is because phy_read_status could be called concurrently from lan78xx_delayedwork, phy_state_machine and __ethtool_get_link. Fix this by making sure that phy_device state is updated atomically. [1] lan78xx 1-1.1.1:1.0 eth0: No phy led trigger registered for speed(-1) Signed-off-by: Ivan T. Ivanov <iivanov@suse.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-08-04 12:51:14 +01:00
Jakub Kicinski	396492b4c5	docs: networking: netdevsim rules There are aspects of netdevsim which are commonly misunderstood and pointed out in review. Cong suggest we document them. Suggested-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-08-04 12:43:27 +01:00
Peilin Ye	625af9f029	tc-testing: Add control-plane selftests for sch_mq Recently we added multi-queue support to netdevsim in commit `d4861fc6be` ("netdevsim: Add multi-queue support"); add a few control-plane selftests for sch_mq using this new feature. Use nsPlugin.py to avoid network interface name collisions. Reviewed-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Peilin Ye <peilin.ye@bytedance.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-08-04 12:42:27 +01:00
Vladimir Oltean	a54182b2a5	Revert "net: build all switchdev drivers as modules when the bridge is a module" This reverts commit `b0e8181762`. Explicit driver dependency on the bridge is no longer needed since switchdev_bridge_port_{,un}offload() is no longer implemented by the bridge driver but by switchdev. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Tested-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-08-04 12:35:07 +01:00
Vladimir Oltean	957e2235e5	net: make switchdev_bridge_port_{,unoffload} loosely coupled with the bridge With the introduction of explicit offloading API in switchdev in commit `2f5dc00f7a` ("net: bridge: switchdev: let drivers inform which bridge ports are offloaded"), we started having Ethernet switch drivers calling directly into a function exported by net/bridge/br_switchdev.c, which is a function exported by the bridge driver. This means that drivers that did not have an explicit dependency on the bridge before, like cpsw and am65-cpsw, now do - otherwise it is not possible to call a symbol exported by a driver that can be built as module unless you are a module too. There was an attempt to solve the dependency issue in the form of commit `b0e8181762` ("net: build all switchdev drivers as modules when the bridge is a module"). Grygorii Strashko, however, says about it: \| In my opinion, the problem is a bit bigger here than just fixing the \| build :( \| \| In case, of ^cpsw the switchdev mode is kinda optional and in many \| cases (especially for testing purposes, NFS) the multi-mac mode is \| still preferable mode. \| \| There were no such tight dependency between switchdev drivers and \| bridge core before and switchdev serviced as independent, notification \| based layer between them, so ^cpsw still can be "Y" and bridge can be \| "M". Now for mostly every kernel build configuration the CONFIG_BRIDGE \| will need to be set as "Y", or we will have to update drivers to \| support build with BRIDGE=n and maintain separate builds for \| networking vs non-networking testing. But is this enough? Wouldn't \| it cause 'chain reaction' required to add more and more "Y" options \| (like CONFIG_VLAN_8021Q)? \| \| PS. Just to be sure we on the same page - ARM builds will be forced \| (with this patch) to have CONFIG_TI_CPSW_SWITCHDEV=m and so all our \| automation testing will just fail with omap2plus_defconfig. In the light of this, it would be desirable for some configurations to avoid dependencies between switchdev drivers and the bridge, and have the switchdev mode as completely optional within the driver. Arnd Bergmann also tried to write a patch which better expressed the build time dependency for Ethernet switch drivers where the switchdev support is optional, like cpsw/am65-cpsw, and this made the drivers follow the bridge (compile as module if the bridge is a module) only if the optional switchdev support in the driver was enabled in the first place: https://patchwork.kernel.org/project/netdevbpf/patch/20210802144813.1152762-1-arnd@kernel.org/ but this still did not solve the fact that cpsw and am65-cpsw now must be built as modules when the bridge is a module - it just expressed correctly that optional dependency. But the new behavior is an apparent regression from Grygorii's perspective. So to support the use case where the Ethernet driver is built-in, NET_SWITCHDEV (a bool option) is enabled, and the bridge is a module, we need a framework that can handle the possible absence of the bridge from the running system, i.e. runtime bloatware as opposed to build-time bloatware. Luckily we already have this framework, since switchdev has been using it extensively. Events from the bridge side are transmitted to the driver side using notifier chains - this was originally done so that unrelated drivers could snoop for events emitted by the bridge towards ports that are implemented by other drivers (think of a switch driver with LAG offload that listens for switchdev events on a bonding/team interface that it offloads). There are also events which are transmitted from the driver side to the bridge side, which again are modeled using notifiers. SWITCHDEV_FDB_ADD_TO_BRIDGE is an example of this, and deals with notifying the bridge that a MAC address has been dynamically learned. So there is a precedent we can use for modeling the new framework. The difference compared to SWITCHDEV_FDB_ADD_TO_BRIDGE is that the work that the bridge needs to do when a port becomes offloaded is blocking in its nature: replay VLANs, MDBs etc. The calling context is indeed blocking (we are under rtnl_mutex), but the existing switchdev notification chain that the bridge is subscribed to is only the atomic one. So we need to subscribe the bridge to the blocking switchdev notification chain too. This patch: - keeps the driver-side perception of the switchdev_bridge_port_{,un}offload unchanged - moves the implementation of switchdev_bridge_port_{,un}offload from the bridge module into the switchdev module. - makes everybody that is subscribed to the switchdev blocking notifier chain "hear" offload & unoffload events - makes the bridge driver subscribe and handle those events - moves the bridge driver's handling of those events into 2 new functions called br_switchdev_port_{,un}offload. These functions contain in fact the core of the logic that was previously in switchdev_bridge_port_{,un}offload, just that now we go through an extra indirection layer to reach them. Unlike all the other switchdev notification structures, the structure used to carry the bridge port information, struct switchdev_notifier_brport_info, does not contain a "bool handled". This is because in the current usage pattern, we always know that a switchdev bridge port offloading event will be handled by the bridge, because the switchdev_bridge_port_offload() call was initiated by a NETDEV_CHANGEUPPER event in the first place, where info->upper_dev is a bridge. So if the bridge wasn't loaded, then the CHANGEUPPER event couldn't have happened. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Tested-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-08-04 12:35:07 +01:00
Arnd Bergmann	12c3dca25d	ARM: ep93xx: remove MaverickCrunch support The MaverickCrunch support for ep93xx never made it into glibc and was removed from gcc in its 4.8 release in 2012. It is now one of the last parts of arch/arm/ that fails to build with the clang integrated assembler, which is unlikely to ever want to support it. The two alternatives are to force the use of binutils/gas when building the crunch support, or to remove it entirely. According to Hartley Sweeten: "Martin Guy did a lot of work trying to get the maverick crunch working but I was never able to successfully use it for anything. It "kind" of works but depending on the EP93xx silicon revision there are still a number of hardware bugs that either give imprecise or garbage results. I have no problem with removing the kernel support for the maverick crunch." Unless someone else comes up with a good reason to keep it around, remove it now. This touches mostly the ep93xx platform, but removes a bit of code from ARM common ptrace and signal frame handling as well. If there are remaining users of MaverickCrunch, they can use LTS kernels for at least another five years before kernel support ends. Link: https://lore.kernel.org/linux-arm-kernel/20210802141245.1146772-1-arnd@kernel.org/ Link: https://lore.kernel.org/linux-arm-kernel/20210226164345.3889993-1-arnd@kernel.org/ Link: https://github.com/ClangBuiltLinux/linux/issues/1272 Link: https://gcc.gnu.org/legacy-ml/gcc/2008-03/msg01063.html Cc: "Martin Guy" <martinwguy@martinwguy@gmail.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>	2021-08-04 13:30:04 +02:00
Om Prakash Singh	f62750e691	PCI: tegra194: Cleanup unused code Remove unused code from function tegra_pcie_config_ep. Link: https://lore.kernel.org/r/20210623100525.19944-6-omp@nvidia.com Signed-off-by: Om Prakash Singh <omp@nvidia.com> Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Reviewed-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Vidya Sagar <vidyas@nvidia.com>	2021-08-04 12:28:17 +01:00
Om Prakash Singh	de2bbf2b71	PCI: tegra194: Don't allow suspend when Tegra PCIe is in EP mode When Tegra PCIe is in endpoint mode it should be available for root port. PCIe link up by root port fails if it is in suspend state. So, don't allow Tegra to suspend when endpoint mode is enabled. Link: https://lore.kernel.org/r/20210623100525.19944-5-omp@nvidia.com Signed-off-by: Om Prakash Singh <omp@nvidia.com> Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Reviewed-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Vidya Sagar <vidyas@nvidia.com>	2021-08-04 12:28:17 +01:00
Om Prakash Singh	834c5cf2b5	PCI: tegra194: Disable interrupts before entering L2 In suspend_noirq() call if link doesn't goto L2, PERST# is asserted to bring link to detect state. However, this is causing surprise link down AER error. Since Kernel is executing noirq suspend calls, AER interrupt is not processed. PME and AER are shared interrupts and PCIe subsystem driver enables wake capability of PME irq during suspend. So this AER will cause suspend failure due to pending AER interrupt. After PCIe link is in L2, interrupts are not expected since PCIe controller will be in reset state. Disable PCIe interrupts before going to L2 state to avoid pending AER interrupt. Link: https://lore.kernel.org/r/20210623100525.19944-4-omp@nvidia.com Signed-off-by: Om Prakash Singh <omp@nvidia.com> Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Reviewed-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Vidya Sagar <vidyas@nvidia.com>	2021-08-04 12:28:17 +01:00
Om Prakash Singh	43537cf7e3	PCI: tegra194: Fix MSI-X programming Lower order MSI-X address is programmed in MSIX_ADDR_MATCH_HIGH_OFF DBI register instead of higher order address. This patch fixes this programming mistake. Link: https://lore.kernel.org/r/20210623100525.19944-3-omp@nvidia.com Signed-off-by: Om Prakash Singh <omp@nvidia.com> Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Reviewed-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Vidya Sagar <vidyas@nvidia.com>	2021-08-04 12:28:17 +01:00
Om Prakash Singh	ceb1412c1c	PCI: tegra194: Fix handling BME_CHGED event In tegra_pcie_ep_hard_irq(), APPL_INTR_STATUS_L0 is stored in val and again APPL_INTR_STATUS_L1_0_0 is also stored in val. So when execution reaches "if (val & APPL_INTR_STATUS_L0_PCI_CMD_EN_INT)", val is not correct. Link: https://lore.kernel.org/r/20210623100525.19944-2-omp@nvidia.com Signed-off-by: Om Prakash Singh <omp@nvidia.com> Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Reviewed-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Vidya Sagar <vidyas@nvidia.com>	2021-08-04 12:28:16 +01:00
Miklos Szeredi	badc741459	fuse: move option checking into fuse_fill_super() Checking whether the "fd=", "rootmode=", "user_id=" and "group_id=" mount options are present can be moved from fuse_get_tree() into fuse_fill_super() where the value of the options are consumed. This relaxes semantics of reusing a fuse blockdev mount using the device name. Before this patch presence of these options were enforced but values ignored, after this patch these options are completely ignored in this case. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2021-08-04 13:22:58 +02:00
Miklos Szeredi	84c215075b	fuse: name fs_context consistently Naming convention under fs/fuse/: struct fuse_conn fc; struct fs_context fsc; Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2021-08-04 13:22:58 +02:00
Miklos Szeredi	e1e71c1688	fuse: fix use after free in fuse_read_interrupt() There is a potential race between fuse_read_interrupt() and fuse_request_end(). TASK1 in fuse_read_interrupt(): delete req->intr_entry (while holding fiq->lock) TASK2 in fuse_request_end(): req->intr_entry is empty -> skip fiq->lock wake up TASK3 TASK3 request is freed TASK1 in fuse_read_interrupt(): dereference req->in.h.unique *BAM* Fix by always grabbing fiq->lock if the request was ever interrupted (FR_INTERRUPTED set) thereby serializing with concurrent fuse_read_interrupt() calls. FR_INTERRUPTED is set before the request is queued on fiq->interrupts. Dequeing the request is done with list_del_init() but FR_INTERRUPTED is not cleared in this case. Reported-by: lijiazi <lijiazi@xiaomi.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2021-08-04 13:22:58 +02:00
Rob Herring	aeaea8969b	PCI: iproc: Fix BCMA probe resource handling In commit `7ef1c871da` ("PCI: iproc: Use pci_parse_request_of_pci_ranges()"), calling devm_request_pci_bus_resources() was dropped from the common iProc probe code, but is still needed for BCMA bus probing. Without it, there will be lots of warnings like this: pci 0000:00:00.0: BAR 8: no space for [mem size 0x00c00000] pci 0000:00:00.0: BAR 8: failed to assign [mem size 0x00c00000] Add back calling devm_request_pci_bus_resources() and adding the resources to pci_host_bridge.windows for BCMA bus probe. Link: https://lore.kernel.org/r/20210803215656.3803204-2-robh@kernel.org Fixes: `7ef1c871da` ("PCI: iproc: Use pci_parse_request_of_pci_ranges()") Reported-by: Rafał Miłecki <zajec5@gmail.com> Tested-by: Rafał Miłecki <rafal@milecki.pl> Signed-off-by: Rob Herring <robh@kernel.org> Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Cc: Srinath Mannam <srinath.mannam@broadcom.com> Cc: Roman Bacik <roman.bacik@broadcom.com> Cc: Bharat Gooty <bharat.gooty@broadcom.com> Cc: Abhishek Shah <abhishek.shah@broadcom.com> Cc: Jitendra Bhivare <jitendra.bhivare@broadcom.com> Cc: Ray Jui <ray.jui@broadcom.com> Cc: Florian Fainelli <f.fainelli@gmail.com> Cc: BCM Kernel Feedback <bcm-kernel-feedback-list@broadcom.com> Cc: Scott Branden <sbranden@broadcom.com> Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Cc: "Krzysztof Wilczyński" <kw@linux.com> Cc: Bjorn Helgaas <bhelgaas@google.com>	2021-08-04 12:20:00 +01:00
Rob Herring	d277f6e88c	PCI: of: Don't fail devm_pci_alloc_host_bridge() on missing 'ranges' Commit `669cbc7081` ("PCI: Move DT resource setup into devm_pci_alloc_host_bridge()") made devm_pci_alloc_host_bridge() fail on any DT resource parsing errors, but Broadcom iProc uses devm_pci_alloc_host_bridge() on BCMA bus devices that don't have DT resources. In particular, there is no 'ranges' property. Fix iProc by making 'ranges' optional. If 'ranges' is required by a platform, there's going to be more errors latter on if it is missing. Link: https://lore.kernel.org/r/20210803215656.3803204-1-robh@kernel.org Fixes: `669cbc7081` ("PCI: Move DT resource setup into devm_pci_alloc_host_bridge()") Reported-by: Rafał Miłecki <zajec5@gmail.com> Tested-by: Rafał Miłecki <rafal@milecki.pl> Signed-off-by: Rob Herring <robh@kernel.org> Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Cc: Srinath Mannam <srinath.mannam@broadcom.com> Cc: Roman Bacik <roman.bacik@broadcom.com> Cc: Bharat Gooty <bharat.gooty@broadcom.com> Cc: Abhishek Shah <abhishek.shah@broadcom.com> Cc: Jitendra Bhivare <jitendra.bhivare@broadcom.com> Cc: Ray Jui <ray.jui@broadcom.com> Cc: Florian Fainelli <f.fainelli@gmail.com> Cc: BCM Kernel Feedback <bcm-kernel-feedback-list@broadcom.com> Cc: Scott Branden <sbranden@broadcom.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>	2021-08-04 12:20:00 +01:00
Shaik Sajida Bhanu	67b13f3e22	mmc: sdhci-msm: Update the software timeout value for sdhc Whenever SDHC run at clock rate 50MHZ or below, the hardware data timeout value will be 21.47secs, which is approx. 22secs and we have a current software timeout value as 10secs. We have to set software timeout value more than the hardware data timeout value to avioid seeing the below register dumps. [ 332.953670] mmc2: Timeout waiting for hardware interrupt. [ 332.959608] mmc2: sdhci: ============ SDHCI REGISTER DUMP =========== [ 332.966450] mmc2: sdhci: Sys addr: 0x00000000 \| Version: 0x00007202 [ 332.973256] mmc2: sdhci: Blk size: 0x00000200 \| Blk cnt: 0x00000001 [ 332.980054] mmc2: sdhci: Argument: 0x00000000 \| Trn mode: 0x00000027 [ 332.986864] mmc2: sdhci: Present: 0x01f801f6 \| Host ctl: 0x0000001f [ 332.993671] mmc2: sdhci: Power: 0x00000001 \| Blk gap: 0x00000000 [ 333.000583] mmc2: sdhci: Wake-up: 0x00000000 \| Clock: 0x00000007 [ 333.007386] mmc2: sdhci: Timeout: 0x0000000e \| Int stat: 0x00000000 [ 333.014182] mmc2: sdhci: Int enab: 0x03ff100b \| Sig enab: 0x03ff100b [ 333.020976] mmc2: sdhci: ACmd stat: 0x00000000 \| Slot int: 0x00000000 [ 333.027771] mmc2: sdhci: Caps: 0x322dc8b2 \| Caps_1: 0x0000808f [ 333.034561] mmc2: sdhci: Cmd: 0x0000183a \| Max curr: 0x00000000 [ 333.041359] mmc2: sdhci: Resp[0]: 0x00000900 \| Resp[1]: 0x00000000 [ 333.048157] mmc2: sdhci: Resp[2]: 0x00000000 \| Resp[3]: 0x00000000 [ 333.054945] mmc2: sdhci: Host ctl2: 0x00000000 [ 333.059657] mmc2: sdhci: ADMA Err: 0x00000000 \| ADMA Ptr: 0x0000000ffffff218 [ 333.067178] mmc2: sdhci_msm: ----------- VENDOR REGISTER DUMP ----------- [ 333.074343] mmc2: sdhci_msm: DLL sts: 0x00000000 \| DLL cfg: 0x6000642c \| DLL cfg2: 0x0020a000 [ 333.083417] mmc2: sdhci_msm: DLL cfg3: 0x00000000 \| DLL usr ctl: 0x00000000 \| DDR cfg: 0x80040873 [ 333.092850] mmc2: sdhci_msm: Vndr func: 0x00008a9c \| Vndr func2 : 0xf88218a8 Vndr func3: 0x02626040 [ 333.102371] mmc2: sdhci: ============================================ So, set software timeout value more than hardware timeout value. Signed-off-by: Shaik Sajida Bhanu <sbhanu@codeaurora.org> Acked-by: Adrian Hunter <adrian.hunter@intel.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/1626435974-14462-1-git-send-email-sbhanu@codeaurora.org Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>	2021-08-04 13:17:21 +02:00
Christophe Kerello	d8e193f13b	mmc: mmci: stm32: Check when the voltage switch procedure should be done If the card has not been power cycled, it may still be using 1.8V signaling. This situation is detected in mmc_sd_init_card function and should be handled in mmci stm32 variant. The host->pwr_reg variable is also correctly protected with spin locks. Fixes: `94b94a93e3` ("mmc: mmci_sdmmc: Implement signal voltage callbacks") Signed-off-by: Christophe Kerello <christophe.kerello@foss.st.com> Signed-off-by: Yann Gautier <yann.gautier@foss.st.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20210701143353.13188-1-yann.gautier@foss.st.com Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>	2021-08-04 12:46:02 +02:00
Vincent Whitchurch	25f8203b4b	mmc: dw_mmc: Fix hang on data CRC error When a Data CRC interrupt is received, the driver disables the DMA, then sends the stop/abort command and then waits for Data Transfer Over. However, sometimes, when a data CRC error is received in the middle of a multi-block write transfer, the Data Transfer Over interrupt is never received, and the driver hangs and never completes the request. The driver sets the BMOD.SWR bit (SDMMC_IDMAC_SWRESET) when stopping the DMA, but according to the manual CMD.STOP_ABORT_CMD should be programmed "before assertion of SWR". Do these operations in the recommended order. With this change the Data Transfer Over is always received correctly in my tests. Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com> Reviewed-by: Jaehoon Chung <jh80.chung@samsung.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20210630102232.16011-1-vincent.whitchurch@axis.com Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>	2021-08-04 12:41:20 +02:00
David S. Miller	9c0532f9cc	Merge tag 'linux-can-next-for-5.15-20210804' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next Marc Kleine-Budde says: ==================== pull-request: can-next 2021-08-04 this is a pull request of 5 patches for net-next/master. The first patch is by me and fixes a typo in a comment in the CAN J1939 protocol. The next 2 patches are by Oleksij Rempel and update the CAN J1939 protocol to send RX status updates via the error queue mechanism. The next patch is by me and adds a missing variable initialization to the flexcan driver (the problem was introduced in the current net-next cycle). The last patch is by Aswath Govindraju and adds power-domains to the Bosch m_can DT binding documentation. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2021-08-04 11:30:09 +01:00
Linus Walleij	47adef20e6	pata: ixp4xx: Rewrite to use device tree This rewrites the IXP4xx CF (IDE) libata driver to use the device tree exclusively to look up its resources: - Probe exclusively from the device tree and look up all resources from there. - Allocate a local state container with devres and pass this around in .private_data. - Initialize with struct ata_port_info. - Use the .set_piomode() callback instead of the much wider .set_mode(), we only support PIO after all. - Bump driver version number from 0.2 to 1.0 to reflect this wider change. - Get a handle on the expansion bus syscon regmap to alter the timings on the chip select. - Put in the more elaborate timing adjustment code for PIO0 to PIO4 in 8 and 16bit mode from the downstream OpenWrt patch. The board file initialization path and platform data include is dropped because the board files will be deleted at the same time as this patch is merged. The platform data file is not deleted right now so as not to conflict with the removal of board files. Signed-off-by: Linus Walleij <linus.walleij@linaro.org>	2021-08-04 12:20:44 +02:00
Linus Walleij	be470496ee	pata: ixp4xx: Add DT bindings This adds device tree bindings for the Intel IXP4xx compact flash card interface. Cc: devicetree@vger.kernel.org Reviewed-by: Rob Herring <robh@kernel.org> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>	2021-08-04 12:20:37 +02:00
Linus Walleij	8e3d25a623	pata: ixp4xx: Refer to cmd and ctl rather than csN The two "cs0" and "cs1" are "chip selects" but on some platforms such as GW2358 they are actually both in CS3 making this terminology very confusing. Call the addresses "cmd" and "ctl" after function instead. Signed-off-by: Linus Walleij <linus.walleij@linaro.org>	2021-08-04 12:20:32 +02:00
Linus Walleij	d2b507acc6	pata: ixp4xx: Use IS_ENABLED() to determine endianness Instead of an ARM-specific ifdef, use the global CPU config and if (IS_ENABLED()). Signed-off-by: Linus Walleij <linus.walleij@linaro.org>	2021-08-04 12:20:27 +02:00
Linus Walleij	f62b38965a	pata: ixp4xx: Use local dev variable Let's simplify all &pdev->dev references by creating a local struct device *dev variable. Signed-off-by: Linus Walleij <linus.walleij@linaro.org>	2021-08-04 12:20:21 +02:00
Linus Walleij	21a0a29d16	watchdog: ixp4xx: Rewrite driver to use core This rewrites the IXP4xx watchdog driver as follows: - Spawn the watchdog driver as a platform device from the timer driver. It's one device in the hardware, and the fact that Linux splits the handling into two different devices is a Linux pecularity, and thus it becomes a Linux pecularity to spawn a separate watchdog driver. - Spawn the watchdog driver from the timer driver at probe(). This is well after the timer driver as actually registered and started and we know the register base is available. - Instead of looping back callbacks to the timer drivers for all watchdog calls, pass the register base to the watchdog driver and manage the registers there. The two drivers aren't even interested in the same register so the spinlock is totally surplus, delete it. - Replace pretty much all of the content in the watchdog driver with a simple, modern watchdog driver utilizing the watchdog core instead of registering its own misc device and ioctl() handling. - Drop module parameters as the same already exist in the watchdog core. What remains is a slim elegant (IMO) watchdog driver using the watchdog core, spawning from device tree or boardfile alike. Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>	2021-08-04 12:20:13 +02:00
Linus Walleij	1c953bda90	bus: ixp4xx: Add a driver for IXP4xx expansion bus The Intel IXP4xx SoCs have an expansion bus that is usually just used for flash memory and configured by the boot loaders and can be accessed using the "simple-bus". However some devices need more elaborate configuration and then we need to provide a proper 3-unit address space indicating chip select for each device and provide timing and similar information. Cc: Marc Zyngier <maz@kernel.org> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>	2021-08-04 12:20:06 +02:00
Linus Walleij	3fbcc6763b	bus: ixp4xx: Add DT bindings for the IXP4xx expansion bus This adds device tree bindings for the IXP4xx expansion bus controller. Cc: Marc Zyngier <maz@kernel.org> Cc: devicetree@vger.kernel.org Reviewed-by: Rob Herring <robh@kernel.org> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>	2021-08-04 12:19:58 +02:00
Aswath Govindraju	d85165b238	dt-bindings: net: can: Document power-domains property Document power-domains property for adding the Power domain provider. Link: https://lore.kernel.org/r/20210802091822.16407-1-a-govindraju@ti.com Signed-off-by: Aswath Govindraju <a-govindraju@ti.com> Acked-by: Rob Herring <robh@kernel.org> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>	2021-08-04 12:11:57 +02:00
Marc Kleine-Budde	3362666972	can: flexcan: flexcan_clks_enable(): add missing variable initialization This patch adds the missing initialization of the "err" variable in the flexcan_clks_enable() function. Fixes: `d9cead75b1` ("can: flexcan: add mcf5441x support") Link: https://lore.kernel.org/r/20210728075428.1493568-1-mkl@pengutronix.de Reported-by: kernel test robot <lkp@intel.com> Cc: Angelo Dureghello <angelo@kernel-space.org> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>	2021-08-04 12:11:56 +02:00
Oleksij Rempel	5b9272e93f	can: j1939: extend UAPI to notify about RX status To be able to create applications with user friendly feedback, we need be able to provide receive status information. Typical ETP transfer may take seconds or even hours. To give user some clue or show a progress bar, the stack should push status updates. Same as for the TX information, the socket error queue will be used with following new signals: - J1939_EE_INFO_RX_RTS - received and accepted request to send signal. - J1939_EE_INFO_RX_DPO - received data package offset signal - J1939_EE_INFO_RX_ABORT - RX session was aborted Instead of completion signal, user will get data package. To activate this signals, application should set SOF_TIMESTAMPING_RX_SOFTWARE to the SO_TIMESTAMPING socket option. This will avoid unpredictable application behavior for the old software. Link: https://lore.kernel.org/r/20210707094854.30781-3-o.rempel@pengutronix.de Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>	2021-08-04 12:11:52 +02:00
Oleksij Rempel	cd85d3aed5	can: j1939: rename J1939_ERRQUEUE_* to J1939_ERRQUEUE_TX_* Prepare the world for the J1939_ERRQUEUE_RX_ version Link: https://lore.kernel.org/r/20210707094854.30781-2-o.rempel@pengutronix.de Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>	2021-08-04 12:11:48 +02:00
Sean Christopherson	179c6c27bf	KVM: SVM: Fix off-by-one indexing when nullifying last used SEV VMCB Use the raw ASID, not ASID-1, when nullifying the last used VMCB when freeing an SEV ASID. The consumer, pre_sev_run(), indexes the array by the raw ASID, thus KVM could get a false negative when checking for a different VMCB if KVM manages to reallocate the same ASID+VMCB combo for a new VM. Note, this cannot cause a functional issue _in the current code_, as pre_sev_run() also checks which pCPU last did VMRUN for the vCPU, and last_vmentry_cpu is initialized to -1 during vCPU creation, i.e. is guaranteed to mismatch on the first VMRUN. However, prior to commit `8a14fe4f0c` ("kvm: x86: Move last_cpu into kvm_vcpu_arch as last_vmentry_cpu"), SVM tracked pCPU on its own and zero-initialized the last_cpu variable. Thus it's theoretically possible that older versions of KVM could miss a TLB flush if the first VMRUN is on pCPU0 and the ASID and VMCB exactly match those of a prior VM. Fixes: `70cd94e60c` ("KVM: SVM: VMRUN should use associated ASID when SEV is enabled") Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-04 06:02:09 -04:00
Paolo Bonzini	85cd39af14	KVM: Do not leak memory for duplicate debugfs directories KVM creates a debugfs directory for each VM in order to store statistics about the virtual machine. The directory name is built from the process pid and a VM fd. While generally unique, it is possible to keep a file descriptor alive in a way that causes duplicate directories, which manifests as these messages: [ 471.846235] debugfs: Directory '20245-4' with parent 'kvm' already present! Even though this should not happen in practice, it is more or less expected in the case of KVM for testcases that call KVM_CREATE_VM and close the resulting file descriptor repeatedly and in parallel. When this happens, debugfs_create_dir() returns an error but kvm_create_vm_debugfs() goes on to allocate stat data structs which are later leaked. The slow memory leak was spotted by syzkaller, where it caused OOM reports. Since the issue only affects debugfs, do a lookup before calling debugfs_create_dir, so that the message is downgraded and rate-limited. While at it, ensure kvm->debugfs_dentry is NULL rather than an error if it is not created. This fixes kvm_destroy_vm_debugfs, which was not checking IS_ERR_OR_NULL correctly. Cc: stable@vger.kernel.org Fixes: `536a6f88c4` ("KVM: Create debugfs dir and stat files for each VM") Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru> Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-04 06:02:03 -04:00
Like Xu	e79f49c37c	KVM: x86/pmu: Introduce pmc->is_paused to reduce the call time of perf interfaces Based on our observations, after any vm-exit associated with vPMU, there are at least two or more perf interfaces to be called for guest counter emulation, such as perf_event_{pause, read_value, period}(), and each one will {lock, unlock} the same perf_event_ctx. The frequency of calls becomes more severe when guest use counters in a multiplexed manner. Holding a lock once and completing the KVM request operations in the perf context would introduce a set of impractical new interfaces. So we can further optimize the vPMU implementation by avoiding repeated calls to these interfaces in the KVM context for at least one pattern: After we call perf_event_pause() once, the event will be disabled and its internal count will be reset to 0. So there is no need to pause it again or read its value. Once the event is paused, event period will not be updated until the next time it's resumed or reprogrammed. And there is also no need to call perf_event_period twice for a non-running counter, considering the perf_event for a running counter is never paused. Based on this implementation, for the following common usage of sampling 4 events using perf on a 4u8g guest: echo 0 > /proc/sys/kernel/watchdog echo 25 > /proc/sys/kernel/perf_cpu_time_max_percent echo 10000 > /proc/sys/kernel/perf_event_max_sample_rate echo 0 > /proc/sys/kernel/perf_cpu_time_max_percent for i in `seq 1 1 10` do taskset -c 0 perf record \ -e cpu-cycles -e instructions -e branch-instructions -e cache-misses \ /root/br_instr a done the average latency of the guest NMI handler is reduced from 37646.7 ns to 32929.3 ns (~1.14x speed up) on the Intel ICX server. Also, in addition to collecting more samples, no loss of sampling accuracy was observed compared to before the optimization. Signed-off-by: Like Xu <likexu@tencent.com> Message-Id: <20210728120705.6855-1-likexu@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Acked-by: Peter Zijlstra <peterz@infradead.org>	2021-08-04 05:55:56 -04:00
Peter Xu	a75b540451	KVM: X86: Optimize zapping rmap Using rmap_get_first() and rmap_remove() for zapping a huge rmap list could be slow. The easy way is to travers the rmap list, collecting the a/d bits and free the slots along the way. Provide a pte_list_destroy() and do exactly that. Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20210730220605.26377-1-peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-04 05:55:56 -04:00
Peter Xu	13236e25eb	KVM: X86: Optimize pte_list_desc with per-array counter Add a counter field into pte_list_desc, so as to simplify the add/remove/loop logic. E.g., we don't need to loop over the array any more for most reasons. This will make more sense after we've switched the array size to be larger otherwise the counter will be a waste. Initially I wanted to store a tail pointer at the head of the array list so we don't need to traverse the list at least for pushing new ones (if without the counter we traverse both the list and the array). However that'll need slightly more change without a huge lot benefit, e.g., after we grow entry numbers per array the list traversing is not so expensive. So let's be simple but still try to get as much benefit as we can with just these extra few lines of changes (not to mention the code looks easier too without looping over arrays). I used the same a test case to fork 500 child and recycle them ("./rmap_fork 500" [1]), this patch further speeds up the total fork time of about 4%, which is a total of 33% of vanilla kernel: Vanilla: 473.90 (+-5.93%) 3->15 slots: 366.10 (+-4.94%) Add counter: 351.00 (+-3.70%) [1] `825436f825` Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20210730220602.26327-1-peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-04 05:55:56 -04:00
Peter Xu	dc1cff9691	KVM: X86: MMU: Tune PTE_LIST_EXT to be bigger Currently rmap array element only contains 3 entries. However for EPT=N there could have a lot of guest pages that got tens of even hundreds of rmap entry. A normal distribution of a 6G guest (even if idle) shows this with rmap count statistics: Rmap_Count: 0 1 2-3 4-7 8-15 16-31 32-63 64-127 128-255 256-511 512-1023 Level=4K: 3089171 49005 14016 1363 235 212 15 7 0 0 0 Level=2M: 5951 227 0 0 0 0 0 0 0 0 0 Level=1G: 32 0 0 0 0 0 0 0 0 0 0 If we do some more fork some pages will grow even larger rmap counts. This patch makes PTE_LIST_EXT bigger so it'll be more efficient for the general use case of EPT=N as we do list reference less and the loops over PTE_LIST_EXT will be slightly more efficient; but still not too large so less waste when array not full. It should not affecting EPT=Y since EPT normally only has zero or one rmap entry for each page, so no array is even allocated. With a test case to fork 500 child and recycle them ("./rmap_fork 500" [1]), this patch speeds up fork time of about 29%. Before: 473.90 (+-5.93%) After: 366.10 (+-4.94%) [1] `825436f825` Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20210730220455.26054-6-peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-04 05:55:56 -04:00

... 151 152 153 154 155 ...

1042671 Commits