Previously nbd_index_mutex was held during whole add/remove/lookup
operations in order to guarantee that partially initialized devices are
not reachable via idr_find() or idr_for_each(). But now that partially
initialized devices become reachable as soon as idr_alloc() succeeds,
we need to skip partially initialized devices. Since it seems that
all functions use refcount_inc_not_zero(&nbd->refs) in order to skip
destroying devices, update nbd->refs from zero to non-zero as the last
step of device initialization in order to also skip partially initialized
devices.
Fixes: 6e4df4c648 ("nbd: reduce the nbd_index_mutex scope")
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
[hch: split from a larger patch, added comments]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210825163108.50713-4-hch@lst.de
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Clean up the following includecheck warning:
./tools/testing/selftests/sched/cs_prctl_test.c:
Include files sys/types.h and sys/wait.h are included more than
once.
No functional change.
Fixed commit header and log:
Shuah Khan <skhan@linuxfoundation.org>
Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: Changcheng Deng <deng.changcheng@zte.com.cn>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
The openat2 test suite fails on ARM64 because the definition of
O_LARGEFILE is different on ARM64. Fix the problem by defining
the correct O_LARGEFILE definition on ARM64.
"openat2 unexpectedly returned # 3['.../tools/testing/selftests/openat2']
with 208000 (!= 208000)
not ok 102 openat2 with incompatible flags (O_PATH | O_LARGEFILE) fails
with -22 (Invalid argument)"
Fixed change log to improve formatting and clarity:
Shuah Khan <skhan@linuxfoundation.org>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Aleksa Sarai <cyphar@cyphar.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
Merge fixes from Andrew Morton:
"2 patches.
Subsystems affected by this patch series: mm/memory-hotplug and
MAINTAINERS"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
MAINTAINERS: exfat: update my email address
mm/memory_hotplug: fix potential permanent lru cache disable
Magnus Karlsson says:
====================
This patch set mainly contains various simplifications to the xsk
selftests. The only exception is the introduction of packet streams
that describes what the Tx process should send and what the Rx process
should receive. If it receives anything else, the test fails. This
mechanism can be used to produce tests were all packets are not
received by the Rx thread or modified in some way. An example of this
is if an XDP program does XDP_PASS on some of the packets.
This patch set will be followed by another patch set that implements a
new structure that will facilitate adding new tests. A couple of new
tests will also be included in that patch set.
v2 -> v3:
* Reworked patch 12 so that it now has functions for creating and
destroying ifobjects. Simplifies the code. [Maciej]
* The packet stream now allocates the supplied buffer array length,
instead of the default one. [Maciej]
* pkt_stream_get_pkt() now returns NULL when indexing a non-existing
packet. [Maciej]
* pkt_validate() is now is_pkt_valid(). [Maciej]
* Slowed down packet sending speed even more in patch 11 so that slow
systems do not silenty drop packets in skb mode.
v1 -> v2:
* Dropped the patch with per process limit changes as it is not needed
[Yonghong]
* Improved the commit message of patch 1 [Yonghong]
* Fixed a spelling error in patch 9
Thanks: Magnus
====================
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Generate packets from a specification instead of something hard
coded. The idea is that a test generates one or more packet
specifications and provides it/them to both Tx and Rx. The Tx thread
will generate from this specification and Rx will validate that it
receives what is in the specification. The specification can be the
same on both ends, meaning that everything that was sent should be
received, or different which means that Rx will only receive part of
the sent packets.
Currently, the packet specification is the same for both Rx and Tx and
the same for each test. This will change in later work as features
and tests are added.
The data path functions are also renamed to better reflect what
actions they are performing after introducing this feature.
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210825093722.10219-15-magnus.karlsson@gmail.com
Decrease sending speed to avoid potentially overflowing some buffers
in the skb case that leads to dropped packets we cannot control (and
thus the tests may generate false negatives). Decrease batch size and
introduce a usleep in the transmit thread to not overflow the
receiver.
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210825093722.10219-12-magnus.karlsson@gmail.com
Simplify packet validation in the xsk selftests by performing it at
once for every packet. The current code performed this per batch and
did this on copied packet data. Make it simpler and faster by
validating it at once and on the umem packet data thus skipping the
copy and the memory allocation for the temprary buffer.
The optional packet dump feature is also simplified in the same
manner. Memory allocation and copying is removed and the dump is
performed directly on the umem data.
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210825093722.10219-10-magnus.karlsson@gmail.com
Commit adae1e931a ("Drivers: hv: vmbus: Copy packets sent by Hyper-V out
of the ring buffer") introduced a notion of maximum packet size and for
KVM and FCOPY drivers set it to the length of the receive buffer. VSS
driver wasn't updated, this means that the maximum packet size is now
VMBUS_DEFAULT_MAX_PKT_SIZE (4k). Apparently, this is not enough. I'm
observing a packet of 6304 bytes which is being truncated to 4096. When
VSS driver tries to read next packet from ring buffer it starts from the
wrong offset and receives garbage.
Set the maximum packet size to 'HV_HYP_PAGE_SIZE * 2' in VSS driver. This
matches the length of the receive buffer and is in line with other utils
drivers.
Fixes: adae1e931a ("Drivers: hv: vmbus: Copy packets sent by Hyper-V out of the ring buffer")
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/20210825133857.847866-1-vkuznets@redhat.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
GENPD core doesn't support handling performance state changes while
consumer device is runtime-suspended or when runtime PM is disabled.
GENPD core may override performance state that was configured by device
driver while RPM of the device was disabled or device was RPM-suspended.
Let's close that gap by allowing drivers to control performance state
while RPM of a consumer device is disabled and to set up performance
state of RPM-suspended device that will be applied by GENPD core on
RPM-resume of the device.
Fixes: 5937c3ce21 ("PM: domains: Drop/restore performance state votes for devices at runtime PM")
Signed-off-by: Dmitry Osipenko <digetx@gmail.com>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
On ia64/allmodconfig:
drivers/of/fdt.c:609:20: error: conflicting types for 'reserve_elfcorehdr'; have 'void(void)'
609 | static void __init reserve_elfcorehdr(void)
| ^~~~~~~~~~~~~~~~~~
arch/ia64/include/asm/meminit.h:43:12: note: previous declaration of 'reserve_elfcorehdr' with type 'int(u64 *, u64 *)' {aka 'int(long long unsigned int *, long long unsigned int *)'}
43 | extern int reserve_elfcorehdr(u64 *start, u64 *end);
| ^~~~~~~~~~~~~~~~~~
Fix this by prefixing the FDT function name with "fdt_".
Fixes: f7e7ce93aa ("of: fdt: Add generic support for handling elf core headers property")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Link: https://lore.kernel.org/r/f6eabbbce0fba6da3da0264c1e1cf23c01173999.1629884393.git.geert+renesas@glider.be
Signed-off-by: Rob Herring <robh@kernel.org>
It is possible that HWP guaranteed ratio is changed in response to
change in power and thermal limits. For example when Intel Speed Select
performance profile is changed or there is change in TDP, hardware can
send notifications. It is possible that the guaranteed ratio is
increased. This creates an issue when turbo is disabled, as the old
limits set in MSR_HWP_REQUEST are still lower and hardware will clip
to older limits.
This change enables HWP interrupt and process HWP interrupts. When
guaranteed is changed, calls cpufreq_update_policy() so that driver
callbacks are called to update to new HWP limits. This callback
is called from a delayed workqueue of 10ms to avoid frequent updates.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Add a weak function to process HWP (Hardware P-states) notifications and
move updating HWP_STATUS MSR to this function.
This allows HWP interrupts to be processed by the intel_pstate driver in
HWP mode by overriding the implementation.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Acked-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The Lenovo Yoga 9 (14INTL5)'s ACPI _LID is bugged:
After hibernation the lid is initially reported as closed.
Once closing and then reopening the lid reports the lid as
open again. This leads to the conclusion that the initial
notification of the lid is missing but subsequent
notifications are correct.
In order fo the Linux LID code to handle this device properly
the lid_init_state must be set to ACPI_BUTTON_LID_INIT_OPEN.
Signed-off-by: Ulrich Huber <ulrich@huberulrich.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The memory attributes attached to memory regions depend on architecture
specific mappings.
For some memory regions, the attributes specified by firmware (eg
uncached) are not sufficient to determine how a memory region should be
mapped by an OS (for instance a region that is define as uncached in
firmware can be mapped as Normal or Device memory on arm64) and
therefore the OS must be given control on how to map the region to match
the expected mapping behaviour (eg if a mapping is requested with memory
semantics, it must allow unaligned accesses).
Rework acpi_os_map_memory() and acpi_os_ioremap() back-end to split
them into two separate code paths:
acpi_os_memmap() -> memory semantics
acpi_os_ioremap() -> MMIO semantics
The split allows the architectural implementation back-ends to detect
the default memory attributes required by the mapping in question
(ie the mapping API defines the semantics memory vs MMIO) and map the
memory accordingly.
Link: https://lore.kernel.org/linux-arm-kernel/31ffe8fc-f5ee-2858-26c5-0fd8bdd68702@arm.com
Tested-by: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Daniel Xu says:
====================
The motivation behind this helper is to access userspace pt_regs in a
kprobe handler.
uprobe's ctx is the userspace pt_regs. kprobe's ctx is the kernelspace
pt_regs. bpf_task_pt_regs() allows accessing userspace pt_regs in a
kprobe handler. The final case (kernelspace pt_regs in uprobe) is
pretty rare (usermode helper) so I think that can be solved later if
necessary.
More concretely, this helper is useful in doing BPF-based DWARF stack
unwinding. Currently the kernel can only do framepointer based stack
unwinds for userspace code. This is because the DWARF state machines are
too fragile to be computed in kernelspace [0]. The idea behind
DWARF-based stack unwinds w/ BPF is to copy a chunk of the userspace
stack (while in prog context) and send it up to userspace for unwinding
(probably with libunwind) [1]. This would effectively enable profiling
applications with -fomit-frame-pointer using kprobes and uprobes.
[0]: https://lkml.org/lkml/2012/2/10/356
[1]: https://github.com/danobi/bpf-dwarf-walk
Changes from v1:
- Conwolidate BTF_ID decls for task_struct
- Enable bpf_get_current_task_btf() for all prog types
- Enable bpf_task_pt_regs() for all prog types
- Use ASSERT_* macros instead of CHECK
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The motivation behind this helper is to access userspace pt_regs in a
kprobe handler.
uprobe's ctx is the userspace pt_regs. kprobe's ctx is the kernelspace
pt_regs. bpf_task_pt_regs() allows accessing userspace pt_regs in a
kprobe handler. The final case (kernelspace pt_regs in uprobe) is
pretty rare (usermode helper) so I think that can be solved later if
necessary.
More concretely, this helper is useful in doing BPF-based DWARF stack
unwinding. Currently the kernel can only do framepointer based stack
unwinds for userspace code. This is because the DWARF state machines are
too fragile to be computed in kernelspace [0]. The idea behind
DWARF-based stack unwinds w/ BPF is to copy a chunk of the userspace
stack (while in prog context) and send it up to userspace for unwinding
(probably with libunwind) [1]. This would effectively enable profiling
applications with -fomit-frame-pointer using kprobes and uprobes.
[0]: https://lkml.org/lkml/2012/2/10/356
[1]: https://github.com/danobi/bpf-dwarf-walk
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/e2718ced2d51ef4268590ab8562962438ab82815.1629772842.git.dxu@dxuuu.xyz
It turns out that the SIGIO/FASYNC situation is almost exactly the same
as the EPOLLET case was: user space really wants to be notified after
every operation.
Now, in a perfect world it should be sufficient to only notify user
space on "state transitions" when the IO state changes (ie when a pipe
goes from unreadable to readable, or from unwritable to writable). User
space should then do as much as possible - fully emptying the buffer or
what not - and we'll notify it again the next time the state changes.
But as with EPOLLET, we have at least one case (stress-ng) where the
kernel sent SIGIO due to the pipe being marked for asynchronous
notification, but the user space signal handler then didn't actually
necessarily read it all before returning (it read more than what was
written, but since there could be multiple writes, it could leave data
pending).
The user space code then expected to get another SIGIO for subsequent
writes - even though the pipe had been readable the whole time - and
would only then read more.
This is arguably a user space bug - and Colin King already fixed the
stress-ng code in question - but the kernel regression rules are clear:
it doesn't matter if kernel people think that user space did something
silly and wrong. What matters is that it used to work.
So if user space depends on specific historical kernel behavior, it's a
regression when that behavior changes. It's on us: we were silly to
have that non-optimal historical behavior, and our old kernel behavior
was what user space was tested against.
Because of how the FASYNC notification was tied to wakeup behavior, this
was first broken by commits f467a6a664 and 1b6b26ae70 ("pipe: fix
and clarify pipe read/write wakeup logic"), but at the time it seems
nobody noticed. Probably because the stress-ng problem case ends up
being timing-dependent too.
It was then unwittingly fixed by commit 3a34b13a88 ("pipe: make pipe
writes always wake up readers") only to be broken again when by commit
3b844826b6 ("pipe: avoid unnecessary EPOLLET wakeups under normal
loads").
And at that point the kernel test robot noticed the performance
refression in the stress-ng.sigio.ops_per_sec case. So the "Fixes" tag
below is somewhat ad hoc, but it matches when the issue was noticed.
Fix it for good (knock wood) by simply making the kill_fasync() case
separate from the wakeup case. FASYNC is quite rare, and we clearly
shouldn't even try to use the "avoid unnecessary wakeups" logic for it.
Link: https://lore.kernel.org/lkml/20210824151337.GC27667@xsang-OptiPlex-9020/
Fixes: 3b844826b6 ("pipe: avoid unnecessary EPOLLET wakeups under normal loads")
Reported-by: kernel test robot <oliver.sang@intel.com>
Tested-by: Oliver Sang <oliver.sang@intel.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull ucount fixes from Eric Biederman:
"This branch fixes a regression that made it impossible to increase
rlimits that had been converted to the ucount infrastructure, and also
fixes a reference counting bug where the reference was not incremented
soon enough.
The fixes are trivial and the bugs have been encountered in the wild,
and the fixes have been tested"
* 'for-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
ucounts: Increase ucounts reference counter before the security hook
ucounts: Fix regression preventing increasing of rlimits in init_user_ns
dip_idx and dgid should be a one-to-one mapping relationship, but when
qp_num loops back to the start number, it may happen that two different
dgid are assiociated to the same dip_idx incorrectly.
One solution is to store the qp_num that is not assigned to dip_idx in an
array. When a dip_idx needs to be allocated to a new dgid, an spare qp_num
is extracted and assigned to dip_idx.
Fixes: f91696f2f0 ("RDMA/hns: Support congestion control type selection according to the FW")
Link: https://lore.kernel.org/r/1629884592-23424-4-git-send-email-liangwenpeng@huawei.com
Signed-off-by: Junxian Huang <huangjunxian4@hisilicon.com>
Signed-off-by: Yangyang Li <liyangyang20@huawei.com>
Signed-off-by: Wenpeng Liang <liangwenpeng@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>