For requeue PI it's required to establish PI state for the PI futex to
which waiters are requeued. This either acquires the user space futex on
behalf of the top most waiter on the inner 'waitqueue' futex, or attaches to
the PI state of an existing waiter, or creates on attached to the owner of
the futex.
This code can retry in case of failure, but retry can never happen when the
pi state was successfully created. The condition to run this code is:
(task_count - nr_wake) < nr_requeue
which is always true because:
task_count = 0
nr_wake = 1
nr_requeue >= 0
Remove it completely.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210815211305.362730187@linutronix.de
When requeuing to a PI futex, then the requeue code tries to trylock the PI
futex on behalf of the topmost waiter on the inner 'waitqueue' futex. If
that succeeds, then PI state has to be allocated in order to requeue further
waiters to the PI futex.
The comment and the code are confusing, as the PI state allocation uses
lookup_pi_state(), which either attaches to an existing waiter or to the
owner. As the PI futex was just acquired, there cannot be a waiter on the
PI futex because the hash bucket lock is held.
Clarify the comment and use attach_to_pi_owner() directly. As the task on
which behalf the PI futex has been acquired is guaranteed to be alive and
not exiting, this call must succeed. Add a WARN_ON() in case that fails.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210815211305.305142462@linutronix.de
As follow-up to the discussion with Jakub Kicinski about iavf locking
being insufficient [1] convert iavf to use mutexes instead of bitops.
The locking logic is kept as is, just a drop-in replacement of
enum iavf_critical_section_t with separate mutexes.
The only difference is that the mutexes will be destroyed before the
module is unloaded.
[1] https://lwn.net/ml/netdev/20210316150210.00007249%40intel.com/
Signed-off-by: Stefan Assmann <sassmann@kpanic.de>
Tested-by: Marek Szlosek <marek.szlosek@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Since 2431c4f5b4 ("mtd: Implement mtd_{read,write}() as wrappers
around mtd_{read,write}_oob()") don't allow _write|_read and
_write_oob|_read_oob existing at the same time, we should check the
existence of callbacks "_read and _write" from subdev's master device
(We can trust master device since it has been registered) before
assigning, otherwise following warning occurs while making
concatenated device:
WARNING: CPU: 2 PID: 6728 at drivers/mtd/mtdcore.c:595
add_mtd_device+0x7f/0x7b0
Fixes: 2431c4f5b4 ("mtd: Implement mtd_{read,write}() around ...")
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Link: https://lore.kernel.org/linux-mtd/20210817114857.2784825-3-chengzhihao1@huawei.com
Since commit 46b5889cc2c5("mtd: implement proper partition handling")
applied, mtd partition device won't hold some callback functions, such
as _block_isbad, _block_markbad, etc. Besides, function mtd_block_isbad()
will get mtd device's master mtd device, then invokes master mtd device's
callback function. So, following process may result mtd_block_isbad()
always return 0, even though mtd device has bad blocks:
1. Split a mtd device into 3 partitions: PA, PB, PC
[ Each mtd partition device won't has callback function _block_isbad(). ]
2. Concatenate PA and PB as a new mtd device PN
[ mtd_concat_create() finds out each subdev has no callback function
_block_isbad(), so PN won't be assigned callback function
concat_block_isbad(). ]
Then, mtd_block_isbad() checks "!master->_block_isbad" is true, will
always return 0.
Reproducer:
// reproduce.c
static int __init init_diy_module(void)
{
struct mtd_info *mtd[2];
struct mtd_info *mtd_combine = NULL;
mtd[0] = get_mtd_device_nm("NAND simulator partition 0");
if (!mtd[0]) {
pr_err("cannot find mtd1\n");
return -EINVAL;
}
mtd[1] = get_mtd_device_nm("NAND simulator partition 1");
if (!mtd[1]) {
pr_err("cannot find mtd2\n");
return -EINVAL;
}
put_mtd_device(mtd[0]);
put_mtd_device(mtd[1]);
mtd_combine = mtd_concat_create(mtd, 2, "Combine mtd");
if (mtd_combine == NULL) {
pr_err("combine failed\n");
return -EINVAL;
}
mtd_device_register(mtd_combine, NULL, 0);
pr_info("Combine success\n");
return 0;
}
1. ID="0x20,0xac,0x00,0x15"
2. modprobe nandsim id_bytes=$ID parts=50,100 badblocks=100
3. insmod reproduce.ko
4. flash_erase /dev/mtd3 0 0
libmtd: error!: MEMERASE64 ioctl failed for eraseblock 100 (mtd3)
error 5 (Input/output error)
// Should be "flash_erase: Skipping bad block at 00c80000"
Fixes: 46b5889cc2 ("mtd: implement proper partition handling")
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Link: https://lore.kernel.org/linux-mtd/20210817114857.2784825-2-chengzhihao1@huawei.com
I proposed this change 16 years ago before discard was a feature in
the block layer: https://lwn.net/Articles/162776/
Now that the block layer has discard, we can finally merge this change.
Discard is also known as trim. By implementing discard, both fstrim and
the discard filesystem option can be used.
Implementing discard in the ftl means that when files are removed, there
is less data in the ftl mapping. This means less stuff to move around for
erasing and also less erasing to do; this means improved wear levelling
and improved performance.
Signed-off-by: Sean Young <sean@mess.org>
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Link: https://lore.kernel.org/linux-mtd/20210807214538.14484-3-sean@mess.org
There is a surprisingly large number of tutorials
that suggest using mtdblock to mount SquashFS filesystems
on flash devices, including NAND devices.
This approach is suboptimal than using UBI. If the flash device
is NAND, this is specially true, due to wear leveling, bit-flips and
badblocks. In this case UBI is strongly preferred, so be nice to users
and print a warning suggesting to consider UBI block, if mtdblock
is added for a NAND device.
Signed-off-by: Ezequiel Garcia <ezequiel@collabora.com>
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Link: https://lore.kernel.org/linux-mtd/20210801234509.18774-8-ezequiel@collabora.com
Current PCIe MEM space of size 16 MB is not enough for some combination
of PCIe cards (e.g. NVMe disk together with ath11k wifi card). ARM Trusted
Firmware for Armada 3700 platform already assigns 128 MB for PCIe window,
so extend PCIe MEM space to the end of 128 MB PCIe window which allows to
allocate more PCIe BARs for more PCIe cards.
Without this change some combination of PCIe cards cannot be used and
kernel show error messages in dmesg during initialization:
pci 0000:00:00.0: BAR 8: no space for [mem size 0x01800000]
pci 0000:00:00.0: BAR 8: failed to assign [mem size 0x01800000]
pci 0000:00:00.0: BAR 6: assigned [mem 0xe8000000-0xe80007ff pref]
pci 0000:01:00.0: BAR 8: no space for [mem size 0x01800000]
pci 0000:01:00.0: BAR 8: failed to assign [mem size 0x01800000]
pci 0000:02:03.0: BAR 8: no space for [mem size 0x01000000]
pci 0000:02:03.0: BAR 8: failed to assign [mem size 0x01000000]
pci 0000:02:07.0: BAR 8: no space for [mem size 0x00100000]
pci 0000:02:07.0: BAR 8: failed to assign [mem size 0x00100000]
pci 0000:03:00.0: BAR 0: no space for [mem size 0x01000000 64bit]
pci 0000:03:00.0: BAR 0: failed to assign [mem size 0x01000000 64bit]
Due to bugs in U-Boot port for Turris Mox, the second range in Turris Mox
kernel DTS file for PCIe must start at 16 MB offset. Otherwise U-Boot
crashes during loading of kernel DTB file. This bug is present only in
U-Boot code for Turris Mox and therefore other Armada 3700 devices are not
affected by this bug. Bug is fixed in U-Boot version 2021.07.
To not break booting new kernels on existing versions of U-Boot on Turris
Mox, use first 16 MB range for IO and second range with rest of PCIe window
for MEM.
Signed-off-by: Pali Rohár <pali@kernel.org>
Fixes: 76f6386b25 ("arm64: dts: marvell: Add Aardvark PCIe support for Armada 3700")
Signed-off-by: Gregory CLEMENT <gregory.clement@bootlin.com>
Ensure all !RT tasks have the same prio such that they end up in FIFO
order and aren't split up according to nice level.
The reason why nice levels were taken into account so far is historical. In
the early days of the rtmutex code it was done to give the PI boosting and
deboosting a larger coverage.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210815211303.938676930@linutronix.de
Similar to rw_semaphores, on RT the rwlock substitution is not writer fair,
because it's not feasible to have a writer inherit its priority to
multiple readers. Readers blocked on a writer follow the normal rules of
priority inheritance. Like RT spinlocks, RT rwlocks are state preserving
across the slow lock operations (contended case).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210815211303.882793524@linutronix.de
This should use the network-namespace-wide client_lock, not the
per-client cl_lock.
You shouldn't see any bugs unless you're actually using the
forced-expiry interface introduced by 89c905becc.
Fixes: 89c905becc "nfsd: allow forced expiration of NFSv4 clients"
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The failure case here should be rare, but it's obviously wrong.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
nsm_use_hostnames is a module parameter and it will be exported to sysctl
procfs. This is to let user sometimes change it from userspace. But the
minimal unit for sysctl procfs read/write it sizeof(int).
In big endian system, the converting from/to bool to/from int will cause
error for proc items.
This patch use a new proc_handler proc_dobool to fix it.
Signed-off-by: Jia He <hejianet@gmail.com>
Reviewed-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
[thuth: Fix typo in commit message]
Signed-off-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
This is to let bool variable could be correctly displayed in
big/little endian sysctl procfs. sizeof(bool) is arch dependent,
proc_dobool should work in all arches.
Suggested-by: Pan Xinhui <xinhui@linux.vnet.ibm.com>
Signed-off-by: Jia He <hejianet@gmail.com>
[thuth: rebased the patch to the current kernel version]
Signed-off-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Some paths through svc_process() leave rqst->rq_procinfo set to
NULL, which triggers a crash if tracing happens to be enabled.
Fixes: 89ff87494c ("SUNRPC: Display RPC procedure names instead of proc numbers")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Including one's name in copyright claims is appropriate. Including it
in random comments is just vanity. After 2 decades, it is time for
these to be gone.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Relieve contention on sc_rw_ctxt_lock by converting rdma->sc_rw_ctxts
to an llist.
The goal is to reduce the average overhead of Send completions,
because a transport's completion handlers are single-threaded on
one CPU core. This change reduces CPU utilization of each Send
completion by 2-3% on my server.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-By: Tom Talpey <tom@talpey.com>
/proc/lock_stat indicates the the sc_send_lock is heavily
contended when the server is under load from a single client.
To address this, convert the send_ctxt free list to an llist.
Returning an item to the send_ctxt cache is now waitless, which
reduces the instruction path length in the single-threaded Send
handler (svc_rdma_wc_send).
The goal is to enable the ib_comp_wq worker to handle a higher
RPC/RDMA Send completion rate given the same CPU resources. This
change reduces CPU utilization of Send completion by 2-3% on my
server.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-By: Tom Talpey <tom@talpey.com>