The nvme multipath error handling defaults to controller reset if the
error is unknown. There are, however, no existing nvme status codes that
indicate a reset should be used, and resetting causes unnecessary
disruption to the rest of IO.
Change nvme's error handling to first check if failover should happen.
If not, let the normal error handling take over rather than reset the
controller.
Based-on-a-patch-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: John Meneghini <johnm@netapp.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Current nvmet-rdma code allocates MR pool budget based on queue size,
assuming both host and target use the same "max_pages_per_mr" count.
After limiting the mdts value for RDMA controllers, we know the factor
of maximum MR's per IO operation. Thus, make sure MR pool will be
sufficient for the required IO depth and IO size.
That is, say host's SQ size is 100, then the MR pool budget allocated
currently at target will also be 100 MRs. But 100 IO WRITE Requests
with 256 sg_count(IO size above 1MB) require 200 MRs when target's
"max_pages_per_mr" is 128.
Reported-by: Krishnamraju Eraparaju <krishna2@chelsio.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Set the maximal data transfer size to be 1MB (currently mdts is
unlimited). This will allow calculating the amount of MR's that
one ctrl should allocate to fulfill it's capabilities.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Some transports, such as RDMA, would like to set the Maximum Data
Transfer Size (MDTS) according to device/port/ctrl characteristics.
This will enable the transport to set the optimal MDTS according to
controller needs and device capabilities. Add a new nvmet transport
op that is called during ctrl identification. This will not effect
transports that don't implement this option. The return value of the new
op is according to the NVMe spec definition for MDTS.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
If we failed to receive data from the socket, don't try
to further process it, we will for sure be handling a queue
error at this point. While no issue was seen with the
current behavior thus far, its safer to cease socket processing
if we detected an error.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Consolidate the request failure handling code to where
it is being fetched (nvme_tcp_try_send).
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
MAXH2CDATA is not zero based. Also no reason to limit ourselves to
1M transfers as we can do more easily. Make this an arbitrary limit
of 16M.
Reported-by: Wenhua Liu <liuw@vmware.com>
Cc: stable@vger.kernel.org # v5.0+
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Currently, queue io_cpu assignment is done sequentially for default,
read and poll queues based on queue id. This causes miss-alignment between
context of CPU initiating I/O and the I/O worker thread processing
queued requests or completions.
Change to modify queue io_cpu assignment to take into account queue
maps offset. Each queue io_cpu will start at zero for each queue map.
This essentially aligns read/poll queues to start over the same range as
default queues.
Testing performed by Mark with:
- ram device (nvmet)
- single CPU core (pinned)
- 100% 4k reads
- engine io_uring (not using sq_thread option)
- hipri flag set
Micro-benchmark results show a net gain of:
- increase of 18%-29% in IOPs
- reduction of 16%-22% in average latency
- reduction of 7%-23% in 99.99% latency
Baseline:
========
QDepth/Batch | IOPs [k] | Avg. Lat [us] | 99.99% Lat [us]
-----------------------------------------------------------------
1/1 | 32.4 | 30.11 | 50.94
32/8 | 179 | 168.20 | 371
CPU alignment:
=============
QDepth/Batch | IOPs [k] | Avg. Lat [us] | 99.99% Lat [us]
-----------------------------------------------------------------
1/1 | 38.5 | 25.18 | 39.16
32/8 | 231 | 130.75 | 343
Reported-by: Mark Wunderlich <mark.wunderlich@intel.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
The timeout handler can use the existing nvme_poll() if it needs to
check a polled queue, allowing nvme_poll_irqdisable() to handle only
irq driven queues for the remaining callers.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Completion handling had been done in two steps: find all new completions
under a lock, then handle those completions outside the lock. This was
done to make the locked section as short as possible so that other
threads using the same lock wait less time.
The driver no longer shares locks during completion, and is in fact
lockless for interrupt driven queues, so the optimization no longer
serves its original purpose. Replace the two-pass completion queue
handler with a single pass that completes entries immediately.
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
The only user for tagged completion was for timeout handling. That user,
though, really only cares if the timed out command is completed, which
we can safely check within the timeout handler.
Remove the tag check to simplify completion handling.
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
For set feature command when setting up NVME_FEAT_NUM_QUEUES, check
Number of I/O Completion Queues Requested (NCQR) and Number of I/O
Submission Queues Requested (NSQR) before we proceed, for invalid values
(i.e. 65535) return an appropriate NVMe invalid field status.
Signed-off-by: Amit Engel <Amit.Engel@dell.com>
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
After initialization, nvme_wait_ready checks for readiness every 100ms,
even though the drive may be ready far sooner than that. This delays
system boot by hundreds of milliseconds. Reduce the delay, checking for
readiness every millisecond instead.
Boot-time tests on an AWS c5.12xlarge:
Before:
[ 0.546936] initcall nvme_init+0x0/0x5b returned 0 after 37 usecs
...
[ 0.764178] nvme nvme0: 2/0/0 default/read/poll queues
[ 0.768424] nvme0n1: p1
[ 0.774132] EXT4-fs (nvme0n1p1): mounted filesystem with ordered data mode. Opts: (null)
[ 0.774146] VFS: Mounted root (ext4 filesystem) on device 259:1.
...
[ 0.788141] Run /sbin/init as init process
After:
[ 0.537088] initcall nvme_init+0x0/0x5b returned 0 after 37 usecs
...
[ 0.543457] nvme nvme0: 2/0/0 default/read/poll queues
[ 0.548473] nvme0n1: p1
[ 0.554339] EXT4-fs (nvme0n1p1): mounted filesystem with ordered data mode. Opts: (null)
[ 0.554344] VFS: Mounted root (ext4 filesystem) on device 259:1.
...
[ 0.567931] Run /sbin/init as init process
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Log the controller status to know more about issue if it
lies within kernel nvme subsytem or controller is unhealthy.
Signed-off-by: Rupesh Girase <rgirase@redhat.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulakrni@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
The function nvme_identify_ns_desc() has 3 levels of nesting which make
error message to exceeded > 80 char per line which is not aligned with
the kernel code standards and rest of the NVMe subsystem code.
Add a helper function to move the processing of the log when the
command is successful by reducing the nesting and keeping the
code < 80 char per line.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
I see no good reason for the "If unsure, say N" advice in the description
of the NVME_HWMON configuration option. It is not dangerous, it does
not select any other option, and has a fairly low overhead.
As the option is already not enabled by default, further suggesting
hesitant users to not enable it is not useful anyway. Unlike some other
options where the description alone may not be sufficient for users to
make a decision, NVME_HWMON is pretty simple to grasp in my opinion,
so just let the user do what they want.
Signed-off-by: Jean Delvare <jdelvare@suse.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
We allow userspace to connect with a custom hostid which is useful for
certain use-cases. However there is is no way to tell what is the hostid
used to connect to a given controller.
Expose this so userspace can correlate controllers based on hostid.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
We allow userspace to connect with a custom hostnqn which is useful for
certain use-cases. However there is no way to tell what is the hostnqn
used to connect to a given controller.
Expose this so userspace can correlate controllers based on hostnqn.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Enable ability to associate all sockets related to NVMf TCP traffic
to a priority group that will perform optimized network processing for
this traffic class. Maintain initial default behavior of using priority
of zero.
Signed-off-by: Kiran Patil <kiran.patil@intel.com>
Signed-off-by: Mark Wunderlich <mark.wunderlich@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Enable ability to associate all sockets related to NVMf TCP traffic
to a priority group that will perform optimized network processing for
this traffic class. Maintain initial default behavior of using priority
of zero.
Signed-off-by: Kiran Patil <kiran.patil@intel.com>
Signed-off-by: Mark Wunderlich <mark.wunderlich@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
For nvmet in configfs.c we check return values for all the sscanf()
calls. Add similar check into the nvmet_subsys_attr_serial_store().
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
This patch adds a new target subsys attribute which allows user to
optionally specify model name which then used in the
nvmet_execute_identify_ctrl() to fill up the nvme_id_ctrl structure.
The default value for the model is set to "Linux" for backward
compatibility.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Mark Ruijter <MRuijter@onestopsystems.com>
[chaitanya.kulkarni@wdc.com
*Use macro for default model, coding style fixes.
*Use RCU for accessing model in for configfs and in
nvmet_execute_identify_ctrl().
]
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
This patch adds a new target subsys attribute which allows user to
optionally specify target controller IDs which then used in the
nvmet_execute_identify_ctrl() to fill up the nvme_id_ctrl structure.
For example, when using a cluster setup with two nodes, with a dual
ported NVMe drive and exporting the drive from both the nodes,
The connection to the host fails due to the same controller ID and
results in the following error message:-
"nvme nvmeX: Duplicate cntlid XXX with nvmeX, rejecting"
With this patch now user can partition the controller IDs for each
subsystem by setting up the cntlid_min and cntlid_max. These values
will be used at the time of the controller ID creation. By partitioning
the ctrl-ids for each subsystem results in the unique ctrl-id space
which avoids the collision.
When new attribute is not specified target will fall back to original
cntlid calculation method.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
This is a pure code cleanup patch which does not change any
functionality. This patch removes the extra lines, get rid of
else which is duplicate for return.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
The return code of nvme_alloc_ns is never used, so change it
to void.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Edmund Nadolski <edmund.nadolski@intel.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl5dnvEACgkQiiy9cAdy
T1FaWAv/XnyYfYh6H4fhtgtfNxW9xt9mkHo/AohHcf2rk2erqjVz0lHVe7SuS9C5
EpDYnijZKa//aiIV6VzDymPaMrXQZ+oCAExAzLPmWZnLeZ65Q02K2P1F3KvURdue
4nLjuOyzyG4YYkoBi4wKneu1Ji377m9L6BpSfM+MzPScCOl8OV/vv/nBRY1N6gIY
Rreq5iipRaDhifsaOgiA501sUu7mvpPEHNpluCtFmY4iTHQzYqjWZ5ZGXr2xz63n
5VV8KWWn/p3nhJGt7L/1aynws59AdEd5GNZ5FbDQHokx9n3MMnyl4QGDzUehnhlY
Ym6n50QA5QMn9I9NLg8I2aD6z4vNIj9kZxersoHduf4UsA9CyPaucUIyV81mt683
AZIqtz8H21fgJXOQ3nv4uNc8Yyt1SGQfFDo1EfphwLl6LaE8rx3CFEnVoNLM+jqb
nyRB/NxLtDWVQhYM8Bg/TP7iMqknHtarfZirv48LFdXLlhb83+qpSSHy0zVy9vli
y/0B7rEI
=zLW4
-----END PGP SIGNATURE-----
Merge tag '5.6-rc4-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Five small cifs/smb3 fixes, two for stable (one for a reconnect
problem and the other fixes a use case when renaming an open file)"
* tag '5.6-rc4-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: Use #define in cifs_dbg
cifs: fix rename() by ensuring source handle opened with DELETE bit
cifs: add missing mount option to /proc/mounts
cifs: fix potential mismatch of UNC paths
cifs: don't leak -EAGAIN for stat() during reconnect
Pull x86 fixes from Ingo Molnar:
"Misc fixes: a pkeys fix for a bug that triggers with weird BIOS
settings, and two Xen PV fixes: a paravirt interface fix, and
pagetable dumping fix"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm: Fix dump_pagetables with Xen PV
x86/ioperm: Add new paravirt function update_io_bitmap()
x86/pkeys: Manually set X86_FEATURE_OSPKE to preserve existing changes
Pull scheduler fix from Ingo Molnar:
"Fix a scheduler statistics bug"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Fix statistics for find_idlest_group()
Pull perf fixes from Ingo Molnar:
"No kernel side changes, all tooling fixes plus two tooling cleanups
that were committed late in the merge window alongside the perf
annotate fixes, delayed by Arnaldo's European trip"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
perf annotate: Fix segfault with source toggle
perf annotate: Align struct annotate_args
perf annotate: Simplify disasm_line allocation and freeing code
perf annotate: Remove privsize from symbol__annotate() args
perf probe: Check return value of strlist__add() for -ENOMEM
perf config: Document missing config options
perf annotate: Fix perf config option description
perf annotate: Prefer cmdline option over default config
perf annotate: Make perf config effective
perf config: Introduce perf_config_u8()
perf annotate: Fix --show-nr-samples for tui/stdio2
perf annotate: Fix --show-total-period for tui/stdio2
perf annotate/tui: Re-render title bar after switching back from script browser
tools headers UAPI: Update tools's copy of kvm.h headers
tools arch x86: Sync the msr-index.h copy with the kernel sources
perf arch powerpc: Sync powerpc syscall.tbl with the kernel sources
perf auxtrace: Add auxtrace_record__read_finish()
perf arm-spe: Fix endless record after being terminated
perf cs-etm: Fix endless record after being terminated
perf intel-bts: Fix endless record after being terminated
...
Pull EFI fixes from Ingo Molnar:
"Three fixes to EFI mixed boot mode, mostly related to x86-64 vmap
stacks activated years ago, bug-fixed recently for EFI, which had
knock-on effects of various 1:1 mapping assumptions in mixed mode.
There's also a READ_ONCE() fix for reading an mmap-ed EFI firmware
data field only once, out of caution"
* 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
efi: READ_ONCE rng seed size before munmap
efi/x86: Handle by-ref arguments covering multiple pages in mixed mode
efi/x86: Remove support for EFI time and counter services in mixed mode
efi/x86: Align GUIDs to their size in the mixed mode runtime wrapper
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAl5cMPoACgkQ8vlZVpUN
gaNYmgf/WX4/jMSYQu2fICudCqLr5fkLqsybvYGZGei3F8BaJ90zohQAQybNznWS
iyF0JzrOp37b/o0haz7KfDr7xVB3lAVsKu9Bglq+zL8mc9IkPmjhCXuLbknUtOUw
j3aVdntt4d6S3szbtP4PIZxNqh+/4KJDS2soWvuNWRpYMOv2yoMClptWWQtsimAt
3fYpxasSz0Jrhtbuf+I1oID++wOycDT3RKiko5tpLlQiFVoKBzfou+0ZdkC4+UIl
KvcpMBm1ijdGAaN9jfb2L2KCY5UdSvmeVui3sMXtHBEpKMJl2QsClylR1wGfgBKi
+YMEsjBONxKo3kH2DaPJaU6LEm8JuQ==
=rszH
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fixes from Ted Ts'o:
"Two more bug fixes (including a regression) for 5.6"
* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: potential crash on allocation error in ext4_alloc_flex_bg_array()
jbd2: fix data races at struct journal_head
as too large frame sizes on some configurations. On the
ARM side, the compiler was messing up shadow stacks between
EL1 and EL2 code, which is easily fixed with __always_inline.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJeXAT4AAoJEL/70l94x66DWywH/1kv4MmeGo6PI0Nxk/yvA7X8
78iqIBchtxZX0v/9kqpTB7bYmHyTgmZHM+IkwtIUANDSaOvWqJwU+TLUfduOiuXF
NxBHcZDyuMoftX5CSQ+bJ5PwxKijAdJsIkCZ13CnsTCkwcfamSGypFUCK8LacPeq
WHvV5Ws5pFc51xrP3CH1DrRhLoulaBmt5xxqK9fxWtslrlsnm1uNza5vs8As8CzM
apnmdRIf5p4v91Zic3PFH7/GXES0m1tjIBKdtZ4YHb8yrXV/kBsEVhhTjqE9mrUq
qtRRl5waOFoP4yc9ey52PAbMm1x1Ho/pyunpM0xh40Yq8OPFwqXBPTnWfobSoiM=
=LNQc
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Paolo Bonzini:
"More bugfixes, including a few remaining "make W=1" issues such as too
large frame sizes on some configurations.
On the ARM side, the compiler was messing up shadow stacks between EL1
and EL2 code, which is easily fixed with __always_inline"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: VMX: check descriptor table exits on instruction emulation
kvm: x86: Limit the number of "kvm: disabled by bios" messages
KVM: x86: avoid useless copy of cpufreq policy
KVM: allow disabling -Werror
KVM: x86: allow compiling as non-module with W=1
KVM: Pre-allocate 1 cpumask variable per cpu for both pv tlb and pv ipis
KVM: Introduce pv check helpers
KVM: let declaration of kvm_get_running_vcpus match implementation
KVM: SVM: allocate AVIC data structures based on kvm_amd module parameter
arm64: Ask the compiler to __always_inline functions used by KVM at HYP
KVM: arm64: Define our own swab32() to avoid a uapi static inline
KVM: arm64: Ask the compiler to __always_inline functions used at HYP
kvm: arm/arm64: Fold VHE entry/exit work into kvm_vcpu_run_vhe()
KVM: arm/arm64: Fix up includes for trace.h
KVM emulates UMIP on hardware that doesn't support it by setting the
'descriptor table exiting' VM-execution control and performing
instruction emulation. When running nested, this emulation is broken as
KVM refuses to emulate L2 instructions by default.
Correct this regression by allowing the emulation of descriptor table
instructions if L1 hasn't requested 'descriptor table exiting'.
Fixes: 07721feee4 ("KVM: nVMX: Don't emulate instructions in guest mode")
Reported-by: Jan Kiszka <jan.kiszka@web.de>
Cc: stable@vger.kernel.org
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Jim Mattson <jmattson@google.com>
Signed-off-by: Oliver Upton <oupton@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Pull i2c fixes from Wolfram Sang:
"I2C has three driver bugfixes for you. We agreed on the Mac regression
to go in via I2C"
* 'i2c/for-current-fixed' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
macintosh: therm_windtunnel: fix regression when instantiating devices
i2c: altera: Fix potential integer overflow
i2c: jz4780: silence log flood on txabrt
If sbi->s_flex_groups_allocated is zero and the first allocation fails
then this code will crash. The problem is that "i--" will set "i" to
-1 but when we compare "i >= sbi->s_flex_groups_allocated" then the -1
is type promoted to unsigned and becomes UINT_MAX. Since UINT_MAX
is more than zero, the condition is true so we call kvfree(new_groups[-1]).
The loop will carry on freeing invalid memory until it crashes.
Fixes: 7c990728b9 ("ext4: fix potential race between s_flex_groups online resizing and access")
Reviewed-by: Suraj Jitindar Singh <surajjs@amazon.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: stable@kernel.org
Link: https://lore.kernel.org/r/20200228092142.7irbc44yaz3by7nb@kili.mountain
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Removing attach_adapter from this driver caused a regression for at
least some machines. Those machines had the sensors described in their
DT, too, so they didn't need manual creation of the sensor devices. The
old code worked, though, because manual creation came first. Creation of
DT devices then failed later and caused error logs, but the sensors
worked nonetheless because of the manually created devices.
When removing attach_adaper, manual creation now comes later and loses
the race. The sensor devices were already registered via DT, yet with
another binding, so the driver could not be bound to it.
This fix refactors the code to remove the race and only manually creates
devices if there are no DT nodes present. Also, the DT binding is updated
to match both, the DT and manually created devices. Because we don't
know which device creation will be used at runtime, the code to start
the kthread is moved to do_probe() which will be called by both methods.
Fixes: 3e7bed5271 ("macintosh: therm_windtunnel: drop using attach_adapter")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=201723
Reported-by: Erhard Furtner <erhard_f@mailbox.org>
Tested-by: Erhard Furtner <erhard_f@mailbox.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
Cc: stable@kernel.org # v4.19+
journal_head::b_transaction and journal_head::b_next_transaction could
be accessed concurrently as noticed by KCSAN,
LTP: starting fsync04
/dev/zero: Can't open blockdev
EXT4-fs (loop0): mounting ext3 file system using the ext4 subsystem
EXT4-fs (loop0): mounted filesystem with ordered data mode. Opts: (null)
==================================================================
BUG: KCSAN: data-race in __jbd2_journal_refile_buffer [jbd2] / jbd2_write_access_granted [jbd2]
write to 0xffff99f9b1bd0e30 of 8 bytes by task 25721 on cpu 70:
__jbd2_journal_refile_buffer+0xdd/0x210 [jbd2]
__jbd2_journal_refile_buffer at fs/jbd2/transaction.c:2569
jbd2_journal_commit_transaction+0x2d15/0x3f20 [jbd2]
(inlined by) jbd2_journal_commit_transaction at fs/jbd2/commit.c:1034
kjournald2+0x13b/0x450 [jbd2]
kthread+0x1cd/0x1f0
ret_from_fork+0x27/0x50
read to 0xffff99f9b1bd0e30 of 8 bytes by task 25724 on cpu 68:
jbd2_write_access_granted+0x1b2/0x250 [jbd2]
jbd2_write_access_granted at fs/jbd2/transaction.c:1155
jbd2_journal_get_write_access+0x2c/0x60 [jbd2]
__ext4_journal_get_write_access+0x50/0x90 [ext4]
ext4_mb_mark_diskspace_used+0x158/0x620 [ext4]
ext4_mb_new_blocks+0x54f/0xca0 [ext4]
ext4_ind_map_blocks+0xc79/0x1b40 [ext4]
ext4_map_blocks+0x3b4/0x950 [ext4]
_ext4_get_block+0xfc/0x270 [ext4]
ext4_get_block+0x3b/0x50 [ext4]
__block_write_begin_int+0x22e/0xae0
__block_write_begin+0x39/0x50
ext4_write_begin+0x388/0xb50 [ext4]
generic_perform_write+0x15d/0x290
ext4_buffered_write_iter+0x11f/0x210 [ext4]
ext4_file_write_iter+0xce/0x9e0 [ext4]
new_sync_write+0x29c/0x3b0
__vfs_write+0x92/0xa0
vfs_write+0x103/0x260
ksys_write+0x9d/0x130
__x64_sys_write+0x4c/0x60
do_syscall_64+0x91/0xb05
entry_SYSCALL_64_after_hwframe+0x49/0xbe
5 locks held by fsync04/25724:
#0: ffff99f9911093f8 (sb_writers#13){.+.+}, at: vfs_write+0x21c/0x260
#1: ffff99f9db4c0348 (&sb->s_type->i_mutex_key#15){+.+.}, at: ext4_buffered_write_iter+0x65/0x210 [ext4]
#2: ffff99f5e7dfcf58 (jbd2_handle){++++}, at: start_this_handle+0x1c1/0x9d0 [jbd2]
#3: ffff99f9db4c0168 (&ei->i_data_sem){++++}, at: ext4_map_blocks+0x176/0x950 [ext4]
#4: ffffffff99086b40 (rcu_read_lock){....}, at: jbd2_write_access_granted+0x4e/0x250 [jbd2]
irq event stamp: 1407125
hardirqs last enabled at (1407125): [<ffffffff980da9b7>] __find_get_block+0x107/0x790
hardirqs last disabled at (1407124): [<ffffffff980da8f9>] __find_get_block+0x49/0x790
softirqs last enabled at (1405528): [<ffffffff98a0034c>] __do_softirq+0x34c/0x57c
softirqs last disabled at (1405521): [<ffffffff97cc67a2>] irq_exit+0xa2/0xc0
Reported by Kernel Concurrency Sanitizer on:
CPU: 68 PID: 25724 Comm: fsync04 Tainted: G L 5.6.0-rc2-next-20200221+ #7
Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
The plain reads are outside of jh->b_state_lock critical section which result
in data races. Fix them by adding pairs of READ|WRITE_ONCE().
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Qian Cai <cai@lca.pw>
Link: https://lore.kernel.org/r/20200222043111.2227-1-cai@lca.pw
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Four small fixes. Three are in drivers for fairly obvious bugs. The
fourth is a set of regressions introduced by the compat_ioctl changes
because some of the compat updates wrongly replaced .ioctl instead of
.compat_ioctl.
Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com>
-----BEGIN PGP SIGNATURE-----
iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCXlpxDCYcamFtZXMuYm90
dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishSXsAPwOGPkU
ObFbUs75Tdmk1M7jqtxgBsNhuNta0S8d7dJ3aAEA/YBtGGQWoeEGivUKwzwA4cwL
1w1GbhPEblpMNO8keVA=
=I7qk
-----END PGP SIGNATURE-----
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Four small fixes.
Three are in drivers for fairly obvious bugs. The fourth is a set of
regressions introduced by the compat_ioctl changes because some of the
compat updates wrongly replaced .ioctl instead of .compat_ioctl"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: compat_ioctl: cdrom: Replace .ioctl with .compat_ioctl in four appropriate places
scsi: zfcp: fix wrong data and display format of SFP+ temperature
scsi: sd_sbc: Fix sd_zbc_report_zones()
scsi: libfc: free response frame from GPN_ID
Commit 2ae27137b2 ("x86: mm: convert dump_pagetables to use
walk_page_range") broke Xen PV guests as the hypervisor reserved hole in
the memory map was not taken into account.
Fix that by starting the kernel range only at GUARD_HOLE_END_ADDR.
Fixes: 2ae27137b2 ("x86: mm: convert dump_pagetables to use walk_page_range")
Reported-by: Julien Grall <julien@xen.org>
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Julien Grall <julien@xen.org>
Link: https://lkml.kernel.org/r/20200221103851.7855-1-jgross@suse.com
Commit 111e7b15cf ("x86/ioperm: Extend IOPL config to control ioperm()
as well") reworked the iopl syscall to use I/O bitmaps.
Unfortunately this broke Xen PV domains using that syscall as there is
currently no I/O bitmap support in PV domains.
Add I/O bitmap support via a new paravirt function update_io_bitmap which
Xen PV domains can use to update their I/O bitmaps via a hypercall.
Fixes: 111e7b15cf ("x86/ioperm: Extend IOPL config to control ioperm() as well")
Reported-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Cc: <stable@vger.kernel.org> # 5.5
Link: https://lkml.kernel.org/r/20200218154712.25490-1-jgross@suse.com
perf annotate:
Ravi Bangoria:
- Fix segfault with source toggle.
- Fix --show-total-period and --show-nr-samples for tui/stdio2.
- Fix handling of settings in ~/.perfconfig versus the ones passed
in the command line
- Re-render title bar after switching back from script browser.
- Fix options man page, document some missing ones.
perf probe:
He Zhe:
- Check return value of strlist__add() for -ENOMEM.
tools UAPI:
Arnaldo Carvalho de Melo:
- Sync x86's msr-index.h copy with the kernel sources.
- Update tools's copy of x86's kvm.h headers.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQR2GiIUctdOfX2qHhGyPKLppCJ+JwUCXlkcOwAKCRCyPKLppCJ+
J36pAP9bTih+YOItpzt2hX+j3yshW2DAhc53h5McYkJnX4oLOAD+Jo+OQgwp3ort
ceQOKzghA+P0QB7b8Ntbi34qH26OzAE=
=OQg+
-----END PGP SIGNATURE-----
Merge tag 'perf-urgent-for-mingo-5.6-20200228' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/urgent
Pull perf/urgent fixes from Arnaldo Carvalho de Melo:
perf annotate:
Ravi Bangoria:
- Fix segfault with source toggle.
- Fix --show-total-period and --show-nr-samples for tui/stdio2.
- Fix handling of settings in ~/.perfconfig versus the ones passed
in the command line
- Re-render title bar after switching back from script browser.
- Fix options man page, document some missing ones.
perf probe:
He Zhe:
- Check return value of strlist__add() for -ENOMEM.
tools UAPI:
Arnaldo Carvalho de Melo:
- Sync x86's msr-index.h copy with the kernel sources.
- Update tools's copy of x86's kvm.h headers.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl5ZXkgQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgprZqEACOvhiprH9Q75Pp+ZwQknM3xGyJRWI3Mbj9
ZOyTVTK0qhTeaq6rN4MLSYevXOh+L68x5WRt1YJ1UnQRE0i8+ZQyZczqLKxxl8gF
trhbYDXjvXIWr9zvdtiL01PKKu4Vjjp6eZAomrbxCTFku0qn76fo9wDgGPRGL+Kx
lNO/6QvCXr9EjDniEUhlQsxTad5xc4sL0cnL4s2i7RlTCYtW4WJXJMC/4Gkg69j+
W5GBZyjJDa8Sj3pEbLjtDtA4ooE9VMaldb7ZvR62ONUVwGpftPsbN7UhVlhyhpW+
8v4ZEf07CxB246+hj7oL0RvEW3+/nB2hym1ySMXyBzpbx4O1JOUG7hQtNgdLRbCZ
27IOg2O36qbUKM1hUwn7Qm3XAfBPQdFpVmqE2+E9MEOKzigLzhRP6Bu5d9x9VQGh
JDxsm3B8PRHFJVAasiYu0p7mlx/+BCLjB84UrMB3I9UCBuVfk4mtmuwZX+mcK2PR
pV1xJlEMYKme3cz2/u6uB8p3Nq6ipE1nSVrI6AnfEvJbQ9sFL61KaG4wHKPvtb0y
mlNgc4seSjiWcBR2/84561a4CSmlXAn9dWMIGdHFFA43mTPYGc5omTcM8FwcEDkW
cTFGB8sFukcTNmOw62HUHYI1vPpowX6apV08lEQrScz7GiK5piTYqTFNneqEzcwZ
3bIMisH3Gg==
=WheR
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.6-2020-02-28' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
- Fix for a race with IOPOLL used with SQPOLL (Xiaoguang)
- Only show ->fdinfo if procfs is enabled (Tobias)
- Fix for a chain with multiple personalities in the SQEs
- Fix for a missing free of personality idr on exit
- Removal of the spin-for-work optimization
- Fix for next work lookup on request completion
- Fix for non-vec read/write result progation in case of links
- Fix for a fileset references on switch
- Fix for a recvmsg/sendmsg 32-bit compatability mode
* tag 'io_uring-5.6-2020-02-28' of git://git.kernel.dk/linux-block:
io_uring: fix 32-bit compatability with sendmsg/recvmsg
io_uring: define and set show_fdinfo only if procfs is enabled
io_uring: drop file set ref put/get on switch
io_uring: import_single_range() returns 0/-ERROR
io_uring: pick up link work on submit reference drop
io-wq: ensure work->task_pid is cleared on init
io-wq: remove spin-for-work optimization
io_uring: fix poll_list race for SETUP_IOPOLL|SETUP_SQPOLL
io_uring: fix personality idr leak
io_uring: handle multiple personalities in link chains
Fix a couple of configuration issues in the ACPI watchdog (WDAT)
driver (Mika Westerberg) and make it possible to disable that
driver at boot time in case it still does not work as
expected (Jean Delvare).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAl5Y6UYSHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRxj7YP/1qVMcIJo3CcItiQVV7+2S+WS+kK+Rf8
t6g+8F5KL0DBrDqji7l19kv0O+kPpn2uGnREcNk3T3mHEu0jSywghvBI+nzrym2P
Q+p2+2Fryy5zgiNTyhq7imA3IH4Zw5CVt6GRJjITvzFwhMGM38NQsDUbZEpZtJrz
HTVQyqPgmhvox5sa2Qk1+vZOtclWlkPenWmtJEs8m+Pg8w+Ejfs4Lj+CQYlI5RB8
GGhTgBiBMS7A80yTVxl8LULsrini6PWAGGY2yMScwDUYNZTAUuT8QSUx4/QBmyyP
U5K1sIK2pd/JuE2QqSBztrYC7x1fpfoBJylF8B0HK5uNBpZth7eXhe6t+xDYVTpw
U31unYfmcx6UT47lF/yWWNuhP6EDvdvZ0Xx4DHlgBJqKNuyfhz4Uazsr7HyLi2sI
/YrBPimtgvICFHYorrNVwEheTUEp+QXtNZp6aTugLGVX6zbQ3nZp7ajifNO7hdsQ
K+IjjRgH/YsVi9GQY/U3A/zTO+39fSupQgKiH33pG8Yp5bZbreoo1kPKzlnt23JJ
DXuaN7wH74sgT2EqZm30lFapF/0wv+iIpdWmorlE/4WwTwIeDck/ULeBiGFqpUI6
9Di10O6w+TJ+VevyJ5Hqkzukjlj93Vq+kCY8eOiWVpTga2ieQiYKFSbxDPE3OC1k
j2wkDwKqkh0o
=eIlw
-----END PGP SIGNATURE-----
Merge tag 'acpi-5.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI fixes from Rafael Wysocki:
"Fix a couple of configuration issues in the ACPI watchdog (WDAT)
driver (Mika Westerberg) and make it possible to disable that driver
at boot time in case it still does not work as expected (Jean
Delvare)"
* tag 'acpi-5.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: watchdog: Set default timeout in probe
ACPI: watchdog: Fix gas->access_width usage
ACPICA: Introduce ACPI_ACCESS_BYTE_WIDTH() macro
ACPI: watchdog: Allow disabling WDAT at boot
Fix a recent cpufreq initialization regression (Rafael Wysocki),
revert a devfreq commit that made incompatible changes and broke
user land on some systems (Orson Zhai), drop a stale reference to
a document that has gone away recently (Jonathan Neuschäfer) and
fix a typo in a hibernation code comment (Alexandre Belloni).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAl5Y6MYSHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRxqCAP/3zmwDsBMG07c+g9kOEJzuGLkpUrGIFp
t1vR2+EmV7bCq7onCLSBz090wvogLCmNPpXmq45Ddt5hx3ltZSx4stwYdbsS/xmU
t+9pFiCSSFufs1vaUzZqCuw53jpizRD4KKoUbCT2kpCgH4wjF7ftRIC5y2DYI57Y
hOIfrI9XzwR2UZYRWZqCYtwpPkMicao8lXLfhs1mrZFk+AxwwwUFSp8/Gw94LBPD
+ESctvsrGfmqEDEvcZUaYd2i0i5PsqnbnVy6wxWb3rhmhSXJTfiCzCGECItBJuSA
HcZj3m2dohOXDxXK/Yevwiv+fBkqeeeP+KVlvtjzmYzTDW1bryLYQpxAsdpeYGEE
bANjAl6MARqBDKUjvw+N6yDdAPPNbsT6zEK3f0w4eQNRpfmjhHP7TxrOsjC0DRJu
EZpN6hWGQs1Vmndr+xaK0ITeRXB+7IQeXG6t6XDYZg2gS65hkhF5z1E5vjmn1Cen
wpQsqvek6K7/DqqLqCH27yATNNKff7PNNCfpIazPE6CjDKBAiM9jsMpWkQf4E9/7
VKJFzmhy5P9VsHnUNonRN1K8Vl5eo+b7YOfmR8kvfsmNpfvKqaXM813owVMkOEq6
44kc2D38t3TnlH+GNgGfyPBnHchqlw/P0y0mypQ73Wyg+PQB+fx+W9vAXWDe6fBW
/tIhZYILu90f
=ySea
-----END PGP SIGNATURE-----
Merge tag 'pm-5.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"Fix a recent cpufreq initialization regression (Rafael Wysocki),
revert a devfreq commit that made incompatible changes and broke user
land on some systems (Orson Zhai), drop a stale reference to a
document that has gone away recently (Jonathan Neuschäfer), and fix a
typo in a hibernation code comment (Alexandre Belloni)"
* tag 'pm-5.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpufreq: Fix policy initialization for internal governor drivers
Revert "PM / devfreq: Modify the device name as devfreq(X) for sysfs"
PM / hibernate: fix typo "reserverd_size" -> "reserved_size"
Documentation: power: Drop reference to interface.rst