Patch series "hfs: fix various errors", v2.
This series ultimately aims to address a lockdep warning in
hfs_find_init reported by Syzbot [1].
The work done for this led to the discovery of another bug, and the
Syzkaller repro test also reveals an invalid memory access error after
clearing the lockdep warning. Hence, this series is broken up into
three patches:
1. Add a missing call to hfs_find_exit for an error path in
hfs_fill_super
2. Fix memory mapping in hfs_bnode_read by fixing calls to kmap
3. Add lock nesting notation to tell lockdep that the observed locking
hierarchy is safe
This patch (of 3):
Before exiting hfs_fill_super, the struct hfs_find_data used in
hfs_find_init should be passed to hfs_find_exit to be cleaned up, and to
release the lock held on the btree.
The call to hfs_find_exit is missing from an error path. We add it back
in by consolidating calls to hfs_find_exit for error paths.
Link: https://syzkaller.appspot.com/bug?id=f007ef1d7a31a469e3be7aeb0fde0769b18585db [1]
Link: https://lkml.kernel.org/r/20210701030756.58760-1-desmondcheongzx@gmail.com
Link: https://lkml.kernel.org/r/20210701030756.58760-2-desmondcheongzx@gmail.com
Signed-off-by: Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com>
Reviewed-by: Viacheslav Dubeyko <slava@dubeyko.com>
Cc: Gustavo A. R. Silva <gustavoars@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The HMM selftests use atomic_check_access() to check atomic access to a
page has been revoked. It doesn't matter if the page mapping has been
removed from the mirrored page tables as that also implies atomic access
has been revoked. Therefore remove the unused page variable to fix this
compiler warning:
lib/test_hmm.c:631:16: warning: variable `page' set but not used [-Wunused-but-set-variable]
Link: https://lkml.kernel.org/r/20210706025603.4059-1-apopple@nvidia.com
Fixes: b659baea75 ("mm: selftests for exclusive device memory")
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reported-by: Hulk Robot <hulkci@huawei.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
Reported-by: Yang Yingliang <yangyingliang@huawei.com>
Acked-by: Souptick Joarder <jrdr.linux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Adjust the nommu stub of try_to_unmap to match the changed protype for the
full version. Turn it into an inline instead of a macro to generally
improve the type checking.
Link: https://lkml.kernel.org/r/20210705053944.885828-1-hch@lst.de
Fixes: 1fb08ac63b ("mm: rmap: make try_to_unmap() void function")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The author of commit b3b64ebd38 ("mm/page_alloc: do bulk array
bounds check after checking populated elements") was possibly
confused by the mixture of return values throughout the function.
The API contract is clear that the function "Returns the number of pages
on the list or array." It does not list zero as a unique return value with
a special meaning. Therefore zero is a plausible return value only if
@nr_pages is zero or less.
Clean up the return logic to make it clear that the returned value is
always the total number of pages in the array/list, not the number of
pages that were allocated during this call.
The only change in behavior with this patch is the value returned if
prepare_alloc_pages() fails. To match the API contract, the number of
pages currently in the array/list is returned in this case.
The call site in __page_pool_alloc_pages_slow() also seems to be confused
on this matter. It should be attended to by someone who is familiar with
that code.
[mel@techsingularity.net: Return nr_populated if 0 pages are requested]
Link: https://lkml.kernel.org/r/20210713152100.10381-4-mgorman@techsingularity.net
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com>
Cc: Zhang Qiang <Qiang.Zhang@windriver.com>
Cc: Yanfei Xu <yanfei.xu@windriver.com>
Cc: Matteo Croce <mcroce@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Syzbot is reporting potential deadlocks due to pagesets.lock when
PAGE_OWNER is enabled. One example from Desmond Cheong Zhi Xi is as
follows
__alloc_pages_bulk()
local_lock_irqsave(&pagesets.lock, flags) <---- outer lock here
prep_new_page():
post_alloc_hook():
set_page_owner():
__set_page_owner():
save_stack():
stack_depot_save():
alloc_pages():
alloc_page_interleave():
__alloc_pages():
get_page_from_freelist():
rm_queue():
rm_queue_pcplist():
local_lock_irqsave(&pagesets.lock, flags);
*** DEADLOCK ***
Zhang, Qiang also reported
BUG: sleeping function called from invalid context at mm/page_alloc.c:5179
in_atomic(): 0, irqs_disabled(): 1, non_block: 0, pid: 1, name: swapper/0
.....
__dump_stack lib/dump_stack.c:79 [inline]
dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:96
___might_sleep.cold+0x1f1/0x237 kernel/sched/core.c:9153
prepare_alloc_pages+0x3da/0x580 mm/page_alloc.c:5179
__alloc_pages+0x12f/0x500 mm/page_alloc.c:5375
alloc_page_interleave+0x1e/0x200 mm/mempolicy.c:2147
alloc_pages+0x238/0x2a0 mm/mempolicy.c:2270
stack_depot_save+0x39d/0x4e0 lib/stackdepot.c:303
save_stack+0x15e/0x1e0 mm/page_owner.c:120
__set_page_owner+0x50/0x290 mm/page_owner.c:181
prep_new_page mm/page_alloc.c:2445 [inline]
__alloc_pages_bulk+0x8b9/0x1870 mm/page_alloc.c:5313
alloc_pages_bulk_array_node include/linux/gfp.h:557 [inline]
vm_area_alloc_pages mm/vmalloc.c:2775 [inline]
__vmalloc_area_node mm/vmalloc.c:2845 [inline]
__vmalloc_node_range+0x39d/0x960 mm/vmalloc.c:2947
__vmalloc_node mm/vmalloc.c:2996 [inline]
vzalloc+0x67/0x80 mm/vmalloc.c:3066
There are a number of ways it could be fixed. The page owner code could
be audited to strip GFP flags that allow sleeping but it'll impair the
functionality of PAGE_OWNER if allocations fail. The bulk allocator could
add a special case to release/reacquire the lock for prep_new_page and
lookup PCP after the lock is reacquired at the cost of performance. The
pages requiring prep could be tracked using the least significant bit and
looping through the array although it is more complicated for the list
interface. The options are relatively complex and the second one still
incurs a performance penalty when PAGE_OWNER is active so this patch takes
the simple approach -- disable bulk allocation of PAGE_OWNER is active.
The caller will be forced to allocate one page at a time incurring a
performance penalty but PAGE_OWNER is already a performance penalty.
Link: https://lkml.kernel.org/r/20210708081434.GV3840@techsingularity.net
Fixes: dbbee9d5cd ("mm/page_alloc: convert per-cpu list protection to local_lock")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reported-by: Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com>
Reported-by: "Zhang, Qiang" <Qiang.Zhang@windriver.com>
Reported-by: syzbot+127fd7828d6eeb611703@syzkaller.appspotmail.com
Tested-by: syzbot+127fd7828d6eeb611703@syzkaller.appspotmail.com
Acked-by: Rafael Aquini <aquini@redhat.com>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The <linux/kasan.h> header relies on _RET_IP_ being defined, and had been
receiving that definition via inclusion of bug.h which includes kernel.h.
However, since f39650de68 ("kernel.h: split out panic and oops helpers")
that is no longer the case and get the following build error when building
CONFIG_KASAN_HW_TAGS on arm64:
In file included from arch/arm64/mm/kasan_init.c:10:
include/linux/kasan.h: In function 'kasan_slab_free':
include/linux/kasan.h:230:39: error: '_RET_IP_' undeclared (first use in this function)
230 | return __kasan_slab_free(s, object, _RET_IP_, init);
Fix it by including kernel.h from kasan.h.
Link: https://lkml.kernel.org/r/20210705072716.2125074-1-elver@google.com
Fixes: f39650de68 ("kernel.h: split out panic and oops helpers")
Signed-off-by: Marco Elver <elver@google.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Issue: when SLUB debug is on, hwtag kasan_unpoison() would overwrite the
redzone of object with unaligned size.
An additional memzero_explicit() path is added to replacing init by hwtag
instruction for those unaligned size at SLUB debug mode.
The penalty is acceptable since they are only enabled in debug mode, not
production builds. A block of comment is added for explanation.
Link: https://lkml.kernel.org/r/20210705103229.8505-3-yee.lee@mediatek.com
Signed-off-by: Yee Lee <yee.lee@mediatek.com>
Suggested-by: Andrey Konovalov <andreyknvl@gmail.com>
Suggested-by: Marco Elver <elver@google.com>
Reviewed-by: Marco Elver <elver@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Nicholas Tang <nicholas.tang@mediatek.com>
Cc: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com>
Cc: Chinwen Chang <chinwen.chang@mediatek.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move the helper to check slub_debug_enabled, so that we can confine the
use of #ifdef outside slub.c as well.
Link: https://lkml.kernel.org/r/20210705103229.8505-2-yee.lee@mediatek.com
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Yee Lee <yee.lee@mediatek.com>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Chinwen Chang <chinwen.chang@mediatek.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com>
Cc: Nicholas Tang <nicholas.tang@mediatek.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If we encounter a directory that has been configured to pass on an
extent size hint to a new realtime file and the hint isn't an integer
multiple of the rt extent size, we should flag the hint for
administrative review because that is a misconfiguration (that other
parts of the kernel will fix automatically).
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
During a realtime grow operation, we run a single transaction for each
rt bitmap block added to the filesystem. This means that each step has
to be careful to increase sb_rblocks appropriately.
Fix the integer overflow error in this calculation that can happen when
the extent size is very large. Found by running growfs to add a rt
volume to a filesystem formatted with a 1g rt extent size.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Improve the checking at the start of a realtime grow operation so that
we avoid accidentally set a new extent size that is too large and avoid
adding an rt volume to a filesystem with rmap or reflink because we
don't support rt rmap or reflink yet.
While we're at it, separate the checks so that we're only testing one
aspect at a time.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Commit 603f000b15 changed xfs_ioctl_setattr_check_extsize to reject an
attempt to set an EXTSZINHERIT extent size hint on a directory with
RTINHERIT set if the hint isn't a multiple of the realtime extent size.
However, I have recently discovered that it is possible to change the
realtime extent size when adding a rt device to a filesystem, which
means that the existence of directories with misaligned inherited hints
is not an accident.
As a result, it's possible that someone could have set a valid hint and
added an rt volume with a different rt extent size, which invalidates
the ondisk hints. After such a sequence, FSGETXATTR will report a
misaligned hint, which FSSETXATTR will trip over, causing confusion if
the user was doing the usual GET/SET sequence to change some other
attribute. Change xfs_fill_fsxattr to omit the hint if it isn't aligned
properly.
Fixes: 603f000b15 ("xfs: validate extsz hints against rt extent size when rtinherit is set")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
While auditing the realtime growfs code, I realized that the GROWFSRT
ioctl (and by extension xfs_growfs) has always allowed sysadmins to
change the realtime extent size when adding a realtime section to the
filesystem. Since we also have always allowed sysadmins to set
RTINHERIT and EXTSZINHERIT on directories even if there is no realtime
device, this invalidates the premise laid out in the comments added in
commit 603f000b15.
In other words, this is not a case of inadequate metadata validation.
This is a case of nearly forgotten (and apparently untested) but
supported functionality. Update the comments to reflect what we've
learned, and remove the log message about correcting the misalignment.
Fixes: 603f000b15 ("xfs: validate extsz hints against rt extent size when rtinherit is set")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
While running xfs/168, I noticed a second source of post-shrink
corruption errors causing shutdowns.
Let's say that directory B has a low inode number and is a child of
directory A, which has a high number. If B is empty but open, and
unlinked from A, B's dotdot link continues to point to A. If A is then
unlinked and the filesystem shrunk so that A is no longer a valid inode,
a subsequent AIL push of B will trip the inode verifiers because the
dotdot entry points outside of the filesystem.
To avoid this problem, reset B's dotdot entry to the root directory when
unlinking directories, since the root directory cannot be removed.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
While running xfs/168, I noticed occasional write verifier shutdowns
involving inodes at the very end of the filesystem. Existing inode
btree validation code checks that all inode clusters are fully contained
within the filesystem.
However, due to inadequate checking in the fs shrink code, it's possible
that there could be a sparse inode cluster at the end of the filesystem
where the upper inodes of the cluster are marked as holes and the
corresponding blocks are free. In this case, the last blocks in the AG
are listed in the bnobt. This enables the shrink to proceed but results
in a filesystem that trips the inode verifiers. Fix this by disallowing
the shrink.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Now that we create those objects in iomap_writepage_map when needed,
there's no need to pre-create them in iomap_page_mkwrite_actor anymore.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
In iomap_readpage_actor, don't create iop objects for inline inodes.
Otherwise, iomap_read_inline_data will set PageUptodate without setting
iop->uptodate, and iomap_page_release will eventually complain.
To prevent this kind of bug from occurring in the future, make sure the
page doesn't have private data attached in iomap_read_inline_data.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Create an iop in the writeback path if one doesn't exist. This allows us
to avoid creating the iop in some cases. We'll initially do that for pages
with inline data, but it can be extended to pages which are entirely within
an extent. It also allows for an iop to be removed from pages in the
future (eg page split).
Co-developed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
The length variable is rather pointless given that it can be trivially
deduced from offset and size. Also the initial calculation can lead
to KASAN warnings.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Leizhen (ThunderTown) <thunder.leizhen@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
The length variable is rather pointless given that it can be trivially
deduced from offset and size. Also the initial calculation can lead
to KASAN warnings.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Leizhen (ThunderTown) <thunder.leizhen@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
We suppress KCOV for entry.o rather than entry-common.o. As entry.o is
built from entry.S, this is pointless, and permits instrumentation of
entry-common.o, which is built from entry-common.c.
Fix the Makefile to suppress KCOV for entry-common.o, as we had intended
to begin with. I've verified with objdump that this is working as
expected.
Fixes: bf6fa2c0dd ("arm64: entry: don't instrument entry code with KCOV")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: James Morse <james.morse@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20210715123049.9990-1-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
We intend that all the early exception handling code is marked as
`noinstr`, but we forgot this for __el0_error_handler_common(), which is
called before we have completed entry from user mode. If it were
instrumented, we could run into problems with RCU, lockdep, etc.
Mark it as `noinstr` to prevent this.
The few other functions in entry-common.c which do not have `noinstr` are
called once we've completed entry, and are safe to instrument.
Fixes: bb8e93a287 ("arm64: entry: convert SError handlers to C")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Joey Gouly <joey.gouly@arm.com>
Cc: James Morse <james.morse@arm.com>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20210714172801.16475-1-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
Since commit:
bad1e1c663 ("arm64: mte: switch GCR_EL1 in kernel entry and exit")
we saved/restored the user GCR_EL1 value at exception boundaries, and
update_gcr_el1_excl() is no longer used for this. However it is used to
restore the kernel's GCR_EL1 value when returning from a suspend state.
Thus, the comment is misleading (and an ISB is necessary).
When restoring the kernel's GCR value, we need an ISB to ensure this is
used by subsequent instructions. We don't necessarily get an ISB by
other means (e.g. if the kernel is built without support for pointer
authentication). As __cpu_setup() initialised GCR_EL1.Exclude to 0xffff,
until a context synchronization event, allocation tag 0 may be used
rather than the desired set of tags.
This patch drops the misleading comment, adds the missing ISB, and for
clarity folds update_gcr_el1_excl() into its only user.
Fixes: bad1e1c663 ("arm64: mte: switch GCR_EL1 in kernel entry and exit")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20210714143843.56537-2-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
Al reminds us that the usercopy API must only return complete failure
if absolutely nothing could be copied. Currently, if userspace does
something silly like giving us an unaligned pointer to Device memory,
or a size which overruns MTE tag bounds, we may fail to honour that
requirement when faulting on a multi-byte access even though a smaller
access could have succeeded.
Add a mitigation to the fixup routines to fall back to a single-byte
copy if we faulted on a larger access before anything has been written
to the destination, to guarantee making *some* forward progress. We
needn't be too concerned about the overall performance since this should
only occur when callers are doing something a bit dodgy in the first
place. Particularly broken userspace might still be able to trick
generic_perform_write() into an infinite loop by targeting write() at
an mmap() of some read-only device register where the fault-in load
succeeds but any store synchronously aborts such that copy_to_user() is
genuinely unable to make progress, but, well, don't do that...
CC: stable@vger.kernel.org
Reported-by: Chen Huang <chenhuang5@huawei.com>
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Link: https://lore.kernel.org/r/dc03d5c675731a1f24a62417dba5429ad744234e.1626098433.git.robin.murphy@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
xen-blkfront has a weird protocol where close message from the remote
side can be delayed, and where hot removals are treated somewhat
differently from regular removals, all leading to potential NULL
pointer removals, and a del_gendisk from the block device release
method, which will deadlock. Fix this by just performing normal hot
removals even when the device is opened like all other Linux block
drivers.
Fixes: c76f48eb5c ("block: take bd_mutex around delete_partitions in del_gendisk")
Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20210715141711.1257293-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
- fix various races in nvme-pci when shutting down just after probing
(Casey Chen)
- fix a net_device leak in nvme-tcp (Prabhakar Kushwaha)
-----BEGIN PGP SIGNATURE-----
iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmDwQcQLHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYO1Rw/+MCBsOWNPnlbX3IN4wnk10ySXVb1YKkNibn7riBN5
1hhi8hXHBA5h+H/HRy00FAYvi3OjRyUBVoRN7aEQEc9t0q3WWznCTv81kdfG2vX3
ZkAmKGW/J0+Qc4+g7ul2hUcjoYSpMKNtRV/fpmn6TudDW7myXTwMReu/FFzS/XBj
tP/F+k9DO9/3GCSxZJEZoPyWmo1hQsNTH03D/d/CyMTEffObMkRfcFEJgkOpdLFZ
dFzfE33DbM36t8bU6wAwpXcg/y7X9mdzz2LZONYirIrE9OV+gi31uy4m9WHoXPgY
n7WNQymJCFXOEIT0u9yyju+OeJ0DILKKztgG2K369y6svtwPLD2z9Jsg55fPSXPr
YCA1oFyfqx/+9CezuiSD3WylJBSwSWAn/en7mlqmrIeCtX49VAx2OaEgsiy04Qll
2i8MXZG2Lc3EzTgr+FemA6OC0ESZj8t5v0NoAlP+5hsE+Y1fpu9g0CLXhNOF6M60
fZ/NvqwV3NQtjdTsMUC+rAEDBasTyTqhhfHdJ9NShm1rssUhqQxzDEF4TghaPG7f
NkpmNLUQu3EHr99SJzk269pCZ6dgk3AftJc06kzvwLbBBDfVXTCnhCrQnf9wEgPM
7RPiLjwMLeCk9CLbg0cDtTnEt92SidfIe2mXL/i/0DP8Au5A7/DQhDJ60bWUt5V9
J/Y=
=PZvx
-----END PGP SIGNATURE-----
Merge tag 'nvme-5.14-2021-07-15' of git://git.infradead.org/nvme into block-5.14
Pull NVMe fixes from Christoph:
"nvme fixes for Linux 5.14
- fix various races in nvme-pci when shutting down just after probing
(Casey Chen)
- fix a net_device leak in nvme-tcp (Prabhakar Kushwaha)"
* tag 'nvme-5.14-2021-07-15' of git://git.infradead.org/nvme:
nvme-pci: do not call nvme_dev_remove_admin from nvme_remove
nvme-pci: fix multiple races in nvme_setup_io_queues
nvme-tcp: use __dev_get_by_name instead dev_get_by_name for OPT_HOST_IFACE
We must release the queue before freeing the tagset.
Fixes: 262d431f90 ("pd: use blk_mq_alloc_disk and blk_cleanup_disk")
Signed-off-by: Guoqing Jiang <jiangguoqing@kylinos.cn>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210706010734.1356066-1-guoqing.jiang@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There's no need for fixed strings to be under 'patternProperties', so move
them under 'properties' instead.
Cc: Jean Delvare <jdelvare@suse.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Kishon Vijay Abraham I <kishon@ti.com>
Cc: Vinod Koul <vkoul@kernel.org>
Cc: Saravanan Sekar <sravanhome@gmail.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Jagan Teki <jagan@amarulasolutions.com>
Cc: Troy Kisky <troy.kisky@boundarydevices.com>
Cc: linux-hwmon@vger.kernel.org
Cc: linux-phy@lists.infradead.org
Cc: linux-spi@vger.kernel.org
Signed-off-by: Rob Herring <robh@kernel.org>
Acked-by: Mark Brown <broonie@kernel.org>
Acked-by: Guenter Roeck <linux@roeck-us.net>
Link: https://lore.kernel.org/r/20210713193514.690894-1-robh@kernel.org
Another round of removing redundant minItems/maxItems from new schema in
the recent merge window.
If a property has an 'items' list, then a 'minItems' or 'maxItems' with the
same size as the list is redundant and can be dropped. Note that is DT
schema specific behavior and not standard json-schema behavior. The tooling
will fixup the final schema adding any unspecified minItems/maxItems.
This condition is partially checked with the meta-schema already, but
only if both 'minItems' and 'maxItems' are equal to the 'items' length.
An improved meta-schema is pending.
Cc: Stephen Boyd <sboyd@kernel.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Will Deacon <will@kernel.org>
Cc: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Cc: Miquel Raynal <miquel.raynal@bootlin.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Vignesh Raghavendra <vigneshr@ti.com>
Cc: Alessandro Zummo <a.zummo@towertech.it>
Cc: Alexandre Belloni <alexandre.belloni@bootlin.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Sureshkumar Relli <naga.sureshkumar.relli@xilinx.com>
Cc: Brian Norris <computersforpeace@gmail.com>
Cc: Kamal Dasu <kdasu.kdev@gmail.com>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Cc: linux-clk@vger.kernel.org
Cc: iommu@lists.linux-foundation.org
Cc: linux-mtd@lists.infradead.org
Cc: linux-rtc@vger.kernel.org
Cc: linux-usb@vger.kernel.org
Signed-off-by: Rob Herring <robh@kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Reviewed-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Link: https://lore.kernel.org/r/20210713193453.690290-1-robh@kernel.org
Two additional tests are added:
- SMM triggered from L2 does not currupt L1 host state.
- Save/restore during SMM triggered from L2 does not corrupt guest/host
state.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210628104425.391276-7-vkuznets@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If the VM was migrated while in SMM, no nested state was saved/restored,
and therefore svm_leave_smm has to load both save and control area
of the vmcb12. Save area is already loaded from HSAVE area,
so now load the control area as well from the vmcb12.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210628104425.391276-6-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
VMCB split commit 4995a3685f ("KVM: SVM: Use a separate vmcb for the
nested L2 guest") broke return from SMM when we entered there from guest
(L2) mode. Gen2 WS2016/Hyper-V is known to do this on boot. The problem
manifests itself like this:
kvm_exit: reason EXIT_RSM rip 0x7ffbb280 info 0 0
kvm_emulate_insn: 0:7ffbb280: 0f aa
kvm_smm_transition: vcpu 0: leaving SMM, smbase 0x7ffb3000
kvm_nested_vmrun: rip: 0x000000007ffbb280 vmcb: 0x0000000008224000
nrip: 0xffffffffffbbe119 int_ctl: 0x01020000 event_inj: 0x00000000
npt: on
kvm_nested_intercepts: cr_read: 0000 cr_write: 0010 excp: 40060002
intercepts: fd44bfeb 0000217f 00000000
kvm_entry: vcpu 0, rip 0xffffffffffbbe119
kvm_exit: reason EXIT_NPF rip 0xffffffffffbbe119 info
200000006 1ab000
kvm_nested_vmexit: vcpu 0 reason npf rip 0xffffffffffbbe119 info1
0x0000000200000006 info2 0x00000000001ab000 intr_info 0x00000000
error_code 0x00000000
kvm_page_fault: address 1ab000 error_code 6
kvm_nested_vmexit_inject: reason EXIT_NPF info1 200000006 info2 1ab000
int_info 0 int_info_err 0
kvm_entry: vcpu 0, rip 0x7ffbb280
kvm_exit: reason EXIT_EXCP_GP rip 0x7ffbb280 info 0 0
kvm_emulate_insn: 0:7ffbb280: 0f aa
kvm_inj_exception: #GP (0x0)
Note: return to L2 succeeded but upon first exit to L1 its RIP points to
'RSM' instruction but we're not in SMM.
The problem appears to be that VMCB01 gets irreversibly destroyed during
SMM execution. Previously, we used to have 'hsave' VMCB where regular
(pre-SMM) L1's state was saved upon nested_svm_vmexit() but now we just
switch to VMCB01 from VMCB02.
Pre-split (working) flow looked like:
- SMM is triggered during L2's execution
- L2's state is pushed to SMRAM
- nested_svm_vmexit() restores L1's state from 'hsave'
- SMM -> RSM
- enter_svm_guest_mode() switches to L2 but keeps 'hsave' intact so we have
pre-SMM (and pre L2 VMRUN) L1's state there
- L2's state is restored from SMRAM
- upon first exit L1's state is restored from L1.
This was always broken with regards to svm_get_nested_state()/
svm_set_nested_state(): 'hsave' was never a part of what's being
save and restored so migration happening during SMM triggered from L2 would
never restore L1's state correctly.
Post-split flow (broken) looks like:
- SMM is triggered during L2's execution
- L2's state is pushed to SMRAM
- nested_svm_vmexit() switches to VMCB01 from VMCB02
- SMM -> RSM
- enter_svm_guest_mode() switches from VMCB01 to VMCB02 but pre-SMM VMCB01
is already lost.
- L2's state is restored from SMRAM
- upon first exit L1's state is restored from VMCB01 but it is corrupted
(reflects the state during 'RSM' execution).
VMX doesn't have this problem because unlike VMCB, VMCS keeps both guest
and host state so when we switch back to VMCS02 L1's state is intact there.
To resolve the issue we need to save L1's state somewhere. We could've
created a third VMCB for SMM but that would require us to modify saved
state format. L1's architectural HSAVE area (pointed by MSR_VM_HSAVE_PA)
seems appropriate: L0 is free to save any (or none) of L1's state there.
Currently, KVM does 'none'.
Note, for nested state migration to succeed, both source and destination
hypervisors must have the fix. We, however, don't need to create a new
flag indicating the fact that HSAVE area is now populated as migration
during SMM triggered from L2 was always broken.
Fixes: 4995a3685f ("KVM: SVM: Use a separate vmcb for the nested L2 guest")
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Separate the code setting non-VMLOAD-VMSAVE state from
svm_set_nested_state() into its own function. This is going to be
re-used from svm_enter_smm()/svm_leave_smm().
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210628104425.391276-4-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
APM states that "The address written to the VM_HSAVE_PA MSR, which holds
the address of the page used to save the host state on a VMRUN, must point
to a hypervisor-owned page. If this check fails, the WRMSR will fail with
a #GP(0) exception. Note that a value of 0 is not considered valid for the
VM_HSAVE_PA MSR and a VMRUN that is attempted while the HSAVE_PA is 0 will
fail with a #GP(0) exception."
svm_set_msr() already checks that the supplied address is valid, so only
check for '0' is missing. Add it to nested_svm_vmrun().
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210628104425.391276-3-vkuznets@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
APM states that #GP is raised upon write to MSR_VM_HSAVE_PA when
the supplied address is not page-aligned or is outside of "maximum
supported physical address for this implementation".
page_address_valid() check seems suitable. Also, forcefully page-align
the address when it's written from VMM.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210628104425.391276-2-vkuznets@redhat.com>
Cc: stable@vger.kernel.org
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
[Add comment about behavior for host-provided values. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use IS_ERR() instead of checking for a NULL pointer when querying for
sev_pin_memory() failures. sev_pin_memory() always returns an error code
cast to a pointer, or a valid pointer; it never returns NULL.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Steve Rutherford <srutherford@google.com>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Ashish Kalra <ashish.kalra@amd.com>
Fixes: d3d1af85e2 ("KVM: SVM: Add KVM_SEND_UPDATE_DATA command")
Fixes: 15fb7de1a7 ("KVM: SVM: Add KVM_SEV_RECEIVE_UPDATE_DATA command")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210506175826.2166383-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Return -EFAULT if copy_to_user() fails; if accessing user memory faults,
copy_to_user() returns the number of bytes remaining, not an error code.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Steve Rutherford <srutherford@google.com>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Ashish Kalra <ashish.kalra@amd.com>
Fixes: d3d1af85e2 ("KVM: SVM: Add KVM_SEND_UPDATE_DATA command")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210506175826.2166383-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In theory there are no side effects of not intercepting #SMI,
because then #SMI becomes transparent to the OS and the KVM.
Plus an observation on recent Zen2 CPUs reveals that these
CPUs ignore #SMI interception and never deliver #SMI VMexits.
This is also useful to test nested KVM to see that L1
handles #SMIs correctly in case when L1 doesn't intercept #SMI.
Finally the default remains the same, the SMI are intercepted
by default thus this patch doesn't have any effect unless
non default module param value is used.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210707125100.677203-4-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Kernel never sends real INIT even to CPUs, other than on boot.
Thus INIT interception is an error which should be caught
by a check for an unknown VMexit reason.
On top of that, the current INIT VM exit handler skips
the current instruction which is wrong.
That was added in commit 5ff3a351f6 ("KVM: x86: Move trivial
instruction-based exit handlers to common code").
Fixes: 5ff3a351f6 ("KVM: x86: Move trivial instruction-based exit handlers to common code")
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210707125100.677203-3-mlevitsk@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commit 5ff3a351f6 ("KVM: x86: Move trivial instruction-based
exit handlers to common code"), unfortunately made a mistake of
treating nop_on_interception and nop_interception in the same way.
Former does truly nothing while the latter skips the instruction.
SMI VM exit handler should do nothing.
(SMI itself is handled by the host when we do STGI)
Fixes: 5ff3a351f6 ("KVM: x86: Move trivial instruction-based exit handlers to common code")
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210707125100.677203-2-mlevitsk@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
vmx_msr_index was used to record the list of MSRs which can be lazily
restored when kvm returns to userspace. It is now reimplemented as
kvm_uret_msrs_list, a common x86 list which is only used inside x86.c.
So just remove the obsolete declaration in vmx.h.
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Message-Id: <20210707235702.31595-1-yu.c.zhang@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When the host is using debug registers but the guest is not using them
nor is the guest in guest-debug state, the kvm code does not reset
the host debug registers before kvm_x86->run(). Rather, it relies on
the hardware vmentry instruction to automatically reset the dr7 registers
which ensures that the host breakpoints do not affect the guest.
This however violates the non-instrumentable nature around VM entry
and exit; for example, when a host breakpoint is set on vcpu->arch.cr2,
Another issue is consistency. When the guest debug registers are active,
the host breakpoints are reset before kvm_x86->run(). But when the
guest debug registers are inactive, the host breakpoints are delayed to
be disabled. The host tracing tools may see different results depending
on what the guest is doing.
To fix the problems, we clear %db7 unconditionally before kvm_x86->run()
if the host has set any breakpoints, no matter if the guest is using
them or not.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20210628172632.81029-1-jiangshanlai@gmail.com>
Cc: stable@vger.kernel.org
[Only clear %db7 instead of reloading all debug registers. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commit a75a895e64 ("KVM: selftests: Unconditionally use memslot 0 for
vaddr allocations") removed the memslot parameters from vm_vaddr_alloc.
It addressed all callers except one under lib/aarch64/, due to a race
with commit e3db7579ef ("KVM: selftests: Add exception handling
support for aarch64")
Fix the vm_vaddr_alloc call in lib/aarch64/processor.c.
Reported-by: Zenghui Yu <yuzenghui@huawei.com>
Signed-off-by: Ricardo Koller <ricarkol@google.com>
Message-Id: <20210702201042.4036162-1-ricarkol@google.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In commit bc9e9e672d ("KVM: debugfs: Reuse binary stats descriptors")
loop for filling debugfs_stat_data was copy-pasted 2 times, but
in the second loop pointers are saved over pointers allocated
in the first loop. All this causes is a memory leak, fix it.
Fixes: bc9e9e672d ("KVM: debugfs: Reuse binary stats descriptors")
Signed-off-by: Pavel Skripkin <paskripkin@gmail.com>
Reviewed-by: Jing Zhang <jingzhangos@google.com>
Message-Id: <20210701195500.27097-1-paskripkin@gmail.com>
Reviewed-by: Jing Zhang <jingzhangos@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Some of the lines aren't properly indented, causing yamllint to warn
about them:
.../nxp,sja1105.yaml:70:17: [warning] wrong indentation: expected 18 but found 16 (indentation)
Use the proper indentation to fix those warnings.
Signed-off-by: Thierry Reding <treding@nvidia.com>
Fixes: 070f5b701d ("dt-bindings: net: dsa: sja1105: add SJA1110 bindings")
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Link: https://lore.kernel.org/r/20210622113327.3613595-1-thierry.reding@gmail.com
Signed-off-by: Rob Herring <robh@kernel.org>