-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlx9czQACgkQxWXV+ddt
WDvC9w/8CxJf1/eZBqb+b+aA38kgZhoaNixMud/IW/IFmIlicX0PoDxk6dh1ZTA+
3uej/7fyfwjNCVvtrPVVxdT8zhZgyJouHrbhG1PlDWtmTEV2VqV5pBG1xQtCwmZy
oinQI5oYYM5Le5EXxRGH8TQs6Z3tFuLx2kcrVWBLFKoZ2kZBZxe6KykGF9izve4a
sVjtOL1CEL1e00vrNLzUmch8qss9Cu0i3qd3k8UANp3SgKIaOkJt4S/HeEcLfy5J
kf6hVKlgPDuakVtAJKyhbLVQsfHVNkfiyvplta9lDot/iJchJITTRkadP6LblVeo
knl8V+VO9kzQUvGauxtu66Q3DJ/7mqbzHUwPISetdKCV9ZXkuPFHnu0AEP577mrx
e1JAPA/a8lF3up5QhqIb0uzH3sczOd8nNN/b1Xnxl7Kogyl8SUjhmX3FFy88borj
/8Ptv/fFMQZs9IJ0QWlkh5TKRXAtSNAzVy2FpkvLaO0k0gJKQjyJuTKV5ezv/PGU
+4m5kDtfpyz//KAOZxq4lERj4EMIEDhHhNbA8Qqmdeoj7oaKZ+gW57enOXohCTbi
gVE6xDr2u4oQ85j3JuQo5W5mZA4uza35Gh4t43n5akdrrkLFVLW584hDxShGx9uS
B0maToGbzOdGJTZXZ2SLHZ5Da14Lzb/TooCufF8GMAISb99vbjw=
=D9zc
-----END PGP SIGNATURE-----
Merge tag 'for-5.1-part1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"This contains usual mix of new features, core changes and fixes; full
list below. I'm planning second pull request, with a few more fixes
that arrived recently but too close to merge window, will send it next
week.
New features:
- support zstd compression levels
- new ioctl to unregister a device from the module (ie. reverse of
device scan)
- scrub prints a message to log when it's about to start or finish
Core changes:
- qgroups can now skip part of a tree that does not get updated
during relocation, because this does not affect the quota
accounting, estimated speedup in run time is about 20%
- the compression workspace management had to be enhanced due to zstd
requirements
- various enospc fixes, when there's high fragmentation the
over-reservation can cause ENOSPC that might not happen after a
flush, in such cases try to wait if the situation improves
Fixes:
- various ioctls could overwrite previous return value if
copy_to_user fails, fix this so the original error is reported
- more reclaim vs GFP_KERNEL fixes
- other cleanups and refactoring
- fix a (valid) lockdep warning in a test when device replace is
destroying worker threads
- make qgroup async transaction commit more aggressive, this avoids
some 'quota limit reached' errors if there are not enough data to
trigger transaction in order to flush
- fix deadlock between snapshot deletion and quotas when backref
walking is called from context that already holds the same locks
- fsync fixes:
- fix fsync after succession of renames of different files
- fix fsync after succession of renames and unlink/rmdir"
* tag 'for-5.1-part1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (92 commits)
btrfs: Remove unnecessary casts in btrfs_read_root_item
Btrfs: remove assertion when searching for a key in a node/leaf
Btrfs: add missing error handling after doing leaf/node binary search
btrfs: drop the lock on error in btrfs_dev_replace_cancel
btrfs: ensure that a DUP or RAID1 block group has exactly two stripes
btrfs: init csum_list before possible free
Btrfs: remove no longer needed range length checks for deduplication
Btrfs: fix fsync after succession of renames and unlink/rmdir
Btrfs: fix fsync after succession of renames of different files
btrfs: honor path->skip_locking in backref code
btrfs: qgroup: Make qgroup async transaction commit more aggressive
btrfs: qgroup: Move reserved data accounting from btrfs_delayed_ref_head to btrfs_qgroup_extent_record
btrfs: scrub: remove unused nocow worker pointer
btrfs: scrub: add assertions for worker pointers
btrfs: scrub: convert scrub_workers_refcnt to refcount_t
btrfs: scrub: add scrub_lock lockdep check in scrub_workers_get
btrfs: scrub: fix circular locking dependency warning
btrfs: fix comment its device list mutex not volume lock
btrfs: extent_io: Kill the forward declaration of flush_write_bio
btrfs: Fix grossly misleading argument names in extent io search
...
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAlx5SzMACgkQnJ2qBz9k
QNnP0AgAl9HDtk436P4QPPFhdeXBB6uRTYU8wgZSropXMxoyUzqZotWkeUqUusHs
I8BAJeQojZeKUnMwET1/RA+dMLsgAlMcxkM3+3eDCLF9wh/+wO2B2G3ywh8KMVed
Sa3C4V/NZvZzAossoGDV/yWmK+ZYrrW8l/DM3LU54GV1NfAL+Khn4FNwtgWiYiP5
S4RkRflzwhIEZJSZByMlCLcsrHl/ehtMJuR1opUPY1c0CY8iAGcIobSSzVFqv4f5
ScRB56rnXqTt6CBGBcDIkWxWqEE9XuTdAn1AVMC7327UezyFNocXUZaqDzJcGcR/
rM3sf+seZftbA4nid8dIhSRoqAf4Yw==
=/DQr
-----END PGP SIGNATURE-----
Merge tag 'fsnotify_for_v5.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fanotify updates from Jan Kara:
"Support for fanotify directory events and changes to make waiting for
fanotify permission event response killable"
* tag 'fsnotify_for_v5.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: (25 commits)
fanotify: Make waits for fanotify events only killable
fanotify: Use interruptible wait when waiting for permission events
fanotify: Track permission event state
fanotify: Simplify cleaning of access_list
fsnotify: Create function to remove event from notification list
fanotify: Move locking inside get_one_event()
fanotify: Fold dequeue_event() into process_access_response()
fanotify: Select EXPORTFS
fanotify: report FAN_ONDIR to listener with FAN_REPORT_FID
fanotify: add support for create/attrib/move/delete events
fanotify: support events with data type FSNOTIFY_EVENT_INODE
fanotify: check FS_ISDIR flag instead of d_is_dir()
fsnotify: report FS_ISDIR flag with MOVE_SELF and DELETE_SELF events
fanotify: use vfs_get_fsid() helper instead of vfs_statfs()
vfs: add vfs_get_fsid() helper
fanotify: cache fsid in fsnotify_mark_connector
fanotify: enable FAN_REPORT_FID init flag
fanotify: copy event fid info to user
fanotify: encode file identifier for FAN_REPORT_FID
fanotify: open code fill_event_metadata()
...
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAlx5SeAACgkQnJ2qBz9k
QNlzLAf/c+1o6fd4mH9uBMqEHwo+g7cKcr76j00h60bZpMJ0N/k91o8KtUKDixLJ
wG1o3FtaFyOpKXInjQOZZ83XQybjpDDFO67pCss/OYZ9bHtWM6ZfrYQzpxpIXu2E
7/FFjZV7MlugmnqJbvYRvMr2Tx7IrqOeWZ0ZIUMRnghuBarLpbiOqFaTbGlqS8e1
haFRhbxv0sA44YN9N40XVpg6P+cRsxJ4cHDSyQn4+X9CoYdKZ69utXyiiaV2L/Gc
iNYn2fkh7IDkgxF8imwHSLhvvAVangWWphhTX/XVnCPq0FKTRw9e2tRdt77IlDlX
w/GCHKnXaM6GnGDj4t83KV4yrdXGsQ==
=xrvx
-----END PGP SIGNATURE-----
Merge tag 'fs_for_v5.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull ext2 and udf fixes from Jan Kara:
"A couple of fixes for udf and ext2. Namely:
- fix making ext2 mountable (again) with 64k blocksize
- fix for ext2 statx(2) handling
- fix for udf handling of corrupted filesystem so that it doesn't get
corrupted even further
- couple smaller ext2 and udf cleanups"
* tag 'fs_for_v5.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
udf: Drop pointless check from udf_sync_fs()
ext2: support statx syscall
udf: disallow RW mount without valid integrity descriptor
udf: finalize integrity descriptor before writeback
udf: factor out LVID finalization for reuse
ext2: Fix underflow in ext2_max_size()
ext2: Fix a typo in comment
ext2: Remove redundant check for finding no group
ext2: Annotate implicit fall through in __ext2_truncate_blocks
ext2: Set superblock revision when enabling xattr feature
ext2: Remove redundant check on s_inode_size
ext2: set proper return code
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAlx5R3AACgkQnJ2qBz9k
QNlrLQf/f8puq1PgwvxuxnZATtKBWA0O84YCkIvf18LV9GsOIaYGBVOhpd3CNZ0u
WFKKaWxmrWlHtjKb43mAnZbGDLBE7uJmBe3CweIxg/Dgl3i0zvcI1Sz2vgyD3g+Q
cSW8KF8mmG53ltSpQV2NzQOSwtAGuBGfJt9b9aZ25Xl+Tpoq3PlRGNfA8oyVsL+f
iZeiJ9UxB4eRBhO0fEqhpyW1ZvNLoHF1U1qhJaVLK85tBnAAGvRQtlP1n4gFNNXP
/+Hhb0khunkhH5uXrXxYpxp5AX8mciqT28d0PPaFUxHIa4PDtgMDZoTkIjgFCusk
SqiL6TkPOovAG/27rBTH14L2ZMf7bw==
=8mdR
-----END PGP SIGNATURE-----
Merge tag 'dtype_for_v5.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull dtype handling cleanups from Jan Kara:
"A reworked dtype cleanup patches based on your feedback to the
previous version of these.
Again the series includes only the generic code and ext2 cleanup as a
sample. The plan is to push cleanups for other filesystems separately
through respective trees once the generic code lands to reduce the
number of conflicts"
* tag 'dtype_for_v5.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
ext2: use common file type conversion
fs: common implementation of file type
Here is the big driver core patchset for 5.1-rc1
More patches than "normal" here this merge window, due to some work in
the driver core by Alexander Duyck to rework the async probe
functionality to work better for a number of devices, and independant
work from Rafael for the device link functionality to make it work
"correctly".
Also in here is:
- lots of BUS_ATTR() removals, the macro is about to go away
- firmware test fixups
- ihex fixups and simplification
- component additions (also includes i915 patches)
- lots of minor coding style fixups and cleanups.
All of these have been in linux-next for a while with no reported
issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXH+euQ8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ynyTgCfbV8CLums843sBnT8NnWrTMTdTCcAn1K4re0m
ep8g+6oRLxJy414hogxQ
=bLs2
-----END PGP SIGNATURE-----
Merge tag 'driver-core-5.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here is the big driver core patchset for 5.1-rc1
More patches than "normal" here this merge window, due to some work in
the driver core by Alexander Duyck to rework the async probe
functionality to work better for a number of devices, and independant
work from Rafael for the device link functionality to make it work
"correctly".
Also in here is:
- lots of BUS_ATTR() removals, the macro is about to go away
- firmware test fixups
- ihex fixups and simplification
- component additions (also includes i915 patches)
- lots of minor coding style fixups and cleanups.
All of these have been in linux-next for a while with no reported
issues"
* tag 'driver-core-5.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (65 commits)
driver core: platform: remove misleading err_alloc label
platform: set of_node in platform_device_register_full()
firmware: hardcode the debug message for -ENOENT
driver core: Add missing description of new struct device_link field
driver core: Fix PM-runtime for links added during consumer probe
drivers/component: kerneldoc polish
async: Add cmdline option to specify drivers to be async probed
driver core: Fix possible supplier PM-usage counter imbalance
PM-runtime: Fix __pm_runtime_set_status() race with runtime resume
driver: platform: Support parsing GpioInt 0 in platform_get_irq()
selftests: firmware: fix verify_reqs() return value
Revert "selftests: firmware: remove use of non-standard diff -Z option"
Revert "selftests: firmware: add CONFIG_FW_LOADER_USER_HELPER_FALLBACK to config"
device: Fix comment for driver_data in struct device
kernfs: Allocating memory for kernfs_iattrs with kmem_cache.
sysfs: remove unused include of kernfs-internal.h
driver core: Postpone DMA tear-down until after devres release
driver core: Document limitation related to DL_FLAG_RPM_ACTIVE
PM-runtime: Take suppliers into account in __pm_runtime_set_status()
device.h: Add __cold to dev_<level> logging functions
...
Merge misc updates from Andrew Morton:
- a few misc things
- ocfs2 updates
- most of MM
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (159 commits)
tools/testing/selftests/proc/proc-self-syscall.c: remove duplicate include
proc: more robust bulk read test
proc: test /proc/*/maps, smaps, smaps_rollup, statm
proc: use seq_puts() everywhere
proc: read kernel cpu stat pointer once
proc: remove unused argument in proc_pid_lookup()
fs/proc/thread_self.c: code cleanup for proc_setup_thread_self()
fs/proc/self.c: code cleanup for proc_setup_self()
proc: return exit code 4 for skipped tests
mm,mremap: bail out earlier in mremap_to under map pressure
mm/sparse: fix a bad comparison
mm/memory.c: do_fault: avoid usage of stale vm_area_struct
writeback: fix inode cgroup switching comment
mm/huge_memory.c: fix "orig_pud" set but not used
mm/hotplug: fix an imbalance with DEBUG_PAGEALLOC
mm/memcontrol.c: fix bad line in comment
mm/cma.c: cma_declare_contiguous: correct err handling
mm/page_ext.c: fix an imbalance with kmemleak
mm/compaction: pass pgdat to too_many_isolated() instead of zone
mm: remove zone_lru_lock() function, access ->lru_lock directly
...
Pull scheduler updates from Ingo Molnar:
"The main changes in this cycle were:
- refcount conversions
- Solve the rq->leaf_cfs_rq_list can of worms for real.
- improve power-aware scheduling
- add sysctl knob for Energy Aware Scheduling
- documentation updates
- misc other changes"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits)
kthread: Do not use TIMER_IRQSAFE
kthread: Convert worker lock to raw spinlock
sched/fair: Use non-atomic cpumask_{set,clear}_cpu()
sched/fair: Remove unused 'sd' parameter from select_idle_smt()
sched/wait: Use freezable_schedule() when possible
sched/fair: Prune, fix and simplify the nohz_balancer_kick() comment block
sched/fair: Explain LLC nohz kick condition
sched/fair: Simplify nohz_balancer_kick()
sched/topology: Fix percpu data types in struct sd_data & struct s_data
sched/fair: Simplify post_init_entity_util_avg() by calling it with a task_struct pointer argument
sched/fair: Fix O(nr_cgroups) in the load balancing path
sched/fair: Optimize update_blocked_averages()
sched/fair: Fix insertion in rq->leaf_cfs_rq_list
sched/fair: Add tmp_alone_branch assertion
sched/core: Use READ_ONCE()/WRITE_ONCE() in move_queued_task()/task_rq_lock()
sched/debug: Initialize sd_sysctl_cpus if !CONFIG_CPUMASK_OFFSTACK
sched/pelt: Skip updating util_est when utilization is higher than CPU's capacity
sched/fair: Update scale invariance of PELT
sched/fair: Move the rq_of() helper function
sched/core: Convert task_struct.stack_refcount to refcount_t
...
Pull locking updates from Ingo Molnar:
"The biggest part of this tree is the new auto-generated atomics API
wrappers by Mark Rutland.
The primary motivation was to allow instrumentation without uglifying
the primary source code.
The linecount increase comes from adding the auto-generated files to
the Git space as well:
include/asm-generic/atomic-instrumented.h | 1689 ++++++++++++++++--
include/asm-generic/atomic-long.h | 1174 ++++++++++---
include/linux/atomic-fallback.h | 2295 +++++++++++++++++++++++++
include/linux/atomic.h | 1241 +------------
I preferred this approach, so that the full call stack of the (already
complex) locking APIs is still fully visible in 'git grep'.
But if this is excessive we could certainly hide them.
There's a separate build-time mechanism to determine whether the
headers are out of date (they should never be stale if we do our job
right).
Anyway, nothing from this should be visible to regular kernel
developers.
Other changes:
- Add support for dynamic keys, which removes a source of false
positives in the workqueue code, among other things (Bart Van
Assche)
- Updates to tools/memory-model (Andrea Parri, Paul E. McKenney)
- qspinlock, wake_q and lockdep micro-optimizations (Waiman Long)
- misc other updates and enhancements"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (48 commits)
locking/lockdep: Shrink struct lock_class_key
locking/lockdep: Add module_param to enable consistency checks
lockdep/lib/tests: Test dynamic key registration
lockdep/lib/tests: Fix run_tests.sh
kernel/workqueue: Use dynamic lockdep keys for workqueues
locking/lockdep: Add support for dynamic keys
locking/lockdep: Verify whether lock objects are small enough to be used as class keys
locking/lockdep: Check data structure consistency
locking/lockdep: Reuse lock chains that have been freed
locking/lockdep: Fix a comment in add_chain_cache()
locking/lockdep: Introduce lockdep_next_lockchain() and lock_chain_count()
locking/lockdep: Reuse list entries that are no longer in use
locking/lockdep: Free lock classes that are no longer in use
locking/lockdep: Update two outdated comments
locking/lockdep: Make it easy to detect whether or not inside a selftest
locking/lockdep: Split lockdep_free_key_range() and lockdep_reset_lock()
locking/lockdep: Initialize the locks_before and locks_after lists earlier
locking/lockdep: Make zap_class() remove all matching lock order entries
locking/lockdep: Reorder struct lock_class members
locking/lockdep: Avoid that add_chain_cache() adds an invalid chain to the cache
...
Android uses ashmem for sharing memory regions. We are looking forward
to migrating all usecases of ashmem to memfd so that we can possibly
remove the ashmem driver in the future from staging while also
benefiting from using memfd and contributing to it. Note staging
drivers are also not ABI and generally can be removed at anytime.
One of the main usecases Android has is the ability to create a region
and mmap it as writeable, then add protection against making any
"future" writes while keeping the existing already mmap'ed
writeable-region active. This allows us to implement a usecase where
receivers of the shared memory buffer can get a read-only view, while
the sender continues to write to the buffer. See CursorWindow
documentation in Android for more details:
https://developer.android.com/reference/android/database/CursorWindow
This usecase cannot be implemented with the existing F_SEAL_WRITE seal.
To support the usecase, this patch adds a new F_SEAL_FUTURE_WRITE seal
which prevents any future mmap and write syscalls from succeeding while
keeping the existing mmap active.
A better way to do F_SEAL_FUTURE_WRITE seal was discussed [1] last week
where we don't need to modify core VFS structures to get the same
behavior of the seal. This solves several side-effects pointed by Andy.
self-tests are provided in later patch to verify the expected semantics.
[1] https://lore.kernel.org/lkml/20181111173650.GA256781@google.com/
Thanks a lot to Andy for suggestions to improve code.
Link: http://lkml.kernel.org/r/20190112203816.85534-2-joel@joelfernandes.org
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Acked-by: John Stultz <john.stultz@linaro.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: J. Bruce Fields <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Marc-Andr Lureau <marcandre.lureau@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Architectures like ppc64 require to do a conditional tlb flush based on
the old and new value of pte. Enable that by passing old pte value as
the arg.
Link: http://lkml.kernel.org/r/20190116085035.29729-3-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "NestMMU pte upgrade workaround for mprotect", v5.
We can upgrade pte access (R -> RW transition) via mprotect. We need to
make sure we follow the recommended pte update sequence as outlined in
commit bd5050e38a ("powerpc/mm/radix: Change pte relax sequence to
handle nest MMU hang") for such updates. This patch series does that.
This patch (of 5):
Some architectures may want to call flush_tlb_range from these helpers.
Link: http://lkml.kernel.org/r/20190116085035.29729-2-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "psi: pressure stall monitors", v3.
Android is adopting psi to detect and remedy memory pressure that
results in stuttering and decreased responsiveness on mobile devices.
Psi gives us the stall information, but because we're dealing with
latencies in the millisecond range, periodically reading the pressure
files to detect stalls in a timely fashion is not feasible. Psi also
doesn't aggregate its averages at a high enough frequency right now.
This patch series extends the psi interface such that users can
configure sensitive latency thresholds and use poll() and friends to be
notified when these are breached.
As high-frequency aggregation is costly, it implements an aggregation
method that is optimized for fast, short-interval averaging, and makes
the aggregation frequency adaptive, such that high-frequency updates
only happen while monitored stall events are actively occurring.
With these patches applied, Android can monitor for, and ward off,
mounting memory shortages before they cause problems for the user. For
example, using memory stall monitors in userspace low memory killer
daemon (lmkd) we can detect mounting pressure and kill less important
processes before device becomes visibly sluggish.
In our memory stress testing psi memory monitors produce roughly 10x
less false positives compared to vmpressure signals. Having ability to
specify multiple triggers for the same psi metric allows other parts of
Android framework to monitor memory state of the device and act
accordingly.
The new interface is straightforward. The user opens one of the
pressure files for writing and writes a trigger description into the
file descriptor that defines the stall state - some or full, and the
maximum stall time over a given window of time. E.g.:
/* Signal when stall time exceeds 100ms of a 1s window */
char trigger[] = "full 100000 1000000";
fd = open("/proc/pressure/memory");
write(fd, trigger, sizeof(trigger));
while (poll() >= 0) {
...
}
close(fd);
When the monitored stall state is entered, psi adapts its aggregation
frequency according to what the configured time window requires in order
to emit event signals in a timely fashion. Once the stalling subsides,
aggregation reverts back to normal.
The trigger is associated with the open file descriptor. To stop
monitoring, the user only needs to close the file descriptor and the
trigger is discarded.
Patches 1-4 prepare the psi code for polling support. Patch 5
implements the adaptive polling logic, the pressure growth detection
optimized for short intervals, and hooks up write() and poll() on the
pressure files.
The patches were developed in collaboration with Johannes Weiner.
This patch (of 5):
Kernfs has a standardized poll/notification mechanism for waking all
pollers on all fds when a filesystem node changes. To allow polling for
custom events, add a .poll callback that can override the default.
This is in preparation for pollable cgroup pressure files which have
per-fd trigger configurations.
Link: http://lkml.kernel.org/r/20190124211518.244221-2-surenb@google.com
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move the memcg_kmem_enabled() checks into memcg kmem charge/uncharge
functions, so, the users don't have to explicitly check that condition.
This is purely code cleanup patch without any functional change. Only
the order of checks in memcg_charge_slab() can potentially be changed
but the functionally it will be same. This should not matter as
memcg_charge_slab() is not in the hot path.
Link: http://lkml.kernel.org/r/20190103161203.162375-1-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
PG_balloon was introduced to implement page migration/compaction for
pages inflated in virtio-balloon. Nowadays, it is only a marker that a
page is part of virtio-balloon and therefore logically offline.
We also want to make use of this flag in other balloon drivers - for
inflated pages or when onlining a section but keeping some pages offline
(e.g. used right now by XEN and Hyper-V via set_online_page_callback()).
We are going to expose this flag to dump tools like makedumpfile. But
instead of exposing PG_balloon, let's generalize the concept of marking
pages as logically offline, so it can be reused for other purposes later
on.
Rename PG_balloon to PG_offline. This is an indicator that the page is
logically offline, the content stale and that it should not be touched
(e.g. a hypervisor would have to allocate backing storage in order for
the guest to dump an unused page). We can then e.g. exclude such pages
from dumps.
We replace and reuse KPF_BALLOON (23), as this shouldn't really harm
(and for now the semantics stay the same). In following patches, we
will make use of this bit also in other balloon drivers. While at it,
document PGTABLE.
[akpm@linux-foundation.org: fix comment text, per David]
Link: http://lkml.kernel.org/r/20181119101616.8901-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Konstantin Khlebnikov <koct9i@gmail.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Pankaj gupta <pagupta@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Christian Hansen <chansen3@cisco.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Miles Chen <miles.chen@mediatek.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Kazuhito Hagio <k-hagio@ab.jp.nec.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Julien Freche <jfreche@vmware.com>
Cc: Kairui Song <kasong@redhat.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Lianbo Jiang <lijiang@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Nadav Amit <namit@vmware.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Xavier Deguillard <xdeguillard@vmware.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It seems that commits 5f16f3225b and 00a1a053eb, both with same
commitlog ("ext4: atomically set inode->i_flags in ext4_set_inode_flags()")
introduced the set_mask_bits API, but somehow missed not using it in ext4
in the end.
Also, set_mask_bits() is used in fs quite a bit and we can possibly come
up with a generic llsc based implementation (w/o the cmpxchg loop)
Link: http://lkml.kernel.org/r/1548275584-18096-3-git-send-email-vgupta@synopsys.com
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Reviewed-by: Anthony Yznaga <anthony.yznaga@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Update the code to use a zero-sized array instead of a pointer in
structure ocfs2_slot_info and use struct_size() in kzalloc().
Notice that one of the more common cases of allocation size calculations
is finding the size of a structure that has a zero-sized array at the
end, along with memory for some number of elements for that array. For
example:
struct foo {
int stuff;
void *entry[];
};
instance = kzalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL);
Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:
instance = kzalloc(struct_size(instance, entry, count), GFP_KERNEL);
This code was detected with the help of Coccinelle.
Link: http://lkml.kernel.org/r/20190108191903.GA22056@embeddedor
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The user reported this problem, the upper application IO was timeout
when fstrim was running on this ocfs2 partition. the application
monitoring resource agent considered that this application did not work,
then this node was fenced by the cluster brain (e.g. pacemaker).
The root cause is that fstrim thread always holds main_bm meta-file
related locks until all the cluster groups are trimmed. This patch will
make fstrim thread release main_bm meta-file related locks when each
cluster group is trimmed, this will let the current application IO has a
chance to claim the clusters from main_bm meta-file.
Link: http://lkml.kernel.org/r/20190111090014.31645-1-ghe@suse.com
Signed-off-by: Gang He <ghe@suse.com>
Reviewed-by: Changwei Ge <ge.changwei@h3c.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In the process of creating a node, it will cause NULL pointer
dereference in kernel if o2cb_ctl failed in the interval (mkdir,
o2cb_set_node_attribute(node_num)] in function o2cb_add_node.
The node num is initialized to 0 in function o2nm_node_group_make_item,
o2nm_node_group_drop_item will mistake the node number 0 for a valid
node number when we delete the node before the node number is set
correctly. If the local node number of the current host happens to be
0, cluster->cl_local_node will be set to O2NM_INVALID_NODE_NUM while
o2hb_thread still running. The panic stack is generated as follows:
o2hb_thread
\-o2hb_do_disk_heartbeat
\-o2hb_check_own_slot
|-slot = ®->hr_slots[o2nm_this_node()];
//o2nm_this_node() return O2NM_INVALID_NODE_NUM
We need to check whether the node number is set when we delete the node.
Link: http://lkml.kernel.org/r/133d8045-72cc-863e-8eae-5013f9f6bc51@huawei.com
Signed-off-by: Jia Guo <guojia12@huawei.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Acked-by: Jun Piao <piaojun@huawei.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull year 2038 updates from Thomas Gleixner:
"Another round of changes to make the kernel ready for 2038. After lots
of preparatory work this is the first set of syscalls which are 2038
safe:
403 clock_gettime64
404 clock_settime64
405 clock_adjtime64
406 clock_getres_time64
407 clock_nanosleep_time64
408 timer_gettime64
409 timer_settime64
410 timerfd_gettime64
411 timerfd_settime64
412 utimensat_time64
413 pselect6_time64
414 ppoll_time64
416 io_pgetevents_time64
417 recvmmsg_time64
418 mq_timedsend_time64
419 mq_timedreceiv_time64
420 semtimedop_time64
421 rt_sigtimedwait_time64
422 futex_time64
423 sched_rr_get_interval_time64
The syscall numbers are identical all over the architectures"
* 'timers-2038-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
riscv: Use latest system call ABI
checksyscalls: fix up mq_timedreceive and stat exceptions
unicore32: Fix __ARCH_WANT_STAT64 definition
asm-generic: Make time32 syscall numbers optional
asm-generic: Drop getrlimit and setrlimit syscalls from default list
32-bit userspace ABI: introduce ARCH_32BIT_OFF_T config option
compat ABI: use non-compat openat and open_by_handle_at variants
y2038: add 64-bit time_t syscalls to all 32-bit architectures
y2038: rename old time and utime syscalls
y2038: remove struct definition redirects
y2038: use time32 syscall names on 32-bit
syscalls: remove obsolete __IGNORE_ macros
y2038: syscalls: rename y2038 compat syscalls
x86/x32: use time64 versions of sigtimedwait and recvmmsg
timex: change syscalls to use struct __kernel_timex
timex: use __kernel_timex internally
sparc64: add custom adjtimex/clock_adjtime functions
time: fix sys_timer_settime prototype
time: Add struct __kernel_timex
time: make adjtime compat handling available for 32 bit
...
Pull irq updates from Thomas Gleixner:
"The interrupt departement delivers this time:
- New infrastructure to manage NMIs on platforms which have a sane
NMI delivery, i.e. identifiable NMI vectors instead of a single
lump.
- Simplification of the interrupt affinity management so drivers
don't have to implement ugly loops around the PCI/MSI enablement.
- Speedup for interrupt statistics in /proc/stat
- Provide a function to retrieve the default irq domain
- A new interrupt controller for the Loongson LS1X platform
- Affinity support for the SiFive PLIC
- Better support for the iMX irqsteer driver
- NUMA aware memory allocations for GICv3
- The usual small fixes, improvements and cleanups all over the
place"
* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
irqchip/imx-irqsteer: Add multi output interrupts support
irqchip/imx-irqsteer: Change to use reg_num instead of irq_group
dt-bindings: irq: imx-irqsteer: Add multi output interrupts support
dt-binding: irq: imx-irqsteer: Use irq number instead of group number
irqchip/brcmstb-l2: Use _irqsave locking variants in non-interrupt code
irqchip/gicv3-its: Use NUMA aware memory allocation for ITS tables
irqdomain: Allow the default irq domain to be retrieved
irqchip/sifive-plic: Implement irq_set_affinity() for SMP host
irqchip/sifive-plic: Differentiate between PLIC handler and context
irqchip/sifive-plic: Add warning in plic_init() if handler already present
irqchip/sifive-plic: Pre-compute context hart base and enable base
PCI/MSI: Remove obsolete sanity checks for multiple interrupt sets
genirq/affinity: Remove the leftovers of the original set support
nvme-pci: Simplify interrupt allocation
genirq/affinity: Add new callback for (re)calculating interrupt sets
genirq/affinity: Store interrupt sets size in struct irq_affinity
genirq/affinity: Code consolidation
irqchip/irq-sifive-plic: Check and continue in case of an invalid cpuid.
irqchip/i8259: Fix shutdown order by moving syscore_ops registration
dt-bindings: interrupt-controller: loongson ls1x intc
...
We're (finally) phasing out a.out support for good. As Borislav Petkov
points out, we've supported ELF binaries for about 25 years by now, and
coredumping in particular has bitrotted over the years.
None of the tool chains even support generating a.out binaries any more,
and the plan is to deprecate a.out support entirely for the kernel. But
I want to start with just removing the core dumping code, because I can
still imagine that somebody actually might want to support a.out as a
simpler biinary format.
Particularly if you generate some random binaries on the fly, ELF is a
much more complicated format (admittedly ELF also does have a lot of
toolchain support, mitigating that complexity a lot and you really
should have moved over in the last 25 years).
So it's at least somewhat possible that somebody out there has some
workflow that still involves generating and running a.out executables.
In contrast, it's very unlikely that anybody depends on debugging any
legacy a.out core files. But regardless, I want this phase-out to be
done in two steps, so that we can resurrect a.out support (if needed)
without having to resurrect the core file dumping that is almost
certainly not needed.
Jann Horn pointed to the <asm/a.out-core.h> file that my first trivial
cut at this had missed.
And Alan Cox points out that the a.out binary loader _could_ be done in
user space if somebody wants to, but we might keep just the loader in
the kernel if somebody really wants it, since the loader isn't that big
and has no really odd special cases like the core dumping does.
Acked-by: Borislav Petkov <bp@alien8.de>
Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
Cc: Jann Horn <jannh@google.com>
Cc: Richard Weinberger <richard@nod.at>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull crypto update from Herbert Xu:
"API:
- Add helper for simple skcipher modes.
- Add helper to register multiple templates.
- Set CRYPTO_TFM_NEED_KEY when setkey fails.
- Require neither or both of export/import in shash.
- AEAD decryption test vectors are now generated from encryption
ones.
- New option CONFIG_CRYPTO_MANAGER_EXTRA_TESTS that includes random
fuzzing.
Algorithms:
- Conversions to skcipher and helper for many templates.
- Add more test vectors for nhpoly1305 and adiantum.
Drivers:
- Add crypto4xx prng support.
- Add xcbc/cmac/ecb support in caam.
- Add AES support for Exynos5433 in s5p.
- Remove sha384/sha512 from artpec7 as hardware cannot do partial
hash"
[ There is a merge of the Freescale SoC tree in order to pull in changes
required by patches to the caam/qi2 driver. ]
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (174 commits)
crypto: s5p - add AES support for Exynos5433
dt-bindings: crypto: document Exynos5433 SlimSSS
crypto: crypto4xx - add missing of_node_put after of_device_is_available
crypto: cavium/zip - fix collision with generic cra_driver_name
crypto: af_alg - use struct_size() in sock_kfree_s()
crypto: caam - remove redundant likely/unlikely annotation
crypto: s5p - update iv after AES-CBC op end
crypto: x86/poly1305 - Clear key material from stack in SSE2 variant
crypto: caam - generate hash keys in-place
crypto: caam - fix DMA mapping xcbc key twice
crypto: caam - fix hash context DMA unmap size
hwrng: bcm2835 - fix probe as platform device
crypto: s5p-sss - Use AES_BLOCK_SIZE define instead of number
crypto: stm32 - drop pointless static qualifier in stm32_hash_remove()
crypto: chelsio - Fixed Traffic Stall
crypto: marvell - Remove set but not used variable 'ivsize'
crypto: ccp - Update driver messages to remove some confusion
crypto: adiantum - add 1536 and 4096-byte test vectors
crypto: nhpoly1305 - add a test vector with len % 16 != 0
crypto: arm/aes-ce - update IV after partial final CTR block
...
Pull networking updates from David Miller:
"Here we go, another merge window full of networking and #ebpf changes:
1) Snoop DHCPACKS in batman-adv to learn MAC/IP pairs in the DHCP
range without dealing with floods of ARP traffic, from Linus
Lüssing.
2) Throttle buffered multicast packet transmission in mt76, from
Felix Fietkau.
3) Support adaptive interrupt moderation in ice, from Brett Creeley.
4) A lot of struct_size conversions, from Gustavo A. R. Silva.
5) Add peek/push/pop commands to bpftool, as well as bash completion,
from Stanislav Fomichev.
6) Optimize sk_msg_clone(), from Vakul Garg.
7) Add SO_BINDTOIFINDEX, from David Herrmann.
8) Be more conservative with local resends due to local congestion,
from Yuchung Cheng.
9) Allow vetoing of unsupported VXLAN FDBs, from Petr Machata.
10) Add health buffer support to devlink, from Eran Ben Elisha.
11) Add TXQ scheduling API to mac80211, from Toke Høiland-Jørgensen.
12) Add statistics to basic packet scheduler filter, from Cong Wang.
13) Add GRE tunnel support for mlxsw Spectrum-2, from Nir Dotan.
14) Lots of new IP tunneling forwarding tests, also from Nir Dotan.
15) Add 3ad stats to bonding, from Nikolay Aleksandrov.
16) Lots of probing improvements for bpftool, from Quentin Monnet.
17) Various nfp drive #ebpf JIT improvements from Jakub Kicinski.
18) Allow #ebpf programs to access gso_segs from skb shared info, from
Eric Dumazet.
19) Add sock_diag support for AF_XDP sockets, from Björn Töpel.
20) Support 22260 iwlwifi devices, from Luca Coelho.
21) Use rbtree for ipv6 defragmentation, from Peter Oskolkov.
22) Add JMP32 instruction class support to #ebpf, from Jiong Wang.
23) Add spinlock support to #ebpf, from Alexei Starovoitov.
24) Support 256-bit keys and TLS 1.3 in ktls, from Dave Watson.
25) Add device infomation API to devlink, from Jakub Kicinski.
26) Add new timestamping socket options which are y2038 safe, from
Deepa Dinamani.
27) Add RX checksum offloading for various sh_eth chips, from Sergei
Shtylyov.
28) Flow offload infrastructure, from Pablo Neira Ayuso.
29) Numerous cleanups, improvements, and bug fixes to the PHY layer
and many drivers from Heiner Kallweit.
30) Lots of changes to try and make packet scheduler classifiers run
lockless as much as possible, from Vlad Buslov.
31) Support BCM957504 chip in bnxt_en driver, from Erik Burrows.
32) Add concurrency tests to tc-tests infrastructure, from Vlad
Buslov.
33) Add hwmon support to aquantia, from Heiner Kallweit.
34) Allow 64-bit values for SO_MAX_PACING_RATE, from Eric Dumazet.
And I would be remiss if I didn't thank the various major networking
subsystem maintainers for integrating much of this work before I even
saw it. Alexei Starovoitov, Daniel Borkmann, Pablo Neira Ayuso,
Johannes Berg, Kalle Valo, and many others. Thank you!"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2207 commits)
net/sched: avoid unused-label warning
net: ignore sysctl_devconf_inherit_init_net without SYSCTL
phy: mdio-mux: fix Kconfig dependencies
net: phy: use phy_modify_mmd_changed in genphy_c45_an_config_aneg
net: dsa: mv88e6xxx: add call to mv88e6xxx_ports_cmode_init to probe for new DSA framework
selftest/net: Remove duplicate header
sky2: Disable MSI on Dell Inspiron 1545 and Gateway P-79
net/mlx5e: Update tx reporter status in case channels were successfully opened
devlink: Add support for direct reporter health state update
devlink: Update reporter state to error even if recover aborted
sctp: call iov_iter_revert() after sending ABORT
team: Free BPF filter when unregistering netdev
ip6mr: Do not call __IP6_INC_STATS() from preemptible context
isdn: mISDN: Fix potential NULL pointer dereference of kzalloc
net: dsa: mv88e6xxx: support in-band signalling on SGMII ports with external PHYs
cxgb4/chtls: Prefix adapter flags with CXGB4
net-sysfs: Switch to bitmap_zalloc()
mellanox: Switch to bitmap_zalloc()
bpf: add test cases for non-pointer sanitiation logic
mlxsw: i2c: Extend initialization by querying resources data
...
The current implementation of splice() and tee() ignores O_NONBLOCK set
on pipe file descriptors and checks only the SPLICE_F_NONBLOCK flag for
blocking on pipe arguments. This is inconsistent since splice()-ing
from/to non-pipe file descriptors does take O_NONBLOCK into
consideration.
Fix this by promoting O_NONBLOCK, when set on a pipe, to
SPLICE_F_NONBLOCK.
Some context for how the current implementation of splice() leads to
inconsistent behavior. In the ongoing work[1] to add VM tracing
capability to trace-cmd we stream tracing data over named FIFOs or
vsockets from guests back to the host.
When we receive SIGINT from user to stop tracing, we set O_NONBLOCK on
the input file descriptor and set SPLICE_F_NONBLOCK for the next call to
splice(). If splice() was blocked waiting on data from the input FIFO,
after SIGINT splice() restarts with the same arguments (no
SPLICE_F_NONBLOCK) and blocks again instead of returning -EAGAIN when no
data is available.
This differs from the splice() behavior when reading from a vsocket or
when we're doing a traditional read()/write() loop (trace-cmd's
--nosplice argument).
With this patch applied we get the same behavior in all situations after
setting O_NONBLOCK which also matches the behavior of doing a
read()/write() loop instead of splice().
This change does have potential of breaking users who don't expect
EAGAIN from splice() when SPLICE_F_NONBLOCK is not set. OTOH programs
that set O_NONBLOCK and don't anticipate EAGAIN are arguably buggy[2].
[1] https://github.com/skaslev/trace-cmd/tree/vsock
[2] d47e3da175/fs/read_write.c (L1425)
Signed-off-by: Slavomir Kaslev <kaslevs@vmware.com>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull vfs fixes from Al Viro:
"Assorted fixes that sat in -next for a while, all over the place"
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
aio: Fix locking in aio_poll()
exec: Fix mem leak in kernel_read_file
copy_mount_string: Limit string length to PATH_MAX
cgroup: saner refcounting for cgroup_root
fix cgroup_do_mount() handling of failure exits
Every in-kernel use of this function defined it to KERNEL_DS (either as
an actual define, or as an inline function). It's an entirely
historical artifact, and long long long ago used to actually read the
segment selector valueof '%ds' on x86.
Which in the kernel is always KERNEL_DS.
Inspired by a patch from Jann Horn that just did this for a very small
subset of users (the ones in fs/), along with Al who suggested a script.
I then just took it to the logical extreme and removed all the remaining
gunk.
Roughly scripted with
git grep -l '(get_ds())' -- :^tools/ | xargs sed -i 's/(get_ds())/(KERNEL_DS)/'
git grep -lw 'get_ds' -- :^tools/ | xargs sed -i '/^#define get_ds()/d'
plus manual fixups to remove a few unusual usage patterns, the couple of
inline function cases and to fix up a comment that had become stale.
The 'get_ds()' function remains in an x86 kvm selftest, since in user
space it actually does something relevant.
Inspired-by: Jann Horn <jannh@google.com>
Inspired-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Al Viro root-caused a race where the IOCB_CMD_POLL handling of
fget/fput() could cause us to access the file pointer after it had
already been freed:
"In more details - normally IOCB_CMD_POLL handling looks so:
1) io_submit(2) allocates aio_kiocb instance and passes it to
aio_poll()
2) aio_poll() resolves the descriptor to struct file by req->file =
fget(iocb->aio_fildes)
3) aio_poll() sets ->woken to false and raises ->ki_refcnt of that
aio_kiocb to 2 (bumps by 1, that is).
4) aio_poll() calls vfs_poll(). After sanity checks (basically,
"poll_wait() had been called and only once") it locks the queue.
That's what the extra reference to iocb had been for - we know we
can safely access it.
5) With queue locked, we check if ->woken has already been set to
true (by aio_poll_wake()) and, if it had been, we unlock the
queue, drop a reference to aio_kiocb and bugger off - at that
point it's a responsibility to aio_poll_wake() and the stuff
called/scheduled by it. That code will drop the reference to file
in req->file, along with the other reference to our aio_kiocb.
6) otherwise, we see whether we need to wait. If we do, we unlock the
queue, drop one reference to aio_kiocb and go away - eventual
wakeup (or cancel) will deal with the reference to file and with
the other reference to aio_kiocb
7) otherwise we remove ourselves from waitqueue (still under the
queue lock), so that wakeup won't get us. No async activity will
be happening, so we can safely drop req->file and iocb ourselves.
If wakeup happens while we are in vfs_poll(), we are fine - aio_kiocb
won't get freed under us, so we can do all the checks and locking
safely. And we don't touch ->file if we detect that case.
However, vfs_poll() most certainly *does* touch the file it had been
given. So wakeup coming while we are still in ->poll() might end up
doing fput() on that file. That case is not too rare, and usually we
are saved by the still present reference from descriptor table - that
fput() is not the final one.
But if another thread closes that descriptor right after our fget()
and wakeup does happen before ->poll() returns, we are in trouble -
final fput() done while we are in the middle of a method:
Al also wrote a patch to take an extra reference to the file descriptor
to fix this, but I instead suggested we just streamline the whole file
pointer handling by submit_io() so that the generic aio submission code
simply keeps the file pointer around until the aio has completed.
Fixes: bfe4037e72 ("aio: implement IOCB_CMD_POLL")
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Reported-by: syzbot+503d4cc169fcec1cb18c@syzkaller.appspotmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge misc fixes from Andrew Morton:
"2 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
hugetlbfs: fix races and page leaks during migration
kasan: turn off asan-stack for clang-8 and earlier
hugetlb pages should only be migrated if they are 'active'. The
routines set/clear_page_huge_active() modify the active state of hugetlb
pages.
When a new hugetlb page is allocated at fault time, set_page_huge_active
is called before the page is locked. Therefore, another thread could
race and migrate the page while it is being added to page table by the
fault code. This race is somewhat hard to trigger, but can be seen by
strategically adding udelay to simulate worst case scheduling behavior.
Depending on 'how' the code races, various BUG()s could be triggered.
To address this issue, simply delay the set_page_huge_active call until
after the page is successfully added to the page table.
Hugetlb pages can also be leaked at migration time if the pages are
associated with a file in an explicitly mounted hugetlbfs filesystem.
For example, consider a two node system with 4GB worth of huge pages
available. A program mmaps a 2G file in a hugetlbfs filesystem. It
then migrates the pages associated with the file from one node to
another. When the program exits, huge page counts are as follows:
node0
1024 free_hugepages
1024 nr_hugepages
node1
0 free_hugepages
1024 nr_hugepages
Filesystem Size Used Avail Use% Mounted on
nodev 4.0G 2.0G 2.0G 50% /var/opt/hugepool
That is as expected. 2G of huge pages are taken from the free_hugepages
counts, and 2G is the size of the file in the explicitly mounted
filesystem. If the file is then removed, the counts become:
node0
1024 free_hugepages
1024 nr_hugepages
node1
1024 free_hugepages
1024 nr_hugepages
Filesystem Size Used Avail Use% Mounted on
nodev 4.0G 2.0G 2.0G 50% /var/opt/hugepool
Note that the filesystem still shows 2G of pages used, while there
actually are no huge pages in use. The only way to 'fix' the filesystem
accounting is to unmount the filesystem
If a hugetlb page is associated with an explicitly mounted filesystem,
this information in contained in the page_private field. At migration
time, this information is not preserved. To fix, simply transfer
page_private from old to new page at migration time if necessary.
There is a related race with removing a huge page from a file and
migration. When a huge page is removed from the pagecache, the
page_mapping() field is cleared, yet page_private remains set until the
page is actually freed by free_huge_page(). A page could be migrated
while in this state. However, since page_mapping() is not set the
hugetlbfs specific routine to transfer page_private is not called and we
leak the page count in the filesystem.
To fix that, check for this condition before migrating a huge page. If
the condition is detected, return EBUSY for the page.
Link: http://lkml.kernel.org/r/74510272-7319-7372-9ea6-ec914734c179@oracle.com
Link: http://lkml.kernel.org/r/20190212221400.3512-1-mike.kravetz@oracle.com
Fixes: bcc5422230 ("mm: hugetlb: introduce page_huge_active")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: <stable@vger.kernel.org>
[mike.kravetz@oracle.com: v2]
Link: http://lkml.kernel.org/r/7534d322-d782-8ac6-1c8d-a8dc380eb3ab@oracle.com
[mike.kravetz@oracle.com: update comment and changelog]
Link: http://lkml.kernel.org/r/420bcfd6-158b-38e4-98da-26d0cd85bd01@oracle.com
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Effective revert commit:
87709e28dc ("fs/locks: Use percpu_down_read_preempt_disable()")
This is causing major pain for PREEMPT_RT.
Sebastian did a lot of lockperf runs on 2 and 4 node machines with all
preemption modes (PREEMPT=n should be an obvious NOP for this patch
and thus serves as a good control) and no results showed significance
over 2-sigma (the PREEMPT=n results were almost empty at 1-sigma).
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When a cell with a volume location server list is added manually by
echoing the details into /proc/net/afs/cells, a record is added but the
flag saying it has been looked up isn't set.
This causes the VL server rotation code to wait forever, with the top of
/proc/pid/stack looking like:
afs_select_vlserver+0x3a6/0x6f3
afs_vl_lookup_vldb+0x4b/0x92
afs_create_volume+0x25/0x1b9
...
with the thread stuck in afs_start_vl_iteration() waiting for
AFS_CELL_FL_NO_LOOKUP_YET to be cleared.
Fix this by clearing AFS_CELL_FL_NO_LOOKUP_YET when setting up a record
if that record's details were supplied manually.
Fixes: 0a5143f2f8 ("afs: Implement VL server rotation")
Reported-by: Dave Botsch <dwb7@cornell.edu>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit 9da3f2b740.
It was well-intentioned, but wrong. Overriding the exception tables for
instructions for random reasons is just wrong, and that is what the new
code did.
It caused problems for tracing, and it caused problems for strncpy_from_user(),
because the new checks made perfectly valid use cases break, rather than
catch things that did bad things.
Unchecked user space accesses are a problem, but that's not a reason to
add invalid checks that then people have to work around with silly flags
(in this case, that 'kernel_uaccess_faults_ok' flag, which is just an
odd way to say "this commit was wrong" and was sprinked into random
places to hide the wrongness).
The real fix to unchecked user space accesses is to get rid of the
special "let's not check __get_user() and __put_user() at all" logic.
Make __{get|put}_user() be just aliases to the regular {get|put}_user()
functions, and make it impossible to access user space without having
the proper checks in places.
The raison d'être of the special double-underscore versions used to be
that the range check was expensive, and if you did multiple user
accesses, you'd do the range check up front (like the signal frame
handling code, for example). But SMAP (on x86) and PAN (on ARM) have
made that optimization pointless, because the _real_ expense is the "set
CPU flag to allow user space access".
Do let's not break the valid cases to catch invalid cases that shouldn't
even exist.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Jann Horn <jannh@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is a messy cast here:
min_t(int, len, (int)sizeof(*item)));
min_t() should normally cast to unsigned. It's not possible for "len"
to be negative, but if it were then we definitely wouldn't want to pass
negatives to read_extent_buffer(). Also there is an extra cast.
This patch shouldn't affect runtime, it's just a clean up.
Reviewed-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At ctree.c:key_search(), the assertion that verifies the first key on a
child extent buffer corresponds to the key at a specific slot in the
parent has a disadvantage: we effectively hit a BUG_ON() which requires
rebooting the machine later. It also does not tell any information about
which extent buffer is affected, from which root, the expected and found
keys, etc.
However as of commit 581c176041 ("btrfs: Validate child tree block's
level and first key"), that assertion is not needed since at the time we
read an extent buffer from disk we validate that its first key matches the
key, at the respective slot, in the parent extent buffer. Therefore just
remove the assertion at key_search().
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function map_private_extent_buffer() can return an -EINVAL error, and
it is called by generic_bin_search() which will return back the error. The
btrfs_bin_search() function in turn calls generic_bin_search() and the
key_search() function calls btrfs_bin_search(), so both can return the
-EINVAL error coming from the map_private_extent_buffer() function. Some
callers of these functions were ignoring that these functions can return
an error, so fix them to deal with error return values.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We should drop the lock on this error path. This has been found by a
static tool.
The lock needs to be released, it's there to protect access to the
dev_replace members and is not supposed to be left locked. The value of
state that's being switched would need to be artifically changed to an
invalid value so the default: branch is taken.
Fixes: d189dd70e2 ("btrfs: fix use-after-free due to race between replace start and cancel")
CC: stable@vger.kernel.org # 5.0+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We recently had a customer issue with a corrupted filesystem. When
trying to mount this image btrfs panicked with a division by zero in
calc_stripe_length().
The corrupt chunk had a 'num_stripes' value of 1. calc_stripe_length()
takes this value and divides it by the number of copies the RAID profile
is expected to have to calculate the amount of data stripes. As a DUP
profile is expected to have 2 copies this division resulted in 1/2 = 0.
Later then the 'data_stripes' variable is used as a divisor in the
stripe length calculation which results in a division by 0 and thus a
kernel panic.
When encountering a filesystem with a DUP block group and a
'num_stripes' value unequal to 2, refuse mounting as the image is
corrupted and will lead to unexpected behaviour.
Code inspection showed a RAID1 block group has the same issues.
Fixes: e06cd3dd7c ("Btrfs: add validadtion checks for chunk loading")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The scrub_ctx csum_list member must be initialized before scrub_free_ctx
is called. If the csum_list is not initialized beforehand, the
list_empty call in scrub_free_csums will result in a null deref if the
allocation fails in the for loop.
Fixes: a2de733c78 ("btrfs: scrub")
CC: stable@vger.kernel.org # 3.0+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Dan Robertson <dan@dlrobertson.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Comparing the content of the pages in the range to deduplicate is now
done in generic_remap_checks called by the generic helper
generic_remap_file_range_prep(), which takes care of ensuring we do not
compare/deduplicate undefined data beyond a file's EOF (range from EOF
to the next block boundary). So remove these checks which are now
redundant.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
After a succession of renames operations of different files and unlinking
one of them, if we fsync one of the renamed files we can end up with a
log that will either fail to replay at mount time or result in a filesystem
that is in an inconsistent state. One example scenario:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ mkdir /mnt/testdir
$ touch /mnt/testdir/fname1
$ touch /mnt/testdir/fname2
$ sync
$ mv /mnt/testdir/fname1 /mnt/testdir/fname3
$ rm -f /mnt/testdir/fname2
$ ln /mnt/testdir/fname3 /mnt/testdir/fname2
$ touch /mnt/testdir/fname1
$ xfs_io -c "fsync" /mnt/testdir/fname1
<power failure>
$ mount /dev/sdb /mnt
$ umount /mnt
$ btrfs check /dev/sdb
[1/7] checking root items
[2/7] checking extents
[3/7] checking free space cache
[4/7] checking fs roots
root 5 inode 259 errors 2, no orphan item
ERROR: errors found in fs roots
Opening filesystem to check...
Checking filesystem on /dev/sdc
UUID: 20e4abb8-5a19-4492-8bb4-6084125c2d0d
found 393216 bytes used, error(s) found
total csum bytes: 0
total tree bytes: 131072
total fs tree bytes: 32768
total extent tree bytes: 16384
btree space waste bytes: 122986
file data blocks allocated: 262144
referenced 262144
On a kernel without the first patch in this series, titled
"[PATCH] Btrfs: fix fsync after succession of renames of different files",
we get instead an error when mounting the filesystem due to failure of
replaying the log:
$ mount /dev/sdb /mnt
mount: mount /dev/sdb on /mnt failed: File exists
Fix this by logging the parent directory of an inode whenever we find an
inode that no longer exists (was unlinked in the current transaction),
during the procedure which finds inodes that have old names that collide
with new names of other inodes.
A test case for fstests follows soon.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
After a succession of rename operations of different files and fsyncing
one of them, such that each file gets a new name that corresponds to an
old name of another file, we can end up with a log that will cause a
failure when attempted to replay at mount time (an EEXIST error).
We currently have correct behaviour when such succession of renames
involves only two files, but if there are more files involved, we end up
not logging all the inodes that are needed, therefore resulting in a
failure when attempting to replay the log.
Example:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ mkdir /mnt/testdir
$ touch /mnt/testdir/fname1
$ touch /mnt/testdir/fname2
$ sync
$ mv /mnt/testdir/fname1 /mnt/testdir/fname3
$ mv /mnt/testdir/fname2 /mnt/testdir/fname4
$ ln /mnt/testdir/fname3 /mnt/testdir/fname2
$ touch /mnt/testdir/fname1
$ xfs_io -c "fsync" /mnt/testdir/fname1
<power failure>
$ mount /dev/sdb /mnt
mount: mount /dev/sdb on /mnt failed: File exists
So fix this by checking all inode dependencies when logging an inode. That
is, if one logged inode A has a new name that matches the old name of some
other inode B, check if inode B has a new name that matches the old name
of some other inode C, and so on. This fix is implemented not by doing any
recursive function calls but by using an iterative method using a linked
list that is used in a first-in-first-out fashion.
A test case for fstests follows soon.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qgroups will do the old roots lookup at delayed ref time, which could be
while walking down the extent root while running a delayed ref. This
should be fine, except we specifically lock eb's in the backref walking
code irrespective of path->skip_locking, which deadlocks the system.
Fix up the backref code to honor path->skip_locking, nobody will be
modifying the commit_root when we're searching so it's completely safe
to do.
This happens since fb235dc06f ("btrfs: qgroup: Move half of the qgroup
accounting time out of commit trans"), kernel may lockup with quota
enabled.
There is one backref trace triggered by snapshot dropping along with
write operation in the source subvolume. The example can be reliably
reproduced:
btrfs-cleaner D 0 4062 2 0x80000000
Call Trace:
schedule+0x32/0x90
btrfs_tree_read_lock+0x93/0x130 [btrfs]
find_parent_nodes+0x29b/0x1170 [btrfs]
btrfs_find_all_roots_safe+0xa8/0x120 [btrfs]
btrfs_find_all_roots+0x57/0x70 [btrfs]
btrfs_qgroup_trace_extent_post+0x37/0x70 [btrfs]
btrfs_qgroup_trace_leaf_items+0x10b/0x140 [btrfs]
btrfs_qgroup_trace_subtree+0xc8/0xe0 [btrfs]
do_walk_down+0x541/0x5e3 [btrfs]
walk_down_tree+0xab/0xe7 [btrfs]
btrfs_drop_snapshot+0x356/0x71a [btrfs]
btrfs_clean_one_deleted_snapshot+0xb8/0xf0 [btrfs]
cleaner_kthread+0x12b/0x160 [btrfs]
kthread+0x112/0x130
ret_from_fork+0x27/0x50
When dropping snapshots with qgroup enabled, we will trigger backref
walk.
However such backref walk at that timing is pretty dangerous, as if one
of the parent nodes get WRITE locked by other thread, we could cause a
dead lock.
For example:
FS 260 FS 261 (Dropped)
node A node B
/ \ / \
node C node D node E
/ \ / \ / \
leaf F|leaf G|leaf H|leaf I|leaf J|leaf K
The lock sequence would be:
Thread A (cleaner) | Thread B (other writer)
-----------------------------------------------------------------------
write_lock(B) |
write_lock(D) |
^^^ called by walk_down_tree() |
| write_lock(A)
| write_lock(D) << Stall
read_lock(H) << for backref walk |
read_lock(D) << lock owner is |
the same thread A |
so read lock is OK |
read_lock(A) << Stall |
So thread A hold write lock D, and needs read lock A to unlock.
While thread B holds write lock A, while needs lock D to unlock.
This will cause a deadlock.
This is not only limited to snapshot dropping case. As the backref
walk, even only happens on commit trees, is breaking the normal top-down
locking order, makes it deadlock prone.
Fixes: fb235dc06f ("btrfs: qgroup: Move half of the qgroup accounting time out of commit trans")
CC: stable@vger.kernel.org # 4.14+
Reported-and-tested-by: David Sterba <dsterba@suse.com>
Reported-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
[ rebase to latest branch and fix lock assert bug in btrfs/007 ]
Signed-off-by: Qu Wenruo <wqu@suse.com>
[ copy logs and deadlock analysis from Qu's patch ]
Signed-off-by: David Sterba <dsterba@suse.com>