When "backup intent" is requested on the mount (e.g. backupuid or
backupgid mount options), the corresponding flag needs to be set
on opens of directories (and files) but was missing in some
places causing access denied trying to enumerate and backup
servers.
Fixes kernel bugzilla #200953https://bugzilla.kernel.org/show_bug.cgi?id=200953
Reported-and-tested-by: <whh@rubrik.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
When a Mac client saves an item containing a backslash to a file server
the backslash is represented in the CIFS/SMB protocol as as U+F026.
Before this change, listing a directory containing an item with a
backslash in its name will return that item with the backslash
represented with a true backslash character (U+005C) because
convert_sfm_character mapped U+F026 to U+005C when interpretting the
CIFS/SMB protocol response. However, attempting to open or stat the
path using a true backslash will result in an error because
convert_to_sfm_char does not map U+005C back to U+F026 causing the
CIFS/SMB request to be made with the backslash represented as U+005C.
This change simply prevents the U+F026 to U+005C conversion from
happenning. This is analogous to how the code does not do any
translation of UNI_SLASH (U+F000).
Signed-off-by: Jon Kuhn <jkuhn@barracuda.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Pull core fixes from Thomas Gleixner:
"A small set of updates for core code:
- Prevent tracing in functions which are called from trace patching
via stop_machine() to prevent executing half patched function trace
entries.
- Remove old GCC workarounds
- Remove pointless includes of notifier.h"
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
objtool: Remove workaround for unreachable warnings from old GCC
notifier: Remove notifier header file wherever not used
watchdog: Mark watchdog touch functions as notrace
When mounting the superblock, ext4_fill_super() calculates the free
blocks and free inodes and stores them in the superblock. It's not
strictly necessary, since we don't use them any more, but it's nice to
keep them roughly aligned to reality.
Since it's not critical for file system correctness, the code doesn't
call ext4_commit_super(). The problem is that it's in
ext4_commit_super() that we recalculate the superblock checksum. So
if we're not going to call ext4_commit_super(), we need to call
ext4_superblock_csum_set() to make sure the superblock checksum is
consistent.
Most of the time, this doesn't matter, since we end up calling
ext4_commit_super() very soon thereafter, and definitely by the time
the file system is unmounted. However, it doesn't work in this
sequence:
mke2fs -Fq -t ext4 /dev/vdc 128M
mount /dev/vdc /vdc
cp xfstests/git-versions /vdc
godown /vdc
umount /vdc
mount /dev/vdc
tune2fs -l /dev/vdc
With this commit, the "tune2fs -l" no longer fails.
Reported-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
A maliciously crafted file system can cause an overflow when the
results of a 64-bit calculation is stored into a 32-bit length
parameter.
https://bugzilla.kernel.org/show_bug.cgi?id=200623
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reported-by: Wen Xu <wen.xu@gatech.edu>
Cc: stable@vger.kernel.org
Since overlayfs implements stacked file operations, the underlying
filesystems are not supposed to be exposed to the overlayfs file,
whose f_inode is an overlayfs inode.
Assigning an overlayfs file to swap_file results in an attempt of xfs
code to dereference an xfs_inode struct from an ovl_inode pointer:
CPU: 0 PID: 2462 Comm: swapon Not tainted
4.18.0-xfstests-12721-g33e17876ea4e #3402
RIP: 0010:xfs_find_bdev_for_inode+0x23/0x2f
Call Trace:
xfs_iomap_swapfile_activate+0x1f/0x43
__se_sys_swapon+0xb1a/0xee9
Fix this by not assigning the real inode mapping to f_mapping, which
will cause swapon() to return an error (-EINVAL). Although it makes
sense not to allow setting swpafile on an overlayfs file, some users
may depend on it, so we may need to fix this up in the future.
Keeping f_mapping pointing to overlay inode mapping will cause O_DIRECT
open to fail. Fix this by installing ovl_aops with noop_direct_IO in
overlay inode mapping.
Keeping f_mapping pointing to overlay inode mapping will cause other
a_ops related operations to fail (e.g. readahead()). Those will be
fixed by follow up patches.
Suggested-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: f7c72396d0 ("ovl: add O_DIRECT support")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAluGfbAACgkQnJ2qBz9k
QNmkVQgAjv/tCHStwoQ4Hhr6q5wLU9apAW5WC7Or2MyNfqoJsKgyujpikwFMa0xY
wkVqlITYavvUKl/nRlQ6hpZtAY7hi1Y7GvIYs2ci6QO4YAUuBoH7qPLKqyZYzXx+
mBaC68885nMMDqIHvsCcLurwTiDer6XQXXPDmpMoO9g4kyVuhm2e0/M7CPaA4Ue9
WEfBPBROSNdRH7Wtww/MUvtfd+9mezdlpUQZVVO5cdhAVQ8pzW2g2piYtwIpaY7E
DiJ1zcgaqCnni+b69x4wz8TwB0q8wZJjrPfJgf4X7mIGBZrw80yNRlevfe5SxwdJ
mkMDvjaD0TEItvpjgTZReXaAjFOWTA==
=E61M
-----END PGP SIGNATURE-----
Merge tag 'for_v4.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull misc fs fixes from Jan Kara:
- make UDF to properly mount media created by Win7
- make isofs to properly refuse devices with large physical block size
- fix a Spectre gadget in quotactl(2)
- fix a warning in fsnotify code hit by syzkaller
* tag 'for_v4.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
udf: Fix mounting of Win7 created UDF filesystems
udf: Remove dead code from udf_find_fileset()
fs/quota: Fix spectre gadget in do_quotactl
fs/quota: Replace XQM_MAXQUOTAS usage with MAXQUOTAS
isofs: reject hardware sector size > 2048 bytes
fsnotify: fix false positive warning on inode delete
A specially crafted file system can trick empty_inline_dir() into
reading past the last valid entry in a inline directory, and then run
into the end of xattr marker. This will trigger a divide by zero
fault. Fix this by using the size of the inline directory instead of
dir->i_size.
Also clean up error reporting in __ext4_check_dir_entry so that the
message is clearer and more understandable --- and avoids the division
by zero trap if the size passed in is zero. (I'm not sure why we
coded it that way in the first place; printing offset % size is
actually more confusing and less useful.)
https://bugzilla.kernel.org/show_bug.cgi?id=200933
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reported-by: Wen Xu <wen.xu@gatech.edu>
Cc: stable@vger.kernel.org
If the destination of the rename(2) system call exists, the inode's
link count (i_nlinks) must be non-zero. If it is, the inode can end
up on the orphan list prematurely, leading to all sorts of hilarity,
including a use-after-free.
https://bugzilla.kernel.org/show_bug.cgi?id=200931
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reported-by: Wen Xu <wen.xu@gatech.edu>
Cc: stable@vger.kernel.org
This suppresses some false positives in gcc 8's -Wstringop-truncation
Suggested by Miguel Ojeda (hopefully the __nonstring definition will
eventually get accepted in the compiler-gcc.h header file).
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
Pull IDA updates from Matthew Wilcox:
"A better IDA API:
id = ida_alloc(ida, GFP_xxx);
ida_free(ida, id);
rather than the cumbersome ida_simple_get(), ida_simple_remove().
The new IDA API is similar to ida_simple_get() but better named. The
internal restructuring of the IDA code removes the bitmap
preallocation nonsense.
I hope the net -200 lines of code is convincing"
* 'ida-4.19' of git://git.infradead.org/users/willy/linux-dax: (29 commits)
ida: Change ida_get_new_above to return the id
ida: Remove old API
test_ida: check_ida_destroy and check_ida_alloc
test_ida: Convert check_ida_conv to new API
test_ida: Move ida_check_max
test_ida: Move ida_check_leaf
idr-test: Convert ida_check_nomem to new API
ida: Start new test_ida module
target/iscsi: Allocate session IDs from an IDA
iscsi target: fix session creation failure handling
drm/vmwgfx: Convert to new IDA API
dmaengine: Convert to new IDA API
ppc: Convert vas ID allocation to new IDA API
media: Convert entity ID allocation to new IDA API
ppc: Convert mmu context allocation to new IDA API
Convert net_namespace to new IDA API
cb710: Convert to new IDA API
rsxx: Convert to new IDA API
osd: Convert to new IDA API
sd: Convert to new IDA API
...
Pull perf updates from Thomas Gleixner:
"Kernel:
- Improve kallsyms coverage
- Add x86 entry trampolines to kcore
- Fix ARM SPE handling
- Correct PPC event post processing
Tools:
- Make the build system more robust
- Small fixes and enhancements all over the place
- Update kernel ABI header copies
- Preparatory work for converting libtraceevnt to a shared library
- License cleanups"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (100 commits)
tools arch: Update arch/x86/lib/memcpy_64.S copy used in 'perf bench mem memcpy'
tools arch x86: Update tools's copy of cpufeatures.h
perf python: Fix pyrf_evlist__read_on_cpu() interface
perf mmap: Store real cpu number in 'struct perf_mmap'
perf tools: Remove ext from struct kmod_path
perf tools: Add gzip_is_compressed function
perf tools: Add lzma_is_compressed function
perf tools: Add is_compressed callback to compressions array
perf tools: Move the temp file processing into decompress_kmodule
perf tools: Use compression id in decompress_kmodule()
perf tools: Store compression id into struct dso
perf tools: Add compression id into 'struct kmod_path'
perf tools: Make is_supported_compression() static
perf tools: Make decompress_to_file() function static
perf tools: Get rid of dso__needs_decompress() call in __open_dso()
perf tools: Get rid of dso__needs_decompress() call in symbol__disassemble()
perf tools: Get rid of dso__needs_decompress() call in read_object_code()
tools lib traceevent: Change to SPDX License format
perf llvm: Allow passing options to llc in addition to clang
perf parser: Improve error message for PMU address filters
...
* memory_failure() gets confused by dev_pagemap backed mappings. The
recovery code has specific enabling for several possible page states
that needs new enabling to handle poison in dax mappings. Teach
memory_failure() about ZONE_DEVICE pages.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE5DAy15EJMCV1R6v9YGjFFmlTOEoFAlt9ui8ACgkQYGjFFmlT
OEpNRw//XGj9s7sezfJFeol4psJlRUd935yii/gmJRgi/yPf2VxxQG9qyM6SMBUc
75jASfOL6FSsfxHz0kplyWzMDNdrTkNNAD+9rv80FmY7GqWgcas9DaJX7jZ994vI
5SRO7pfvNZcXlo7IhqZippDw3yxkIU9Ufi0YQKaEUm7GFieptvCZ0p9x3VYfdvwM
BExrxQe0X1XUF4xErp5P78+WUbKxP47DLcucRDig8Q7dmHELUdyNzo3E1SVoc7m+
3CmvyTj6XuFQgOZw7ZKun1BJYfx/eD5ZlRJLZbx6wJHRtTXv/Uea8mZ8mJ31ykN9
F7QVd0Pmlyxys8lcXfK+nvpL09QBE0/PhwWKjmZBoU8AdgP/ZvBXLDL/D6YuMTg6
T4wwtPNJorfV4lVD06OliFkVI4qbKbmNsfRq43Ns7PCaLueu4U/eMaSwSH99UMaZ
MGbO140XW2RZsHiU9yTRUmZq73AplePEjxtzR8oHmnjo45nPDPy8mucWPlkT9kXA
oUFMhgiviK7dOo19H4eaPJGqLmHM93+x5tpYxGqTr0dUOXUadKWxMsTnkID+8Yi7
/kzQWCFvySz3VhiEHGuWkW08GZT6aCcpkREDomnRh4MEnETlZI8bblcuXYOCLs6c
nNf1SIMtLdlsl7U1fEX89PNeQQ2y237vEDhFQZftaalPeu/JJV0=
=Ftop
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-for-4.19_dax-memory-failure' of gitolite.kernel.org:pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm memory-failure update from Dave Jiang:
"As it stands, memory_failure() gets thoroughly confused by dev_pagemap
backed mappings. The recovery code has specific enabling for several
possible page states and needs new enabling to handle poison in dax
mappings.
In order to support reliable reverse mapping of user space addresses:
1/ Add new locking in the memory_failure() rmap path to prevent races
that would typically be handled by the page lock.
2/ Since dev_pagemap pages are hidden from the page allocator and the
"compound page" accounting machinery, add a mechanism to determine
the size of the mapping that encompasses a given poisoned pfn.
3/ Given pmem errors can be repaired, change the speculatively
accessed poison protection, mce_unmap_kpfn(), to be reversible and
otherwise allow ongoing access from the kernel.
A side effect of this enabling is that MADV_HWPOISON becomes usable
for dax mappings, however the primary motivation is to allow the
system to survive userspace consumption of hardware-poison via dax.
Specifically the current behavior is:
mce: Uncorrected hardware memory error in user-access at af34214200
{1}[Hardware Error]: It has been corrected by h/w and requires no further action
mce: [Hardware Error]: Machine check events logged
{1}[Hardware Error]: event severity: corrected
Memory failure: 0xaf34214: reserved kernel page still referenced by 1 users
[..]
Memory failure: 0xaf34214: recovery action for reserved kernel page: Failed
mce: Memory error not recovered
<reboot>
...and with these changes:
Injecting memory failure for pfn 0x20cb00 at process virtual address 0x7f763dd00000
Memory failure: 0x20cb00: Killing dax-pmd:5421 due to hardware memory corruption
Memory failure: 0x20cb00: recovery action for dax page: Recovered
Given all the cross dependencies I propose taking this through
nvdimm.git with acks from Naoya, x86/core, x86/RAS, and of course dax
folks"
* tag 'libnvdimm-for-4.19_dax-memory-failure' of gitolite.kernel.org:pub/scm/linux/kernel/git/nvdimm/nvdimm:
libnvdimm, pmem: Restore page attributes when clearing errors
x86/memory_failure: Introduce {set, clear}_mce_nospec()
x86/mm/pat: Prepare {reserve, free}_memtype() for "decoy" addresses
mm, memory_failure: Teach memory_failure() about dev_pagemap pages
filesystem-dax: Introduce dax_lock_mapping_entry()
mm, memory_failure: Collect mapping size in collect_procs()
mm, madvise_inject_error: Let memory_failure() optionally take a page reference
mm, dev_pagemap: Do not clear ->mapping on final put
mm, madvise_inject_error: Disable MADV_SOFT_OFFLINE for ZONE_DEVICE pages
filesystem-dax: Set page->index
device-dax: Set page->index
device-dax: Enable page_mapping()
device-dax: Convert to vmf_insert_mixed and vm_fault_t
Collection of misc libnvdimm patches for 4.19 submission
* Adding support to read locked nvdimm capacity.
* Change test code to make DSM failure code injection an override.
* Add support for calculate maximum contiguous area for namespace.
* Add support for queueing a short ARS when there is on going ARS for
nvdimm.
* Allow NULL to be passed in to ->direct_access() for kaddr and
pfn params.
* Improve smart injection support for nvdimm emulation testing.
* Fix test code that supports for emulating controller temperature.
* Fix hang on error before devm_memremap_pages()
* Fix a bug that causes user memory corruption when data returned
to user for ars_status.
* Maintainer updates for Ross Zwisler emails and adding Jan Kara to fsdax.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE5DAy15EJMCV1R6v9YGjFFmlTOEoFAlt9uUIACgkQYGjFFmlT
OErL+xAAgWHSGs8w98VtYA9kLDeTYEXutq93wJZQoBu/FMAXuuU3hYmQYnOQU87h
KKYmfDkeusaih1R3IX7mzlegnnzSfQ6MraNSV76M43noJHbRTunknCPZH6ebp4fo
b/eljvWlZF/idM+7YcsnoFMnHSRj2pjJGXmKQDlKedHD+KMxpmk6zEl2s5Y0zvPU
4U7UQLtk3D5IIpLNsLEmxge32BfvNf5IzoSO1aZp7Eqk0+U5Tq3Sq/Tjmd+J0RKt
6WH5yA6NqXQgBh+ayHsYU8YX62RqnbKQZXqVxD35OH64zJEUefnP1fpt9pmaZ9eL
43BPMkpM09eLAikO2ET3/3c2k6h3h9ttz1sH8t/hiroCtfmxs3XgskY06hxpKjZV
EbN+BUmut5Mr+zzYitRr3dbK2aHPVU9IbU7jUw/1Tz23rq3kU5iI7SHHv1b/eWup
1Cr77Z1M6HB8VBhjnJ+R607sbRrnKQUOV7fGzAaIskyUOTWsEvIgTh/6MRiaj9MD
5HXIgc/0y9E+G93s7MsUWwzpB7J6E7EGoybST2SKPtqwtDMPsBNeWRjyA9quBCoN
u1s+e+lWHYutqRW0eisDTTlq3nJwPijSx1nnzhJxw9s1EkCXz3f7KRZhyH1C79Co
7wjiuvKQ79e/HI/oXvGmTnv5lbLEpWYyJ3U3KIFfoUqugeyhr0k=
=5p2n
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-for-4.19_misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm updates from Dave Jiang:
"Collection of misc libnvdimm patches for 4.19 submission:
- Adding support to read locked nvdimm capacity.
- Change test code to make DSM failure code injection an override.
- Add support for calculate maximum contiguous area for namespace.
- Add support for queueing a short ARS when there is on going ARS for
nvdimm.
- Allow NULL to be passed in to ->direct_access() for kaddr and pfn
params.
- Improve smart injection support for nvdimm emulation testing.
- Fix test code that supports for emulating controller temperature.
- Fix hang on error before devm_memremap_pages()
- Fix a bug that causes user memory corruption when data returned to
user for ars_status.
- Maintainer updates for Ross Zwisler emails and adding Jan Kara to
fsdax"
* tag 'libnvdimm-for-4.19_misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/nvdimm/nvdimm:
libnvdimm: fix ars_status output length calculation
device-dax: avoid hang on error before devm_memremap_pages()
tools/testing/nvdimm: improve emulation of smart injection
filesystem-dax: Do not request kaddr and pfn when not required
md/dm-writecache: Don't request pointer dummy_addr when not required
dax/super: Do not request a pointer kaddr when not required
tools/testing/nvdimm: kaddr and pfn can be NULL to ->direct_access()
s390, dcssblk: kaddr and pfn can be NULL to ->direct_access()
libnvdimm, pmem: kaddr and pfn can be NULL to ->direct_access()
acpi/nfit: queue issuing of ars when an uc error notification comes in
libnvdimm: Export max available extent
libnvdimm: Use max contiguous area for namespace size
MAINTAINERS: Add Jan Kara for filesystem DAX
MAINTAINERS: update Ross Zwisler's email address
tools/testing/nvdimm: Fix support for emulating controller temperature
tools/testing/nvdimm: Make DSM failure code injection an override
acpi, nfit: Prefer _DSM over _LSR for namespace label reads
libnvdimm: Introduce locked DIMM capacity support
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAluAdzkACgkQiiy9cAdy
T1GjlAv/SOsNm2sj9Bcq/Z/CpPoFRFoJBFsLeReF78QdbF/+eUFuQJvq0aIK8BmV
0NRmvlQk9oCGQfWN0pLWeRn7a7xUqMQ7HYKSS1fzW1O+kJGlyA1cFjCbiZe4py6m
AFSlpraPTtL7isX/ZMOyZ1D7YKMj4Fq5wndcHPSnMQXI2GlaUAip5k/zamXUbMmo
dFaGDkXc67un6Y/04v18LsKJtOHgbVIAES2OgO0sjqiwp0cnGATsZl/OGzvsTo31
brstBum/0Ig2Mpr+5IXa4QFoP+naNXDyhv+D69huETwsMSImnjGL6L/GAgUUO5/t
sDN6bpQdM9wqpckNuFcV9hKbBHQ1nZhzr0gUjRnmN8CJFHOgI2Ndo4J4N9slnWo9
wEXJV7RN5/VQh3Ozb7m+DpAYo4K4r3Oexxa3MG7+IFY8t89O39PBc0dEQ6O8ztaJ
NCcubQVpSFxC7fwVjY56IHHofyznS7JLKMAcCe3+ssOHwJt0PZr/iqdBNHC9W4ZE
7fcNZtct
=jMLB
-----END PGP SIGNATURE-----
Merge tag '4.19-rc-smb3' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Three small SMB3 fixes, one for stable"
* tag '4.19-rc-smb3' of git://git.samba.org/sfrench/cifs-2.6:
cifs: update internal module version number for cifs.ko to 2.12
cifs: check kmalloc before use
cifs: check if SMB2 PDU size has been padded and suppress the warning
cifs: create a define for how many iovs we need for an SMB2_open()
At the point where r is being checked for different values, r is always
going to be equal to 2 as the previous if statements jump to end or end1
if r is not 2. Hence the assignment to err can be simplified to just
err an assignment without any checks on the value or r.
Detected by CoverityScan, CID#1226737 ("Logically dead code")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull namespace fixes from Eric Biederman:
"This is a set of four fairly obvious bug fixes:
- a switch from d_find_alias to d_find_any_alias because the xattr
code perversely takes a dentry
- two mutex vs copy_to_user fixes from Jann Horn
- a fix to use a sanitized size not the size userspace passed in from
Christian Brauner"
* 'userns-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
getxattr: use correct xattr length
sys: don't hold uts_sem while accessing userspace memory
userns: move user access out of the mutex
cap_inode_getsecurity: use d_find_any_alias() instead of d_find_alias()
Commit 672d599041 ("btrfs: Use wrapper macro for rcu string to remove
duplicate code") replaces some open coded RCU string handling with macro.
It turns out that btrfs_debug_in_rcu() is used for the first time and
the macro lacks lock/unlock of RCU string for non-debug case (i.e. when
the message is not printed), leading to suspicious RCU usage warning
when CONFIG_PROVE_RCU is on.
Fix this by adding a wrapper to call lock/unlock for the non-debug case
too.
Fixes: 672d599041 ("btrfs: Use wrapper macro for rcu string to remove duplicate code")
Reported-by: David Howells <dhowells@redhat.com>
Tested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This empty file sneaked into the tree by mistake.
Remove it.
Fixes: 6eb61d587f ("ubifs: Pass struct ubifs_info to ubifs_assert()")
Signed-off-by: Richard Weinberger <richard@nod.at>
Win7 is creating UDF filesystems with single partition with number 8192.
Current partition descriptor scanning code does not handle this well as
it incorrectly assumes that partition numbers will form mostly contiguous
space of small numbers. This results in unmountable media due to errors
like:
UDF-fs: error (device dm-1): udf_read_tagged: tag version 0x0000 != 0x0002 || 0x0003, block 0
UDF-fs: warning (device dm-1): udf_fill_super: No fileset found
Fix the problem by handling partition descriptors in a way that sparse
partition numbering does not matter.
Reported-and-tested-by: jean-luc malet <jeanluc.malet@gmail.com>
CC: stable@vger.kernel.org
Fixes: 7b78fd02fb
Signed-off-by: Jan Kara <jack@suse.cz>
Merge yet more updates from Andrew Morton:
- the rest of MM
- various misc fixes and tweaks
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (22 commits)
mm: Change return type int to vm_fault_t for fault handlers
lib/fonts: convert comments to utf-8
s390: ebcdic: convert comments to UTF-8
treewide: convert ISO_8859-1 text comments to utf-8
drivers/gpu/drm/gma500/: change return type to vm_fault_t
docs/core-api: mm-api: add section about GFP flags
docs/mm: make GFP flags descriptions usable as kernel-doc
docs/core-api: split memory management API to a separate file
docs/core-api: move *{str,mem}dup* to "String Manipulation"
docs/core-api: kill trailing whitespace in kernel-api.rst
mm/util: add kernel-doc for kvfree
mm/util: make strndup_user description a kernel-doc comment
fs/proc/vmcore.c: hide vmcoredd_mmap_dumps() for nommu builds
treewide: correct "differenciate" and "instanciate" typos
fs/afs: use new return type vm_fault_t
drivers/hwtracing/intel_th/msu.c: change return type to vm_fault_t
mm: soft-offline: close the race against page allocation
mm: fix race on soft-offlining free huge pages
namei: allow restricted O_CREAT of FIFOs and regular files
hfs: prevent crash on exit from failed search
...
Use new return type vm_fault_t for fault handler. For now, this is just
documenting that the function returns a VM_FAULT value rather than an
errno. Once all instances are converted, vm_fault_t will become a
distinct type.
Ref-> commit 1c8f422059 ("mm: change return type to vm_fault_t")
The aim is to change the return type of finish_fault() and
handle_mm_fault() to vm_fault_t type. As part of that clean up return
type of all other recursively called functions have been changed to
vm_fault_t type.
The places from where handle_mm_fault() is getting invoked will be
change to vm_fault_t type but in a separate patch.
vmf_error() is the newly introduce inline function in 4.17-rc6.
[akpm@linux-foundation.org: don't shadow outer local `ret' in __do_huge_pmd_anonymous_page()]
Link: http://lkml.kernel.org/r/20180604171727.GA20279@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Without CONFIG_MMU, we get a build warning:
fs/proc/vmcore.c:228:12: error: 'vmcoredd_mmap_dumps' defined but not used [-Werror=unused-function]
static int vmcoredd_mmap_dumps(struct vm_area_struct *vma, unsigned long dst,
The function is only referenced from an #ifdef'ed caller, so
this uses the same #ifdef around it.
Link: http://lkml.kernel.org/r/20180525213526.2117790-1-arnd@arndb.de
Fixes: 7efe48df8a ("vmcore: append device dumps to vmcore as elf notes")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Ganesh Goudar <ganeshgr@chelsio.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use new return type vm_fault_t for fault handler in struct
vm_operations_struct. For now, this is just documenting that the
function returns a VM_FAULT value rather than an errno. Once all
instances are converted, vm_fault_t will become a distinct type.
See 1c8f422059 ("mm: change return type to vm_fault_t") for reference.
Link: http://lkml.kernel.org/r/20180702152017.GA3780@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Disallows open of FIFOs or regular files not owned by the user in world
writable sticky directories, unless the owner is the same as that of the
directory or the file is opened without the O_CREAT flag. The purpose
is to make data spoofing attacks harder. This protection can be turned
on and off separately for FIFOs and regular files via sysctl, just like
the symlinks/hardlinks protection. This patch is based on Openwall's
"HARDEN_FIFO" feature by Solar Designer.
This is a brief list of old vulnerabilities that could have been prevented
by this feature, some of them even allow for privilege escalation:
CVE-2000-1134
CVE-2007-3852
CVE-2008-0525
CVE-2009-0416
CVE-2011-4834
CVE-2015-1838
CVE-2015-7442
CVE-2016-7489
This list is not meant to be complete. It's difficult to track down all
vulnerabilities of this kind because they were often reported without any
mention of this particular attack vector. In fact, before
hardlinks/symlinks restrictions, fifos/regular files weren't the favorite
vehicle to exploit them.
[s.mesoraca16@gmail.com: fix bug reported by Dan Carpenter]
Link: https://lkml.kernel.org/r/20180426081456.GA7060@mwanda
Link: http://lkml.kernel.org/r/1524829819-11275-1-git-send-email-s.mesoraca16@gmail.com
[keescook@chromium.org: drop pr_warn_ratelimited() in favor of audit changes in the future]
[keescook@chromium.org: adjust commit subjet]
Link: http://lkml.kernel.org/r/20180416175918.GA13494@beast
Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Suggested-by: Solar Designer <solar@openwall.com>
Suggested-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
hfs_find_exit() expects fd->bnode to be NULL after a search has failed.
hfs_brec_insert() may instead set it to an error-valued pointer. Fix
this to prevent a crash.
Link: http://lkml.kernel.org/r/53d9749a029c41b4016c495fc5838c9dba3afc52.1530294815.git.ernesto.mnd.fernandez@gmail.com
Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Cc: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Cc: Viacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
hfs_find_exit() expects fd->bnode to be NULL after a search has failed.
hfs_brec_insert() may instead set it to an error-valued pointer. Fix
this to prevent a crash.
Link: http://lkml.kernel.org/r/803590a35221fbf411b2c141419aea3233a6e990.1530294813.git.ernesto.mnd.fernandez@gmail.com
Signed-off-by: Ernesto A. Fernandez <ernesto.mnd.fernandez@gmail.com>
Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Reviewed-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
An HFS+ filesystem can be mounted read-only without having a metadata
directory, which is needed to support hardlinks. But if the catalog
data is corrupted, a directory lookup may still find dentries claiming
to be hardlinks.
hfsplus_lookup() does check that ->hidden_dir is not NULL in such a
situation, but mistakenly does so after dereferencing it for the first
time. Reorder this check to prevent a crash.
This happens when looking up corrupted catalog data (dentry) on a
filesystem with no metadata directory (this could only ever happen on a
read-only mount). Wen Xu sent the replication steps in detail to the
fsdevel list: https://bugzilla.kernel.org/show_bug.cgi?id=200297
Link: http://lkml.kernel.org/r/20180712215344.q44dyrhymm4ajkao@eaf
Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Reported-by: Wen Xu <wen.xu@gatech.edu>
Cc: Viacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Stable bufixes:
- v3.17+: Fix an off-by-one in bl_map_stripe()
- v4.9+: NFSv4 client live hangs after live data migration recovery
- v4.18+: xprtrdma: Fix disconnect regression
- v4.14+: Fix locking in pnfs_generic_recover_commit_reqs
- v4.9+: Fix a sleep in atomic context in nfs4_callback_sequence()
Features:
- Add support for asynchronous server-side COPY operations
Other bugfixes and cleanups:
- Optitmizations and fixes involving NFS v4.1 / pNFS layout handling
- Optimize lseek(fd, SEEK_CUR, 0) on directories to avoid locking
- Immediately reschedule writeback when the server replies with an error
- Fix excessive attribute revalidation in nfs_execute_ok()
- Add error checking to nfs_idmap_prepare_message()
- Use new vm_fault_t return type
- Return a delegation when reclaiming one that the server has recalled
- Referrals should inherit proto setting from parents
- Make rpc_auth_create_args a const
- Improvements to rpc_iostats tracking
- Fix a potential reference leak when there is an error processing a callback
- Fix rmdir / mkdir / rename nlink accounting
- Fix updating inode change attribute
- Fix error handling in nfsn4_sp4_select_mode()
- Use an appropriate work queue for direct-write completion
- Don't busy wait if NFSv4 session draining is interrupted
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlt/CYIACgkQ18tUv7Cl
QOu8gBAA0xQWmgRoG6oIdYUxvgYqhuJmMqC4SU1E6mCJ93xEuUSvEFw51X+84KCt
r6UPkp/bKiVe3EIinKTplIzuxgggXNG0EQmO46FYNTl7nqpN85ffLsQoWsiD23fp
j8afqKPFR2zfhHXLKQC7k1oiOpwGqJ+EJWgIW4llE80pSNaErEoEaDqSPds5thMN
dHEjjLr8ef6cbBux6sSPjwWGNbE82uoSu3MDuV2+e62hpGkgvuEYo1vyE6ujeZW5
MUsmw+AHZkwro0msTtNBOHcPZAS0q/2UMPzl1tsDeCWNl2mugqZ6szQLSS2AThKq
Zr6iK9Q5dWjJfrQHcjRMnYJB+SCX1SfPA7ASuU34opwcWPjecbS9Q92BNTByQYwN
o9ngs2K0mZfqpYESMAmf7Il134cCBrtEp3skGko2KopJcYcE5YUFhdKihi1yQQjU
UbOOubMpQk8vY9DpDCAwGbICKwUZwGvq27uuUWL20kFVDb1+jvfHwcV4KjRAJo/E
J9aFtU+qOh4rMPMnYlEVZcAZBGfenlv/DmBl1upRpjzBkteUpUJsAbCmGyAk4616
3RECasehgsjNCQpFIhv3FpUkWzP5jt0T3gRr1NeY6WKJZwYnHEJr9PtapS+EIsCT
tB5DvvaJqFtuHFOxzn+KlGaxdSodHF7klOq7NM3AC0cX8AkWqaU=
=8+9t
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-4.19-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client updates from Anna Schumaker:
"These patches include adding async support for the v4.2 COPY
operation. I think Bruce is planning to send the server patches for
the next release, but I figured we could get the client side out of
the way now since it's been in my tree for a while. This shouldn't
cause any problems, since the server will still respond with
synchronous copies even if the client requests async.
Features:
- Add support for asynchronous server-side COPY operations
Stable bufixes:
- Fix an off-by-one in bl_map_stripe() (v3.17+)
- NFSv4 client live hangs after live data migration recovery (v4.9+)
- xprtrdma: Fix disconnect regression (v4.18+)
- Fix locking in pnfs_generic_recover_commit_reqs (v4.14+)
- Fix a sleep in atomic context in nfs4_callback_sequence() (v4.9+)
Other bugfixes and cleanups:
- Optimizations and fixes involving NFS v4.1 / pNFS layout handling
- Optimize lseek(fd, SEEK_CUR, 0) on directories to avoid locking
- Immediately reschedule writeback when the server replies with an
error
- Fix excessive attribute revalidation in nfs_execute_ok()
- Add error checking to nfs_idmap_prepare_message()
- Use new vm_fault_t return type
- Return a delegation when reclaiming one that the server has
recalled
- Referrals should inherit proto setting from parents
- Make rpc_auth_create_args a const
- Improvements to rpc_iostats tracking
- Fix a potential reference leak when there is an error processing a
callback
- Fix rmdir / mkdir / rename nlink accounting
- Fix updating inode change attribute
- Fix error handling in nfsn4_sp4_select_mode()
- Use an appropriate work queue for direct-write completion
- Don't busy wait if NFSv4 session draining is interrupted"
* tag 'nfs-for-4.19-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (54 commits)
pNFS: Remove unwanted optimisation of layoutget
pNFS/flexfiles: ff_layout_pg_init_read should exit on error
pNFS: Treat RECALLCONFLICT like DELAY...
pNFS: When updating the stateid in layoutreturn, also update the recall range
NFSv4: Fix a sleep in atomic context in nfs4_callback_sequence()
NFSv4: Fix locking in pnfs_generic_recover_commit_reqs
NFSv4: Fix a typo in nfs4_init_channel_attrs()
NFSv4: Don't busy wait if NFSv4 session draining is interrupted
NFS recover from destination server reboot for copies
NFS add a simple sync nfs4_proc_commit after async COPY
NFS handle COPY ERR_OFFLOAD_NO_REQS
NFS send OFFLOAD_CANCEL when COPY killed
NFS export nfs4_async_handle_error
NFS handle COPY reply CB_OFFLOAD call race
NFS add support for asynchronous COPY
NFS COPY xdr handle async reply
NFS OFFLOAD_CANCEL xdr
NFS CB_OFFLOAD xdr
NFS: Use an appropriate work queue for direct-write completion
NFSv4: Fix error handling in nfs4_sp4_select_mode()
...
missing Chuck's fixes for the problem with callbacks over GSS from
multi-homed servers, and a smaller fix from Laura Abbott.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJbftA8AAoJECebzXlCjuG+QPMQALieEKkX0YoqRhPz5G+RrWFy
KgOBFAoiRcjFQD6wMt9FzD6qYEZqSJ+I2b+K5N3BkdyDDQu845iD0wK0zBGhMgLm
7ith85nphIMbe18+5jPorqAsI9RlfBQjiSGw1MEx5dicLQQzTObHL5q+l5jcWna4
jWS3yUKv1URpOsR1hIryw74ktSnhuH8n//zmntw8aWrCkq3hnXOZK/agtYxZ7Viv
V3kiQsiNpL2FPRcHN7ejhLUTnRkkuD2iYKrzP/SpTT/JfdNEUXlMhKkAySogNpus
nvR9X7hwta8Lgrt7PSB9ibFTXtCupmuICg5mbDWy6nXea2NvpB01QhnTzrlX17Eh
Yfk/18z95b6Qs1v4m3SI8ESmyc6l5dMZozLudtHzifyCqooWZriEhCR1PlQfQ/FJ
4cYQ8U/qiMiZIJXL7N2wpSoSaWR5bqU1rXen29Np1WEDkiv4Nf5u2fsCXzv0ZH2C
ReWpNkbnNxsNiKpp4geBZtlcSEU1pk+1PqE0MagTdBV3iptiUHRSP4jR7qLnc0zT
J1lCvU7Fodnt9vNSxMpt2Jd6XxQ6xtx7n6aMQAiYFnXDs+hP2hPnJVCScnYW3L6R
2r1sHRKKeoOzCJ2thw+zu4lOwMm7WPkJPWAYfv90reWkiKoy2vG0S9P7wsNGoJuW
fuEjB2b9pow1Ffynat6q
=JnLK
-----END PGP SIGNATURE-----
Merge tag 'nfsd-4.19-1' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
"Chuck Lever fixed a problem with NFSv4.0 callbacks over GSS from
multi-homed servers.
The only new feature is a minor bit of protocol (change_attr_type)
which the client doesn't even use yet.
Other than that, various bugfixes and cleanup"
* tag 'nfsd-4.19-1' of git://linux-nfs.org/~bfields/linux: (27 commits)
sunrpc: Add comment defining gssd upcall API keywords
nfsd: Remove callback_cred
nfsd: Use correct credential for NFSv4.0 callback with GSS
sunrpc: Extract target name into svc_cred
sunrpc: Enable the kernel to specify the hostname part of service principals
sunrpc: Don't use stack buffer with scatterlist
rpc: remove unneeded variable 'ret' in rdma_listen_handler
nfsd: use true and false for boolean values
nfsd: constify write_op[]
fs/nfsd: Delete invalid assignment statements in nfsd4_decode_exchange_id
NFSD: Handle full-length symlinks
NFSD: Refactor the generic write vector fill helper
svcrdma: Clean up Read chunk path
svcrdma: Avoid releasing a page in svc_xprt_release()
nfsd: Mark expected switch fall-through
sunrpc: remove redundant variables 'checksumlen','blocksize' and 'data'
nfsd: fix leaked file lock with nfs exported overlayfs
nfsd: don't advertise a SCSI layout for an unsupported request_queue
nfsd: fix corrupted reply to badly ordered compound
nfsd: clarify check_op_ordering
...
- Year 2038 preparations
- New UBI feature to skip CRC checks of static volumes
- A new Kconfig option to disable xattrs in UBIFS
- Lots of fixes in UBIFS, found by our new test framework
-----BEGIN PGP SIGNATURE-----
iQJKBAABCAA0FiEEdgfidid8lnn52cLTZvlZhesYu8EFAlt9zFkWHHJpY2hhcmRA
c2lnbWEtc3Rhci5hdAAKCRBm+VmF6xi7waiuD/oDYzerOLe0R7n2sRT9zjtY8kCx
LuizRvDYUlmMynI6EVahfyJy2IixcDmXOklGdxJqUkN5igDC/FORWdQjv2X9y56d
qZ2dlS8aBvI0ZKBG2ew4VP1H67CXtCw8H9fE32SGotPmxKRUQqt2vKqo+vgQfapH
eSVPrOaoqoRh+/ieumYXsvFdEUWpa66G3tVMFe4znu+kYRBbGzSszxpuq1ukIls2
P9wewqbWAZpqn+n9A9+RBIv81g+jH87/acfjK2L7/lT9wsFO7BQGKi373dPbnTa5
9WsjGEd+Gt0kb4Kjh5QegY97bPqWjmaMj1BLqeQVpSbQqpzkiFMf9GW5+h3XqAfO
hM1zzgONZMxHdZSKH0bWzIRQbvU6v0d9C4J/elfFuH9ke2XscrxjOtZZQbtbGeYj
tE7FWoZnB8euXubulGAUBKofzWe+gItBe9+iA29EBETNOemrJyHyKjO0Fe9ze5p2
bfVFvN62kHz4ZCJoinwO/OpXnCuA91xrVocLOOIreb4dkZ/kqP+YZWFf70FcE1o5
sPAbAUu+hfb2LbpktEdZHHbhoupfCnJokzfboJMX0NWKRtFXJDONjogJYTFUjrpW
eXS+55+WFHoLWtx9J2IVmcb3cQrj/W/4J83kSg99cUkVjGpil50zmtzhq9bHzsLc
wazngueP7kW2l9bSSg==
=gCyp
-----END PGP SIGNATURE-----
Merge tag 'upstream-4.19-rc1' of git://git.infradead.org/linux-ubifs
Pull UBI/UBIFS updates from Richard Weinberger:
- Year 2038 preparations
- New UBI feature to skip CRC checks of static volumes
- A new Kconfig option to disable xattrs in UBIFS
- Lots of fixes in UBIFS, found by our new test framework
* tag 'upstream-4.19-rc1' of git://git.infradead.org/linux-ubifs: (21 commits)
ubifs: Set default assert action to read-only
ubifs: Allow setting assert action as mount parameter
ubifs: Rework ubifs_assert()
ubifs: Pass struct ubifs_info to ubifs_assert()
ubifs: Turn two ubifs_assert() into a WARN_ON()
ubi: expose the volume CRC check skip flag
ubi: provide a way to skip CRC checks
ubifs: Use kmalloc_array()
ubifs: Check data node size before truncate
Revert "UBIFS: Fix potential integer overflow in allocation"
ubifs: Add comment on c->commit_sem
ubifs: introduce Kconfig symbol for xattr support
ubifs: use swap macro in swap_dirty_idx
ubifs: tnc: use monotonic znode timestamp
ubifs: use timespec64 for inode timestamps
ubifs: xattr: Don't operate on deleted inodes
ubifs: gc: Fix typo
ubifs: Fix memory leak in lprobs self-check
ubi: Initialize Fastmap checkmapping correctly
ubifs: Fix synced_i_size calculation for xattr inodes
...
The kmalloc was not being checked - if it fails issue a warning
and return -ENOMEM to the caller.
Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org>
Fixes: b8da344b74 ("cifs: dynamic allocation of ntlmssp blob")
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
cc: Stable <stable@vger.kernel.org>`
Some SMB2/3 servers, Win2016 but possibly others too, adds padding
not only between PDUs in a compound but also to the final PDU.
This padding extends the PDU to a multiple of 8 bytes.
Check if the unexpected length looks like this might be the case
and avoid triggering the log messages for :
"SMB2 server sent bad RFC1001 len %d not %d\n"
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
When running in a container with a user namespace, if you call getxattr
with name = "system.posix_acl_access" and size % 8 != 4, then getxattr
silently skips the user namespace fixup that it normally does resulting in
un-fixed-up data being returned.
This is caused by posix_acl_fix_xattr_to_user() being passed the total
buffer size and not the actual size of the xattr as returned by
vfs_getxattr().
This commit passes the actual length of the xattr as returned by
vfs_getxattr() down.
A reproducer for the issue is:
touch acl_posix
setfacl -m user:0:rwx acl_posix
and the compile:
#define _GNU_SOURCE
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <unistd.h>
#include <attr/xattr.h>
/* Run in user namespace with nsuid 0 mapped to uid != 0 on the host. */
int main(int argc, void **argv)
{
ssize_t ret1, ret2;
char buf1[128], buf2[132];
int fret = EXIT_SUCCESS;
char *file;
if (argc < 2) {
fprintf(stderr,
"Please specify a file with "
"\"system.posix_acl_access\" permissions set\n");
_exit(EXIT_FAILURE);
}
file = argv[1];
ret1 = getxattr(file, "system.posix_acl_access",
buf1, sizeof(buf1));
if (ret1 < 0) {
fprintf(stderr, "%s - Failed to retrieve "
"\"system.posix_acl_access\" "
"from \"%s\"\n", strerror(errno), file);
_exit(EXIT_FAILURE);
}
ret2 = getxattr(file, "system.posix_acl_access",
buf2, sizeof(buf2));
if (ret2 < 0) {
fprintf(stderr, "%s - Failed to retrieve "
"\"system.posix_acl_access\" "
"from \"%s\"\n", strerror(errno), file);
_exit(EXIT_FAILURE);
}
if (ret1 != ret2) {
fprintf(stderr, "The value of \"system.posix_acl_"
"access\" for file \"%s\" changed "
"between two successive calls\n", file);
_exit(EXIT_FAILURE);
}
for (ssize_t i = 0; i < ret2; i++) {
if (buf1[i] == buf2[i])
continue;
fprintf(stderr,
"Unexpected different in byte %zd: "
"%02x != %02x\n", i, buf1[i], buf2[i]);
fret = EXIT_FAILURE;
}
if (fret == EXIT_SUCCESS)
fprintf(stderr, "Test passed\n");
else
fprintf(stderr, "Test failed\n");
_exit(fret);
}
and run:
./tester acl_posix
On a non-fixed up kernel this should return something like:
root@c1:/# ./t
Unexpected different in byte 16: ffffffa0 != 00
Unexpected different in byte 17: ffffff86 != 00
Unexpected different in byte 18: 01 != 00
and on a fixed kernel:
root@c1:~# ./t
Test passed
Cc: stable@vger.kernel.org
Fixes: 2f6f0654ab ("userns: Convert vfs posix_acl support to use kuids and kgids")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=199945
Reported-by: Colin Watson <cjwatson@ubuntu.com>
Signed-off-by: Christian Brauner <christian@brauner.io>
Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
The issue here is that btrfs_commit_transaction() frees "trans" on both
the error and the success path. So the problem would be if
btrfs_commit_transaction() succeeds, and then qgroup_rescan_init()
fails. That means that "ret" is non-zero and "trans" is non-NULL and it
leads to a use after free inside the btrfs_end_transaction() macro.
Fixes: 340f1aa27f ("btrfs: qgroups: Move transaction management inside btrfs_quota_enable/disable")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Test case btrfs/164 reports use-after-free:
[ 6712.084324] general protection fault: 0000 [#1] PREEMPT SMP
..
[ 6712.195423] btrfs_update_commit_device_size+0x75/0xf0 [btrfs]
[ 6712.201424] btrfs_commit_transaction+0x57d/0xa90 [btrfs]
[ 6712.206999] btrfs_rm_device+0x627/0x850 [btrfs]
[ 6712.211800] btrfs_ioctl+0x2b03/0x3120 [btrfs]
Reason for this is that btrfs_shrink_device adds the resized device to
the fs_devices::resized_devices after it has called the last commit
transaction.
So the list fs_devices::resized_devices is not empty when
btrfs_shrink_device returns. Now the parent function
btrfs_rm_device calls:
btrfs_close_bdev(device);
call_rcu(&device->rcu, free_device_rcu);
and then does the transactio ncommit. It goes through the
fs_devices::resized_devices in btrfs_update_commit_device_size and
leads to use-after-free.
Fix this by making sure btrfs_shrink_device calls the last needed
btrfs_commit_transaction before the return. This is consistent with what
the grow counterpart does and this makes sure the on-disk state is
persistent when the function returns.
Reported-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Tested-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
After btrfs_qgroup_reserve_meta_prealloc(), num_bytes will be assigned
again by btrfs_calc_trans_metadata_size(). Once block_rsv fails, we
can't properly free the num_bytes of the previous qgroup_reserve. Use a
separate variable to store the num_bytes of the qgroup_reserve.
Delete the comment for the qgroup_reserved that does not exist and add a
comment about use_global_rsv.
Fixes: c4c129db5d ("btrfs: drop unused parameter qgroup_reserved")
CC: stable@vger.kernel.org # 4.18+
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we deduplicate extents between two different files we can end up
corrupting data if the source range ends at the size of the source file,
the source file's size is not aligned to the filesystem's block size
and the destination range does not go past the size of the destination
file size.
Example:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ xfs_io -f -c "pwrite -S 0x6b 0 2518890" /mnt/foo
# The first byte with a value of 0xae starts at an offset (2518890)
# which is not a multiple of the sector size.
$ xfs_io -c "pwrite -S 0xae 2518890 102398" /mnt/foo
# Confirm the file content is full of bytes with values 0x6b and 0xae.
$ od -t x1 /mnt/foo
0000000 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
*
11467540 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b ae ae ae ae ae ae
11467560 ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae
*
11777540 ae ae ae ae ae ae ae ae
11777550
# Create a second file with a length not aligned to the sector size,
# whose bytes all have the value 0x6b, so that its extent(s) can be
# deduplicated with the first file.
$ xfs_io -f -c "pwrite -S 0x6b 0 557771" /mnt/bar
# Now deduplicate the entire second file into a range of the first file
# that also has all bytes with the value 0x6b. The destination range's
# end offset must not be aligned to the sector size and must be less
# then the offset of the first byte with the value 0xae (byte at offset
# 2518890).
$ xfs_io -c "dedupe /mnt/bar 0 1957888 557771" /mnt/foo
# The bytes in the range starting at offset 2515659 (end of the
# deduplication range) and ending at offset 2519040 (start offset
# rounded up to the block size) must all have the value 0xae (and not
# replaced with 0x00 values). In other words, we should have exactly
# the same data we had before we asked for deduplication.
$ od -t x1 /mnt/foo
0000000 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
*
11467540 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b ae ae ae ae ae ae
11467560 ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae
*
11777540 ae ae ae ae ae ae ae ae
11777550
# Unmount the filesystem and mount it again. This guarantees any file
# data in the page cache is dropped.
$ umount /dev/sdb
$ mount /dev/sdb /mnt
$ od -t x1 /mnt/foo
0000000 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
*
11461300 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 00 00 00 00 00
11461320 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
11470000 ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae
*
11777540 ae ae ae ae ae ae ae ae
11777550
# The bytes in range 2515659 to 2519040 have a value of 0x00 and not a
# value of 0xae, data corruption happened due to the deduplication
# operation.
So fix this by rounding down, to the sector size, the length used for the
deduplication when the following conditions are met:
1) Source file's range ends at its i_size;
2) Source file's i_size is not aligned to the sector size;
3) Destination range does not cross the i_size of the destination file.
Fixes: e1d227a42e ("btrfs: Handle unaligned length in extent_same")
CC: stable@vger.kernel.org # 4.2+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we add a new name for an inode which was logged in the current
transaction, we update the inode in the log so that its new name and
ancestors are added to the log. However when we do this we do not persist
the log, so the changes remain in memory only, and as a consequence, any
ancestors that were created in the current transaction are updated such
that future calls to btrfs_inode_in_log() return true. This leads to a
subsequent fsync against such new ancestor directories returning
immediately, without persisting the log, therefore after a power failure
the new ancestor directories do not exist, despite fsync being called
against them explicitly.
Example:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ mkdir /mnt/A
$ mkdir /mnt/B
$ mkdir /mnt/A/C
$ touch /mnt/B/foo
$ xfs_io -c "fsync" /mnt/B/foo
$ ln /mnt/B/foo /mnt/A/C/foo
$ xfs_io -c "fsync" /mnt/A
<power failure>
After the power failure, directory "A" does not exist, despite the explicit
fsync on it.
Instead of fixing this by changing the behaviour of the explicit fsync on
directory "A" to persist the log instead of doing nothing, make the logging
of the new file name (which happens when creating a hard link or renaming)
persist the log. This approach not only is simpler, not requiring addition
of new fields to the inode in memory structure, but also gives us the same
behaviour as ext4, xfs and f2fs (possibly other filesystems too).
A test case for fstests follows soon.
Fixes: 12fcfd22fe ("Btrfs: tree logging unlink/rename fixes")
Reported-by: Vijay Chidambaram <vvijay03@gmail.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Clean up: The global callback_cred is no longer used, so it can be
removed.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
I've had trouble when operating a multi-homed Linux NFS server with
Kerberos using NFSv4.0. Lately, I've seen my clients reporting
this (and then hanging):
May 9 11:43:26 manet kernel: NFS: NFSv4 callback contains invalid cred
The client-side commit f11b2a1cfb ("nfs4: copy acceptor name from
context to nfs_client") appears to be related, but I suspect this
problem has been going on for some time before that.
RFC 7530 Section 3.3.3 says:
> For Kerberos V5, nfs/hostname would be a server principal in the
> Kerberos Key Distribution Center database. This is the same
> principal the client acquired a GSS-API context for when it issued
> the SETCLIENTID operation ...
In other words, an NFSv4.0 client expects that the server will use
the same GSS principal for callback that the client used to
establish its lease. For example, if the client used the service
principal "nfs@server.domain" to establish its lease, the server
is required to use "nfs@server.domain" when performing NFSv4.0
callback operations.
The Linux NFS server currently does not. It uses a common service
principal for all callback connections. Sometimes this works as
expected, and other times -- for example, when the server is
accessible via multiple hostnames -- it won't work at all.
This patch scrapes the target name from the client credential,
and uses that for the NFSv4.0 callback credential. That should
be correct much more often.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
NFSv4.0 callback needs to know the GSS target name the client used
when it established its lease. That information is available from
the GSS context created by gssproxy. Make it available in each
svc_cred.
Note this will also give us access to the real target service
principal name (which is typically "nfs", but spec does not require
that).
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
In this round, we've tuned f2fs to improve general performance by serializing
block allocation and enhancing discard flows like fstrim which avoids user IO
contention. And we've added fsync_mode=nobarrier which gives an option to user
where it skips issuing cache_flush commands to underlying flash storage. And
there are many bug fixes related to fuzzed images, revoked atomic writes, quota
ops, and minor direct IO.
Enhancement:
- add fsync_mode=nobarrier which bypasses cache_flush command
- enhance the discarding flow which avoids user IOs and issues in LBA order
- readahead some encrypted blocks during GC
- enable in-memory inode checksum to verify the blocks if F2FS_CHECK_FS is set
- enhance nat_bits behavior
- set -o discard by default
- set REQ_RAHEAD to bio in ->readpages
Bug fixes:
- fix a corner case to corrupt atomic_writes revoking flow
- revisit i_gc_rwsem to fix race conditions
- fix some dio behaviors captured by xfstests
- correct handling errors given by quota-related failures
- add many sanity check flows to avoid fuzz test failures
- add more error number propagation to their callers
- fix several corner cases to continue fault injection w/ shutdown loop
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAlt82U4ACgkQQBSofoJI
UNJTLQ/+PhewnNa5tDfUgWdFnUFz3h9/NcC677l0OplOOUNxA8iSa1xamlKf/nf9
sB5ey0I7oBF8zQGxfndhHQfi6fpfUcNMr14hm+TS/3+d54xLJmiVShD5fjNSV2vB
Ur0xoozuQDwYF1e3QKdBQjFqaCf78VheTr3aWxyv22/Sg+PYylZJ2K8rHTB7mGPU
UG0aRnKrP3FPRjL7Q3m0Vm6b6eZ5uNdNrFfjgn/8yuQQ9V197K8vwSbPAsR5/pOh
miCQXyM708NgEYJRWkWmC/rDSQdU0/h/mGnJWrBrbceW62QefGOgd2jcVfmthHJa
ZXpj+BEG5bYpCCxGxF6N+u0e28OKonCwO/uvL8YAd5icN7yXtsKzoF1CCuXxOYf1
9K5SMylCTSyrs/+LV8CJoT2ya8w0l0w+R/txUYn8UT+4AgqU+chS2kJeXqw9tcHB
WLFs/rnAyofWCI/8frVBmJY+zA1ZZvTqs/lmVYrtJUkiOcMTq34WICBUAEFKV452
BM5dcu21bSIkapYispEt4Rr7o4P4HHMQ+N1i2yUZMFCz5T0RyzdybeS5THk2yVzd
L0kxfYU+zHigNX51ez8+Z7DyDLDBp6jkD0e66x73bUK9TGPH+ZbnAL6gwLikmD3M
+VxYl5nyW/3bxx1HdfK1Xwd4/wYMNBmdtn5NC50oZ+jpB0h8YCE=
=zgbL
-----END PGP SIGNATURE-----
Merge tag 'f2fs-for-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"In this round, we've tuned f2fs to improve general performance by
serializing block allocation and enhancing discard flows like fstrim
which avoids user IO contention. And we've added fsync_mode=nobarrier
which gives an option to user where it skips issuing cache_flush
commands to underlying flash storage. And there are many bug fixes
related to fuzzed images, revoked atomic writes, quota ops, and minor
direct IO.
Enhancements:
- add fsync_mode=nobarrier which bypasses cache_flush command
- enhance the discarding flow which avoids user IOs and issues in
LBA order
- readahead some encrypted blocks during GC
- enable in-memory inode checksum to verify the blocks if
F2FS_CHECK_FS is set
- enhance nat_bits behavior
- set -o discard by default
- set REQ_RAHEAD to bio in ->readpages
Bug fixes:
- fix a corner case to corrupt atomic_writes revoking flow
- revisit i_gc_rwsem to fix race conditions
- fix some dio behaviors captured by xfstests
- correct handling errors given by quota-related failures
- add many sanity check flows to avoid fuzz test failures
- add more error number propagation to their callers
- fix several corner cases to continue fault injection w/ shutdown
loop"
* tag 'f2fs-for-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (89 commits)
f2fs: readahead encrypted block during GC
f2fs: avoid fi->i_gc_rwsem[WRITE] lock in f2fs_gc
f2fs: fix performance issue observed with multi-thread sequential read
f2fs: fix to skip verifying block address for non-regular inode
f2fs: rework fault injection handling to avoid a warning
f2fs: support fault_type mount option
f2fs: fix to return success when trimming meta area
f2fs: fix use-after-free of dicard command entry
f2fs: support discard submission error injection
f2fs: split discard command in prior to block layer
f2fs: wake up gc thread immediately when gc_urgent is set
f2fs: fix incorrect range->len in f2fs_trim_fs()
f2fs: refresh recent accessed nat entry in lru list
f2fs: fix avoid race between truncate and background GC
f2fs: avoid race between zero_range and background GC
f2fs: fix to do sanity check with block address in main area v2
f2fs: fix to do sanity check with inline flags
f2fs: fix to reset i_gc_failures correctly
f2fs: fix invalid memory access
f2fs: fix to avoid broken of dnode block list
...
...otherwise there will be list corruption due to inode_sb_list_add() being
called for inode already on the sb list.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: e950564b97 ("vfs: don't evict uninitialized inode")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge more updates from Andrew Morton:
- the rest of MM
- procfs updates
- various misc things
- more y2038 fixes
- get_maintainer updates
- lib/ updates
- checkpatch updates
- various epoll updates
- autofs updates
- hfsplus
- some reiserfs work
- fatfs updates
- signal.c cleanups
- ipc/ updates
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (166 commits)
ipc/util.c: update return value of ipc_getref from int to bool
ipc/util.c: further variable name cleanups
ipc: simplify ipc initialization
ipc: get rid of ids->tables_initialized hack
lib/rhashtable: guarantee initial hashtable allocation
lib/rhashtable: simplify bucket_table_alloc()
ipc: drop ipc_lock()
ipc/util.c: correct comment in ipc_obtain_object_check
ipc: rename ipcctl_pre_down_nolock()
ipc/util.c: use ipc_rcu_putref() for failues in ipc_addid()
ipc: reorganize initialization of kern_ipc_perm.seq
ipc: compute kern_ipc_perm.id under the ipc lock
init/Kconfig: remove EXPERT from CHECKPOINT_RESTORE
fs/sysv/inode.c: use ktime_get_real_seconds() for superblock stamp
adfs: use timespec64 for time conversion
kernel/sysctl.c: fix typos in comments
drivers/rapidio/devices/rio_mport_cdev.c: remove redundant pointer md
fork: don't copy inconsistent signal handler state to child
signal: make get_signal() return bool
signal: make sigkill_pending() return bool
...
get_seconds() is deprecated in favor of ktime_get_real_seconds(), which
returns a 64-bit timestamp.
In the SYSV file system, the superblock timestamp is only 32 bits wide,
and it is used to check whether a file system is clean, so the best
solution seems to be to force a wraparound and explicitly convert it to an
unsigned 32-bit value.
This is independent of the inode timestamps that are also 32-bit wide on
disk and that come from current_time().
Link: http://lkml.kernel.org/r/20180713145236.3152513-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We just truncate the seconds to 32-bit in one place now, so this can
trivially be converted over to using timespec64 consistently.
Link: http://lkml.kernel.org/r/20180620100133.4035614-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that we pass down 64-bit timestamps from VFS, we just need to convert
that correctly into on-disk timestamps. To make that work correctly, this
changes the last use of time_to_tm() in the kernel to time64_to_tm(),
which also lets use remove that deprecated interfaces.
Similarly, the time_t use in fat_time_fat2unix() truncates the timestamp
on the way in, which can be avoided by using types that are wide enough to
hold the intermediate values during the conversion.
[hirofumi@mail.parknet.co.jp: remove useless temporary variable, needless long long]
Link: http://lkml.kernel.org/r/20180619153646.3637529-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On corrupted FATfs may have invalid ->i_start. To handle it, this checks
->i_start before using, and return proper error code.
Link: http://lkml.kernel.org/r/87o9f8y1t5.fsf_-_@mail.parknet.co.jp
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Tested-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This fixes the following issues:
- When a buffer size is supplied to reiserfs_listxattr() such that each
individual name fits, but the concatenation of all names doesn't fit,
reiserfs_listxattr() overflows the supplied buffer. This leads to a
kernel heap overflow (verified using KASAN) followed by an out-of-bounds
usercopy and is therefore a security bug.
- When a buffer size is supplied to reiserfs_listxattr() such that a
name doesn't fit, -ERANGE should be returned. But reiserfs instead just
truncates the list of names; I have verified that if the only xattr on a
file has a longer name than the supplied buffer length, listxattr()
incorrectly returns zero.
With my patch applied, -ERANGE is returned in both cases and the memory
corruption doesn't happen anymore.
Credit for making me clean this code up a bit goes to Al Viro, who pointed
out that the ->actor calling convention is suboptimal and should be
changed.
Link: http://lkml.kernel.org/r/20180802151539.5373-1-jannh@google.com
Fixes: 48b32a3553 ("reiserfs: use generic xattr handlers")
Signed-off-by: Jann Horn <jannh@google.com>
Acked-by: Jeff Mahoney <jeffm@suse.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This uses the deprecated time_t type but is write-only, and could be
removed, but as Jeff explains, having a timestamp can be usefule for
post-mortem analysis in crash dumps.
In order to remove one of the last instances of time_t, this changes the
type to time64_t, same as j_trans_start_time.
Link: http://lkml.kernel.org/r/20180622133315.221210-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Before linux-2.4.6, print_time() was used to pretty-print an inode time
when running reiserfs in user space, after that it has become obsolete and
is still a bit incorrect: It behaves differently on 32-bit and 64-bit
machines, and uses a static buffer to hold a string, which could lead to
undefined behavior if we ever called this from multiple places
simultaneously.
Since we always want to treat the timestamps as 'unsigned' anyway, simply
printing them as an integer is both simpler and safer while avoiding the
deprecated time_t type.
Link: http://lkml.kernel.org/r/20180620142522.27639-3-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Using CLOCK_REALTIME time_t timestamps breaks on 32-bit systems in 2038,
and gives surprising results with a concurrent settimeofday().
This changes the reiserfs journal timestamps to use ktime_get_seconds()
instead, which makes it use a 64-bit CLOCK_MONOTONIC stamp.
In the procfs output, the monotonic timestamp needs to be converted back
to CLOCK_REALTIME to keep the existing ABI.
Link: http://lkml.kernel.org/r/20180620142522.27639-2-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The HFS+ Access Control Lists have not worked at all for the past five
years, and nobody seems to have noticed. Besides, POSIX draft ACLs are
not compatible with MacOS. Drop the feature entirely.
Link: http://lkml.kernel.org/r/20180714190608.wtnmmtjqeyladkut@eaf
Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: Viacheslav Dubeyko <slava@dubeyko.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Files created under macOS cannot be opened under linux if their names
contain Korean characters, and vice versa.
The Korean alphabet is special because its normalization is done without a
table. The module deals with it correctly when composing, but forgets
about it for the decomposition.
Fix this using the Hangul decomposition function provided in the Unicode
Standard. The code fits a bit awkwardly because it requires a buffer,
while all the other normalizations are returned as pointers to the
decomposition table. This is actually also a bug because reordering may
still be needed, but for now leave it as it is.
The patch will cause trouble for Hangul filenames already created by the
module in the past. This shouldn't really be concern because its main
purpose was always sharing with macOS. If a user actually needs to access
such a file the nodecompose mount option should be enough.
Link: http://lkml.kernel.org/r/20180717220951.p6qqrgautc4pxvzu@eaf
Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Reported-by: Ting-Chang Hou <tchou@synology.com>
Tested-by: Ting-Chang Hou <tchou@synology.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After an extent is removed from the extent tree, the corresponding bits
are also cleared from the block allocation file. This is currently done
without releasing the tree lock.
The problem is that the allocation file has extents of its own; if it is
fragmented enough, some of them may be in the extent tree as well, and
hfsplus_get_block() will try to take the lock again.
To avoid deadlock, only hold the extent tree lock during the actual tree
operations.
Link: http://lkml.kernel.org/r/20180709202549.auxwkb6memlegb4a@eaf
Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Cc: Viacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
syzbot is reporting NULL pointer dereference at mount_fs() [1]. This is
because hfsplus_fill_super() is by error returning 0 when
hfsplus_fill_super() detected invalid filesystem image, and mount_bdev()
is returning NULL because dget(s->s_root) == NULL if s->s_root == NULL,
and mount_fs() is accessing root->d_sb because IS_ERR(root) == false if
root == NULL. Fix this by returning -EINVAL when hfsplus_fill_super()
detected invalid filesystem image.
[1] https://syzkaller.appspot.com/bug?id=21acb6850cecbc960c927229e597158cf35f33d0
Link: http://lkml.kernel.org/r/d83ce31a-874c-dd5b-f790-41405983a5be@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: syzbot <syzbot+01ffaf5d9568dd1609f7@syzkaller.appspotmail.com>
Reviewed-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use new return type vm_fault_t for page_mkwrite handler.
Link: http://lkml.kernel.org/r/1529555928-2411-1-git-send-email-konishi.ryusuke@lab.ntt.co.jp
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The mount time field in the superblock uses a 64-bit timestamp, but
calling get_seconds() may truncate the current time to 32 bits.
This changes it to ktime_get_real_seconds() to avoid the potential
overflow.
Link: http://lkml.kernel.org/r/20180620075041.4154396-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: David Howells <dhowells@redhat.com>
Cc: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The userspace automount(8) daemon is meant to perform a forced expire when
sent a SIGUSR2.
But since the expiration is routed through the kernel and the kernel
doesn't send an expire request if the mount is busy this hasn't worked at
least since autofs version 5.
Add an AUTOFS_EXP_FORCED flag to allow implemention of the feature and
bump the protocol version so user space can check if it's implemented if
needed.
Link: http://lkml.kernel.org/r/152937734715.21213.6594007182776598970.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make the usage of the expire flags consistent by naming the expire flags
the same as it is named in the version 5 miscelaneous ioctl parameters and
only check the bit flags when needed.
Link: http://lkml.kernel.org/r/152937734046.21213.9454131988766280028.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The expire flag AUTOFS_EXP_LEAVES is cleared before the second call to
should_expire() in autofs_expire_indirect() but the parameter passed in
the second call is incorrect.
Fortunately AUTOFS_EXP_LEAVES expire flag has not been used for a long
time but might be needed in the future so fix it rather than remove the
expire leaves functionality.
Link: http://lkml.kernel.org/r/152937732410.21213.7447294898147765076.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The global variable "now" in fs/autofs/expire.c is used in an inconsistent
way, sometimes using jiffies directly, and sometimes using the "now"
variable, and setting it isn't done consistently either.
But the autofs dentry info last_used field is only updated during path
walks or during expire so jiffies can be used directly and the global
variable "now" removed.
Link: http://lkml.kernel.org/r/152937731702.21213.7371321165189170865.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Depending on how it is configured the autofs user space daemon can leave
in use mounts mounted at exit and re-connect to them at start up. But for
this to work best the state of the autofs file system needs to be left
intact over the restart.
Also, at system shutdown, mounts in an autofs file system might be
umounted exposing a mount point trigger for which subsequent access can
lead to a hang. So recent versions of automount(8) now does its best to
set autofs file system mounts catatonic at shutdown.
When autofs file system mounts are catatonic it's currently possible to
create and remove directories and symlinks which can be a problem at
restart, as described above.
So return EACCES in the directory, symlink and unlink methods if the
autofs file system is catatonic.
Link: http://lkml.kernel.org/r/152902119090.4144.9561910674530214291.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Instead of having each caller pass the rdllink explicitly, just have
ep_is_linked() pass it while the callers just need the epi pointer. This
helper is all about the rdllink, and this change, furthermore, improves
the function's self documentation.
Link: http://lkml.kernel.org/r/20180727053432.16679-3-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Similar to other calls, ep_poll() is not called with interrupts disabled,
and we can therefore avoid the irq save/restore dance and just disable
local irqs. In fact, the call should never be called in irq context at
all, considering that the only path is
epoll_wait(2) -> do_epoll_wait() -> ep_poll().
When running on a 2 socket 40-core (ht) IvyBridge a common pipe based
epoll_wait(2) microbenchmark, the following performance improvements are
seen:
# threads vanilla dirty
1 1805587 2106412
2 1854064 2090762
4 1805484 2017436
8 1751222 1974475
16 1725299 1962104
32 1378463 1571233
64 787368 900784
Which is a pretty constantly near 15%.
Also add a lockdep check such that we detect any mischief before
deadlocking.
Link: http://lkml.kernel.org/r/20180727053432.16679-2-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jason Baron <jbaron@akamai.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
... 'tis easier on the eye.
[akpm@linux-foundation.org: use inlines rather than macros]
Link: http://lkml.kernel.org/r/20180725185620.11020-1-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sprinkle lockdep_assert_irqs_enabled() checks in the functions that do not
save and restore interrupts when dealing with the ep->wq.lock. These are
ep_scan_ready_list() and those called by epoll_ctl(): ep_insert, ep_modify
and ep_remove.
[akpm@linux-foundation.org: remove too-obvious comments]
Link: http://lkml.kernel.org/r/20180721183127.3busfa335zlcjeox@linux-r8p5
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Both functions are similar to the context of ep_modify(), called via
epoll_ctl(2). Just like ep_modify(), saving and restoring interrupts is
an overkill in these calls as it will never be called with irqs disabled.
While ep_remove() can be called directly from EPOLL_CTL_DEL, it can also
be called when releasing the file, but this also complies with the above.
Link: http://lkml.kernel.org/r/20180720172956.2883-3-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "fs/epoll: loosen irq safety when possible".
Both patches replace saving+restoring interrupts when taking the ep->lock
(now the waitqueue lock), with just disabling local irqs. This shows
immediate performance benefits in patch 1 for an epoll workload running on
Xen. The main concern we need to have with this sort of changes in epoll
is the ep_poll_callback() which is passed to the wait queue wakeup and is
done very often under irq context, this patch does not touch this call.
Patches have been tested pretty heavily with the customer workload,
microbenchmarks, ltp testcases and two high level workloads that use epoll
under the hood: nginx and libevent benchmarks.
This patch (of 2):
Saving and restoring interrupts in ep_scan_ready_list() is an
overkill as it is never called with interrupts disabled. Loosen
this to simply disabling local irqs such that archs where managing
irqs is expensive or virtual environments. This patch yields
some throughput improvements on a workload that is epoll intensive
running on a single Xen DomU.
1 Job 7500 --> 8800 enq/s (+17%)
2 Jobs 14000 --> 15200 enq/s (+8%)
3 Jobs 20500 --> 22300 enq/s (+8%)
4 Jobs 25000 --> 28000 enq/s (+8-12)%
On bare metal:
For a 2-socket 40-core (ht) IvyBridge on a few workloads, unfortunately I
don't have a xen environment and the results for Xen I do have (which
numbers are in patch 1) I don't have the actual workload, so cannot
compare them directly.
1) Different configurations were used for a epoll_wait (pipes io)
microbench (http://linux-scalability.org/epoll/epoll-test.c) and shows
around a 7-10% improvement in overall total number of times the
epoll_wait() loops when using both regular and nested epolls, so very
raw numbers, but measurable nonetheless.
# threads vanilla dirty
1 1677717 1805587
2 1660510 1854064
4 1610184 1805484
8 1577696 1751222
16 1568837 1725299
32 1291532 1378463
64 752584 787368
Note that stddev is pretty small.
2) Another pipe test, which shows no real measurable improvement.
(http://www.xmailserver.org/linux-patches/pipetest.c)
Link: http://lkml.kernel.org/r/20180720172956.2883-2-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The userfaultfd code currently uses the unlocked waitqueue helpers for
managing fault_wqh, but instead of holding the waitqueue lock for this
waitqueue around these calls, it the waitqueue lock of
fault_pending_wq, which is a different waitqueue instance. Given that
the waitqueue is not exposed to the rest of the kernel this actually
works ok at the moment, but prevents the userfaultfd locking rules from
being enforced using lockdep.
Switch to the internally locked waitqueue helpers instead. This means
that the lock inside fault_wqh now nests inside the fault_pending_wqh
lock, but that's not a problem since it was entirely unused before.
[hch@lst.de: slight changelog updates]
[rppt@linux.vnet.ibm.com: spotted changelog spellos]
Link: http://lkml.kernel.org/r/20171214152344.6880-3-hch@lst.de
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "waitqueue lockdep annotation", v3.
This series adds a strategic lockdep_assert_held to __wake_up_common to
ensure callers really do hold the wait_queue_head lock when calling the
unlocked wake_up variants. It turns out epoll did not do this for a
fairly common path (hit all the time by systemd during bootup), so the
second patch fixed this instance as well.
This patch (of 3):
The epoll code currently uses the unlocked waitqueue helpers for managing
ep->wq, but instead of holding the waitqueue lock around these calls, it
uses its own ep->lock spinlock. Given that the waitqueue is not exposed
to the rest of the kernel this actually works ok at the moment, but
prevents the epoll locking rules from being enforced using lockdep.
Remove ep->lock and use the waitqueue lock to not only reduce the size of
struct eventpoll but also to make sure we can assert locking invariants in
the waitqueue code.
Link: http://lkml.kernel.org/r/20171214152344.6880-2-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Baron <jbaron@akamai.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The vmcoreinfo information is useful for runtime debugging tools, not just
for crash dumps. A lot of this information can be determined by other
means, but this is much more convenient, and it only adds a page at most
to the file.
Link: http://lkml.kernel.org/r/fddbcd08eed76344863303878b12de1c1e2a04b6.1531953780.git.osandov@fb.com
Signed-off-by: Omar Sandoval <osandov@fb.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bhupesh Sharma <bhsharma@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The current code does a full search of the segment list every time for
every page. This is wasteful, since it's almost certain that the next
page will be in the same segment. Instead, check if the previous segment
covers the current page before doing the list search.
Link: http://lkml.kernel.org/r/fd346c11090cf93d867e01b8d73a6567c5ac6361.1531953780.git.osandov@fb.com
Signed-off-by: Omar Sandoval <osandov@fb.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bhupesh Sharma <bhsharma@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, the ELF file header, program headers, and note segment are
allocated all at once, in some icky code dating back to 2.3. Programs
tend to read the file header, then the program headers, then the note
segment, all separately, so this is a waste of effort. It's cleaner and
more efficient to handle the three separately.
Link: http://lkml.kernel.org/r/19c92cbad0e11f6103ff3274b2e7a7e51a1eb74b.1531953780.git.osandov@fb.com
Signed-off-by: Omar Sandoval <osandov@fb.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bhupesh Sharma <bhsharma@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that we're using an rwsem, we can hold it during the entirety of
read_kcore() and have a common return path. This is preparation for the
next change.
[akpm@linux-foundation.org: fix locking bug reported by Tetsuo Handa]
Link: http://lkml.kernel.org/r/d7cfbc1e8a76616f3b699eaff9df0a2730380534.1531953780.git.osandov@fb.com
Signed-off-by: Omar Sandoval <osandov@fb.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bhupesh Sharma <bhsharma@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: James Morse <james.morse@arm.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There's a theoretical race condition that will cause /proc/kcore to miss
a memory hotplug event:
CPU0 CPU1
// hotplug event 1
kcore_need_update = 1
open_kcore() open_kcore()
kcore_update_ram() kcore_update_ram()
// Walk RAM // Walk RAM
__kcore_update_ram() __kcore_update_ram()
kcore_need_update = 0
// hotplug event 2
kcore_need_update = 1
kcore_need_update = 0
Note that CPU1 set up the RAM kcore entries with the state after hotplug
event 1 but cleared the flag for hotplug event 2. The RAM entries will
therefore be stale until there is another hotplug event.
This is an extremely unlikely sequence of events, but the fix makes the
synchronization saner, anyways: we serialize the entire update sequence,
which means that whoever clears the flag will always succeed in replacing
the kcore list.
Link: http://lkml.kernel.org/r/6106c509998779730c12400c1b996425df7d7089.1531953780.git.osandov@fb.com
Signed-off-by: Omar Sandoval <osandov@fb.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bhupesh Sharma <bhsharma@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now we only need kclist_lock from user context and at fs init time, and
the following changes need to sleep while holding the kclist_lock.
Link: http://lkml.kernel.org/r/521ba449ebe921d905177410fee9222d07882f0d.1531953780.git.osandov@fb.com
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bhupesh Sharma <bhsharma@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The memory hotplug notifier kcore_callback() only needs kclist_lock to
prevent races with __kcore_update_ram(), but we can easily eliminate that
race by using an atomic xchg() in __kcore_update_ram(). This is
preparation for converting kclist_lock to an rwsem.
Link: http://lkml.kernel.org/r/0a4bc89f4dbde8b5b2ea309f7b4fb6a85fe29df2.1531953780.git.osandov@fb.com
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bhupesh Sharma <bhsharma@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "/proc/kcore improvements", v4.
This series makes a few improvements to /proc/kcore. It fixes a couple of
small issues in v3 but is otherwise the same. Patches 1, 2, and 3 are
prep patches. Patch 4 is a fix/cleanup. Patch 5 is another prep patch.
Patches 6 and 7 are optimizations to ->read(). Patch 8 makes it possible
to enable CRASH_CORE on any architecture, which is needed for patch 9.
Patch 9 adds vmcoreinfo to /proc/kcore.
This patch (of 9):
kclist_add() is only called at init time, so there's no point in grabbing
any locks. We're also going to replace the rwlock with a rwsem, which we
don't want to try grabbing during early boot.
While we're here, mark kclist_add() with __init so that we'll get a
warning if it's called from non-init code.
Link: http://lkml.kernel.org/r/98208db1faf167aa8b08eebfa968d95c70527739.1531953780.git.osandov@fb.com
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Bhupesh Sharma <bhsharma@redhat.com>
Tested-by: Bhupesh Sharma <bhsharma@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bhupesh Sharma <bhsharma@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
elf_kcore_store_hdr() uses __pa() to find the physical address of
KCORE_RAM or KCORE_TEXT entries exported as program headers.
This trips CONFIG_DEBUG_VIRTUAL's checks, as the KCORE_TEXT entries are
not in the linear map.
Handle these two cases separately, using __pa_symbol() for the KCORE_TEXT
entries.
Link: http://lkml.kernel.org/r/20180711131944.15252-1-james.morse@arm.com
Signed-off-by: James Morse <james.morse@arm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Omar Sandoval <osandov@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use new return type vm_fault_t for fault handler in struct
vm_operations_struct. For now, this is just documenting that the function
returns a VM_FAULT value rather than an errno. Once all instances are
converted, vm_fault_t will become a distinct type.
See 1c8f422059 ("mm: change return type to vm_fault_t") for reference.
Link: http://lkml.kernel.org/r/20180702153325.GA3875@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ganesh Goudar <ganeshgr@chelsio.com>
Cc: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Number of CPUs is never high enough to force 64-bit arithmetic.
Save couple of bytes on x86_64.
Link: http://lkml.kernel.org/r/20180627200710.GC18434@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
->latency_record is defined as
struct latency_record[LT_SAVECOUNT];
so use the same macro whie iterating.
Link: http://lkml.kernel.org/r/20180627200534.GA18434@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Code checks if write is done by current to its own attributes.
For that get/put pair is unnecessary as it can be done under RCU.
Note: rcu_read_unlock() can be done even earlier since pointer to a task
is not dereferenced. It depends if /proc code should look scary or not:
rcu_read_lock();
task = pid_task(...);
rcu_read_unlock();
if (!task)
return -ESRCH;
if (task != current)
return -EACCESS:
P.S.: rename "length" variable. Code like this
length = -EINVAL;
should not exist.
Link: http://lkml.kernel.org/r/20180627200218.GF18113@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Readdir context is thread local, so ->pos is thread local,
move it out of readlock.
Link: http://lkml.kernel.org/r/20180627195339.GD18113@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
get_monotonic_boottime() is deprecated and uses the old timespec type.
Let's convert /proc/uptime to use ktime_get_boottime_ts64().
Link: http://lkml.kernel.org/r/20180620081746.282742-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
24074a35c5 ("proc: Make inline name size calculation automatic")
started to put PDE allocations into kmalloc-256 which is unnecessary as
~40 character names are very rare.
Put allocation back into kmalloc-192 cache for 64-bit non-debug builds.
Put BUILD_BUG_ON to know when PDE size has gotten out of control.
[adobriyan@gmail.com: fix BUILD_BUG_ON breakage on powerpc64]
Link: http://lkml.kernel.org/r/20180703191602.GA25521@avx2
Link: http://lkml.kernel.org/r/20180617215732.GA24688@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, percpu memory only exposes allocation and utilization
information via debugfs. This more or less is only really useful for
understanding the fragmentation and allocation information at a per-chunk
level with a few global counters. This is also gated behind a config.
BPF and cgroup, for example, have seen an increase in use causing
increased use of percpu memory. Let's make it easier for someone to
identify how much memory is being used.
This patch adds the "Percpu" stat to meminfo to more easily look up how
much percpu memory is in use. This number includes the cost for all
allocated backing pages and not just insight at the per a unit, per chunk
level. Metadata is excluded. I think excluding metadata is fair because
the backing memory scales with the numbere of cpus and can quickly
outweigh the metadata. It also makes this calculation light.
Link: http://lkml.kernel.org/r/20180807184723.74919-1-dennisszhou@gmail.com
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rather than in vm_area_alloc(). To ensure that the various oddball
stack-based vmas are in a good state. Some of the callers were zeroing
them out, others were not.
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Russell King <rmk+kernel@arm.linux.org.uk>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The /proc/pid/smaps_rollup file is currently implemented via the
m_start/m_next/m_stop seq_file iterators shared with the other maps files,
that iterate over vma's. However, the rollup file doesn't print anything
for each vma, only accumulate the stats.
There are some issues with the current code as reported in [1] - the
accumulated stats can get skewed if seq_file start()/stop() op is called
multiple times, if show() is called multiple times, and after seeks to
non-zero position.
Patch [1] fixed those within existing design, but I believe it is
fundamentally wrong to expose the vma iterators to the seq_file mechanism
when smaps_rollup shows logically a single set of values for the whole
address space.
This patch thus refactors the code to provide a single "value" at offset
0, with vma iteration to gather the stats done internally. This fixes the
situations where results are skewed, and simplifies the code, especially
in show_smap(), at the expense of somewhat less code reuse.
[1] https://marc.info/?l=linux-mm&m=151927723128134&w=2
[vbabka@suse.c: use seq_file infrastructure]
Link: http://lkml.kernel.org/r/bf4525b0-fd5b-4c4c-2cb3-adee3dd95a48@suse.cz
Link: http://lkml.kernel.org/r/20180723111933.15443-5-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Daniel Colascione <dancol@google.com>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To prepare for handling /proc/pid/smaps_rollup differently from
/proc/pid/smaps factor out from show_smap() printing the parts of output
that are common for both variants, which is the bulk of the gathered
memory stats.
[vbabka@suse.cz: add const, per Alexey]
Link: http://lkml.kernel.org/r/b45f319f-cd04-337b-37f8-77f99786aa8a@suse.cz
Link: http://lkml.kernel.org/r/20180723111933.15443-4-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Daniel Colascione <dancol@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To prepare for handling /proc/pid/smaps_rollup differently from
/proc/pid/smaps factor out vma mem stats gathering from show_smap() - it
will be used by both.
Link: http://lkml.kernel.org/r/20180723111933.15443-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Daniel Colascione <dancol@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "cleanups and refactor of /proc/pid/smaps*".
The recent regression in /proc/pid/smaps made me look more into the code.
Especially the issues with smaps_rollup reported in [1] as explained in
Patch 4, which fixes them by refactoring the code. Patches 2 and 3 are
preparations for that. Patch 1 is me realizing that there's a lot of
boilerplate left from times where we tried (unsuccessfuly) to mark thread
stacks in the output.
Originally I had also plans to rework the translation from
/proc/pid/*maps* file offsets to the internal structures. Now the offset
means "vma number", which is not really stable (vma's can come and go
between read() calls) and there's an extra caching of last vma's address.
My idea was that offsets would be interpreted directly as addresses, which
would also allow meaningful seeks (see the ugly seek_to_smaps_entry() in
tools/testing/selftests/vm/mlock2.h). However loff_t is (signed) long
long so that might be insufficient somewhere for the unsigned long
addresses.
So the result is fixed issues with skewed /proc/pid/smaps_rollup results,
simpler smaps code, and a lot of unused code removed.
[1] https://marc.info/?l=linux-mm&m=151927723128134&w=2
This patch (of 4):
Commit b76437579d ("procfs: mark thread stack correctly in
proc/<pid>/maps") introduced differences between /proc/PID/maps and
/proc/PID/task/TID/maps to mark thread stacks properly, and this was
also done for smaps and numa_maps. However it didn't work properly and
was ultimately removed by commit b18cb64ead ("fs/proc: Stop trying to
report thread stacks").
Now the is_pid parameter for the related show_*() functions is unused
and we can remove it together with wrapper functions and ops structures
that differ for PID and TID cases only in this parameter.
Link: http://lkml.kernel.org/r/20180723111933.15443-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Daniel Colascione <dancol@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
autofs_sbi() does not check the superblock magic number to verify it has
been given an autofs super block.
Link: http://lkml.kernel.org/r/153475422934.17131.7563724552005298277.stgit@pluto.themaw.net
Reported-by: <syzbot+87c3c541582e56943277@syzkaller.appspotmail.com>
Signed-off-by: Ian Kent <raven@themaw.net>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
XQM_MAXQUOTAS and MAXQUOTAS are, it appears, equivalent. Replace all
usage of XQM_MAXQUOTAS and remove it along with the unused XQM_*QUOTA
definitions.
Signed-off-by: Jeremy Cline <jcline@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
ida_alloc_max() matches what this driver wants to do. Also removes a
call to ida_pre_get(). We no longer need the protection of the mutex,
so convert pty_count to an atomic_t and remove the mutex entirely.
Signed-off-by: Matthew Wilcox <willy@infradead.org>
This contains various bug fixes and cleanups.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCW3xvGwAKCRDh3BK/laaZ
PKECAP9qUpdtQ5RaIL/y9OGZzJLSZbBZuK3LGNY2u2B3EfrSjgEAvhkhXyOQgvVi
kgYLNszbg/C+w8U4Xc5GWB6cjNm6rwE=
=GJI7
-----END PGP SIGNATURE-----
Merge tag 'fuse-update-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse update from Miklos Szeredi:
"Various bug fixes and cleanups"
* tag 'fuse-update-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: reduce allocation size for splice_write
fuse: use kvmalloc to allocate array of pipe_buffer structs.
fuse: convert last timespec use to timespec64
fs: fuse: Adding new return type vm_fault_t
fuse: simplify fuse_abort_conn()
fuse: Add missed unlock_page() to fuse_readpages_fill()
fuse: Don't access pipe->buffers without pipe_lock()
fuse: fix initial parallel dirops
fuse: Fix oops at process_init_reply()
fuse: umount should wait for all requests
fuse: fix unlocked access to processing queue
fuse: fix double request_end()
This contains two new features:
1) Stack file operations: this allows removal of several hacks from the
VFS, proper interaction of read-only open files with copy-up,
possibility to implement fs modifying ioctls properly, and others.
2) Metadata only copy-up: when file is on lower layer and only metadata is
modified (except size) then only copy up the metadata and continue to
use the data from the lower file.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCW3srhAAKCRDh3BK/laaZ
PC6tAQCP+KklcN+TvNp502f+O/kATahSpgnun4NY1/p4I8JV+AEAzdlkTN3+MiAO
fn9brN6mBK7h59DO3hqedPLJy2vrgwg=
=QDXH
-----END PGP SIGNATURE-----
Merge tag 'ovl-update-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs updates from Miklos Szeredi:
"This contains two new features:
- Stack file operations: this allows removal of several hacks from
the VFS, proper interaction of read-only open files with copy-up,
possibility to implement fs modifying ioctls properly, and others.
- Metadata only copy-up: when file is on lower layer and only
metadata is modified (except size) then only copy up the metadata
and continue to use the data from the lower file"
* tag 'ovl-update-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: (66 commits)
ovl: Enable metadata only feature
ovl: Do not do metacopy only for ioctl modifying file attr
ovl: Do not do metadata only copy-up for truncate operation
ovl: add helper to force data copy-up
ovl: Check redirect on index as well
ovl: Set redirect on upper inode when it is linked
ovl: Set redirect on metacopy files upon rename
ovl: Do not set dentry type ORIGIN for broken hardlinks
ovl: Add an inode flag OVL_CONST_INO
ovl: Treat metacopy dentries as type OVL_PATH_MERGE
ovl: Check redirects for metacopy files
ovl: Move some dir related ovl_lookup_single() code in else block
ovl: Do not expose metacopy only dentry from d_real()
ovl: Open file with data except for the case of fsync
ovl: Add helper ovl_inode_realdata()
ovl: Store lower data inode in ovl_inode
ovl: Fix ovl_getattr() to get number of blocks from lower
ovl: Add helper ovl_dentry_lowerdata() to get lower data dentry
ovl: Copy up meta inode data from lowest data inode
ovl: Modify ovl_lookup() and friends to lookup metacopy dentry
...
- Fix an uninitialized variable
- Don't use obviously garbage AG header counters to calculate
transaction reservations
- Trigger icount recalculation on bad icount when monting.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAlty8tsACgkQ+H93GTRK
tOtoKw/+OeCaY6jZc2JoztBwLSUJsMYQ0R8Wsj5GRb4bVp9b0zes7RJMFU03nCtj
XuE4Rhdsx+6+QZQKxTq/Z6lrKHEjF0kL1EVGHtL46Inr+Z+Rr4bLBG6NV1o0dg7B
CR1IqW5vYcZ7Vrk9ko/RXVXtuCIxBS5jSW/S/uFT95Y4lVMAf/2asR/OoYt5ZVE3
17CUfWRifiSGoBQpjtfZd63F23XlEEusiErC5iS9rUbE2qC9FxP9EuvoUP5M/n01
nLS34Fjw7X739AiwHbf10fQPOvBr7atTazCXskjy4gbwqIWTmuhbF4ieTU1OfTI8
ozhvYomBYLiZbsEYBhVCs09VEnIfHmf2HoLh//efGE8VEvoQllxdn/g2PQekoPAn
M7VnRUXCTvaLI8IE2d3Ed1VWm0OTea09xqEiNpB0XGjegim9pXuf6t/zbe4R0vJy
YLBgQT8XRPw5ZgCnBbxvZOXXxQtAqKnqZzYSWGxlHJhhduKVeKMqerhP0nn0ui8g
wAOmOe3XEoyLfSY8WY0ACEEEA00pAwErerwVEFLCpaKTh5GOY4i3OBdqcZOtXacn
f5oIeG9HZKAXKkOTGwpq1zGHTOYhz4mxAYhodRFiEE8rXHDa9odUWQ/iG0zgZaO6
19xznXjXkVWVg0QJqQJi6SbEkkrAEFtFRYH+VPTgWM/1tg47a14=
=+0Eq
-----END PGP SIGNATURE-----
Merge tag 'xfs-4.19-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
- Fix an uninitialized variable
- Don't use obviously garbage AG header counters to calculate
transaction reservations
- Trigger icount recalculation on bad icount when mounting
* tag 'xfs-4.19-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
iomap: fix WARN_ON_ONCE on uninitialized variable
xfs: sanity check ag header values in xrep_calc_ag_resblks
xfs: recalculate summary counters at mount time if icount is bad
Pull core signal handling updates from Eric Biederman:
"It was observed that a periodic timer in combination with a
sufficiently expensive fork could prevent fork from every completing.
This contains the changes to remove the need for that restart.
This set of changes is split into several parts:
- The first part makes PIDTYPE_TGID a proper pid type instead
something only for very special cases. The part starts using
PIDTYPE_TGID enough so that in __send_signal where signals are
actually delivered we know if the signal is being sent to a a group
of processes or just a single process.
- With that prep work out of the way the logic in fork is modified so
that fork logically makes signals received while it is running
appear to be received after the fork completes"
* 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (22 commits)
signal: Don't send signals to tasks that don't exist
signal: Don't restart fork when signals come in.
fork: Have new threads join on-going signal group stops
fork: Skip setting TIF_SIGPENDING in ptrace_init_task
signal: Add calculate_sigpending()
fork: Unconditionally exit if a fatal signal is pending
fork: Move and describe why the code examines PIDNS_ADDING
signal: Push pid type down into complete_signal.
signal: Push pid type down into __send_signal
signal: Push pid type down into send_signal
signal: Pass pid type into do_send_sig_info
signal: Pass pid type into send_sigio_to_task & send_sigurg_to_task
signal: Pass pid type into group_send_sig_info
signal: Pass pid and pid type into send_sigqueue
posix-timers: Noralize good_sigevent
signal: Use PIDTYPE_TGID to clearly store where file signals will be sent
pid: Implement PIDTYPE_TGID
pids: Move the pgrp and session pid pointers from task_struct to signal_struct
kvm: Don't open code task_pid in kvm_vcpu_ioctl
pids: Compute task_tgid using signal->leader_pid
...
If we knew that the file was empty, we wouldn't be asking for a layout.
Any optimisation here is already done before calling pnfs_update_layout().
As it stands, we sometimes end up doing an unnecessary inband read to
the MDS even when holding a layout.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
If we get an error while retrieving the layout, then we should
report it rather than falling back to I/O through the MDS.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The largest block size supported by isofs is ISOFS_BLOCK_SIZE (2048), but
isofs_fill_super calls sb_min_blocksize and sets the blocksize to the
device's logical block size if it's larger than what we ended up with after
option parsing.
If for some reason we try to mount a hard 4k device as an isofs filesystem,
we'll set opt.blocksize to 4096, and when we try to read the superblock
we found via:
block = iso_blknum << (ISOFS_BLOCK_BITS - s->s_blocksize_bits)
with s_blocksize_bits greater than ISOFS_BLOCK_BITS, we'll have a negative
shift and the bread will fail somewhat cryptically:
isofs_fill_super: bread failed, dev=sda, iso_blknum=17, block=-2147483648
It seems best to just catch and clearly reject mounts of such a device.
Reported-by: Bryan Gurney <bgurney@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
The f2fs_gc() called by f2fs_balance_fs() requires to be called outside of
fi->i_gc_rwsem[WRITE], since f2fs_gc() can try to grab it in a loop.
If it hits the miximum retrials in GC, let's give a chance to release
gc_mutex for a short time in order not to go into live lock in the worst
case.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This reverts the commit - "b93f771 - f2fs: remove writepages lock"
to fix the drop in sequential read throughput.
Test: ./tiotest -t 32 -d /data/tio_tmp -f 32 -b 524288 -k 1 -k 3 -L
device: UFS
Before -
read throughput: 185 MB/s
total read requests: 85177 (of these ~80000 are 4KB size requests).
total write requests: 2546 (of these ~2208 requests are written in 512KB).
After -
read throughput: 758 MB/s
total read requests: 2417 (of these ~2042 are 512KB reads).
total write requests: 2701 (of these ~2034 requests are written in 512KB).
Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
- Restructure of lockdep and latency tracers
This is the biggest change. Joel Fernandes restructured the hooks
from irqs and preemption disabling and enabling. He got rid of
a lot of the preprocessor #ifdef mess that they caused.
He turned both lockdep and the latency tracers to use trace events
inserted in the preempt/irqs disabling paths. But unfortunately,
these started to cause issues in corner cases. Thus, parts of the
code was reverted back to where lockde and the latency tracers
just get called directly (without using the trace events).
But because the original change cleaned up the code very nicely
we kept that, as well as the trace events for preempt and irqs
disabling, but they are limited to not being called in NMIs.
- Have trace events use SRCU for "rcu idle" calls. This was required
for the preempt/irqs off trace events. But it also had to not
allow them to be called in NMI context. Waiting till Paul makes
an NMI safe SRCU API.
- New notrace SRCU API to allow trace events to use SRCU.
- Addition of mcount-nop option support
- SPDX headers replacing GPL templates.
- Various other fixes and clean ups.
- Some fixes are marked for stable, but were not fully tested
before the merge window opened.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCW3ruhRQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qiM7AP47NhYdSnCFCRUJfrt6PovXmQtuCHt3
c3QMoGGdvzh9YAEAqcSXwh7uLhpHUp1LjMAPkXdZVwNddf4zJQ1zyxQ+EAU=
=vgEr
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
- Restructure of lockdep and latency tracers
This is the biggest change. Joel Fernandes restructured the hooks
from irqs and preemption disabling and enabling. He got rid of a lot
of the preprocessor #ifdef mess that they caused.
He turned both lockdep and the latency tracers to use trace events
inserted in the preempt/irqs disabling paths. But unfortunately,
these started to cause issues in corner cases. Thus, parts of the
code was reverted back to where lockdep and the latency tracers just
get called directly (without using the trace events). But because the
original change cleaned up the code very nicely we kept that, as well
as the trace events for preempt and irqs disabling, but they are
limited to not being called in NMIs.
- Have trace events use SRCU for "rcu idle" calls. This was required
for the preempt/irqs off trace events. But it also had to not allow
them to be called in NMI context. Waiting till Paul makes an NMI safe
SRCU API.
- New notrace SRCU API to allow trace events to use SRCU.
- Addition of mcount-nop option support
- SPDX headers replacing GPL templates.
- Various other fixes and clean ups.
- Some fixes are marked for stable, but were not fully tested before
the merge window opened.
* tag 'trace-v4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (44 commits)
tracing: Fix SPDX format headers to use C++ style comments
tracing: Add SPDX License format tags to tracing files
tracing: Add SPDX License format to bpf_trace.c
blktrace: Add SPDX License format header
s390/ftrace: Add -mfentry and -mnop-mcount support
tracing: Add -mcount-nop option support
tracing: Avoid calling cc-option -mrecord-mcount for every Makefile
tracing: Handle CC_FLAGS_FTRACE more accurately
Uprobe: Additional argument arch_uprobe to uprobe_write_opcode()
Uprobes: Simplify uprobe_register() body
tracepoints: Free early tracepoints after RCU is initialized
uprobes: Use synchronize_rcu() not synchronize_sched()
tracing: Fix synchronizing to event changes with tracepoint_synchronize_unregister()
ftrace: Remove unused pointer ftrace_swapper_pid
tracing: More reverting of "tracing: Centralize preemptirq tracepoints and unify their usage"
tracing/irqsoff: Handle preempt_count for different configs
tracing: Partial revert of "tracing: Centralize preemptirq tracepoints and unify their usage"
tracing: irqsoff: Account for additional preempt_disable
trace: Use rcu_dereference_raw for hooks from trace-event subsystem
tracing/kprobes: Fix within_notrace_func() to check only notrace functions
...
basic support for rbd images within namespaces (myself). Also included
y2038 conversion patches from Arnd, a pile of miscellaneous fixes from
Chengguang and Zheng's feature bit infrastructure for the filesystem.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAlt62CkTHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzizfhB/0c/rz6frunc6EcZMWuBNzlOIOktJ/m
MEbPGjCxMAsmidO1rqHHYF4iEN5hr+3AWTbtIL2m6wkqYVdg3FjmNaAYB27AdQMG
kH9bLfrKIew72/NZqXfm25yjY/86kIt8t91kay4Lchc97tSYhnFSnku7iAX2HTND
TMhq/1O/GvEyw/RmqnenJEQqFJvKnfgPPQm6W8sM2bH0T5j+EXmDT/Rv+90LogFR
J4+pZkHqDfvyMb1WJ5MkumohytbRVzRNKcMpOvjquJSqUgtgZa2JdrIsypDqSNKY
nUT6jGGlxoSbHCqRwDJoFEJOlh5A9RwKqYxNuM2a/vs9u7HpvdCK/Iah
=AtgY
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-4.19-rc1' of git://github.com/ceph/ceph-client
Pull ceph updates from Ilya Dryomov:
"The main things are support for cephx v2 authentication protocol and
basic support for rbd images within namespaces (myself).
Also included are y2038 conversion patches from Arnd, a pile of
miscellaneous fixes from Chengguang and Zheng's feature bit
infrastructure for the filesystem"
* tag 'ceph-for-4.19-rc1' of git://github.com/ceph/ceph-client: (40 commits)
ceph: don't drop message if it contains more data than expected
ceph: support cephfs' own feature bits
crush: fix using plain integer as NULL warning
libceph: remove unnecessary non NULL check for request_key
ceph: refactor error handling code in ceph_reserve_caps()
ceph: refactor ceph_unreserve_caps()
ceph: change to void return type for __do_request()
ceph: compare fsc->max_file_size and inode->i_size for max file size limit
ceph: add additional size check in ceph_setattr()
ceph: add additional offset check in ceph_write_iter()
ceph: add additional range check in ceph_fallocate()
ceph: add new field max_file_size in ceph_fs_client
libceph: weaken sizeof check in ceph_x_verify_authorizer_reply()
libceph: check authorizer reply/challenge length before reading
libceph: implement CEPHX_V2 calculation mode
libceph: add authorizer challenge
libceph: factor out encrypt_authorizer()
libceph: factor out __ceph_x_decrypt()
libceph: factor out __prepare_write_connect()
libceph: store ceph_auth_handshake pointer in ceph_connection
...
When inode is getting deleted and someone else holds reference to a mark
attached to the inode, we just detach the connector from the inode. In
that case fsnotify_put_mark() called from fsnotify_destroy_marks() will
decide to recalculate mask for the inode and __fsnotify_recalc_mask()
will WARN about invalid connector type:
WARNING: CPU: 1 PID: 12015 at fs/notify/mark.c:139
__fsnotify_recalc_mask+0x2d7/0x350 fs/notify/mark.c:139
Actually there's no reason to warn about detached connector in
__fsnotify_recalc_mask() so just silently skip updating the mask in such
case.
Reported-by: syzbot+c34692a51b9a6ca93540@syzkaller.appspotmail.com
Fixes: 3ac70bfcde ("fsnotify: add helper to get mask from connector")
Signed-off-by: Jan Kara <jack@suse.cz>
Here are all of the driver core and related patches for 4.19-rc1.
Nothing huge here, just a number of small cleanups and the ability to
now stop the deferred probing after init happens.
All of these have been in linux-next for a while with only a merge issue
reported. That merge issue is in fs/sysfs/group.c and Stephen has
posted the diff of what it should be to resolve this. I'll follow up
with that diff to this pull request.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCW3g86Q8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ynyXQCePaZSW8wft4b7nLN8RdZ98ATBru0Ani10lrJa
HQeQJRNbWU1AZ0ym7695
=tOaH
-----END PGP SIGNATURE-----
Merge tag 'driver-core-4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here are all of the driver core and related patches for 4.19-rc1.
Nothing huge here, just a number of small cleanups and the ability to
now stop the deferred probing after init happens.
All of these have been in linux-next for a while with only a merge
issue reported"
* tag 'driver-core-4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (21 commits)
base: core: Remove WARN_ON from link dependencies check
drivers/base: stop new probing during shutdown
drivers: core: Remove glue dirs from sysfs earlier
driver core: remove unnecessary function extern declare
sysfs.h: fix non-kernel-doc comment
PM / Domains: Stop deferring probe at the end of initcall
iommu: Remove IOMMU_OF_DECLARE
iommu: Stop deferring probe at end of initcalls
pinctrl: Support stopping deferred probe after initcalls
dt-bindings: pinctrl: add a 'pinctrl-use-default' property
driver core: allow stopping deferred probe after init
driver core: add a debugfs entry to show deferred devices
sysfs: Fix internal_create_group() for named group updates
base: fix order of OF initialization
linux/device.h: fix kernel-doc notation warning
Documentation: update firmware loader fallback reference
kobject: Replace strncpy with memcpy
drivers: base: cacheinfo: use OF property_read_u32 instead of get_property,read_number
kernfs: Replace strncpy with memcpy
device: Add #define dev_fmt similar to #define pr_fmt
...
This tag is the same as 9p-for-4.19 without the two MAINTAINERS patches
Contains mostly fixes (6 to be backported to stable) and a few changes,
here is the breakdown:
* Rework how fids are attributed by replacing some custom tracking in a
list by an idr (f28cdf0430)
* For packet-based transports (virtio/rdma) validate that the packet
length matches what the header says (f984579a01)
* A few race condition fixes found by syzkaller (9f476d7c54,
430ac66eb4)
* Missing argument check when NULL device is passed in sys_mount
(10aa14527f)
* A few virtio fixes (23cba9cbde, 31934da810, d28c756cae)
* Some spelling and style fixes
----------------------------------------------------------------
Chirantan Ekbote (1):
9p/net: Fix zero-copy path in the 9p virtio transport
Colin Ian King (1):
fs/9p/v9fs.c: fix spelling mistake "Uknown" -> "Unknown"
Jean-Philippe Brucker (1):
net/9p: fix error path of p9_virtio_probe
Matthew Wilcox (4):
9p: Fix comment on smp_wmb
9p: Change p9_fid_create calling convention
9p: Replace the fidlist with an IDR
9p: Embed wait_queue_head into p9_req_t
Souptick Joarder (1):
fs/9p/vfs_file.c: use new return type vm_fault_t
Stephen Hemminger (1):
9p: fix whitespace issues
Tomas Bortoli (5):
net/9p/client.c: version pointer uninitialized
net/9p/trans_fd.c: fix race-condition by flushing workqueue before the kfree()
net/9p/trans_fd.c: fix race by holding the lock
9p: validate PDU length
9p: fix multiple NULL-pointer-dereferences
jiangyiwen (2):
net/9p/virtio: Fix hard lockup in req_done
9p/virtio: fix off-by-one error in sg list bounds check
piaojun (5):
net/9p/client.c: add missing '\n' at the end of p9_debug()
9p/net/protocol.c: return -ENOMEM when kmalloc() failed
net/9p/trans_virtio.c: fix some spell mistakes in comments
fs/9p/xattr.c: catch the error of p9_client_clunk when setting xattr failed
net/9p/trans_virtio.c: add null terminal for mount tag
fs/9p/v9fs.c | 2 +-
fs/9p/vfs_file.c | 2 +-
fs/9p/xattr.c | 6 ++++--
include/net/9p/client.h | 11 ++++-------
net/9p/client.c | 119 +++++++++++++++++++++++++++++++++++++++++++++++++-------------------------------------------------------------------
net/9p/protocol.c | 2 +-
net/9p/trans_fd.c | 22 +++++++++++++++-------
net/9p/trans_rdma.c | 4 ++++
net/9p/trans_virtio.c | 66 +++++++++++++++++++++++++++++++++++++---------------------------
net/9p/trans_xen.c | 3 +++
net/9p/util.c | 1 -
12 files changed, 122 insertions(+), 116 deletions(-)
-----BEGIN PGP SIGNATURE-----
iF0EABECAB0WIQQ8idm2ZSicIMLgzKqoqIItDqvwPAUCW3ElNwAKCRCoqIItDqvw
PMzfAKCkCYFyNC89vcpxcCNsK7rFQ1qKlwCgoaBpZDdegOu0jMB7cyKwAWrB0LM=
=h3T0
-----END PGP SIGNATURE-----
Merge tag '9p-for-4.19-2' of git://github.com/martinetd/linux
Pull 9p updates from Dominique Martinet:
"This contains mostly fixes (6 to be backported to stable) and a few
changes, here is the breakdown:
- rework how fids are attributed by replacing some custom tracking in
a list by an idr
- for packet-based transports (virtio/rdma) validate that the packet
length matches what the header says
- a few race condition fixes found by syzkaller
- missing argument check when NULL device is passed in sys_mount
- a few virtio fixes
- some spelling and style fixes"
* tag '9p-for-4.19-2' of git://github.com/martinetd/linux: (21 commits)
net/9p/trans_virtio.c: add null terminal for mount tag
9p/virtio: fix off-by-one error in sg list bounds check
9p: fix whitespace issues
9p: fix multiple NULL-pointer-dereferences
fs/9p/xattr.c: catch the error of p9_client_clunk when setting xattr failed
9p: validate PDU length
net/9p/trans_fd.c: fix race by holding the lock
net/9p/trans_fd.c: fix race-condition by flushing workqueue before the kfree()
net/9p/virtio: Fix hard lockup in req_done
net/9p/trans_virtio.c: fix some spell mistakes in comments
9p/net: Fix zero-copy path in the 9p virtio transport
9p: Embed wait_queue_head into p9_req_t
9p: Replace the fidlist with an IDR
9p: Change p9_fid_create calling convention
9p: Fix comment on smp_wmb
net/9p/client.c: version pointer uninitialized
fs/9p/v9fs.c: fix spelling mistake "Uknown" -> "Unknown"
net/9p: fix error path of p9_virtio_probe
9p/net/protocol.c: return -ENOMEM when kmalloc() failed
net/9p/client.c: add missing '\n' at the end of p9_debug()
...
Merge updates from Andrew Morton:
- a few misc things
- a few Y2038 fixes
- ntfs fixes
- arch/sh tweaks
- ocfs2 updates
- most of MM
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (111 commits)
mm/hmm.c: remove unused variables align_start and align_end
fs/userfaultfd.c: remove redundant pointer uwq
mm, vmacache: hash addresses based on pmd
mm/list_lru: introduce list_lru_shrink_walk_irq()
mm/list_lru.c: pass struct list_lru_node* as an argument to __list_lru_walk_one()
mm/list_lru.c: move locking from __list_lru_walk_one() to its caller
mm/list_lru.c: use list_lru_walk_one() in list_lru_walk_node()
mm, swap: make CONFIG_THP_SWAP depend on CONFIG_SWAP
mm/sparse: delete old sparse_init and enable new one
mm/sparse: add new sparse_init_nid() and sparse_init()
mm/sparse: move buffer init/fini to the common place
mm/sparse: use the new sparse buffer functions in non-vmemmap
mm/sparse: abstract sparse buffer allocations
mm/hugetlb.c: don't zero 1GiB bootmem pages
mm, page_alloc: double zone's batchsize
mm/oom_kill.c: document oom_lock
mm/hugetlb: remove gigantic page support for HIGHMEM
mm, oom: remove sleep from under oom_lock
kernel/dma: remove unsupported gfp_mask parameter from dma_alloc_from_contiguous()
mm/cma: remove unsupported gfp_mask parameter from cma_alloc()
...
Pointer uwq is being assigned but is never used hence it is redundant
and can be removed.
Cleans up clang warning:
warning: variable 'uwq' set but not used [-Wunused-but-set-variable]
Link: http://lkml.kernel.org/r/20180717090802.18357-1-colin.king@canonical.com
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We need to distinguish the situations when shrinker has very small
amount of objects (see vfs_pressure_ratio() called from
super_cache_count()), and when it has no objects at all. Currently, in
the both of these cases, shrinker::count_objects() returns 0.
The patch introduces new SHRINK_EMPTY return value, which will be used
for "no objects at all" case. It's is a refactoring mostly, as
SHRINK_EMPTY is replaced by 0 by all callers of do_shrink_slab() in this
patch, and all the magic will happen in further.
Link: http://lkml.kernel.org/r/153063069574.1818.11037751256699341813.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add list_lru::shrinker_id field and populate it by registered shrinker
id.
This will be used to set correct bit in memcg shrinkers map by lru code
in next patches, after there appeared the first related to memcg element
in list_lru.
Link: http://lkml.kernel.org/r/153063059758.1818.14866596416857717800.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Do two list_lru_init_memcg() calls after prealloc_super().
destroy_unused_super() in fail path is OK with this. Next patch needs
such the order.
Link: http://lkml.kernel.org/r/153063058712.1818.3382490999719078571.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The buffer_head can consume a significant amount of system memory and is
directly related to the amount of page cache. In our production
environment we have observed that a lot of machines are spending a
significant amount of memory as buffer_head and can not be left as
system memory overhead.
Charging buffer_head is not as simple as adding __GFP_ACCOUNT to the
allocation. The buffer_heads can be allocated in a memcg different from
the memcg of the page for which buffer_heads are being allocated. One
concrete example is memory reclaim. The reclaim can trigger I/O of
pages of any memcg on the system. So, the right way to charge
buffer_head is to extract the memcg from the page for which buffer_heads
are being allocated and then use targeted memcg charging API.
[shakeelb@google.com: use __GFP_ACCOUNT for directed memcg charging]
Link: http://lkml.kernel.org/r/20180702220208.213380-1-shakeelb@google.com
Link: http://lkml.kernel.org/r/20180627191250.209150-3-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "Directed kmem charging", v8.
The Linux kernel's memory cgroup allows limiting the memory usage of the
jobs running on the system to provide isolation between the jobs. All
the kernel memory allocated in the context of the job and marked with
__GFP_ACCOUNT will also be included in the memory usage and be limited
by the job's limit.
The kernel memory can only be charged to the memcg of the process in
whose context kernel memory was allocated. However there are cases
where the allocated kernel memory should be charged to the memcg
different from the current processes's memcg. This patch series
contains two such concrete use-cases i.e. fsnotify and buffer_head.
The fsnotify event objects can consume a lot of system memory for large
or unlimited queues if there is either no or slow listener. The events
are allocated in the context of the event producer. However they should
be charged to the event consumer. Similarly the buffer_head objects can
be allocated in a memcg different from the memcg of the page for which
buffer_head objects are being allocated.
To solve this issue, this patch series introduces mechanism to charge
kernel memory to a given memcg. In case of fsnotify events, the memcg
of the consumer can be used for charging and for buffer_head, the memcg
of the page can be charged. For directed charging, the caller can use
the scope API memalloc_[un]use_memcg() to specify the memcg to charge
for all the __GFP_ACCOUNT allocations within the scope.
This patch (of 2):
A lot of memory can be consumed by the events generated for the huge or
unlimited queues if there is either no or slow listener. This can cause
system level memory pressure or OOMs. So, it's better to account the
fsnotify kmem caches to the memcg of the listener.
However the listener can be in a different memcg than the memcg of the
producer and these allocations happen in the context of the event
producer. This patch introduces remote memcg charging API which the
producer can use to charge the allocations to the memcg of the listener.
There are seven fsnotify kmem caches and among them allocations from
dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
inotify_inode_mark_cachep happens in the context of syscall from the
listener. So, SLAB_ACCOUNT is enough for these caches.
The objects from fsnotify_mark_connector_cachep are not accounted as
they are small compared to the notification mark or events and it is
unclear whom to account connector to since it is shared by all events
attached to the inode.
The allocations from the event caches happen in the context of the event
producer. For such caches we will need to remote charge the allocations
to the listener's memcg. Thus we save the memcg reference in the
fsnotify_group structure of the listener.
This patch has also moved the members of fsnotify_group to keep the size
same, at least for 64 bit build, even with additional member by filling
the holes.
[shakeelb@google.com: use GFP_KERNEL_ACCOUNT rather than open-coding it]
Link: http://lkml.kernel.org/r/20180702215439.211597-1-shakeelb@google.com
Link: http://lkml.kernel.org/r/20180627191250.209150-2-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
a_ops->readpages() is only ever used for read-ahead. Ensure that we
pass this information down to the block layer.
Link: http://lkml.kernel.org/r/20180621010725.17813-5-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <clm@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
a_ops->readpages() is only ever used for read-ahead. Ensure that we
pass this information down to the block layer.
Link: http://lkml.kernel.org/r/20180621010725.17813-4-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <clm@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
a_ops->readpages() is only ever used for read-ahead, yet we don't flag
the IO being submitted as such. Fix that up. Any file system that uses
mpage_readpages() as its ->readpages() implementation will now get this
right.
Since we're passing in whether the IO is read-ahead or not, we don't
need to pass in the 'gfp' separately, as it is dependent on the IO being
read-ahead. Kill off that member.
Add some documentation notes on ->readpages() being purely for
read-ahead.
Link: http://lkml.kernel.org/r/20180621010725.17813-3-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <clm@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "Submit ->readpages() IO as read-ahead", v4.
The only caller of ->readpages() is from read-ahead, yet we don't submit
IO flagged with REQ_RAHEAD. This means we don't see it in blktrace, for
instance, which is a shame. Additionally, it's preventing further
functional changes in the block layer for deadling with read-ahead more
intelligently. We already make assumptions about ->readpages() just
being for read-ahead in the mpage implementation, using
readahead_gfp_mask(mapping) as out GFP mask of choice.
This small series fixes up mpage_readpages() to submit with REQ_RAHEAD,
which takes care of file systems using mpage_readpages(). The first
patch is a prep patch, that makes do_mpage_readpage() take an argument
structure.
This patch (of 4):
We're currently passing 8 arguments to this function, clean it up a bit
by packing the arguments in an args structure we pass to it.
No intentional functional changes in this patch.
[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20180621010725.17813-2-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <clm@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The documentation for seq_file suggests that it is necessary to be able
to move the iterator to a given offset, however that is not the case.
If the iterator is stored in the private data and is stable from one
read() syscall to the next, it is only necessary to support first/next
interactions. Implementing this in a client is a little clumsy.
- if ->start() is given a pos of zero, it should go to start of
sequence.
- if ->start() is given the name pos that was given to the most recent
next() or start(), it should restore the iterator to state just
before that last call
- if ->start is given another number, it should set the iterator one
beyond the start just before the last ->start or ->next call.
Also, the documentation says that the implementation can interpret the
pos however it likes (other than zero meaning start), but seq_file
increments the pos sometimes which does impose on the implementation.
This patch simplifies the interface for first/next iteration and
simplifies the code, while maintaining complete backward compatability.
Now:
- if ->start() is given a pos of zero, it should return an iterator
placed at the start of the sequence
- if ->start() is given a non-zero pos, it should return the iterator
in the same state it was after the last ->start or ->next.
This is particularly useful for interators which walk the multiple
chains in a hash table, e.g. using rhashtable_walk*. See
fs/gfs2/glock.c and drivers/staging/lustre/lustre/llite/vvp_dev.c
A large part of achieving this is to *always* call ->next after ->show
has successfully stored all of an entry in the buffer. Never just
increment the index instead. Also:
- always pass &m->index to ->start() and ->next(), never a temp
variable
- don't clear ->from when ->count is zero, as ->from is dead when
->count is zero.
Some ->next functions do not increment *pos when they return NULL. To
maintain compatability with this, we still need to increment m->index in
one place, if ->next didn't increment it. Note that such ->next
functions are buggy and should be fixed. A simple demonstration is
dd if=/proc/swaps bs=1000 skip=1
Choose any block size larger than the size of /proc/swaps. This will
always show the whole last line of /proc/swaps.
This patch doesn't work around buggy next() functions for this case.
[neilb@suse.com: ensure ->from is valid]
Link: http://lkml.kernel.org/r/87601ryb8a.fsf@notabene.neil.brown.name
Signed-off-by: NeilBrown <neilb@suse.com>
Acked-by: Jonathan Corbet <corbet@lwn.net> [docs]
Tested-by: Jann Horn <jannh@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This flag was introduce in 2.1.37pre1 and the only place it was tested
was removed in 2.1.43pre1. The flag was never set.
Let's discard it properly.
Link: http://lkml.kernel.org/r/877en0hewz.fsf@notabene.neil.brown.name
Signed-off-by: NeilBrown <neilb@suse.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since only dentry->d_name.len + 1 bytes out of DNAME_INLINE_LEN bytes
are initialized at __d_alloc(), we can't copy the whole size
unconditionally.
WARNING: kmemcheck: Caught 32-bit read from uninitialized memory (ffff8fa27465ac50)
636f6e66696766732e746d70000000000010000000000000020000000188ffff
i i i i i i i i i i i i i u u u u u u u u u u i i i i i u u u u
^
RIP: 0010:take_dentry_name_snapshot+0x28/0x50
RSP: 0018:ffffa83000f5bdf8 EFLAGS: 00010246
RAX: 0000000000000020 RBX: ffff8fa274b20550 RCX: 0000000000000002
RDX: ffffa83000f5be40 RSI: ffff8fa27465ac50 RDI: ffffa83000f5be60
RBP: ffffa83000f5bdf8 R08: ffffa83000f5be48 R09: 0000000000000001
R10: ffff8fa27465ac00 R11: ffff8fa27465acc0 R12: ffff8fa27465ac00
R13: ffff8fa27465acc0 R14: 0000000000000000 R15: 0000000000000000
FS: 00007f79737ac8c0(0000) GS:ffffffff8fc30000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff8fa274c0b000 CR3: 0000000134aa7002 CR4: 00000000000606f0
take_dentry_name_snapshot+0x28/0x50
vfs_rename+0x128/0x870
SyS_rename+0x3b2/0x3d0
entry_SYSCALL_64_fastpath+0x1a/0xa4
0xffffffffffffffff
Link: http://lkml.kernel.org/r/201709131912.GBG39012.QMJLOVFSFFOOtH@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Vegard Nossum <vegard.nossum@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are a variety of functions and variables that are local to the
source and do not need to be in global scope, so make them static. Also
make a couple of char arrays static const.
Cleans up sparse warnings:
symbol 'o2hb_heartbeat_mode_desc' was not declared. Should it be static?
symbol 'o2hb_heartbeat_mode' was not declared. Should it be static?
symbol 'o2hb_dependent_users' was not declared. Should it be static?
symbol 'o2hb_region_dec_user' was not declared. Should it be static?
symbol 'o2nm_fence_method_desc' was not declared. Should it be static?
symbol 'lockdep_keys' was not declared. Should it be static?
Link: http://lkml.kernel.org/r/20180628131659.12133-1-colin.king@canonical.com
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Several functions have some unnecessary code, clean up these code.
Link: http://lkml.kernel.org/r/5B14DF72.5020800@huawei.com
Signed-off-by: Yan Wang <wangyan122@huawei.com>
Reviewed-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We should return -EROFS rather than other errno if filesystem becomes
read-only.
[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/5B191B26.9010501@huawei.com
Signed-off-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
Acked-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In the quest to remove all stack VLA usage from the kernel[1], this
allocates the maximum size stack buffer. Existing checks already
require that blocksize >= NTFS_BLOCK_SIZE and mft_record_size <=
PAGE_SIZE, so max_bhs can be at most PAGE_SIZE / NTFS_BLOCK_SIZE.
Sanity checks are added for robustness.
[1] https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com
Link: http://lkml.kernel.org/r/20180626172909.41453-4-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Anton Altaparmakov <anton@tuxera.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In the quest to remove all stack VLA usage from the kernel[1], this
moves the stack buffer used during decompression to be allocated
externally.
The existing "dest_max_index" used in the VLA is bounded by cb_max_page.
cb_max_page is bounded by max_page, and max_page is bounded by nr_pages.
Since nr_pages is used for the "pages" allocation, it can similarly be
used for the "completed_pages" allocation and passed into the
decompression function. The error paths are updated to free the new
allocation.
[1] https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com
Link: http://lkml.kernel.org/r/20180626172909.41453-3-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Anton Altaparmakov <anton@tuxera.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ntfs_end_buffer_async_read() disables interrupts around kmap_atomic().
This is a leftover from the old kmap_atomic() implementation which
relied on fixed mapping slots, so the caller had to make sure that the
same slot could not be reused from an interrupting context.
kmap_atomic() was changed to dynamic slots long ago and commit
1ec9c5ddc1 ("include/linux/highmem.h: remove the second argument of
k[un]map_atomic()") removed the slot assignements, but the callers were
not checked for now redundant interrupt disabling.
Remove the conditional interrupt disable.
Link: http://lkml.kernel.org/r/20180611144913.gln5mklhqcrfsoom@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Anton Altaparmakov <anton@tuxera.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The VFS timestamps are all 64-bit now, the only missing piece for hpfs
is the internal conversion function. One interesting bit about hpfs is
that it can already deal with moving the 136 year window of its
timestamps to support a much wider range than other file systems with
32-bit timestamps. It also treats the timestamps as 'unsigned' on
64-bit architectures (but signed on 32-bit, because time_t always around
to negative numbers in 2038).
Changing the conversion to use time64_t makes 32-bit architectures
behave the same way as 64-bit. For completeness, this also adds a
clamp_t call for each conversion, so we don't wrap the timestamps but
instead stay within the [0..U32_MAX] range of the on-disk timestamps.
Link: http://lkml.kernel.org/r/20180718115017.742609-3-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that the VFS has been converted from timespec to timespec64
timestamps, only the conversion to/from ntfs timestamps uses 32-bit
seconds.
This changes that last missing piece to get the ntfs implementation
y2038 safe on 32-bit architectures.
Link: http://lkml.kernel.org/r/20180718115017.742609-2-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Anton Altaparmakov <anton@tuxera.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>