linux

mirror of https://github.com/torvalds/linux.git synced 2024-12-18 09:02:17 +00:00

Author	SHA1	Message	Date
Miklos Szeredi	05acefb487	ovl: check permission to open real file Call inode_permission() on real inode before opening regular file on one of the underlying layers. In some cases ovl_permission() already checks access to an underlying file, but it misses the metacopy case, and possibly other ones as well. Removing the redundant permission check from ovl_permission() should be considered later. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-06-03 09:45:22 +02:00
Miklos Szeredi	292f902a40	ovl: call secutiry hook in ovl_real_ioctl() Verify LSM permissions for underlying file, since vfs_ioctl() doesn't do it. [Stephen Rothwell] export security_file_ioctl Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-06-03 09:45:18 +02:00
Miklos Szeredi	56230d9567	ovl: verify permissions in ovl_path_open() Check permission before opening a real file. ovl_path_open() is used by readdir and copy-up routines. ovl_permission() theoretically already checked copy up permissions, but it doesn't hurt to re-do these checks during the actual copy-up. For directory reading ovl_permission() only checks access to topmost underlying layer. Readdir on a merged directory accesses layers below the topmost one as well. Permission wasn't checked for these layers. Note: modifying ovl_permission() to perform this check would be far more complex and hence more bug prone. The result is less precise permissions returned in access(2). If this turns out to be an issue, we can revisit this bug. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-06-02 22:20:26 +02:00
Miklos Szeredi	48bd024b8a	ovl: switch to mounter creds in readdir In preparation for more permission checking, override credentials for directory operations on the underlying filesystems. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-06-02 22:20:25 +02:00
Miklos Szeredi	130fdbc3d1	ovl: pass correct flags for opening real directory The three instances of ovl_path_open() in overlayfs/readdir.c do three different things: - pass f_flags from overlay file - pass O_RDONLY \| O_DIRECTORY - pass just O_RDONLY The value of f_flags can be (other than O_RDONLY): O_WRONLY - not possible for a directory O_RDWR - not possible for a directory O_CREAT - masked out by dentry_open() O_EXCL - masked out by dentry_open() O_NOCTTY - masked out by dentry_open() O_TRUNC - masked out by dentry_open() O_APPEND - no effect on directory ops O_NDELAY - no effect on directory ops O_NONBLOCK - no effect on directory ops __O_SYNC - no effect on directory ops O_DSYNC - no effect on directory ops FASYNC - no effect on directory ops O_DIRECT - no effect on directory ops O_LARGEFILE - ? O_DIRECTORY - only affects lookup O_NOFOLLOW - only affects lookup O_NOATIME - overlay sets this unconditionally in ovl_path_open() O_CLOEXEC - only affects fd allocation O_PATH - no effect on directory ops __O_TMPFILE - not possible for a directory Fon non-merge directories we use the underlying filesystem's iterate; in this case honor O_LARGEFILE from the original file to make sure that open doesn't get rejected. For merge directories it's safe to pass O_LARGEFILE unconditionally since userspace will only see the artificial offsets created by overlayfs. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-06-02 22:20:25 +02:00
Vivek Goyal	21d8d66abf	ovl: fix redirect traversal on metacopy dentries Amir pointed me to metacopy test cases in unionmount-testsuite and I decided to run "./run --ov=10 --meta" and it failed while running test "rename-mass-5.py". Problem is w.r.t absolute redirect traversal on intermediate metacopy dentry. We do not store intermediate metacopy dentries and also skip current loop/layer and move onto lookup in next layer. But at the end of loop, we have logic to reset "poe" and layer index if currnently looked up dentry has absolute redirect. We skip all that and that means lookup in next layer will fail. Following is simple test case to reproduce this. - mkdir -p lower upper work merged lower/a lower/b - touch lower/a/foo.txt - mount -t overlay -o lowerdir=lower,upperdir=upper,workdir=work,metacopy=on none merged # Following will create absolute redirect "/a/foo.txt" on upper/b/bar.txt. - mv merged/a/foo.txt merged/b/bar.txt # unmount overlay and use upper as lower layer (lower2) for next mount. - umount merged - mv upper lower2 - rm -rf work; mkdir -p upper work - mount -t overlay -o lowerdir=lower2:lower,upperdir=upper,workdir=work,metacopy=on none merged # Force a metacopy copy-up - chown bin:bin merged/b/bar.txt # unmount overlay and use upper as lower layer (lower3) for next mount. - umount merged - mv upper lower3 - rm -rf work; mkdir -p upper work - mount -t overlay -o lowerdir=lower3:lower2:lower,upperdir=upper,workdir=work,metacopy=on none merged # ls merged/b/bar.txt ls: cannot access 'bar.txt': Input/output error Intermediate lower layer (lower2) has metacopy dentry b/bar.txt with absolute redirect "/a/foo.txt". We skipped redirect processing at the end of loop which sets poe to roe and sets the appropriate next lower layer index. And that means lookup failed in next layer. Fix this by continuing the loop for any intermediate dentries. We still do not save these at lower stack. With this fix applied unionmount-testsuite, "./run --ov-10 --meta" now passes. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-06-02 22:20:25 +02:00
Vivek Goyal	28166ab3c8	ovl: initialize OVL_UPPERDATA in ovl_lookup() Currently ovl_get_inode() initializes OVL_UPPERDATA flag and for that it has to call ovl_check_metacopy_xattr() and check if metacopy xattr is present or not. yangerkun reported sometimes underlying filesystem might return -EIO and in that case error handling path does not cleanup properly leading to various warnings. Run generic/461 with ext4 upper/lower layer sometimes may trigger the bug as below(linux 4.19): [ 551.001349] overlayfs: failed to get metacopy (-5) [ 551.003464] overlayfs: failed to get inode (-5) [ 551.004243] overlayfs: cleanup of 'd44/fd51' failed (-5) [ 551.004941] overlayfs: failed to get origin (-5) [ 551.005199] ------------[ cut here ]------------ [ 551.006697] WARNING: CPU: 3 PID: 24674 at fs/inode.c:1528 iput+0x33b/0x400 ... [ 551.027219] Call Trace: [ 551.027623] ovl_create_object+0x13f/0x170 [ 551.028268] ovl_create+0x27/0x30 [ 551.028799] path_openat+0x1a35/0x1ea0 [ 551.029377] do_filp_open+0xad/0x160 [ 551.029944] ? vfs_writev+0xe9/0x170 [ 551.030499] ? page_counter_try_charge+0x77/0x120 [ 551.031245] ? __alloc_fd+0x160/0x2a0 [ 551.031832] ? do_sys_open+0x189/0x340 [ 551.032417] ? get_unused_fd_flags+0x34/0x40 [ 551.033081] do_sys_open+0x189/0x340 [ 551.033632] __x64_sys_creat+0x24/0x30 [ 551.034219] do_syscall_64+0xd5/0x430 [ 551.034800] entry_SYSCALL_64_after_hwframe+0x44/0xa9 One solution is to improve error handling and call iget_failed() if error is encountered. Amir thinks that this path is little intricate and there is not real need to check and initialize OVL_UPPERDATA in ovl_get_inode(). Instead caller of ovl_get_inode() can initialize this state. And this will avoid double checking of metacopy xattr lookup in ovl_lookup() and ovl_get_inode(). OVL_UPPERDATA is inode flag. So I was little concerned that initializing it outside ovl_get_inode() might have some races. But this is one way transition. That is once a file has been fully copied up, it can't go back to metacopy file again. And that seems to help avoid races. So as of now I can't see any races w.r.t OVL_UPPERDATA being set wrongly. So move settingof OVL_UPPERDATA inside the callers of ovl_get_inode(). ovl_obtain_alias() already does it. So only two callers now left are ovl_lookup() and ovl_instantiate(). Reported-by: yangerkun <yangerkun@huawei.com> Suggested-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-06-02 22:20:25 +02:00
Vivek Goyal	6815f479ca	ovl: use only uppermetacopy state in ovl_lookup() Currently we use a variable "metacopy" which signifies that dentry could be either uppermetacopy or lowermetacopy. Amir suggested that we can move code around and use d.metacopy in such a way that we don't need lowermetacopy and just can do away with uppermetacopy. So this patch replaces "metacopy" with "uppermetacopy". It also moves some code little higher to keep reading little simpler. Suggested-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-06-02 22:20:25 +02:00
Vivek Goyal	59fb20138a	ovl: simplify setting of origin for index lookup overlayfs can keep index of copied up files and directories and it seems to serve two primary puroposes. For regular files, it avoids breaking lower hardlinks over copy up. For directories it seems to be used for various error checks. During ovl_lookup(), we lookup for index using lower dentry in many a cases. That lower dentry is called "origin" and following is a summary of current logic. If there is no upperdentry, always lookup for index using lower dentry. For regular files it helps avoiding breaking hard links over copyup and for directories it seems to be just error checks. If there is an upperdentry, then there are 3 possible cases. - For directories, lower dentry is found using two ways. One is regular path based lookup in lower layers and second is using ORIGIN xattr on upper dentry. First verify that path based lookup lower dentry matches the one pointed by upper ORIGIN xattr. If yes, use this verified origin for index lookup. - For regular files (non-metacopy), there is no path based lookup in lower layers as lookup stops once we find upper dentry. So there is no origin verification. If there is ORIGIN xattr present on upper, use that to lookup index otherwise don't. - For regular metacopy files, again lower dentry is found using path based lookup as well as ORIGIN xattr on upper. Path based lookup is continued in this case to find lower data dentry for metacopy upper. So like directories we only use verified origin. If ORIGIN xattr is not present (Either because lower did not support file handles or because this is hardlink copied up with index=off), then don't use path lookup based lower dentry as origin. This is same as regular non-metacopy file case. Suggested-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-06-02 22:20:25 +02:00
Amir Goldstein	522f6e6cba	ovl: fix out of bounds access warning in ovl_check_fb_len() syzbot reported out of bounds memory access from open_by_handle_at() with a crafted file handle that looks like this: { .handle_bytes = 2, .handle_type = OVL_FILEID_V1 } handle_bytes gets rounded down to 0 and we end up calling: ovl_check_fh_len(fh, 0) => ovl_check_fb_len(fh + 3, -3) But fh buffer is only 2 bytes long, so accessing struct ovl_fb at fh + 3 is illegal. Fixes: `cbe7fba8ed` ("ovl: make sure that real fid is 32bit aligned in memory") Reported-and-tested-by: syzbot+61958888b1c60361a791@syzkaller.appspotmail.com Cc: <stable@vger.kernel.org> # v5.5 Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-06-02 22:20:25 +02:00
Linus Torvalds	b23c4771ff	A fair amount of stuff this time around, dominated by yet another massive set from Mauro toward the completion of the RST conversion. I really hope we are getting close to the end of this. Meanwhile, those patches reach pretty far afield to update document references around the tree; there should be no actual code changes there. There will be, alas, more of the usual trivial merge conflicts. Beyond that we have more translations, improvements to the sphinx scripting, a number of additions to the sysctl documentation, and lots of fixes. -----BEGIN PGP SIGNATURE----- iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAl7VId8PHGNvcmJldEBs d24ubmV0AAoJEBdDWhNsDH5Yq/gH/iaDgirQZV6UZ2v9sfwQNYolNpf2sKAuOZjd bPFB7WJoMQbKwQEvYrAUL2+5zPOcLYuIfzyOfo1BV1py+EyKbACcKjI4AedxfJF7 +NchmOBhlEqmEhzx2U08HRc4/8J223WG17fJRVsV3p+opJySexSFeQucfOciX5NR RUCxweWWyg/FgyqjkyMMTtsePqZPmcT5dWTlVXISlbWzcv5NFhuJXnSrw8Sfzcmm SJMzqItv3O+CabnKQ8kMLV2PozXTMfjeWH47ZUK0Y8/8PP9+cvqwFzZ0UDQJ1Xaz oyW/TqmunaXhfMsMFeFGSwtfgwRHvXdxkQdtwNHvo1dV4dzTvDw= =fDC/ -----END PGP SIGNATURE----- Merge tag 'docs-5.8' of git://git.lwn.net/linux Pull documentation updates from Jonathan Corbet: "A fair amount of stuff this time around, dominated by yet another massive set from Mauro toward the completion of the RST conversion. I really hope we are getting close to the end of this. Meanwhile, those patches reach pretty far afield to update document references around the tree; there should be no actual code changes there. There will be, alas, more of the usual trivial merge conflicts. Beyond that we have more translations, improvements to the sphinx scripting, a number of additions to the sysctl documentation, and lots of fixes" * tag 'docs-5.8' of git://git.lwn.net/linux: (130 commits) Documentation: fixes to the maintainer-entry-profile template zswap: docs/vm: Fix typo accept_threshold_percent in zswap.rst tracing: Fix events.rst section numbering docs: acpi: fix old http link and improve document format docs: filesystems: add info about efivars content Documentation: LSM: Correct the basic LSM description mailmap: change email for Ricardo Ribalda docs: sysctl/kernel: document unaligned controls Documentation: admin-guide: update bug-hunting.rst docs: sysctl/kernel: document ngroups_max nvdimm: fixes to maintainter-entry-profile Documentation/features: Correct RISC-V kprobes support entry Documentation/features: Refresh the arch support status files Revert "docs: sysctl/kernel: document ngroups_max" docs: move locking-specific documents to locking/ docs: move digsig docs to the security book docs: move the kref doc into the core-api book docs: add IRQ documentation at the core-api book docs: debugging-via-ohci1394.txt: add it to the core-api book docs: fix references for ipmi.rst file ...	2020-06-01 15:45:27 -07:00
Lubos Dolezel	144da23bea	ovl: return required buffer size for file handles Overlayfs doesn't work well with the fanotify mechanism. Fanotify first probes for the required buffer size for the file handle, but overlayfs currently bails out without passing the size back. That results in errors in the kernel log, such as: [527944.485384] overlayfs: failed to encode file handle (/, err=-75, buflen=0, len=29, type=1) [527944.485386] fanotify: failed to encode fid (fsid=ae521e68.a434d95f, type=255, bytes=0, err=-2) Signed-off-by: Lubos Dolezel <lubos@dolezel.info> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-05-13 11:11:24 +02:00
Chengguang Xu	399c109d35	ovl: sync dirty data when remounting to ro mode sync_filesystem() does not sync dirty data for readonly filesystem during umount, so before changing to readonly filesystem we should sync dirty data for data integrity. Signed-off-by: Chengguang Xu <cgxu519@mykernel.net> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-05-13 11:11:24 +02:00
Chengguang Xu	c21c839b84	ovl: whiteout inode sharing Share inode with different whiteout files for saving inode and speeding up delete operation. If EMLINK is encountered when linking a shared whiteout, create a new one. In case of any other error, disable sharing for this super block. Note: ofs->whiteout is protected by inode lock on workdir. Signed-off-by: Chengguang Xu <cgxu519@mykernel.net> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-05-13 11:11:24 +02:00
Jeffle Xu	654255fa20	ovl: inherit SB_NOSEC flag from upperdir Since the stacking of regular file operations [1], the overlayfs edition of write_iter() is called when writing regular files. Since then, xattr lookup is needed on every write since file_remove_privs() is called from ovl_write_iter(), which would become the performance bottleneck when writing small chunks of data. In my test case, file_remove_privs() would consume ~15% CPU when running fstime of unixbench (the workload is repeadly writing 1 KB to the same file) [2]. Inherit the SB_NOSEC flag from upperdir. Since then xattr lookup would be done only once on the first write. Unixbench fstime gets a ~20% performance gain with this patch. [1] https://lore.kernel.org/lkml/20180606150905.GC9426@magnolia/T/ [2] https://www.spinics.net/lists/linux-unionfs/msg07153.html Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-05-13 11:11:24 +02:00
Konstantin Khlebnikov	32b1924b21	ovl: skip overlayfs superblocks at global sync Stacked filesystems like overlayfs has no own writeback, but they have to forward syncfs() requests to backend for keeping data integrity. During global sync() each overlayfs instance calls method ->sync_fs() for backend although it itself is in global list of superblocks too. As a result one syscall sync() could write one superblock several times and send multiple disk barriers. This patch adds flag SB_I_SKIP_SYNC into sb->sb_iflags to avoid that. Reported-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-05-13 11:11:24 +02:00
Amir Goldstein	62a8a85be8	ovl: index dir act as work dir With index=on, let index dir act as the work dir for copy up and cleanups. This will help implementing whiteout inode sharing. We still create the "work" dir on mount regardless of index=on and it is used to test the features supported by upper fs. One reason is that before the feature tests, we do not know if index could be enabled or not. The reason we do not use "index" directory also as workdir with index=off is because the existence of the "index" directory acts as a simple persistent signal that index was enabled on this filesystem and tools may want to use that signal. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-05-13 11:11:24 +02:00
Amir Goldstein	773cb4c56b	ovl: prepare to copy up without workdir With index=on, we copy up lower hardlinks to work dir and move them into index dir. Fix locking to allow work dir and index dir to be the same directory. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-05-13 11:11:24 +02:00
Amir Goldstein	3011645b5b	ovl: cleanup non-empty directories in ovl_indexdir_cleanup() Teach ovl_indexdir_cleanup() to remove temp directories containing whiteouts to prepare for using index dir instead of work dir for removing merge directories. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-05-13 11:11:24 +02:00
Amir Goldstein	b0def88d80	ovl: resolve more conflicting mount options Similar to the way that a conflict between metacopy=on,redirect_dir=off is resolved, also resolve conflicts between nfs_export=on,index=off and nfs_export=on,metacopy=on. An explicit mount option wins over a default config value. Both explicit mount options result in an error. Without this change the xfstests group overlay/exportfs are skipped if metacopy is enabled by default. Reported-by: Chengguang Xu <cgxu519@mykernel.net> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-05-13 11:11:24 +02:00
Dan Carpenter	9aafc1b018	ovl: potential crash in ovl_fid_to_fh() The "buflen" value comes from the user and there is a potential that it could be zero. In do_handle_to_path() we know that "handle->handle_bytes" is non-zero and we do: handle_dwords = handle->handle_bytes >> 2; So values 1-3 become zero. Then in ovl_fh_to_dentry() we do: int len = fh_len << 2; So now len is in the "0,4-128" range and a multiple of 4. But if "buflen" is zero it will try to copy negative bytes when we do the memcpy in ovl_fid_to_fh(). memcpy(&fh->fb, fid, buflen - OVL_FH_WIRE_OFFSET); And that will lead to a crash. Thanks to Amir Goldstein for his help with this patch. Fixes: `cbe7fba8ed` ("ovl: make sure that real fid is 32bit aligned in memory") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Cc: <stable@vger.kernel.org> # v5.5 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-05-13 11:10:57 +02:00
Vivek Goyal	15fd2ea9f4	ovl: clear ATTR_OPEN from attr->ia_valid As of now during open(), we don't pass bunch of flags to underlying filesystem. O_TRUNC is one of these. Normally this is not a problem as VFS calls ->setattr() with zero size and underlying filesystem sets file size to 0. But when overlayfs is running on top of virtiofs, it has an optimization where it does not send setattr request to server if dectects that truncation is part of open(O_TRUNC). It assumes that server already zeroed file size as part of open(O_TRUNC). fuse_do_setattr() { if (attr->ia_valid & ATTR_OPEN) { /* * No need to send request to userspace, since actual * truncation has already been done by OPEN. But still * need to truncate page cache. */ } } IOW, fuse expects O_TRUNC to be passed to it as part of open flags. But currently overlayfs does not pass O_TRUNC to underlying filesystem hence fuse/virtiofs breaks. Setup overlayfs on top of virtiofs and following does not zero the file size of a file is either upper only or has already been copied up. fd = open(foo.txt, O_TRUNC \| O_WRONLY); There are two ways to fix this. Either pass O_TRUNC to underlying filesystem or clear ATTR_OPEN from attr->ia_valid so that fuse ends up sending a SETATTR request to server. Miklos is concerned that O_TRUNC might have side affects so it is better to clear ATTR_OPEN for now. Hence this patch clears ATTR_OPEN from attr->ia_valid. I found this problem while running unionmount-testsuite. With this patch, unionmount-testsuite passes with overlayfs on top of virtiofs. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Fixes: `bccece1ead` ("ovl: allow remote upper") Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-04-30 11:52:07 +02:00
Vivek Goyal	e67f021693	ovl: clear ATTR_FILE from attr->ia_valid ovl_setattr() can be passed an attr which has ATTR_FILE set and attr->ia_file is a file pointer to overlay file. This is done in open(O_TRUNC) path. We should either replace with attr->ia_file with underlying file object or clear ATTR_FILE so that underlying filesystem does not end up using overlayfs file object pointer. There are no good use cases yet so for now clear ATTR_FILE. fuse seems to be one user which can use this. But it can work even without this. So it is not mandatory to pass ATTR_FILE to fuse. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Fixes: `bccece1ead` ("ovl: allow remote upper") Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-04-30 11:52:07 +02:00
Mauro Carvalho Chehab	72ef5e52b3	docs: fix broken references to text files Several references got broken due to txt to ReST conversion. Several of them can be automatically fixed with: scripts/documentation-file-ref-check --fix Reviewed-by: Mathieu Poirier <mathieu.poirier@linaro.org> # hwtracing/coresight/Kconfig Reviewed-by: Paul E. McKenney <paulmck@kernel.org> # memory-barrier.txt Acked-by: Alex Shi <alex.shi@linux.alibaba.com> # translations/zh_CN Acked-by: Federico Vaga <federico.vaga@vaga.pv.it> # translations/it_IT Acked-by: Marc Zyngier <maz@kernel.org> # kvm/arm64 Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Link: https://lore.kernel.org/r/6f919ddb83a33b5f2a63b6b5f0575737bb2b36aa.1586881715.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet <corbet@lwn.net>	2020-04-20 15:35:59 -06:00
Amir Goldstein	926e94d79b	ovl: enable xino automatically in more cases So far, with xino=auto, we only enable xino if we know that all underlying filesystem use 32bit inode numbers. When users configure overlay with xino=auto, they already declare that they are ready to handle 64bit inode number from overlay. It is a very common case, that underlying filesystem uses 64bit ino, but rarely or never uses the high inode number bits (e.g. tmpfs, xfs). Leaving it for the users to declare high ino bits are unused with xino=on is not a recipe for many users to enjoy the benefits of xino. There appears to be very little reason not to enable xino when users declare xino=auto even if we do not know how many bits underlying filesystem uses for inode numbers. In the worst case of xino bits overflow by real inode number, we already fall back to the non-xino behavior - real inode number with unique pseudo dev or to non persistent inode number and overlay st_dev (for directories). The only annoyance from auto enabling xino is that xino bits overflow emits a warning to kmsg. Suppress those warnings unless users explicitly asked for xino=on, suggesting that they expected high ino bits to be unused by underlying filesystem. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-03-27 16:51:02 +01:00
Amir Goldstein	dfe51d47b7	ovl: avoid possible inode number collisions with xino=on When xino feature is enabled and a real directory inode number overflows the lower xino bits, we cannot map this directory inode number to a unique and persistent inode number and we fall back to the real inode st_ino and overlay st_dev. The real inode st_ino with high bits may collide with a lower inode number on overlay st_dev that was mapped using xino. To avoid possible collision with legitimate xino values, map a non persistent inode number to a dedicated range in the xino address space. The dedicated range is created by adding one more bit to the number of reserved high xino bits. We could have added just one more fsid, but that would have had the undesired effect of changing persistent overlay inode numbers on kernel or require more complex xino mapping code. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-03-27 16:51:02 +01:00
Amir Goldstein	4d314f7859	ovl: use a private non-persistent ino pool There is no reason to deplete the system's global get_next_ino() pool for overlay non-persistent inode numbers and there is no reason at all to allocate non-persistent inode numbers for non-directories. For non-directories, it is much better to leave i_ino the same as real i_ino, to be consistent with st_ino/d_ino. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-03-27 16:51:02 +01:00
Miklos Szeredi	83552eacdf	ovl: fix WARN_ON nlink drop to zero Changes to underlying layers should not cause WARN_ON(), but this repro does: mkdir w l u mnt sudo mount -t overlay -o workdir=w,lowerdir=l,upperdir=u overlay mnt touch mnt/h ln u/h u/k rm -rf mnt/k rm -rf mnt/h dmesg ------------[ cut here ]------------ WARNING: CPU: 1 PID: 116244 at fs/inode.c:302 drop_nlink+0x28/0x40 After upper hardlinks were added while overlay is mounted, unlinking all overlay hardlinks drops overlay nlink to zero before all upper inodes are unlinked. After unlink/rename prevent i_nlink from going to zero if there are still hashed aliases (i.e. cached hard links to the victim) remaining. Reported-by: Phasip <phasip@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-03-27 16:51:02 +01:00
Chengguang Xu	a5a84682ec	ovl: fix a typo in comment Fix a typo in comment. (annonate -> annotate) Signed-off-by: Chengguang Xu <cgxu519@mykernel.net> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-03-17 15:04:23 +01:00
Gustavo A. R. Silva	0efbe7c4f9	ovl: replace zero-length array with flexible-array member The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit `7649773293` ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Fixes: `cbe7fba8ed` ("ovl: make sure that real fid is 32bit aligned in memory") Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-03-17 15:04:23 +01:00
Al Viro	504f38410a	ovl: ovl_obtain_alias(): don't call d_instantiate_anon() for old The situation is the same as for __d_obtain_alias() (which is what that thing is parallel to) - if we find a preexisting alias, we want to grab it, drop the inode and return the alias we'd found. The only thing d_instantiate_anon() does compared to that is spurious security_d_instiate() that has already been done to that dentry with exact same arguments. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-03-17 15:04:23 +01:00
Amir Goldstein	d80172c2d8	ovl: strict upper fs requirements for remote upper fs Overlayfs works sub-optimally with upper fs that has no xattr/d_type/ RENAME_WHITEOUT support. We should basically deprecate support for those filesystems, but so far, we only issue a warning and don't fail the mount for the sake of backward compat. Some features are already being disabled with no xattr support. For newly supported remote upper fs, we do not need to worry about backward compatibility, so we can fail the mount if upper fs is a sub-optimal filesystem. This reduces the in-tree remote filesystems supported as upper to just FUSE, for which the remote upper fs support was added. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-03-17 15:04:23 +01:00
Amir Goldstein	cad218ab33	ovl: check if upper fs supports RENAME_WHITEOUT As with other required upper fs features, we only warn if support is missing to avoid breaking existing sub-optimal setups. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-03-17 15:04:22 +01:00
Miklos Szeredi	bccece1ead	ovl: allow remote upper No reason to prevent upper layer being a remote filesystem. Do the revalidation in that case, just as we already do for lower layers. This lets virtiofs be used as upper layer, which appears to be a real use case. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-03-17 15:04:22 +01:00
Miklos Szeredi	f428884456	ovl: decide if revalidate needed on a per-dentry basis Allow completely skipping ->revalidate() on a per-dentry basis, in case the underlying layers used for a dentry do not themselves have ->revalidate(). E.g. negative overlay dentry has no underlying layers, hence revalidate is unnecessary. Or if lower layer is remote but overlay dentry is pure-upper, then can skip revalidate. The following places need to update whether the dentry needs revalidate or not: - fill-super (root dentry) - lookup - create - fh_to_dentry Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-03-17 15:04:22 +01:00
Miklos Szeredi	7925dad839	ovl: separate detection of remote upper layer from stacked overlay Following patch will allow remote as upper layer, but not overlay stacked on upper layer. Separate the two concepts. This patch is doesn't change behavior. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-03-17 15:04:22 +01:00
Miklos Szeredi	3bb7df928a	ovl: restructure dentry revalidation Use a common loop for plain and weak revalidation. This will aid doing revalidation on upper layer. This patch doesn't change behavior. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-03-17 15:04:22 +01:00
Miklos Szeredi	c61ca55725	ovl: ignore failure to copy up unknown xattrs This issue came up with NFSv4 as the lower layer, which generates "system.nfs4_acl" xattrs (even for plain old unix permissions). Prior to this patch this prevented copy-up from succeeding. The overlayfs permission model mandates that permissions are checked locally for the task and remotely for the mounter(). NFS4 ACLs are not supported by the Linux kernel currently, hence they cannot be enforced locally. Which means it is indifferent whether this attribute is copied or not. Generalize this to any xattr that is not used in access checking (i.e. it's not a POSIX ACL and not in the "security." namespace). Incidentally, best effort copying of xattrs seems to also be the behavior of "cp -a", which is what overlayfs tries to mimic. () Documentation/filesystems/overlayfs.txt#Permission model Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-03-17 15:04:22 +01:00
Amir Goldstein	62c832ed4e	ovl: simplify i_ino initialization Move i_ino initialization to ovl_inode_init() to avoid the dance of setting i_ino in ovl_fill_inode() sometimes on the first call and sometimes on the seconds call. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-03-17 15:04:22 +01:00
Amir Goldstein	2effc5c25d	ovl: factor out helper ovl_get_root() Allocates and initializes the root dentry and inode. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-03-17 15:04:22 +01:00
Amir Goldstein	735c907d7b	ovl: fix out of date comment and unreachable code ovl_inode_update() is no longer called from create object code path. Fixes: `01b39dcc95` ("ovl: use inode_insert5() to hash a newly...") Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-03-17 15:04:21 +01:00
Amir Goldstein	300b124fcf	ovl: fix value of i_ino for lower hardlink corner case Commit `6dde1e42f4` ("ovl: make i_ino consistent with st_ino in more cases"), relaxed the condition nfs_export=on in order to set the value of i_ino to xino map of real ino. Specifically, it also relaxed the pre-condition that index=on for consistent i_ino. This opened the corner case of lower hardlink in ovl_get_inode(), which calls ovl_fill_inode() with ino=0 and then ovl_init_inode() is called to set i_ino to lower real ino without the xino mapping. Pass the correct values of ino;fsid in this case to ovl_fill_inode(), so it can initialize i_ino correctly. Fixes: `6dde1e42f4` ("ovl: make i_ino consistent with st_ino in more ...") Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-03-17 15:04:21 +01:00
Miklos Szeredi	c853680453	ovl: fix lockdep warning for async write Lockdep reports "WARNING: lock held when returning to user space!" due to async write holding freeze lock over the write. Apparently aio.c already deals with this by lying to lockdep about the state of the lock. Do the same here. No need to check for S_IFREG() here since these file ops are regular-only. Reported-by: syzbot+9331a354f4f624a52a55@syzkaller.appspotmail.com Fixes: `2406a307ac` ("ovl: implement async IO routines") Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-03-13 15:53:06 +01:00
Amir Goldstein	53afcd310e	ovl: fix some xino configurations Fix up two bugs in the coversion to xino_mode: 1. xino=off does not always end up in disabled mode 2. xino=auto on 32bit arch should end up in disabled mode Take a proactive approach to disabling xino on 32bit kernel: 1. Disable XINO_AUTO config during build time 2. Disable xino with a warning on mount time As a by product, xino=on on 32bit arch also ends up in disabled mode. We never intended to enable xino on 32bit arch and this will make the rest of the logic simpler. Fixes: `0f831ec85e` ("ovl: simplify ovl_same_sb() helper") Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-03-13 15:53:06 +01:00
Amir Goldstein	531d3040bc	ovl: fix lock in ovl_llseek() ovl_inode_lock() is interruptible. When inode_lock() in ovl_llseek() was replaced with ovl_inode_lock(), we did not add a check for error. Fix this by making ovl_inode_lock() uninterruptible and change the existing call sites to use an _interruptible variant. Reported-by: syzbot+66a9752fa927f745385e@syzkaller.appspotmail.com Fixes: `b1f9d3858f` ("ovl: use ovl_inode_lock in ovl_llseek()") Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-03-12 16:38:10 +01:00
Miklos Szeredi	a4ac9d45c0	ovl: fix lseek overflow on 32bit ovl_lseek() is using ssize_t to return the value from vfs_llseek(). On a 32-bit kernel ssize_t is a 32-bit signed int, which overflows above 2 GB. Assign the return value of vfs_llseek() to loff_t to fix this. Reported-by: Boris Gjenero <boris.gjenero@gmail.com> Fixes: `9e46b840c7` ("ovl: support stacked SEEK_HOLE/SEEK_DATA") Cc: <stable@vger.kernel.org> # v4.19 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-02-03 11:41:53 +01:00
Murphy Zhou	1a980b8cbf	ovl: add splice file read write helper Now overlayfs falls back to use default file splice read and write, which is not compatiple with overlayfs, returning EFAULT. xfstests generic/591 can reproduce part of this. Tested this patch with xfstests auto group tests. Signed-off-by: Murphy Zhou <jencce.kernel@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-01-24 16:28:15 +01:00
Jiufei Xue	2406a307ac	ovl: implement async IO routines A performance regression was observed since linux v4.19 with aio test using fio with iodepth 128 on overlayfs. The queue depth of the device was always 1 which is unexpected. After investigation, it was found that commit `16914e6fc7` ("ovl: add ovl_read_iter()") and commit `2a92e07edc` ("ovl: add ovl_write_iter()") resulted in vfs_iter_{read,write} being called on underlying filesystem, which always results in syncronous IO. Implement async IO for stacked reading and writing. This resolves the performance regresion. This is implemented by allocating a new kiocb for submitting the AIO request on the underlying filesystem. When the request is completed, the new kiocb is freed and the completion callback is called on the original iocb. Signed-off-by: Jiufei Xue <jiufei.xue@linux.alibaba.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-01-24 09:46:46 +01:00
Miklos Szeredi	1346416564	ovl: layer is const The ovl_layer struct is never modified except at initialization. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-01-24 09:46:45 +01:00
Amir Goldstein	b7bf9908e1	ovl: fix corner case of non-constant st_dev;st_ino On non-samefs overlay without xino, non pure upper inodes should use a pseudo_dev assigned to each unique lower fs, but if lower layer is on the same fs and upper layer, it has no pseudo_dev assigned. In this overlay layers setup: - two filesystems, A and B - upper layer is on A - lower layer 1 is also on A - lower layer 2 is on B Non pure upper overlay inode, whose origin is in layer 1 will have the st_dev;st_ino values of the real lower inode before copy up and the st_dev;st_ino values of the real upper inode after copy up. Fix this inconsitency by assigning a unique pseudo_dev also for upper fs, that will be used as st_dev value along with the lower inode st_dev for overlay inodes in the case above. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-01-24 09:46:45 +01:00
Amir Goldstein	1b81dddd35	ovl: fix corner case of conflicting lower layer uuid This fixes ovl_lower_uuid_ok() to correctly detect the corner case: - two filesystems, A and B, both have null uuid - upper layer is on A - lower layer 1 is also on A - lower layer 2 is on B In this case, bad_uuid would not have been set for B, because the check only involved the list of lower fs. Hence we'll try to decode a layer 2 origin on layer 1 and fail. We check for conflicting (and null) uuid among all lower layers, including those layers that are on the same fs as the upper layer. Reported-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-01-24 09:46:45 +01:00
Amir Goldstein	07f1e59637	ovl: generalize the lower_fs[] array Rename lower_fs[] array to fs[], extend its size by one and use index fsid (instead of fsid-1) to access the fs[] array. Initialize fs[0] with upper fs values. fsid 0 is reserved even with lower only overlay, so fs[0] remains null in this case. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-01-24 09:46:45 +01:00
Amir Goldstein	0f831ec85e	ovl: simplify ovl_same_sb() helper No code uses the sb returned from this helper, so make it retrun a boolean and rename it to ovl_same_fs(). The xino mode is irrelevant when all layers are on same fs, so instead of describing samefs with mode OVL_XINO_OFF, use a new xino_mode state, which is 0 in the case of samefs, -1 in the case of xino=off and > 0 with xino enabled. Create a new helper ovl_same_dev(), to use instead of the common check for (ovl_same_fs() \|\| xinobits). Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-01-24 09:46:45 +01:00
Amir Goldstein	94375f9d51	ovl: generalize the lower_layers[] array Rename lower_layers[] array to layers[], extend its size by one and initialize layers[0] with upper layer values. Lower layers are now addressed with index 1..numlower. layers[0] is reserved even with lower only overlay. [SzM: replace ofs->numlower with ofs->numlayer, the latter's value is incremented by one] Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-01-22 20:11:41 +01:00
Chengguang Xu	b504c6540d	ovl: improving copy-up efficiency for big sparse file Current copy-up is not efficient for big sparse file, It's not only slow but also wasting more disk space when the target lower file has huge hole inside. This patch tries to recognize file hole and skip it during copy-up. Detail logic of hole detection as below: When we detect next data position is larger than current position we will skip that hole, otherwise we copy data in the size of OVL_COPY_UP_CHUNK_SIZE. Actually, it may not recognize all kind of holes and sometimes only skips partial of hole area. However, it will be enough for most of the use cases. Additionally, this optimization relies on lseek(2) SEEK_DATA implementation, so for some specific filesystems which do not support this feature will behave as before on copy-up. Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Chengguang Xu <cgxu519@mykernel.net> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-01-22 20:11:41 +01:00
Amir Goldstein	b1f9d3858f	ovl: use ovl_inode_lock in ovl_llseek() In ovl_llseek() we use the overlay inode rwsem to protect against concurrent modifications to real file f_pos, because we copy the overlay file f_pos to/from the real file f_pos. This caused a lockdep warning of locking order violation when the ovl_llseek() operation was called on a lower nested overlay layer while the upper layer fs sb_writers is held (with patch improving copy-up efficiency for big sparse file). Use the internal ovl_inode_lock() instead of the overlay inode rwsem in those cases. It is meant to be used for protecting against concurrent changes to overlay inode internal state changes. The locking order rules are documented to explain this case. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-01-22 20:11:41 +01:00
lijiazi	1bd0a3aea4	ovl: use pr_fmt auto generate prefix Use pr_fmt auto generate "overlayfs: " prefix. Signed-off-by: lijiazi <lijiazi@xiaomi.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-01-22 20:11:41 +01:00
Amir Goldstein	4c37e71b71	ovl: fix wrong WARN_ON() in ovl_cache_update_ino() The WARN_ON() that child entry is always on overlay st_dev became wrong when we allowed this function to update d_ino in non-samefs setup with xino enabled. It is not true in case of xino bits overflow on a non-dir inode. Leave the WARN_ON() only for directories, where assertion is still true. Fixes: `adbf4f7ea8` ("ovl: consistent d_ino for non-samefs with xino") Cc: <stable@vger.kernel.org> # v4.17+ Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-01-22 20:11:41 +01:00
Linus Torvalds	81c64b0bd0	overlayfs fixes for 5.5-rc2 -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCXfNhGQAKCRDh3BK/laaZ PGSEAP9Nyv3XCN2wdqMLdrgn07B3Pk9w2Unf3Y5amKOxNXqyQwEAy2/E6DCiGjSa WRheJoTgDSeqUQNY6GFHsCIgLWOCHgs= =WH5O -----END PGP SIGNATURE----- Merge tag 'ovl-fixes-5.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull overlayfs fixes from Miklos Szeredi: "Fix some bugs and documentation" * tag 'ovl-fixes-5.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: docs: filesystems: overlayfs: Fix restview warnings docs: filesystems: overlayfs: Rename overlayfs.txt to .rst ovl: relax WARN_ON() on rename to self ovl: fix corner case of non-unique st_dev;st_ino ovl: don't use a temp buf for encoding real fh ovl: make sure that real fid is 32bit aligned in memory ovl: fix lookup failure on multi lower squashfs	2019-12-14 11:13:54 -08:00
Amir Goldstein	6889ee5a53	ovl: relax WARN_ON() on rename to self In ovl_rename(), if new upper is hardlinked to old upper underneath overlayfs before upper dirs are locked, user will get an ESTALE error and a WARN_ON will be printed. Changes to underlying layers while overlayfs is mounted may result in unexpected behavior, but it shouldn't crash the kernel and it shouldn't trigger WARN_ON() either, so relax this WARN_ON(). Reported-by: syzbot+bb1836a212e69f8e201a@syzkaller.appspotmail.com Fixes: `804032fabb` ("ovl: don't check rename to self") Cc: <stable@vger.kernel.org> # v4.9+ Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2019-12-10 16:00:55 +01:00
Amir Goldstein	9c6d8f13e9	ovl: fix corner case of non-unique st_dev;st_ino On non-samefs overlay without xino, non pure upper inodes should use a pseudo_dev assigned to each unique lower fs and pure upper inodes use the real upper st_dev. It is fine for an overlay pure upper inode to use the same st_dev;st_ino values as the real upper inode, because the content of those two different filesystem objects is always the same. In this case, however: - two filesystems, A and B - upper layer is on A - lower layer 1 is also on A - lower layer 2 is on B Non pure upper overlay inode, whose origin is in layer 1 will have the same st_dev;st_ino values as the real lower inode. This may result with a false positive results of 'diff' between the real lower and copied up overlay inode. Fix this by using the upper st_dev;st_ino values in this case. This breaks the property of constant st_dev;st_ino across copy up of this case. This breakage will be fixed by a later patch. Fixes: `5148626b80` ("ovl: allocate anon bdev per unique lower fs") Cc: stable@vger.kernel.org # v4.17+ Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2019-12-10 16:00:55 +01:00
Amir Goldstein	ec7bbb53d3	ovl: don't use a temp buf for encoding real fh We can allocate maximum fh size and encode into it directly. Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2019-12-10 16:00:55 +01:00
Amir Goldstein	cbe7fba8ed	ovl: make sure that real fid is 32bit aligned in memory Seprate on-disk encoding from in-memory and on-wire resresentation of overlay file handle. In-memory and on-wire we only ever pass around pointers to struct ovl_fh, which encapsulates at offset 3 the on-disk format struct ovl_fb. struct ovl_fb encapsulates at offset 21 the real file handle. That makes sure that the real file handle is always 32bit aligned in-memory when passed down to the underlying filesystem. On-disk format remains the same and store/load are done into correctly aligned buffer. New nfs exported file handles are exported with aligned real fid. Old nfs file handles are copied to an aligned buffer before being decoded. Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2019-12-10 16:00:55 +01:00
Amir Goldstein	7e63c87fc2	ovl: fix lookup failure on multi lower squashfs In the past, overlayfs required that lower fs have non null uuid in order to support nfs export and decode copy up origin file handles. Commit `9df085f3c9` ("ovl: relax requirement for non null uuid of lower fs") relaxed this requirement for nfs export support, as long as uuid (even if null) is unique among all lower fs. However, said commit unintentionally also relaxed the non null uuid requirement for decoding copy up origin file handles, regardless of the unique uuid requirement. Amend this mistake by disabling decoding of copy up origin file handle from lower fs with a conflicting uuid. We still encode copy up origin file handles from those fs, because file handles like those already exist in the wild and because they might provide useful information in the future. There is an unhandled corner case described by Miklos this way: - two filesystems, A and B, both have null uuid - upper layer is on A - lower layer 1 is also on A - lower layer 2 is on B In this case bad_uuid won't be set for B, because the check only involves the list of lower fs. Hence we'll try to decode a layer 2 origin on layer 1 and fail. We will deal with this corner case later. Reported-by: Colin Ian King <colin.king@canonical.com> Tested-by: Colin Ian King <colin.king@canonical.com> Link: https://lore.kernel.org/lkml/20191106234301.283006-1-colin.king@canonical.com/ Fixes: `9df085f3c9` ("ovl: relax requirement for non null uuid ...") Cc: stable@vger.kernel.org # v4.20+ Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2019-12-10 16:00:55 +01:00
Al Viro	6c2d4798a8	new helper: lookup_positive_unlocked() Most of the callers of lookup_one_len_unlocked() treat negatives are ERR_PTR(-ENOENT). Provide a helper that would do just that. Note that a pinned positive dentry remains positive - it's ->d_inode is stable, etc.; a pinned _negative_ dentry can become positive at any point as long as you are not holding its parent at least shared. So using lookup_one_len_unlocked() needs to be careful; lookup_positive_unlocked() is safer and that's what the callers end up open-coding anyway. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-11-15 13:49:04 -05:00
Mark Salyzyn	5c2e9f346b	ovl: filter of trusted xattr results in audit When filtering xattr list for reading, presence of trusted xattr results in a security audit log. However, if there is other content no errno will be set, and if there isn't, the errno will be -ENODATA and not -EPERM as is usually associated with a lack of capability. The check does not block the request to list the xattrs present. Switch to ns_capable_noaudit to reflect a more appropriate check. Signed-off-by: Mark Salyzyn <salyzyn@android.com> Cc: linux-security-module@vger.kernel.org Cc: kernel-team@android.com Cc: stable@vger.kernel.org # v3.18+ Fixes: `a082c6f680` ("ovl: filter trusted xattr for non-admin") Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2019-09-11 16:11:45 +02:00
Ding Xiang	97f024b917	ovl: Fix dereferencing possible ERR_PTR() if ovl_encode_real_fh() fails, no memory was allocated and the error in the error-valued pointer should be returned. Fixes: `9b6faee074` ("ovl: check ERR_PTR() return value from ovl_encode_fh()") Signed-off-by: Ding Xiang <dingxiang@cmss.chinamobile.com> Cc: <stable@vger.kernel.org> # v4.16+ Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2019-09-11 16:11:45 +02:00
Amir Goldstein	0be0bfd2de	ovl: fix regression caused by overlapping layers detection Once upon a time, commit `2cac0c00a6` ("ovl: get exclusive ownership on upper/work dirs") in v4.13 added some sanity checks on overlayfs layers. This change caused a docker regression. The root cause was mount leaks by docker, which as far as I know, still exist. To mitigate the regression, commit `85fdee1eef` ("ovl: fix regression caused by exclusive upper/work dir protection") in v4.14 turned the mount errors into warnings for the default index=off configuration. Recently, commit `146d62e5a5` ("ovl: detect overlapping layers") in v5.2, re-introduced exclusive upper/work dir checks regardless of index=off configuration. This changes the status quo and mount leak related bug reports have started to re-surface. Restore the status quo to fix the regressions. To clarify, index=off does NOT relax overlapping layers check for this ovelayfs mount. index=off only relaxes exclusive upper/work dir checks with another overlayfs mount. To cover the part of overlapping layers detection that used the exclusive upper/work dir checks to detect overlap with self upper/work dir, add a trap also on the work base dir. Link: https://github.com/moby/moby/issues/34672 Link: https://lore.kernel.org/linux-fsdevel/20171006121405.GA32700@veci.piliscsaba.szeredi.hu/ Link: https://github.com/containers/libpod/issues/3540 Fixes: `146d62e5a5` ("ovl: detect overlapping layers") Cc: <stable@vger.kernel.org> # v4.19+ Signed-off-by: Amir Goldstein <amir73il@gmail.com> Tested-by: Colin Walters <walters@verbum.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2019-07-16 13:23:40 +02:00
Linus Torvalds	c884d8ac7f	SPDX update for 5.2-rc6 Another round of SPDX updates for 5.2-rc6 Here is what I am guessing is going to be the last "big" SPDX update for 5.2. It contains all of the remaining GPLv2 and GPLv2+ updates that were "easy" to determine by pattern matching. The ones after this are going to be a bit more difficult and the people on the spdx list will be discussing them on a case-by-case basis now. Another 5000+ files are fixed up, so our overall totals are: Files checked: 64545 Files with SPDX: 45529 Compared to the 5.1 kernel which was: Files checked: 63848 Files with SPDX: 22576 This is a huge improvement. Also, we deleted another 20000 lines of boilerplate license crud, always nice to see in a diffstat. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> -----BEGIN PGP SIGNATURE----- iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXQyQYA8cZ3JlZ0Brcm9h aC5jb20ACgkQMUfUDdst+ymnGQCghETUBotn1p3hTjY56VEs6dGzpHMAnRT0m+lv kbsjBGEJpLbMRB2krnaU =RMcT -----END PGP SIGNATURE----- Merge tag 'spdx-5.2-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx Pull still more SPDX updates from Greg KH: "Another round of SPDX updates for 5.2-rc6 Here is what I am guessing is going to be the last "big" SPDX update for 5.2. It contains all of the remaining GPLv2 and GPLv2+ updates that were "easy" to determine by pattern matching. The ones after this are going to be a bit more difficult and the people on the spdx list will be discussing them on a case-by-case basis now. Another 5000+ files are fixed up, so our overall totals are: Files checked: 64545 Files with SPDX: 45529 Compared to the 5.1 kernel which was: Files checked: 63848 Files with SPDX: 22576 This is a huge improvement. Also, we deleted another 20000 lines of boilerplate license crud, always nice to see in a diffstat" * tag 'spdx-5.2-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx: (65 commits) treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 507 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 506 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 505 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 504 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 503 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 502 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 501 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 499 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 498 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 497 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 496 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 495 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 491 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 490 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 489 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 488 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 487 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 486 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 485 ...	2019-06-21 09:58:42 -07:00
Thomas Gleixner	d2912cb15b	treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500 Based on 2 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license version 2 as published by the free software foundation this program is free software you can redistribute it and or modify it under the terms of the gnu general public license version 2 as published by the free software foundation # extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 4122 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Enrico Weigelt <info@metux.net> Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Allison Randal <allison@lohutok.net> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190604081206.933168790@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-06-19 17:09:55 +02:00
Amir Goldstein	6dde1e42f4	ovl: make i_ino consistent with st_ino in more cases Relax the condition that overlayfs supports nfs export, to require that i_ino is consistent with st_ino/d_ino. It is enough to require that st_ino and d_ino are consistent. This fixes the failure of xfstest generic/504, due to mismatch of st_ino to inode number in the output of /proc/locks. Fixes: `12574a9f4c` ("ovl: consistent i_ino for non-samefs with xino") Cc: <stable@vger.kernel.org> # v4.19 Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2019-06-19 09:04:19 +02:00
Nicolas Schier	253e748339	ovl: fix typo in MODULE_PARM_DESC Change first argument to MODULE_PARM_DESC() calls, that each of them matched the actual module parameter name. The matching results in changing (the 'parm' section from) the output of `modinfo overlay` from: parm: ovl_check_copy_up:Obsolete; does nothing parm: redirect_max:ushort parm: ovl_redirect_max:Maximum length of absolute redirect xattr value parm: redirect_dir:bool parm: ovl_redirect_dir_def:Default to on or off for the redirect_dir feature parm: redirect_always_follow:bool parm: ovl_redirect_always_follow:Follow redirects even if redirect_dir feature is turned off parm: index:bool parm: ovl_index_def:Default to on or off for the inodes index feature parm: nfs_export:bool parm: ovl_nfs_export_def:Default to on or off for the NFS export feature parm: xino_auto:bool parm: ovl_xino_auto_def:Auto enable xino feature parm: metacopy:bool parm: ovl_metacopy_def:Default to on or off for the metadata only copy up feature into: parm: check_copy_up:Obsolete; does nothing parm: redirect_max:Maximum length of absolute redirect xattr value (ushort) parm: redirect_dir:Default to on or off for the redirect_dir feature (bool) parm: redirect_always_follow:Follow redirects even if redirect_dir feature is turned off (bool) parm: index:Default to on or off for the inodes index feature (bool) parm: nfs_export:Default to on or off for the NFS export feature (bool) parm: xino_auto:Auto enable xino feature (bool) parm: metacopy:Default to on or off for the metadata only copy up feature (bool) Signed-off-by: Nicolas Schier <n.schier@avm.de> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2019-06-18 15:06:16 +02:00
Arnd Bergmann	1dac6f5b0e	ovl: fix bogus -Wmaybe-unitialized warning gcc gets a bit confused by the logic in ovl_setup_trap() and can't figure out whether the local 'trap' variable in the caller was initialized or not: fs/overlayfs/super.c: In function 'ovl_fill_super': fs/overlayfs/super.c:1333:4: error: 'trap' may be used uninitialized in this function [-Werror=maybe-uninitialized] iput(trap); ^~~~~~~~~~ fs/overlayfs/super.c:1312:17: note: 'trap' was declared here Reword slightly to make it easier for the compiler to understand. Fixes: `146d62e5a5` ("ovl: detect overlapping layers") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2019-06-18 15:06:16 +02:00
Miklos Szeredi	9179c21dc6	ovl: don't fail with disconnected lower NFS NFS mounts can be disconnected from fs root. Don't fail the overlapping layer check because of this. The check is not authoritative anyway, since topology can change during or after the check. Reported-by: Antti Antinoja <antti@fennosys.fi> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Fixes: `146d62e5a5` ("ovl: detect overlapping layers")	2019-06-18 15:06:16 +02:00
Amir Goldstein	941d935ac7	ovl: fix wrong flags check in FS_IOC_FS[SG]ETXATTR ioctls The ioctl argument was parsed as the wrong type. Fixes: `b21d9c435f` ("ovl: support the FS_IOC_FS[SG]ETXATTR ioctls") Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2019-06-11 17:17:41 +02:00
Amir Goldstein	146d62e5a5	ovl: detect overlapping layers Overlapping overlay layers are not supported and can cause unexpected behavior, but overlayfs does not currently check or warn about these configurations. User is not supposed to specify the same directory for upper and lower dirs or for different lower layers and user is not supposed to specify directories that are descendants of each other for overlay layers, but that is exactly what this zysbot repro did: https://syzkaller.appspot.com/x/repro.syz?x=12c7a94f400000 Moving layer root directories into other layers while overlayfs is mounted could also result in unexpected behavior. This commit places "traps" in the overlay inode hash table. Those traps are dummy overlay inodes that are hashed by the layers root inodes. On mount, the hash table trap entries are used to verify that overlay layers are not overlapping. While at it, we also verify that overlay layers are not overlapping with directories "in-use" by other overlay instances as upperdir/workdir. On lookup, the trap entries are used to verify that overlay layers root inodes have not been moved into other layers after mount. Some examples: $ ./run --ov --samefs -s ... ( mkdir -p base/upper/0/u base/upper/0/w base/lower lower upper mnt mount -o bind base/lower lower mount -o bind base/upper upper mount -t overlay none mnt ... -o lowerdir=lower,upperdir=upper/0/u,workdir=upper/0/w) $ umount mnt $ mount -t overlay none mnt ... -o lowerdir=base,upperdir=upper/0/u,workdir=upper/0/w [ 94.434900] overlayfs: overlapping upperdir path mount: mount overlay on mnt failed: Too many levels of symbolic links $ mount -t overlay none mnt ... -o lowerdir=upper/0/u,upperdir=upper/0/u,workdir=upper/0/w [ 151.350132] overlayfs: conflicting lowerdir path mount: none is already mounted or mnt busy $ mount -t overlay none mnt ... -o lowerdir=lower:lower/a,upperdir=upper/0/u,workdir=upper/0/w [ 201.205045] overlayfs: overlapping lowerdir path mount: mount overlay on mnt failed: Too many levels of symbolic links $ mount -t overlay none mnt ... -o lowerdir=lower,upperdir=upper/0/u,workdir=upper/0/w $ mv base/upper/0/ base/lower/ $ find mnt/0 mnt/0 mnt/0/w find: 'mnt/0/w/work': Too many levels of symbolic links find: 'mnt/0/u': Too many levels of symbolic links Reported-by: syzbot+9c69c282adc4edd2b540@syzkaller.appspotmail.com Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2019-05-29 13:03:37 +02:00
Amir Goldstein	b21d9c435f	ovl: support the FS_IOC_FS[SG]ETXATTR ioctls They are the extended version of FS_IOC_FS[SG]ETFLAGS ioctls. xfs_io -c "chattr <flags>" uses the new ioctls for setting flags. This used to work in kernel pre v4.19, before stacked file ops introduced the ovl_ioctl whitelist. Reported-by: Dave Chinner <david@fromorbit.com> Fixes: `d1d04ef857` ("ovl: stack file ops") Cc: <stable@vger.kernel.org> # v4.19 Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2019-05-27 10:03:10 +02:00
Thomas Gleixner	ec8f24b7fa	treewide: Add SPDX license identifier - Makefile/Kconfig Add SPDX license identifiers to all Make/Kconfig files which: - Have no license information of any form These files fall under the project license, GPL v2 only. The resulting SPDX license identifier is: GPL-2.0-only Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-05-21 10:50:46 +02:00
Linus Torvalds	7e9890a350	overlayfs update for 5.2 -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCXNpu6gAKCRDh3BK/laaZ PNYnAQCLMJBZp9AVKU+5onOGKLmgUfnbKZhWJYICW6DVKobo6AEA4aXBIk5TIDiu +a3Ny0nAutdpHcRkbi8jJty91BeJgQg= =iLSD -----END PGP SIGNATURE----- Merge tag 'ovl-update-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull overlayfs update from Miklos Szeredi: "Just bug fixes in this small update" * tag 'ovl-update-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: ovl: relax WARN_ON() for overlapping layers use case ovl: check the capability before cred overridden ovl: do not generate duplicate fsnotify events for "fake" path ovl: support stacked SEEK_HOLE/SEEK_DATA ovl: fix missing upper fs freeze protection on copy up for ioctl	2019-05-14 09:02:14 -07:00
Amir Goldstein	acf3062a7e	ovl: relax WARN_ON() for overlapping layers use case This nasty little syzbot repro: https://syzkaller.appspot.com/x/repro.syz?x=12c7a94f400000 Creates overlay mounts where the same directory is both in upper and lower layers. Simplified example: mkdir foo work mount -t overlay none foo -o"lowerdir=.,upperdir=foo,workdir=work" The repro runs several threads in parallel that attempt to chdir into foo and attempt to symlink/rename/exec/mkdir the file bar. The repro hits a WARN_ON() I placed in ovl_instantiate(), which suggests that an overlay inode already exists in cache and is hashed by the pointer of the real upper dentry that ovl_create_real() has just created. At the point of the WARN_ON(), for overlay dir inode lock is held and upper dir inode lock, so at first, I did not see how this was possible. On a closer look, I see that after ovl_create_real(), because of the overlapping upper and lower layers, a lookup by another thread can find the file foo/bar that was just created in upper layer, at overlay path foo/foo/bar and hash the an overlay inode with the new real dentry as lower dentry. This is possible because the overlay directory foo/foo is not locked and the upper dentry foo/bar is in dcache, so ovl_lookup() can find it without taking upper dir inode shared lock. Overlapping layers is considered a wrong setup which would result in unexpected behavior, but it shouldn't crash the kernel and it shouldn't trigger WARN_ON() either, so relax this WARN_ON() and leave a pr_warn() instead to cover all cases of failure to get an overlay inode. The error returned from failure to insert new inode to cache with inode_insert5() was changed to -EEXIST, to distinguish from the error -ENOMEM returned on failure to get/allocate inode with iget5_locked(). Reported-by: syzbot+9c69c282adc4edd2b540@syzkaller.appspotmail.com Fixes: `01b39dcc95` ("ovl: use inode_insert5() to hash a newly...") Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2019-05-08 13:25:53 +02:00
Linus Torvalds	d27fb65bc2	Merge branch 'work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull misc dcache updates from Al Viro: "Most of this pile is putting name length into struct name_snapshot and making use of it. The beginning of this series ("ovl_lookup_real_one(): don't bother with strlen()") ought to have been split in two (separate switch of name_snapshot to struct qstr from overlayfs reaping the trivial benefits of that), but I wanted to avoid a rebase - by the time I'd spotted that it was (a) in -next and (b) close to 5.1-final ;-/" * 'work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: audit_compare_dname_path(): switch to const struct qstr * audit_update_watch(): switch to const struct qstr * inotify_handle_event(): don't bother with strlen() fsnotify: switch send_to_group() and ->handle_event to const struct qstr * fsnotify(): switch to passing const struct qstr * for file_name switch fsnotify_move() to passing const struct qstr * for old_name ovl_lookup_real_one(): don't bother with strlen() sysv: bury the broken "quietly truncate the long filenames" logics nsfs: unobfuscate unexport d_alloc_pseudo()	2019-05-07 20:03:32 -07:00
Jiufei Xue	98487de318	ovl: check the capability before cred overridden We found that it return success when we set IMMUTABLE_FL flag to a file in docker even though the docker didn't have the capability CAP_LINUX_IMMUTABLE. The commit `d1d04ef857` ("ovl: stack file ops") and `dab5ca8fd9` ("ovl: add lsattr/chattr support") implemented chattr operations on a regular overlay file. ovl_real_ioctl() overridden the current process's subjective credentials with ofs->creator_cred which have the capability CAP_LINUX_IMMUTABLE so that it will return success in vfs_ioctl()->cap_capable(). Fix this by checking the capability before cred overridden. And here we only care about APPEND_FL and IMMUTABLE_FL, so get these information from inode. [SzM: move check and call to underlying fs inside inode locked region to prevent two such calls from racing with each other] Signed-off-by: Jiufei Xue <jiufei.xue@linux.alibaba.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2019-05-06 14:00:37 +02:00
Amir Goldstein	d989903058	ovl: do not generate duplicate fsnotify events for "fake" path Overlayfs "fake" path is used for stacked file operations on underlying files. Operations on files with "fake" path must not generate fsnotify events with path data, because those events have already been generated at overlayfs layer and because the reported event->fd for fanotify marks on underlying inode/filesystem will have the wrong path (the overlayfs path). Link: https://lore.kernel.org/linux-fsdevel/20190423065024.12695-1-jencce.kernel@gmail.com/ Reported-by: Murphy Zhou <jencce.kernel@gmail.com> Fixes: `d1d04ef857` ("ovl: stack file ops") Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2019-05-06 13:54:51 +02:00
Amir Goldstein	9e46b840c7	ovl: support stacked SEEK_HOLE/SEEK_DATA Overlay file f_pos is the master copy that is preserved through copy up and modified on read/write, but only real fs knows how to SEEK_HOLE/SEEK_DATA and real fs may impose limitations that are more strict than ->s_maxbytes for specific files, so we use the real file to perform seeks. We do not call real fs for SEEK_CUR:0 query and for SEEK_SET:0 requests. Fixes: `d1d04ef857` ("ovl: stack file ops") Reported-by: Eddie Horng <eddiehorng.tw@gmail.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2019-05-06 13:54:51 +02:00
Amir Goldstein	3428030da0	ovl: fix missing upper fs freeze protection on copy up for ioctl Generalize the helper ovl_open_maybe_copy_up() and use it to copy up file with data before FS_IOC_SETFLAGS ioctl. The FS_IOC_SETFLAGS ioctl is a bit of an odd ball in vfs, which probably caused the confusion. File may be open O_RDONLY, but ioctl modifies the file. VFS does not call mnt_want_write_file() nor lock inode mutex, but fs-specific code for FS_IOC_SETFLAGS does. So ovl_ioctl() calls mnt_want_write_file() for the overlay file, and fs-specific code calls mnt_want_write_file() for upper fs file, but there was no call for ovl_want_write() for copy up duration which prevents overlayfs from copying up on a frozen upper fs. Fixes: `dab5ca8fd9` ("ovl: add lsattr/chattr support") Cc: <stable@vger.kernel.org> # v4.19 Signed-off-by: Amir Goldstein <amir73il@gmail.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2019-05-06 13:54:50 +02:00
Al Viro	0b269ded4e	overlayfs: make use of ->free_inode() synchronous parts are left in ->destroy_inode() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-05-01 22:43:27 -04:00
Al Viro	230c6402b1	ovl_lookup_real_one(): don't bother with strlen() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-04-26 13:13:33 -04:00
Vivek Goyal	993a0b2aec	ovl: Do not lose security.capability xattr over metadata file copy-up If a file has been copied up metadata only, and later data is copied up, upper loses any security.capability xattr it has (underlying filesystem clears it as upon file write). From a user's point of view, this is just a file copy-up and that should not result in losing security.capability xattr. Hence, before data copy up, save security.capability xattr (if any) and restore it on upper after data copy up is complete. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Fixes: `0c28887493` ("ovl: A new xattr OVL_XATTR_METACOPY for file on upper") Cc: <stable@vger.kernel.org> # v4.19+ Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2019-02-13 11:14:46 +01:00
Vivek Goyal	5f32879ea3	ovl: During copy up, first copy up data and then xattrs If a file with capability set (and hence security.capability xattr) is written kernel clears security.capability xattr. For overlay, during file copy up if xattrs are copied up first and then data is, copied up. This means data copy up will result in clearing of security.capability xattr file on lower has. And this can result into surprises. If a lower file has CAP_SETUID, then it should not be cleared over copy up (if nothing was actually written to file). This also creates problems with chown logic where it first copies up file and then tries to clear setuid bit. But by that time security.capability xattr is already gone (due to data copy up), and caller gets -ENODATA. This has been reported by Giuseppe here. https://github.com/containers/libpod/issues/2015#issuecomment-447824842 Fix this by copying up data first and then metadta. This is a regression which has been introduced by my commit as part of metadata only copy up patches. TODO: There will be some corner cases where a file is copied up metadata only and later data copy up happens and that will clear security.capability xattr. Something needs to be done about that too. Fixes: `bd64e57586` ("ovl: During copy up, first copy up metadata and then data") Cc: <stable@vger.kernel.org> # v4.19+ Reported-by: Giuseppe Scrivano <gscrivan@redhat.com> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2019-02-04 09:09:57 +01:00
Miklos Szeredi	ec7ba118b9	Revert "ovl: relax permission checking on underlying layers" This reverts commit `007ea44892`. The commit broke some selinux-testsuite cases, and it looks like there's no straightforward fix keeping the direction of this patch, so revert for now. The original patch was trying to fix the consistency of permission checks, and not an observed bug. So reverting should be safe. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-12-04 11:31:30 +01:00
Amir Goldstein	155b8a0492	ovl: fix decode of dir file handle with multi lower layers When decoding a lower file handle, we first call ovl_check_origin_fh() with connected=false to get any real lower dentry for overlay inode cache lookup. If the real dentry is a disconnected dir dentry, ovl_check_origin_fh() is called again with connected=true to get a connected real dentry and find the lower layer the real dentry belongs to. If the first call returned a connected real dentry, we use it to lookup an overlay connected dentry, but the first ovl_check_origin_fh() call with connected=false did not check that the found dentry is under the root of the layer (see ovl_acceptable()), it only checked that the found dentry super block matches the uuid of the lower file handle. In case there are multiple lower layers on the same fs and the found dentry is not from the top most lower layer, using the layer index returned from the first ovl_check_origin_fh() is wrong and we end up failing to decode the file handle. Fix this by always calling ovl_check_origin_fh() with connected=true if we got a directory dentry in the first call. Fixes: `8b58924ad5` ("ovl: lookup in inode cache first when decoding...") Cc: <stable@vger.kernel.org> # v4.17 Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-11-21 10:44:48 +01:00
Amir Goldstein	91ff20f34e	ovl: fix missing override creds in link of a metacopy upper Theodore Ts'o reported a v4.19 regression with docker-dropbox: https://marc.info/?l=linux-fsdevel&m=154070089431116&w=2 "I was rebuilding my dropbox Docker container, and it failed in 4.19 with the following error: ... dpkg: error: error creating new backup file \ '/var/lib/dpkg/status-old': Invalid cross-device link" The problem did not reproduce with metacopy feature disabled. The error was caused by insufficient credentials to set "trusted.overlay.redirect" xattr on link of a metacopy file. Reproducer: echo Y > /sys/module/overlay/parameters/redirect_dir echo Y > /sys/module/overlay/parameters/metacopy cd /tmp mkdir l u w m chmod 777 l u touch l/foo ln l/foo l/link chmod 666 l/foo mount -t overlay none -olowerdir=l,upperdir=u,workdir=w m su fsgqa ln m/foo m/bar [ 21.455823] overlayfs: failed to set redirect (-1) ln: failed to create hard link 'm/bar' => 'm/foo':\ Invalid cross-device link Reported-by: Theodore Y. Ts'o <tytso@mit.edu> Reported-by: Maciej Zięba <maciekz82@gmail.com> Fixes: `4120fe64dc` ("ovl: Set redirect on upper inode when it is linked") Cc: <stable@vger.kernel.org> # v4.19 Signed-off-by: Amir Goldstein <amir73il@gmail.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-11-19 16:21:29 +01:00
Linus Torvalds	c2aa1a444c	vfs: rework data cloning infrastructure Rework the vfs_clone_file_range and vfs_dedupe_file_range infrastructure to use a common .remap_file_range method and supply generic bounds and sanity checking functions that are shared with the data write path. The current VFS infrastructure has problems with rlimit, LFS file sizes, file time stamps, maximum filesystem file sizes, stripping setuid bits, etc and so they are addressed in these commits. We also introduce the ability for the ->remap_file_range methods to return short clones so that clones for vfs_copy_file_range() don't get rejected if the entire range can't be cloned. It also allows filesystems to sliently skip deduplication of partial EOF blocks if they are not capable of doing so without requiring errors to be thrown to userspace. All existing filesystems are converted to user the new .remap_file_range method, and both XFS and ocfs2 are modified to make use of the new generic checking infrastructure. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJb29gEAAoJEK3oKUf0dfodpOAQAL2VbHjvKXEwNMDTKscSRMmZ Z0xXo3gamFKQ+VGOqy2g2lmAYQs9SAnTuCGTJ7zIAp7u+q8gzUy5FzKAwLS4Id6L 8siaY6nzlicfO04d0MdXnWz0f3xykChgzfdQfVUlUi7WrDioBUECLPmx4a+USsp1 DQGjLOZfoOAmn2rijdnH9RTEaHqg+8mcTaLN9TRav4gGqrWxldFKXw2y6ouFC7uo /hxTRNXR9VI+EdbDelwBNXl9nU9gQA0WLOvRKwgUrtv6bSJohTPsmXt7EbBtNcVR cl3zDNc1sLD1bLaRLEUAszI/33wXaaQgom1iB51obIcHHef+JxRNG/j6rUMfzxZI VaauGv5EIvtaKN0LTAqVVLQ8t2MQFYfOr8TykmO+1UFog204aKRANdVMHDSjxD/0 dTGKJGcq+HnKQ+JHDbTdvuXEL8sUUl1FiLjOQbZPw63XmuddLKFUA2TOjXn6htbU 1h1MG5d9KjGLpabp2BQheczD08NuSmcrOBNt7IoeI3+nxr3HpMwprfB9TyaERy9X iEgyVXmjjc9bLLRW7A2wm77aW64NvPs51wKMnvuNgNwnCewrGS6cB8WVj2zbQjH1 h3f3nku44s9ctNPSBzb/sJLnpqmZQ5t0oSmrMSN+5+En6rNTacoJCzxHRJBA7z/h Z+C6y1GTZw0euY6Zjiwu =CE/A -----END PGP SIGNATURE----- Merge tag 'xfs-4.20-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux Pull vfs dedup fixes from Dave Chinner: "This reworks the vfs data cloning infrastructure. We discovered many issues with these interfaces late in the 4.19 cycle - the worst of them (data corruption, setuid stripping) were fixed for XFS in 4.19-rc8, but a larger rework of the infrastructure fixing all the problems was needed. That rework is the contents of this pull request. Rework the vfs_clone_file_range and vfs_dedupe_file_range infrastructure to use a common .remap_file_range method and supply generic bounds and sanity checking functions that are shared with the data write path. The current VFS infrastructure has problems with rlimit, LFS file sizes, file time stamps, maximum filesystem file sizes, stripping setuid bits, etc and so they are addressed in these commits. We also introduce the ability for the ->remap_file_range methods to return short clones so that clones for vfs_copy_file_range() don't get rejected if the entire range can't be cloned. It also allows filesystems to sliently skip deduplication of partial EOF blocks if they are not capable of doing so without requiring errors to be thrown to userspace. Existing filesystems are converted to user the new remap_file_range method, and both XFS and ocfs2 are modified to make use of the new generic checking infrastructure" * tag 'xfs-4.20-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (28 commits) xfs: remove [cm]time update from reflink calls xfs: remove xfs_reflink_remap_range xfs: remove redundant remap partial EOF block checks xfs: support returning partial reflink results xfs: clean up xfs_reflink_remap_blocks call site xfs: fix pagecache truncation prior to reflink ocfs2: remove ocfs2_reflink_remap_range ocfs2: support partial clone range and dedupe range ocfs2: fix pagecache truncation prior to reflink ocfs2: truncate page cache for clone destination file before remapping vfs: clean up generic_remap_file_range_prep return value vfs: hide file range comparison function vfs: enable remap callers that can handle short operations vfs: plumb remap flags through the vfs dedupe functions vfs: plumb remap flags through the vfs clone functions vfs: make remap_file_range functions take and return bytes completed vfs: remap helper should update destination inode metadata vfs: pass remap flags to generic_remap_checks vfs: pass remap flags to generic_remap_file_range_prep vfs: combine the clone and dedupe into a single remap_file_range ...	2018-11-02 09:33:08 -07:00
Miklos Szeredi	d47748e5ae	ovl: automatically enable redirect_dir on metacopy=on Current behavior is to automatically disable metacopy if redirect_dir is not enabled and proceed with the mount. If "metacopy=on" mount option was given, then this behavior can confuse the user: no mount failure, yet metacopy is disabled. This patch makes metacopy=on imply redirect_dir=on. The converse is also true: turning off full redirect with redirect_dir= {off\|follow\|nofollow} will disable metacopy. If both metacopy=on and redirect_dir={off\|follow\|nofollow} is specified, then mount will fail, since there's no way to correctly resolve the conflict. Reported-by: Daniel Walsh <dwalsh@redhat.com> Fixes: `d5791044d2` ("ovl: Provide a mount option metacopy=on/off...") Cc: <stable@vger.kernel.org> # v4.19 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-11-01 21:31:39 +01:00
Miklos Szeredi	5e12758086	ovl: check whiteout in ovl_create_over_whiteout() Kaixuxia repors that it's possible to crash overlayfs by removing the whiteout on the upper layer before creating a directory over it. This is a reproducer: mkdir lower upper work merge touch lower/file mount -t overlay overlay -olowerdir=lower,upperdir=upper,workdir=work merge rm merge/file ls -al merge/file rm upper/file ls -al merge/ mkdir merge/file Before commencing with a vfs_rename(..., RENAME_EXCHANGE) verify that the lookup of "upper" is positive and is a whiteout, and return ESTALE otherwise. Reported by: kaixuxia <xiakaixu1987@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Fixes: `e9be9d5e76` ("overlay filesystem") Cc: <stable@vger.kernel.org> # v3.18	2018-10-31 12:15:23 +01:00
Darrick J. Wong	df36583619	vfs: plumb remap flags through the vfs dedupe functions Plumb a remap_flags argument through the vfs_dedupe_file_range_one functions so that dedupe can take advantage of it. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2018-10-30 10:42:03 +11:00
Darrick J. Wong	452ce65951	vfs: plumb remap flags through the vfs clone functions Plumb a remap_flags argument through the {do,vfs}_clone_file_range functions so that clone can take advantage of it. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2018-10-30 10:41:56 +11:00
Darrick J. Wong	42ec3d4c02	vfs: make remap_file_range functions take and return bytes completed Change the remap_file_range functions to take a number of bytes to operate upon and return the number of bytes they operated on. This is a requirement for allowing fs implementations to return short clone/dedupe results to the user, which will enable us to obey resource limits in a graceful manner. A subsequent patch will enable copy_file_range to signal to the ->clone_file_range implementation that it can handle a short length, which will be returned in the function's return value. For now the short return is not implemented anywhere so the behavior won't change -- either copy_file_range manages to clone the entire range or it tries an alternative. Neither clone ioctl can take advantage of this, alas. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2018-10-30 10:41:49 +11:00
Darrick J. Wong	2e5dfc99f2	vfs: combine the clone and dedupe into a single remap_file_range Combine the clone_file_range and dedupe_file_range operations into a single remap_file_range file operation dispatch since they're fundamentally the same operation. The differences between the two can be made in the prep functions. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>	2018-10-30 10:41:21 +11:00
Chengguang Xu	14fa085640	ovl: using posix_acl_xattr_size() to get size instead of posix_acl_to_xattr() There is no functional change but it seems better to get size by calling posix_acl_xattr_size() instead of calling posix_acl_to_xattr() with NULL buffer argument. Additionally, remove unnecessary assignments. Signed-off-by: Chengguang Xu <cgxu519@gmx.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-10-26 23:34:40 +02:00
Amir Goldstein	1e92e3072c	ovl: abstract ovl_inode lock with a helper The abstraction improves code readabilty (to some). Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-10-26 23:34:40 +02:00
Amir Goldstein	0e32992f7f	ovl: remove the 'locked' argument of ovl_nlink_{start,end} It just makes the interface strange without adding any significant value. The only case where locked is false and return value is 0 is in ovl_rename() when new is negative, so handle that case explicitly in ovl_rename(). Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-10-26 23:34:40 +02:00
Amir Goldstein	9df085f3c9	ovl: relax requirement for non null uuid of lower fs We use uuid to associate an overlay lower file handle with a lower layer, so we can accept lower fs with null uuid as long as all lower layers with null uuid are on the same fs. This change allows enabling index and nfs_export features for the setup of single lower fs of type squashfs - squashfs supports file handles, but has a null uuid. This change also allows enabling index and nfs_export features for nested overlayfs, where the lower overlay has nfs_export enabled. Enabling the index feature with single lower squashfs fixes the unionmount-testsuite test: ./run --ov --squashfs --verify As a by-product, if, like the lower squashfs, upper fs also uses the generic export_encode_fh() implementation to export 32bit inode file handles (e.g. ext4), then the xino_auto config/module/mount option will enable unique overlay inode numbers. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-10-26 23:34:40 +02:00
Miklos Szeredi	6b52243f63	ovl: fold copy-up helpers into callers Now that the workdir and tmpfile copy up modes have been untagled, the functions become simple enough that the helpers can be folded into the callers. Add new helpers where there is any duplication remaining: preparing creds for creating the object. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-10-26 23:34:39 +02:00
Amir Goldstein	b10cdcdc20	ovl: untangle copy up call chain In an attempt to dedup ~100 LOC, we ended up creating a tangled call chain, whose branches merge and diverge in several points according to the immutable c->tmpfile copy up mode. This call chain was hard to analyse for locking correctness because the locking requirements for the c->tmpfile flow were very different from the locking requirements for the !c->tmpfile flow (i.e. directory vs. regulare file copy up). Split the copy up helpers of the c->tmpfile flow from those of the !c->tmpfile (i.e. workdir) flow and remove the c->tmpfile mode from copy up context. Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-10-26 23:34:39 +02:00
Miklos Szeredi	007ea44892	ovl: relax permission checking on underlying layers Make permission checking more consistent: - special files don't need any access check on underling fs - exec permission check doesn't need to be performed on underlying fs Reported-by: "J. Bruce Fields" <bfields@fieldses.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-10-26 23:34:39 +02:00
Amir Goldstein	6cd078702f	ovl: fix recursive oi->lock in ovl_link() linking a non-copied-up file into a non-copied-up parent results in a nested call to mutex_lock_interruptible(&oi->lock). Fix this by copying up target parent before ovl_nlink_start(), same as done in ovl_rename(). ~/unionmount-testsuite$ ./run --ov -s ~/unionmount-testsuite$ ln /mnt/a/foo100 /mnt/a/dir100/ WARNING: possible recursive locking detected -------------------------------------------- ln/1545 is trying to acquire lock: 00000000bcce7c4c (&ovl_i_lock_key[depth]){+.+.}, at: ovl_copy_up_start+0x28/0x7d but task is already holding lock: 0000000026d73d5b (&ovl_i_lock_key[depth]){+.+.}, at: ovl_nlink_start+0x3c/0xc1 [SzM: this seems to be a false positive, but doing the copy-up first is harmless and removes the lockdep splat] Reported-by: syzbot+3ef5c0d1a5cb0b21e6be@syzkaller.appspotmail.com Fixes: `5f8415d6b8` ("ovl: persistent overlay inode nlink for...") Cc: <stable@vger.kernel.org> # v4.13 Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-10-26 23:34:39 +02:00
Miklos Szeredi	1f244dc521	ovl: clean up error handling in ovl_get_tmpfile() If security_inode_copy_up() fails, it should not set new_creds, so no need for the cleanup (which would've Oops-ed anyway, due to old_creds being NULL). Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-10-26 23:34:39 +02:00
Amir Goldstein	babf4770be	ovl: fix error handling in ovl_verify_set_fh() We hit a BUG on kfree of an ERR_PTR()... Reported-by: syzbot+ff03fe05c717b82502d0@syzkaller.appspotmail.com Fixes: `8b88a2e640` ("ovl: verify upper root dir matches lower root dir") Cc: <stable@vger.kernel.org> # v4.13 Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-10-26 23:34:39 +02:00
Miklos Szeredi	1a8f8d2a44	ovl: fix format of setxattr debug Format has a typo: it was meant to be "%.s", not "%s". But at some point callers grew nonprintable values as well, so use "%*pE" instead with a maximized length. Reported-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Fixes: `3a1e819b4e` ("ovl: store file handle of lower inode on copy up") Cc: <stable@vger.kernel.org> # v4.12	2018-10-04 14:49:10 +02:00
Amir Goldstein	601350ff58	ovl: fix access beyond unterminated strings KASAN detected slab-out-of-bounds access in printk from overlayfs, because string format used %s instead of %.s. > BUG: KASAN: slab-out-of-bounds in string+0x298/0x2d0 lib/vsprintf.c:604 > Read of size 1 at addr ffff8801c36c66ba by task syz-executor2/27811 > > CPU: 0 PID: 27811 Comm: syz-executor2 Not tainted 4.19.0-rc5+ #36 ... > printk+0xa7/0xcf kernel/printk/printk.c:1996 > ovl_lookup_index.cold.15+0xe8/0x1f8 fs/overlayfs/namei.c:689 Reported-by: syzbot+376cea2b0ef340db3dd4@syzkaller.appspotmail.com Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Fixes: `359f392ca5` ("ovl: lookup index entry for copy up origin") Cc: <stable@vger.kernel.org> # v4.13	2018-10-04 14:49:10 +02:00
Wei Yongjun	69383c5913	ovl: make symbol 'ovl_aops' static Fixes the following sparse warning: fs/overlayfs/inode.c:507:39: warning: symbol 'ovl_aops' was not declared. Should it be static? Fixes: `5b910bd615` ("ovl: fix GPF in swapfile_activate of file from overlayfs over xfs") Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-09-25 20:41:23 +02:00
Amir Goldstein	a725356b66	vfs: swap names of {do,vfs}_clone_file_range() Commit `031a072a0b` ("vfs: call vfs_clone_file_range() under freeze protection") created a wrapper do_clone_file_range() around vfs_clone_file_range() moving the freeze protection to former, so overlayfs could call the latter. The more common vfs practice is to call do_xxx helpers from vfs_xxx helpers, where freeze protecction is taken in the vfs_xxx helper, so this anomality could be a source of confusion. It seems that commit `8ede205541` ("ovl: add reflink/copyfile/dedup support") may have fallen a victim to this confusion - ovl_clone_file_range() calls the vfs_clone_file_range() helper in the hope of getting freeze protection on upper fs, but in fact results in overlayfs allowing to bypass upper fs freeze protection. Swap the names of the two helpers to conform to common vfs practice and call the correct helpers from overlayfs and nfsd. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-09-24 10:54:01 +02:00
Amir Goldstein	d9d150ae50	ovl: fix freeze protection bypass in ovl_clone_file_range() Tested by doing clone on overlayfs while upper xfs+reflink is frozen: xfs_io -f /ovl/y fsfreeze -f /xfs xfs_io> reflink /ovl/x Before the fix xfs_io enters xfs_reflink_remap_range() and blocks in xfs_trans_alloc(). After the fix, xfs_io blocks outside xfs code in ovl_clone_file_range(). Fixes: `8ede205541` ("ovl: add reflink/copyfile/dedup support") Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-09-24 10:54:01 +02:00
Amir Goldstein	898cc19d8a	ovl: fix freeze protection bypass in ovl_write_iter() Tested by re-writing to an open overlayfs file while upper ext4 is frozen: xfs_io -f /ovl/x xfs_io> pwrite 0 4096 fsfreeze -f /ext4 xfs_io> pwrite 0 4096 WARNING: CPU: 0 PID: 1492 at fs/ext4/ext4_jbd2.c:53 \ ext4_journal_check_start+0x48/0x82 After the fix, the second write blocks in ovl_write_iter() and avoids hitting WARN_ON(sb->s_writers.frozen == SB_FREEZE_COMPLETE) in ext4_journal_check_start(). Fixes: `2a92e07edc` ("ovl: add ovl_write_iter()") Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-09-24 10:54:01 +02:00
Amir Goldstein	63e1325280	ovl: fix memory leak on unlink of indexed file The memory leak was detected by kmemleak when running xfstests overlay/051,053 Fixes: `caf70cb2ba` ("ovl: cleanup orphan index entries") Cc: <stable@vger.kernel.org> # v4.13 Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-09-24 10:54:01 +02:00
Miklos Szeredi	8c25741aaa	ovl: fix oopses in ovl_fill_super() failure paths ovl_free_fs() dereferences ofs->workbasedir and ofs->upper_mnt in cases when those might not have been initialized yet. Fix the initialization order for these fields. Reported-by: syzbot+c75f181dc8429d2eb887@syzkaller.appspotmail.com Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Cc: <stable@vger.kernel.org> # v4.15 Fixes: `95e6d4177c` ("ovl: grab reference to workbasedir early") Fixes: `a9075cdb46` ("ovl: factor out ovl_free_fs() helper")	2018-09-10 12:55:49 +02:00
Amir Goldstein	b833a36603	ovl: add ovl_fadvise() Implement stacked fadvise to fix syscalls readahead(2) and fadvise64(2) on an overlayfs file. Suggested-by: Miklos Szeredi <mszeredi@redhat.com> Fixes: `d1d04ef857` ("ovl: stack file ops") Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-09-03 09:43:10 +02:00
Amir Goldstein	5b910bd615	ovl: fix GPF in swapfile_activate of file from overlayfs over xfs Since overlayfs implements stacked file operations, the underlying filesystems are not supposed to be exposed to the overlayfs file, whose f_inode is an overlayfs inode. Assigning an overlayfs file to swap_file results in an attempt of xfs code to dereference an xfs_inode struct from an ovl_inode pointer: CPU: 0 PID: 2462 Comm: swapon Not tainted 4.18.0-xfstests-12721-g33e17876ea4e #3402 RIP: 0010:xfs_find_bdev_for_inode+0x23/0x2f Call Trace: xfs_iomap_swapfile_activate+0x1f/0x43 __se_sys_swapon+0xb1a/0xee9 Fix this by not assigning the real inode mapping to f_mapping, which will cause swapon() to return an error (-EINVAL). Although it makes sense not to allow setting swpafile on an overlayfs file, some users may depend on it, so we may need to fix this up in the future. Keeping f_mapping pointing to overlay inode mapping will cause O_DIRECT open to fail. Fix this by installing ovl_aops with noop_direct_IO in overlay inode mapping. Keeping f_mapping pointing to overlay inode mapping will cause other a_ops related operations to fail (e.g. readahead()). Those will be fixed by follow up patches. Suggested-by: Miklos Szeredi <mszeredi@redhat.com> Fixes: `f7c72396d0` ("ovl: add O_DIRECT support") Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-08-30 17:08:35 +02:00
Amir Goldstein	80d3481081	ovl: respect FIEMAP_FLAG_SYNC flag Stacked overlayfs fiemap operation broke xfstests that test delayed allocation (with "_test_generic_punch -d"), because ovl_fiemap() failed to write dirty pages when requested. Fixes: `9e142c4102` ("ovl: add ovl_fiemap()") Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-08-30 17:08:35 +02:00
Miklos Szeredi	6faf05c2b2	ovl: set I_CREATING on inode being created ...otherwise there will be list corruption due to inode_sb_list_add() being called for inode already on the sb list. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Fixes: `e950564b97` ("vfs: don't evict uninitialized inode") Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-08-22 13:15:25 -07:00
Vivek Goyal	989974c804	ovl: Enable metadata only feature All the bits are in patches before this. So it is time to enable the metadata only copy up feature. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-07-20 09:56:17 +02:00
Vivek Goyal	935a074f48	ovl: Do not do metacopy only for ioctl modifying file attr ovl_copy_up() by default will only do metadata only copy up (if enabled). That means when ovl_real_ioctl() calls ovl_real_file(), it will still get the lower file (as ovl_real_file() opens data file and not metacopy). And that means "chattr +i" will end up modifying lower inode. There seem to be two ways to solve this. A. Open metacopy file in ovl_real_ioctl() and do operations on that B. Force full copy up when FS_IOC_SETFLAGS is called. I am resorting to option B for now as it feels little safer option. If there are performance issues due to this, we can revisit it. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-07-20 09:56:17 +02:00
Vivek Goyal	997336f2c3	ovl: Do not do metadata only copy-up for truncate operation truncate should copy up full file (and not do metacopy only), otherwise it will be broken. For example, use truncate to increase size of a file so that any read beyong existing size will return null bytes. If we don't copy up full file, then we end up opening lower file and read from it only reads upto the old size (and not new size after truncate). Hence to avoid such situations, copy up data as well when file size changes. So far it was being done by d_real(O_WRONLY) call in truncate() path. Now that patch has been reverted. So force full copy up in ovl_setattr() if size of file is changing. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-07-20 09:56:17 +02:00
Vivek Goyal	d1e6f6a94d	ovl: add helper to force data copy-up Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-07-20 09:56:16 +02:00
Vivek Goyal	0a2d0d3f2f	ovl: Check redirect on index as well Right now we seem to check redirect only if upperdentry is found. But it is possible that there is no upperdentry but later we found an index. We need to check redirect on index as well and set it in ovl_inode->redirect. Otherwise link code can assume that dentry does not have redirect and place a new one which breaks things. In my testing overlay/033 test started failing in xfstests. Following are the details. For example do following. $ mkdir lower upper work merged - Make lower dir with 4 links. $ echo "foo" > lower/l0.txt $ ln lower/l0.txt lower/l1.txt $ ln lower/l0.txt lower/l2.txt $ ln lower/l0.txt lower/l3.txt - Mount with index on and metacopy on. $ mount -t overlay -o lowerdir=lower,upperdir=upper,workdir=work,\ index=on,metacopy=on none merged - Link lower $ ln merged/l0.txt merged/l4.txt (This will metadata copy up of l0.txt and put an absolute redirect /l0.txt) $ echo 2 > /proc/sys/vm/drop/caches $ ls merged/l1.txt (Now l1.txt will be looked up. There is no upper dentry but there is lower dentry and index will be found. We don't check for redirect on index, hence ovl_inode->redirect will be NULL.) - Link Upper $ ln merged/l4.txt merged/l5.txt (Lookup of l4.txt will use inode from l1.txt lookup which is still in cache. It has ovl_inode->redirect NULL, hence link will put a new redirect and replace /l0.txt with /l4.txt - Drop caches. echo 2 > /proc/sys/vm/drop_caches - List l1.txt and it returns -ESTALE $ ls merged/l0.txt (It returns stale because, we found a metacopy of l0.txt in upper and it has redirect l4.txt but there is no file named l4.txt in lower layer. So lower data copy is not found and -ESTALE is returned.) So problem here is that we did not process redirect on index. Check redirect on index as well and then problem is fixed. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-07-20 09:56:16 +02:00
Vivek Goyal	4120fe64dc	ovl: Set redirect on upper inode when it is linked When we create a hardlink to a metacopy upper file, first the redirect on that inode. Path based lookup will not work with newly created link and redirect will solve that issue. Also use absolute redirect as two hardlinks could be in different directores and relative redirect will not work. I have not put any additional locking around setting redirects while introducing redirects for non-dir files. For now it feels like existing locking is sufficient. If that's not the case, we will have add more locking. Following is my rationale about why do I think current locking seems ok. Basic problem for non-dir files is that more than on dentry could be pointing to same inode and in theory only relying on dentry based locks (d->d_lock) did not seem sufficient. We set redirect upon rename and upon link creation. In both the paths for non-dir file, VFS locks both source and target inodes (->i_rwsem). That means vfs rename and link operations on same source and target can't he happening in parallel (Even if there are multiple dentries pointing to same inode). So that probably means that at a time on an inode, only one call of ovl_set_redirect() could be working and we don't need additional locking in ovl_set_redirect(). ovl_inode->redirect is initialized only when inode is created new. That means it should not race with any other path and setting ovl_inode->redirect should be fine. Reading of ovl_inode->redirect happens in ovl_get_redirect() path. And this called only in ovl_set_redirect(). And ovl_set_redirect() already seemed to be protected using ->i_rwsem. That means ovl_set_redirect() and ovl_get_redirect() on source/target inode should not make progress in parallel and is mutually exclusive. Hence no additional locking required. Now, only case where ovl_set_redirect() and ovl_get_redirect() could race seems to be case of absolute redirects where ovl_get_redirect() has to travel up the tree. In that case we already take d->d_lock and that should be sufficient as directories will not have multiple dentries pointing to same inode. So given VFS locking and current usage of redirect, current locking around redirect seems to be ok for non-dir as well. Once we have the logic to remove redirect when metacopy file gets copied up, then we probably will need additional locking. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-07-20 09:56:15 +02:00
Vivek Goyal	7bb083837d	ovl: Set redirect on metacopy files upon rename Set redirect on metacopy files upon rename. This will help find data dentry in lower dirs. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-07-20 09:56:15 +02:00
Vivek Goyal	60124877b9	ovl: Do not set dentry type ORIGIN for broken hardlinks If a dentry has copy up origin, we set flag OVL_PATH_ORIGIN. So far this decision was easy that we had to check only for oe->numlower and if it is non-zero, we knew there is copy up origin. (For non-dir we installed origin dentry in lowerstack[0]). But we don't create ORGIN xattr for broken hardlinks (index=off). And with metacopy feature it is possible that we will install lowerstack[0] but ORIGIN xattr is not there. It is data dentry of upper metacopy dentry which has been found using regular name based lookup or using REDIRECT. So with addition of this new case, just presence of oe->numlower is not sufficient to guarantee that ORIGIN xattr is present. So to differentiate between two cases, look at OVL_CONST_INO flag. If this flag is set and upperdentry is there, that means it can be marked as type ORIGIN. OVL_CONST_INO is not set if lower hardlink is broken or will be broken over copy up. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-07-20 09:56:14 +02:00
Vivek Goyal	a00c2d59e9	ovl: Add an inode flag OVL_CONST_INO Add an ovl_inode flag OVL_CONST_INO. This flag signifies if inode number will remain constant over copy up or not. This flag does not get updated over copy up and remains unmodifed after setting once. Next patch in the series will make use of this flag. It will basically figure out if dentry is of type ORIGIN or not. And this can be derived by this flag. ORIGIN = (upperdentry && ovl_test_flag(OVL_CONST_INO, inode)). Suggested-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-07-20 09:56:14 +02:00
Vivek Goyal	0b17c28af1	ovl: Treat metacopy dentries as type OVL_PATH_MERGE Right now OVL_PATH_MERGE is used only for merged directories. But conceptually, a metacopy dentry (backed by a lower data dentry) is a merged entity as well. So mark metacopy dentries as OVL_PATH_MERGE and ovl_rename() makes use of this property later to set redirect on a metacopy file. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-07-20 09:56:13 +02:00
Vivek Goyal	b8a8824ca0	ovl: Check redirects for metacopy files Right now we rely on path based lookup for data origin of metacopy upper. This will work only if upper has not been renamed. We solved this problem already for merged directories using redirect. Use same logic for metacopy files. This patch just goes on to check redirects for metacopy files. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-07-20 09:56:13 +02:00
Vivek Goyal	0618a816ed	ovl: Move some dir related ovl_lookup_single() code in else block Move some directory related code in else block. This is pure code reorganization and no functionality change. Next patch enables redirect processing on metacopy files and needs this change. By keeping non-functional changes in a separate patch, next patch looks much smaller and cleaner. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-07-20 09:56:12 +02:00
Vivek Goyal	2c3d73589a	ovl: Do not expose metacopy only dentry from d_real() Metacopy dentry/inode is internal to overlay and is never exposed outside of it. Exception is metacopy upper file used for fsync(). Modify d_real() to look for dentries/inode which have data, but also allow matching upper inode without data for the fsync case. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-07-20 09:56:12 +02:00
Vivek Goyal	8c444d2a97	ovl: Open file with data except for the case of fsync ovl_open() should open file which contains data and not open metacopy inode. With the introduction of metacopy inodes, with current implementaion we will end up opening metacopy inode as well. But there can be certain circumstances like ovl_fsync() where we want to allow opening a metacopy inode instead. Hence, change ovl_open_realfile() and and add extra parameter which specifies whether to allow opening metacopy inode or not. If this parameter is false, we look for data inode and open that. This should allow covering both the cases. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-07-20 09:56:12 +02:00
Vivek Goyal	4823d49c26	ovl: Add helper ovl_inode_realdata() Add an helper to retrieve real data inode associated with overlay inode. This helper will ignore all metacopy inodes and will return only the real inode which has data. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-07-20 09:56:11 +02:00
Vivek Goyal	2664bd0897	ovl: Store lower data inode in ovl_inode Right now ovl_inode stores inode pointer for lower inode. This helps with quickly getting lower inode given overlay inode (ovl_inode_lower()). Now with metadata only copy-up, we can have metacopy inode in middle layer as well and inode containing data can be different from ->lower. I need to be able to open the real file in ovl_open_realfile() and for that I need to quickly find the lower data inode. Hence store lower data inode also in ovl_inode. Also provide an helper ovl_inode_lowerdata() to access this field. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-07-20 09:56:11 +02:00
Vivek Goyal	67d756c27a	ovl: Fix ovl_getattr() to get number of blocks from lower If an inode has been copied up metadata only, then we need to query the number of blocks from lower and fill up the stat->st_blocks. We need to be careful about races where we are doing stat on one cpu and data copy up is taking place on other cpu. We want to return stat->st_blocks either from lower or stable upper and not something in between. Hence, ovl_has_upperdata() is called first to figure out whether block reporting will take place from lower or upper. We now support metacopy dentries in middle layer. That means number of blocks reporting needs to come from lowest data dentry and this could be different from lower dentry. Hence we end up making a separate vfs_getxattr() call for metacopy dentries to get number of blocks. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-07-20 09:56:10 +02:00
Vivek Goyal	647d253fcd	ovl: Add helper ovl_dentry_lowerdata() to get lower data dentry Now we have the notion of data dentry and metacopy dentry. ovl_dentry_lower() will return uppermost lower dentry, but it could be either data or metacopy dentry. Now we support metacopy dentries in lower layers so it is possible that lowerstack[0] is metacopy dentry while lowerstack[1] is actual data dentry. So add an helper which returns lowest most dentry which is supposed to be data dentry. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-07-20 09:56:10 +02:00
Vivek Goyal	4f93b426ab	ovl: Copy up meta inode data from lowest data inode So far lower could not be a meta inode. So whenever it was time to copy up data of a meta inode, we could copy it up from top most lower dentry. But now lower itself can be a metacopy inode. That means data copy up needs to take place from a data inode in metacopy inode chain. Find lower data inode in the chain and use that for data copy up. Introduced a helper called ovl_path_lowerdata() to find the lower data inode chain. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-07-20 09:56:09 +02:00
Vivek Goyal	9d3dfea3d3	ovl: Modify ovl_lookup() and friends to lookup metacopy dentry This patch modifies ovl_lookup() and friends to lookup metacopy dentries. It also allows for presence of metacopy dentries in lower layer. During lookup, check for presence of OVL_XATTR_METACOPY and if not present, set OVL_UPPERDATA bit in flags. We don't support metacopy feature with nfs_export. So in nfs_export code, we set OVL_UPPERDATA flag set unconditionally if upper inode exists. Do not follow metacopy origin if we find a metacopy only inode and metacopy feature is not enabled for that mount. Like redirect, this can have security implications where an attacker could hand craft upper and try to gain access to file on lower which it should not have to begin with. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-07-20 09:56:09 +02:00
Vivek Goyal	027065b726	ovl: Use out_err instead of out_nomem Right now we use goto out_nomem which assumes error code is -ENOMEM. But there are other errors returned like -ESTALE as well. So instead of out_nomem, use out_err which will do ERR_PTR(err). That way one can put error code in err and jump to out_err. This just code reorganization and no change of functionality. I am about to add more code and this organization helps laying more code and error paths on top of it. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-07-20 09:56:08 +02:00
Vivek Goyal	0c28887493	ovl: A new xattr OVL_XATTR_METACOPY for file on upper Now we will have the capability to have upper inodes which might be only metadata copy up and data is still on lower inode. So add a new xattr OVL_XATTR_METACOPY to distinguish between two cases. Presence of OVL_XATTR_METACOPY reflects that file has been copied up metadata only and and data will be copied up later from lower origin. So this xattr is set when a metadata copy takes place and cleared when data copy takes place. We also use a bit in ovl_inode->flags to cache OVL_UPPERDATA which reflects whether ovl inode has data or not (as opposed to metadata only copy up). If a file is copied up metadata only and later when same file is opened for WRITE, then data copy up takes place. We copy up data, remove METACOPY xattr and then set the UPPERDATA flag in ovl_inode->flags. While all these operations happen with oi->lock held, read side of oi->flags can be lockless. That is another thread on another cpu can check if UPPERDATA flag is set or not. So this gives us an ordering requirement w.r.t UPPERDATA flag. That is, if another cpu sees UPPERDATA flag set, then it should be guaranteed that effects of data copy up and remove xattr operations are also visible. For example. CPU1 CPU2 ovl_open() acquire(oi->lock) ovl_open_maybe_copy_up() ovl_copy_up_data() open_open_need_copy_up() vfs_removexattr() ovl_already_copied_up() ovl_dentry_needs_data_copy_up() ovl_set_flag(OVL_UPPERDATA) ovl_test_flag(OVL_UPPERDATA) release(oi->lock) Say CPU2 is copying up data and in the end sets UPPERDATA flag. But if CPU1 perceives the effects of setting UPPERDATA flag but not the effects of preceding operations (ex. upper that is not fully copied up), it will be a problem. Hence this patch introduces smp_wmb() on setting UPPERDATA flag operation and smp_rmb() on UPPERDATA flag test operation. May be some other lock or barrier is already covering it. But I am not sure what that is and is it obvious enough that we will not break it in future. So hence trying to be safe here and introducing barriers explicitly for UPPERDATA flag/bit. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-07-20 09:56:08 +02:00
Vivek Goyal	2002df8536	ovl: Add helper ovl_already_copied_up() There are couple of places where we need to know if file is already copied up (in lockless manner). Right now its open coded and there are only two conditions to check. Soon this patch series will introduce another condition to check and Amir wants to introduce one more. So introduce a helper instead to check this so that code is easier to read. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-07-20 09:56:08 +02:00
Vivek Goyal	44d5bf109a	ovl: Copy up only metadata during copy up where it makes sense If it makes sense to copy up only metadata during copy up, do it. This is done for regular files which are not opened for WRITE. Right now ->metacopy is set to 0 always. Last patch in the series will remove the hard coded statement and enable metacopy feature. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-07-20 09:56:07 +02:00
Vivek Goyal	bd64e57586	ovl: During copy up, first copy up metadata and then data Just a little re-ordering of code. This helps with next patch where after copying up metadata, we skip data copying step, if needed. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-07-20 09:56:07 +02:00
Vivek Goyal	d5791044d2	ovl: Provide a mount option metacopy=on/off for metadata copyup By default metadata only copy up is disabled. Provide a mount option so that users can choose one way or other. Also provide a kernel config and module option to enable/disable metacopy feature. metacopy feature requires redirect_dir=on when upper is present. Otherwise, it requires redirect_dir=follow atleast. As of now, metacopy does not work with nfs_export=on. So if both metacopy=on and nfs_export=on then nfs_export is disabled. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-07-20 09:56:06 +02:00
Vivek Goyal	d6eac03913	ovl: Move the copy up helpers to copy_up.c Right now two copy up helpers are in inode.c. Amir suggested it might be better to move these to copy_up.c. There will one more related function which will come in later patch. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-07-20 09:56:06 +02:00
Vivek Goyal	9cec54c83a	ovl: Initialize ovl_inode->redirect in ovl_get_inode() ovl_inode->redirect is an inode property and should be initialized in ovl_get_inode() only when we are adding a new inode to cache. If inode is already in cache, it is already initialized and we should not be touching ovl_inode->redirect field. As of now this is not a problem as redirects are used only for directories which don't share inode. But soon I want to use redirects for regular files also and there it can become an issue. Hence, move ->redirect initialization in ovl_get_inode(). Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-07-20 09:56:05 +02:00
Miklos Szeredi	670c23248e	ovl: obsolete "check_copy_up" module option This was provided for debugging the ro/rw inconsistecy. The inconsitency is now gone so this option is obsolete. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-07-18 15:44:44 +02:00
Miklos Szeredi	fb16043b46	vfs: remove open_flags from d_real() Opening regular files on overlayfs is now handled via ovl_open(). Remove the now unused "open_flags" argument from d_op->d_real() and the d_real() helper. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-07-18 15:44:44 +02:00
Miklos Szeredi	de2a4a501e	Partially revert "locks: fix file locking on overlayfs" This partially reverts commit `c568d68341`. Overlayfs files will now automatically get the correct locks, no need to hack overlay support in VFS. It is a partial revert, because it leaves the locks_inode() calls in place and defines locks_inode() to file_inode(). We could revert those as well, but it would be unnecessary code churn and it makes sense to document that we are getting the inode for locking purposes. Don't revert MS_NOREMOTELOCK yet since that has been part of the userspace API for some time (though not in a useful way). Will try to remove internal flags later when the dust around the new mount API settles. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Acked-by: Jeff Layton <jlayton@kernel.org>	2018-07-18 15:44:43 +02:00
Miklos Szeredi	4ab30319fd	Revert "vfs: add flags to d_real()" This reverts commit `495e642939`. No user of "flags" argument of d_real() remain. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-07-18 15:44:43 +02:00
Miklos Szeredi	88059de155	Revert "ovl: fix relatime for directories" This reverts commit `cd91304e71`. Overlayfs no longer relies on the vfs correct atime handling. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-07-18 15:44:43 +02:00
Miklos Szeredi	8ede205541	ovl: add reflink/copyfile/dedup support Since set of arguments are so similar, handle in a common helper. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-07-18 15:44:42 +02:00
Miklos Szeredi	f7c72396d0	ovl: add O_DIRECT support Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-07-18 15:44:42 +02:00
Miklos Szeredi	9e142c4102	ovl: add ovl_fiemap() Implement stacked fiemap(). Need to split inode operations for regular file (which has fiemap) and special file (which doesn't have fiemap). Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-07-18 15:44:42 +02:00
Miklos Szeredi	dab5ca8fd9	ovl: add lsattr/chattr support Implement FS_IOC_GETFLAGS and FS_IOC_SETFLAGS. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-07-18 15:44:42 +02:00
Miklos Szeredi	aab8848cee	ovl: add ovl_fallocate() Implement stacked fallocate. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-07-18 15:44:42 +02:00
Miklos Szeredi	2f502839e8	ovl: add ovl_mmap() Implement stacked mmap. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-07-18 15:44:42 +02:00
Miklos Szeredi	de30dfd629	ovl: add ovl_fsync() Implement stacked fsync(). Don't sync if lower (noticed by Amir Goldstein). Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-07-18 15:44:42 +02:00
Miklos Szeredi	2a92e07edc	ovl: add ovl_write_iter() Implement stacked writes. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-07-18 15:44:41 +02:00
Miklos Szeredi	16914e6fc7	ovl: add ovl_read_iter() Implement stacked reading. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-07-18 15:44:41 +02:00
Miklos Szeredi	2ef66b8a03	ovl: add helper to return real file In the common case we can just use the real file cached in file->private_data. There are two exceptions: 1) File has been copied up since open: in this unlikely corner case just use a throwaway real file for the operation. If ever this becomes a perfomance problem (very unlikely, since overlayfs has been doing most fine without correctly handling this case at all), then we can deal with that by updating the cached real file. 2) File's f_flags have changed since open: no need to reopen the cached real file, we can just change the flags there as well. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-07-18 15:44:41 +02:00
Miklos Szeredi	d1d04ef857	ovl: stack file ops Implement file operations on a regular overlay file. The underlying file is opened separately and cached in ->private_data. It might be worth making an exception for such files when accounting in nr_file to confirm to userspace expectations. We are only adding a small overhead (248bytes for the struct file) since the real inode and dentry are pinned by overlayfs anyway. This patch doesn't have any effect, since the vfs will use d_real() to find the real underlying file to open. The patch at the end of the series will actually enable this functionality. AV: make it use open_with_fake_path(), don't mess with override_creds SzM: still need to mess with override_creds() until no fs uses current_cred() in their open method. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-07-18 15:44:41 +02:00
Miklos Szeredi	e8c985bace	ovl: deal with overlay files in ovl_d_real() Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-07-18 15:44:41 +02:00
Miklos Szeredi	46e5d0a390	ovl: copy up file size as well Copy i_size of the underlying inode to the overlay inode in ovl_copyattr(). This is in preparation for stacking I/O operations on overlay files. This patch shouldn't have any observable effect. Remove stale comment from ovl_setattr() [spotted by Vivek Goyal]. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-07-18 15:44:41 +02:00
Miklos Szeredi	5812160eb5	Revert "Revert "ovl: get_write_access() in truncate"" This reverts commit `31c3a70695`. Re-add functionality dealing with i_writecount on truncate to overlayfs. This patch shouldn't have any observable effects, since we just re-assert the writecout that vfs_truncate() already got for us. This is in preparation for moving overlay functionality out of the VFS. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-07-18 15:44:41 +02:00
Miklos Szeredi	4f3572954a	ovl: copy up inode flags On inode creation copy certain inode flags from the underlying real inode to the overlay inode. This is in preparation for moving overlay functionality out of the VFS. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-07-18 15:44:41 +02:00
Miklos Szeredi	d9854c87f0	ovl: copy up times Copy up mtime and ctime to overlay inode after times in real object are modified. Be careful not to dirty cachelines when not necessary. This is in preparation for moving overlay functionality out of the VFS. This patch shouldn't have any observable effect. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-07-18 15:44:40 +02:00
Amir Goldstein	6781069307	ovl: fix wrong use of impure dir cache in ovl_iterate() Only upper dir can be impure, but if we are in the middle of iterating a lower real dir, dir could be copied up and marked impure. We only want the impure cache if we started iterating a real upper dir to begin with. Aditya Kali reported that the following reproducer hits the WARN_ON(!cache->refcount) in ovl_get_cache(): docker run --rm drupal:8.5.4-fpm-alpine \ sh -c 'cd /var/www/html/vendor/symfony && \ chown -R www-data:www-data . && ls -l .' Reported-by: Aditya Kali <adityakali@google.com> Tested-by: Aditya Kali <adityakali@google.com> Fixes: `4edb83bb10` ('ovl: constant d_ino for non-merge dirs') Cc: <stable@vger.kernel.org> # v4.14 Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-07-17 16:04:34 +02:00
Linus Torvalds	7a932516f5	vfs/y2038: inode timestamps conversion to timespec64 This is a late set of changes from Deepa Dinamani doing an automated treewide conversion of the inode and iattr structures from 'timespec' to 'timespec64', to push the conversion from the VFS layer into the individual file systems. There were no conflicts between this and the contents of linux-next until just before the merge window, when we saw multiple problems: - A minor conflict with my own y2038 fixes, which I could address by adding another patch on top here. - One semantic conflict with late changes to the NFS tree. I addressed this by merging Deepa's original branch on top of the changes that now got merged into mainline and making sure the merge commit includes the necessary changes as produced by coccinelle. - A trivial conflict against the removal of staging/lustre. - Multiple conflicts against the VFS changes in the overlayfs tree. These are still part of linux-next, but apparently this is no longer intended for 4.18 [1], so I am ignoring that part. As Deepa writes: The series aims to switch vfs timestamps to use struct timespec64. Currently vfs uses struct timespec, which is not y2038 safe. The series involves the following: 1. Add vfs helper functions for supporting struct timepec64 timestamps. 2. Cast prints of vfs timestamps to avoid warnings after the switch. 3. Simplify code using vfs timestamps so that the actual replacement becomes easy. 4. Convert vfs timestamps to use struct timespec64 using a script. This is a flag day patch. Next steps: 1. Convert APIs that can handle timespec64, instead of converting timestamps at the boundaries. 2. Update internal data structures to avoid timestamp conversions. Thomas Gleixner adds: I think there is no point to drag that out for the next merge window. The whole thing needs to be done in one go for the core changes which means that you're going to play that catchup game forever. Let's get over with it towards the end of the merge window. [1] https://www.spinics.net/lists/linux-fsdevel/msg128294.html -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJbInZAAAoJEGCrR//JCVInReoQAIlVIIMt5ZX6wmaKbrjy9Itf MfgbFihQ/djLnuSPVQ3nztcxF0d66BKHZ9puVjz6+mIHqfDvJTRwZs9nU+sOF/T1 g78fRkM1cxq6ZCkGYAbzyjyo5aC4PnSMP/NQLmwqvi0MXqqrbDoq5ZdP9DHJw39h L9lD8FM/P7T29Fgp9tq/pT5l9X8VU8+s5KQG1uhB5hii4VL6pD6JyLElDita7rg+ Z7/V7jkxIGEUWF7vGaiR1QTFzEtpUA/exDf9cnsf51OGtK/LJfQ0oiZPPuq3oA/E LSbt8YQQObc+dvfnGxwgxEg1k5WP5ekj/Wdibv/+rQKgGyLOTz6Q4xK6r8F2ahxs nyZQBdXqHhJYyKr1H1reUH3mrSgQbE5U5R1i3My0xV2dSn+vtK5vgF21v2Ku3A1G wJratdtF/kVBzSEQUhsYTw14Un+xhBLRWzcq0cELonqxaKvRQK9r92KHLIWNE7/v c0TmhFbkZA+zR8HdsaL3iYf1+0W/eYy8PcvepyldKNeW2pVk3CyvdTfY2Z87G2XK tIkK+BUWbG3drEGG3hxZ3757Ln3a9qWyC5ruD3mBVkuug/wekbI8PykYJS7Mx4s/ WNXl0dAL0Eeu1M8uEJejRAe1Q3eXoMWZbvCYZc+wAm92pATfHVcKwPOh8P7NHlfy A3HkjIBrKW5AgQDxfgvm =CZX2 -----END PGP SIGNATURE----- Merge tag 'vfs-timespec64' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground Pull inode timestamps conversion to timespec64 from Arnd Bergmann: "This is a late set of changes from Deepa Dinamani doing an automated treewide conversion of the inode and iattr structures from 'timespec' to 'timespec64', to push the conversion from the VFS layer into the individual file systems. As Deepa writes: 'The series aims to switch vfs timestamps to use struct timespec64. Currently vfs uses struct timespec, which is not y2038 safe. The series involves the following: 1. Add vfs helper functions for supporting struct timepec64 timestamps. 2. Cast prints of vfs timestamps to avoid warnings after the switch. 3. Simplify code using vfs timestamps so that the actual replacement becomes easy. 4. Convert vfs timestamps to use struct timespec64 using a script. This is a flag day patch. Next steps: 1. Convert APIs that can handle timespec64, instead of converting timestamps at the boundaries. 2. Update internal data structures to avoid timestamp conversions' Thomas Gleixner adds: 'I think there is no point to drag that out for the next merge window. The whole thing needs to be done in one go for the core changes which means that you're going to play that catchup game forever. Let's get over with it towards the end of the merge window'" * tag 'vfs-timespec64' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground: pstore: Remove bogus format string definition vfs: change inode times to use struct timespec64 pstore: Convert internal records to timespec64 udf: Simplify calls to udf_disk_stamp_to_time fs: nfs: get rid of memcpys for inode times ceph: make inode time prints to be long long lustre: Use long long type to print inode time fs: add timespec64_truncate()	2018-06-15 07:31:07 +09:00
Kees Cook	6396bb2215	treewide: kzalloc() -> kcalloc() The kzalloc() function has a 2-factor argument form, kcalloc(). This patch replaces cases of: kzalloc(a * b, gfp) with: kcalloc(a * b, gfp) as well as handling cases of: kzalloc(a * b * c, gfp) with: kzalloc(array3_size(a, b, c), gfp) as it's slightly less ugly than: kzalloc_array(array_size(a, b), c, gfp) This does, however, attempt to ignore constant size factors like: kzalloc(4 * 1024, gfp) though any constants defined via macros get caught up in the conversion. Any factors with a sizeof() of "unsigned char", "char", and "u8" were dropped, since they're redundant. The Coccinelle script used for this was: // Fix redundant parens around sizeof(). @@ type TYPE; expression THING, E; @@ ( kzalloc( - (sizeof(TYPE)) * E + sizeof(TYPE) * E , ...) \| kzalloc( - (sizeof(THING)) * E + sizeof(THING) * E , ...) ) // Drop single-byte sizes and redundant parens. @@ expression COUNT; typedef u8; typedef __u8; @@ ( kzalloc( - sizeof(u8) * (COUNT) + COUNT , ...) \| kzalloc( - sizeof(__u8) * (COUNT) + COUNT , ...) \| kzalloc( - sizeof(char) * (COUNT) + COUNT , ...) \| kzalloc( - sizeof(unsigned char) * (COUNT) + COUNT , ...) \| kzalloc( - sizeof(u8) * COUNT + COUNT , ...) \| kzalloc( - sizeof(__u8) * COUNT + COUNT , ...) \| kzalloc( - sizeof(char) * COUNT + COUNT , ...) \| kzalloc( - sizeof(unsigned char) * COUNT + COUNT , ...) ) // 2-factor product with sizeof(type/expression) and identifier or constant. @@ type TYPE; expression THING; identifier COUNT_ID; constant COUNT_CONST; @@ ( - kzalloc + kcalloc ( - sizeof(TYPE) * (COUNT_ID) + COUNT_ID, sizeof(TYPE) , ...) \| - kzalloc + kcalloc ( - sizeof(TYPE) * COUNT_ID + COUNT_ID, sizeof(TYPE) , ...) \| - kzalloc + kcalloc ( - sizeof(TYPE) * (COUNT_CONST) + COUNT_CONST, sizeof(TYPE) , ...) \| - kzalloc + kcalloc ( - sizeof(TYPE) * COUNT_CONST + COUNT_CONST, sizeof(TYPE) , ...) \| - kzalloc + kcalloc ( - sizeof(THING) * (COUNT_ID) + COUNT_ID, sizeof(THING) , ...) \| - kzalloc + kcalloc ( - sizeof(THING) * COUNT_ID + COUNT_ID, sizeof(THING) , ...) \| - kzalloc + kcalloc ( - sizeof(THING) * (COUNT_CONST) + COUNT_CONST, sizeof(THING) , ...) \| - kzalloc + kcalloc ( - sizeof(THING) * COUNT_CONST + COUNT_CONST, sizeof(THING) , ...) ) // 2-factor product, only identifiers. @@ identifier SIZE, COUNT; @@ - kzalloc + kcalloc ( - SIZE * COUNT + COUNT, SIZE , ...) // 3-factor product with 1 sizeof(type) or sizeof(expression), with // redundant parens removed. @@ expression THING; identifier STRIDE, COUNT; type TYPE; @@ ( kzalloc( - sizeof(TYPE) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) \| kzalloc( - sizeof(TYPE) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) \| kzalloc( - sizeof(TYPE) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) \| kzalloc( - sizeof(TYPE) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) \| kzalloc( - sizeof(THING) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) \| kzalloc( - sizeof(THING) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) \| kzalloc( - sizeof(THING) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) \| kzalloc( - sizeof(THING) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) ) // 3-factor product with 2 sizeof(variable), with redundant parens removed. @@ expression THING1, THING2; identifier COUNT; type TYPE1, TYPE2; @@ ( kzalloc( - sizeof(TYPE1) * sizeof(TYPE2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) \| kzalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) \| kzalloc( - sizeof(THING1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) \| kzalloc( - sizeof(THING1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) \| kzalloc( - sizeof(TYPE1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) \| kzalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) ) // 3-factor product, only identifiers, with redundant parens removed. @@ identifier STRIDE, SIZE, COUNT; @@ ( kzalloc( - (COUNT) * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) \| kzalloc( - COUNT * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) \| kzalloc( - COUNT * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) \| kzalloc( - (COUNT) * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) \| kzalloc( - COUNT * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) \| kzalloc( - (COUNT) * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) \| kzalloc( - (COUNT) * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) \| kzalloc( - COUNT * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) ) // Any remaining multi-factor products, first at least 3-factor products, // when they're not all constants... @@ expression E1, E2, E3; constant C1, C2, C3; @@ ( kzalloc(C1 * C2 * C3, ...) \| kzalloc( - (E1) * E2 * E3 + array3_size(E1, E2, E3) , ...) \| kzalloc( - (E1) * (E2) * E3 + array3_size(E1, E2, E3) , ...) \| kzalloc( - (E1) * (E2) * (E3) + array3_size(E1, E2, E3) , ...) \| kzalloc( - E1 * E2 * E3 + array3_size(E1, E2, E3) , ...) ) // And then all remaining 2 factors products when they're not all constants, // keeping sizeof() as the second factor argument. @@ expression THING, E1, E2; type TYPE; constant C1, C2, C3; @@ ( kzalloc(sizeof(THING) * C2, ...) \| kzalloc(sizeof(TYPE) * C2, ...) \| kzalloc(C1 * C2 * C3, ...) \| kzalloc(C1 * C2, ...) \| - kzalloc + kcalloc ( - sizeof(TYPE) * (E2) + E2, sizeof(TYPE) , ...) \| - kzalloc + kcalloc ( - sizeof(TYPE) * E2 + E2, sizeof(TYPE) , ...) \| - kzalloc + kcalloc ( - sizeof(THING) * (E2) + E2, sizeof(THING) , ...) \| - kzalloc + kcalloc ( - sizeof(THING) * E2 + E2, sizeof(THING) , ...) \| - kzalloc + kcalloc ( - (E1) * E2 + E1, E2 , ...) \| - kzalloc + kcalloc ( - (E1) * (E2) + E1, E2 , ...) \| - kzalloc + kcalloc ( - E1 * E2 + E1, E2 , ...) ) Signed-off-by: Kees Cook <keescook@chromium.org>	2018-06-12 16:19:22 -07:00
Deepa Dinamani	95582b0083	vfs: change inode times to use struct timespec64 struct timespec is not y2038 safe. Transition vfs to use y2038 safe struct timespec64 instead. The change was made with the help of the following cocinelle script. This catches about 80% of the changes. All the header file and logic changes are included in the first 5 rules. The rest are trivial substitutions. I avoid changing any of the function signatures or any other filesystem specific data structures to keep the patch simple for review. The script can be a little shorter by combining different cases. But, this version was sufficient for my usecase. virtual patch @ depends on patch @ identifier now; @@ - struct timespec + struct timespec64 current_time ( ... ) { - struct timespec now = current_kernel_time(); + struct timespec64 now = current_kernel_time64(); ... - return timespec_trunc( + return timespec64_trunc( ... ); } @ depends on patch @ identifier xtime; @@ struct $ iattr \\| inode \\| kstat $ { ... - struct timespec xtime; + struct timespec64 xtime; ... } @ depends on patch @ identifier t; @@ struct inode_operations { ... int (update_time) (..., - struct timespec t, + struct timespec64 t, ...); ... } @ depends on patch @ identifier t; identifier fn_update_time =~ "update_time$"; @@ fn_update_time (..., - struct timespec t, + struct timespec64 t, ...) { ... } @ depends on patch @ identifier t; @@ lease_get_mtime( ... , - struct timespec t + struct timespec64 t ) { ... } @te depends on patch forall@ identifier ts; local idexpression struct inode inode_node; identifier i_xtime =~ "^i_[acm]time$"; identifier ia_xtime =~ "^ia_[acm]time$"; identifier fn_update_time =~ "update_time$"; identifier fn; expression e, E3; local idexpression struct inode node1; local idexpression struct inode node2; local idexpression struct iattr attr1; local idexpression struct iattr attr2; local idexpression struct iattr attr; identifier i_xtime1 =~ "^i_[acm]time$"; identifier i_xtime2 =~ "^i_[acm]time$"; identifier ia_xtime1 =~ "^ia_[acm]time$"; identifier ia_xtime2 =~ "^ia_[acm]time$"; @@ ( ( - struct timespec ts; + struct timespec64 ts; \| - struct timespec ts = current_time(inode_node); + struct timespec64 ts = current_time(inode_node); ) <+... when != ts ( - timespec_equal(&inode_node->i_xtime, &ts) + timespec64_equal(&inode_node->i_xtime, &ts) \| - timespec_equal(&ts, &inode_node->i_xtime) + timespec64_equal(&ts, &inode_node->i_xtime) \| - timespec_compare(&inode_node->i_xtime, &ts) + timespec64_compare(&inode_node->i_xtime, &ts) \| - timespec_compare(&ts, &inode_node->i_xtime) + timespec64_compare(&ts, &inode_node->i_xtime) \| ts = current_time(e) \| fn_update_time(..., &ts,...) \| inode_node->i_xtime = ts \| node1->i_xtime = ts \| ts = inode_node->i_xtime \| <+... attr1->ia_xtime ...+> = ts \| ts = attr1->ia_xtime \| ts.tv_sec \| ts.tv_nsec \| btrfs_set_stack_timespec_sec(..., ts.tv_sec) \| btrfs_set_stack_timespec_nsec(..., ts.tv_nsec) \| - ts = timespec64_to_timespec( + ts = ... -) \| - ts = ktime_to_timespec( + ts = ktime_to_timespec64( ...) \| - ts = E3 + ts = timespec_to_timespec64(E3) \| - ktime_get_real_ts(&ts) + ktime_get_real_ts64(&ts) \| fn(..., - ts + timespec64_to_timespec(ts) ,...) ) ...+> ( <... when != ts - return ts; + return timespec64_to_timespec(ts); ...> ) \| - timespec_equal(&node1->i_xtime1, &node2->i_xtime2) + timespec64_equal(&node1->i_xtime2, &node2->i_xtime2) \| - timespec_equal(&node1->i_xtime1, &attr2->ia_xtime2) + timespec64_equal(&node1->i_xtime2, &attr2->ia_xtime2) \| - timespec_compare(&node1->i_xtime1, &node2->i_xtime2) + timespec64_compare(&node1->i_xtime1, &node2->i_xtime2) \| node1->i_xtime1 = - timespec_trunc(attr1->ia_xtime1, + timespec64_trunc(attr1->ia_xtime1, ...) \| - attr1->ia_xtime1 = timespec_trunc(attr2->ia_xtime2, + attr1->ia_xtime1 = timespec64_trunc(attr2->ia_xtime2, ...) \| - ktime_get_real_ts(&attr1->ia_xtime1) + ktime_get_real_ts64(&attr1->ia_xtime1) \| - ktime_get_real_ts(&attr.ia_xtime1) + ktime_get_real_ts64(&attr.ia_xtime1) ) @ depends on patch @ struct inode node; struct iattr attr; identifier fn; identifier i_xtime =~ "^i_[acm]time$"; identifier ia_xtime =~ "^ia_[acm]time$"; expression e; @@ ( - fn(node->i_xtime); + fn(timespec64_to_timespec(node->i_xtime)); \| fn(..., - node->i_xtime); + timespec64_to_timespec(node->i_xtime)); \| - e = fn(attr->ia_xtime); + e = fn(timespec64_to_timespec(attr->ia_xtime)); ) @ depends on patch forall @ struct inode node; struct iattr attr; identifier i_xtime =~ "^i_[acm]time$"; identifier ia_xtime =~ "^ia_[acm]time$"; identifier fn; @@ { + struct timespec ts; <+... ( + ts = timespec64_to_timespec(node->i_xtime); fn (..., - &node->i_xtime, + &ts, ...); \| + ts = timespec64_to_timespec(attr->ia_xtime); fn (..., - &attr->ia_xtime, + &ts, ...); ) ...+> } @ depends on patch forall @ struct inode node; struct iattr attr; struct kstat stat; identifier ia_xtime =~ "^ia_[acm]time$"; identifier i_xtime =~ "^i_[acm]time$"; identifier xtime =~ "^[acm]time$"; identifier fn, ret; @@ { + struct timespec ts; <+... ( + ts = timespec64_to_timespec(node->i_xtime); ret = fn (..., - &node->i_xtime, + &ts, ...); \| + ts = timespec64_to_timespec(node->i_xtime); ret = fn (..., - &node->i_xtime); + &ts); \| + ts = timespec64_to_timespec(attr->ia_xtime); ret = fn (..., - &attr->ia_xtime, + &ts, ...); \| + ts = timespec64_to_timespec(attr->ia_xtime); ret = fn (..., - &attr->ia_xtime); + &ts); \| + ts = timespec64_to_timespec(stat->xtime); ret = fn (..., - &stat->xtime); + &ts); ) ...+> } @ depends on patch @ struct inode node; struct inode node2; identifier i_xtime1 =~ "^i_[acm]time$"; identifier i_xtime2 =~ "^i_[acm]time$"; identifier i_xtime3 =~ "^i_[acm]time$"; struct iattr attrp; struct iattr attrp2; struct iattr attr ; identifier ia_xtime1 =~ "^ia_[acm]time$"; identifier ia_xtime2 =~ "^ia_[acm]time$"; struct kstat stat; struct kstat stat1; struct timespec64 ts; identifier xtime =~ "^[acmb]time$"; expression e; @@ ( ( node->i_xtime2 \\| attrp->ia_xtime2 \\| attr.ia_xtime2 \) = node->i_xtime1 ; \| node->i_xtime2 = $ node2->i_xtime1 \\| timespec64_trunc(...) $; \| node->i_xtime2 = node->i_xtime1 = node->i_xtime3 = $ts \\| current_time(...) $; \| node->i_xtime1 = node->i_xtime3 = $ts \\| current_time(...) $; \| stat->xtime = node2->i_xtime1; \| stat1.xtime = node2->i_xtime1; \| ( node->i_xtime2 \\| attrp->ia_xtime2 \) = attrp->ia_xtime1 ; \| ( attrp->ia_xtime1 \\| attr.ia_xtime1 \) = attrp2->ia_xtime2; \| - e = node->i_xtime1; + e = timespec64_to_timespec( node->i_xtime1 ); \| - e = attrp->ia_xtime1; + e = timespec64_to_timespec( attrp->ia_xtime1 ); \| node->i_xtime1 = current_time(...); \| node->i_xtime2 = node->i_xtime1 = node->i_xtime3 = - e; + timespec_to_timespec64(e); \| node->i_xtime1 = node->i_xtime3 = - e; + timespec_to_timespec64(e); \| - node->i_xtime1 = e; + node->i_xtime1 = timespec_to_timespec64(e); ) Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Cc: <anton@tuxera.com> Cc: <balbi@kernel.org> Cc: <bfields@fieldses.org> Cc: <darrick.wong@oracle.com> Cc: <dhowells@redhat.com> Cc: <dsterba@suse.com> Cc: <dwmw2@infradead.org> Cc: <hch@lst.de> Cc: <hirofumi@mail.parknet.co.jp> Cc: <hubcap@omnibond.com> Cc: <jack@suse.com> Cc: <jaegeuk@kernel.org> Cc: <jaharkes@cs.cmu.edu> Cc: <jslaby@suse.com> Cc: <keescook@chromium.org> Cc: <mark@fasheh.com> Cc: <miklos@szeredi.hu> Cc: <nico@linaro.org> Cc: <reiserfs-devel@vger.kernel.org> Cc: <richard@nod.at> Cc: <sage@redhat.com> Cc: <sfrench@samba.org> Cc: <swhiteho@redhat.com> Cc: <tj@kernel.org> Cc: <trond.myklebust@primarydata.com> Cc: <tytso@mit.edu> Cc: <viro@zeniv.linux.org.uk>	2018-06-05 16:57:31 -07:00
Amir Goldstein	01b39dcc95	ovl: use inode_insert5() to hash a newly created inode Currently, there is a small window where ovl_obtain_alias() can race with ovl_instantiate() and create two different overlay inodes with the same underlying real non-dir non-hardlink inode. The race requires an adversary to guess the file handle of the yet to be created upper inode and decode the guessed file handle after ovl_creat_real(), but before ovl_instantiate(). This race does not affect overlay directory inodes, because those are decoded via ovl_lookup_real() and not with ovl_obtain_alias(). This patch fixes the race, by using inode_insert5() to add a newly created inode to cache. If the newly created inode apears to already exist in cache (hashed by the same real upper inode), we instantiate the dentry with the old inode and drop the new inode, instead of silently not hashing the new inode. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-05-31 11:06:12 +02:00
Vivek Goyal	ac6a52eb65	ovl: Pass argument to ovl_get_inode() in a structure ovl_get_inode() right now has 5 parameters. Soon this patch series will add 2 more and suddenly argument list starts looking too long. Hence pass arguments to ovl_get_inode() in a structure and it looks little cleaner. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-05-31 11:06:12 +02:00
Miklos Szeredi	b148cba403	ovl: clean up copy-up error paths Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-05-31 11:06:11 +02:00
Miklos Szeredi	dd8ac699ed	ovl: return EIO on internal error EIO better represents an internal error than ENOENT. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-05-31 11:06:11 +02:00
Al Viro	f73cc77c3a	ovl: make ovl_create_real() cope with vfs_mkdir() safely vfs_mkdir() may succeed and leave the dentry passed to it unhashed and negative. ovl_create_real() is the last caller breaking when that happens. [amir: split re-factoring of ovl_create_temp() to prep patch add comment about unhashed dir after mkdir add pr_warn() if mkdir succeeds and lookup fails] Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-05-31 11:06:11 +02:00
Amir Goldstein	137ec526a2	ovl: create helper ovl_create_temp() Also used ovl_create_temp() in ovl_create_index() instead of calling ovl_do_mkdir() directly, so now all callers of ovl_do_mkdir() are routed through ovl_create_real(), which paves the way for Al's fix for non-hashed result from vfs_mkdir(). Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-05-31 11:06:11 +02:00
Miklos Szeredi	95a1c8153a	ovl: return dentry from ovl_create_real() Al Viro suggested to simplify callers of ovl_create_real() by returning the created dentry (or ERR_PTR) from ovl_create_real(). Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-05-31 11:06:11 +02:00
Amir Goldstein	471ec5dcf4	ovl: struct cattr cleanups * Rename to ovl_cattr * Fold ovl_create_real() hardlink argument into struct ovl_cattr * Create macro OVL_CATTR() to initialize struct ovl_cattr from mode Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-05-31 11:06:10 +02:00
Amir Goldstein	6cf00764b0	ovl: strip debug argument from ovl_do_ helpers It did not prove to be useful. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-05-31 11:06:10 +02:00
Amir Goldstein	a8b9e0ceed	ovl: remove WARN_ON() real inode attributes mismatch Overlayfs should cope with online changes to underlying layer without crashing the kernel, which is what xfstest overlay/019 checks. This test may sometimes trigger WARN_ON() in ovl_create_or_link() when linking an overlay inode that has been changed on underlying layer. Remove those WARN_ON() to prevent the stress test from failing. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-05-31 11:06:10 +02:00
Miklos Szeredi	4280f74a57	ovl: Kconfig documentation fixes Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-05-31 11:06:10 +02:00
Amir Goldstein	795939a93e	ovl: add support for "xino" mount and config options With mount option "xino=on", mounter declares that there are enough free high bits in underlying fs to hold the layer fsid. If overlayfs does encounter underlying inodes using the high xino bits reserved for layer fsid, a warning will be emitted and the original inode number will be used. The mount option name "xino" goes after a similar meaning mount option of aufs, but in overlayfs case, the mapping is stateless. An example for a use case of "xino=on" is when upper/lower is on an xfs filesystem. xfs uses 64bit inode numbers, but it currently never uses the upper 8bit for inode numbers exposed via stat(2) and that is not likely to change in the future without user opting-in for a new xfs feature. The actual number of unused upper bit is much larger and determined by the xfs filesystem geometry (64 - agno_log - agblklog - inopblog). That means that for all practical purpose, there are enough unused bits in xfs inode numbers for more than OVL_MAX_STACK unique fsid's. Another use case of "xino=on" is when upper/lower is on tmpfs. tmpfs inode numbers are allocated sequentially since boot, so they will practially never use the high inode number bits. For compatibility with applications that expect 32bit inodes, the feature can be disabled with "xino=off". The option "xino=auto" automatically detects underlying filesystem that use 32bit inodes and enables the feature. The Kconfig option OVERLAY_FS_XINO_AUTO and module parameter of the same name, determine if the default mode for overlayfs mount is "xino=auto" or "xino=off". Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-04-12 12:04:50 +02:00
Amir Goldstein	adbf4f7ea8	ovl: consistent d_ino for non-samefs with xino When overlay layers are not all on the same fs, but all inode numbers of underlying fs do not use the high 'xino' bits, overlay st_ino values are constant and persistent. In that case, relax non-samefs constraint for consistent d_ino and always iterate non-merge dir using ovl_fill_real() actor so we can remap lower inode numbers to unique lower fs range. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-04-12 12:04:50 +02:00
Amir Goldstein	12574a9f4c	ovl: consistent i_ino for non-samefs with xino When overlay layers are not all on the same fs, but all inode numbers of underlying fs do not use the high 'xino' bits, overlay st_ino values are constant and persistent. In that case, set i_ino value to the same value as st_ino for nfsd readdirplus validator. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-04-12 12:04:50 +02:00
Amir Goldstein	e487d889b7	ovl: constant st_ino for non-samefs with xino On 64bit systems, when overlay layers are not all on the same fs, but all inode numbers of underlying fs are not using the high bits, use the high bits to partition the overlay st_ino address space. The high bits hold the fsid (upper fsid is 0). This way overlay inode numbers are unique and all inodes use overlay st_dev. Inode numbers are also persistent for a given layer configuration. Currently, our only indication for available high ino bits is from a filesystem that supports file handles and uses the default encode_fh() operation, which encodes a 32bit inode number. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-04-12 12:04:50 +02:00
Amir Goldstein	5148626b80	ovl: allocate anon bdev per unique lower fs Instead of allocating an anonymous bdev per lower layer, allocate one anonymous bdev per every unique lower fs that is different than upper fs. Every unique lower fs is assigned an fsid > 0 and the number of unique lower fs are stored in ofs->numlowerfs. The assigned fsid is stored in the lower layer struct and will be used also for inode number multiplexing. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-04-12 12:04:50 +02:00
Amir Goldstein	da309e8c05	ovl: factor out ovl_map_dev_ino() helper A helper for ovl_getattr() to map the values of st_dev and st_ino according to constant st_ino rules. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-04-12 12:04:50 +02:00
Miklos Szeredi	8f35cf51cd	ovl: cleanup ovl_update_time() No need to mess with an alias, the upperdentry can be retrieved directly from the overlay inode. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-04-12 12:04:50 +02:00
Miklos Szeredi	3a291774d1	ovl: add WARN_ON() for non-dir redirect cases Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-04-12 12:04:49 +02:00
Vivek Goyal	0471a9cdb0	ovl: cleanup setting OVL_INDEX Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-04-12 12:04:49 +02:00
Vivek Goyal	102b0d11cb	ovl: set d->is_dir and d->opaque for last path element Certain properties in ovl_lookup_data should be set only for the last element of the path. IOW, if we are calling ovl_lookup_single() for an absolute redirect, then d->is_dir and d->opaque do not make much sense for intermediate path elements. Instead set them only if dentry being lookup is last path element. As of now we do not seem to be making use of d->opaque if it is set for a path/dentry in lower. But just define the semantics so that future code can make use of this assumption. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-04-12 12:04:49 +02:00
Vivek Goyal	e9b77f90cc	ovl: Do not check for redirect if this is last layer If we are looking in last layer, then there should not be any need to process redirect. redirect information is used only for lookup in next lower layer and there is no more lower layer to look into. So no need to process redirects. IOW, ignore redirects on lowest layer. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-04-12 12:04:49 +02:00
Amir Goldstein	8b58924ad5	ovl: lookup in inode cache first when decoding lower file handle When decoding a lower file handle, we need to check if lower file was copied up and indexed and if it has a whiteout index, we need to check if this is an unlinked but open non-dir before returning -ESTALE. To find out if this is an unlinked but open non-dir we need to lookup an overlay inode in inode cache by lower inode and that requires decoding the lower file handle before looking in inode cache. Before this change, if the lower inode turned out to be a directory, we may have paid an expensive cost to reconnect that lower directory for nothing. After this change, we start by decoding a disconnected lower dentry and using the lower inode for looking up an overlay inode in inode cache. If we find overlay inode and dentry in cache, we avoid the index lookup overhead. If we don't find an overlay inode and dentry in cache, then we only need to decode a connected lower dentry in case the lower dentry is a non-indexed directory. The xfstests group overlay/exportfs tests decoding overlayfs file handles after drop_caches with different states of the file at encode and decode time. Overall the tests in the group call ovl_lower_fh_to_d() 89 times to decode a lower file handle. Before this change, the tests called ovl_get_index_fh() 75 times and reconnect_one() 61 times. After this change, the tests call ovl_get_index_fh() 70 times and reconnect_one() 59 times. The 2 cases where reconnect_one() was avoided are cases where a non-upper directory file handle was encoded, then the directory removed and then file handle was decoded. To demonstrate the affect on decoding file handles with hot inode/dentry cache, the drop_caches call in the tests was disabled. Without drop_caches, there are no reconnect_one() calls at all before or after the change. Before the change, there are 75 calls to ovl_get_index_fh(), exactly as the case with drop_caches. After the change, there are only 10 calls to ovl_get_index_fh(). Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-04-12 12:04:49 +02:00
Amir Goldstein	8a22efa15b	ovl: do not try to reconnect a disconnected origin dentry On lookup of non directory, we try to decode the origin file handle stored in upper inode. The origin file handle is supposed to be decoded to a disconnected non-dir dentry, which is fine, because we only need the lower inode of a copy up origin. However, if the origin file handle somehow turns out to be a directory we pay the expensive cost of reconnecting the directory dentry, only to get a mismatch file type and drop the dentry. Optimize this case by explicitly opting out of reconnecting the dentry. Opting-out of reconnect is done by passing a NULL acceptable callback to exportfs_decode_fh(). While the case described above is a strange corner case that does not really need to be optimized, the API added for this optimization will be used by a following patch to optimize a more common case of decoding an overlayfs file handle. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-04-12 12:04:49 +02:00
Amir Goldstein	5b2cccd32c	ovl: disambiguate ovl_encode_fh() Rename ovl_encode_fh() to ovl_encode_real_fh() to differentiate from the exportfs function ovl_encode_inode_fh() and change the latter to ovl_encode_fh() to match the exportfs method name. Rename ovl_decode_fh() to ovl_decode_real_fh() for consistency. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-04-12 12:04:49 +02:00
Amir Goldstein	9f99e50d46	ovl: set lower layer st_dev only if setting lower st_ino For broken hardlinks, we do not return lower st_ino, so we should also not return lower pseudo st_dev. Fixes: `a0c5ad307a` ("ovl: relax same fs constraint for constant st_ino") Cc: <stable@vger.kernel.org> #v4.15 Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-04-12 12:04:49 +02:00

... 2 3 4 5 6 ...

766 Commits