linux

Author	SHA1	Message	Date
Ming Lei	85a8ce62c2	block: add bio_truncate to fix guard_bio_eod Some filesystem, such as vfat, may send bio which crosses device boundary, and the worse thing is that the IO request starting within device boundaries can contain more than one segment past EOD. Commit `dce30ca9e3` ("fs: fix guard_bio_eod to check for real EOD errors") tries to fix this issue by returning -EIO for this situation. However, this way lets fs user code lose chance to handle -EIO, then sync_inodes_sb() may hang for ever. Also the current truncating on last segment is dangerous by updating the last bvec, given bvec table becomes not immutable any more, and fs bio users may not retrieve the truncated pages via bio_for_each_segment_all() in its .end_io callback. Fixes this issue by supporting multi-segment truncating. And the approach is simpler: - just update bio size since block layer can make correct bvec with the updated bio size. Then bvec table becomes really immutable. - zero all truncated segments for read bio Cc: Carlos Maiolino <cmaiolino@redhat.com> Cc: linux-fsdevel@vger.kernel.org Fixed-by: `dce30ca9e3` ("fs: fix guard_bio_eod to check for real EOD errors") Reported-by: syzbot+2b9e54155c8c25d8d165@syzkaller.appspotmail.com Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-28 09:44:56 -07:00
Linus Torvalds	534121d289	io_uring-5.5-20191226 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl4E+aYQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgprKmEAC6tcPlb2BB+7fOuj44uAdE+RInqMxbfD3w Tj9KpF47e02DUvBTtwDJDHJ9QT4PlJhd66M1xrp3IMUV13PKQt9OFfc6TH38Jz/9 mhHDGNj1s+GVLRH22PQtFjyMgzHA6+UF4NxHLDJ62c2CtrCVswFRUiWrSR8LgvDp EkVELGEpi080ffton9nhyy3ylOCcpCu1xX1mOCg5EhcqzFQnZMlFaj9PDFrNhQzT e8fdl/nGoKtxZ/x6V8Oso02r/K1XievV4dfrAtOZg4jiqp/3G2eiqoGGcYnShSDU qulKLGsuHK51Lay8AGEaw3haeMn1PKCNe+xv0uCubHdf2iMyBdpjCLsLpTlhmtF/ DkfP13H8k3/nUP9Y8FHt9+Ld56qpdqi/77ngCF84Ed4MFXKYkwyFFyHLMaBCw5zk Z07qISAbj3UeRPug+8iBKpzNBUXvXqqOGHp2h0faXz+C0yG0l7HOkhZ3m+dDD6vN 6ABrMrS/ZuWdiW4PiJUejW81rlRKJaCgmTXMjjQCpgFeUqj6flB4sALp3amY9v7r CZVL67wBZ4u4YeKW0q8j4Lh3DrT5M7IGPP2uT9tw0FiamgYByC2rV/SAecxbh8f5 NQJbL4uyJwcusOorRvaKOWU9KQaz5Q6dx9auHnmLC3A0WsEFxjgtVwG3iqw8zOeF 8E7Lk3kPcA== =L2kR -----END PGP SIGNATURE----- Merge tag 'io_uring-5.5-20191226' of git://git.kernel.dk/linux-block Pull io_uring fixes from Jens Axboe: - Removal of now unused busy wqe list (Hillf) - Add cond_resched() to io-wq work processing (Hillf) - And then the series that I hinted at from last week, which removes the sqe from the io_kiocb and keeps all sqe handling on the prep side. This guarantees that an opcode can't do the wrong thing and read the sqe more than once. This is unchanged from last week, no issues have been observed with this in testing. Hence I really think we should fold this into 5.5. * tag 'io_uring-5.5-20191226' of git://git.kernel.dk/linux-block: io-wq: add cond_resched() to worker thread io-wq: remove unused busy list from io_sqe io_uring: pass in 'sqe' to the prep handlers io_uring: standardize the prep methods io_uring: read 'count' for IORING_OP_TIMEOUT in prep handler io_uring: move all prep state for IORING_OP_{SEND,RECV}_MGS to prep handler io_uring: move all prep state for IORING_OP_CONNECT to prep handler io_uring: add and use struct io_rw for read/writes io_uring: use u64_to_user_ptr() consistently	2019-12-27 11:17:08 -08:00
Hillf Danton	fd1c4bc6e9	io-wq: add cond_resched() to worker thread Reschedule the current IO worker to cut the risk that it is becoming a cpu hog. Signed-off-by: Hillf Danton <hdanton@sina.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-24 09:14:29 -07:00
Hillf Danton	1f424e8bd1	io-wq: remove unused busy list from io_sqe Commit `e61df66c69` ("io-wq: ensure free/busy list browsing see all items") added a list for io workers in addition to the free and busy lists, not only making worker walk cleaner, but leaving the busy list unused. Let's remove it. Signed-off-by: Hillf Danton <hdanton@sina.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-23 08:23:54 -07:00
Paulo Alcantara (SUSE)	046aca3c25	cifs: Optimize readdir on reparse points When listing a directory with thounsands of files and most of them are reparse points, we simply marked all those dentries for revalidation and then sending additional (compounded) create/getinfo/close requests for each of them. Instead, upon receiving a response from an SMB2_QUERY_DIRECTORY (FileIdFullDirectoryInformation) command, the directory entries that have a file attribute of FILE_ATTRIBUTE_REPARSE_POINT will contain an EaSize field with a reparse tag in it, so we parse it and mark the dentry for revalidation only if it is a DFS or a symlink. Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2019-12-23 09:04:44 -06:00
Nathan Chancellor	7935799e04	cifs: Adjust indentation in smb2_open_file Clang warns: ../fs/cifs/smb2file.c:70:3: warning: misleading indentation; statement is not part of the previous 'if' [-Wmisleading-indentation] if (oparms->tcon->use_resilient) { ^ ../fs/cifs/smb2file.c:66:2: note: previous statement is here if (rc) ^ 1 warning generated. This warning occurs because there is a space after the tab on this line. Remove it so that the indentation is consistent with the Linux kernel coding style and clang no longer warns. Fixes: `592fafe644` ("Add resilienthandles mount parm") Link: https://github.com/ClangBuiltLinux/linux/issues/826 Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2019-12-23 09:04:44 -06:00
Linus Torvalds	9efa3ed504	Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs fixes from Al Viro: "Eric's s_inodes softlockup fixes + Jan's fix for recent regression from pipe rework" * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs: call fsnotify_sb_delete after evict_inodes fs: avoid softlockups in s_inodes iterators pipe: Fix bogus dereference in iov_iter_alignment()	2019-12-22 17:00:04 -08:00
Linus Torvalds	c601747175	Fixes for 5.5: - Minor documentation fixes - Fix a file corruption due to read racing with an insert range operation. - Fix log reservation overflows when allocating large rt extents - Fix a buffer log item flags check - Don't allow administrators to mount with sunit= options that will cause later xfs_repair complaints about the root directory being suspicious because the fs geometry appeared inconsistent - Fix a non-static helper that should have been static -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl3890IACgkQ+H93GTRK tOuNkw//e7lT3Xys+dd60Regn+EkYkuI6myN4TsRNYHI7fQ0C9FqkGyYltJmEvqr QhUvQD1BU9Czb5Ghba+DpYz2dpqLYrbVTdmK4jNGjt9K0xNEU7zL297/PyE5Y54t il5nAxZVZ9x0aadKS0yhIt+Q3+dN29O2ablcRcErPi6H5EM3csjmPnrHKD+irG5j MhY5NNWvU1//qU4w2q8ikRKGhMrDYLWo57iJoIX2y17Sw+HXrDsEGoavOpyaoy0v T5m4OfBxU9FD8UhqI86Pua9HG8AlZK+IPT9pZjYGYWT8mkuTppSWjSHJU6HBGqF2 fNrfMpQK/2H5KrqBTVvAzbhYcby8L1tZXUg+4w5iJuvAlHqb/IuBd+Y+nbSbduL/ O9k3Ao0PL6Yt78knNf2F1943ioAI0zbhjDhmKtX17qfAojWQz6CAJmP5OWPWPprh FHA9WT0OzArXF77E+srfYyChclQzllBTOmYNKU//sXgKnqe33fgRIN6Il3T6V3w1 5ifI/0N+FV2Z+yRqE0gSaqLNdPATMNzuGorQsv7P+TRPtD70aB8dhRzcVzqzfbTm C7owl3FGQFTCS/PIwPTRsLfqt3vt9mvc7pUMZOIu7uP63T2daPZ2amTbav/poXgb 5Zih0pknWS8iQM4bwaPMLEr51Wp3Yo8gDPuW1jKJ13FCOuXU70E= =JXs9 -----END PGP SIGNATURE----- Merge tag 'xfs-5.5-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux Pull xfs fixes from Darrick Wong: "Fix a few bugs that could lead to corrupt files, fsck complaints, and filesystem crashes: - Minor documentation fixes - Fix a file corruption due to read racing with an insert range operation. - Fix log reservation overflows when allocating large rt extents - Fix a buffer log item flags check - Don't allow administrators to mount with sunit= options that will cause later xfs_repair complaints about the root directory being suspicious because the fs geometry appeared inconsistent - Fix a non-static helper that should have been static" * tag 'xfs-5.5-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: Make the symbol 'xfs_rtalloc_log_count' static xfs: don't commit sunit/swidth updates to disk if that would cause repair failures xfs: split the sunit parameter update into two parts xfs: refactor agfl length computation function libxfs: resync with the userspace libxfs xfs: use bitops interface for buf log item AIL flag check xfs: fix log reservation overflows when allocating large rt extents xfs: stabilize insert range start boundary to avoid COW writeback race xfs: fix Sphinx documentation warning	2019-12-22 10:59:06 -08:00
Linus Torvalds	a396560706	Ext4 bug fixes (including a regression fix) for 5.5 -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAl3/fDEACgkQ8vlZVpUN gaMZ6Qf/f973waBpA1E9GgAvB4AymRvGbqPJhW2lDDhEl36oXVpUw6EgIKWgNQPS HP6NhYXZakrpEak6Uk2MtiTmcm+6lqDJ+bCslCMylNh9/Y1yUrED2r8l7S3nGv4g hVB7Eah7E+sutDyrDQhYhcQo3GJjt8CbwRLgo8fbhSVrZ7qdfb0lWQmVnruc+72b 3VAeMzPJb0wRY6myxLN4Pw6oEMR1WKVsXm3I9gNXboE2XvgVvnNn2tJxP+xml8rW uGxzWTo7QQNN2bUyjZBa6Mm44lMpHr7JT0nMwkIGV5v3eAYuBgeSwIXUskfw29q7 sP9xNP2voU3M6TyWuT0+cHpoeZasPg== =K63f -----END PGP SIGNATURE----- Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 bug fixes from Ted Ts'o: "Ext4 bug fixes, including a regression fix" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: clarify impact of 'commit' mount option ext4: fix unused-but-set-variable warning in ext4_add_entry() jbd2: fix kernel-doc notation warning ext4: use RCU API in debug_print_tree ext4: validate the debug_want_extra_isize mount option at parse time ext4: reserve revoke credits in __ext4_new_inode ext4: unlock on error in ext4_expand_extra_isize() ext4: optimize __ext4_check_dir_entry() ext4: check for directory entries too close to block end ext4: fix ext4_empty_dir() for directories with holes	2019-12-22 10:41:48 -08:00
Jan Stancek	0dd1e3773a	pipe: fix empty pipe check in pipe_write() LTP pipeio_1 test is hanging with v5.5-rc2-385-gb8e382a185eb, with read side observing empty pipe and sleeping and write side running out of space and then sleeping as well. In this scenario there are 5 writers and 1 reader. Problem is that after pipe_write() reacquires pipe lock, it re-checks for empty pipe with potentially stale 'head' and doesn't wake up read side anymore. pipe->tail can advance beyond 'head', because there are multiple writers. Use pipe->head for empty pipe check after reacquiring lock to observe current state. Testing: With patch, LTP pipeio_1 ran successfully in loop for 1 hour. Without patch it hanged within a minute. Fixes: `1b6b26ae70` ("pipe: fix and clarify pipe write wakeup logic") Reported-by: Rachel Sibley <rasibley@redhat.com> Signed-off-by: Jan Stancek <jstancek@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-12-22 09:47:47 -08:00
Yunfeng Ye	68d7b2d838	ext4: fix unused-but-set-variable warning in ext4_add_entry() Warning is found when compile with "-Wunused-but-set-variable": fs/ext4/namei.c: In function ‘ext4_add_entry’: fs/ext4/namei.c:2167:23: warning: variable ‘sbi’ set but not used [-Wunused-but-set-variable] struct ext4_sb_info *sbi; ^~~ Fix this by moving the variable @sbi under CONFIG_UNICODE. Signed-off-by: Yunfeng Ye <yeyunfeng@huawei.com> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/cb5eb904-224a-9701-c38f-cb23514b1fff@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2019-12-21 21:00:53 -05:00
Linus Torvalds	f8f04d0859	io_uring-5.5-20191220 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl39CowQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpufZD/4t5p6e5S1GO915Y35+q8ooOjd7Ci4a+QJh mYV1KVlrvt+uPXSMHZoEj2JQhI3quEb2IHmUE5IydFBfIwJl2soD7mAsky2iQNaC ULMQFCW33vVnfz7WyuHwkEHmdgEuKg8OeGWVSMEsjrFqygHYSWR94wmqJiYKpkgm Klw4guyGzVfjHutxEDRM3QzHbHmy9xwSNDpJR9Vyr/s0GPOLCavpE71/1ztoc3mq UbolvEirXwUgGNArC/YyHhJAMM+lWNYplWBdGM3YrKzmV2oqKQY9+148IOWeV3Yl vmHLX0/s2WsbKnZPqE5DeuDc8X1fspJHcrQDn+BeM8w8TdGaSUHxgMIuyFulDzWr +3cDjVGaKo3J41xnX7u7v3ph8qnDjMz6k6o6IaQtWz7MCwzJKpCEyF/dJcDIJfxU 7gaOnP5ltDf5wJWzfsOCYFA3/CTL2mshdHD4lg0siEp7CksfX6BFGoLNqdk5qhuv 0md3X0nMkTSsXd09tRjYBQUaKOJ9NnvD63bGIcSshUAMZ2JQUOcFdPNfkSwb3jq+ OMFnz/t6C8VOnyLwBYleYr8r5bum80lVzwvDa4LZNGivyeD/ne1HO+22WA5xDwod 8yNm/hBhy5FGtoucQU2Vo2P9SiAil586PLz9HxAtD9eUOBTLQxVOOrv9x8vVhAW3 Jln/sNGwYg== =6pRI -----END PGP SIGNATURE----- Merge tag 'io_uring-5.5-20191220' of git://git.kernel.dk/linux-block Pull io_uring fixes from Jens Axboe: "Here's a set of fixes that should go into 5.5-rc3 for io_uring. This is bigger than I'd like it to be, mainly because we're fixing the case where an application reuses sqe data right after issue. This really must work, or it's confusing. With 5.5 we're flagging us as submit stable for the actual data, this must also be the case for SQEs. Honestly, I'd really like to add another series on top of this, since it cleans it up considerable and prevents any SQE reuse by design. I posted that here: https://lore.kernel.org/io-uring/20191220174742.7449-1-axboe@kernel.dk/T/#u and may still send it your way early next week once it's been looked at and had some more soak time (does pass all regression tests). With that series, we've unified the prep+issue handling, and only the prep phase even has access to the SQE. Anyway, outside of that, fixes in here for a few other issues that have been hit in testing or production" * tag 'io_uring-5.5-20191220' of git://git.kernel.dk/linux-block: io_uring: io_wq_submit_work() should not touch req->rw io_uring: don't wait when under-submitting io_uring: warn about unhandled opcode io_uring: read opcode and user_data from SQE exactly once io_uring: make IORING_OP_TIMEOUT_REMOVE deferrable io_uring: make IORING_OP_CANCEL_ASYNC deferrable io_uring: make IORING_POLL_ADD and IORING_POLL_REMOVE deferrable io_uring: make HARDLINK imply LINK io_uring: any deferred command must have stable sqe data io_uring: remove 'sqe' parameter to the OP helpers that take it io_uring: fix pre-prepped issue with force_nonblock == true io-wq: re-add io_wq_current_is_worker() io_uring: fix sporadic -EFAULT from IORING_OP_RECVMSG io_uring: fix stale comment and a few typos	2019-12-20 13:30:49 -08:00
Jens Axboe	3529d8c2b3	io_uring: pass in 'sqe' to the prep handlers This moves the prep handlers outside of the opcode handlers, and allows us to pass in the sqe directly. If the sqe is non-NULL, it means that the request should be prepared for the first time. With the opcode handlers not having access to the sqe at all, we are guaranteed that the prep handler has setup the request fully by the time we get there. As before, for opcodes that need to copy in more data then the io_kiocb allows for, the io_async_ctx holds that info. If a prep handler is invoked with req->io set, it must use that to retain information for later. Finally, we can remove io_kiocb->sqe as well. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-20 10:04:50 -07:00
Jens Axboe	06b76d44ba	io_uring: standardize the prep methods We currently have a mix of use cases. Most of the newer ones are pretty uniform, but we have some older ones that use different calling calling conventions. This is confusing. For the opcodes that currently rely on the req->io->sqe copy saving them from reuse, add a request type struct in the io_kiocb command union to store the data they need. Prepare for all opcodes having a standard prep method, so we can call it in a uniform fashion and outside of the opcode handler. This is in preparation for passing in the 'sqe' pointer, rather than storing it in the io_kiocb. Once we have uniform prep handlers, we can leave all the prep work to that part, and not even pass in the sqe to the opcode handler. This ensures that we don't reuse sqe data inadvertently. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-20 10:04:22 -07:00
Jens Axboe	26a61679f1	io_uring: read 'count' for IORING_OP_TIMEOUT in prep handler Add the count field to struct io_timeout, and ensure the prep handler has read it. Timeout also needs an async context always, set it up in the prep handler if we don't have one. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-20 09:55:33 -07:00
Jens Axboe	e47293fdf9	io_uring: move all prep state for IORING_OP_{SEND,RECV}_MGS to prep handler Add struct io_sr_msg in our io_kiocb per-command union, and ensure that the send/recvmsg prep handlers have grabbed what they need from the SQE by the time prep is done. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-20 09:55:23 -07:00
Jens Axboe	3fbb51c18f	io_uring: move all prep state for IORING_OP_CONNECT to prep handler Add struct io_connect in our io_kiocb per-command union, and ensure that io_connect_prep() has grabbed what it needs from the SQE. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-20 09:52:48 -07:00
Jens Axboe	9adbd45d6d	io_uring: add and use struct io_rw for read/writes Put the kiocb in struct io_rw, and add the addr/len for the request as well. Use the kiocb->private field for the buffer index for fixed reads and writes. Any use of kiocb->ki_filp is flipped to req->file. It's the same thing, and less confusing. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-20 09:52:45 -07:00
Chen Wandun	5084bf6b20	xfs: Make the symbol 'xfs_rtalloc_log_count' static Fix the following sparse warning: fs/xfs/libxfs/xfs_trans_resv.c:206:1: warning: symbol 'xfs_rtalloc_log_count' was not declared. Should it be static? Fixes: `b1de6fc752` ("xfs: fix log reservation overflows when allocating large rt extents") Signed-off-by: Chen Wandun <chenwandun@huawei.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2019-12-20 08:07:31 -08:00
Jens Axboe	d55e5f5b70	io_uring: use u64_to_user_ptr() consistently We use it in some spots, but not consistently. Convert the rest over, makes it easier to read as well. No functional changes in this patch. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-20 08:36:50 -07:00
Darrick J. Wong	13eaec4b2a	xfs: don't commit sunit/swidth updates to disk if that would cause repair failures Alex Lyakas reported[1] that mounting an xfs filesystem with new sunit and swidth values could cause xfs_repair to fail loudly. The problem here is that repair calculates the where mkfs should have allocated the root inode, based on the superblock geometry. The allocation decisions depend on sunit, which means that we really can't go updating sunit if it would lead to a subsequent repair failure on an otherwise correct filesystem. Port from xfs_repair some code that computes the location of the root inode and teach mount to skip the ondisk update if it would cause problems for repair. Along the way we'll update the documentation, provide a function for computing the minimum AGFL size instead of open-coding it, and cut down some indenting in the mount code. Note that we allow the mount to proceed (and new allocations will reflect this new geometry) because we've never screened this kind of thing before. We'll have to wait for a new future incompat feature to enforce correct behavior, alas. Note that the geometry reporting always uses the superblock values, not the incore ones, so that is what xfs_info and xfs_growfs will report. [1] https://lore.kernel.org/linux-xfs/20191125130744.GA44777@bfoster/T/#m00f9594b511e076e2fcdd489d78bc30216d72a7d Reported-by: Alex Lyakas <alex@zadara.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>	2019-12-19 07:53:48 -08:00
Darrick J. Wong	4f5b1b3a8f	xfs: split the sunit parameter update into two parts If the administrator provided a sunit= mount option, we need to validate the raw parameter, convert the mount option units (512b blocks) into the internal unit (fs blocks), and then validate that the (now cooked) parameter doesn't screw anything up on disk. The incore inode geometry computation can depend on the new sunit option, but a subsequent patch will make validating the cooked value depends on the computed inode geometry, so break the sunit update into two steps. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>	2019-12-19 07:53:48 -08:00
Darrick J. Wong	1cac233cfe	xfs: refactor agfl length computation function Refactor xfs_alloc_min_freelist to accept a NULL @pag argument, in which case it returns the largest possible minimum length. This will be used in an upcoming patch to compute the length of the AGFL at mkfs time. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>	2019-12-19 07:53:48 -08:00
Darrick J. Wong	af952aeb4a	libxfs: resync with the userspace libxfs Prepare to resync the userspace libxfs with the kernel libxfs. There were a few things I missed -- a couple of static inline directory functions that have to be exported for xfs_repair; a couple of directory naming functions that make porting much easier if they're /not/ static inline; and a u16 usage that should have been uint16_t. None of these things are bugs in their own right; this just makes porting xfsprogs easier. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com>	2019-12-19 07:53:47 -08:00
Brian Foster	826f7e3413	xfs: use bitops interface for buf log item AIL flag check The xfs_log_item flags were converted to atomic bitops as of commit `22525c17ed` ("xfs: log item flags are racy"). The assert check for AIL presence in xfs_buf_item_relse() still uses the old value based check. This likely went unnoticed as XFS_LI_IN_AIL evaluates to 0 and causes the assert to unconditionally pass. Fix up the check. Signed-off-by: Brian Foster <bfoster@redhat.com> Fixes: `22525c17ed` ("xfs: log item flags are racy") Reviewed-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2019-12-19 07:53:47 -08:00
Jens Axboe	fd6c2e4c06	io_uring: io_wq_submit_work() should not touch req->rw I've been chasing a weird and obscure crash that was userspace stack corruption, and finally narrowed it down to a bit flip that made a stack address invalid. io_wq_submit_work() unconditionally flips the req->rw.ki_flags IOCB_NOWAIT bit, but since it's a generic work handler, this isn't valid. Normal read/write operations own that part of the request, on other types it could be something else. Move the IOCB_NOWAIT clear to the read/write handlers where it belongs. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-18 12:19:41 -07:00
Pavel Begunkov	7c504e6520	io_uring: don't wait when under-submitting There is no reliable way to submit and wait in a single syscall, as io_submit_sqes() may under-consume sqes (in case of an early error). Then it will wait for not-yet-submitted requests, deadlocking the user in most cases. Don't wait/poll if can't submit all sqes Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-18 10:01:49 -07:00
Eric Sandeen	1edc8eb2e9	fs: call fsnotify_sb_delete after evict_inodes When a filesystem is unmounted, we currently call fsnotify_sb_delete() before evict_inodes(), which means that fsnotify_unmount_inodes() must iterate over all inodes on the superblock looking for any inodes with watches. This is inefficient and can lead to livelocks as it iterates over many unwatched inodes. At this point, SB_ACTIVE is gone and dropping refcount to zero kicks the inode out out immediately, so anything processed by fsnotify_sb_delete / fsnotify_unmount_inodes gets evicted in that loop. After that, the call to evict_inodes will evict everything else with a zero refcount. This should speed things up overall, and avoid livelocks in fsnotify_unmount_inodes(). Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-12-18 00:03:01 -05:00
Eric Sandeen	04646aebd3	fs: avoid softlockups in s_inodes iterators Anything that walks all inodes on sb->s_inodes list without rescheduling risks softlockups. Previous efforts were made in 2 functions, see: `c27d82f` fs/drop_caches.c: avoid softlockups in drop_pagecache_sb() `ac05fbb` inode: don't softlockup when evicting inodes but there hasn't been an audit of all walkers, so do that now. This also consistently moves the cond_resched() calls to the bottom of each loop in cases where it already exists. One loop remains: remove_dquot_ref(), because I'm not quite sure how to deal with that one w/o taking the i_lock. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-12-18 00:03:01 -05:00
Jens Axboe	e781573e2f	io_uring: warn about unhandled opcode Now that we have all the opcodes handled in terms of command prep and SQE reuse, add a printk_once() to warn about any potentially new and unhandled ones. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-17 19:57:27 -07:00
Jens Axboe	d625c6ee49	io_uring: read opcode and user_data from SQE exactly once If we defer a request, we can't be reading the opcode again. Ensure that the user_data and opcode fields are stable. For the user_data we already have a place for it, for the opcode we can fill a one byte hold and store that as well. For both of them, assign them when we originally read the SQE in io_get_sqring(). Any code that uses sqe->opcode or sqe->user_data is switched to req->opcode and req->user_data. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-17 19:57:27 -07:00
Jens Axboe	b29472ee7b	io_uring: make IORING_OP_TIMEOUT_REMOVE deferrable If we defer this command as part of a link, we have to make sure that the SQE data has been read upfront. Integrate the timeout remove op into the prep handling to make it safe for SQE reuse. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-17 19:57:27 -07:00
Jens Axboe	fbf23849b1	io_uring: make IORING_OP_CANCEL_ASYNC deferrable If we defer this command as part of a link, we have to make sure that the SQE data has been read upfront. Integrate the async cancel op into the prep handling to make it safe for SQE reuse. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-17 19:57:27 -07:00
Jens Axboe	0969e783e3	io_uring: make IORING_POLL_ADD and IORING_POLL_REMOVE deferrable If we defer these commands as part of a link, we have to make sure that the SQE data has been read upfront. Integrate the poll add/remove into the prep handling to make it safe for SQE reuse. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-17 19:57:27 -07:00
Pavel Begunkov	ffbb8d6b76	io_uring: make HARDLINK imply LINK The rules are as follows, if IOSQE_IO_HARDLINK is specified, then it's a link and there is no need to set IOSQE_IO_LINK separately, though it could be there. Add proper check and ensure that IOSQE_IO_HARDLINK implies IOSQE_IO_LINK. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-17 19:57:27 -07:00
Jens Axboe	8ed8d3c3bc	io_uring: any deferred command must have stable sqe data We're currently not retaining sqe data for accept, fsync, and sync_file_range. None of these commands need data outside of what is directly provided, hence it can't go stale when the request is deferred. However, it can get reused, if an application reuses SQE entries. Ensure that we retain the information we need and only read the sqe contents once, off the submission path. Most of this is just moving code into a prep and finish function. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-17 19:57:20 -07:00
Jens Axboe	fc4df999e2	io_uring: remove 'sqe' parameter to the OP helpers that take it We pass in req->sqe for all of them, no need to pass it in as the request is always passed in. This is a necessary prep patch to be able to cleanup/fix the request prep path. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-17 19:57:20 -07:00
Jens Axboe	b7bb4f7da0	io_uring: fix pre-prepped issue with force_nonblock == true Some of these code paths assume that any force_nonblock == true issue is not prepped, but that's not true if we did prep as part of link setup earlier. Check if we already have an async context allocate before setting up a new one. Cleanup the async context setup in general, we have a lot of duplicated code there. Fixes: `03b1230ca1` ("io_uring: ensure async punted sendmsg/recvmsg requests copy data") Fixes: `f67676d160` ("io_uring: ensure async punted read/write requests copy iovec") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-17 19:57:20 -07:00
Jens Axboe	525b305d61	io-wq: re-add io_wq_current_is_worker() This reverts commit `8cdda87a44`, we now have several use csaes for this helper. Reinstate it. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-17 19:57:20 -07:00
Linus Torvalds	2187f215eb	for-5.5-rc2-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl35CAYACgkQxWXV+ddt WDur/w//S98RvSZYMW5y2u+bPGe8sCpXwu5Sr87hTd14We8cBWj8684npUmSk7Dz rTRSjcf9EQe5dGoiHOzpKU0HcsLKy9DVTPigvVbmsWZfT9mqS6Y8wAKMw/7UUvyy n7aZk/yQGRow3gZ/Z/aF23JypRoDJK7DPbSMKUW164BnD5rCCyr+VdA8V+CwHgVh UN6UG0KMDbDKS4501DsX8418pcJN+a+Jo4oBGwN/guKRjK1oNcrhj34DNhvXlaOV Rlu7HcVtfHNDS/xD3DZS9mDIiycJ6qHkvC3hUsEmlKRoPEm1leVxTDLDf78oEy9H TrvH71hbvYjxaOU4YQbJG8ky+VwFfiV0Vrj73GgdEeRRDuMbYwUyFI5gYQOji8fS DuYdJGyslOqQovpii+jrPiT1TPG+97R4+qKH2DfOW1xUChYsbQHt7FOfzUbLe0JE dev9zV6MRqZ1qf70+Wt2LuWYFefpg9KVnsn8mcjoBwz9s9uImzLgpI90+DMPKOaU TizwJK3W5K3YLhqPHwPLvqVxKwVOzu00v01xl/bjTuyp982oPSCj3fj+FprGV1la OkqOYbKe2ZqEkQpINDu8I58oydTKywZGVsUl4ldJlcSY1hEDFCyeoAFmixaJbRbQ IdBcQnjD7qgvu9E4cA0kL8Ma1op2+1zw8sUOdXKFIDiNEqL5FPs= =AnHL -----END PGP SIGNATURE----- Merge tag 'for-5.5-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A mix of regression fixes and regular fixes for stable trees: - fix swapped error messages for qgroup enable/rescan - fixes for NO_HOLES feature with clone range - fix deadlock between iget/srcu lock/synchronize srcu while freeing an inode - fix double lock on subvolume cross-rename - tree log fixes * fix missing data checksums after replaying a log tree * also teach tree-checker about this problem * skip log replay on orphaned roots - fix maximum devices constraints for RAID1C -3 and -4 - send: don't print warning on read-only mount regarding orphan cleanup - error handling fixes" * tag 'for-5.5-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: send: remove WARN_ON for readonly mount btrfs: do not leak reloc root if we fail to read the fs root btrfs: skip log replay on orphaned roots btrfs: handle ENOENT in btrfs_uuid_tree_iterate btrfs: abort transaction after failed inode updates in create_subvol Btrfs: fix hole extent items with a zero size after range cloning Btrfs: fix removal logic of the tree mod log that leads to use-after-free issues Btrfs: make tree checker detect checksum items with overlapping ranges Btrfs: fix missing data checksums after replaying a log tree btrfs: return error pointer from alloc_test_extent_buffer btrfs: fix devs_max constraints for raid1c3 and raid1c4 btrfs: tree-checker: Fix error format string for size_t btrfs: don't double lock the subvol_sem for rename exchange btrfs: handle error in btrfs_cache_block_group btrfs: do not call synchronize_srcu() in inode_tree_del Btrfs: fix cloning range with a hole when using the NO_HOLES feature btrfs: Fix error messages in qgroup_rescan_init	2019-12-17 13:27:02 -08:00
Darrick J. Wong	b1de6fc752	xfs: fix log reservation overflows when allocating large rt extents Omar Sandoval reported that a 4G fallocate on the realtime device causes filesystem shutdowns due to a log reservation overflow that happens when we log the rtbitmap updates. Factor rtbitmap/rtsummary updates into the the tr_write and tr_itruncate log reservation calculation. "The following reproducer results in a transaction log overrun warning for me: mkfs.xfs -f -r rtdev=/dev/vdc -d rtinherit=1 -m reflink=0 /dev/vdb mount -o rtdev=/dev/vdc /dev/vdb /mnt fallocate -l 4G /mnt/foo Reported-by: Omar Sandoval <osandov@osandov.com> Tested-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>	2019-12-17 11:19:28 -08:00
Linus Torvalds	4340ebd19f	Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Ingo Molnar: "Fix the guest-nice cpustat values in /proc" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/cputime, proc/stat: Fix incorrect guest nice cpustat value	2019-12-17 11:09:05 -08:00
Linus Torvalds	6afa873170	linux-kselftest-5.5-rc2 This Kselftest fixes update for Linux 5.5-rc2 consists of -- ftrace and safesetid test fixes from Masami Hiramatsu -- Kunit fixes from Brendan Higgins, Iurii Zaikin, and Heidi Fahim -- Kselftest framework fixes from SeongJae Park and Michael Ellerman -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEPZKym/RZuOCGeA/kCwJExA0NQxwFAl3zxBUACgkQCwJExA0N QxxeyhAAgCPilGbQEjr3mJk9rHLpBlDcHF783zrKS538ymVWDcMqxWgW9WOY7RKb LKli4Q3SDhWPzxiH4dcNklkIld6WaNaehIwhYCykAxrWnOKmQQ1i8/4+D6KPwGhp W7do/g8ZITYJYJgYieoABC5W4rThFyIR+uAVCDyf5nP+nQrJlgPfsq2ClBvRxzep QhanBPlweQSHVLBMATijUETFHqoIvx6bL8emolY9x6qbCPrTcvVKqW+Va2K24TqP dJGPm5OctSHD2RP4clKMfx3dbwabQR0JuDKdh3F/jO89h+1/Pku5YboZCDezbNp9 2oKXjDXniZHKmuWVgzh7ix/5y6FfpGpck4+9PhaNpCd/pIJ2ZtrZNd+ct72JA2yr zGIWWtj5y6Ggw7NkWloRsVTmQFAYsWBIJS8CqC+2aypFfWZpRFaDWSBcBifRsvVc 3F1L/uQyrgeJx5XNTe028i7eLmvQ1a4RqHUxQIt795lnQmygeLHffx4R+K/uw8XD 0eKtjV3HYR/FuRXEB1A6WH3eLQ4b1mmcx2aV5e6mbUk+QezPRMnJr2E6+dE6XH63 2ipJHfDQmKakrieidt5LCYTy9+VzlFj2TOrKiLLwUPmjPJv2AASfXAYwlTYpsbMi bymTZcJkVsGnCi3jNfK/1SBBNRUVUYTeN8SWq/6ublgIs4BMYe0= =MJZj -----END PGP SIGNATURE----- Merge tag 'linux-kselftest-5.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest Pull kselftest fixes from Shuah Khan: - ftrace and safesetid test fixes from Masami Hiramatsu - Kunit fixes from Brendan Higgins, Iurii Zaikin, and Heidi Fahim - Kselftest framework fixes from SeongJae Park and Michael Ellerman * tag 'linux-kselftest-5.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: kselftest: Support old perl versions kselftest/runner: Print new line in print of timeout log selftests: Fix dangling documentation references to kselftest_module.sh Documentation: kunit: add documentation for kunit_tool Documentation: kunit: fix typos and gramatical errors kunit: testing kunit: Bug fix in test_run_timeout function fs/ext4/inode-test: Fix inode test on 32 bit platforms. selftests: safesetid: Fix Makefile to set correct test program selftests: safesetid: Check the return value of setuid/setgid selftests: safesetid: Move link library to LDLIBS selftests/ftrace: Fix multiple kprobe testcase selftests/ftrace: Do not to use absolute debugfs path selftests/ftrace: Fix ftrace test cases to check unsupported selftests/ftrace: Fix to check the existence of set_ftrace_filter	2019-12-16 10:06:04 -08:00
Jens Axboe	0b416c3e13	io_uring: fix sporadic -EFAULT from IORING_OP_RECVMSG If we have to punt the recvmsg to async context, we copy all the context. But since the iovec used can be either on-stack (if small) or dynamically allocated, if it's on-stack, then we need to ensure we reset the iov pointer. If we don't, then we're reusing old stack data, and that can lead to -EFAULTs if things get overwritten. Ensure we retain the right pointers for the iov, and free it as well if we end up having to go beyond UIO_FASTIOV number of vectors. Fixes: `03b1230ca1` ("io_uring: ensure async punted sendmsg/recvmsg requests copy data") Reported-by: 李通洲 <carter.li@eoitek.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-15 22:12:47 -07:00
Phong Tran	69000d82ee	ext4: use RCU API in debug_print_tree struct ext4_sb_info.system_blks was marked __rcu. But access the pointer without using RCU lock and dereference. Sparse warning with __rcu notation: block_validity.c:139:29: warning: incorrect type in argument 1 (different address spaces) block_validity.c:139:29: expected struct rb_root const * block_validity.c:139:29: got struct rb_root [noderef] <asn:4> * Link: https://lore.kernel.org/r/20191213153306.30744-1-tranmanphong@gmail.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Phong Tran <tranmanphong@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2019-12-15 21:41:04 -05:00
Theodore Ts'o	9803387c55	ext4: validate the debug_want_extra_isize mount option at parse time Instead of setting s_want_extra_size and then making sure that it is a valid value afterwards, validate the field before we set it. This avoids races and other problems when remounting the file system. Link: https://lore.kernel.org/r/20191215063020.GA11512@mit.edu Cc: stable@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reported-and-tested-by: syzbot+4a39a025912b265cacef@syzkaller.appspotmail.com	2019-12-15 18:05:20 -05:00
Brian Gianforcaro	d195a66e36	io_uring: fix stale comment and a few typos - Fix a few typos found while reading the code. - Fix stale io_get_sqring comment referencing s->sqe, the 's' parameter was renamed to 'req', but the comment still holds. Signed-off-by: Brian Gianforcaro <b.gianfo@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-15 14:49:30 -07:00
Linus Torvalds	2e6d304515	Merge branch 'remove-ksys-mount-dup' of git://git.kernel.org/pub/scm/linux/kernel/git/brodo/linux Pull ksys_mount() and ksys_dup() removal from Dominik Brodowski: "This small series replaces all in-kernel calls to the userspace-focused ksys_mount() and ksys_dup() with calls to kernel-centric functions: For each replacement of ksys_mount() with do_mount(), one needs to verify that the first and third parameter (char dev_name, char type) are strings allocated in kernelspace and that the fifth parameter (void data) is either NULL or refers to a full page (only occurence in init/do_mounts.c::do_mount_root()). The second and fourth parameters (char dir_name, unsigned long flags) are passed by ksys_mount() to do_mount() unchanged, and therefore do not require particular care. Moreover, instead of pretending to be userspace, the opening of /dev/console as stdin/stdout/stderr can be implemented using in-kernel functions as well. Thereby, ksys_dup() can be removed for good" [ This doesn't get rid of the special "kernel init runs with KERNEL_DS" case, but it at least removes _some_ of the users of "treat kernel pointers as user pointers for our magical init sequence". One day we'll hopefully be rid of it all, and can initialize our init_thread addr_limit to USER_DS. - Linus ] * 'remove-ksys-mount-dup' of git://git.kernel.org/pub/scm/linux/kernel/git/brodo/linux: fs: remove ksys_dup() init: unify opening /dev/console as stdin/stdout/stderr init: use do_mount() instead of ksys_mount() initrd: use do_mount() instead of ksys_mount() devtmpfs: use do_mount() instead of ksys_mount()	2019-12-15 11:36:12 -08:00
yangerkun	a70fd5ac2e	ext4: reserve revoke credits in __ext4_new_inode It's possible that __ext4_new_inode will release the xattr block, so it will trigger a warning since there is revoke credits will be 0 if the handle == NULL. The below scripts can reproduce it easily. ------------[ cut here ]------------ WARNING: CPU: 0 PID: 3861 at fs/jbd2/revoke.c:374 jbd2_journal_revoke+0x30e/0x540 fs/jbd2/revoke.c:374 ... __ext4_forget+0x1d7/0x800 fs/ext4/ext4_jbd2.c:248 ext4_free_blocks+0x213/0x1d60 fs/ext4/mballoc.c:4743 ext4_xattr_release_block+0x55b/0x780 fs/ext4/xattr.c:1254 ext4_xattr_block_set+0x1c2c/0x2c40 fs/ext4/xattr.c:2112 ext4_xattr_set_handle+0xa7e/0x1090 fs/ext4/xattr.c:2384 __ext4_set_acl+0x54d/0x6c0 fs/ext4/acl.c:214 ext4_init_acl+0x218/0x2e0 fs/ext4/acl.c:293 __ext4_new_inode+0x352a/0x42b0 fs/ext4/ialloc.c:1151 ext4_mkdir+0x2e9/0xbd0 fs/ext4/namei.c:2774 vfs_mkdir+0x386/0x5f0 fs/namei.c:3811 do_mkdirat+0x11c/0x210 fs/namei.c:3834 do_syscall_64+0xa1/0x530 arch/x86/entry/common.c:294 ... ------------------------------------- scripts: mkfs.ext4 /dev/vdb mount /dev/vdb /mnt cd /mnt && mkdir dir && for i in {1..8}; do setfacl -dm "u:user_"$i":rx" dir; done mkdir dir/dir1 && mv dir/dir1 ./ sh repro.sh && add some user [root@localhost ~]# cat repro.sh while [ 1 -eq 1 ]; do rm -rf dir rm -rf dir1/dir1 mkdir dir for i in {1..8}; do setfacl -dm "u:test"$i":rx" dir; done setfacl -m "u:user_9:rx" dir & mkdir dir1/dir1 & done Before exec repro.sh, dir1 has inherit the default acl from dir, and xattr block of dir1 dir is not the same, so the h_refcount of these two dir's xattr block will be 1. Then repro.sh can trigger the warning with the situation show as below. The last h_refcount can be clear with mkdir, and __ext4_new_inode has not reserved revoke credits, so the warning will happened, fix it by reserve revoke credits in __ext4_new_inode. Thread 1 Thread 2 mkdir dir set default acl(will create a xattr block blk1 and the refcount of ext4_xattr_header will be 1) ... mkdir dir1/dir1 ->....->ext4_init_acl ->__ext4_set_acl(set default acl, will reuse blk1, and h_refcount will be 2) setfacl->ext4_set_acl->... ->ext4_xattr_block_set(will create new block blk2 to store xattr) ->__ext4_set_acl(set access acl, since h_refcount of blk1 is 2, will create blk3 to store xattr) ->ext4_xattr_release_block(dec h_refcount of blk1 to 1) ->ext4_xattr_release_block(dec h_refcount and since it is 0, will release the block and trigger the warning) Link: https://lore.kernel.org/r/20191213014900.47228-1-yangerkun@huawei.com Reported-by: Hulk Robot <hulkci@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: yangerkun <yangerkun@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2019-12-14 17:47:13 -05:00
Dan Carpenter	7f420d64a0	ext4: unlock on error in ext4_expand_extra_isize() We need to unlock the xattr before returning on this error path. Cc: stable@kernel.org # 4.13 Fixes: `c03b45b853` ("ext4, project: expand inode extra size if possible") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Link: https://lore.kernel.org/r/20191213185010.6k7yl2tck3wlsdkt@kili.mountain Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2019-12-14 17:31:23 -05:00
Theodore Ts'o	707d1a2f60	ext4: optimize __ext4_check_dir_entry() Make __ext4_check_dir_entry() a bit easier to understand, and reduce the object size of the function by over 11%. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20191209004346.38526-1-tytso@mit.edu Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2019-12-14 17:23:14 -05:00
Jan Kara	109ba779d6	ext4: check for directory entries too close to block end ext4_check_dir_entry() currently does not catch a case when a directory entry ends so close to the block end that the header of the next directory entry would not fit in the remaining space. This can lead to directory iteration code trying to access address beyond end of current buffer head leading to oops. CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191202170213.4761-3-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2019-12-14 17:22:45 -05:00
Jan Kara	64d4ce8923	ext4: fix ext4_empty_dir() for directories with holes Function ext4_empty_dir() doesn't correctly handle directories with holes and crashes on bh->b_data dereference when bh is NULL. Reorganize the loop to use 'offset' variable all the times instead of comparing pointers to current direntry with bh->b_data pointer. Also add more strict checking of '.' and '..' directory entries to avoid entering loop in possibly invalid state on corrupted filesystems. References: CVE-2019-19037 CC: stable@vger.kernel.org Fixes: `4e19d6b65f` ("ext4: allow directory holes") Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191202170213.4761-2-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2019-12-14 17:22:45 -05:00
Linus Torvalds	103a022d6b	3 small smb3 fixes -----BEGIN PGP SIGNATURE----- iQGzBAABCAAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl30eFcACgkQiiy9cAdy T1FrCAv+Lwq/7pwyzoS0F1DWXJfs+3a2KCR2QSnZE9Mok8AobvRnNS9dZcXZdp0U RnSFfkv9ia5vXegO0899nM6n0jtS5/OCex9RITVBtvIk8HLKrpcmYP+gS3III5Qq oMF11JZCWDI2HHJA6xEV677rgFjOiEDjtjaZrXOS2TClnBCU3ZDRghul46DKACbA xQDr0ifgOcDdxKBSTERhGvKA6xOf+gPP73SKB5Kg6OaPWL9FjhGvN/1ic2LVmFHF cc7rbXhl1jTnKfw22qrS5XsElBSQdbM7X23CJ9ik8zXpF8gLBjGefx894xyFLELb efJcHC1vroBc11HfLaLmAoQRp7leDIP4Icyq2TqJC6+mWsxJ0G96ofn7tGR9z/FK SlcggWkrC8frJQQoYMYylQcknXK9+WahyyNswMXFiUYU7XSojyx4MoBtvRavOI5l JKJtoiZd3OPtXOjvnKzNqSFcHC3YNQDE9GmIOXHgRuhFcnO0UQOKvbwXik2HCXBp 7A1dxRob =65tY -----END PGP SIGNATURE----- Merge tag '5.5-rc1-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6 Pull cfis fixes from Steve French: "Three small smb3 fixes: this addresses two recent issues reported in additional testing during rc1, a refcount underflow and a problem with an intermittent crash in SMB2_open_init" * tag '5.5-rc1-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6: CIFS: Close cached root handle only if it has a lease SMB3: Fix crash in SMB2_open_init due to uninitialized field in compounding path smb3: fix refcount underflow warning on unmount when no directory leases	2019-12-14 11:44:14 -08:00
Linus Torvalds	81c64b0bd0	overlayfs fixes for 5.5-rc2 -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCXfNhGQAKCRDh3BK/laaZ PGSEAP9Nyv3XCN2wdqMLdrgn07B3Pk9w2Unf3Y5amKOxNXqyQwEAy2/E6DCiGjSa WRheJoTgDSeqUQNY6GFHsCIgLWOCHgs= =WH5O -----END PGP SIGNATURE----- Merge tag 'ovl-fixes-5.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull overlayfs fixes from Miklos Szeredi: "Fix some bugs and documentation" * tag 'ovl-fixes-5.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: docs: filesystems: overlayfs: Fix restview warnings docs: filesystems: overlayfs: Rename overlayfs.txt to .rst ovl: relax WARN_ON() on rename to self ovl: fix corner case of non-unique st_dev;st_ino ovl: don't use a temp buf for encoding real fh ovl: make sure that real fid is 32bit aligned in memory ovl: fix lookup failure on multi lower squashfs	2019-12-14 11:13:54 -08:00
Linus Torvalds	5bd831a469	io_uring-5.5-20191212 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl3y5kIQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpv0MEADY9LOC2JkG2aNP53gCXGu74rFQNstJfusr xPpkMs5I7hxoeVp/bkVAvR6BbpfPDsyGMhSMURELgMFzkuw+03z2IbxVuexdGOEu X1+6EkunAq/r331fXL+cXUKJM3ThlkUBbNTuV8z5oWWSM1EfoBAWYfh2u4L4oBcc yIHRH8by9qipx+fhBWPU/ZWeCcMVt5Ju2b6PHAE/GWwteZh00PsTwq0oqQcVcm/5 4Ennr0OELb7LaZWblAIIZe96KeJuCPzGfbRV/y12Ne/t6STH/sWA1tMUzgT22Eju 27uXtwAJtoLsep4V66a4VYObs/hXmEP281wIsXEnIkX0YCvbAiM+J6qVHGjYSMGf mRrzfF6nDC1vaMduE4waoO3VFiDFQU/qfQRa21ZP50dfQOAXTpEz8kJNG/PjN1VB gmz9xsujDU1QPD7IRiTreiPPHE5AocUzlUYuYOIMt7VSbFZIMHW5S/pML93avkJt nx6g3gOP0JKkjaCEvIKY4VLljDT8eDZ/WScDsSedSbPZikMzEo8DSx4U6nbGPqzy qSljgQniDcrH8GdRaJFDXgLkaB8pu83NH7zUH+xioUZjAHq/XEzKQFYqysf1DbtU SEuSnUnLXBvbwfb3Z2VpQ/oz8G3a0nD9M7oudt1sN19oTCJKxYSbOUgNUrj9JsyQ QlAVrPKRkQ== =fbsf -----END PGP SIGNATURE----- Merge tag 'io_uring-5.5-20191212' of git://git.kernel.dk/linux-block Pull io_uring fixes from Jens Axboe: - A tweak to IOSQE_IO_LINK (also marked for stable) to allow links that don't sever if the result is < 0. This is mostly for linked timeouts, where if we ask for a pure timeout we always get -ETIME. This makes links useless for that case, hence allow a case where it works. - Five minor optimizations to fix and improve cases that regressed since v5.4. - An SQTHREAD locking fix. - A sendmsg/recvmsg iov assignment fix. - Net fix where read_iter/write_iter don't honor IOCB_NOWAIT, and subsequently ensuring that works for io_uring. - Fix a case where for an invalid opcode we might return -EBADF instead of -EINVAL, if the ->fd of that sqe was set to an invalid fd value. * tag 'io_uring-5.5-20191212' of git://git.kernel.dk/linux-block: io_uring: ensure we return -EINVAL on unknown opcode io_uring: add sockets to list of files that support non-blocking issue net: make socket read/write_iter() honor IOCB_NOWAIT io_uring: only hash regular files for async work execution io_uring: run next sqe inline if possible io_uring: don't dynamically allocate poll data io_uring: deferred send/recvmsg should assign iov io_uring: sqthread should grab ctx->uring_lock for submissions io-wq: briefly spin for new work after finishing work io-wq: remove worker->wait waitqueue io_uring: allow unbreakable links	2019-12-13 14:24:54 -08:00
Linus Torvalds	22ff311af9	treewide conversion from FIELD_SIZEOF() to sizeof_field() -----BEGIN PGP SIGNATURE----- Comment: Kees Cook <kees@outflux.net> iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAl3umDgWHGtlZXNjb29r QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJlvsD/49R12HK7UzTxNTrcpvbadJ4t7j j/qJvjMerW7iVNAPOoNAOePUa21+y3rI1AZPvoPyzIqp1Bf2eOICf5SdisG2cG+O X0A8EKWvS0SSQWSKaT6udUKJ3nBJItwvOvQ5B58KQzcOj3S4X7B9iVBWgieMHrzz urkZm7pqowrZB3wuF8keRtli5IZaoiCwzApy48Qrn70G3OeXymknFbpHTDwIAiGw RiE5Xh0R4EzQdsYyCgjR8U56gBchadAmj8BUJU0ppMnOFMyIAG670hNLrs0L3roP 8TOIeyb993ZC5GZaMlnR8mz0jfibfkPa3Z85VAsVyQSPaOQldwc9j8TGBqD5Gfat 1PjOU5RVwma0pH5xTPOeevWPQpIK9KovQpQYqMMN9GMxOEx96IOUjwTrnNK2xWoN UGyOVlESFGoniClhCiKYzPSrYOjlIBk5ovf15PdTe+bwyUDMfyfy5CZV88OS2DHz ZBZvpLrH/EMW9zJ+FqMTp0C4s4wa2Ioid3bSh6XuNUTtltKSjp71eUja8ZEz+2sd 5AGstCC+hYqxaEk+6/851pfkQ9sbBjwuGtNrtX+pqreiLUvWLhQ0yUj6cLXlEQNH aucjCukCjI+4lMzofeaQ2LbNhtff4YsfO4b1Ye8maoDdHjzUVL57n3bTOxKhdzbt y6FM3lApOjk3OyaTJQ== =YU4A -----END PGP SIGNATURE----- Merge tag 'sizeof_field-v5.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull FIELD_SIZEOF conversion from Kees Cook: "A mostly mechanical treewide conversion from FIELD_SIZEOF() to sizeof_field(). This avoids the redundancy of having 2 macros (actually 3) doing the same thing, and consolidates on sizeof_field(). While "field" is not an accurate name, it is the common name used in the kernel, and doesn't result in any unintended innuendo. As there are still users of FIELD_SIZEOF() in -next, I will clean up those during this coming development cycle and send the final old macro removal patch at that time" * tag 'sizeof_field-v5.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: treewide: Use sizeof_field() macro MIPS: OCTEON: Replace SIZEOF_FIELD() macro	2019-12-13 14:02:12 -08:00
Anand Jain	fbd542971a	btrfs: send: remove WARN_ON for readonly mount We log warning if root::orphan_cleanup_state is not set to ORPHAN_CLEANUP_DONE in btrfs_ioctl_send(). However if the filesystem is mounted as readonly we skip the orphan item cleanup during the lookup and root::orphan_cleanup_state remains at the init state 0 instead of ORPHAN_CLEANUP_DONE (2). So during send in btrfs_ioctl_send() we hit the warning as below. WARN_ON(send_root->orphan_cleanup_state != ORPHAN_CLEANUP_DONE); WARNING: CPU: 0 PID: 2616 at /Volumes/ws/btrfs-devel/fs/btrfs/send.c:7090 btrfs_ioctl_send+0xb2f/0x18c0 [btrfs] :: RIP: 0010:btrfs_ioctl_send+0xb2f/0x18c0 [btrfs] :: Call Trace: :: _btrfs_ioctl_send+0x7b/0x110 [btrfs] btrfs_ioctl+0x150a/0x2b00 [btrfs] :: do_vfs_ioctl+0xa9/0x620 ? __fget+0xac/0xe0 ksys_ioctl+0x60/0x90 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x49/0x130 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Reproducer: mkfs.btrfs -fq /dev/sdb mount /dev/sdb /btrfs btrfs subvolume create /btrfs/sv1 btrfs subvolume snapshot -r /btrfs/sv1 /btrfs/ss1 umount /btrfs mount -o ro /dev/sdb /btrfs btrfs send /btrfs/ss1 -f /tmp/f The warning exists because having orphan inodes could confuse send and cause it to fail or produce incorrect streams. The two cases that would cause such send failures, which are already fixed are: 1) Inodes that were unlinked - these are orphanized and remain with a link count of 0. These caused send operations to fail because it expected to always find at least one path for an inode. However this is no longer a problem since send is now able to deal with such inodes since commit `46b2f4590a` ("Btrfs: fix send failure when root has deleted files still open") and treats them as having been completely removed (the state after an orphan cleanup is performed). 2) Inodes that were in the process of being truncated. These resulted in send not knowing about the truncation and potentially issue write operations full of zeroes for the range from the new file size to the old file size. This is no longer a problem because we no longer create orphan items for truncation since commit `f7e9e8fc79` ("Btrfs: stop creating orphan items for truncate"). As such before these commits, the WARN_ON here provided a clue in case something went wrong. Instead of being a warning against the root::orphan_cleanup_state value, it could have been more accurate by checking if there were actually any orphan items, and then issue a warning only if any exists, but that would be more expensive to check. Since orphanized inodes no longer cause problems for send, just remove the warning. Reported-by: Christoph Anton Mitterer <calestyo@scientia.net> Link: https://lore.kernel.org/linux-btrfs/21cb5e8d059f6e1496a903fa7bfc0a297e2f5370.camel@scientia.net/ CC: stable@vger.kernel.org # 4.19+ Suggested-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-12-13 14:10:46 +01:00
Josef Bacik	ca1aa2818a	btrfs: do not leak reloc root if we fail to read the fs root If we fail to read the fs root corresponding with a reloc root we'll just break out and free the reloc roots. But we remove our current reloc_root from this list higher up, which means we'll leak this reloc_root. Fix this by adding ourselves back to the reloc_roots list so we are properly cleaned up. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-12-13 14:10:45 +01:00
Josef Bacik	9bc574de59	btrfs: skip log replay on orphaned roots My fsstress modifications coupled with generic/475 uncovered a failure to mount and replay the log if we hit a orphaned root. We do not want to replay the log for an orphan root, but it's completely legitimate to have an orphaned root with a log attached. Fix this by simply skipping replaying the log. We still need to pin it's root node so that we do not overwrite it while replaying other logs, as we re-read the log root at every stage of the replay. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-12-13 14:10:45 +01:00
Josef Bacik	714cd3e8cb	btrfs: handle ENOENT in btrfs_uuid_tree_iterate If we get an -ENOENT back from btrfs_uuid_iter_rem when iterating the uuid tree we'll just continue and do btrfs_next_item(). However we've done a btrfs_release_path() at this point and no longer have a valid path. So increment the key and go back and do a normal search. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-12-13 14:10:45 +01:00
Josef Bacik	c7e54b5102	btrfs: abort transaction after failed inode updates in create_subvol We can just abort the transaction here, and in fact do that for every other failure in this function except these two cases. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-12-13 14:10:44 +01:00
Filipe Manana	147271e35b	Btrfs: fix hole extent items with a zero size after range cloning Normally when cloning a file range if we find an implicit hole at the end of the range we assume it is because the NO_HOLES feature is enabled. However that is not always the case. One well known case [1] is when we have a power failure after mixing buffered and direct IO writes against the same file. In such cases we need to punch a hole in the destination file, and if the NO_HOLES feature is not enabled, we need to insert explicit file extent items to represent the hole. After commit `690a5dbfc5` ("Btrfs: fix ENOSPC errors, leading to transaction aborts, when cloning extents"), we started to insert file extent items representing the hole with an item size of 0, which is invalid and should be 53 bytes (the size of a btrfs_file_extent_item structure), resulting in all sorts of corruptions and invalid memory accesses. This is detected by the tree checker when we attempt to write a leaf to disk. The problem can be sporadically triggered by test case generic/561 from fstests. That test case does not exercise power failure and creates a new filesystem when it starts, so it does not use a filesystem created by any previous test that tests power failure. However the test does both buffered and direct IO writes (through fsstress) and it's precisely that which is creating the implicit holes in files. That happens even before the commit mentioned earlier. I need to investigate why we get those implicit holes to check if there is a real problem or not. For now this change fixes the regression of introducing file extent items with an item size of 0 bytes. Fix the issue by calling btrfs_punch_hole_range() without passing a btrfs_clone_extent_info structure, which ensures file extent items are inserted to represent the hole with a correct item size. We were passing a btrfs_clone_extent_info with a value of 0 for its 'item_size' field, which was causing the insertion of file extent items with an item size of 0. [1] https://www.spinics.net/lists/linux-btrfs/msg75350.html Reported-by: David Sterba <dsterba@suse.com> Fixes: `690a5dbfc5` ("Btrfs: fix ENOSPC errors, leading to transaction aborts, when cloning extents") Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-12-13 14:10:28 +01:00
Filipe Manana	6609fee889	Btrfs: fix removal logic of the tree mod log that leads to use-after-free issues When a tree mod log user no longer needs to use the tree it calls btrfs_put_tree_mod_seq() to remove itself from the list of users and delete all no longer used elements of the tree's red black tree, which should be all elements with a sequence number less then our equals to the caller's sequence number. However the logic is broken because it can delete and free elements from the red black tree that have a sequence number greater then the caller's sequence number: 1) At a point in time we have sequence numbers 1, 2, 3 and 4 in the tree mod log; 2) The task which got assigned the sequence number 1 calls btrfs_put_tree_mod_seq(); 3) Sequence number 1 is deleted from the list of sequence numbers; 4) The current minimum sequence number is computed to be the sequence number 2; 5) A task using sequence number 2 is at tree_mod_log_rewind() and gets a pointer to one of its elements from the red black tree through a call to tree_mod_log_search(); 6) The task with sequence number 1 iterates the red black tree of tree modification elements and deletes (and frees) all elements with a sequence number less then or equals to 2 (the computed minimum sequence number) - it ends up only leaving elements with sequence numbers of 3 and 4; 7) The task with sequence number 2 now uses the pointer to its element, already freed by the other task, at __tree_mod_log_rewind(), resulting in a use-after-free issue. When CONFIG_DEBUG_PAGEALLOC=y it produces a trace like the following: [16804.546854] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI [16804.547451] CPU: 0 PID: 28257 Comm: pool Tainted: G W 5.4.0-rc8-btrfs-next-51 #1 [16804.548059] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014 [16804.548666] RIP: 0010:rb_next+0x16/0x50 (...) [16804.550581] RSP: 0018:ffffb948418ef9b0 EFLAGS: 00010202 [16804.551227] RAX: 6b6b6b6b6b6b6b6b RBX: ffff90e0247f6600 RCX: 6b6b6b6b6b6b6b6b [16804.551873] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff90e0247f6600 [16804.552504] RBP: ffff90dffe0d4688 R08: 0000000000000001 R09: 0000000000000000 [16804.553136] R10: ffff90dffa4a0040 R11: 0000000000000000 R12: 000000000000002e [16804.553768] R13: ffff90e0247f6600 R14: 0000000000001663 R15: ffff90dff77862b8 [16804.554399] FS: 00007f4b197ae700(0000) GS:ffff90e036a00000(0000) knlGS:0000000000000000 [16804.555039] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [16804.555683] CR2: 00007f4b10022000 CR3: 00000002060e2004 CR4: 00000000003606f0 [16804.556336] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [16804.556968] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [16804.557583] Call Trace: [16804.558207] __tree_mod_log_rewind+0xbf/0x280 [btrfs] [16804.558835] btrfs_search_old_slot+0x105/0xd00 [btrfs] [16804.559468] resolve_indirect_refs+0x1eb/0xc70 [btrfs] [16804.560087] ? free_extent_buffer.part.19+0x5a/0xc0 [btrfs] [16804.560700] find_parent_nodes+0x388/0x1120 [btrfs] [16804.561310] btrfs_check_shared+0x115/0x1c0 [btrfs] [16804.561916] ? extent_fiemap+0x59d/0x6d0 [btrfs] [16804.562518] extent_fiemap+0x59d/0x6d0 [btrfs] [16804.563112] ? __might_fault+0x11/0x90 [16804.563706] do_vfs_ioctl+0x45a/0x700 [16804.564299] ksys_ioctl+0x70/0x80 [16804.564885] ? trace_hardirqs_off_thunk+0x1a/0x20 [16804.565461] __x64_sys_ioctl+0x16/0x20 [16804.566020] do_syscall_64+0x5c/0x250 [16804.566580] entry_SYSCALL_64_after_hwframe+0x49/0xbe [16804.567153] RIP: 0033:0x7f4b1ba2add7 (...) [16804.568907] RSP: 002b:00007f4b197adc88 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [16804.569513] RAX: ffffffffffffffda RBX: 00007f4b100210d8 RCX: 00007f4b1ba2add7 [16804.570133] RDX: 00007f4b100210d8 RSI: 00000000c020660b RDI: 0000000000000003 [16804.570726] RBP: 000055de05a6cfe0 R08: 0000000000000000 R09: 00007f4b197add44 [16804.571314] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f4b197add48 [16804.571905] R13: 00007f4b197add40 R14: 00007f4b100210d0 R15: 00007f4b197add50 (...) [16804.575623] ---[ end trace 87317359aad4ba50 ]--- Fix this by making btrfs_put_tree_mod_seq() skip deletion of elements that have a sequence number equals to the computed minimum sequence number, and not just elements with a sequence number greater then that minimum. Fixes: `bd989ba359` ("Btrfs: add tree modification log functions") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-12-13 14:09:25 +01:00
Filipe Manana	ad1d8c4399	Btrfs: make tree checker detect checksum items with overlapping ranges Having checksum items, either on the checksums tree or in a log tree, that represent ranges that overlap each other is a sign of a corruption. Such case confuses the checksum lookup code and can result in not being able to find checksums or find stale checksums. So add a check for such case. This is motivated by a recent fix for a case where a log tree had checksum items covering ranges that overlap each other due to extent cloning, and resulted in missing checksums after replaying the log tree. It also helps detect past issues such as stale and outdated checksums due to overlapping, commit `27b9a8122f` ("Btrfs: fix csum tree corruption, duplicate and outdated checksums"). CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-12-13 14:09:25 +01:00
Filipe Manana	40e046acbd	Btrfs: fix missing data checksums after replaying a log tree When logging a file that has shared extents (reflinked with other files or with itself), we can end up logging multiple checksum items that cover overlapping ranges. This confuses the search for checksums at log replay time causing some checksums to never be added to the fs/subvolume tree. Consider the following example of a file that shares the same extent at offsets 0 and 256Kb: [ bytenr 13893632, offset 64Kb, len 64Kb ] 0 64Kb [ bytenr 13631488, offset 64Kb, len 192Kb ] 64Kb 256Kb [ bytenr 13893632, offset 0, len 256Kb ] 256Kb 512Kb When logging the inode, at tree-log.c:copy_items(), when processing the file extent item at offset 0, we log a checksum item covering the range 13959168 to 14024704, which corresponds to 13893632 + 64Kb and 13893632 + 64Kb + 64Kb, respectively. Later when processing the extent item at offset 256K, we log the checksums for the range from 13893632 to 14155776 (which corresponds to 13893632 + 256Kb). These checksums get merged with the checksum item for the range from 13631488 to 13893632 (13631488 + 256Kb), logged by a previous fsync. So after this we get the two following checksum items in the log tree: (...) item 6 key (EXTENT_CSUM EXTENT_CSUM 13631488) itemoff 3095 itemsize 512 range start 13631488 end 14155776 length 524288 item 7 key (EXTENT_CSUM EXTENT_CSUM 13959168) itemoff 3031 itemsize 64 range start 13959168 end 14024704 length 65536 The first one covers the range from the second one, they overlap. So far this does not cause a problem after replaying the log, because when replaying the file extent item for offset 256K, we copy all the checksums for the extent 13893632 from the log tree to the fs/subvolume tree, since searching for an checksum item for bytenr 13893632 leaves us at the first checksum item, which covers the whole range of the extent. However if we write 64Kb to file offset 256Kb for example, we will not be able to find and copy the checksums for the last 128Kb of the extent at bytenr 13893632, referenced by the file range 384Kb to 512Kb. After writing 64Kb into file offset 256Kb we get the following extent layout for our file: [ bytenr 13893632, offset 64K, len 64Kb ] 0 64Kb [ bytenr 13631488, offset 64Kb, len 192Kb ] 64Kb 256Kb [ bytenr 14155776, offset 0, len 64Kb ] 256Kb 320Kb [ bytenr 13893632, offset 64Kb, len 192Kb ] 320Kb 512Kb After fsync'ing the file, if we have a power failure and then mount the filesystem to replay the log, the following happens: 1) When replaying the file extent item for file offset 320Kb, we lookup for the checksums for the extent range from 13959168 (13893632 + 64Kb) to 14155776 (13893632 + 256Kb), through a call to btrfs_lookup_csums_range(); 2) btrfs_lookup_csums_range() finds the checksum item that starts precisely at offset 13959168 (item 7 in the log tree, shown before); 3) However that checksum item only covers 64Kb of data, and not 192Kb of data; 4) As a result only the checksums for the first 64Kb of data referenced by the file extent item are found and copied to the fs/subvolume tree. The remaining 128Kb of data, file range 384Kb to 512Kb, doesn't get the corresponding data checksums found and copied to the fs/subvolume tree. 5) After replaying the log userspace will not be able to read the file range from 384Kb to 512Kb, because the checksums are missing and resulting in an -EIO error. The following steps reproduce this scenario: $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt/sdc $ xfs_io -f -c "pwrite -S 0xa3 0 256K" /mnt/sdc/foobar $ xfs_io -c "fsync" /mnt/sdc/foobar $ xfs_io -c "pwrite -S 0xc7 256K 256K" /mnt/sdc/foobar $ xfs_io -c "reflink /mnt/sdc/foobar 320K 0 64K" /mnt/sdc/foobar $ xfs_io -c "fsync" /mnt/sdc/foobar $ xfs_io -c "pwrite -S 0xe5 256K 64K" /mnt/sdc/foobar $ xfs_io -c "fsync" /mnt/sdc/foobar <power failure> $ mount /dev/sdc /mnt/sdc $ md5sum /mnt/sdc/foobar md5sum: /mnt/sdc/foobar: Input/output error $ dmesg \| tail [165305.003464] BTRFS info (device sdc): no csum found for inode 257 start 401408 [165305.004014] BTRFS info (device sdc): no csum found for inode 257 start 405504 [165305.004559] BTRFS info (device sdc): no csum found for inode 257 start 409600 [165305.005101] BTRFS info (device sdc): no csum found for inode 257 start 413696 [165305.005627] BTRFS info (device sdc): no csum found for inode 257 start 417792 [165305.006134] BTRFS info (device sdc): no csum found for inode 257 start 421888 [165305.006625] BTRFS info (device sdc): no csum found for inode 257 start 425984 [165305.007278] BTRFS info (device sdc): no csum found for inode 257 start 430080 [165305.008248] BTRFS warning (device sdc): csum failed root 5 ino 257 off 393216 csum 0x1337385e expected csum 0x00000000 mirror 1 [165305.009550] BTRFS warning (device sdc): csum failed root 5 ino 257 off 393216 csum 0x1337385e expected csum 0x00000000 mirror 1 Fix this simply by deleting first any checksums, from the log tree, for the range of the extent we are logging at copy_items(). This ensures we do not get checksum items in the log tree that have overlapping ranges. This is a long time issue that has been present since we have the clone (and deduplication) ioctl, and can happen both when an extent is shared between different files and within the same file. A test case for fstests follows soon. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-12-13 14:09:24 +01:00
Dan Carpenter	b6293c821e	btrfs: return error pointer from alloc_test_extent_buffer Callers of alloc_test_extent_buffer have not correctly interpreted the return value as error pointer, as alloc_test_extent_buffer should behave as alloc_extent_buffer. The self-tests were unaffected but btrfs_find_create_tree_block could call both functions and that would cause problems up in the call chain. Fixes: `faa2dbf004` ("Btrfs: add sanity tests for new qgroup accounting code") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-12-13 14:09:24 +01:00
David Sterba	cf93e15eca	btrfs: fix devs_max constraints for raid1c3 and raid1c4 The value 0 for devs_max means to spread the allocated chunks over all available devices, eg. stripe for RAID0 or RAID5. This got mistakenly copied to the RAID1C3/4 profiles. The intention is to have exactly 3 and 4 copies respectively. Fixes: `47e6f7423b` ("btrfs: add support for 3-copy replication (raid1c3)") Fixes: `8d6fac0087` ("btrfs: add support for 4-copy replication (raid1c4)") Signed-off-by: David Sterba <dsterba@suse.com>	2019-12-13 14:09:23 +01:00
Andreas Färber	994bf9cd78	btrfs: tree-checker: Fix error format string for size_t Argument BTRFS_FILE_EXTENT_INLINE_DATA_START is defined as offsetof(), which returns type size_t, so we need %zu instead of %lu. This fixes a build warning on 32-bit ARM: ../fs/btrfs/tree-checker.c: In function 'check_extent_data_item': ../fs/btrfs/tree-checker.c:230:43: warning: format '%lu' expects argument of type 'long unsigned int', but argument 5 has type 'unsigned int' [-Wformat=] 230 \| "invalid item size, have %u expect [%lu, %u)", \| ~~^ \| long unsigned int \| %u Fixes: `153a6d2999` ("btrfs: tree-checker: Check item size before reading file extent type") Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andreas Färber <afaerber@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-12-13 14:09:23 +01:00
Josef Bacik	943eb3bf25	btrfs: don't double lock the subvol_sem for rename exchange If we're rename exchanging two subvols we'll try to lock this lock twice, which is bad. Just lock once if either of the ino's are subvols. Fixes: `cdd1fedf82` ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-12-13 14:09:23 +01:00
Josef Bacik	db8fe64f9c	btrfs: handle error in btrfs_cache_block_group We have a BUG_ON(ret < 0) in find_free_extent from btrfs_cache_block_group. If we fail to allocate our ctl we'll just panic, which is not good. Instead just go on to another block group. If we fail to find a block group we don't want to return ENOSPC, because really we got a ENOMEM and that's the root of the problem. Save our return from btrfs_cache_block_group(), and then if we still fail to make our allocation return that ret so we get the right error back. Tested with inject-error.py from bcc. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-12-13 14:09:22 +01:00
Josef Bacik	f72ff01df9	btrfs: do not call synchronize_srcu() in inode_tree_del Testing with the new fsstress uncovered a pretty nasty deadlock with lookup and snapshot deletion. Process A unlink -> final iput -> inode_tree_del -> synchronize_srcu(subvol_srcu) Process B btrfs_lookup <- srcu_read_lock() acquired here -> btrfs_iget -> find inode that has I_FREEING set -> __wait_on_freeing_inode() We're holding the srcu_read_lock() while doing the iget in order to make sure our fs root doesn't go away, and then we are waiting for the inode to finish freeing. However because the free'ing process is doing a synchronize_srcu() we deadlock. Fix this by dropping the synchronize_srcu() in inode_tree_del(). We don't need people to stop accessing the fs root at this point, we're only adding our empty root to the dead roots list. A larger much more invasive fix is forthcoming to address how we deal with fs roots, but this fixes the immediate problem. Fixes: `76dda93c6a` ("Btrfs: add snapshot/subvolume destroy ioctl") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-12-13 14:09:08 +01:00
Filipe Manana	fcb970581d	Btrfs: fix cloning range with a hole when using the NO_HOLES feature When using the NO_HOLES feature if we clone a range that contains a hole and a temporary ENOSPC happens while dropping extents from the target inode's range, we can end up failing and aborting the transaction with -EEXIST or with a corrupt file extent item, that has a length greater than it should and overlaps with other extents. For example when cloning the following range from inode A to inode B: Inode A: extent A1 extent A2 [ ----------- ] [ hole, implicit, 4MB length ] [ ------------- ] 0 1MB 5MB 6MB Range to clone: [1MB, 6MB) Inode B: extent B1 extent B2 extent B3 extent B4 [ ---------- ] [ --------- ] [ ---------- ] [ ---------- ] 0 1MB 1MB 2MB 2MB 5MB 5MB 6MB Target range: [1MB, 6MB) (same as source, to make it easier to explain) The following can happen: 1) btrfs_punch_hole_range() gets -ENOSPC from __btrfs_drop_extents(); 2) At that point, 'cur_offset' is set to 1MB and __btrfs_drop_extents() set 'drop_end' to 2MB, meaning it was able to drop only extent B2; 3) We then compute 'clone_len' as 'drop_end' - 'cur_offset' = 2MB - 1MB = 1MB; 4) We then attempt to insert a file extent item at inode B with a file offset of 5MB, which is the value of clone_info->file_offset. This fails with error -EEXIST because there's already an extent at that offset (extent B4); 5) We abort the current transaction with -EEXIST and return that error to user space as well. Another example, for extent corruption: Inode A: extent A1 extent A2 [ ----------- ] [ hole, implicit, 10MB length ] [ ------------- ] 0 1MB 11MB 12MB Inode B: extent B1 extent B2 [ ----------- ] [ --------- ] [ ----------------------------- ] 0 1MB 1MB 5MB 5MB 12MB Target range: [1MB, 12MB) (same as source, to make it easier to explain) 1) btrfs_punch_hole_range() gets -ENOSPC from __btrfs_drop_extents(); 2) At that point, 'cur_offset' is set to 1MB and __btrfs_drop_extents() set 'drop_end' to 5MB, meaning it was able to drop only extent B2; 3) We then compute 'clone_len' as 'drop_end' - 'cur_offset' = 5MB - 1MB = 4MB; 4) We then insert a file extent item at inode B with a file offset of 11MB which is the value of clone_info->file_offset, and a length of 4MB (the value of 'clone_len'). So we get 2 extents items with ranges that overlap and an extent length of 4MB, larger then the extent A2 from inode A (1MB length); 5) After that we end the transaction, balance the btree dirty pages and then start another or join the previous transaction. It might happen that the transaction which inserted the incorrect extent was committed by another task so we end up with extent corruption if a power failure happens. So fix this by making sure we attempt to insert the extent to clone at the destination inode only if we are past dropping the sub-range that corresponds to a hole. Fixes: `690a5dbfc5` ("Btrfs: fix ENOSPC errors, leading to transaction aborts, when cloning extents") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-12-13 13:29:22 +01:00
Nikolay Borisov	37d02592f1	btrfs: Fix error messages in qgroup_rescan_init The branch of qgroup_rescan_init which is executed from the mount path prints wrong errors messages. The textual print out in case BTRFS_QGROUP_STATUS_FLAG_RESCAN/BTRFS_QGROUP_STATUS_FLAG_ON are not set are transposed. Fix it by exchanging their place. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-12-13 13:29:12 +01:00
Pavel Shilovsky	d919131935	CIFS: Close cached root handle only if it has a lease SMB2_tdis() checks if a root handle is valid in order to decide whether it needs to close the handle or not. However if another thread has reference for the handle, it may end up with putting the reference twice. The extra reference that we want to put during the tree disconnect is the reference that has a directory lease. So, track the fact that we have a directory lease and close the handle only in that case. Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2019-12-13 00:49:57 -06:00
Steve French	e0fc5b1153	SMB3: Fix crash in SMB2_open_init due to uninitialized field in compounding path Ran into an intermittent crash in SMB2_open_init+0x2f6/0x970 due to oparms.cifs_sb not being initialized when called from: smb2_compound_op+0x45d/0x1690 Zero the whole oparms struct in the compounding path before setting up the oparms so we don't risk any uninitialized fields. Fixes: `fdef665ba4` ("smb3: fix mode passed in on create for modetosid mount option") Signed-off-by: Steve French <stfrench@microsoft.com> Acked-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>	2019-12-13 00:49:38 -06:00
Linus Torvalds	37d4e84f76	A fix to avoid a corner case when scheduling cap reclaim in batches from Xiubo, a patch to add some observability into cap waiters from Jeff and a couple of cleanups. -----BEGIN PGP SIGNATURE----- iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl3yiJMTHGlkcnlvbW92 QGdtYWlsLmNvbQAKCRBKf944AhHzi5k9CACmM3fJGrTUuOLgXAxxllCfiV6UQLoY nuTo/bx0DmG603n+Ze8+Z0iz7hDc1Gw2XUeLkJcAE/xSetgZXO/MvJ0Ionq5Ac/k CrqS6ucIa1bPxbE1QMTHswHjkajKwBpAZ5+khdLNLuXJxy3c9HDCGOT4VZav7Yc9 99W4kIdzOKdYLpZHAedMK97IJIrD5WhYTAFW4rNPY0GL6OPD1V0uiS9v7xUWIxnZ Uusnu+zY8miQlLVx/V9DyLh/6G5X7XyQO1nkSQcVXZOOG7+qnkq6jDhQW8adgOSZ wUFigTxxhSTIcntWg01TaCRNoi1N3/P8Z9/rD27zBHPbl93ANH+lUkCh =NicF -----END PGP SIGNATURE----- Merge tag 'ceph-for-5.5-rc2' of git://github.com/ceph/ceph-client Pull ceph fixes from Ilya Dryomov: "A fix to avoid a corner case when scheduling cap reclaim in batches from Xiubo, a patch to add some observability into cap waiters from Jeff and a couple of cleanups" * tag 'ceph-for-5.5-rc2' of git://github.com/ceph/ceph-client: ceph: add more debug info when decoding mdsmap ceph: switch to global cap helper ceph: trigger the reclaim work once there has enough pending caps ceph: show tasks waiting on caps in debugfs caps file ceph: convert int fields in ceph_mount_options to unsigned int	2019-12-12 10:56:37 -08:00
Dominik Brodowski	8243186f0c	fs: remove ksys_dup() ksys_dup() is used only at one place in the kernel, namely to duplicate fd 0 of /dev/console to stdout and stderr. The same functionality can be achieved by using functions already available within the kernel namespace. Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>	2019-12-12 19:00:36 +01:00
Dominik Brodowski	cccaa5e335	init: use do_mount() instead of ksys_mount() In prepare_namespace(), do_mount() can be used instead of ksys_mount() as the first and third argument are const strings in the kernel, the second and fourth argument are passed through anyway, and the fifth argument is NULL. In do_mount_root(), ksys_mount() is called with the first and third argument being already kernelspace strings, which do not need to be copied over from userspace to kernelspace (again). The second and fourth arguments are passed through to do_mount() anyway. The fifth argument, while already residing in kernelspace, needs to be put into a page of its own. Then, do_mount() can be used instead of ksys_mount(). Once this is done, there are no in-kernel users to ksys_mount() left, which can therefore be removed. Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>	2019-12-12 14:50:05 +01:00
Linus Torvalds	ae4b064e2a	AFS fixes -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAl3xW+cACgkQ+7dXa6fL C2t+UQ/9ETwAYJ2XgfwatNhAHX0+Tf4XMlgEDybf5a+ERRLbkSQEqRjTImopk5m4 T7yKea0ctsVb1avdUVdayh2KuJhjmO627u3w48kd/uf2ut4IyOyaraatYKYOJ+XL upr8WxAMXUHgnwsb38avjH0dwq7+eDyX8FNLHiDY4Sk3mTAVHbafL6V0ujp0ms2o VmHLX6ihwECehUqrNEyy1pLX2nuFmd+MeLnoi/EWLa47/X0te21G8u8UdPWHGgHn cZp8kqWSEoaeChO7x6XOoJZCY6N/7o+hogsrU/N5YrB6FXPpFWGdwFpej3YcyFso QqffyXX5jVAyUg9I/v3WCrtDmTQ9xVG4kxhmBMVId6bBk3ZVpGaHuLE3ENLbr1w/ Z1k26BI41nJwrqV0C2bPoMalpDprP2WIuWsBIpYTqaICpc53KJ72c6mLhcDEt+oc YCSd9X0Bv5O7tZS9rLI8mt0JtCr9Uw+VfnK+uOuUTPv2YcqQwK1/xZ0/9hd2P8XS L5GKcEeuk3indgmefjwsz5YJcmzeCttaOQTlmaN/Hz3MbkK2CC+spMG4OZWzUXL4 axZmCIo/kKrv64bLyVZul6BjkhawyvfWzCs2NM6Ni63akW3Yb0S2WRdgJhvGILml 48YgqcG7R1y8LRPYiJzcuAPnbumuu6ZSxVuyZN0FU0qc4lt5mPM= =ldOp -----END PGP SIGNATURE----- Merge tag 'afs-fixes-20191211' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs Pull AFS fixes from David Howells: "Fixes for AFS plus one patch to make debugging easier: - Fix how addresses are matched to server records. This is currently incorrect which means cache invalidation callbacks from the server don't necessarily get delivered correctly. This causes stale data and metadata to be seen under some circumstances. - Make the dynamic root superblock R/W so that rpm/dnf can reapply the SELinux label to it when upgrading the Fedora filesystem-afs package. If the filesystem is R/O, this fails and the upgrade fails. It might be better in future to allow setxattr from an LSM to bypass the R/O protections, if only for pseudo-filesystems. - Fix the parsing of mountpoint strings. The mountpoint object has to have a terminal dot, whereas the source/device string passed to mount should not. This confuses type-forcing suffix detection leading to the wrong volume variant being mounted. - Make lookups in the dynamic root superblock for creation events (such as mkdir) fail with EOPNOTSUPP rather than something like EEXIST. The dynamic root only allows implicit creation by the ->lookup() method - and only if the target cell exists. - Fix the looking up of an AFS superblock to include the cell in the matching key - otherwise all volumes with the same ID number are treated as the same thing, irrespective of which cell they're in. - Show the volume name of each volume in the volume records displayed in /proc/net/afs/<cell>/volumes. This proved useful in debugging as it provides a way to map the volume IDs to names, where the names are what appear in /proc/mounts" * tag 'afs-fixes-20191211' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: afs: Show volume name in /proc/net/afs/<cell>/volumes afs: Fix missing cell comparison in afs_test_super() afs: Fix creation calls in the dynamic root to fail with EOPNOTSUPP afs: Fix mountpoint parsing afs: Fix SELinux setting security label on /afs afs: Fix afs_find_server lookups for ipv4 peers	2019-12-11 18:10:40 -08:00
Jens Axboe	9e3aa61ae3	io_uring: ensure we return -EINVAL on unknown opcode If we submit an unknown opcode and have fd == -1, io_op_needs_file() will return true as we default to needing a file. Then when we go and assign the file, we find the 'fd' invalid and return -EBADF. We really should be returning -EINVAL for that case, as we normally do for unsupported opcodes. Change io_op_needs_file() to have the following return values: 0 - does not need a file 1 - does need a file < 0 - error value and use this to pass back the right value for this invalid case. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-11 16:02:32 -07:00
Brian Foster	d0c2204135	xfs: stabilize insert range start boundary to avoid COW writeback race generic/522 (fsx) occasionally fails with a file corruption due to an insert range operation. The primary characteristic of the corruption is a misplaced insert range operation that differs from the requested target offset. The reason for this behavior is a race between the extent shift sequence of an insert range and a COW writeback completion that causes a front merge with the first extent in the shift. The shift preparation function flushes and unmaps from the target offset of the operation to the end of the file to ensure no modifications can be made and page cache is invalidated before file data is shifted. An insert range operation then splits the extent at the target offset, if necessary, and begins to shift the start offset of each extent starting from the end of the file to the start offset. The shift sequence operates at extent level and so depends on the preparation sequence to guarantee no changes can be made to the target range during the shift. If the block immediately prior to the target offset was dirty and shared, however, it can undergo writeback and move from the COW fork to the data fork at any point during the shift. If the block is contiguous with the block at the start offset of the insert range, it can front merge and alter the start offset of the extent. Once the shift sequence reaches the target offset, it shifts based on the latest start offset and silently changes the target offset of the operation and corrupts the file. To address this problem, update the shift preparation code to stabilize the start boundary along with the full range of the insert. Also update the existing corruption check to fail if any extent is shifted with a start offset behind the target offset of the insert range. This prevents insert from racing with COW writeback completion and fails loudly in the event of an unexpected extent shift. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2019-12-11 13:18:42 -08:00
Linus Torvalds	687dec9b94	Changes since last update: - Fix improper return value of listxattr() with no xattr; - Keep up documentation with latest code. -----BEGIN PGP SIGNATURE----- iIwEABYIADQWIQThPAmQN9sSA0DVxtI5NzHcH7XmBAUCXfELlBYcZ2FveGlhbmcy NUBodWF3ZWkuY29tAAoJEDk3MdwfteYEtUABAN164UwGU9QKEsqgZQcmbz23qXSJ QDR8r/ch2LxzXKkVAQDXCNU+ol6jkiapLcTvsXEjBk8sUxsCEVnmZ36jru+TBA== =kRp9 -----END PGP SIGNATURE----- Merge tag 'erofs-for-5.5-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs Pull erofs fixes from Gao Xiang: "Mainly address a regression reported by David recently observed together with overlayfs due to the improper return value of listxattr() without xattr. Update outdated expressions in document as well. Summary: - Fix improper return value of listxattr() with no xattr - Keep up documentation with latest code" * tag 'erofs-for-5.5-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: erofs: update documentation erofs: zero out when listxattr is called with no xattr	2019-12-11 12:25:32 -08:00
Linus Torvalds	d1c6a2aa02	pipe: simplify signal handling in pipe_read() and add comments There's no need to separately check for signals while inside the locked region, since we're going to do "wait_event_interruptible()" right afterwards anyway, and the error handling is much simpler there. The check for whether we had already read anything was also redundant, since we no longer do the odd merging of reads when there are pending writers. But perhaps more importantly, this adds commentary about why we still need to wake up possible writers even though we didn't read any data, and why we can skip all the finishing touches now if we get a signal (or had a signal pending) while waiting for more data. [ This is a split-out cleanup from my "make pipe IO use exclusive wait queues" thing, which I can't apply because it triggers a nasty bug in the GNU make jobserver - Linus ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-12-11 11:46:19 -08:00
David Howells	50559800b7	afs: Show volume name in /proc/net/afs/<cell>/volumes Show the name of each volume in /proc/net/afs/<cell>/volumes to make it easier to work out the name corresponding to a volume ID. This makes it easier to work out which mounts in /proc/mounts correspond to which volume ID. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Marc Dionne <marc.dionne@auristor.com>	2019-12-11 17:48:20 +00:00
David Howells	106bc79843	afs: Fix missing cell comparison in afs_test_super() Fix missing cell comparison in afs_test_super(). Without this, any pair volumes that have the same volume ID will share a superblock, no matter the cell, unless they're in different network namespaces. Normally, most users will only deal with a single cell and so they won't see this. Even if they do look into a second cell, they won't see a problem unless they happen to hit a volume with the same ID as one they've already got mounted. Before the patch: # ls /afs/grand.central.org/archive linuxdev/ mailman/ moin/ mysql/ pipermail/ stage/ twiki/ # ls /afs/kth.se/ linuxdev/ mailman/ moin/ mysql/ pipermail/ stage/ twiki/ # cat /proc/mounts \| grep afs none /afs afs rw,relatime,dyn,autocell 0 0 #grand.central.org:root.cell /afs/grand.central.org afs ro,relatime 0 0 #grand.central.org:root.archive /afs/grand.central.org/archive afs ro,relatime 0 0 #grand.central.org:root.archive /afs/kth.se afs ro,relatime 0 0 After the patch: # ls /afs/grand.central.org/archive linuxdev/ mailman/ moin/ mysql/ pipermail/ stage/ twiki/ # ls /afs/kth.se/ admin/ common/ install/ OldFiles/ service/ system/ bakrestores/ home/ misc/ pkg/ src/ wsadmin/ # cat /proc/mounts \| grep afs none /afs afs rw,relatime,dyn,autocell 0 0 #grand.central.org:root.cell /afs/grand.central.org afs ro,relatime 0 0 #grand.central.org:root.archive /afs/grand.central.org/archive afs ro,relatime 0 0 #kth.se:root.cell /afs/kth.se afs ro,relatime 0 0 Fixes: ^1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: Carsten Jacobi <jacobi@de.ibm.com> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Marc Dionne <marc.dionne@auristor.com> Tested-by: Jonathan Billings <jsbillings@jsbillings.org> cc: Todd DeSantis <atd@us.ibm.com>	2019-12-11 17:47:51 +00:00
David Howells	1da4bd9f9d	afs: Fix creation calls in the dynamic root to fail with EOPNOTSUPP Fix the lookup method on the dynamic root directory such that creation calls, such as mkdir, open(O_CREAT), symlink, etc. fail with EOPNOTSUPP rather than failing with some odd error (such as EEXIST). lookup() itself tries to create automount directories when it is invoked. These are cached locally in RAM and not committed to storage. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Marc Dionne <marc.dionne@auristor.com> Tested-by: Jonathan Billings <jsbillings@jsbillings.org>	2019-12-11 17:47:51 +00:00
David Howells	158d583353	afs: Fix mountpoint parsing Each AFS mountpoint has strings that define the target to be mounted. This is required to end in a dot that is supposed to be stripped off. The string can include suffixes of ".readonly" or ".backup" - which are supposed to come before the terminal dot. To add to the confusion, the "fs lsmount" afs utility does not show the terminal dot when displaying the string. The kernel mount source string parser, however, assumes that the terminal dot marks the suffix and that the suffix is always "" and is thus ignored. In most cases, there is no suffix and this is not a problem - but if there is a suffix, it is lost and this affects the ability to mount the correct volume. The command line mount command, on the other hand, is expected not to include a terminal dot - so the problem doesn't arise there. Fix this by making sure that the dot exists and then stripping it when passing the string to the mount configuration. Fixes: `bec5eb6141` ("AFS: Implement an autocell mount capability [ver #2]") Reported-by: Jonathan Billings <jsbillings@jsbillings.org> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Marc Dionne <marc.dionne@auristor.com> Tested-by: Jonathan Billings <jsbillings@jsbillings.org>	2019-12-11 16:56:54 +00:00
Flavio Leitner	346da4d2c7	sched/cputime, proc/stat: Fix incorrect guest nice cpustat value The value being used for guest_nice should be CPUTIME_GUEST_NICE and not CPUTIME_USER. Fixes: `26dae145a7` ("procfs: Use all-in-one vtime aware kcpustat accessor") Signed-off-by: Flavio Leitner <fbl@sysclose.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lkml.kernel.org/r/20191205020344.14940-1-frederic@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2019-12-11 07:09:58 +01:00
Jens Axboe	10d5934557	io_uring: add sockets to list of files that support non-blocking issue In chasing a performance issue between using IORING_OP_RECVMSG and IORING_OP_READV on sockets, tracing showed that we always punt the socket reads to async offload. This is due to io_file_supports_async() not checking for S_ISSOCK on the inode. Since sockets supports the O_NONBLOCK (or MSG_DONTWAIT) flag just fine, add sockets to the list of file types that we can do a non-blocking issue to. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-10 16:33:23 -07:00
Jens Axboe	53108d476a	io_uring: only hash regular files for async work execution We hash regular files to avoid having multiple threads hammer on the inode mutex, but it should not be needed on other types of files (like sockets). Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-10 16:33:23 -07:00
Jens Axboe	4a0a7a1874	io_uring: run next sqe inline if possible One major use case of linked commands is the ability to run the next link inline, if at all possible. This is done correctly for async offload, but somewhere along the line we lost the ability to do so when we were able to complete a request without having to punt it. Ensure that we do so correctly. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-10 16:33:23 -07:00
Jens Axboe	392edb45b2	io_uring: don't dynamically allocate poll data This essentially reverts commit `e944475e69`. For high poll ops workloads, like TAO, the dynamic allocation of the wait_queue entry for IORING_OP_POLL_ADD adds considerable extra overhead. Go back to embedding the wait_queue_entry, but keep the usage of wait->private for the pointer stashing. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-10 16:33:23 -07:00
Jens Axboe	d96885658d	io_uring: deferred send/recvmsg should assign iov Don't just assign it from the main call path, that can miss the case when we're called from issue deferral. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-10 16:33:23 -07:00
Jens Axboe	8a4955ff1c	io_uring: sqthread should grab ctx->uring_lock for submissions We use the mutex to guard against registered file updates, for instance. Ensure we're safe in accessing that state against concurrent updates. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-10 16:33:23 -07:00
Jens Axboe	e995d5123e	io-wq: briefly spin for new work after finishing work To avoid going to sleep only to get woken shortly thereafter, spin briefly for new work upon completion of work. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-10 16:33:22 -07:00
Jens Axboe	506d95ff5d	io-wq: remove worker->wait waitqueue We only have one cases of using the waitqueue to wake the worker, the rest are using wake_up_process(). Since we can save some cycles not fiddling with the waitqueue io_wqe_worker(), switch the work activation to task wakeup and get rid of the now unused wait_queue_head_t in struct io_worker. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-10 16:33:22 -07:00
Jens Axboe	4e88d6e779	io_uring: allow unbreakable links Some commands will invariably end in a failure in the sense that the completion result will be less than zero. One such example is timeouts that don't have a completion count set, they will always complete with -ETIME unless cancelled. For linked commands, we sever links and fail the rest of the chain if the result is less than zero. Since we have commands where we know that will happen, add IOSQE_IO_HARDLINK as a stronger link that doesn't sever regardless of the completion result. Note that the link will still sever if we fail submitting the parent request, hard links are only resilient in the presence of completion results for requests that did submit correctly. Cc: stable@vger.kernel.org # v5.4 Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Reported-by: 李通洲 <carter.li@eoitek.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-10 16:33:06 -07:00
Amir Goldstein	6889ee5a53	ovl: relax WARN_ON() on rename to self In ovl_rename(), if new upper is hardlinked to old upper underneath overlayfs before upper dirs are locked, user will get an ESTALE error and a WARN_ON will be printed. Changes to underlying layers while overlayfs is mounted may result in unexpected behavior, but it shouldn't crash the kernel and it shouldn't trigger WARN_ON() either, so relax this WARN_ON(). Reported-by: syzbot+bb1836a212e69f8e201a@syzkaller.appspotmail.com Fixes: `804032fabb` ("ovl: don't check rename to self") Cc: <stable@vger.kernel.org> # v4.9+ Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2019-12-10 16:00:55 +01:00
Amir Goldstein	9c6d8f13e9	ovl: fix corner case of non-unique st_dev;st_ino On non-samefs overlay without xino, non pure upper inodes should use a pseudo_dev assigned to each unique lower fs and pure upper inodes use the real upper st_dev. It is fine for an overlay pure upper inode to use the same st_dev;st_ino values as the real upper inode, because the content of those two different filesystem objects is always the same. In this case, however: - two filesystems, A and B - upper layer is on A - lower layer 1 is also on A - lower layer 2 is on B Non pure upper overlay inode, whose origin is in layer 1 will have the same st_dev;st_ino values as the real lower inode. This may result with a false positive results of 'diff' between the real lower and copied up overlay inode. Fix this by using the upper st_dev;st_ino values in this case. This breaks the property of constant st_dev;st_ino across copy up of this case. This breakage will be fixed by a later patch. Fixes: `5148626b80` ("ovl: allocate anon bdev per unique lower fs") Cc: stable@vger.kernel.org # v4.17+ Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2019-12-10 16:00:55 +01:00

1 2 3 4 5 ...

61957 Commits