Commit Graph

65313 Commits

Author SHA1 Message Date
Zhang Xiaoxu
95a3d8f3af cifs: Fix double add page to memcg when cifs_readpages
When xfstests generic/451, there is an BUG at mm/memcontrol.c:
  page:ffffea000560f2c0 refcount:2 mapcount:0 mapping:000000008544e0ea
       index:0xf
  mapping->aops:cifs_addr_ops dentry name:"tst-aio-dio-cycle-write.451"
  flags: 0x2fffff80000001(locked)
  raw: 002fffff80000001 ffffc90002023c50 ffffea0005280088 ffff88815cda0210
  raw: 000000000000000f 0000000000000000 00000002ffffffff ffff88817287d000
  page dumped because: VM_BUG_ON_PAGE(page->mem_cgroup)
  page->mem_cgroup:ffff88817287d000
  ------------[ cut here ]------------
  kernel BUG at mm/memcontrol.c:2659!
  invalid opcode: 0000 [#1] SMP
  CPU: 2 PID: 2038 Comm: xfs_io Not tainted 5.8.0-rc1 #44
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_
    073836-buildvm-ppc64le-16.ppc.4
  RIP: 0010:commit_charge+0x35/0x50
  Code: 0d 48 83 05 54 b2 02 05 01 48 89 77 38 c3 48 c7
        c6 78 4a ea ba 48 83 05 38 b2 02 05 01 e8 63 0d9
  RSP: 0018:ffffc90002023a50 EFLAGS: 00010202
  RAX: 0000000000000000 RBX: ffff88817287d000 RCX: 0000000000000000
  RDX: 0000000000000000 RSI: ffff88817ac97ea0 RDI: ffff88817ac97ea0
  RBP: ffffea000560f2c0 R08: 0000000000000203 R09: 0000000000000005
  R10: 0000000000000030 R11: ffffc900020237a8 R12: 0000000000000000
  R13: 0000000000000001 R14: 0000000000000001 R15: ffff88815a1272c0
  FS:  00007f5071ab0800(0000) GS:ffff88817ac80000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 000055efcd5ca000 CR3: 000000015d312000 CR4: 00000000000006e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   mem_cgroup_charge+0x166/0x4f0
   __add_to_page_cache_locked+0x4a9/0x710
   add_to_page_cache_locked+0x15/0x20
   cifs_readpages+0x217/0x1270
   read_pages+0x29a/0x670
   page_cache_readahead_unbounded+0x24f/0x390
   __do_page_cache_readahead+0x3f/0x60
   ondemand_readahead+0x1f1/0x470
   page_cache_async_readahead+0x14c/0x170
   generic_file_buffered_read+0x5df/0x1100
   generic_file_read_iter+0x10c/0x1d0
   cifs_strict_readv+0x139/0x170
   new_sync_read+0x164/0x250
   __vfs_read+0x39/0x60
   vfs_read+0xb5/0x1e0
   ksys_pread64+0x85/0xf0
   __x64_sys_pread64+0x22/0x30
   do_syscall_64+0x69/0x150
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f5071fcb1af
  Code: Bad RIP value.
  RSP: 002b:00007ffde2cdb8e0 EFLAGS: 00000293 ORIG_RAX: 0000000000000011
  RAX: ffffffffffffffda RBX: 00007ffde2cdb990 RCX: 00007f5071fcb1af
  RDX: 0000000000001000 RSI: 000055efcd5ca000 RDI: 0000000000000003
  RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000
  R10: 0000000000001000 R11: 0000000000000293 R12: 0000000000000001
  R13: 000000000009f000 R14: 0000000000000000 R15: 0000000000001000
  Modules linked in:
  ---[ end trace 725fa14a3e1af65c ]---

Since commit 3fea5a499d ("mm: memcontrol: convert page cache to a new
mem_cgroup_charge() API") not cancel the page charge, the pages maybe
double add to pagecache:
thread1                       | thread2
cifs_readpages
readpages_get_pages
 add_to_page_cache_locked(head,index=n)=0
                              | readpages_get_pages
                              | add_to_page_cache_locked(head,index=n+1)=0
 add_to_page_cache_locked(head, index=n+1)=-EEXIST
 then, will next loop with list head page's
 index=n+1 and the page->mapping not NULL
readpages_get_pages
add_to_page_cache_locked(head, index=n+1)
 commit_charge
  VM_BUG_ON_PAGE

So, we should not do the next loop when any page add to page cache
failed.

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
2020-06-23 12:04:52 -05:00
Linus Torvalds
3e08a95294 for-5.8-rc2-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl7yABEACgkQxWXV+ddt
 WDtGoQ//cBWRRWLlLTRgpaKnY6t8JgVUqNvPJISHHf45cNbOJh0yo8hUuKMW+440
 8ovYqtFoZD+JHcHDE2sMueHBFe38rG5eT/zh8j/ruhBzeJcTb3lSYz53d7sfl5kD
 cIVngPEVlGziDqW2PsWLlyh8ulBGzY3YmS6kAEkyP/6/uhE/B1dq6qn3GUibkbKI
 dfNjHTLwZVmwnqoxLu8ZE2/hHFbzhl0sm09snsXYSVu13g36+edp0Z+pF0MlKGVk
 G6YrnZcts8TWwneZ4nogD9f2CMvzMhYDDLyEjsX0Ouhb+Cu2WNxdfrJ2ZbPNU82w
 EGbo451mIt6Ht8wicEjh27LWLI7YMraF/Ig/ODMdvFBYDbhl4voX2t+4n+p5Czbg
 AW6Wtg/q5EaaNFqrTsqAAiUn0+R3sMiDWrE0AewcE7syPGqQ2XMwP4la5pZ36rz8
 8Vo5KIGo44PIJ1dMwcX+bg3HTtUnBJSxE5fUi0rJ3ZfHKGjLS79VonEeQjh3QD6W
 0UlK+jCjo6KZoe33XdVV2hVkHd63ZIlliXWv0LOR+gpmqqgW2b3wf181zTvo/5sI
 v0fDjstA9caqf68ChPE9jJi7rZPp/AL1yAQGEiNzjKm4U431TeZJl2cpREicMJDg
 FCDU51t9425h8BFkM4scErX2/53F1SNNNSlAsFBGvgJkx6rTENs=
 =/eCR
 -----END PGP SIGNATURE-----

Merge tag 'for-5.8-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "A number of fixes, located in two areas, one performance fix and one
  fixup for better integration with another patchset.

   - bug fixes in nowait aio:
       - fix snapshot creation hang after nowait-aio was used
       - fix failure to write to prealloc extent past EOF
       - don't block when extent range is locked

   - block group fixes:
       - relocation failure when scrub runs in parallel
       - refcount fix when removing fails
       - fix race between removal and creation
       - space accounting fixes

   - reinstante fast path check for log tree at unlink time, fixes
     performance drop up to 30% in REAIM

   - kzfree/kfree fixup to ease treewide patchset renaming kzfree"

* tag 'for-5.8-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: use kfree() in btrfs_ioctl_get_subvol_info()
  btrfs: fix RWF_NOWAIT writes blocking on extent locks and waiting for IO
  btrfs: fix RWF_NOWAIT write not failling when we need to cow
  btrfs: fix failure of RWF_NOWAIT write into prealloc extent beyond eof
  btrfs: fix hang on snapshot creation after RWF_NOWAIT write
  btrfs: check if a log root exists before locking the log_mutex on unlink
  btrfs: fix bytes_may_use underflow when running balance and scrub in parallel
  btrfs: fix data block group relocation failure due to concurrent scrub
  btrfs: fix race between block group removal and block group creation
  btrfs: fix a block group ref counter leak after failure to remove block group
2020-06-23 09:20:11 -07:00
Dave Chinner
c7f87f3984 xfs: fix use-after-free on CIL context on shutdown
xlog_wait() on the CIL context can reference a freed context if the
waiter doesn't get scheduled before the CIL context is freed. This
can happen when a task is on the hard throttle and the CIL push
aborts due to a shutdown. This was detected by generic/019:

thread 1			thread 2

__xfs_trans_commit
 xfs_log_commit_cil
  <CIL size over hard throttle limit>
  xlog_wait
   schedule
				xlog_cil_push_work
				wake_up_all
				<shutdown aborts commit>
				xlog_cil_committed
				kmem_free

   remove_wait_queue
    spin_lock_irqsave --> UAF

Fix it by moving the wait queue to the CIL rather than keeping it in
in the CIL context that gets freed on push completion. Because the
wait queue is now independent of the CIL context and we might have
multiple contexts in flight at once, only wake the waiters on the
push throttle when the context we are pushing is over the hard
throttle size threshold.

Fixes: 0e7ab7efe7 ("xfs: Throttle commits on delayed background CIL push")
Reported-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-06-22 19:22:57 -07:00
Xiyu Yang
77577de641 cifs: Fix cached_fid refcnt leak in open_shroot
open_shroot() invokes kref_get(), which increases the refcount of the
"tcon->crfid" object. When open_shroot() returns not zero, it means the
open operation failed and close_shroot() will not be called to decrement
the refcount of the "tcon->crfid".

The reference counting issue happens in one normal path of
open_shroot(). When the cached root have been opened successfully in a
concurrent process, the function increases the refcount and jump to
"oshr_free" to return. However the current return value "rc" may not
equal to 0, thus the increased refcount will not be balanced outside the
function, causing a refcnt leak.

Fix this issue by setting the value of "rc" to 0 before jumping to
"oshr_free" label.

Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn>
Signed-off-by: Xin Tan <tanxin.ctf@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
2020-06-21 22:34:50 -05:00
Linus Torvalds
8b6ddd10d6 A few fixes and small cleanups for tracing:
- Have recordmcount work with > 64K sections (to support LTO)
  - kprobe RCU fixes
  - Correct a kprobe critical section with missing mutex
  - Remove redundant arch_disarm_kprobe() call
  - Fix lockup when kretprobe triggers within kprobe_flush_task()
  - Fix memory leak in fetch_op_data operations
  - Fix sleep in atomic in ftrace trace array sample code
  - Free up memory on failure in sample trace array code
  - Fix incorrect reporting of function_graph fields in format file
  - Fix quote within quote parsing in bootconfig
  - Fix return value of bootconfig tool
  - Add testcases for bootconfig tool
  - Fix maybe uninitialized warning in ftrace pid file code
  - Remove unused variable in tracing_iter_reset()
  - Fix some typos
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXu1jrRQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qoCMAP91nOccE3X+Nvc3zET3isDWnl1tWJxk
 icsBgN/JwBRuTAD/dnWTHIWM2/5lTiagvyVsmINdJHP6JLr8T7dpN9tlxAQ=
 =Cuo7
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Have recordmcount work with > 64K sections (to support LTO)

 - kprobe RCU fixes

 - Correct a kprobe critical section with missing mutex

 - Remove redundant arch_disarm_kprobe() call

 - Fix lockup when kretprobe triggers within kprobe_flush_task()

 - Fix memory leak in fetch_op_data operations

 - Fix sleep in atomic in ftrace trace array sample code

 - Free up memory on failure in sample trace array code

 - Fix incorrect reporting of function_graph fields in format file

 - Fix quote within quote parsing in bootconfig

 - Fix return value of bootconfig tool

 - Add testcases for bootconfig tool

 - Fix maybe uninitialized warning in ftrace pid file code

 - Remove unused variable in tracing_iter_reset()

 - Fix some typos

* tag 'trace-v5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  ftrace: Fix maybe-uninitialized compiler warning
  tools/bootconfig: Add testcase for show-command and quotes test
  tools/bootconfig: Fix to return 0 if succeeded to show the bootconfig
  tools/bootconfig: Fix to use correct quotes for value
  proc/bootconfig: Fix to use correct quotes for value
  tracing: Remove unused event variable in tracing_iter_reset
  tracing/probe: Fix memleak in fetch_op_data operations
  trace: Fix typo in allocate_ftrace_ops()'s comment
  tracing: Make ftrace packed events have align of 1
  sample-trace-array: Remove trace_array 'sample-instance'
  sample-trace-array: Fix sleeping function called from invalid context
  kretprobe: Prevent triggering kretprobe from within kprobe_flush_task
  kprobes: Remove redundant arch_disarm_kprobe() call
  kprobes: Fix to protect kick_kprobe_optimizer() by kprobe_mutex
  kprobes: Use non RCU traversal APIs on kprobe_tables if possible
  kprobes: Suppress the suspicious RCU warning on kprobes
  recordmcount: support >64k sections
2020-06-20 13:17:47 -07:00
David Howells
5481fc6eb8 afs: Fix hang on rmmod due to outstanding timer
The fileserver probe timer, net->fs_probe_timer, isn't cancelled when
the kafs module is being removed and so the count it holds on
net->servers_outstanding doesn't get dropped..

This causes rmmod to wait forever.  The hung process shows a stack like:

	afs_purge_servers+0x1b5/0x23c [kafs]
	afs_net_exit+0x44/0x6e [kafs]
	ops_exit_list+0x72/0x93
	unregister_pernet_operations+0x14c/0x1ba
	unregister_pernet_subsys+0x1d/0x2a
	afs_exit+0x29/0x6f [kafs]
	__do_sys_delete_module.isra.0+0x1a2/0x24b
	do_syscall_64+0x51/0x95
	entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fix this by:

 (1) Attempting to cancel the probe timer and, if successful, drop the
     count that the timer was holding.

 (2) Make the timer function just drop the count and not schedule the
     prober if the afs portion of net namespace is being destroyed.

Also, whilst we're at it, make the following changes:

 (3) Initialise net->servers_outstanding to 1 and decrement it before
     waiting on it so that it doesn't generate wake up events by being
     decremented to 0 until we're cleaning up.

 (4) Switch the atomic_dec() on ->servers_outstanding for ->fs_timer in
     afs_purge_servers() to use the helper function for that.

Fixes: f6cbb368bc ("afs: Actively poll fileservers to maintain NAT or firewall openings")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-20 12:01:58 -07:00
David Howells
f8ea5c7bce afs: Fix afs_do_lookup() to call correct fetch-status op variant
Fix afs_do_lookup()'s fallback case for when FS.InlineBulkStatus isn't
supported by the server.

In the fallback, it calls FS.FetchStatus for the specific vnode it's
meant to be looking up.  Commit b6489a49f7 broke this by renaming one
of the two identically-named afs_fetch_status_operation descriptors to
something else so that one of them could be made non-static.  The site
that used the renamed one, however, wasn't renamed and didn't produce
any warning because the other was declared in a header.

Fix this by making afs_do_lookup() use the renamed variant.

Note that there are two variants of the success method because one is
called from ->lookup() where we may or may not have an inode, but can't
call iget until after we've talked to the server - whereas the other is
called from within iget where we have an inode, but it may or may not be
initialised.

The latter variant expects there to be an inode, but because it's being
called from there former case, there might not be - resulting in an oops
like the following:

  BUG: kernel NULL pointer dereference, address: 00000000000000b0
  ...
  RIP: 0010:afs_fetch_status_success+0x27/0x7e
  ...
  Call Trace:
    afs_wait_for_operation+0xda/0x234
    afs_do_lookup+0x2fe/0x3c1
    afs_lookup+0x3c5/0x4bd
    __lookup_slow+0xcd/0x10f
    walk_component+0xa2/0x10c
    path_lookupat.isra.0+0x80/0x110
    filename_lookup+0x81/0x104
    vfs_statx+0x76/0x109
    __do_sys_newlstat+0x39/0x6b
    do_syscall_64+0x4c/0x78
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: b6489a49f7 ("afs: Fix silly rename")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-20 12:01:58 -07:00
Linus Torvalds
4333a9b0b6 io_uring-5.8-2020-06-19
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl7s0e0QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgppsQD/9ZD5Rrr1fRqZw29UMvjoKu53Plvc14GA7R
 Tv44p/qF8KD7mkSY+YB3OIRh+NP7fHPXIJLotjZU9CFpgQtTjkiYbVCxPH5TpwZJ
 YuEODrLuwuy3o5MU+a4t5uCBoGqDq5Fnz0Kh5kFfOl1D8qBzqczNzL0Ygn1FyoLd
 LcCchC+kjROX6F+Oo+0onQeOFipSjxkG6LOThSiFsLJAL8huVLDem7ihon/HvqYw
 68lAj2X0QlaIMzk0yKJ2LFovQRhk+nlWGtW6XzVCPetbEkFdmGOcEl13orwh/say
 tkzKGN8O0JLThIxWqEQn1MHK9MeaKlnS9j2tFI3suR65xvjlxE8+cxsmlg/wNzGo
 UyQgh1M8QvPNvDAXCL4q1k2QmCH0YwTY+pHqCIFDp37LRE6ZPboNWEV55YVB8VpL
 axLPf89Any8ta3YICFq/Zmm03A/GUmLsxWspgbtOZMT40loNgZ3YoDR7cPfE76jU
 N2XEZEVOQrvodXW6fjqfx6AAYraxhDo2gh4SZhF0ydXDgGTmK6BcHHu7a3lSp7+e
 eKgYDgkMsa7bP5Cm3+VaNuv7db84kPEBeXLO5zJ86N/nKk8JE97Tl6uiVC8maBC2
 r9ftsQd3fXkwDYmUk81EOjp2+YXY2zEt3vs8cP3euGt39qwjy4mdN05UQa5Xm5BS
 XLWmpO2teA==
 =tmwG
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.8-2020-06-19' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:

 - Catch a case where io_sq_thread() didn't do proper mm acquire

 - Ensure poll completions are reaped on shutdown

 - Async cancelation and run fixes (Pavel)

 - io-poll race fixes (Xiaoguang)

 - Request cleanup race fix (Xiaoguang)

* tag 'io_uring-5.8-2020-06-19' of git://git.kernel.dk/linux-block:
  io_uring: fix possible race condition against REQ_F_NEED_CLEANUP
  io_uring: reap poll completions while waiting for refs to drop on exit
  io_uring: acquire 'mm' for task_work for SQPOLL
  io_uring: add memory barrier to synchronize io_kiocb's result and iopoll_completed
  io_uring: don't fail links for EAGAIN error in IOPOLL mode
  io_uring: cancel by ->task not pid
  io_uring: lazy get task
  io_uring: batch cancel in io_uring_cancel_files()
  io_uring: cancel all task's requests on exit
  io-wq: add an option to cancel all matched reqs
  io-wq: reorder cancellation pending -> running
  io_uring: fix lazy work init
2020-06-19 13:16:58 -07:00
Linus Torvalds
d2b1c81f5f block-5.8-2020-06-19
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl7s0SAQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpp+YEACVqFvsfzxKCqa61IzyuOaPfnj9awyP+MY2
 7V6y9sDDHL8sp6aPDbHvqFnqz0O7E+7nHVZD2rf2qc6tKKMvJYNO/BFZSXPvWTZV
 KQ4cBChf/LDwqAKOnI4ZhmF5UcSyyob1yMy4uJ+U0gQiXXrRMbwJ3N1K24a9dr4c
 epkzGavR0Q+PJ9BbUgjACjbRdT+vrP4bOu0cuyCGkIpD9eCerKJ6mFaUAj0FDthD
 bg4BJj+c8Ij6LO0V++Wga6OxccmL43KeP0ky8B3x07PfAl+tDWqsbHSlU2YPtdcq
 5nKgMMTW16mVnZeO2/W0JB7tn89VubsmyvIFcm2KNeeRqSnEZyW9HI8n4kq994Ju
 xMH24lgbsU4trNeYkgOmzPoJJZ+LShkn+rnldyI1U/fhpEYub7DqfVySuT7ti9in
 uFpQdeRUmPsdw92F3+o6h8OYAflpcQQ7CblkzxPEeV4OyzOZasb+S9tMNPe59KBh
 0MtHv9IfzgtDihR6HuXifitXaP+GtH4x3D2z0dzEdooHKHC/+P3WycS5daG+3WKQ
 xV5lJruvpTuxhXKLFAH0wRrxnVlB0VUvhQ21T3WgHrwF0btbdmQMHFc83XOxBIB4
 jHWJMHGc4xp1ZdpWFBC8Cj79OmJh1w/ao8+/cf8SUoTB0LzFce1B8LvwnxgpcpUk
 VjIOrl7zhQ==
 =LeLd
 -----END PGP SIGNATURE-----

Merge tag 'block-5.8-2020-06-19' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - Use import_uuid() where appropriate (Andy)

 - bcache fixes (Coly, Mauricio, Zhiqiang)

 - blktrace sparse warnings fix (Jan)

 - blktrace concurrent setup fix (Luis)

 - blkdev_get use-after-free fix (Jason)

 - Ensure all blk-mq maps are updated (Weiping)

 - Loop invalidate bdev fix (Zheng)

* tag 'block-5.8-2020-06-19' of git://git.kernel.dk/linux-block:
  block: make function 'kill_bdev' static
  loop: replace kill_bdev with invalidate_bdev
  partitions/ldm: Replace uuid_copy() with import_uuid() where it makes sense
  block: update hctx map when use multiple maps
  blktrace: Avoid sparse warnings when assigning q->blk_trace
  blktrace: break out of blktrace setup on concurrent calls
  block: Fix use-after-free in blkdev_get()
  trace/events/block.h: drop kernel-doc for dropped function parameter
  blk-mq: Remove redundant 'return' statement
  bcache: pr_info() format clean up in bcache_device_init()
  bcache: use delayed kworker fo asynchronous devices registration
  bcache: check and adjust logical block size for backing devices
  bcache: fix potential deadlock problem in btree_gc_coalesce
2020-06-19 13:11:26 -07:00
Linus Torvalds
5e857ce6ea Merge branch 'hch' (maccess patches from Christoph Hellwig)
Merge non-faulting memory access cleanups from Christoph Hellwig:
 "Andrew and I decided to drop the patches implementing your suggested
  rename of the probe_kernel_* and probe_user_* helpers from -mm as
  there were way to many conflicts.

  After -rc1 might be a good time for this as all the conflicts are
  resolved now"

This also adds a type safety checking patch on top of the renaming
series to make the subtle behavioral difference between 'get_user()' and
'get_kernel_nofault()' less potentially dangerous and surprising.

* emailed patches from Christoph Hellwig <hch@lst.de>:
  maccess: make get_kernel_nofault() check for minimal type compatibility
  maccess: rename probe_kernel_address to get_kernel_nofault
  maccess: rename probe_user_{read,write} to copy_{from,to}_user_nofault
  maccess: rename probe_kernel_{read,write} to copy_{from,to}_kernel_nofault
2020-06-18 12:35:51 -07:00
Zheng Bin
3373a3461a block: make function 'kill_bdev' static
kill_bdev does not have any external user, so make it static.

Signed-off-by: Zheng Bin <zhengbin13@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-18 09:24:35 -06:00
Xiaoguang Wang
6f2cc1664d io_uring: fix possible race condition against REQ_F_NEED_CLEANUP
In io_read() or io_write(), when io request is submitted successfully,
it'll go through the below sequence:

    kfree(iovec);
    req->flags &= ~REQ_F_NEED_CLEANUP;
    return ret;

But clearing REQ_F_NEED_CLEANUP might be unsafe. The io request may
already have been completed, and then io_complete_rw_iopoll()
and io_complete_rw() will be called, both of which will also modify
req->flags if needed. This causes a race condition, with concurrent
non-atomic modification of req->flags.

To eliminate this race, in io_read() or io_write(), if io request is
submitted successfully, we don't remove REQ_F_NEED_CLEANUP flag. If
REQ_F_NEED_CLEANUP is set, we'll leave __io_req_aux_free() to the
iovec cleanup work correspondingly.

Cc: stable@vger.kernel.org
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-18 08:32:44 -06:00
Jens Axboe
56952e91ac io_uring: reap poll completions while waiting for refs to drop on exit
If we're doing polled IO and end up having requests being submitted
async, then completions can come in while we're waiting for refs to
drop. We need to reap these manually, as nobody else will be looking
for them.

Break the wait into 1/20th of a second time waits, and check for done
poll completions if we time out. Otherwise we can have done poll
completions sitting in ctx->poll_list, which needs us to reap them but
we're just waiting for them.

Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-17 15:05:08 -06:00
Jens Axboe
9d8426a091 io_uring: acquire 'mm' for task_work for SQPOLL
If we're unlucky with timing, we could be running task_work after
having dropped the memory context in the sq thread. Since dropping
the context requires a runnable task state, we cannot reliably drop
it as part of our check-for-work loop in io_sq_thread(). Instead,
abstract out the mm acquire for the sq thread into a helper, and call
it from the async task work handler.

Cc: stable@vger.kernel.org # v5.7
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-17 12:49:16 -06:00
Xiaoguang Wang
bbde017a32 io_uring: add memory barrier to synchronize io_kiocb's result and iopoll_completed
In io_complete_rw_iopoll(), stores to io_kiocb's result and iopoll
completed are two independent store operations, to ensure that once
iopoll_completed is ture and then req->result must been perceived by
the cpu executing io_do_iopoll(), proper memory barrier should be used.

And in io_do_iopoll(), we check whether req->result is EAGAIN, if it is,
we'll need to issue this io request using io-wq again. In order to just
issue a single smp_rmb() on the completion side, move the re-submit work
to io_iopoll_complete().

Cc: stable@vger.kernel.org
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
[axboe: don't set ->iopoll_completed for -EAGAIN retry]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-17 12:49:09 -06:00
Xiaoguang Wang
2d7d67920e io_uring: don't fail links for EAGAIN error in IOPOLL mode
In IOPOLL mode, for EAGAIN error, we'll try to submit io request
again using io-wq, so don't fail rest of links if this io request
has links.

Cc: stable@vger.kernel.org
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-17 12:49:01 -06:00
Christoph Hellwig
fe557319aa maccess: rename probe_kernel_{read,write} to copy_{from,to}_kernel_nofault
Better describe what these functions do.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-17 10:57:41 -07:00
J. Bruce Fields
22cf8419f1 nfsd: apply umask on fs without ACL support
The server is failing to apply the umask when creating new objects on
filesystems without ACL support.

To reproduce this, you need to use NFSv4.2 and a client and server
recent enough to support umask, and you need to export a filesystem that
lacks ACL support (for example, ext4 with the "noacl" mount option).

Filesystems with ACL support are expected to take care of the umask
themselves (usually by calling posix_acl_create).

For filesystems without ACL support, this is up to the caller of
vfs_create(), vfs_mknod(), or vfs_mkdir().

Reported-by: Elliott Mitchell <ehem+debian@m5p.com>
Reported-by: Salvatore Bonaccorso <carnil@debian.org>
Tested-by: Salvatore Bonaccorso <carnil@debian.org>
Fixes: 47057abde5 ("nfsd: add support for the umask attribute")
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-06-17 10:48:58 -04:00
Masami Hiramatsu
4e264ffd95 proc/bootconfig: Fix to use correct quotes for value
Fix /proc/bootconfig to select double or single quotes
corrctly according to the value.

If a bootconfig value includes a double quote character,
we must use single-quotes to quote that value.

This modifies if() condition and blocks for avoiding
double-quote in value check in 2 places. Anyway, since
xbc_array_for_each_value() can handle the array which
has a single node correctly.
Thus,

if (vnode && xbc_node_is_array(vnode)) {
	xbc_array_for_each_value(vnode)	/* vnode->next != NULL */
		...
} else {
	snprintf(val); /* val is an empty string if !vnode */
}

is equivalent to

if (vnode) {
	xbc_array_for_each_value(vnode)	/* vnode->next can be NULL */
		...
} else {
	snprintf("");	/* value is always empty */
}

Link: http://lkml.kernel.org/r/159230244786.65555.3763894451251622488.stgit@devnote2

Cc: stable@vger.kernel.org
Fixes: c1a3c36017 ("proc: bootconfig: Add /proc/bootconfig to show boot config list")
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-16 21:21:03 -04:00
Linus Torvalds
26c20ffcb5 AFS fixes
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAl7pMu8ACgkQ+7dXa6fL
 C2sgNRAAnOCq281ojebwVSIkRDVGlxBODNeJtcgOOC4ib3jZM++vhdnnJgJIr8kc
 UOQ+LF4E5hNgwELubCrLOx/AjIzVuzfrreFNOPh3P3TSjyxW/7AU+tFGkdnLkYun
 NyadOXxI9Dk84UBN1LrmRm3ccAbF6nDf/KcPykS0oAEh12LVm6sDpVJz9+1uclnK
 Xq0rgl+zrR0+SPplPYz4P/OEPTgNfpLV9DHVYfkvsvEhwb/TaUmiLj9SEgndp+fg
 L3CT66QXoG9zds9hYFVODQM3devaXOpGNU0vsc9+Xg57BWuYvVed24eH5oBrcBQo
 F5kon+mcZlHtmTG87UJ6vFUwfHGeYqKKRb9XTbKbATtIWvkB3XM4Jz/XUlaAIE+R
 y0njNYEoIn4wHkleL/KeHmFPFSYG7pZpAN3wqhXZ9wVptXRDSB10OK3vpgLD/2rM
 V68FmBin6eStE5qZ8Mu9qMQxXb1buknoef37FIXUozjc+VMPrg5dbG6GjcW/CqIC
 LynaNUvrQOvF0ZFVzMt7ffZPrdDYlqqzyN0bReMdibys4BPKo24gSr5aVMLt7YXf
 ZaJeApcSdsphs4uUmtHKlHYgUQrSEl9pSGmc4hcq9bNIKHo9S618LG9uuUplOjdP
 j0L8N6uWBHQCjAvu6kDm8Wp5pRPPUnTgaXDsok7yP2GLRqBEm3Q=
 =bYOZ
 -----END PGP SIGNATURE-----

Merge tag 'afs-fixes-20200616' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull AFS fixes from David Howells:
 "I've managed to get xfstests kind of working with afs. Here are a set
  of patches that fix most of the bugs found.

  There are a number of primary issues:

   - Incorrect handling of mtime and non-handling of ctime. It might be
     argued, that the latter isn't a bug since the AFS protocol doesn't
     support ctime, but I should probably still update it locally.

   - Shared-write mmap, truncate and writeback bugs. This includes not
     changing i_size under the callback lock, overwriting local i_size
     with the reply from the server after a partial writeback, not
     limiting the writeback from an mmapped page to EOF.

   - Checks for an abort code indicating that the primary vnode in an
     operation was deleted by a third-party are done in the wrong place.

   - Silly rename bugs. This includes an incomplete conversion to the
     new operation handling, duplicate nlink handling, nlink changing
     not being done inside the callback lock and insufficient handling
     of third-party conflicting directory changes.

  And some secondary ones:

   - The UAEOVERFLOW abort code should map to EOVERFLOW not EREMOTEIO.

   - Remove a couple of unused or incompletely used bits.

   - Remove a couple of redundant success checks.

  These seem to fix all the data-corruption bugs found by

	./check -afs -g quick

  along with the obvious silly rename bugs and time bugs.

  There are still some test failures, but they seem to fall into two
  classes: firstly, the authentication/security model is different to
  the standard UNIX model and permission is arbitrated by the server and
  cached locally; and secondly, there are a number of features that AFS
  does not support (such as mknod). But in these cases, the tests
  themselves need to be adapted or skipped.

  Using the in-kernel afs client with xfstests also found a bug in the
  AuriStor AFS server that has been fixed for a future release"

* tag 'afs-fixes-20200616' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  afs: Fix silly rename
  afs: afs_vnode_commit_status() doesn't need to check the RPC error
  afs: Fix use of afs_check_for_remote_deletion()
  afs: Remove afs_operation::abort_code
  afs: Fix yfs_fs_fetch_status() to honour vnode selector
  afs: Remove yfs_fs_fetch_file_status() as it's not used
  afs: Fix the mapping of the UAEOVERFLOW abort code
  afs: Fix truncation issues and mmap writeback size
  afs: Concoct ctimes
  afs: Fix EOF corruption
  afs: afs_write_end() should change i_size under the right lock
  afs: Fix non-setting of mtime when writing into mmap
2020-06-16 17:40:51 -07:00
Linus Torvalds
ffbc93768e flexible-array member conversion patches for 5.8-rc2
Hi Linus,
 
 Please, pull the following patches that replace zero-length arrays with
 flexible-array members.
 
 Notice that all of these patches have been baking in linux-next for
 two development cycles now.
 
 There is a regular need in the kernel to provide a way to declare having a
 dynamically sized set of trailing elements in a structure. Kernel code should
 always use “flexible array members”[1] for these cases. The older style of
 one-element or zero-length arrays should no longer be used[2].
 
 C99 introduced “flexible array members”, which lacks a numeric size for the
 array declaration entirely:
 
 struct something {
         size_t count;
         struct foo items[];
 };
 
 This is the way the kernel expects dynamically sized trailing elements to be
 declared. It allows the compiler to generate errors when the flexible array
 does not occur last in the structure, which helps to prevent some kind of
 undefined behavior[3] bugs from being inadvertently introduced to the codebase.
 It also allows the compiler to correctly analyze array sizes (via sizeof(),
 CONFIG_FORTIFY_SOURCE, and CONFIG_UBSAN_BOUNDS). For instance, there is no
 mechanism that warns us that the following application of the sizeof() operator
 to a zero-length array always results in zero:
 
 struct something {
         size_t count;
         struct foo items[0];
 };
 
 struct something *instance;
 
 instance = kmalloc(struct_size(instance, items, count), GFP_KERNEL);
 instance->count = count;
 
 size = sizeof(instance->items) * instance->count;
 memcpy(instance->items, source, size);
 
 At the last line of code above, size turns out to be zero, when one might have
 thought it represents the total size in bytes of the dynamic memory recently
 allocated for the trailing array items. Here are a couple examples of this
 issue[4][5]. Instead, flexible array members have incomplete type, and so the
 sizeof() operator may not be applied[6], so any misuse of such operators will
 be immediately noticed at build time.
 
 The cleanest and least error-prone way to implement this is through the use of
 a flexible array member:
 
 struct something {
         size_t count;
         struct foo items[];
 };
 
 struct something *instance;
 
 instance = kmalloc(struct_size(instance, items, count), GFP_KERNEL);
 instance->count = count;
 
 size = sizeof(instance->items[0]) * instance->count;
 memcpy(instance->items, source, size);
 
 Thanks
 --
 Gustavo
 
 [1] https://en.wikipedia.org/wiki/Flexible_array_member
 [2] https://github.com/KSPP/linux/issues/21
 [3] https://git.kernel.org/linus/76497732932f15e7323dc805e8ea8dc11bb587cf
 [4] https://git.kernel.org/linus/f2cd32a443da694ac4e28fbf4ac6f9d5cc63a539
 [5] https://git.kernel.org/linus/ab91c2a89f86be2898cee208d492816ec238b2cf
 [6] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEkmRahXBSurMIg1YvRwW0y0cG2zEFAl7oSmYACgkQRwW0y0cG
 2zGEiw/9FiH3MBwMlPVJPcneY1wCH/N6ZSf+kr7SJiVwV/YbBe9EWuaKZ0D4vAWm
 kTACkOfsZ1me1OKz9wNrOxn0zezTMFQK2PLPgzKIPuK0Hg8MW1EU63RIRsnr0bPc
 b90wZwyBQtLbGRC3/9yAACKwFZe/SeYoV5rr8uylffA35HZW3SZbTex6XnGCF9Q5
 UYwnz7vNg+9VH1GRQeB5jlqL7mAoRzJ49I/TL3zJr04Mn+xC+vVBS7XwipDd03p+
 foC6/KmGhlCO9HMPASReGrOYNPydDAMKLNPdIfUlcTKHWsoTjGOcW/dzfT4rUu6n
 nKr5rIqJ4FdlIvXZL5P5w7Uhkwbd3mus5G0HBk+V/cUScckCpBou+yuGzjxXSitQ
 o0qPsGjWr3v+gxRWHj8YO/9MhKKKW0Iy+QmAC9+uLnbfJdbUwYbLIXbsOKnokCA8
 jkDEr64F5hFTKtajIK4VToJK1CsM3D9dwTub27lwZysHn3RYSQdcyN+9OiZgdzpc
 GlI6QoaqKR9AT4b/eBmqlQAKgA07zSQ5RsIjRm6hN3d7u/77x2kyrreo+trJyVY2
 F17uEOzfTqZyxtkPayE8DVjTtbByoCuBR0Vm1oMAFxjyqZQY5daalB0DKd1mdYqi
 khIXqNAuYqHOb898fEuzidjV38hxZ9y8SAym3P7WnYl+Hxz+8Jo=
 =8HUQ
 -----END PGP SIGNATURE-----

Merge tag 'flex-array-conversions-5.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux

Pull flexible-array member conversions from Gustavo A. R. Silva:
 "Replace zero-length arrays with flexible-array members.

  Notice that all of these patches have been baking in linux-next for
  two development cycles now.

  There is a regular need in the kernel to provide a way to declare
  having a dynamically sized set of trailing elements in a structure.
  Kernel code should always use “flexible array members”[1] for these
  cases. The older style of one-element or zero-length arrays should no
  longer be used[2].

  C99 introduced “flexible array members”, which lacks a numeric size
  for the array declaration entirely:

        struct something {
                size_t count;
                struct foo items[];
        };

  This is the way the kernel expects dynamically sized trailing elements
  to be declared. It allows the compiler to generate errors when the
  flexible array does not occur last in the structure, which helps to
  prevent some kind of undefined behavior[3] bugs from being
  inadvertently introduced to the codebase.

  It also allows the compiler to correctly analyze array sizes (via
  sizeof(), CONFIG_FORTIFY_SOURCE, and CONFIG_UBSAN_BOUNDS). For
  instance, there is no mechanism that warns us that the following
  application of the sizeof() operator to a zero-length array always
  results in zero:

        struct something {
                size_t count;
                struct foo items[0];
        };

        struct something *instance;

        instance = kmalloc(struct_size(instance, items, count), GFP_KERNEL);
        instance->count = count;

        size = sizeof(instance->items) * instance->count;
        memcpy(instance->items, source, size);

  At the last line of code above, size turns out to be zero, when one
  might have thought it represents the total size in bytes of the
  dynamic memory recently allocated for the trailing array items. Here
  are a couple examples of this issue[4][5].

  Instead, flexible array members have incomplete type, and so the
  sizeof() operator may not be applied[6], so any misuse of such
  operators will be immediately noticed at build time.

  The cleanest and least error-prone way to implement this is through
  the use of a flexible array member:

        struct something {
                size_t count;
                struct foo items[];
        };

        struct something *instance;

        instance = kmalloc(struct_size(instance, items, count), GFP_KERNEL);
        instance->count = count;

        size = sizeof(instance->items[0]) * instance->count;
        memcpy(instance->items, source, size);

  instead"

[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")
[4] commit f2cd32a443 ("rndis_wlan: Remove logically dead code")
[5] commit ab91c2a89f ("tpm: eventlog: Replace zero-length array with flexible-array member")
[6] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html

* tag 'flex-array-conversions-5.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux: (41 commits)
  w1: Replace zero-length array with flexible-array
  tracing/probe: Replace zero-length array with flexible-array
  soc: ti: Replace zero-length array with flexible-array
  tifm: Replace zero-length array with flexible-array
  dmaengine: tegra-apb: Replace zero-length array with flexible-array
  stm class: Replace zero-length array with flexible-array
  Squashfs: Replace zero-length array with flexible-array
  ASoC: SOF: Replace zero-length array with flexible-array
  ima: Replace zero-length array with flexible-array
  sctp: Replace zero-length array with flexible-array
  phy: samsung: Replace zero-length array with flexible-array
  RxRPC: Replace zero-length array with flexible-array
  rapidio: Replace zero-length array with flexible-array
  media: pwc: Replace zero-length array with flexible-array
  firmware: pcdp: Replace zero-length array with flexible-array
  oprofile: Replace zero-length array with flexible-array
  block: Replace zero-length array with flexible-array
  tools/testing/nvdimm: Replace zero-length array with flexible-array
  libata: Replace zero-length array with flexible-array
  kprobes: Replace zero-length array with flexible-array
  ...
2020-06-16 17:23:57 -07:00
David Howells
b6489a49f7 afs: Fix silly rename
Fix AFS's silly rename by the following means:

 (1) Set the destination directory in afs_do_silly_rename() so as to avoid
     misbehaviour and indicate that the directory data version will
     increment by 1 so as to avoid warnings about unexpected changes in the
     DV.  Also indicate that the ctime should be updated to avoid xfstest
     grumbling.

 (2) Note when the server indicates that a directory changed more than we
     expected (AFS_OPERATION_DIR_CONFLICT), indicating a conflict with a
     third party change, checking on successful completion of unlink and
     rename.

     The problem is that the FS.RemoveFile RPC op doesn't report the status
     of the unlinked file, though YFS.RemoveFile2 does.  This can be
     mitigated by the assumption that if the directory DV cranked by
     exactly 1, we can be sure we removed one link from the file; further,
     ordinarily in AFS, files cannot be hardlinked across directories, so
     if we reduce nlink to 0, the file is deleted.

     However, if the directory DV jumps by more than 1, we cannot know if a
     third party intervened by adding or removing a link on the file we
     just removed a link from.

     The same also goes for any vnode that is at the destination of the
     FS.Rename RPC op.

 (3) Make afs_vnode_commit_status() apply the nlink drop inside the cb_lock
     section along with the other attribute updates if ->op_unlinked is set
     on the descriptor for the appropriate vnode.

 (4) Issue a follow up status fetch to the unlinked file in the event of a
     third party conflict that makes it impossible for us to know if we
     actually deleted the file or not.

 (5) Provide a flag, AFS_VNODE_SILLY_DELETED, to make afs_getattr() lie to
     the user about the nlink of a silly deleted file so that it appears as
     0, not 1.

Found with the generic/035 and generic/084 xfstests.

Fixes: e49c7b2f6d ("afs: Build an abstraction around an "operation" concept")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-16 22:00:28 +01:00
Waiman Long
b091f7fede btrfs: use kfree() in btrfs_ioctl_get_subvol_info()
In btrfs_ioctl_get_subvol_info(), there is a classic case where kzalloc()
was incorrectly paired with kzfree(). According to David Sterba, there
isn't any sensitive information in the subvol_info that needs to be
cleared before freeing. So kzfree() isn't really needed, use kfree()
instead.

Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-06-16 19:24:03 +02:00
Filipe Manana
5dbb75ed69 btrfs: fix RWF_NOWAIT writes blocking on extent locks and waiting for IO
A RWF_NOWAIT write is not supposed to wait on filesystem locks that can be
held for a long time or for ongoing IO to complete.

However when calling check_can_nocow(), if the inode has prealloc extents
or has the NOCOW flag set, we can block on extent (file range) locks
through the call to btrfs_lock_and_flush_ordered_range(). Such lock can
take a significant amount of time to be available. For example, a fiemap
task may be running, and iterating through the entire file range checking
all extents and doing backref walking to determine if they are shared,
or a readpage operation may be in progress.

Also at btrfs_lock_and_flush_ordered_range(), called by check_can_nocow(),
after locking the file range we wait for any existing ordered extent that
is in progress to complete. Another operation that can take a significant
amount of time and defeat the purpose of RWF_NOWAIT.

So fix this by trying to lock the file range and if it's currently locked
return -EAGAIN to user space. If we are able to lock the file range without
waiting and there is an ordered extent in the range, return -EAGAIN as
well, instead of waiting for it to complete. Finally, don't bother trying
to lock the snapshot lock of the root when attempting a RWF_NOWAIT write,
as that is only important for buffered writes.

Fixes: edf064e7c6 ("btrfs: nowait aio support")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-06-16 19:22:45 +02:00
Filipe Manana
260a63395f btrfs: fix RWF_NOWAIT write not failling when we need to cow
If we attempt to do a RWF_NOWAIT write against a file range for which we
can only do NOCOW for a part of it, due to the existence of holes or
shared extents for example, we proceed with the write as if it were
possible to NOCOW the whole range.

Example:

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt

  $ touch /mnt/sdj/bar
  $ chattr +C /mnt/sdj/bar

  $ xfs_io -d -c "pwrite -S 0xab -b 256K 0 256K" /mnt/bar
  wrote 262144/262144 bytes at offset 0
  256 KiB, 1 ops; 0.0003 sec (694.444 MiB/sec and 2777.7778 ops/sec)

  $ xfs_io -c "fpunch 64K 64K" /mnt/bar
  $ sync

  $ xfs_io -d -c "pwrite -N -V 1 -b 128K -S 0xfe 0 128K" /mnt/bar
  wrote 131072/131072 bytes at offset 0
  128 KiB, 1 ops; 0.0007 sec (160.051 MiB/sec and 1280.4097 ops/sec)

This last write should fail with -EAGAIN since the file range from 64K to
128K is a hole. On xfs it fails, as expected, but on ext4 it currently
succeeds because apparently it is expensive to check if there are extents
allocated for the whole range, but I'll check with the ext4 people.

Fix the issue by checking if check_can_nocow() returns a number of
NOCOW'able bytes smaller then the requested number of bytes, and if it
does return -EAGAIN.

Fixes: edf064e7c6 ("btrfs: nowait aio support")
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-06-16 19:22:37 +02:00
Filipe Manana
4b1946284d btrfs: fix failure of RWF_NOWAIT write into prealloc extent beyond eof
If we attempt to write to prealloc extent located after eof using a
RWF_NOWAIT write, we always fail with -EAGAIN.

We do actually check if we have an allocated extent for the write at
the start of btrfs_file_write_iter() through a call to check_can_nocow(),
but later when we go into the actual direct IO write path we simply
return -EAGAIN if the write starts at or beyond EOF.

Trivial to reproduce:

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt

  $ touch /mnt/foo
  $ chattr +C /mnt/foo

  $ xfs_io -d -c "pwrite -S 0xab 0 64K" /mnt/foo
  wrote 65536/65536 bytes at offset 0
  64 KiB, 16 ops; 0.0004 sec (135.575 MiB/sec and 34707.1584 ops/sec)

  $ xfs_io -c "falloc -k 64K 1M" /mnt/foo

  $ xfs_io -d -c "pwrite -N -V 1 -S 0xfe -b 64K 64K 64K" /mnt/foo
  pwrite: Resource temporarily unavailable

On xfs and ext4 the write succeeds, as expected.

Fix this by removing the wrong check at btrfs_direct_IO().

Fixes: edf064e7c6 ("btrfs: nowait aio support")
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-06-16 19:22:31 +02:00
Filipe Manana
f2cb2f39cc btrfs: fix hang on snapshot creation after RWF_NOWAIT write
If we do a successful RWF_NOWAIT write we end up locking the snapshot lock
of the inode, through a call to check_can_nocow(), but we never unlock it.

This means the next attempt to create a snapshot on the subvolume will
hang forever.

Trivial reproducer:

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt

  $ touch /mnt/foobar
  $ chattr +C /mnt/foobar
  $ xfs_io -d -c "pwrite -S 0xab 0 64K" /mnt/foobar
  $ xfs_io -d -c "pwrite -N -V 1 -S 0xfe 0 64K" /mnt/foobar

  $ btrfs subvolume snapshot -r /mnt /mnt/snap
    --> hangs

Fix this by unlocking the snapshot lock if check_can_nocow() returned
success.

Fixes: edf064e7c6 ("btrfs: nowait aio support")
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-06-16 19:22:27 +02:00
Filipe Manana
e7a79811d0 btrfs: check if a log root exists before locking the log_mutex on unlink
This brings back an optimization that commit e678934cbe ("btrfs:
Remove unnecessary check from join_running_log_trans") removed, but in
a different form. So it's almost equivalent to a revert.

That commit removed an optimization where we avoid locking a root's
log_mutex when there is no log tree created in the current transaction.
The affected code path is triggered through unlink operations.

That commit was based on the assumption that the optimization was not
necessary because we used to have the following checks when the patch
was authored:

  int btrfs_del_dir_entries_in_log(...)
  {
        (...)
        if (dir->logged_trans < trans->transid)
            return 0;

        ret = join_running_log_trans(root);
        (...)
   }

   int btrfs_del_inode_ref_in_log(...)
   {
        (...)
        if (inode->logged_trans < trans->transid)
            return 0;

        ret = join_running_log_trans(root);
        (...)
   }

However before that patch was merged, another patch was merged first which
replaced those checks because they were buggy.

That other patch corresponds to commit 803f0f64d1 ("Btrfs: fix fsync
not persisting dentry deletions due to inode evictions"). The assumption
that if the logged_trans field of an inode had a smaller value then the
current transaction's generation (transid) meant that the inode was not
logged in the current transaction was only correct if the inode was not
evicted and reloaded in the current transaction. So the corresponding bug
fix changed those checks and replaced them with the following helper
function:

  static bool inode_logged(struct btrfs_trans_handle *trans,
                           struct btrfs_inode *inode)
  {
        if (inode->logged_trans == trans->transid)
                return true;

        if (inode->last_trans == trans->transid &&
            test_bit(BTRFS_INODE_NEEDS_FULL_SYNC, &inode->runtime_flags) &&
            !test_bit(BTRFS_FS_LOG_RECOVERING, &trans->fs_info->flags))
                return true;

        return false;
  }

So if we have a subvolume without a log tree in the current transaction
(because we had no fsyncs), every time we unlink an inode we can end up
trying to lock the log_mutex of the root through join_running_log_trans()
twice, once for the inode being unlinked (by btrfs_del_inode_ref_in_log())
and once for the parent directory (with btrfs_del_dir_entries_in_log()).

This means if we have several unlink operations happening in parallel for
inodes in the same subvolume, and the those inodes and/or their parent
inode were changed in the current transaction, we end up having a lot of
contention on the log_mutex.

The test robots from intel reported a -30.7% performance regression for
a REAIM test after commit e678934cbe ("btrfs: Remove unnecessary check
from join_running_log_trans").

So just bring back the optimization to join_running_log_trans() where we
check first if a log root exists before trying to lock the log_mutex. This
is done by checking for a bit that is set on the root when a log tree is
created and removed when a log tree is freed (at transaction commit time).

Commit e678934cbe ("btrfs: Remove unnecessary check from
join_running_log_trans") was merged in the 5.4 merge window while commit
803f0f64d1 ("Btrfs: fix fsync not persisting dentry deletions due to
inode evictions") was merged in the 5.3 merge window. But the first
commit was actually authored before the second commit (May 23 2019 vs
June 19 2019).

Reported-by: kernel test robot <rong.a.chen@intel.com>
Link: https://lore.kernel.org/lkml/20200611090233.GL12456@shao2-debian/
Fixes: e678934cbe ("btrfs: Remove unnecessary check from join_running_log_trans")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-06-16 19:22:23 +02:00
Filipe Manana
6bd335b469 btrfs: fix bytes_may_use underflow when running balance and scrub in parallel
When balance and scrub are running in parallel it is possible to end up
with an underflow of the bytes_may_use counter of the data space_info
object, which triggers a warning like the following:

   [134243.793196] BTRFS info (device sdc): relocating block group 1104150528 flags data
   [134243.806891] ------------[ cut here ]------------
   [134243.807561] WARNING: CPU: 1 PID: 26884 at fs/btrfs/space-info.h:125 btrfs_add_reserved_bytes+0x1da/0x280 [btrfs]
   [134243.808819] Modules linked in: btrfs blake2b_generic xor (...)
   [134243.815779] CPU: 1 PID: 26884 Comm: kworker/u8:8 Tainted: G        W         5.6.0-rc7-btrfs-next-58 #5
   [134243.816944] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
   [134243.818389] Workqueue: writeback wb_workfn (flush-btrfs-108483)
   [134243.819186] RIP: 0010:btrfs_add_reserved_bytes+0x1da/0x280 [btrfs]
   [134243.819963] Code: 0b f2 85 (...)
   [134243.822271] RSP: 0018:ffffa4160aae7510 EFLAGS: 00010287
   [134243.822929] RAX: 000000000000c000 RBX: ffff96159a8c1000 RCX: 0000000000000000
   [134243.823816] RDX: 0000000000008000 RSI: 0000000000000000 RDI: ffff96158067a810
   [134243.824742] RBP: ffff96158067a800 R08: 0000000000000001 R09: 0000000000000000
   [134243.825636] R10: ffff961501432a40 R11: 0000000000000000 R12: 000000000000c000
   [134243.826532] R13: 0000000000000001 R14: ffffffffffff4000 R15: ffff96158067a810
   [134243.827432] FS:  0000000000000000(0000) GS:ffff9615baa00000(0000) knlGS:0000000000000000
   [134243.828451] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
   [134243.829184] CR2: 000055bd7e414000 CR3: 00000001077be004 CR4: 00000000003606e0
   [134243.830083] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
   [134243.830975] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
   [134243.831867] Call Trace:
   [134243.832211]  find_free_extent+0x4a0/0x16c0 [btrfs]
   [134243.832846]  btrfs_reserve_extent+0x91/0x180 [btrfs]
   [134243.833487]  cow_file_range+0x12d/0x490 [btrfs]
   [134243.834080]  fallback_to_cow+0x82/0x1b0 [btrfs]
   [134243.834689]  ? release_extent_buffer+0x121/0x170 [btrfs]
   [134243.835370]  run_delalloc_nocow+0x33f/0xa30 [btrfs]
   [134243.836032]  btrfs_run_delalloc_range+0x1ea/0x6d0 [btrfs]
   [134243.836725]  ? find_lock_delalloc_range+0x221/0x250 [btrfs]
   [134243.837450]  writepage_delalloc+0xe8/0x150 [btrfs]
   [134243.838059]  __extent_writepage+0xe8/0x4c0 [btrfs]
   [134243.838674]  extent_write_cache_pages+0x237/0x530 [btrfs]
   [134243.839364]  extent_writepages+0x44/0xa0 [btrfs]
   [134243.839946]  do_writepages+0x23/0x80
   [134243.840401]  __writeback_single_inode+0x59/0x700
   [134243.841006]  writeback_sb_inodes+0x267/0x5f0
   [134243.841548]  __writeback_inodes_wb+0x87/0xe0
   [134243.842091]  wb_writeback+0x382/0x590
   [134243.842574]  ? wb_workfn+0x4a2/0x6c0
   [134243.843030]  wb_workfn+0x4a2/0x6c0
   [134243.843468]  process_one_work+0x26d/0x6a0
   [134243.843978]  worker_thread+0x4f/0x3e0
   [134243.844452]  ? process_one_work+0x6a0/0x6a0
   [134243.844981]  kthread+0x103/0x140
   [134243.845400]  ? kthread_create_worker_on_cpu+0x70/0x70
   [134243.846030]  ret_from_fork+0x3a/0x50
   [134243.846494] irq event stamp: 0
   [134243.846892] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
   [134243.847682] hardirqs last disabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020
   [134243.848687] softirqs last  enabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020
   [134243.849913] softirqs last disabled at (0): [<0000000000000000>] 0x0
   [134243.850698] ---[ end trace bd7c03622e0b0a96 ]---
   [134243.851335] ------------[ cut here ]------------

When relocating a data block group, for each extent allocated in the
block group we preallocate another extent with the same size for the
data relocation inode (we do it at prealloc_file_extent_cluster()).
We reserve space by calling btrfs_check_data_free_space(), which ends
up incrementing the data space_info's bytes_may_use counter, and
then call btrfs_prealloc_file_range() to allocate the extent, which
always decrements the bytes_may_use counter by the same amount.

The expectation is that writeback of the data relocation inode always
follows a NOCOW path, by writing into the preallocated extents. However,
when starting writeback we might end up falling back into the COW path,
because the block group that contains the preallocated extent was turned
into RO mode by a scrub running in parallel. The COW path then calls the
extent allocator which ends up calling btrfs_add_reserved_bytes(), and
this function decrements the bytes_may_use counter of the data space_info
object by an amount corresponding to the size of the allocated extent,
despite we haven't previously incremented it. When the counter currently
has a value smaller then the allocated extent we reset the counter to 0
and emit a warning, otherwise we just decrement it and slowly mess up
with this counter which is crucial for space reservation, the end result
can be granting reserved space to tasks when there isn't really enough
free space, and having the tasks fail later in critical places where
error handling consists of a transaction abort or hitting a BUG_ON().

Fix this by making sure that if we fallback to the COW path for a data
relocation inode, we increment the bytes_may_use counter of the data
space_info object. The COW path will then decrement it at
btrfs_add_reserved_bytes() on success or through its error handling part
by a call to extent_clear_unlock_delalloc() (which ends up calling
btrfs_clear_delalloc_extent() that does the decrement operation) in case
of an error.

Test case btrfs/061 from fstests could sporadically trigger this.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-06-16 19:21:31 +02:00
Filipe Manana
432cd2a10f btrfs: fix data block group relocation failure due to concurrent scrub
When running relocation of a data block group while scrub is running in
parallel, it is possible that the relocation will fail and abort the
current transaction with an -EINVAL error:

   [134243.988595] BTRFS info (device sdc): found 14 extents, stage: move data extents
   [134243.999871] ------------[ cut here ]------------
   [134244.000741] BTRFS: Transaction aborted (error -22)
   [134244.001692] WARNING: CPU: 0 PID: 26954 at fs/btrfs/ctree.c:1071 __btrfs_cow_block+0x6a7/0x790 [btrfs]
   [134244.003380] Modules linked in: btrfs blake2b_generic xor raid6_pq (...)
   [134244.012577] CPU: 0 PID: 26954 Comm: btrfs Tainted: G        W         5.6.0-rc7-btrfs-next-58 #5
   [134244.014162] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
   [134244.016184] RIP: 0010:__btrfs_cow_block+0x6a7/0x790 [btrfs]
   [134244.017151] Code: 48 c7 c7 (...)
   [134244.020549] RSP: 0018:ffffa41607863888 EFLAGS: 00010286
   [134244.021515] RAX: 0000000000000000 RBX: ffff9614bdfe09c8 RCX: 0000000000000000
   [134244.022822] RDX: 0000000000000001 RSI: ffffffffb3d63980 RDI: 0000000000000001
   [134244.024124] RBP: ffff961589e8c000 R08: 0000000000000000 R09: 0000000000000001
   [134244.025424] R10: ffffffffc0ae5955 R11: 0000000000000000 R12: ffff9614bd530d08
   [134244.026725] R13: ffff9614ced41b88 R14: ffff9614bdfe2a48 R15: 0000000000000000
   [134244.028024] FS:  00007f29b63c08c0(0000) GS:ffff9615ba600000(0000) knlGS:0000000000000000
   [134244.029491] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
   [134244.030560] CR2: 00007f4eb339b000 CR3: 0000000130d6e006 CR4: 00000000003606f0
   [134244.031997] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
   [134244.033153] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
   [134244.034484] Call Trace:
   [134244.034984]  btrfs_cow_block+0x12b/0x2b0 [btrfs]
   [134244.035859]  do_relocation+0x30b/0x790 [btrfs]
   [134244.036681]  ? do_raw_spin_unlock+0x49/0xc0
   [134244.037460]  ? _raw_spin_unlock+0x29/0x40
   [134244.038235]  relocate_tree_blocks+0x37b/0x730 [btrfs]
   [134244.039245]  relocate_block_group+0x388/0x770 [btrfs]
   [134244.040228]  btrfs_relocate_block_group+0x161/0x2e0 [btrfs]
   [134244.041323]  btrfs_relocate_chunk+0x36/0x110 [btrfs]
   [134244.041345]  btrfs_balance+0xc06/0x1860 [btrfs]
   [134244.043382]  ? btrfs_ioctl_balance+0x27c/0x310 [btrfs]
   [134244.045586]  btrfs_ioctl_balance+0x1ed/0x310 [btrfs]
   [134244.045611]  btrfs_ioctl+0x1880/0x3760 [btrfs]
   [134244.049043]  ? do_raw_spin_unlock+0x49/0xc0
   [134244.049838]  ? _raw_spin_unlock+0x29/0x40
   [134244.050587]  ? __handle_mm_fault+0x11b3/0x14b0
   [134244.051417]  ? ksys_ioctl+0x92/0xb0
   [134244.052070]  ksys_ioctl+0x92/0xb0
   [134244.052701]  ? trace_hardirqs_off_thunk+0x1a/0x1c
   [134244.053511]  __x64_sys_ioctl+0x16/0x20
   [134244.054206]  do_syscall_64+0x5c/0x280
   [134244.054891]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
   [134244.055819] RIP: 0033:0x7f29b51c9dd7
   [134244.056491] Code: 00 00 00 (...)
   [134244.059767] RSP: 002b:00007ffcccc1dd08 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
   [134244.061168] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f29b51c9dd7
   [134244.062474] RDX: 00007ffcccc1dda0 RSI: 00000000c4009420 RDI: 0000000000000003
   [134244.063771] RBP: 0000000000000003 R08: 00005565cea4b000 R09: 0000000000000000
   [134244.065032] R10: 0000000000000541 R11: 0000000000000202 R12: 00007ffcccc2060a
   [134244.066327] R13: 00007ffcccc1dda0 R14: 0000000000000002 R15: 00007ffcccc1dec0
   [134244.067626] irq event stamp: 0
   [134244.068202] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
   [134244.069351] hardirqs last disabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020
   [134244.070909] softirqs last  enabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020
   [134244.072392] softirqs last disabled at (0): [<0000000000000000>] 0x0
   [134244.073432] ---[ end trace bd7c03622e0b0a99 ]---

The -EINVAL error comes from the following chain of function calls:

  __btrfs_cow_block() <-- aborts the transaction
    btrfs_reloc_cow_block()
      replace_file_extents()
        get_new_location() <-- returns -EINVAL

When relocating a data block group, for each allocated extent of the block
group, we preallocate another extent (at prealloc_file_extent_cluster()),
associated with the data relocation inode, and then dirty all its pages.
These preallocated extents have, and must have, the same size that extents
from the data block group being relocated have.

Later before we start the relocation stage that updates pointers (bytenr
field of file extent items) to point to the the new extents, we trigger
writeback for the data relocation inode. The expectation is that writeback
will write the pages to the previously preallocated extents, that it
follows the NOCOW path. That is generally the case, however, if a scrub
is running it may have turned the block group that contains those extents
into RO mode, in which case writeback falls back to the COW path.

However in the COW path instead of allocating exactly one extent with the
expected size, the allocator may end up allocating several smaller extents
due to free space fragmentation - because we tell it at cow_file_range()
that the minimum allocation size can match the filesystem's sector size.
This later breaks the relocation's expectation that an extent associated
to a file extent item in the data relocation inode has the same size as
the respective extent pointed by a file extent item in another tree - in
this case the extent to which the relocation inode poins to is smaller,
causing relocation.c:get_new_location() to return -EINVAL.

For example, if we are relocating a data block group X that has a logical
address of X and the block group has an extent allocated at the logical
address X + 128KiB with a size of 64KiB:

1) At prealloc_file_extent_cluster() we allocate an extent for the data
   relocation inode with a size of 64KiB and associate it to the file
   offset 128KiB (X + 128KiB - X) of the data relocation inode. This
   preallocated extent was allocated at block group Z;

2) A scrub running in parallel turns block group Z into RO mode and
   starts scrubing its extents;

3) Relocation triggers writeback for the data relocation inode;

4) When running delalloc (btrfs_run_delalloc_range()), we try first the
   NOCOW path because the data relocation inode has BTRFS_INODE_PREALLOC
   set in its flags. However, because block group Z is in RO mode, the
   NOCOW path (run_delalloc_nocow()) falls back into the COW path, by
   calling cow_file_range();

5) At cow_file_range(), in the first iteration of the while loop we call
   btrfs_reserve_extent() to allocate a 64KiB extent and pass it a minimum
   allocation size of 4KiB (fs_info->sectorsize). Due to free space
   fragmentation, btrfs_reserve_extent() ends up allocating two extents
   of 32KiB each, each one on a different iteration of that while loop;

6) Writeback of the data relocation inode completes;

7) Relocation proceeds and ends up at relocation.c:replace_file_extents(),
   with a leaf which has a file extent item that points to the data extent
   from block group X, that has a logical address (bytenr) of X + 128KiB
   and a size of 64KiB. Then it calls get_new_location(), which does a
   lookup in the data relocation tree for a file extent item starting at
   offset 128KiB (X + 128KiB - X) and belonging to the data relocation
   inode. It finds a corresponding file extent item, however that item
   points to an extent that has a size of 32KiB, which doesn't match the
   expected size of 64KiB, resuling in -EINVAL being returned from this
   function and propagated up to __btrfs_cow_block(), which aborts the
   current transaction.

To fix this make sure that at cow_file_range() when we call the allocator
we pass it a minimum allocation size corresponding the desired extent size
if the inode belongs to the data relocation tree, otherwise pass it the
filesystem's sector size as the minimum allocation size.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-06-16 19:21:25 +02:00
Filipe Manana
ffcb9d4457 btrfs: fix race between block group removal and block group creation
There is a race between block group removal and block group creation
when the removal is completed by a task running fitrim or scrub. When
this happens we end up failing the block group creation with an error
-EEXIST since we attempt to insert a duplicate block group item key
in the extent tree. That results in a transaction abort.

The race happens like this:

1) Task A is doing a fitrim, and at btrfs_trim_block_group() it freezes
   block group X with btrfs_freeze_block_group() (until very recently
   that was named btrfs_get_block_group_trimming());

2) Task B starts removing block group X, either because it's now unused
   or due to relocation for example. So at btrfs_remove_block_group(),
   while holding the chunk mutex and the block group's lock, it sets
   the 'removed' flag of the block group and it sets the local variable
   'remove_em' to false, because the block group is currently frozen
   (its 'frozen' counter is > 0, until very recently this counter was
   named 'trimming');

3) Task B unlocks the block group and the chunk mutex;

4) Task A is done trimming the block group and unfreezes the block group
   by calling btrfs_unfreeze_block_group() (until very recently this was
   named btrfs_put_block_group_trimming()). In this function we lock the
   block group and set the local variable 'cleanup' to true because we
   were able to decrement the block group's 'frozen' counter down to 0 and
   the flag 'removed' is set in the block group.

   Since 'cleanup' is set to true, it locks the chunk mutex and removes
   the extent mapping representing the block group from the mapping tree;

5) Task C allocates a new block group Y and it picks up the logical address
   that block group X had as the logical address for Y, because X was the
   block group with the highest logical address and now the second block
   group with the highest logical address, the last in the fs mapping tree,
   ends at an offset corresponding to block group X's logical address (this
   logical address selection is done at volumes.c:find_next_chunk()).

   At this point the new block group Y does not have yet its item added
   to the extent tree (nor the corresponding device extent items and
   chunk item in the device and chunk trees). The new group Y is added to
   the list of pending block groups in the transaction handle;

6) Before task B proceeds to removing the block group item for block
   group X from the extent tree, which has a key matching:

   (X logical offset, BTRFS_BLOCK_GROUP_ITEM_KEY, length)

   task C while ending its transaction handle calls
   btrfs_create_pending_block_groups(), which finds block group Y and
   tries to insert the block group item for Y into the exten tree, which
   fails with -EEXIST since logical offset is the same that X had and
   task B hasn't yet deleted the key from the extent tree.
   This failure results in a transaction abort, producing a stack like
   the following:

------------[ cut here ]------------
 BTRFS: Transaction aborted (error -17)
 WARNING: CPU: 2 PID: 19736 at fs/btrfs/block-group.c:2074 btrfs_create_pending_block_groups+0x1eb/0x260 [btrfs]
 Modules linked in: btrfs blake2b_generic xor raid6_pq (...)
 CPU: 2 PID: 19736 Comm: fsstress Tainted: G        W         5.6.0-rc7-btrfs-next-58 #5
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
 RIP: 0010:btrfs_create_pending_block_groups+0x1eb/0x260 [btrfs]
 Code: ff ff ff 48 8b 55 50 f0 48 (...)
 RSP: 0018:ffffa4160a1c7d58 EFLAGS: 00010286
 RAX: 0000000000000000 RBX: ffff961581909d98 RCX: 0000000000000000
 RDX: 0000000000000001 RSI: ffffffffb3d63990 RDI: 0000000000000001
 RBP: ffff9614f3356a58 R08: 0000000000000000 R09: 0000000000000001
 R10: ffff9615b65b0040 R11: 0000000000000000 R12: ffff961581909c10
 R13: ffff9615b0c32000 R14: ffff9614f3356ab0 R15: ffff9614be779000
 FS:  00007f2ce2841e80(0000) GS:ffff9615bae00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000555f18780000 CR3: 0000000131d34005 CR4: 00000000003606e0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Call Trace:
  btrfs_start_dirty_block_groups+0x398/0x4e0 [btrfs]
  btrfs_commit_transaction+0xd0/0xc50 [btrfs]
  ? btrfs_attach_transaction_barrier+0x1e/0x50 [btrfs]
  ? __ia32_sys_fdatasync+0x20/0x20
  iterate_supers+0xdb/0x180
  ksys_sync+0x60/0xb0
  __ia32_sys_sync+0xa/0x10
  do_syscall_64+0x5c/0x280
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
 RIP: 0033:0x7f2ce1d4d5b7
 Code: 83 c4 08 48 3d 01 (...)
 RSP: 002b:00007ffd8b558c58 EFLAGS: 00000202 ORIG_RAX: 00000000000000a2
 RAX: ffffffffffffffda RBX: 000000000000002c RCX: 00007f2ce1d4d5b7
 RDX: 00000000ffffffff RSI: 00000000186ba07b RDI: 000000000000002c
 RBP: 0000555f17b9e520 R08: 0000000000000012 R09: 000000000000ce00
 R10: 0000000000000078 R11: 0000000000000202 R12: 0000000000000032
 R13: 0000000051eb851f R14: 00007ffd8b558cd0 R15: 0000555f1798ec20
 irq event stamp: 0
 hardirqs last  enabled at (0): [<0000000000000000>] 0x0
 hardirqs last disabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020
 softirqs last  enabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020
 softirqs last disabled at (0): [<0000000000000000>] 0x0
 ---[ end trace bd7c03622e0b0a9c ]---

Fix this simply by making btrfs_remove_block_group() remove the block
group's item from the extent tree before it flags the block group as
removed. Also make the free space deletion from the free space tree
before flagging the block group as removed, to avoid a similar race
with adding and removing free space entries for the free space tree.

Fixes: 04216820fe ("Btrfs: fix race between fs trimming and block group remove/allocation")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-06-16 19:20:58 +02:00
Filipe Manana
9fecd13202 btrfs: fix a block group ref counter leak after failure to remove block group
When removing a block group, if we fail to delete the block group's item
from the extent tree, we jump to the 'out' label and end up decrementing
the block group's reference count once only (by 1), resulting in a counter
leak because the block group at that point was already removed from the
block group cache rbtree - so we have to decrement the reference count
twice, once for the rbtree and once for our lookup at the start of the
function.

There is a second bug where if removing the free space tree entries (the
call to remove_block_group_free_space()) fails we end up jumping to the
'out_put_group' label but end up decrementing the reference count only
once, when we should have done it twice, since we have already removed
the block group from the block group cache rbtree. This happens because
the reference count decrement for the rbtree reference happens after
attempting to remove the free space tree entries, which is far away from
the place where we remove the block group from the rbtree.

To make things less error prone, decrement the reference count for the
rbtree immediately after removing the block group from it. This also
eleminates the need for two different exit labels on error, renaming
'out_put_label' to just 'out' and removing the old 'out'.

Fixes: f6033c5e33 ("btrfs: fix block group leak when removing fails")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-06-16 19:20:51 +02:00
Jason Yan
2d3a8e2ded block: Fix use-after-free in blkdev_get()
In blkdev_get() we call __blkdev_get() to do some internal jobs and if
there is some errors in __blkdev_get(), the bdput() is called which
means we have released the refcount of the bdev (actually the refcount of
the bdev inode). This means we cannot access bdev after that point. But
acctually bdev is still accessed in blkdev_get() after calling
__blkdev_get(). This results in use-after-free if the refcount is the
last one we released in __blkdev_get(). Let's take a look at the
following scenerio:

  CPU0            CPU1                    CPU2
blkdev_open     blkdev_open           Remove disk
                  bd_acquire
		  blkdev_get
		    __blkdev_get      del_gendisk
					bdev_unhash_inode
  bd_acquire          bdev_get_gendisk
    bd_forget           failed because of unhashed
	  bdput
	              bdput (the last one)
		        bdev_evict_inode

	  	    access bdev => use after free

[  459.350216] BUG: KASAN: use-after-free in __lock_acquire+0x24c1/0x31b0
[  459.351190] Read of size 8 at addr ffff88806c815a80 by task syz-executor.0/20132
[  459.352347]
[  459.352594] CPU: 0 PID: 20132 Comm: syz-executor.0 Not tainted 4.19.90 #2
[  459.353628] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[  459.354947] Call Trace:
[  459.355337]  dump_stack+0x111/0x19e
[  459.355879]  ? __lock_acquire+0x24c1/0x31b0
[  459.356523]  print_address_description+0x60/0x223
[  459.357248]  ? __lock_acquire+0x24c1/0x31b0
[  459.357887]  kasan_report.cold+0xae/0x2d8
[  459.358503]  __lock_acquire+0x24c1/0x31b0
[  459.359120]  ? _raw_spin_unlock_irq+0x24/0x40
[  459.359784]  ? lockdep_hardirqs_on+0x37b/0x580
[  459.360465]  ? _raw_spin_unlock_irq+0x24/0x40
[  459.361123]  ? finish_task_switch+0x125/0x600
[  459.361812]  ? finish_task_switch+0xee/0x600
[  459.362471]  ? mark_held_locks+0xf0/0xf0
[  459.363108]  ? __schedule+0x96f/0x21d0
[  459.363716]  lock_acquire+0x111/0x320
[  459.364285]  ? blkdev_get+0xce/0xbe0
[  459.364846]  ? blkdev_get+0xce/0xbe0
[  459.365390]  __mutex_lock+0xf9/0x12a0
[  459.365948]  ? blkdev_get+0xce/0xbe0
[  459.366493]  ? bdev_evict_inode+0x1f0/0x1f0
[  459.367130]  ? blkdev_get+0xce/0xbe0
[  459.367678]  ? destroy_inode+0xbc/0x110
[  459.368261]  ? mutex_trylock+0x1a0/0x1a0
[  459.368867]  ? __blkdev_get+0x3e6/0x1280
[  459.369463]  ? bdev_disk_changed+0x1d0/0x1d0
[  459.370114]  ? blkdev_get+0xce/0xbe0
[  459.370656]  blkdev_get+0xce/0xbe0
[  459.371178]  ? find_held_lock+0x2c/0x110
[  459.371774]  ? __blkdev_get+0x1280/0x1280
[  459.372383]  ? lock_downgrade+0x680/0x680
[  459.373002]  ? lock_acquire+0x111/0x320
[  459.373587]  ? bd_acquire+0x21/0x2c0
[  459.374134]  ? do_raw_spin_unlock+0x4f/0x250
[  459.374780]  blkdev_open+0x202/0x290
[  459.375325]  do_dentry_open+0x49e/0x1050
[  459.375924]  ? blkdev_get_by_dev+0x70/0x70
[  459.376543]  ? __x64_sys_fchdir+0x1f0/0x1f0
[  459.377192]  ? inode_permission+0xbe/0x3a0
[  459.377818]  path_openat+0x148c/0x3f50
[  459.378392]  ? kmem_cache_alloc+0xd5/0x280
[  459.379016]  ? entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  459.379802]  ? path_lookupat.isra.0+0x900/0x900
[  459.380489]  ? __lock_is_held+0xad/0x140
[  459.381093]  do_filp_open+0x1a1/0x280
[  459.381654]  ? may_open_dev+0xf0/0xf0
[  459.382214]  ? find_held_lock+0x2c/0x110
[  459.382816]  ? lock_downgrade+0x680/0x680
[  459.383425]  ? __lock_is_held+0xad/0x140
[  459.384024]  ? do_raw_spin_unlock+0x4f/0x250
[  459.384668]  ? _raw_spin_unlock+0x1f/0x30
[  459.385280]  ? __alloc_fd+0x448/0x560
[  459.385841]  do_sys_open+0x3c3/0x500
[  459.386386]  ? filp_open+0x70/0x70
[  459.386911]  ? trace_hardirqs_on_thunk+0x1a/0x1c
[  459.387610]  ? trace_hardirqs_off_caller+0x55/0x1c0
[  459.388342]  ? do_syscall_64+0x1a/0x520
[  459.388930]  do_syscall_64+0xc3/0x520
[  459.389490]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  459.390248] RIP: 0033:0x416211
[  459.390720] Code: 75 14 b8 02 00 00 00 0f 05 48 3d 01 f0 ff ff 0f 83
04 19 00 00 c3 48 83 ec 08 e8 0a fa ff ff 48 89 04 24 b8 02 00 00 00 0f
   05 <48> 8b 3c 24 48 89 c2 e8 53 fa ff ff 48 89 d0 48 83 c4 08 48 3d
      01
[  459.393483] RSP: 002b:00007fe45dfe9a60 EFLAGS: 00000293 ORIG_RAX: 0000000000000002
[  459.394610] RAX: ffffffffffffffda RBX: 00007fe45dfea6d4 RCX: 0000000000416211
[  459.395678] RDX: 00007fe45dfe9b0a RSI: 0000000000000002 RDI: 00007fe45dfe9b00
[  459.396758] RBP: 000000000076bf20 R08: 0000000000000000 R09: 000000000000000a
[  459.397930] R10: 0000000000000075 R11: 0000000000000293 R12: 00000000ffffffff
[  459.399022] R13: 0000000000000bd9 R14: 00000000004cdb80 R15: 000000000076bf2c
[  459.400168]
[  459.400430] Allocated by task 20132:
[  459.401038]  kasan_kmalloc+0xbf/0xe0
[  459.401652]  kmem_cache_alloc+0xd5/0x280
[  459.402330]  bdev_alloc_inode+0x18/0x40
[  459.402970]  alloc_inode+0x5f/0x180
[  459.403510]  iget5_locked+0x57/0xd0
[  459.404095]  bdget+0x94/0x4e0
[  459.404607]  bd_acquire+0xfa/0x2c0
[  459.405113]  blkdev_open+0x110/0x290
[  459.405702]  do_dentry_open+0x49e/0x1050
[  459.406340]  path_openat+0x148c/0x3f50
[  459.406926]  do_filp_open+0x1a1/0x280
[  459.407471]  do_sys_open+0x3c3/0x500
[  459.408010]  do_syscall_64+0xc3/0x520
[  459.408572]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  459.409415]
[  459.409679] Freed by task 1262:
[  459.410212]  __kasan_slab_free+0x129/0x170
[  459.410919]  kmem_cache_free+0xb2/0x2a0
[  459.411564]  rcu_process_callbacks+0xbb2/0x2320
[  459.412318]  __do_softirq+0x225/0x8ac

Fix this by delaying bdput() to the end of blkdev_get() which means we
have finished accessing bdev.

Fixes: 77ea887e43 ("implement in-kernel gendisk events handling")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-16 10:33:12 -06:00
David Howells
7c295eec1e afs: afs_vnode_commit_status() doesn't need to check the RPC error
afs_vnode_commit_status() is only ever called if op->error is 0, so remove
the op->error checks from the function.

Fixes: e49c7b2f6d ("afs: Build an abstraction around an "operation" concept")
Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-16 16:26:57 +01:00
David Howells
728279a5a1 afs: Fix use of afs_check_for_remote_deletion()
afs_check_for_remote_deletion() checks to see if error ENOENT is returned
by the server in response to an operation and, if so, marks the primary
vnode as having been deleted as the FID is no longer valid.

However, it's being called from the operation success functions, where no
abort has happened - and if an inline abort is recorded, it's handled by
afs_vnode_commit_status().

Fix this by actually calling the operation aborted method if provided and
having that point to afs_check_for_remote_deletion().

Fixes: e49c7b2f6d ("afs: Build an abstraction around an "operation" concept")
Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-16 16:26:57 +01:00
David Howells
44767c3531 afs: Remove afs_operation::abort_code
Remove afs_operation::abort_code as it's read but never set.  Use
ac.abort_code instead.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-16 16:26:57 +01:00
David Howells
9bd87ec631 afs: Fix yfs_fs_fetch_status() to honour vnode selector
Fix yfs_fs_fetch_status() to honour the vnode selector in
op->fetch_status.which as does afs_fs_fetch_status() that allows
afs_do_lookup() to use this as an alternative to the InlineBulkStatus RPC
call if not implemented by the server.

This doesn't matter in the current code as YFS servers always implement
InlineBulkStatus, but a subsequent will call it on YFS servers too in some
circumstances.

Fixes: e49c7b2f6d ("afs: Build an abstraction around an "operation" concept")
Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-16 16:26:57 +01:00
David Howells
6c85cacc8c afs: Remove yfs_fs_fetch_file_status() as it's not used
Remove yfs_fs_fetch_file_status() as it's no longer used.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-16 16:26:57 +01:00
Mel Gorman
e9c15badbb fs: Do not check if there is a fsnotify watcher on pseudo inodes
The kernel uses internal mounts created by kern_mount() and populated
with files with no lookup path by alloc_file_pseudo() for a variety of
reasons. An example of such a mount is for anonymous pipes. For pipes,
every vfs_write() regardless of filesystem, calls fsnotify_modify()
to notify of any changes which incurs a small amount of overhead in
fsnotify even when there are no watchers. It can also trigger for reads
and readv and writev, it was simply vfs_write() that was noticed first.

A patch is pending that reduces, but does not eliminate, the overhead of
fsnotify but for files that cannot be looked up via a path, even that
small overhead is unnecessary. The user API for all notification
subsystems (inotify, fanotify, ...) is based on the pathname and a dirfd
and proc entries appear to be the only visible representation of the
files. Proc does not have the same pathname as the internal entry and
the proc inode is not the same as the internal inode so even if fanotify
is used on a file under /proc/XX/fd, no useful events are notified.

This patch changes alloc_file_pseudo() to always opt out of fsnotify by
setting FMODE_NONOTIFY flag so that no check is made for fsnotify
watchers on pseudo files. This should be safe as the underlying helper
for the dentry is d_alloc_pseudo() which explicitly states that no
lookups are ever performed meaning that fanotify should have nothing
useful to attach to.

The test motivating this was "perf bench sched messaging --pipe". On
a single-socket machine using threads the difference of the patch was
as follows.

                              5.7.0                  5.7.0
                            vanilla        nofsnotify-v1r1
Amean     1       1.3837 (   0.00%)      1.3547 (   2.10%)
Amean     3       3.7360 (   0.00%)      3.6543 (   2.19%)
Amean     5       5.8130 (   0.00%)      5.7233 *   1.54%*
Amean     7       8.1490 (   0.00%)      7.9730 *   2.16%*
Amean     12     14.6843 (   0.00%)     14.1820 (   3.42%)
Amean     18     21.8840 (   0.00%)     21.7460 (   0.63%)
Amean     24     28.8697 (   0.00%)     29.1680 (  -1.03%)
Amean     30     36.0787 (   0.00%)     35.2640 *   2.26%*
Amean     32     38.0527 (   0.00%)     38.1223 (  -0.18%)

The difference is small but in some cases it's outside the noise so
while marginal, there is still some small benefit to ignoring fsnotify
for files allocated via alloc_file_pseudo() in some cases.

Link: https://lore.kernel.org/r/20200615121358.GF3183@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-06-16 09:40:45 +02:00
Gustavo A. R. Silva
b2b32e3aa0 Squashfs: Replace zero-length array with flexible-array
There is a regular need in the kernel to provide a way to declare having a
dynamically sized set of trailing elements in a structure. Kernel code should
always use “flexible array members”[1] for these cases. The older style of
one-element or zero-length arrays should no longer be used[2].

[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://github.com/KSPP/linux/issues/21

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-06-15 23:08:32 -05:00
Gustavo A. R. Silva
6112bad79f jffs2: Replace zero-length array with flexible-array
There is a regular need in the kernel to provide a way to declare having a
dynamically sized set of trailing elements in a structure. Kernel code should
always use “flexible array members”[1] for these cases. The older style of
one-element or zero-length arrays should no longer be used[2].

[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://github.com/KSPP/linux/issues/21

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-06-15 23:08:31 -05:00
Gustavo A. R. Silva
241cb28e38 aio: Replace zero-length array with flexible-array
There is a regular need in the kernel to provide a way to declare having a
dynamically sized set of trailing elements in a structure. Kernel code should
always use “flexible array members”[1] for these cases. The older style of
one-element or zero-length arrays should no longer be used[2].

[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://github.com/KSPP/linux/issues/21

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-06-15 23:08:25 -05:00
Linus Torvalds
3be20b6fc1 This is the second round of ext4 commits for 5.8 merge window. It
includes the per-inode DAX support, which was dependant on the DAX
 infrastructure which came in via the XFS tree, and a number of
 regression and bug fixes; most notably the "BUG: using
 smp_processor_id() in preemptible code in ext4_mb_new_blocks" reported
 by syzkaller.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAl7mgCcACgkQ8vlZVpUN
 gaPftwf8C4w/7SG+CYLdwg0d2u9TKk77yDuWaioFHOcMSjZvG4TCSgtMhZxQnyty
 9t4yqacILx12pCj/mZnrZp5BOSn9O2ZbuDoXNKNrFXU0BF+CsbnhvJvrrh1j/MUa
 PPtcqyGFdOLSDvHSD9xPVT76juwh79aR8vB7qnQXaEO5wcLodZWoqBEFSKCl6Bo8
 hjXs1EXidusKsoarQxW6mEITmnhU2S2fuCVDgVcoM/LmKwzbgqvlWrentq9u8qLH
 W+XbjWgUtCM1byeDZWqe5FYyyJ8x+dTv7H5an3KR92EN6hKo5AOvzA0I41pZscq/
 bJ9p2THDxJQX4rJBevGAS5mZ6hTkRw==
 =z6eO
 -----END PGP SIGNATURE-----

Merge tag 'ext4-for-linus-5.8-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull more ext4 updates from Ted Ts'o:
 "This is the second round of ext4 commits for 5.8 merge window [1].

  It includes the per-inode DAX support, which was dependant on the DAX
  infrastructure which came in via the XFS tree, and a number of
  regression and bug fixes; most notably the "BUG: using
  smp_processor_id() in preemptible code in ext4_mb_new_blocks" reported
  by syzkaller"

[1] The pull request actually came in 15 minutes after I had tagged the
    rc1 release. Tssk, tssk, late..   - Linus

* tag 'ext4-for-linus-5.8-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4, jbd2: ensure panic by fix a race between jbd2 abort and ext4 error handlers
  ext4: support xattr gnu.* namespace for the Hurd
  ext4: mballoc: Use this_cpu_read instead of this_cpu_ptr
  ext4: avoid utf8_strncasecmp() with unstable name
  ext4: stop overwrite the errcode in ext4_setup_super
  ext4: fix partial cluster initialization when splitting extent
  ext4: avoid race conditions when remounting with options that change dax
  Documentation/dax: Update DAX enablement for ext4
  fs/ext4: Introduce DAX inode flag
  fs/ext4: Remove jflag variable
  fs/ext4: Make DAX mount option a tri-state
  fs/ext4: Only change S_DAX on inode load
  fs/ext4: Update ext4_should_use_dax()
  fs/ext4: Change EXT4_MOUNT_DAX to EXT4_MOUNT_DAX_ALWAYS
  fs/ext4: Disallow verity if inode is DAX
  fs/ext4: Narrow scope of DAX check in setflags
2020-06-15 09:32:10 -07:00
Pavel Begunkov
801dd57bd1 io_uring: cancel by ->task not pid
For an exiting process it tries to cancel all its inflight requests. Use
req->task to match such instead of work.pid. We always have req->task
set, and it will be valid because we're matching only current exiting
task.

Also, remove work.pid and everything related, it's useless now.

Reported-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-15 08:51:38 -06:00
Pavel Begunkov
4dd2824d6d io_uring: lazy get task
There will be multiple places where req->task is used, so refcount-pin
it lazily with introduced *io_{get,put}_req_task(). We need to always
have valid ->task for cancellation reasons, but don't care about pinning
it in some cases. That's why it sets req->task in io_req_init() and
implements get/put laziness with a flag.

This also removes using @current from polling io_arm_poll_handler(),
etc., but doesn't change observable behaviour.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-15 08:51:35 -06:00
Pavel Begunkov
67c4d9e693 io_uring: batch cancel in io_uring_cancel_files()
Instead of waiting for each request one by one, first try to cancel all
of them in a batched manner, and then go over inflight_list/etc to reap
leftovers.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-15 08:51:34 -06:00
Pavel Begunkov
44e728b8aa io_uring: cancel all task's requests on exit
If a process is going away, io_uring_flush() will cancel only 1
request with a matching pid. Cancel all of them

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-15 08:51:34 -06:00
Pavel Begunkov
4f26bda152 io-wq: add an option to cancel all matched reqs
This adds support for cancelling all io-wq works matching a predicate.
It isn't used yet, so no change in observable behaviour.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-15 08:51:34 -06:00
Pavel Begunkov
f4c2665e33 io-wq: reorder cancellation pending -> running
Go all over all pending lists and cancel works there, and only then
try to match running requests. No functional changes here, just a
preparation for bulk cancellation.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-15 08:51:33 -06:00
David Howells
4ec89596d0 afs: Fix the mapping of the UAEOVERFLOW abort code
Abort code UAEOVERFLOW is returned when we try and set a time that's out of
range, but it's currently mapped to EREMOTEIO by the default case.

Fix UAEOVERFLOW to map instead to EOVERFLOW.

Found with the generic/258 xfstest.  Note that the test is wrong as it
assumes that the filesystem will support a pre-UNIX-epoch date.

Fixes: 1eda8bab70 ("afs: Add support for the UAE error table")
Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-15 15:41:03 +01:00
David Howells
793fe82ee3 afs: Fix truncation issues and mmap writeback size
Fix the following issues:

 (1) Fix writeback to reduce the size of a store operation to i_size,
     effectively discarding the extra data.

     The problem comes when afs_page_mkwrite() records that a page is about
     to be modified by mmap().  It doesn't know what bits of the page are
     going to be modified, so it records the whole page as being dirty
     (this is stored in page->private as start and end offsets).

     Without this, the marshalling for the store to the server extends the
     size of the file to the end of the page (in afs_fs_store_data() and
     yfs_fs_store_data()).

 (2) Fix setattr to actually truncate the pagecache, thereby clearing
     the discarded part of a file.

 (3) Fix setattr to check that the new size is okay and to disable
     ATTR_SIZE if i_size wouldn't change.

 (4) Force i_size to be updated as the result of a truncate.

 (5) Don't truncate if ATTR_SIZE is not set.

 (6) Call pagecache_isize_extended() if the file was enlarged.

Note that truncate_set_size() isn't used because the setting of i_size is
done inside afs_vnode_commit_status() under the vnode->cb_lock.

Found with the generic/029 and generic/393 xfstests.

Fixes: 31143d5d51 ("AFS: implement basic file write support")
Fixes: 4343d00872 ("afs: Get rid of the afs_writeback record")
Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-15 15:41:02 +01:00
David Howells
da8d075512 afs: Concoct ctimes
The in-kernel afs filesystem ignores ctime because the AFS fileserver
protocol doesn't support ctimes.  This, however, causes various xfstests to
fail.

Work around this by:

 (1) Setting ctime to attr->ia_ctime in afs_setattr().

 (2) Not ignoring ATTR_MTIME_SET, ATTR_TIMES_SET and ATTR_TOUCH settings.

 (3) Setting the ctime from the server mtime when on the target file when
     creating a hard link to it.

 (4) Setting the ctime on directories from their revised mtimes when
     renaming/moving a file.

Found by the generic/221 and generic/309 xfstests.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-15 15:41:02 +01:00
David Howells
3f4aa98181 afs: Fix EOF corruption
When doing a partial writeback, afs_write_back_from_locked_page() may
generate an FS.StoreData RPC request that writes out part of a file when a
file has been constructed from pieces by doing seek, write, seek, write,
... as is done by ld.

The FS.StoreData RPC is given the current i_size as the file length, but
the server basically ignores it unless the data length is 0 (in which case
it's just a truncate operation).  The revised file length returned in the
result of the RPC may then not reflect what we suggested - and this leads
to i_size getting moved backwards - which causes issues later.

Fix the client to take account of this by ignoring the returned file size
unless the data version number jumped unexpectedly - in which case we're
going to have to clear the pagecache and reload anyway.

This can be observed when doing a kernel build on an AFS mount.  The
following pair of commands produce the issue:

  ld -m elf_x86_64 -z max-page-size=0x200000 --emit-relocs \
      -T arch/x86/realmode/rm/realmode.lds \
      arch/x86/realmode/rm/header.o \
      arch/x86/realmode/rm/trampoline_64.o \
      arch/x86/realmode/rm/stack.o \
      arch/x86/realmode/rm/reboot.o \
      -o arch/x86/realmode/rm/realmode.elf
  arch/x86/tools/relocs --realmode \
      arch/x86/realmode/rm/realmode.elf \
      >arch/x86/realmode/rm/realmode.relocs

This results in the latter giving:

	Cannot read ELF section headers 0/18: Success

as the realmode.elf file got corrupted.

The sequence of events can also be driven with:

	xfs_io -t -f \
		-c "pwrite -S 0x58 0 0x58" \
		-c "pwrite -S 0x59 10000 1000" \
		-c "close" \
		/afs/example.com/scratch/a

Fixes: 31143d5d51 ("AFS: implement basic file write support")
Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-15 15:41:02 +01:00
David Howells
1f32ef7989 afs: afs_write_end() should change i_size under the right lock
Fix afs_write_end() to change i_size under vnode->cb_lock rather than
->wb_lock so that it doesn't race with afs_vnode_commit_status() and
afs_getattr().

The ->wb_lock is only meant to guard access to ->wb_keys which isn't
accessed by that piece of code.

Fixes: 4343d00872 ("afs: Get rid of the afs_writeback record")
Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-15 15:41:02 +01:00
David Howells
bb41348928 afs: Fix non-setting of mtime when writing into mmap
The mtime on an inode needs to be updated when a write is made into an
mmap'ed section.  There are three ways in which this could be done: update
it when page_mkwrite is called, update it when a page is changed from dirty
to writeback or leave it to the server and fix the mtime up from the reply
to the StoreData RPC.

Found with the generic/215 xfstest.

Fixes: 1cf7a1518a ("afs: Implement shared-writeable mmap")
Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-15 15:41:02 +01:00
Pavel Begunkov
59960b9deb io_uring: fix lazy work init
Don't leave garbage in req.work before punting async on -EAGAIN
in io_iopoll_queue().

[  140.922099] general protection fault, probably for non-canonical
     address 0xdead000000000100: 0000 [#1] PREEMPT SMP PTI
...
[  140.922105] RIP: 0010:io_worker_handle_work+0x1db/0x480
...
[  140.922114] Call Trace:
[  140.922118]  ? __next_timer_interrupt+0xe0/0xe0
[  140.922119]  io_wqe_worker+0x2a9/0x360
[  140.922121]  ? _raw_spin_unlock_irqrestore+0x24/0x40
[  140.922124]  kthread+0x12c/0x170
[  140.922125]  ? io_worker_handle_work+0x480/0x480
[  140.922126]  ? kthread_park+0x90/0x90
[  140.922127]  ret_from_fork+0x22/0x30

Fixes: 7cdaf587de ("io_uring: avoid whole io_wq_work copy for requests completed inline")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-15 08:37:55 -06:00
Tony Luck
4353f03317 efivarfs: Don't return -EINTR when rate-limiting reads
Applications that read EFI variables may see a return
value of -EINTR if they exceed the rate limit and a
signal delivery is attempted while the process is sleeping.

This is quite surprising to the application, which probably
doesn't have code to handle it.

Change the interruptible sleep to a non-interruptible one.

Reported-by: Lennart Poettering <mzxreary@0pointer.de>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20200528194905.690-3-tony.luck@intel.com
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2020-06-15 14:38:56 +02:00
Tony Luck
2096721f15 efivarfs: Update inode modification time for successful writes
Some applications want to be able to see when EFI variables
have been updated.

Update the modification time for successful writes.

Reported-by: Lennart Poettering <mzxreary@0pointer.de>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20200528194905.690-2-tony.luck@intel.com
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2020-06-15 14:38:56 +02:00
Linus Torvalds
9d645db853 for-5.8-part2-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl7lZwgACgkQxWXV+ddt
 WDuj6g/9E2JtqeO8zRMLb+Do/n5YX0dFHt+dM1AGY+nw8hb3U9Vlgc8KJa7UpZFX
 opl1i9QL+cJLoZMZL5xZhDouMQlum5cGVV3hLwqEPYetRF/ytw/kunWAg5o8OW1R
 sJxGcjyiiKpZLVx6nMjGnYjsrbOJv0HlaWfY3NCon4oQ8yQTzTPMPBevPWRM7Iqw
 Ssi8pA8zXCc2QoLgyk6Pe/IGeox8+z9RA2akHkJIdMWiPHm43RDF4Yx3Yl9NHHZA
 M+pLVKjZoejqwVaai8osBqWVw4Ypax1+CJit6iHGwJDkQyFPcMXMsOc5ZYBnT5or
 k/ceVMCs+ejvCK1+L30u7FQRiDqf5Fwhf/SGfq7+y83KbEjMfWOya3Lyk47fbDD4
 776rSaS6ejqVklWppbaPhntSrBtPR1NaDOfi55bc9TOe+yW7Du+AsQMlEE0bTJaW
 eHl+A4AP/nDlo8Etn1jTWd023bzzO+iySMn3YZfK0vw3vkj3JfrCGXx6DEYipOou
 uEUj0jDo/rdiB5S3GdUCujjaPgm/f0wkPudTRB9lpxJas2qFU+qo2TLJhEleELwj
 m4laz7W7S+nUFP0LRl8O82AzBfjm+oHjWTpfdloT6JW9Da8/iuZ/x9VBWQ8mFJwX
 U0cR3zVqUuWcK78fZa/FFgGPBxlwUv2j+OhRGsS0/orDRlrwcXo=
 =5S0s
 -----END PGP SIGNATURE-----

Merge tag 'for-5.8-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
 "This reverts the direct io port to iomap infrastructure of btrfs
  merged in the first pull request. We found problems in invalidate page
  that don't seem to be fixable as regressions or without changing iomap
  code that would not affect other filesystems.

  There are four reverts in total, but three of them are followup
  cleanups needed to revert a43a67a2d7 cleanly. The result is the
  buffer head based implementation of direct io.

  Reverts are not great, but under current circumstances I don't see
  better options"

* tag 'for-5.8-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  Revert "btrfs: switch to iomap_dio_rw() for dio"
  Revert "fs: remove dio_end_io()"
  Revert "btrfs: remove BTRFS_INODE_READDIO_NEED_LOCK"
  Revert "btrfs: split btrfs_direct_IO to read and write part"
2020-06-14 09:47:25 -07:00
David Sterba
55e20bd12a Revert "btrfs: switch to iomap_dio_rw() for dio"
This reverts commit a43a67a2d7.

This patch reverts the main part of switching direct io implementation
to iomap infrastructure. There's a problem in invalidate page that
couldn't be solved as regression in this development cycle.

The problem occurs when buffered and direct io are mixed, and the ranges
overlap. Although this is not recommended, filesystems implement
measures or fallbacks to make it somehow work. In this case, fallback to
buffered IO would be an option for btrfs (this already happens when
direct io is done on compressed data), but the change would be needed in
the iomap code, bringing new semantics to other filesystems.

Another problem arises when again the buffered and direct ios are mixed,
invalidation fails, then -EIO is set on the mapping and fsync will fail,
though there's no real error.

There have been discussions how to fix that, but revert seems to be the
least intrusive option.

Link: https://lore.kernel.org/linux-btrfs/20200528192103.xm45qoxqmkw7i5yl@fiona/
Signed-off-by: David Sterba <dsterba@suse.com>
2020-06-14 01:19:02 +02:00
Linus Torvalds
f82e7b57b5 12 cifs/smb3 fixes, 2 for stable. Adds support for idsfromsid on create and chgrp/chown. Improves query info (getattr) when posix extensions negotiated.
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl7kX6EACgkQiiy9cAdy
 T1GdiQwAqMaDRVLZWeV5Uc0EM9AGkrWVu6F5n9nBzuKDTXCAf8aKCyiyYdMz/P20
 belQA3bPG4jkLa/4Or1XfTY2OSSV4eGBlTfjHNeW2ZJ5pJWGInqCHuVco/M98om8
 57JMTMZDTxN6884U+v3bBl4jDE6MqK3QS0WfA63ufd0T8ZnFOGDBn1DieJKbViyy
 ZckpDH0etaAxO171SV5VwzbFe9U7OeTXupD8LYEHngR7vfaFCkX6ZftYYN0aWsvs
 uL3p6K1kiNNxTXm0M3Hw6Gpk1nEAM9/6nOR6+TUppor+rQVJCH5F7NKQVrR92MDq
 Qgwldt16DP1NjOb0q5L37HIg+9kD2kshKs9CErneen6eWtcfiN0HYT35hBxVi7RT
 XT/dMt17wq3waoq92+RY3U4vb47QVWS6asH4/sqsTqUMWrlEYNGkEuCfeniZzJfO
 bxglNPVafQ5qy2DWBzsAUX/isaR06FihEKqODK+K78KGTptim/+ip9+yXGjM6ne2
 lhdWspC5
 =Iwqj
 -----END PGP SIGNATURE-----

Merge tag '5.8-rc-smb3-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6

Pull more cifs updates from Steve French:
 "12 cifs/smb3 fixes, 2 for stable.

   - add support for idsfromsid on create and chgrp/chown allowing
     ability to save owner information more naturally for some workloads

   - improve query info (getattr) when SMB3.1.1 posix extensions are
     negotiated by using new query info level"

* tag '5.8-rc-smb3-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6:
  smb3: Add debug message for new file creation with idsfromsid mount option
  cifs: fix chown and chgrp when idsfromsid mount option enabled
  smb3: allow uid and gid owners to be set on create with idsfromsid mount option
  smb311: Add tracepoints for new compound posix query info
  smb311: add support for using info level for posix extensions query
  smb311: Add support for lookup with posix extensions query info
  smb311: Add support for SMB311 query info (non-compounded)
  SMB311: Add support for query info using posix extensions (level 100)
  smb3: add indatalen that can be a non-zero value to calculation of credit charge in smb2 ioctl
  smb3: fix typo in mount options displayed in /proc/mounts
  cifs: Add get_security_type_str function to return sec type.
  smb3: extend fscache mount volume coherency check
2020-06-13 13:43:56 -07:00
Linus Torvalds
6adc19fd13 Kbuild updates for v5.8 (2nd)
- fix build rules in binderfs sample
 
  - fix build errors when Kbuild recurses to the top Makefile
 
  - covert '---help---' in Kconfig to 'help'
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAl7lBuYVHG1hc2FoaXJv
 eUBrZXJuZWwub3JnAAoJED2LAQed4NsGHvIP/3iErjPshpg/phwH8NTCS4SFkiti
 BZRM+2lupSn7Qs53BTpVzIkXoHBJQZlJxlQ5HY8ScO+fiz28rKZr+b40us+je1Q+
 SkvSPfwZzxjEg7lAZutznG4KgItJLWJKmDyh9T8Y8TAuG4f8WO0hKnXoAp3YorS2
 zppEIxso8O5spZPjp+fF/fPbxPjIsabGK7Jp2LpSVFR5pVDHI/ycTlKQS+MFpMEx
 6JIpdFRw7TkvKew1dr5uAWT5btWHatEqjSR3JeyVHv3EICTGQwHmcHK67cJzGInK
 T51+DT7/CpKtmRgGMiTEu/INfMzzoQAKl6Fcu+vMaShTN97Hk9DpdtQyvA6P/h3L
 8GA4UBct05J7fjjIB7iUD+GYQ0EZbaFujzRXLYk+dQqEJRbhcCwvdzggGp0WvGRs
 1f8/AIpgnQv8JSL/bOMgGMS5uL2dSLsgbzTdr6RzWf1jlYdI1i4u7AZ/nBrwWP+Z
 iOBkKsVceEoJrTbaynl3eoYqFLtWyDau+//oBc2gUvmhn8ioM5dfqBRiJjxJnPG9
 /giRj6xRIqMMEw8Gg8PCG7WebfWxWyaIQwlWBbPok7DwISURK5mvOyakZL+Q25/y
 6MBr2H8NEJsf35q0GTINpfZnot7NX4JXrrndJH8NIRC7HEhwd29S041xlQJdP0rs
 E76xsOr3hrAmBu4P
 =1NIT
 -----END PGP SIGNATURE-----

Merge tag 'kbuild-v5.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull more Kbuild updates from Masahiro Yamada:

 - fix build rules in binderfs sample

 - fix build errors when Kbuild recurses to the top Makefile

 - covert '---help---' in Kconfig to 'help'

* tag 'kbuild-v5.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
  treewide: replace '---help---' in Kconfig files with 'help'
  kbuild: fix broken builds because of GZIP,BZIP2,LZOP variables
  samples: binderfs: really compile this sample and fix build issues
2020-06-13 13:29:16 -07:00
Linus Torvalds
593bd5e5d3 New code for 5.8:
- Fix an integer overflow problem in the unshare actor.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl7fCTEACgkQ+H93GTRK
 tOuQxQ//Ya/xLx9UPoZepTzjHQKl2MlYVYRfKCL60NrH6kNpvq9jyGiPg6xOXc3g
 KGTe23YDiuP80L3hpIZ9yj/SbJAItI8LsqHHrvVDbAdVSQdK56ajZqq3xwyvOC9u
 RqCkGkVzRE+nmToJQbYCSmPA446aqMWuCpmlsTbuGmjvkRKAMgBBG/66nbcplQnC
 eeflcVW7IdbbQ45K8QpyP4AeNMobc26B7zmWqXYeZuMxHcFsrnvld3pgke39i8Hk
 k0SzMenGddYfb6/FknnxHASMnqnhE7lA1YyWe7F3uDM8OwmpNIseBysqm+6tETkn
 DBlcpVeENNJB7ygPhqOJXmmDGnap5Y7vwhAc8jX84yuXRkd0gx5aTRIyH8cNp9lQ
 TRwoVY9DTUkUlMkSLpgeCFIOR5SyOW3H4xZV4PC0sJxAWtM0J3B8A5zvAjQ5kVRP
 79gVRpl2OUj648nbrPRwhDBwnNZAhflRVvBh9kasteA7SAtuGJFJKZZ162Smltz2
 1E9i/2CvUUartNOjKkT3qPzAF6B1Je3AGTMwuDPhcYX9bdW+9pCD09yi1CiGOn7S
 QuuwyHTAcLRtZiShNCG6zQhqq++zQCZ58J1IBHYajE73YM1+8r/5wCfTIhB+CPuf
 J0rjqS+d151d2qMBnK6oag0t2u5Hj+xlcJw9QnQGqPKs6yIktA0=
 =s+Pr
 -----END PGP SIGNATURE-----

Merge tag 'iomap-5.8-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull iomap fix from Darrick Wong:
 "A single iomap bug fix for a variable type mistake on 32-bit
  architectures, fixing an integer overflow problem in the unshare
  actor"

* tag 'iomap-5.8-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  iomap: Fix unsharing of an extent >2GB on a 32-bit machine
2020-06-13 12:44:30 -07:00
Linus Torvalds
c555722768 Fixes for 5.8:
- Fix a resource leak on an error bailout.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl7fCpAACgkQ+H93GTRK
 tOtlog//ZkKRzp72HXCTgGpQj0IjkCjuZlz0F8FpdVhl9lOANaZPoXDbCIAax8q1
 67wfDG7p8wl109KZnMuaPPXSC5KlynaWphSs7XMXqgLFXViha31c6U7PSMyxZmBB
 674hE9eKnVNjhkMk98MtVV3ShWge9T5yGVXYhQbXMWDx8GCdNd9NEP3qnMcBEaLt
 EPl6yoOfdNnKo37ptrt1Qb2NgORDBDDHYPr6SX/xEYDsppDLp8u+k/YGhuoJVtdc
 HGR08ryIn6lctvkLbqDxtFzFxIL8Za7AHrBXilgioJYRJ78v7VyCnj1u8eT/axsa
 ZUis/sQXjgvSvlsGZQZkyPdtnfhFbzXCeulyQvrMnEheMuz691dljMid3fEBkfmq
 SubqE+HDP8aC6Zs9EkV/lEtdTH+EQ2ojZHH9s5oi6qbvilfFxyoPUfIxog+bhqPO
 fwl1sL2nb/eQuBF+DeHg4UxP9WzA06Z1q9nZpDjrY224aMOWnrN8TBOKv4FZiRDt
 M1l3VXcVsaDbCmbOsCTXdLh0Ap3przjk4hFPOjPJxlTzTNO9rPLhopvuLd+J3quA
 fzNNBA4bMSq3IFSg3VEC2U3YgF3anGrt8PuopIwCH8muc9agCs//fI3Y/eI4k9oT
 VOUPSxKckZ6SAEhIr7uTyKFzS+yNFBaYN/Y0FqDGnzbf5Bqr9NM=
 =C0rZ
 -----END PGP SIGNATURE-----

Merge tag 'xfs-5.8-merge-9' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fix from Darrick Wong:
 "We've settled down into the bugfix phase; this one fixes a resource
  leak on an error bailout path"

* tag 'xfs-5.8-merge-9' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: Add the missed xfs_perag_put() for xfs_ifree_cluster()
2020-06-13 12:40:24 -07:00
Masahiro Yamada
a7f7f6248d treewide: replace '---help---' in Kconfig files with 'help'
Since commit 84af7a6194 ("checkpatch: kconfig: prefer 'help' over
'---help---'"), the number of '---help---' has been gradually
decreasing, but there are still more than 2400 instances.

This commit finishes the conversion. While I touched the lines,
I also fixed the indentation.

There are a variety of indentation styles found.

  a) 4 spaces + '---help---'
  b) 7 spaces + '---help---'
  c) 8 spaces + '---help---'
  d) 1 space + 1 tab + '---help---'
  e) 1 tab + '---help---'    (correct indentation)
  f) 1 tab + 1 space + '---help---'
  g) 1 tab + 2 spaces + '---help---'

In order to convert all of them to 1 tab + 'help', I ran the
following commend:

  $ find . -name 'Kconfig*' | xargs sed -i 's/^[[:space:]]*---help---/\thelp/'

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2020-06-14 01:57:21 +09:00
Linus Torvalds
6c32978414 Notifications over pipes + Keyring notifications
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAl7U/i8ACgkQ+7dXa6fL
 C2u2eg/+Oy6ybq0hPovYVkFI9WIG7ZCz7w9Q6BEnfYMqqn3dnfJxKQ3l4pnQEOWw
 f4QfvpvevsYfMtOJkYcG6s66rQgbFdqc5TEyBBy0QNp3acRolN7IXkcopvv9xOpQ
 JxedpbFG1PTFLWjvBpyjlrUPouwLzq2FXAf1Ox0ZIMw6165mYOMWoli1VL8dh0A0
 Ai7JUB0WrvTNbrwhV413obIzXT/rPCdcrgbQcgrrLPex8lQ47ZAE9bq6k4q5HiwK
 KRzEqkQgnzId6cCNTFBfkTWsx89zZunz7jkfM5yx30MvdAtPSxvvpfIPdZRZkXsP
 E2K9Fk1/6OQZTC0Op3Pi/bt+hVG/mD1p0sQUDgo2MO3qlSS+5mMkR8h3mJEgwK12
 72P4YfOJkuAy2z3v4lL0GYdUDAZY6i6G8TMxERKu/a9O3VjTWICDOyBUS6F8YEAK
 C7HlbZxAEOKTVK0BTDTeEUBwSeDrBbvH6MnRlZCG5g1Fos2aWP0udhjiX8IfZLO7
 GN6nWBvK1fYzfsUczdhgnoCzQs3suoDo04HnsTPGJ8De52T4x2RsjV+gPx0nrNAq
 eWChl1JvMWsY2B3GLnl9XQz4NNN+EreKEkk+PULDGllrArrPsp5Vnhb9FJO1PVCU
 hMDJHohPiXnKbc8f4Bd78OhIvnuoGfJPdM5MtNe2flUKy2a2ops=
 =YTGf
 -----END PGP SIGNATURE-----

Merge tag 'notifications-20200601' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull notification queue from David Howells:
 "This adds a general notification queue concept and adds an event
  source for keys/keyrings, such as linking and unlinking keys and
  changing their attributes.

  Thanks to Debarshi Ray, we do have a pull request to use this to fix a
  problem with gnome-online-accounts - as mentioned last time:

     https://gitlab.gnome.org/GNOME/gnome-online-accounts/merge_requests/47

  Without this, g-o-a has to constantly poll a keyring-based kerberos
  cache to find out if kinit has changed anything.

  [ There are other notification pending: mount/sb fsinfo notifications
    for libmount that Karel Zak and Ian Kent have been working on, and
    Christian Brauner would like to use them in lxc, but let's see how
    this one works first ]

  LSM hooks are included:

   - A set of hooks are provided that allow an LSM to rule on whether or
     not a watch may be set. Each of these hooks takes a different
     "watched object" parameter, so they're not really shareable. The
     LSM should use current's credentials. [Wanted by SELinux & Smack]

   - A hook is provided to allow an LSM to rule on whether or not a
     particular message may be posted to a particular queue. This is
     given the credentials from the event generator (which may be the
     system) and the watch setter. [Wanted by Smack]

  I've provided SELinux and Smack with implementations of some of these
  hooks.

  WHY
  ===

  Key/keyring notifications are desirable because if you have your
  kerberos tickets in a file/directory, your Gnome desktop will monitor
  that using something like fanotify and tell you if your credentials
  cache changes.

  However, we also have the ability to cache your kerberos tickets in
  the session, user or persistent keyring so that it isn't left around
  on disk across a reboot or logout. Keyrings, however, cannot currently
  be monitored asynchronously, so the desktop has to poll for it - not
  so good on a laptop. This facility will allow the desktop to avoid the
  need to poll.

  DESIGN DECISIONS
  ================

   - The notification queue is built on top of a standard pipe. Messages
     are effectively spliced in. The pipe is opened with a special flag:

        pipe2(fds, O_NOTIFICATION_PIPE);

     The special flag has the same value as O_EXCL (which doesn't seem
     like it will ever be applicable in this context)[?]. It is given up
     front to make it a lot easier to prohibit splice&co from accessing
     the pipe.

     [?] Should this be done some other way?  I'd rather not use up a new
         O_* flag if I can avoid it - should I add a pipe3() system call
         instead?

     The pipe is then configured::

        ioctl(fds[1], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
        ioctl(fds[1], IOC_WATCH_QUEUE_SET_FILTER, &filter);

     Messages are then read out of the pipe using read().

   - It should be possible to allow write() to insert data into the
     notification pipes too, but this is currently disabled as the
     kernel has to be able to insert messages into the pipe *without*
     holding pipe->mutex and the code to make this work needs careful
     auditing.

   - sendfile(), splice() and vmsplice() are disabled on notification
     pipes because of the pipe->mutex issue and also because they
     sometimes want to revert what they just did - but one or more
     notification messages might've been interleaved in the ring.

   - The kernel inserts messages with the wait queue spinlock held. This
     means that pipe_read() and pipe_write() have to take the spinlock
     to update the queue pointers.

   - Records in the buffer are binary, typed and have a length so that
     they can be of varying size.

     This allows multiple heterogeneous sources to share a common
     buffer; there are 16 million types available, of which I've used
     just a few, so there is scope for others to be used. Tags may be
     specified when a watchpoint is created to help distinguish the
     sources.

   - Records are filterable as types have up to 256 subtypes that can be
     individually filtered. Other filtration is also available.

   - Notification pipes don't interfere with each other; each may be
     bound to a different set of watches. Any particular notification
     will be copied to all the queues that are currently watching for it
     - and only those that are watching for it.

   - When recording a notification, the kernel will not sleep, but will
     rather mark a queue as having lost a message if there's
     insufficient space. read() will fabricate a loss notification
     message at an appropriate point later.

   - The notification pipe is created and then watchpoints are attached
     to it, using one of:

        keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fds[1], 0x01);
        watch_mount(AT_FDCWD, "/", 0, fd, 0x02);
        watch_sb(AT_FDCWD, "/mnt", 0, fd, 0x03);

     where in both cases, fd indicates the queue and the number after is
     a tag between 0 and 255.

   - Watches are removed if either the notification pipe is destroyed or
     the watched object is destroyed. In the latter case, a message will
     be generated indicating the enforced watch removal.

  Things I want to avoid:

   - Introducing features that make the core VFS dependent on the
     network stack or networking namespaces (ie. usage of netlink).

   - Dumping all this stuff into dmesg and having a daemon that sits
     there parsing the output and distributing it as this then puts the
     responsibility for security into userspace and makes handling
     namespaces tricky. Further, dmesg might not exist or might be
     inaccessible inside a container.

   - Letting users see events they shouldn't be able to see.

  TESTING AND MANPAGES
  ====================

   - The keyutils tree has a pipe-watch branch that has keyctl commands
     for making use of notifications. Proposed manual pages can also be
     found on this branch, though a couple of them really need to go to
     the main manpages repository instead.

     If the kernel supports the watching of keys, then running "make
     test" on that branch will cause the testing infrastructure to spawn
     a monitoring process on the side that monitors a notifications pipe
     for all the key/keyring changes induced by the tests and they'll
     all be checked off to make sure they happened.

        https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/keyutils.git/log/?h=pipe-watch

   - A test program is provided (samples/watch_queue/watch_test) that
     can be used to monitor for keyrings, mount and superblock events.
     Information on the notifications is simply logged to stdout"

* tag 'notifications-20200601' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  smack: Implement the watch_key and post_notification hooks
  selinux: Implement the watch_key security hook
  keys: Make the KEY_NEED_* perms an enum rather than a mask
  pipe: Add notification lossage handling
  pipe: Allow buffers to be marked read-whole-or-error for notifications
  Add sample notification program
  watch_queue: Add a key/keyring notification facility
  security: Add hooks to rule on setting a watch
  pipe: Add general notification queue support
  pipe: Add O_NOTIFICATION_PIPE
  security: Add a hook for the point of notification insertion
  uapi: General notification queue definitions
2020-06-13 09:56:21 -07:00
Steve French
a7a519a492 smb3: Add debug message for new file creation with idsfromsid mount option
Pavel noticed that a debug message (disabled by default) in creating the security
descriptor context could be useful for new file creation owner fields
(as we already have for the mode) when using mount parm idsfromsid.

[38120.392272] CIFS: FYI: owner S-1-5-88-1-0, group S-1-5-88-2-0
[38125.792637] CIFS: FYI: owner S-1-5-88-1-1000, group S-1-5-88-2-1000

Also cleans up a typo in a comment

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2020-06-12 16:31:06 -05:00
Linus Torvalds
44ebe016df Merge branch 'proc-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull proc fix from Eric Biederman:
 "Much to my surprise syzbot found a very old bug in proc that the
  recent changes made easier to reproce. This bug is subtle enough it
  looks like it fooled everyone who should know better"

* 'proc-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  proc: Use new_inode not new_inode_pseudo
2020-06-12 12:38:18 -07:00
Eric W. Biederman
ef1548adad proc: Use new_inode not new_inode_pseudo
Recently syzbot reported that unmounting proc when there is an ongoing
inotify watch on the root directory of proc could result in a use
after free when the watch is removed after the unmount of proc
when the watcher exits.

Commit 69879c01a0 ("proc: Remove the now unnecessary internal mount
of proc") made it easier to unmount proc and allowed syzbot to see the
problem, but looking at the code it has been around for a long time.

Looking at the code the fsnotify watch should have been removed by
fsnotify_sb_delete in generic_shutdown_super.  Unfortunately the inode
was allocated with new_inode_pseudo instead of new_inode so the inode
was not on the sb->s_inodes list.  Which prevented
fsnotify_unmount_inodes from finding the inode and removing the watch
as well as made it so the "VFS: Busy inodes after unmount" warning
could not find the inodes to warn about them.

Make all of the inodes in proc visible to generic_shutdown_super,
and fsnotify_sb_delete by using new_inode instead of new_inode_pseudo.
The only functional difference is that new_inode places the inodes
on the sb->s_inodes list.

I wrote a small test program and I can verify that without changes it
can trigger this issue, and by replacing new_inode_pseudo with
new_inode the issues goes away.

Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/000000000000d788c905a7dfa3f4@google.com
Reported-by: syzbot+7d2debdcdb3cb93c1e5e@syzkaller.appspotmail.com
Fixes: 0097875bd4 ("proc: Implement /proc/thread-self to point at the directory of the current thread")
Fixes: 021ada7dff ("procfs: switch /proc/self away from proc_dir_entry")
Fixes: 51f0885e54 ("vfs,proc: guarantee unique inodes in /proc")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-06-12 14:13:33 -05:00
zhangyi (F)
7b97d868b7 ext4, jbd2: ensure panic by fix a race between jbd2 abort and ext4 error handlers
In the ext4 filesystem with errors=panic, if one process is recording
errno in the superblock when invoking jbd2_journal_abort() due to some
error cases, it could be raced by another __ext4_abort() which is
setting the SB_RDONLY flag but missing panic because errno has not been
recorded.

jbd2_journal_commit_transaction()
 jbd2_journal_abort()
  journal->j_flags |= JBD2_ABORT;
  jbd2_journal_update_sb_errno()
                                    | ext4_journal_check_start()
                                    |  __ext4_abort()
                                    |   sb->s_flags |= SB_RDONLY;
                                    |   if (!JBD2_REC_ERR)
                                    |        return;
  journal->j_flags |= JBD2_REC_ERR;

Finally, it will no longer trigger panic because the filesystem has
already been set read-only. Fix this by introduce j_abort_mutex to make
sure journal abort is completed before panic, and remove JBD2_REC_ERR
flag.

Fixes: 4327ba52af ("ext4, jbd2: ensure entering into panic after recording an error in superblock")
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20200609073540.3810702-1-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-12 14:51:41 -04:00
Steve French
a660339827 cifs: fix chown and chgrp when idsfromsid mount option enabled
idsfromsid was ignored in chown and chgrp causing it to fail
when upcalls were not configured for lookup.  idsfromsid allows
mapping users when setting user or group ownership using
"special SID" (reserved for this).  Add support for chmod and chgrp
when idsfromsid mount option is enabled.

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2020-06-12 13:21:32 -05:00
Steve French
975221eca5 smb3: allow uid and gid owners to be set on create with idsfromsid mount option
Currently idsfromsid mount option allows querying owner information from the
special sids used to represent POSIX uids and gids but needed changes to
populate the security descriptor context with the owner information when
idsfromsid mount option was used.

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2020-06-12 13:21:15 -05:00
Jan (janneke) Nieuwenhuizen
88ee9d571b ext4: support xattr gnu.* namespace for the Hurd
The Hurd gained[0] support for moving the translator and author
fields out of the inode and into the "gnu.*" xattr namespace.

In anticipation of that, an xattr INDEX was reserved[1].  The Hurd has
now been brought into compliance[2] with that.

This patch adds support for reading and writing such attributes from
Linux; you can now do something like

    mkdir -p hurd-root/servers/socket
    touch hurd-root/servers/socket/1
    setfattr --name=gnu.translator --value='"/hurd/pflocal\0"' \
        hurd-root/servers/socket/1
    getfattr --name=gnu.translator hurd-root/servers/socket/1
    # file: 1
    gnu.translator="/hurd/pflocal"

to setup a pipe translator, which is being used to create[3] a
vm-image for the Hurd from GNU Guix.

[0] https://summerofcode.withgoogle.com/projects/#5869799859027968
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3980bd3b406addb327d858aebd19e229ea340b9a
[2] https://git.savannah.gnu.org/cgit/hurd/hurd.git/commit/?id=a04c7bf83172faa7cb080fbe3b6c04a8415ca645
[3] https://git.savannah.gnu.org/cgit/guix.git/log/?h=wip-hurd-vm

Signed-off-by: Jan Nieuwenhuizen <janneke@gnu.org>
Link: https://lore.kernel.org/r/20200525193940.878-1-janneke@gnu.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-12 13:23:34 -04:00
Steve French
e4bd7c4a8d smb311: Add tracepoints for new compound posix query info
Add dynamic tracepoints for new SMB3.1.1. posix extensions query info level (100)

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
2020-06-12 08:55:18 -05:00
Steve French
d313852d7a smb311: add support for using info level for posix extensions query
Adds calls to the newer info level for query info using SMB3.1.1 posix extensions.
The remaining two places that call the older query info (non-SMB3.1.1 POSIX)
require passing in the fid and can be updated in a later patch.

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
2020-06-12 08:54:12 -05:00
Steve French
790434ff98 smb311: Add support for lookup with posix extensions query info
Improve support for lookup when using SMB3.1.1 posix mounts.
Use new info level 100 (posix query info)

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
2020-06-12 06:21:19 -05:00
Steve French
b1bc1874b8 smb311: Add support for SMB311 query info (non-compounded)
Add worker function for non-compounded SMB3.1.1 POSIX Extensions query info.
This is needed for revalidate of root (cached) directory for example.

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
2020-06-12 06:21:06 -05:00
Steve French
6a5f6592a0 SMB311: Add support for query info using posix extensions (level 100)
Adds support for better query info on dentry revalidation (using
the SMB3.1.1 POSIX extensions level 100).  Followon patch will
add support for translating the UID/GID from the SID and also
will add support for using the posix query info on lookup.

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
2020-06-12 06:20:38 -05:00
Namjae Jeon
ebf57440ec smb3: add indatalen that can be a non-zero value to calculation of credit charge in smb2 ioctl
Some of tests in xfstests failed with cifsd kernel server since commit
e80ddeb2f7. cifsd kernel server validates credit charge from client
by calculating it base on max((InputCount + OutputCount) and
(MaxInputResponse + MaxOutputResponse)) according to specification.

MS-SMB2 specification describe credit charge calculation of smb2 ioctl :

If Connection.SupportsMultiCredit is TRUE, the server MUST validate
CreditCharge based on the maximum of (InputCount + OutputCount) and
(MaxInputResponse + MaxOutputResponse), as specified in section 3.3.5.2.5.
If the validation fails, it MUST fail the IOCTL request with
STATUS_INVALID_PARAMETER.

This patch add indatalen that can be a non-zero value to calculation of
credit charge in SMB2_ioctl_init().

Fixes: e80ddeb2f7 ("smb3: fix incorrect number of credits when ioctl
MaxOutputResponse > 64K")
Cc: Stable <stable@vger.kernel.org>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Cc: Steve French <smfrench@gmail.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-06-12 06:20:17 -05:00
Linus Torvalds
b1a6274994 Merge branch 'akpm' (patches from Andrew)
Pull updates from Andrew Morton:
 "A few fixes and stragglers.

  Subsystems affected by this patch series: mm/memory-failure, ocfs2,
  lib/lzo, misc"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  amdgpu: a NULL ->mm does not mean a thread is a kthread
  lib/lzo: fix ambiguous encoding bug in lzo-rle
  ocfs2: fix build failure when TCP/IP is disabled
  mm/memory-failure: send SIGBUS(BUS_MCEERR_AR) only to current thread
  mm/memory-failure: prioritize prctl(PR_MCE_KILL) over vm.memory_failure_early_kill
2020-06-11 18:18:50 -07:00
Tom Seewald
fce1affe4e ocfs2: fix build failure when TCP/IP is disabled
After commit 12abc5ee78 ("tcp: add tcp_sock_set_nodelay") and commit
c488aeadcb ("tcp: add tcp_sock_set_user_timeout"), building the kernel
with OCFS2_FS=y but without INET=y causes it to fail with:

  ld: fs/ocfs2/cluster/tcp.o: in function `o2net_accept_many':
  tcp.c:(.text+0x21b1): undefined reference to `tcp_sock_set_nodelay'
  ld: tcp.c:(.text+0x21c1): undefined reference to `tcp_sock_set_user_timeout'
  ld: fs/ocfs2/cluster/tcp.o: in function `o2net_start_connect':
  tcp.c:(.text+0x2633): undefined reference to `tcp_sock_set_nodelay'
  ld: tcp.c:(.text+0x2643): undefined reference to `tcp_sock_set_user_timeout'

This is due to tcp_sock_set_nodelay() and tcp_sock_set_user_timeout()
being declared in linux/tcp.h and defined in net/ipv4/tcp.c, which
depend on TCP/IP being enabled.

To fix this, make OCFS2_FS depend on INET=y which already requires
NET=y.

Fixes: 12abc5ee78 ("tcp: add tcp_sock_set_nodelay")
Fixes: c488aeadcb ("tcp: add tcp_sock_set_user_timeout")
Signed-off-by: Tom Seewald <tseewald@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Link: http://lkml.kernel.org/r/20200606190827.23954-1-tseewald@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-11 18:17:47 -07:00
Linus Torvalds
b961f8dc89 io_uring-5.8-2020-06-11
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl7iocEQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpj96EACRUW8F6Y9qibPIIYGOAdpW5vf6hdW88oan
 hkxOr2+y+9Odyn3WAnQtuMvmIAyOnIpVB1PiGtiXY1mmESWwbFZuxo6m1u4PiqZF
 rmvThcrx/o7T1hPzPJt2dUZmR6qBY2rbkGaruD14bcn36DW6fkAicZmsl7UluKTm
 pKE2wsxKsjGkcvElYsLYZbVm/xGe+UldaSpNFSp8b+yCAaH6eJLfhjeVC4Db7Yzn
 v3Liz012Xed3nmHktgXrihK8vQ1P7zOFaISJlaJ9yRK4z3VAF7wTgvZUjeYGP5FS
 GnUW/2p7UOsi5QkX9w2ZwPf/d0aSLZ/Va/5PjZRzAjNORMY5sjPtsfzqdKCohOhq
 q8qanyU1pOXRKf1cOEzU40hS81ZDRmoQRTCym6vgwHZrmVtcNnL/Af9soGrWIA8m
 +U6S2fpfuxeNP017HSzLHWtCGEOGYvhEc1D70mNBSIB8lElNvNVI6hWZOmxWkbKn
 w3O2JIfh9bl9Pk2espwZykJmzehYECP/H8wyhTlF3vBWieFF4uRucBgsmFgQmhvg
 NWQ7Iea49zOBt3IV3+LIRS2ulpXe7uu4WJYMa6da5o0a11ayNkngrh5QnBSSJ2rR
 HRUKZ9RA99A5edqyxEujDW2QABycNiYdo8ua2gYEFBvRNc9ff1l2CqWAk0n66uxE
 4vj4jmVJHg==
 =evRQ
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.8-2020-06-11' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:
 "A few late stragglers in here. In particular:

   - Validate full range for provided buffers (Bijan)

   - Fix bad use of kfree() in buffer registration failure (Denis)

   - Don't allow close of ring itself, it's not fully safe. Making it
     fully safe would require making the system call more expensive,
     which isn't worth it.

   - Buffer selection fix

   - Regression fix for O_NONBLOCK retry

   - Make IORING_OP_ACCEPT honor O_NONBLOCK (Jiufei)

   - Restrict opcode handling for SQ/IOPOLL (Pavel)

   - io-wq work handling cleanups and improvements (Pavel, Xiaoguang)

   - IOPOLL race fix (Xiaoguang)"

* tag 'io_uring-5.8-2020-06-11' of git://git.kernel.dk/linux-block:
  io_uring: fix io_kiocb.flags modification race in IOPOLL mode
  io_uring: check file O_NONBLOCK state for accept
  io_uring: avoid unnecessary io_wq_work copy for fast poll feature
  io_uring: avoid whole io_wq_work copy for requests completed inline
  io_uring: allow O_NONBLOCK async retry
  io_wq: add per-wq work handler instead of per work
  io_uring: don't arm a timeout through work.func
  io_uring: remove custom ->func handlers
  io_uring: don't derive close state from ->func
  io_uring: use kvfree() in io_sqe_buffer_register()
  io_uring: validate the full range of provided buffers for access
  io_uring: re-set iov base/len for buffer select retry
  io_uring: move send/recv IOPOLL check into prep
  io_uring: deduplicate io_openat{,2}_prep()
  io_uring: do build_open_how() only once
  io_uring: fix {SQ,IO}POLL with unsupported opcodes
  io_uring: disallow close of ring itself
2020-06-11 16:10:08 -07:00
David Howells
b3597945c8 afs: Fix afs_store_data() to set mtime in new operation descriptor
Fix afs_store_data() so that it sets the mtime in the new operation
descriptor otherwise the mtime on the server gets set to 0 when a write is
stored to the server.

Fixes: e49c7b2f6d ("afs: Build an abstraction around an "operation" concept")
Reported-by: Dave Botsch <botsch@cnf.cornell.edu>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-11 16:04:30 -07:00
Linus Torvalds
623f6dc593 Merge branch 'akpm' (patches from Andrew)
Merge some more updates from Andrew Morton:

 - various hotfixes and minor things

 - hch's use_mm/unuse_mm clearnups

Subsystems affected by this patch series: mm/hugetlb, scripts, kcov,
lib, nilfs, checkpatch, lib, mm/debug, ocfs2, lib, misc.

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  kernel: set USER_DS in kthread_use_mm
  kernel: better document the use_mm/unuse_mm API contract
  kernel: move use_mm/unuse_mm to kthread.c
  kernel: move use_mm/unuse_mm to kthread.c
  stacktrace: cleanup inconsistent variable type
  lib: test get_count_order/long in test_bitops.c
  mm: add comments on pglist_data zones
  ocfs2: fix spelling mistake and grammar
  mm/debug_vm_pgtable: fix kernel crash by checking for THP support
  lib: fix bitmap_parse() on 64-bit big endian archs
  checkpatch: correct check for kernel parameters doc
  nilfs2: fix null pointer dereference at nilfs_segctor_do_construct()
  lib/lz4/lz4_decompress.c: document deliberate use of `&'
  kcov: check kcov_softirq in kcov_remote_stop()
  scripts/spelling: add a few more typos
  khugepaged: selftests: fix timeout condition in wait_for_scan()
2020-06-11 13:25:53 -07:00
Linus Torvalds
a539568299 NFS Client Updates for Linux 5.8
New features and improvements:
 - Sunrpc receive buffer sizes only change when establishing a GSS credentials
 - Add more sunrpc tracepoints
 - Improve on tracepoints to capture internal NFS I/O errors
 
 Other bugfixes and cleanups:
 - Move a dprintk() to after a call to nfs_alloc_fattr()
 - Fix off-by-one issues in rpc_ntop6
 - Fix a few coccicheck warnings
 - Use the correct SPDX license identifiers
 - Fix rpc_call_done assignment for BIND_CONN_TO_SESSION
 - Replace zero-length array with flexible array
 - Remove duplicate headers
 - Set invalid blocks after NFSv4 writes to update space_used attribute
 - Fix direct WRITE throughput regression
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAl7ibyIACgkQ18tUv7Cl
 QOsOHBAA1A1stYld0gOhKZtMqxRJi3fnJ5mgroLGtyVQe8uAjpD8Ib1oRleC4MJq
 ifpYPozIhMZQCvDiGTAKJ8629OYiXGrN8D5nV6Y2tEGpu5wYv98MyZlU9Y8rVzCP
 5vsIMUp5XH8y2wYO8k7fDPPxWNH9Ax89wz5OI16mZxgY/LDm4ojZq+pGbYnWZa4w
 oK6Efa66z7yQkPV8oIWuvLe1zZYWGAPibBEwJbrvUWyfygB3owI36sc6nuiEQM+4
 hD3h5UtVn8BnudUqvLLa21rnQROMFpgYf4Q/2A1UaNfyRAPoPXMztECBSEYXO0L4
 saiMc5o/yTTBCC0ZjV1F+xuGQzMgSQ83KOdbr+a+upvBeFpBynJxccdvMTDEam+q
 rl7Ypdc42CsTZ1aVWG/AoIk6GENzR0tXqNR6BcDjYG/yRWvnt/RIZlp6G67IbtRH
 b9we+3MbI/lTBoCFGahkkBYO3elTNwilxH3pWcRi8ehNn0GPjlLqHePR17Tmq1tL
 QycDlm7QB1m5xNsOOLaBoB4SyguPV0SBprZJ4yYU1B3KC3bGurZVK3+TSLXQrO9V
 12RLDt4AOGr0TlctBIhNbkGp8xHY6Dg7HgbdjdrVq8Y9YCfg0C37789BnZA5nVxF
 4L101lsTI0puymh+MwmhiyOvCldn30f+MjuWJSm17Id+eRIxYj4=
 =a84h
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.8-1' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client updates from Anna Schumaker:
 "New features and improvements:
   - Sunrpc receive buffer sizes only change when establishing a GSS credentials
   - Add more sunrpc tracepoints
   - Improve on tracepoints to capture internal NFS I/O errors

  Other bugfixes and cleanups:
   - Move a dprintk() to after a call to nfs_alloc_fattr()
   - Fix off-by-one issues in rpc_ntop6
   - Fix a few coccicheck warnings
   - Use the correct SPDX license identifiers
   - Fix rpc_call_done assignment for BIND_CONN_TO_SESSION
   - Replace zero-length array with flexible array
   - Remove duplicate headers
   - Set invalid blocks after NFSv4 writes to update space_used attribute
   - Fix direct WRITE throughput regression"

* tag 'nfs-for-5.8-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (27 commits)
  NFS: Fix direct WRITE throughput regression
  SUNRPC: rpc_xprt lifetime events should record xprt->state
  xprtrdma: Make xprt_rdma_slot_table_entries static
  nfs: set invalid blocks after NFSv4 writes
  NFS: remove redundant initialization of variable result
  sunrpc: add missing newline when printing parameter 'auth_hashtable_size' by sysfs
  NFS: Add a tracepoint in nfs_set_pgio_error()
  NFS: Trace short NFS READs
  NFS: nfs_xdr_status should record the procedure name
  SUNRPC: Set SOFTCONN when destroying GSS contexts
  SUNRPC: rpc_call_null_helper() should set RPC_TASK_SOFT
  SUNRPC: rpc_call_null_helper() already sets RPC_TASK_NULLCREDS
  SUNRPC: trace RPC client lifetime events
  SUNRPC: Trace transport lifetime events
  SUNRPC: Split the xdr_buf event class
  SUNRPC: Add tracepoint to rpc_call_rpcerror()
  SUNRPC: Update the RPC_SHOW_SOCKET() macro
  SUNRPC: Update the rpc_show_task_flags() macro
  SUNRPC: Trace GSS context lifetimes
  SUNRPC: receive buffer size estimation values almost never change
  ...
2020-06-11 12:22:41 -07:00
Linus Torvalds
7cf035cc83 Third part of new DAX code for 5.8:
- Teach XFS to ask the VFS to drop an inode if the administrator changes
   the FS_XFLAG_DAX inode flag such that the S_DAX state would change.
   This can result in files changing access modes without requiring an
   unmount cycle.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl7WezgACgkQ+H93GTRK
 tOvGCg//QCKK9yEBE0sjR0hjeJwmvoxAtdCGyrq+h1wZ1OI+iZCy3K4D9+/NSl/Z
 SajLB2bK2mJBBZmERJlRITnTwSFbHX5roMHK4NDTqRAusHdm92JxMWds5ZMXGuTD
 HiJ3Luu4VWHBuraRB5JKxHWyDc57xSE0kC1KwPySQCNAuDv+YS42uZMAM6284T/J
 hkE4odUztykN+tOZj/nY+FiRjhzLF93cghMmRlDlgmibyHGLMhrfEmaMj99eCA5O
 PVepVA6Vk0IHWviAWS8vqX2LADQaP1U2RP4racBAdmk+Z6kqUV+KSotJU1O+6ey/
 Zfnr4VytKxmxngXhR4wnQcu3sIZqDdZRF4mcngE/G1yJXKim5iwVX+D04BsvDOpo
 OpND9+dOuh4ecbayfOGxp9lmBsPQBzI/qPpUcpbtsRd5xN5IyTL6MyNsSPJg2ZrG
 5BaSUO+Hw6hmBcB5MLF36XBuj25QEITl/hxy66Ym2BiQT5im9a3JxRlj8OvoWEVI
 1323WehvPSA1bl7m1mNQyE7/h7TjeGA6LiTplSxPqencKzXn93wcTDV09GOZ11No
 UEb+yC97hzRepEq4hSTr7tu18RN04ryA28/Vtcm/YeeM2bQ5WpI6jTd3b7g55BE2
 EuUGAUFLZRKsvSnLhg/GbuaW442osAKuNttIL1NEyPs0Iw76Ero=
 =mKZf
 -----END PGP SIGNATURE-----

Merge tag 'vfs-5.8-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull DAX updates part three from Darrick Wong:
 "Now that the xfs changes have landed, this third piece changes the
  FS_XFLAG_DAX ioctl code in xfs to request that the inode be reloaded
  after the last program closes the file, if doing so would make a S_DAX
  change happen. The goal here is to make dax access mode switching
  quicker when possible.

  Summary:

   - Teach XFS to ask the VFS to drop an inode if the administrator
     changes the FS_XFLAG_DAX inode flag such that the S_DAX state would
     change. This can result in files changing access modes without
     requiring an unmount cycle"

* tag 'vfs-5.8-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  fs/xfs: Update xfs_ioctl_setattr_dax_invalidate()
  fs/xfs: Combine xfs_diflags_to_linux() and xfs_diflags_to_iflags()
  fs/xfs: Create function xfs_inode_should_enable_dax()
  fs/xfs: Make DAX mount option a tri-state
  fs/xfs: Change XFS_MOUNT_DAX to XFS_MOUNT_DAX_ALWAYS
  fs/xfs: Remove unnecessary initialization of i_rwsem
2020-06-11 10:48:12 -07:00
Chuck Lever
ba838a75e7 NFS: Fix direct WRITE throughput regression
I measured a 50% throughput regression for large direct writes.

The observed on-the-wire behavior is that the client sends every
NFS WRITE twice: once as an UNSTABLE WRITE plus a COMMIT, and once
as a FILE_SYNC WRITE.

This is because the nfs_write_match_verf() check in
nfs_direct_commit_complete() fails for every WRITE.

Buffered writes use nfs_write_completion(), which sets req->wb_verf
correctly. Direct writes use nfs_direct_write_completion(), which
does not set req->wb_verf at all. This leaves req->wb_verf set to
all zeroes for every direct WRITE, and thus
nfs_direct_commit_completion() always sets NFS_ODIRECT_RESCHED_WRITES.

This fix appears to restore nearly all of the lost performance.

Fixes: 1f28476dcb ("NFS: Fix O_DIRECT commit verifier handling")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-06-11 13:33:48 -04:00
Zheng Bin
3a39e77869 nfs: set invalid blocks after NFSv4 writes
Use the following command to test nfsv4(size of file1M is 1MB):
mount -t nfs -o vers=4.0,actimeo=60 127.0.0.1/dir1 /mnt
cp file1M /mnt
du -h /mnt/file1M  -->0 within 60s, then 1M

When write is done(cp file1M /mnt), will call this:
nfs_writeback_done
  nfs4_write_done
    nfs4_write_done_cb
      nfs_writeback_update_inode
        nfs_post_op_update_inode_force_wcc_locked(change, ctime, mtime
nfs_post_op_update_inode_force_wcc_locked
   nfs_set_cache_invalid
   nfs_refresh_inode_locked
     nfs_update_inode

nfsd write response contains change, ctime, mtime, the flag will be
clear after nfs_update_inode. Howerver, write response does not contain
space_used, previous open response contains space_used whose value is 0,
so inode->i_blocks is still 0.

nfs_getattr  -->called by "du -h"
  do_update |= force_sync || nfs_attribute_cache_expired -->false in 60s
  cache_validity = READ_ONCE(NFS_I(inode)->cache_validity)
  do_update |= cache_validity & (NFS_INO_INVALID_ATTR    -->false
  if (do_update) {
        __nfs_revalidate_inode
  }

Within 60s, does not send getattr request to nfsd, thus "du -h /mnt/file1M"
is 0.

Add a NFS_INO_INVALID_BLOCKS flag, set it when nfsv4 write is done.

Fixes: 16e1437517 ("NFS: More fine grained attribute tracking")
Signed-off-by: Zheng Bin <zhengbin13@huawei.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-06-11 13:33:48 -04:00
Colin Ian King
86b936672e NFS: remove redundant initialization of variable result
The variable result is being initialized with a value that is never read
and it is being updated later with a new value.  The initialization is
redundant and can be removed.

Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-06-11 13:33:48 -04:00
Chuck Lever
cd2ed9bdc0 NFS: Add a tracepoint in nfs_set_pgio_error()
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-06-11 13:33:48 -04:00
Chuck Lever
fd2b612141 NFS: Trace short NFS READs
A short read can generate an -EIO error without there being an error
on the wire. This tracepoint acts as an eyecatcher when there is no
obvious I/O error.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-06-11 13:33:48 -04:00
Chuck Lever
5be5945864 NFS: nfs_xdr_status should record the procedure name
When sunrpc trace points are not enabled, the recorded task ID
information alone is not helpful.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-06-11 13:33:48 -04:00
Linus Torvalds
c742b63473 Highlights:
- Keep nfsd clients from unnecessarily breaking their own delegations:
   Note this requires a small kthreadd addition, discussed at:
   https://lore.kernel.org/r/1588348912-24781-1-git-send-email-bfields@redhat.com
   The result is Tejun Heo's suggestion, and he was OK with this going
   through my tree.
 - Patch nfsd/clients/ to display filenames, and to fix byte-order when
   displaying stateid's.
 - fix a module loading/unloading bug, from Neil Brown.
 - A big series from Chuck Lever with RPC/RDMA and tracing improvements,
   and lay some groundwork for RPC-over-TLS.
 
 Note Stephen Rothwell spotted two conflicts in linux-next.  Both should
 be straightforward:
 	include/trace/events/sunrpc.h
 		https://lore.kernel.org/r/20200529105917.50dfc40f@canb.auug.org.au
 	net/sunrpc/svcsock.c
 		https://lore.kernel.org/r/20200529131955.26c421db@canb.auug.org.au
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCAAzFiEEYtFWavXG9hZotryuJ5vNeUKO4b4FAl7iRYwVHGJmaWVsZHNA
 ZmllbGRzZXMub3JnAAoJECebzXlCjuG+yx8QALIfyz/ziPgjGBnNJGCW8BjWHz7+
 rGI+1SP2EUpgJ0fGJc9MpGyYTa5T3pTgsENnIRtegyZDISg2OQ5GfifpkTz4U7vg
 QbWRihs/W9EhltVYhKvtLASAuSAJ8ETbDfLXVb2ncY7iO6JNvb22xwsgKZILmzm1
 uG4qSszmBZzpMUUy51kKJYJZ3ysP+v14qOnyOXEoeEMuJYNK9FkQ9bSPZ6wTJNOn
 hvZBMbU7LzRyVIvp358mFHY+vwq5qBNkJfVrZBkURGn4OxWPbWDXzqOi0Zs1oBjA
 L+QODIbTLGkopu/rD0r1b872PDtket7p5zsD8MreeI1vJOlt3xwqdCGlicIeNATI
 b0RG7sqh+pNv0mvwLxSNTf3rO0EKW6tUySqCnQZUAXFGRH0nYM2TWze4HUr2zfWT
 EgRMwxHY/AZUStZBuCIHPJ6inWnKuxSUELMf2a9JHO1BJc/yClRgmwJGdthVwb9u
 GP6F3/maFu+9YOO6iROMsqtxDA+q5vch5IBzevNOOBDEQDKqENmogR/knl9DmAhF
 sr+FOa3O0u6S4tgXw/TU97JS/h1L2Hu6QVEwU2iVzWtlUUOFVMZQODJTB6Lts4Ka
 gKzYXWvCHN+LyETsN6q7uHFg9mtO7xO5vrrIgo72SuVCscDw/8iHkoOOFLief+GE
 O0fR0IYjW8U1Rkn2
 =YEf0
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-5.8' of git://linux-nfs.org/~bfields/linux

Pull nfsd updates from Bruce Fields:
 "Highlights:

   - Keep nfsd clients from unnecessarily breaking their own
     delegations.

     Note this requires a small kthreadd addition. The result is Tejun
     Heo's suggestion (see link), and he was OK with this going through
     my tree.

   - Patch nfsd/clients/ to display filenames, and to fix byte-order
     when displaying stateid's.

   - fix a module loading/unloading bug, from Neil Brown.

   - A big series from Chuck Lever with RPC/RDMA and tracing
     improvements, and lay some groundwork for RPC-over-TLS"

Link: https://lore.kernel.org/r/1588348912-24781-1-git-send-email-bfields@redhat.com

* tag 'nfsd-5.8' of git://linux-nfs.org/~bfields/linux: (49 commits)
  sunrpc: use kmemdup_nul() in gssp_stringify()
  nfsd: safer handling of corrupted c_type
  nfsd4: make drc_slab global, not per-net
  SUNRPC: Remove unreachable error condition in rpcb_getport_async()
  nfsd: Fix svc_xprt refcnt leak when setup callback client failed
  sunrpc: clean up properly in gss_mech_unregister()
  sunrpc: svcauth_gss_register_pseudoflavor must reject duplicate registrations.
  sunrpc: check that domain table is empty at module unload.
  NFSD: Fix improperly-formatted Doxygen comments
  NFSD: Squash an annoying compiler warning
  SUNRPC: Clean up request deferral tracepoints
  NFSD: Add tracepoints for monitoring NFSD callbacks
  NFSD: Add tracepoints to the NFSD state management code
  NFSD: Add tracepoints to NFSD's duplicate reply cache
  SUNRPC: svc_show_status() macro should have enum definitions
  SUNRPC: Restructure svc_udp_recvfrom()
  SUNRPC: Refactor svc_recvfrom()
  SUNRPC: Clean up svc_release_skb() functions
  SUNRPC: Refactor recvfrom path dealing with incomplete TCP receives
  SUNRPC: Replace dprintk() call sites in TCP receive path
  ...
2020-06-11 10:33:13 -07:00
Xiaoguang Wang
65a6543da3 io_uring: fix io_kiocb.flags modification race in IOPOLL mode
While testing io_uring in arm, we found sometimes io_sq_thread() keeps
polling io requests even though there are not inflight io requests in
block layer. After some investigations, found a possible race about
io_kiocb.flags, see below race codes:
  1) in the end of io_write() or io_read()
    req->flags &= ~REQ_F_NEED_CLEANUP;
    kfree(iovec);
    return ret;

  2) in io_complete_rw_iopoll()
    if (res != -EAGAIN)
        req->flags |= REQ_F_IOPOLL_COMPLETED;

In IOPOLL mode, io requests still maybe completed by interrupt, then
above codes are not safe, concurrent modifications to req->flags, which
is not protected by lock or is not atomic modifications. I also had
disassemble io_complete_rw_iopoll() in arm:
   req->flags |= REQ_F_IOPOLL_COMPLETED;
   0xffff000008387b18 <+76>:    ldr     w0, [x19,#104]
   0xffff000008387b1c <+80>:    orr     w0, w0, #0x1000
   0xffff000008387b20 <+84>:    str     w0, [x19,#104]

Seems that the "req->flags |= REQ_F_IOPOLL_COMPLETED;" is  load and
modification, two instructions, which obviously is not atomic.

To fix this issue, add a new iopoll_completed in io_kiocb to indicate
whether io request is completed.

Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-11 09:45:21 -06:00
Ritesh Harjani
8119853653 ext4: mballoc: Use this_cpu_read instead of this_cpu_ptr
Simplify reading a seq variable by directly using this_cpu_read API
instead of doing this_cpu_ptr and then dereferencing it.

This also avoid the below kernel BUG: which happens when
CONFIG_DEBUG_PREEMPT is enabled

BUG: using smp_processor_id() in preemptible [00000000] code: syz-fuzzer/6927
caller is ext4_mb_new_blocks+0xa4d/0x3b70 fs/ext4/mballoc.c:4711
CPU: 1 PID: 6927 Comm: syz-fuzzer Not tainted 5.7.0-next-20200602-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x18f/0x20d lib/dump_stack.c:118
 check_preemption_disabled+0x20d/0x220 lib/smp_processor_id.c:48
 ext4_mb_new_blocks+0xa4d/0x3b70 fs/ext4/mballoc.c:4711
 ext4_ext_map_blocks+0x201b/0x33e0 fs/ext4/extents.c:4244
 ext4_map_blocks+0x4cb/0x1640 fs/ext4/inode.c:626
 ext4_getblk+0xad/0x520 fs/ext4/inode.c:833
 ext4_bread+0x7c/0x380 fs/ext4/inode.c:883
 ext4_append+0x153/0x360 fs/ext4/namei.c:67
 ext4_init_new_dir fs/ext4/namei.c:2757 [inline]
 ext4_mkdir+0x5e0/0xdf0 fs/ext4/namei.c:2802
 vfs_mkdir+0x419/0x690 fs/namei.c:3632
 do_mkdirat+0x21e/0x280 fs/namei.c:3655
 do_syscall_64+0x60/0xe0 arch/x86/entry/common.c:359
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: 42f56b7a4a7d ("ext4: mballoc: introduce pcpu seqcnt for freeing PA
to improve ENOSPC handling")
Suggested-by: Borislav Petkov <bp@alien8.de>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reported-by: syzbot+82f324bb69744c5f6969@syzkaller.appspotmail.com
Link: https://lore.kernel.org/r/534f275016296996f54ecf65168bb3392b6f653d.1591699601.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-11 11:03:26 -04:00
Eric Biggers
2ce3ee931a ext4: avoid utf8_strncasecmp() with unstable name
If the dentry name passed to ->d_compare() fits in dentry::d_iname, then
it may be concurrently modified by a rename.  This can cause undefined
behavior (possibly out-of-bounds memory accesses or crashes) in
utf8_strncasecmp(), since fs/unicode/ isn't written to handle strings
that may be concurrently modified.

Fix this by first copying the filename to a stack buffer if needed.
This way we get a stable snapshot of the filename.

Fixes: b886ee3e77 ("ext4: Support case-insensitive file name lookups")
Cc: <stable@vger.kernel.org> # v5.2+
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Daniel Rosenberg <drosen@google.com>
Cc: Gabriel Krisman Bertazi <krisman@collabora.co.uk>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20200601200543.59417-1-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-11 11:01:33 -04:00
yangerkun
5adaccac46 ext4: stop overwrite the errcode in ext4_setup_super
Now the errcode from ext4_commit_super will overwrite EROFS exists in
ext4_setup_super. Actually, no need to call ext4_commit_super since we
will return EROFS. Fix it by goto done directly.

Fixes: c89128a008 ("ext4: handle errors on ext4_commit_super")
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200601073404.3712492-1-yangerkun@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-11 10:59:38 -04:00
Jeffle Xu
cfb3c85a60 ext4: fix partial cluster initialization when splitting extent
Fix the bug when calculating the physical block number of the first
block in the split extent.

This bug will cause xfstests shared/298 failure on ext4 with bigalloc
enabled occasionally. Ext4 error messages indicate that previously freed
blocks are being freed again, and the following fsck will fail due to
the inconsistency of block bitmap and bg descriptor.

The following is an example case:

1. First, Initialize a ext4 filesystem with cluster size '16K', block size
'4K', in which case, one cluster contains four blocks.

2. Create one file (e.g., xxx.img) on this ext4 filesystem. Now the extent
tree of this file is like:

...
36864:[0]4:220160
36868:[0]14332:145408
51200:[0]2:231424
...

3. Then execute PUNCH_HOLE fallocate on this file. The hole range is
like:

..
ext4_ext_remove_space: dev 254,16 ino 12 since 49506 end 49506 depth 1
ext4_ext_remove_space: dev 254,16 ino 12 since 49544 end 49546 depth 1
ext4_ext_remove_space: dev 254,16 ino 12 since 49605 end 49607 depth 1
...

4. Then the extent tree of this file after punching is like

...
49507:[0]37:158047
49547:[0]58:158087
...

5. Detailed procedure of punching hole [49544, 49546]

5.1. The block address space:
```
lblk        ~49505  49506   49507~49543     49544~49546    49547~
	  ---------+------+-------------+----------------+--------
	    extent | hole |   extent	|	hole	 | extent
	  ---------+------+-------------+----------------+--------
pblk       ~158045  158046  158047~158083  158084~158086   158087~
```

5.2. The detailed layout of cluster 39521:
```
		cluster 39521
	<------------------------------->

		hole		  extent
	<----------------------><--------

lblk      49544   49545   49546   49547
	+-------+-------+-------+-------+
	|	|	|	|	|
	+-------+-------+-------+-------+
pblk     158084  1580845  158086  158087
```

5.3. The ftrace output when punching hole [49544, 49546]:
- ext4_ext_remove_space (start 49544, end 49546)
  - ext4_ext_rm_leaf (start 49544, end 49546, last_extent [49507(158047), 40], partial [pclu 39522 lblk 0 state 2])
    - ext4_remove_blocks (extent [49507(158047), 40], from 49544 to 49546, partial [pclu 39522 lblk 0 state 2]
      - ext4_free_blocks: (block 158084 count 4)
        - ext4_mballoc_free (extent 1/6753/1)

5.4. Ext4 error message in dmesg:
EXT4-fs error (device vdb): mb_free_blocks:1457: group 1, block 158084:freeing already freed block (bit 6753); block bitmap corrupt.
EXT4-fs error (device vdb): ext4_mb_generate_buddy:747: group 1, block bitmap and bg descriptor inconsistent: 19550 vs 19551 free clusters

In this case, the whole cluster 39521 is freed mistakenly when freeing
pblock 158084~158086 (i.e., the first three blocks of this cluster),
although pblock 158087 (the last remaining block of this cluster) has
not been freed yet.

The root cause of this isuue is that, the pclu of the partial cluster is
calculated mistakenly in ext4_ext_remove_space(). The correct
partial_cluster.pclu (i.e., the cluster number of the first block in the
next extent, that is, lblock 49597 (pblock 158086)) should be 39521 rather
than 39522.

Fixes: f4226d9ea4 ("ext4: fix partial cluster initialization")
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Eric Whitney <enwlinux@gmail.com>
Cc: stable@kernel.org # v3.19+
Link: https://lore.kernel.org/r/1590121124-37096-1-git-send-email-jefflexu@linux.alibaba.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-11 10:57:40 -04:00
Theodore Ts'o
829b37b8cd ext4: avoid race conditions when remounting with options that change dax
Trying to change dax mount options when remounting could allow mount
options to be enabled for a small amount of time, and then the mount
option change would be reverted.

In the case of "mount -o remount,dax", this can cause a race where
files would temporarily treated as DAX --- and then not.

Cc: stable@kernel.org
Reported-by: syzbot+bca9799bf129256190da@syzkaller.appspotmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-11 10:54:07 -04:00
Theodore Ts'o
68cd44920d Enable ext4 support for per-file/directory dax operations
This adds the same per-file/per-directory DAX support for ext4 as was
done for xfs, now that we finally have consensus over what the
interface should be.
2020-06-11 10:51:44 -04:00
Christoph Hellwig
37c54f9bd4 kernel: set USER_DS in kthread_use_mm
Some architectures like arm64 and s390 require USER_DS to be set for
kernel threads to access user address space, which is the whole purpose of
kthread_use_mm, but other like x86 don't.  That has lead to a huge mess
where some callers are fixed up once they are tested on said
architectures, while others linger around and yet other like io_uring try
to do "clever" optimizations for what usually is just a trivial asignment
to a member in the thread_struct for most architectures.

Make kthread_use_mm set USER_DS, and kthread_unuse_mm restore to the
previous value instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Felipe Balbi <balbi@kernel.org>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Cc: Zhi Wang <zhi.a.wang@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: http://lkml.kernel.org/r/20200404094101.672954-7-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-10 19:14:18 -07:00
Christoph Hellwig
f5678e7f2a kernel: better document the use_mm/unuse_mm API contract
Switch the function documentation to kerneldoc comments, and add
WARN_ON_ONCE asserts that the calling thread is a kernel thread and does
not have ->mm set (or has ->mm set in the case of unuse_mm).

Also give the functions a kthread_ prefix to better document the use case.

[hch@lst.de: fix a comment typo, cover the newly merged use_mm/unuse_mm caller in vfio]
  Link: http://lkml.kernel.org/r/20200416053158.586887-3-hch@lst.de
[sfr@canb.auug.org.au: powerpc/vas: fix up for {un}use_mm() rename]
  Link: http://lkml.kernel.org/r/20200422163935.5aa93ba5@canb.auug.org.au

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> [usb]
Acked-by: Haren Myneni <haren@linux.ibm.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Felipe Balbi <balbi@kernel.org>
Cc: Jason Wang <jasowang@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Cc: Zhi Wang <zhi.a.wang@intel.com>
Link: http://lkml.kernel.org/r/20200404094101.672954-6-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-10 19:14:18 -07:00
Christoph Hellwig
9bf5b9eb23 kernel: move use_mm/unuse_mm to kthread.c
Patch series "improve use_mm / unuse_mm", v2.

This series improves the use_mm / unuse_mm interface by better documenting
the assumptions, and my taking the set_fs manipulations spread over the
callers into the core API.

This patch (of 3):

Use the proper API instead.

Link: http://lkml.kernel.org/r/20200404094101.672954-1-hch@lst.de

These helpers are only for use with kernel threads, and I will tie them
more into the kthread infrastructure going forward.  Also move the
prototypes to kthread.h - mmu_context.h was a little weird to start with
as it otherwise contains very low-level MM bits.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Felipe Balbi <balbi@kernel.org>
Cc: Jason Wang <jasowang@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Cc: Zhi Wang <zhi.a.wang@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: http://lkml.kernel.org/r/20200404094101.672954-1-hch@lst.de
Link: http://lkml.kernel.org/r/20200416053158.586887-1-hch@lst.de
Link: http://lkml.kernel.org/r/20200404094101.672954-5-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-10 19:14:18 -07:00
Keyur Patel
cc989e7847 ocfs2: fix spelling mistake and grammar
./ocfs2/mmap.c:65: bebongs ==> belonging

Signed-off-by: Keyur Patel <iamkeyur96@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Link: http://lkml.kernel.org/r/20200608014818.102358-1-iamkeyur96@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-10 19:14:18 -07:00
Ryusuke Konishi
8301c719a2 nilfs2: fix null pointer dereference at nilfs_segctor_do_construct()
After commit c3aab9a0bd ("mm/filemap.c: don't initiate writeback if
mapping has no dirty pages"), the following null pointer dereference has
been reported on nilfs2:

  BUG: kernel NULL pointer dereference, address: 00000000000000a8
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 0 P4D 0
  Oops: 0000 [#1] SMP PTI
  ...
  RIP: 0010:percpu_counter_add_batch+0xa/0x60
  ...
  Call Trace:
    __test_set_page_writeback+0x2d3/0x330
    nilfs_segctor_do_construct+0x10d3/0x2110 [nilfs2]
    nilfs_segctor_construct+0x168/0x260 [nilfs2]
    nilfs_segctor_thread+0x127/0x3b0 [nilfs2]
    kthread+0xf8/0x130
    ...

This crash turned out to be caused by set_page_writeback() call for
segment summary buffers at nilfs_segctor_prepare_write().

set_page_writeback() can call inc_wb_stat(inode_to_wb(inode),
WB_WRITEBACK) where inode_to_wb(inode) is NULL if the inode of
underlying block device does not have an associated wb.

This fixes the issue by calling inode_attach_wb() in advance to ensure
to associate the bdev inode with its wb.

Fixes: c3aab9a0bd ("mm/filemap.c: don't initiate writeback if mapping has no dirty pages")
Reported-by: Walton Hoops <me@waltonhoops.com>
Reported-by: Tomas Hlavaty <tom@logand.com>
Reported-by: ARAI Shun-ichi <hermes@ceres.dti.ne.jp>
Reported-by: Hideki EIRAKU <hdk1983@gmail.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: <stable@vger.kernel.org>	[5.4+]
Link: http://lkml.kernel.org/r/20200608.011819.1399059588922299158.konishi.ryusuke@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-10 19:14:17 -07:00
Linus Torvalds
b29482fde6 Merge branch 'work.epoll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull epoll update from Al Viro:
 "epoll conversion to read_iter from Jens; I thought there might be more
  epoll stuff this cycle, but uaccess took too much time"

* 'work.epoll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  eventfd: convert to f_op->read_iter()
2020-06-10 18:09:13 -07:00
Jiufei Xue
e697deed83 io_uring: check file O_NONBLOCK state for accept
If the socket is O_NONBLOCK, we should complete the accept request
with -EAGAIN when data is not ready.

Signed-off-by: Jiufei Xue <jiufei.xue@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-10 18:06:16 -06:00
Xiaoguang Wang
405a5d2b27 io_uring: avoid unnecessary io_wq_work copy for fast poll feature
Basically IORING_OP_POLL_ADD command and async armed poll handlers
for regular commands don't touch io_wq_work, so only REQ_F_WORK_INITIALIZED
is set, can we do io_wq_work copy and restore.

Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-10 17:58:46 -06:00
Xiaoguang Wang
7cdaf587de io_uring: avoid whole io_wq_work copy for requests completed inline
If requests can be submitted and completed inline, we don't need to
initialize whole io_wq_work in io_init_req(), which is an expensive
operation, add a new 'REQ_F_WORK_INITIALIZED' to determine whether
io_wq_work is initialized and add a helper io_req_init_async(), users
must call io_req_init_async() for the first time touching any members
of io_wq_work.

I use /dev/nullb0 to evaluate performance improvement in my physical
machine:
  modprobe null_blk nr_devices=1 completion_nsec=0
  sudo taskset -c 60 fio  -name=fiotest -filename=/dev/nullb0 -iodepth=128
  -thread -rw=read -ioengine=io_uring -direct=1 -bs=4k -size=100G -numjobs=1
  -time_based -runtime=120

before this patch:
Run status group 0 (all jobs):
   READ: bw=724MiB/s (759MB/s), 724MiB/s-724MiB/s (759MB/s-759MB/s),
   io=84.8GiB (91.1GB), run=120001-120001msec

With this patch:
Run status group 0 (all jobs):
   READ: bw=761MiB/s (798MB/s), 761MiB/s-761MiB/s (798MB/s-798MB/s),
   io=89.2GiB (95.8GB), run=120001-120001msec

About 5% improvement.

Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-10 17:58:46 -06:00
Linus Torvalds
4dbb29fe9d Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
 "A couple of trivial patches that fell through the cracks last cycle"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: fix indentation in deactivate_super()
  vfs: Remove duplicated d_mountpoint check in __is_local_mountpoint
2020-06-10 16:09:11 -07:00
Linus Torvalds
1c38372662 Merge branch 'work.sysctl' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull sysctl fixes from Al Viro:
 "Fixups to regressions in sysctl series"

* 'work.sysctl' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  sysctl: reject gigantic reads/write to sysctl files
  cdrom: fix an incorrect __user annotation on cdrom_sysctl_info
  trace: fix an incorrect __user annotation on stack_trace_sysctl
  random: fix an incorrect __user annotation on proc_do_entropy
  net/sysctl: remove leftover __user annotations on neigh_proc_dointvec*
  net/sysctl: use cpumask_parse in flow_limit_cpu_sysctl
2020-06-10 16:05:54 -07:00
Linus Torvalds
4382a79b27 Merge branch 'uaccess.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc uaccess updates from Al Viro:
 "Assorted uaccess patches for this cycle - the stuff that didn't fit
  into thematic series"

* 'uaccess.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  bpf: make bpf_check_uarg_tail_zero() use check_zeroed_user()
  x86: kvm_hv_set_msr(): use __put_user() instead of 32bit __clear_user()
  user_regset_copyout_zero(): use clear_user()
  TEST_ACCESS_OK _never_ had been checked anywhere
  x86: switch cp_stat64() to unsafe_put_user()
  binfmt_flat: don't use __put_user()
  binfmt_elf_fdpic: don't use __... uaccess primitives
  binfmt_elf: don't bother with __{put,copy_to}_user()
  pselect6() and friends: take handling the combined 6th/7th args into helper
2020-06-10 16:02:54 -07:00
Linus Torvalds
79ca035d2d Merge branch 'proc-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull proc fix from Eric Biederman:
 "Syzbot found a NULL pointer dereference if kzalloc of s_fs_info fails"

* 'proc-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  proc: s_fs_info may be NULL when proc_kill_sb is called
2020-06-10 15:00:11 -07:00
Alexey Gladkov
058f2e4da7 proc: s_fs_info may be NULL when proc_kill_sb is called
syzbot found that proc_fill_super() fails before filling up sb->s_fs_info,
deactivate_locked_super() will be called and sb->s_fs_info will be NULL.
The proc_kill_sb() does not expect fs_info to be NULL which is wrong.

Link: https://lore.kernel.org/lkml/0000000000002d7ca605a7b8b1c5@google.com
Reported-by: syzbot+4abac52934a48af5ff19@syzkaller.appspotmail.com
Fixes: fa10fed30f ("proc: allow to mount many instances of proc in one pid namespace")
Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-06-10 14:54:54 -05:00
Christoph Hellwig
ef9d965bc8 sysctl: reject gigantic reads/write to sysctl files
Instead of triggering a WARN_ON deep down in the page allocator just
give up early on allocations that are way larger than the usual sysctl
values.

Fixes: 32927393dc ("sysctl: pass kernel pointers to ->proc_handler")
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-06-10 14:11:33 -04:00
Steve French
7866c177a0 smb3: fix typo in mount options displayed in /proc/mounts
Missing the final 's' in "max_channels" mount option when displayed in
/proc/mounts (or by mount command)

CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Shyam Prasad N <nspmangalore@gmail.com>
2020-06-10 12:05:15 -05:00
Jens Axboe
c5b856255c io_uring: allow O_NONBLOCK async retry
We can assume that O_NONBLOCK is always honored, even if we don't
have a ->read/write_iter() for the file type. Also unify the read/write
checking for allowing async punt, having the write side factoring in the
REQ_F_NOWAIT flag as well.

Cc: stable@vger.kernel.org
Fixes: 490e89676a ("io_uring: only force async punt if poll based retry can't handle it")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-09 19:38:24 -06:00
Linus Torvalds
5b14671be5 fuse update for 5.8
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCXt/0GAAKCRDh3BK/laaZ
 PIJjAP48TurDqomsQMBLiOsSUy0YIhd5QC/G5MYLKSBojXoR+gD+KfqXhVIDz0En
 OI+K4674cNhf4CXNzUedU3qSOaJLfAU=
 =PqbB
 -----END PGP SIGNATURE-----

Merge tag 'fuse-update-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse

Pull fuse updates from Miklos Szeredi:

 - Fix a rare deadlock in virtiofs

 - Fix st_blocks in writeback cache mode

 - Fix wrong checks in splice move causing spurious warnings

 - Fix a race between a GETATTR request and a FUSE_NOTIFY_INVAL_INODE
   notification

 - Use rb-tree instead of linear search for pages currently under
   writeout by userspace

 - Fix copy_file_range() inconsistencies

* tag 'fuse-update-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: copy_file_range should truncate cache
  fuse: fix copy_file_range cache issues
  fuse: optimize writepages search
  fuse: update attr_version counter on fuse_notify_inval_inode()
  fuse: don't check refcount after stealing page
  fuse: fix weird page warning
  fuse: use dump_page
  virtiofs: do not use fuse_fill_super_common() for device installation
  fuse: always allow query of st_dev
  fuse: always flush dirty data on close(2)
  fuse: invalidate inode attr in writeback cache mode
  fuse: Update stale comment in queue_interrupt()
  fuse: BUG_ON correction in fuse_dev_splice_write()
  virtiofs: Add mount option and atime behavior to the doc
  virtiofs: schedule blocking async replies in separate worker
2020-06-09 15:48:24 -07:00
Linus Torvalds
52435c86bf overlayfs update for 5.8
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCXt9klAAKCRDh3BK/laaZ
 PBeeAP9GRI0yajPzBzz2ZK9KkDc6A7wPiaAec+86Q+c02VncVwEAvq5Pi4um5RTZ
 7SVv56ggKO3Cqx779zVyZTRYDs3+YA4=
 =bpKI
 -----END PGP SIGNATURE-----

Merge tag 'ovl-update-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs

Pull overlayfs updates from Miklos Szeredi:
 "Fixes:

   - Resolve mount option conflicts consistently

   - Sync before remount R/O

   - Fix file handle encoding corner cases

   - Fix metacopy related issues

   - Fix an unintialized return value

   - Add missing permission checks for underlying layers

  Optimizations:

   - Allow multipe whiteouts to share an inode

   - Optimize small writes by inheriting SB_NOSEC from upper layer

   - Do not call ->syncfs() multiple times for sync(2)

   - Do not cache negative lookups on upper layer

   - Make private internal mounts longterm"

* tag 'ovl-update-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: (27 commits)
  ovl: remove unnecessary lock check
  ovl: make oip->index bool
  ovl: only pass ->ki_flags to ovl_iocb_to_rwf()
  ovl: make private mounts longterm
  ovl: get rid of redundant members in struct ovl_fs
  ovl: add accessor for ofs->upper_mnt
  ovl: initialize error in ovl_copy_xattr
  ovl: drop negative dentry in upper layer
  ovl: check permission to open real file
  ovl: call secutiry hook in ovl_real_ioctl()
  ovl: verify permissions in ovl_path_open()
  ovl: switch to mounter creds in readdir
  ovl: pass correct flags for opening real directory
  ovl: fix redirect traversal on metacopy dentries
  ovl: initialize OVL_UPPERDATA in ovl_lookup()
  ovl: use only uppermetacopy state in ovl_lookup()
  ovl: simplify setting of origin for index lookup
  ovl: fix out of bounds access warning in ovl_check_fb_len()
  ovl: return required buffer size for file handles
  ovl: sync dirty data when remounting to ro mode
  ...
2020-06-09 15:40:50 -07:00
Linus Torvalds
4964dd2914 AFS fixes
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAl7fxCMACgkQ+7dXa6fL
 C2tUNw//VdGTqw3SstdpCqlIQptfXjxIJqFBp5QsQ95b2yHEFcmaeqLP9SWOMPZ9
 588xBMj4K9iN9WZgJdJwTGj5D1XmRwiISYnDh1pzNUPH2IrnlcTXfpsZ+BSWst/P
 XMJ1N3ZtzNymYTi4wzxcV/SjMJd4eX75jBiJn9rgp2PzVnaCUl+a21sLvryON2Eu
 YOpg08IlpRYLMsuHiISrqqgy7/iUzo34RXWbhgJV69zG07xoHtBwGf/iwetbh56G
 lmKp3xix1Fq181HyQLOn4/QkXTIFR1kyKFDdI3HxdRXsBybSZUk9IGaf74o09c7M
 PTS1FsF0lnGz+61Y5mb1UzPRlJSx8CAeCBSN6KWlbm/g9WPXHuxJFwT/KVPV5T13
 qEYn1U1RL8yMlbjezMDM5TXSM4IhBRxn9hQ3qbhamiDI0AT8kr55lJKbxqTO8Cvv
 6ZjYfPtX0D5Z9kce4E4nWFiRWb8xRsANqWdkVsIcxp1yyZvGkHot5LaCedagF+0b
 71Zs0A5YMA4VaaL7sn1xMt/tjZ2ei+5a+cLbfEOoCniE6XqpLm5ZR6WwdbzwwUbV
 hzUn3ByQ5eqC5tkwehNufDLFt1HOzMbnTG9eCxbpUXAH1Q8CDwuROUMf4FGjR97K
 eeYHcEqLhEKgKRCmGYT00K/q1Ce39rcgGZLygdbbwaigEUCjLvU=
 =rH8G
 -----END PGP SIGNATURE-----

Merge tag 'afs-fixes-20200609' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull AFS fixes from David Howells:
 "A set of small patches to fix some things, most of them minor.

   - Fix a memory leak in afs_put_sysnames()

   - Fix an oops in AFS file locking

   - Fix new use of BUG()

   - Fix debugging statements containing %px

   - Remove afs_zero_fid as it's unused

   - Make afs_zap_data() static"

* tag 'afs-fixes-20200609' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  afs: Make afs_zap_data() static
  afs: Remove afs_zero_fid as it's not used
  afs: Fix debugging statements with %px to be %p
  afs: Fix use of BUG()
  afs: Fix file locking
  afs: Fix memory leak in afs_put_sysnames()
2020-06-09 15:38:46 -07:00
Linus Torvalds
42612e7763 f2fs-for-5.8-rc1
In this round, we've added some knobs to enhance compression feature and harden
 testing environment. In addition, we've fixed several bugs reported from Android
 devices such as long discarding latency, device hanging during quota_sync, etc.
 
 Enhancement:
 - support lzo-rle algorithm
 - add two ioctls to release and reserve blocks for compression
 - support partial truncation/fiemap on compressed file
 - introduce sysfs entries to attach IO flags explicitly
 - add iostat trace point along with read io stat
 
 Bug fix:
 - fix long discard latency
 - flush quota data by f2fs_quota_sync correctly
 - fix to recover parent inode number for power-cut recovery
 - fix lz4/zstd output buffer budget
 - parse checkpoint mount option correctly
 - avoid inifinite loop to wait for flushing node/meta pages
 - manage discard space correctly
 
 And some refactoring and clean up patches were added.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAl7fojgACgkQQBSofoJI
 UNKaZQ//Rd6r7Z25SJkAoy+y/m6QDaKg4Ap1wR6+QmirR7HNxtpr3dXSVvmj4Xhu
 ZDJ3LHmerFiwR/X4zFPud+PAoBe3gJa2k7GT8q0g4YkgLy0hfX9PXt0t3I9F8vlk
 8m34j+hQaL9/3FBK4/PSG541vR/UUnwvu6t2pJMnz7rgnLej5I6yOIaoaihz7m+i
 k0ofK5ckuTNcZReAZ2tCIehQku7tDOBLdS5KxvBZBgRh0i5iSXXIa4ddvaMJdT/M
 WcjTZ6N8bFu0hCZ5hz9dyGGYo1XchQosLdLGhcEugsyxNp9Yuftyf5/Ie1wJNiEl
 ZsoRc15X7wfRPKKMMyDFljzPBPFiHr78p30uJ34bcYCu0j0CYi+gbKQztmEMZ2dy
 9M+sDG3jd5R7ACXrwS2ElSEDyLBnTaxbeSdCpErGjn/U19TLllbzhnMA9KR9elDI
 pEWgRc7DPmPbRZaStXMxIamf7pbmUSm0akAYbzGFvMHcSx4MXuQFICGK9t/mhSDm
 sO2b1Ir39yk65sVNdjFsnqDsi6jTPgrLSe3FY4eMhkn15OSiVGhcz7ddQMD7Fbuq
 WLpHFqER650I28i0EXh8bxzjkrj+aJQKhGcVbmwVS33MtKVfBdh4GfQMvS6MbeOM
 MsZ10E7Dr9ildKxqHP5SgLlggkl512lpj3+d6j0mUSSSUP2jtUw=
 =MiEC
 -----END PGP SIGNATURE-----

Merge tag 'f2fs-for-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs updates from Jaegeuk Kim:
 "In this round, we've added some knobs to enhance compression feature
  and harden testing environment. In addition, we've fixed several bugs
  reported from Android devices such as long discarding latency, device
  hanging during quota_sync, etc.

  Enhancements:
   - support lzo-rle algorithm
   - add two ioctls to release and reserve blocks for compression
   - support partial truncation/fiemap on compressed file
   - introduce sysfs entries to attach IO flags explicitly
   - add iostat trace point along with read io stat

  Bug fixes:
   - fix long discard latency
   - flush quota data by f2fs_quota_sync correctly
   - fix to recover parent inode number for power-cut recovery
   - fix lz4/zstd output buffer budget
   - parse checkpoint mount option correctly
   - avoid inifinite loop to wait for flushing node/meta pages
   - manage discard space correctly

  And some refactoring and clean up patches were added"

* tag 'f2fs-for-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (51 commits)
  f2fs: attach IO flags to the missing cases
  f2fs: add node_io_flag for bio flags likewise data_io_flag
  f2fs: remove unused parameter of f2fs_put_rpages_mapping()
  f2fs: handle readonly filesystem in f2fs_ioc_shutdown()
  f2fs: avoid utf8_strncasecmp() with unstable name
  f2fs: don't return vmalloc() memory from f2fs_kmalloc()
  f2fs: fix retry logic in f2fs_write_cache_pages()
  f2fs: fix wrong discard space
  f2fs: compress: don't compress any datas after cp stop
  f2fs: remove unneeded return value of __insert_discard_tree()
  f2fs: fix wrong value of tracepoint parameter
  f2fs: protect new segment allocation in expand_inode_data
  f2fs: code cleanup by removing ifdef macro surrounding
  f2fs: avoid inifinite loop to wait for flushing node pages at cp_error
  f2fs: flush dirty meta pages when flushing them
  f2fs: fix checkpoint=disable:%u%%
  f2fs: compress: fix zstd data corruption
  f2fs: add compressed/gc data read IO stat
  f2fs: fix potential use-after-free issue
  f2fs: compress: don't handle non-compressed data in workqueue
  ...
2020-06-09 11:28:59 -07:00
Linus Torvalds
ad57a1022f Description for this pull request:
* Bug fixes
   - Fix memory leak on mount failure with iocharset= option.
   - Fix Incorrect update of stream entry.
   - Fix cluster range validation error.
 
 * Clean-up codes
   - Remove unused code and unneeded assignment.
   - Rename variables in exfat structure as specification.
   - Reorganize boot sector analysis code.
   - Simplify exfat_utf8_d_hash and exfat_utf8_d_cmp().
   - Optimize exfat entry cache functions.
   - Improve wording of EXFAT_DEFAULT_IOCHARSET config option.
 
 * New Feature
   - Add boot region verification.
 -----BEGIN PGP SIGNATURE-----
 
 iQJMBAABCgA2FiEE6NzKS6Uv/XAAGHgyZwv7A1FEIQgFAl7fQBwYHG5hbWphZS5q
 ZW9uQHNhbXN1bmcuY29tAAoJEGcL+wNRRCEIyWMP/2CqlryPilKiXj/C2n9r2s5O
 7NNABC7xhyILk9fGz/mUOGohqBQXNNbZUDS17m2xbygw3vkXYN72ejDb/1DLVU8E
 LsYd85Pj8l7kMnOmjXKNLetoql1S3nm19PgIB7GYNI/BfeBFXcyxQdOTOlwq28w7
 PkfnWhnvnIxTfbTJj6EFB5tPYDycpm32LiUSQqsAmy2i0pC9WY6w4PnJz/c8wiqe
 +LZkLtZ1blGSKLY6C1FotVi7OmjiRWm0e+sdPE/Rsaxb/nnL/S7Nt03GPHZMkGxm
 eVq5MBUadQAr61duIWKcF7dFUmqqVTAO/bgYrxB4ljd/1j1lwWwZjD7iLnbsOfOy
 +Go5NsDoLEySKp7JSkLJ8S6mdKsAyAf4TK8diZlIGGfF7jV6puo3h9yDk0e6U0/G
 E613f60O5bymQWe9STLiJwMo65M7rjzuT3WUcTFuf58LqS6UR+ngq089V4lV720N
 USxZu7wtO5m0j5feXY72x6E/xaL1wqbMuHr0defQZ9CN8JZKCRtthletjI8TVDOZ
 hxIASZacQdWkWBL4mCs3lmaflSaD32J7RxPSqnQHMxrB6UVh9lT97rQBGGnbyRyL
 2Hqwe8cUk/ki6fOmpNvyIUh01S+wtgVGuAAEoKPEIKGmDw1KeAGXOpVX1NPcbZWT
 s7HTy7H3SfAnNAED8+Ct
 =Dgtx
 -----END PGP SIGNATURE-----

Merge tag 'exfat-for-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat

Pull exfat update from Namjae Jeon:
 "Bug fixes:
   - Fix memory leak on mount failure with iocharset= option
   - Fix incorrect update of stream entry
   - Fix cluster range validation error

  Clean-ups:
   - Remove unused code and unneeded assignment
   - Rename variables in exfat structure as specification
   - Reorganize boot sector analysis code
   - Simplify exfat_utf8_d_hash and exfat_utf8_d_cmp()
   - Optimize exfat entry cache functions
   - Improve wording of EXFAT_DEFAULT_IOCHARSET config option

 New Feature:
   - Add boot region verification"

* tag 'exfat-for-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat:
  exfat: Fix potential use after free in exfat_load_upcase_table()
  exfat: fix range validation error in alloc and free cluster
  exfat: fix incorrect update of stream entry in __exfat_truncate()
  exfat: fix memory leak in exfat_parse_param()
  exfat: remove unnecessary reassignment of p_uniname->name_len
  exfat: standardize checksum calculation
  exfat: add boot region verification
  exfat: separate the boot sector analysis
  exfat: redefine PBR as boot_sector
  exfat: optimize dir-cache
  exfat: replace 'time_ms' with 'time_cs'
  exfat: remove the assignment of 0 to bool variable
  exfat: Remove unused functions exfat_high_surrogate() and exfat_low_surrogate()
  exfat: Simplify exfat_utf8_d_hash() for code points above U+FFFF
  exfat: Improve wording of EXFAT_DEFAULT_IOCHARSET config option
  exfat: Use a more common logging style
  exfat: Simplify exfat_utf8_d_cmp() for code points above U+FFFF
2020-06-09 11:24:59 -07:00
David Sterba
f1084bc60a Revert "fs: remove dio_end_io()"
This reverts commit b75b7ca7c2.

The patch restores a helper that was not necessary after direct IO port
to iomap infrastructure, which gets reverted.

Signed-off-by: David Sterba <dsterba@suse.com>
2020-06-09 19:23:18 +02:00
David Sterba
8e0fa5d7b3 Revert "btrfs: remove BTRFS_INODE_READDIO_NEED_LOCK"
This reverts commit 5f008163a5.

The patch is a simplification after direct IO port to iomap
infrastructure, which gets reverted.

Signed-off-by: David Sterba <dsterba@suse.com>
2020-06-09 19:21:48 +02:00
David Sterba
f4c48b4408 Revert "btrfs: split btrfs_direct_IO to read and write part"
This reverts commit d8f3e73587.

The patch is a cleanup of direct IO port to iomap infrastructure,
which gets reverted.

Signed-off-by: David Sterba <dsterba@suse.com>
2020-06-09 19:19:27 +02:00
David Howells
c68421bbad afs: Make afs_zap_data() static
Make afs_zap_data() static as it's only used in the file in which it is
defined.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-09 18:17:14 +01:00
David Howells
4a06fa5403 afs: Remove afs_zero_fid as it's not used
Remove afs_zero_fid as it's not used.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-09 18:17:14 +01:00
David Howells
fed79fd783 afs: Fix debugging statements with %px to be %p
Fix a couple of %px to be %p in debugging statements.

Fixes: e49c7b2f6d ("afs: Build an abstraction around an "operation" concept")
Fixes: 8a070a9648 ("afs: Detect cell aliases 1 - Cells with root volumes")
Reported-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
2020-06-09 18:17:14 +01:00
Linus Torvalds
595a56ac1b linux-kselftest-kunit-5.8-rc1
This Kunit update for Linux 5.8-rc1 consists of:
 
 - Several config fragment fixes from Anders Roxell to improve
   test coverage.
 - Improvements to kunit run script to use defconfig as default and
   restructure the code for config/build/exec/parse from Vitor Massaru Iha
   and David Gow.
 - Miscellaneous documentation warn fix.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEPZKym/RZuOCGeA/kCwJExA0NQxwFAl7etrcACgkQCwJExA0N
 QxzGYg/+KHpPhB31IAjNFKCRqwDooftst3dohhzguxJLpDHdEmVJ4moQhLr4gL+/
 qpi3T9hr4Rx++n/A5NoxDvyJvGr+FAL40U+Of7F2UyHpqQmfKPj37I+yvyeR1JEL
 z4+yXEpfQLZaQkmZ7f3GWHyqN3+xwvyTEy7NYUad7xMxLF/99No+I6RMD6yp3srS
 wUUeuBIesSFT0LXYrgI+wgsNGUESlj/McjiP5eMj6UtlMgKpzmfzH56Fia8uw1pw
 6QtpntxDHjtxVfp8YKM4qExI54YI2t6sgHTIoOUsMWD5Q2kHd8kNf1L+lb1sKYUF
 j7lzol5nuqqchAVQYjHzNHa8XKndvexGyWMsPz1gAnkpgVrvBTSFcavdDpDuDZ0T
 HoJZnk9XPsguBQjDxapzPYfAQ81Un/rEmZQ8/X2TaNjdSIH1hHljhaP2OZ6eND/Q
 iobq9x8nC9D95TIqjDbRw3Sp2na/pZLN8Gp27hmKlc+L1XzV8NuZe/WGOUe3lsrq
 fG1ZSLo/iRau8gHuF6fRSrGIzQSCEMGKl3jlQ28OT9HGMAgTlncEwVzQId48/AsS
 UOY+bIAnRZuK+B5F/vw6L3o1e3c17z5bruVlb0M0alP5b7P9/3WLNHsHA3r8haZF
 F6PwIWu41wdRjJf2HI7zD5LaQe/7oU3jfwvuA7n2z8Py+zGx7m4=
 =S+HY
 -----END PGP SIGNATURE-----

Merge tag 'linux-kselftest-kunit-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest

Pull Kunit updates from Shuah Khan:
 "This consists of:

   - Several config fragment fixes from Anders Roxell to improve test
     coverage.

   - Improvements to kunit run script to use defconfig as default and
     restructure the code for config/build/exec/parse from Vitor Massaru
     Iha and David Gow.

   - Miscellaneous documentation warn fix"

* tag 'linux-kselftest-kunit-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
  security: apparmor: default KUNIT_* fragments to KUNIT_ALL_TESTS
  fs: ext4: default KUNIT_* fragments to KUNIT_ALL_TESTS
  drivers: base: default KUNIT_* fragments to KUNIT_ALL_TESTS
  lib: Kconfig.debug: default KUNIT_* fragments to KUNIT_ALL_TESTS
  kunit: default KUNIT_* fragments to KUNIT_ALL_TESTS
  kunit: Kconfig: enable a KUNIT_ALL_TESTS fragment
  kunit: Fix TabError, remove defconfig code and handle when there is no kunitconfig
  kunit: use KUnit defconfig by default
  kunit: use --build_dir=.kunit as default
  Documentation: test.h - fix warnings
  kunit: kunit_tool: Separate out config/build/exec/parse
2020-06-09 10:04:47 -07:00
Michel Lespinasse
c1e8d7c6a7 mmap locking API: convert mmap_sem comments
Convert comments that reference mmap_sem to reference mmap_lock instead.

[akpm@linux-foundation.org: fix up linux-next leftovers]
[akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil]
[akpm@linux-foundation.org: more linux-next fixups, per Michel]

Signed-off-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Liam Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ying Han <yinghan@google.com>
Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 09:39:14 -07:00
Michel Lespinasse
3e4e28c5a8 mmap locking API: convert mmap_sem API comments
Convert comments that reference old mmap_sem APIs to reference
corresponding new mmap locking APIs instead.

Signed-off-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Liam Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ying Han <yinghan@google.com>
Link: http://lkml.kernel.org/r/20200520052908.204642-12-walken@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 09:39:14 -07:00
Michel Lespinasse
42fc541404 mmap locking API: add mmap_assert_locked() and mmap_assert_write_locked()
Add new APIs to assert that mmap_sem is held.

Using this instead of rwsem_is_locked and lockdep_assert_held[_write]
makes the assertions more tolerant of future changes to the lock type.

Signed-off-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Liam Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ying Han <yinghan@google.com>
Link: http://lkml.kernel.org/r/20200520052908.204642-10-walken@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 09:39:14 -07:00
Michel Lespinasse
89154dd531 mmap locking API: convert mmap_sem call sites missed by coccinelle
Convert the last few remaining mmap_sem rwsem calls to use the new mmap
locking API.  These were missed by coccinelle for some reason (I think
coccinelle does not support some of the preprocessor constructs in these
files ?)

[akpm@linux-foundation.org: convert linux-next leftovers]
[akpm@linux-foundation.org: more linux-next leftovers]
[akpm@linux-foundation.org: more linux-next leftovers]

Signed-off-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Liam Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ying Han <yinghan@google.com>
Link: http://lkml.kernel.org/r/20200520052908.204642-6-walken@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 09:39:14 -07:00
Michel Lespinasse
d8ed45c5dc mmap locking API: use coccinelle to convert mmap_sem rwsem call sites
This change converts the existing mmap_sem rwsem calls to use the new mmap
locking API instead.

The change is generated using coccinelle with the following rule:

// spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .

@@
expression mm;
@@
(
-init_rwsem
+mmap_init_lock
|
-down_write
+mmap_write_lock
|
-down_write_killable
+mmap_write_lock_killable
|
-down_write_trylock
+mmap_write_trylock
|
-up_write
+mmap_write_unlock
|
-downgrade_write
+mmap_write_downgrade
|
-down_read
+mmap_read_lock
|
-down_read_killable
+mmap_read_lock_killable
|
-down_read_trylock
+mmap_read_trylock
|
-up_read
+mmap_read_unlock
)
-(&mm->mmap_sem)
+(mm)

Signed-off-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Liam Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ying Han <yinghan@google.com>
Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 09:39:14 -07:00
Mike Rapoport
e31cf2f4ca mm: don't include asm/pgtable.h if linux/mm.h is already included
Patch series "mm: consolidate definitions of page table accessors", v2.

The low level page table accessors (pXY_index(), pXY_offset()) are
duplicated across all architectures and sometimes more than once.  For
instance, we have 31 definition of pgd_offset() for 25 supported
architectures.

Most of these definitions are actually identical and typically it boils
down to, e.g.

static inline unsigned long pmd_index(unsigned long address)
{
        return (address >> PMD_SHIFT) & (PTRS_PER_PMD - 1);
}

static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
{
        return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
}

These definitions can be shared among 90% of the arches provided
XYZ_SHIFT, PTRS_PER_XYZ and xyz_page_vaddr() are defined.

For architectures that really need a custom version there is always
possibility to override the generic version with the usual ifdefs magic.

These patches introduce include/linux/pgtable.h that replaces
include/asm-generic/pgtable.h and add the definitions of the page table
accessors to the new header.

This patch (of 12):

The linux/mm.h header includes <asm/pgtable.h> to allow inlining of the
functions involving page table manipulations, e.g.  pte_alloc() and
pmd_alloc().  So, there is no point to explicitly include <asm/pgtable.h>
in the files that include <linux/mm.h>.

The include statements in such cases are remove with a simple loop:

	for f in $(git grep -l "include <linux/mm.h>") ; do
		sed -i -e '/include <asm\/pgtable.h>/ d' $f
	done

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/20200514170327.31389-1-rppt@kernel.org
Link: http://lkml.kernel.org/r/20200514170327.31389-2-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 09:39:13 -07:00
David Howells
9ca0652596 afs: Fix use of BUG()
Fix afs_compare_addrs() to use WARN_ON(1) instead of BUG() and return 1
(ie. srx_a > srx_b).

There's no point trying to put actual error handling in as this should not
occur unless a new transport address type is allowed by AFS.  And even if
it does, in this particular case, it'll just never match unknown types of
addresses.  This BUG() was more of a 'you need to add a case here'
indicator.

Reported-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
2020-06-09 17:21:03 +01:00
David Howells
5749ce92c4 afs: Fix file locking
Fix AFS file locking to use the correct vnode pointer and remove a member
of the afs_operation struct that is never set, but it is read and followed,
causing an oops.

This can be triggered by:

	flock -s /afs/example.com/foo sleep 1

when it calls the kernel to get a file lock.

Fixes: e49c7b2f6d ("afs: Build an abstraction around an "operation" concept")
Reported-by: Dave Botsch <botsch@cnf.cornell.edu>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Dave Botsch <botsch@cnf.cornell.edu>
2020-06-09 15:22:06 +01:00
Zhihao Cheng
2ca068be09 afs: Fix memory leak in afs_put_sysnames()
Fix afs_put_sysnames() to actually free the specified afs_sysnames
object after its reference count has been decreased to zero and
its contents have been released.

Fixes: 6f8880d8e6 ("afs: Implement @sys substitution handling")
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-09 15:22:06 +01:00
Dan Carpenter
fc961522dd exfat: Fix potential use after free in exfat_load_upcase_table()
This code calls brelse(bh) and then dereferences "bh" on the next line
resulting in a possible use after free.  The brelse() should just be
moved down a line.

Fixes: b676fdbcf4c8 ("exfat: standardize checksum calculation")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
2020-06-09 16:50:18 +09:00
hyeongseok.kim
a949824f01 exfat: fix range validation error in alloc and free cluster
There is check error in range condition that can never be entered
even with invalid input.
Replace incorrent checking code with already existing valid checker.

Signed-off-by: hyeongseok.kim <hyeongseok@gmail.com>
Acked-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
2020-06-09 16:50:12 +09:00
Namjae Jeon
29bbb14bfc exfat: fix incorrect update of stream entry in __exfat_truncate()
At truncate, there is a problem of incorrect updating in the file entry
pointer instead of stream entry. This will cause the problem of
overwriting the time field of the file entry to new_size. Fix it to
update stream entry.

Fixes: 98d917047e ("exfat: add file operations")
Cc: stable@vger.kernel.org # v5.7
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
2020-06-09 16:50:07 +09:00
Al Viro
f341a7d8dc exfat: fix memory leak in exfat_parse_param()
butt3rflyh4ck reported memory leak found by syzkaller.

A param->string held by exfat_mount_options.

BUG: memory leak

unreferenced object 0xffff88801972e090 (size 8):
  comm "syz-executor.2", pid 16298, jiffies 4295172466 (age 14.060s)
  hex dump (first 8 bytes):
    6b 6f 69 38 2d 75 00 00                          koi8-u..
  backtrace:
    [<000000005bfe35d6>] kstrdup+0x36/0x70 mm/util.c:60
    [<0000000018ed3277>] exfat_parse_param+0x160/0x5e0
fs/exfat/super.c:276
    [<000000007680462b>] vfs_parse_fs_param+0x2b4/0x610
fs/fs_context.c:147
    [<0000000097c027f2>] vfs_parse_fs_string+0xe6/0x150
fs/fs_context.c:191
    [<00000000371bf78f>] generic_parse_monolithic+0x16f/0x1f0
fs/fs_context.c:231
    [<000000005ce5eb1b>] do_new_mount fs/namespace.c:2812 [inline]
    [<000000005ce5eb1b>] do_mount+0x12bb/0x1b30 fs/namespace.c:3141
    [<00000000b642040c>] __do_sys_mount fs/namespace.c:3350 [inline]
    [<00000000b642040c>] __se_sys_mount fs/namespace.c:3327 [inline]
    [<00000000b642040c>] __x64_sys_mount+0x18f/0x230 fs/namespace.c:3327
    [<000000003b024e98>] do_syscall_64+0xf6/0x7d0
arch/x86/entry/common.c:295
    [<00000000ce2b698c>] entry_SYSCALL_64_after_hwframe+0x49/0xb3

exfat_free() should call exfat_free_iocharset(), to prevent a leak
in case we fail after parsing iocharset= but before calling
get_tree_bdev().

Additionally, there's no point copying param->string in
exfat_parse_param() - just steal it, leaving NULL in param->string.
That's independent from the leak or fix thereof - it's simply
avoiding an extra copy.

Fixes: 719c1e1829 ("exfat: add super block operations")
Cc: stable@vger.kernel.org # v5.7
Reported-by: butt3rflyh4ck <butterflyhuangxx@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
2020-06-09 16:50:02 +09:00
Namjae Jeon
f78059805f exfat: remove unnecessary reassignment of p_uniname->name_len
kbuild test robot reported :

	fs/exfat/nls.c:531:22: warning: Variable 'p_uniname->name_len'
	is reassigned a value before the old one has been used.

The reassignment of p_uniname->name_len is not needed and remove it.

Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
2020-06-09 16:49:32 +09:00
Tetsuhiro Kohada
5875bf287d exfat: standardize checksum calculation
To clarify that it is a 16-bit checksum, the parts related to the 16-bit
checksum are renamed and change type to u16.
Furthermore, replace checksum calculation in exfat_load_upcase_table()
with exfat_calc_checksum32().

Signed-off-by: Tetsuhiro Kohada <kohada.t2@gmail.com>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
2020-06-09 16:49:25 +09:00
Tetsuhiro Kohada
476189c0ef exfat: add boot region verification
Add Boot-Regions verification specified in exFAT specification.
Note that the checksum type is strongly related to the raw structure,
so the'u32 'type is used to clarify the number of bits.

Signed-off-by: Tetsuhiro Kohada <kohada.t2@gmail.com>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
2020-06-09 16:49:19 +09:00
Tetsuhiro Kohada
33404a1598 exfat: separate the boot sector analysis
Separate the boot sector analysis to read_boot_sector().
And add a check for the fs_name field.
Furthermore, add a strict consistency check, because overlapping areas
can cause serious corruption.

Signed-off-by: Tetsuhiro Kohada <kohada.t2@gmail.com>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
2020-06-09 16:49:14 +09:00
Tetsuhiro Kohada
181a9e8009 exfat: redefine PBR as boot_sector
Aggregate PBR related definitions and redefine as "boot_sector" to comply
with the exFAT specification.
And, rename variable names including 'pbr'.

Signed-off-by: Tetsuhiro Kohada <kohada.t2@gmail.com>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
2020-06-09 16:49:10 +09:00
Tetsuhiro Kohada
943af1fdac exfat: optimize dir-cache
Optimize directory access based on exfat_entry_set_cache.
 - Hold bh instead of copied d-entry.
 - Modify bh->data directly instead of the copied d-entry.
 - Write back the retained bh instead of rescanning the d-entry-set.
And
 - Remove unused cache related definitions.

Signed-off-by: Tetsuhiro Kohada <kohada.tetsuhiro@dc.mitsubishielectric.co.jp>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
2020-06-09 16:49:05 +09:00
Tetsuhiro Kohada
ed0f84d30b exfat: replace 'time_ms' with 'time_cs'
Replace time_ms  with time_cs in the file directory entry structure
and related functions.

The unit of create_time_ms/modify_time_ms in File Directory Entry are not
'milli-second', but 'centi-second'.
The exfat specification uses the term '10ms', but instead use 'cs' as in
msdos_fs.h.

Signed-off-by: Tetsuhiro Kohada <kohada.t2@gmail.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
2020-06-09 16:49:00 +09:00
Jason Yan
cdc06129a6 exfat: remove the assignment of 0 to bool variable
There is no need to init 'sync' in exfat_set_vol_flags().
This also fixes the following coccicheck warning:

fs/exfat/super.c:104:6-10: WARNING: Assignment of 0/1 to bool variable

Signed-off-by: Jason Yan <yanaijie@huawei.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
2020-06-09 16:48:53 +09:00
Pali Rohár
6778337a7a exfat: Remove unused functions exfat_high_surrogate() and exfat_low_surrogate()
After applying previous two patches, these functions are not used anymore.

Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
2020-06-09 16:48:49 +09:00
Pali Rohár
dddf7da398 exfat: Simplify exfat_utf8_d_hash() for code points above U+FFFF
Function partial_name_hash() takes long type value into which can be stored
one Unicode code point. Therefore conversion from UTF-32 to UTF-16 is not
needed.

Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
2020-06-09 16:48:44 +09:00
Geert Uytterhoeven
31f5acc0aa exfat: Improve wording of EXFAT_DEFAULT_IOCHARSET config option
- Use consistent capitalization for "exFAT".
  - Fix grammar,
  - Split long sentence.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
2020-06-09 16:48:39 +09:00
Joe Perches
d1727d55c0 exfat: Use a more common logging style
Remove the direct use of KERN_<LEVEL> in functions by creating
separate exfat_<level> macros.

Miscellanea:

o Remove several unnecessary terminating newlines in formats
o Realign arguments and fit to 80 columns where appropriate

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
2020-06-09 16:48:34 +09:00
Pali Rohár
197298a649 exfat: Simplify exfat_utf8_d_cmp() for code points above U+FFFF
If two Unicode code points represented in UTF-16 are different then also
their UTF-32 representation must be different. Therefore conversion from
UTF-32 to UTF-16 is not needed.

Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
2020-06-09 16:48:28 +09:00
Kenneth D'souza
0b0430c6a1 cifs: Add get_security_type_str function to return sec type.
This code is more organized and robust.

Signed-off-by: Kenneth D'souza <kdsouza@redhat.com>
Signed-off-by: Roberto Bergantinos Corpas <rbergant@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-06-08 23:57:21 -05:00
Matthew Wilcox (Oracle)
d4ff3b2ef9 iomap: Fix unsharing of an extent >2GB on a 32-bit machine
Widen the type used for counting the number of bytes unshared.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-06-08 20:58:29 -07:00
Chuhong Yuan
8cc0072469 xfs: Add the missed xfs_perag_put() for xfs_ifree_cluster()
xfs_ifree_cluster() calls xfs_perag_get() at the beginning, but forgets to
call xfs_perag_put() in one failed path.
Add the missed function call to fix it.

Fixes: ce92464c18 ("xfs: make xfs_trans_get_buf return an error code")
Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-06-08 20:57:03 -07:00
Jaegeuk Kim
b7b911d59d f2fs: attach IO flags to the missing cases
This adds more IOs to attach flags.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-06-08 20:37:54 -07:00
Jaegeuk Kim
32b6aba85c f2fs: add node_io_flag for bio flags likewise data_io_flag
This patch adds another way to attach bio flags to node writes.

Description:   Give a way to attach REQ_META|FUA to node writes
               given temperature-based bits. Now the bits indicate:
               *      REQ_META     |      REQ_FUA      |
               *    5 |    4 |   3 |    2 |    1 |   0 |
               * Cold | Warm | Hot | Cold | Warm | Hot |

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-06-08 20:37:54 -07:00
Chao Yu
bc67c5d0ce f2fs: remove unused parameter of f2fs_put_rpages_mapping()
Just cleanup, no logic change.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-06-08 20:37:53 -07:00
Chao Yu
8626441f05 f2fs: handle readonly filesystem in f2fs_ioc_shutdown()
If mountpoint is readonly, we should allow shutdowning filesystem
successfully, this fixes issue found by generic/599 testcase of
xfstest.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-06-08 20:37:53 -07:00
Eric Biggers
fc3bb095ab f2fs: avoid utf8_strncasecmp() with unstable name
If the dentry name passed to ->d_compare() fits in dentry::d_iname, then
it may be concurrently modified by a rename.  This can cause undefined
behavior (possibly out-of-bounds memory accesses or crashes) in
utf8_strncasecmp(), since fs/unicode/ isn't written to handle strings
that may be concurrently modified.

Fix this by first copying the filename to a stack buffer if needed.
This way we get a stable snapshot of the filename.

Fixes: 2c2eb7a300 ("f2fs: Support case-insensitive file name lookups")
Cc: <stable@vger.kernel.org> # v5.4+
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Daniel Rosenberg <drosen@google.com>
Cc: Gabriel Krisman Bertazi <krisman@collabora.co.uk>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-06-08 20:37:53 -07:00
Eric Biggers
0b6d4ca04a f2fs: don't return vmalloc() memory from f2fs_kmalloc()
kmalloc() returns kmalloc'ed memory, and kvmalloc() returns either
kmalloc'ed or vmalloc'ed memory.  But the f2fs wrappers, f2fs_kmalloc()
and f2fs_kvmalloc(), both return both kinds of memory.

It's redundant to have two functions that do the same thing, and also
breaking the standard naming convention is causing bugs since people
assume it's safe to kfree() memory allocated by f2fs_kmalloc().  See
e.g. the various allocations in fs/f2fs/compress.c.

Fix this by making f2fs_kmalloc() just use kmalloc().  And to avoid
re-introducing the allocation failures that the vmalloc fallback was
intended to fix, convert the largest allocations to use f2fs_kvmalloc().

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-06-08 20:34:58 -07:00
Linus Torvalds
95288a9b3b The highlights are:
- OSD/MDS latency and caps cache metrics infrastructure for the
   filesytem (Xiubo Li).  Currently available through debugfs and
   will be periodically sent to the MDS in the future.
 
 - support for replica reads (balanced and localized reads) for
   rbd and the filesystem (myself).  The default remains to always
   read from primary, users can opt-in with the new crush_location
   and read_from_replica options.  Note that reading from replica
   is safe for general use only since Octopus.
 
 - support for RADOS allocation hint flags (myself).  Currently
   used by rbd to propagate the compressible/incompressible hint
   given with the new compression_hint map option and ready for
   passing on more advanced hints, e.g. based on fadvise() from
   the filesystem.
 
 - support for efficient cross-quota-realm renames (Luis Henriques)
 
 - assorted cap handling improvements and cleanups, particularly
   untangling some of the locking (Jeff Layton)
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl7eZP0THGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHziwJDB/98bH+dsJidUkRctVerX933DvgmRGva
 sIxR0otqCK2zlucKSy8R8awbhVQ2lz4DQm9vrlwFQHBjZqXnrMzDG4rd/PukmKap
 l8DjHRgEsH698zjwDlyyz7/1ZqOOUcCKr5fly3Erqr92yWGoy2ve76LtTKgB5jnv
 wdwMk5v/NBWoxZ3Q1cvexbCtc60l0FCSH4FnH7NtT8eR9zCmL9vlpZWdjKi+U5em
 6tTONuSq+0F4a9eXEv6QHEjRjkRo1WlttGdK3bX7mXD4O22TslgKg9hYsVoQVTiW
 Cc9n6Pggv2tbUnPgn/x342W26QyMgcoHCzrYPR7w0JrU61TzBewxqfpg
 =4fqQ
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-5.8-rc1' of git://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "The highlights are:

   - OSD/MDS latency and caps cache metrics infrastructure for the
     filesytem (Xiubo Li). Currently available through debugfs and will
     be periodically sent to the MDS in the future.

   - support for replica reads (balanced and localized reads) for rbd
     and the filesystem (myself). The default remains to always read
     from primary, users can opt-in with the new crush_location and
     read_from_replica options. Note that reading from replica is safe
     for general use only since Octopus.

   - support for RADOS allocation hint flags (myself). Currently used by
     rbd to propagate the compressible/incompressible hint given with
     the new compression_hint map option and ready for passing on more
     advanced hints, e.g. based on fadvise() from the filesystem.

   - support for efficient cross-quota-realm renames (Luis Henriques)

   - assorted cap handling improvements and cleanups, particularly
     untangling some of the locking (Jeff Layton)"

* tag 'ceph-for-5.8-rc1' of git://github.com/ceph/ceph-client: (29 commits)
  rbd: compression_hint option
  libceph: support for alloc hint flags
  libceph: read_from_replica option
  libceph: support for balanced and localized reads
  libceph: crush_location infrastructure
  libceph: decode CRUSH device/bucket types and names
  libceph: add non-asserting rbtree insertion helper
  ceph: skip checking caps when session reconnecting and releasing reqs
  ceph: make sure mdsc->mutex is nested in s->s_mutex to fix dead lock
  ceph: don't return -ESTALE if there's still an open file
  libceph, rbd: replace zero-length array with flexible-array
  ceph: allow rename operation under different quota realms
  ceph: normalize 'delta' parameter usage in check_quota_exceeded
  ceph: ceph_kick_flushing_caps needs the s_mutex
  ceph: request expedited service on session's last cap flush
  ceph: convert mdsc->cap_dirty to a per-session list
  ceph: reset i_requested_max_size if file write is not wanted
  ceph: throw a warning if we destroy session with mutex still locked
  ceph: fix potential race in ceph_check_caps
  ceph: document what protects i_dirty_item and i_flushing_item
  ...
2020-06-08 12:49:18 -07:00
Pavel Begunkov
f5fa38c59c io_wq: add per-wq work handler instead of per work
io_uring is the only user of io-wq, and now it uses only io-wq callback
for all its requests, namely io_wq_submit_work(). Instead of storing
work->runner callback in each instance of io_wq_work, keep it in io-wq
itself.

pros:
- reduces io_wq_work size
- more robust -- ->func won't be invalidated with mem{cpy,set}(req)
- helps other work

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-08 13:47:37 -06:00
Pavel Begunkov
d4c81f3852 io_uring: don't arm a timeout through work.func
Remove io_link_work_cb() -- the last custom work.func.
Not the prettiest thing, but works. Instead of queueing a linked timeout
in io_link_work_cb() mark a request with REQ_F_QUEUE_TIMEOUT and do
enqueueing based on the flag in io_wq_submit_work().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-08 13:47:37 -06:00
Pavel Begunkov
ac45abc0e2 io_uring: remove custom ->func handlers
In preparation of getting rid of work.func, this removes almost all
custom instances of it, leaving only io_wq_submit_work() and
io_link_work_cb(). And the last one will be dealt later.

Nothing fancy, just routinely remove *_finish() function and inline
what's left. E.g. remove io_fsync_finish() + inline __io_fsync() into
io_fsync().

As no users of io_req_cancelled() are left, delete it as well. The patch
adds extra switch lookup on cold-ish path, but that's overweighted by
nice diffstat and other benefits of the following patches.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-08 13:47:37 -06:00
Pavel Begunkov
3af73b286c io_uring: don't derive close state from ->func
Relying on having a specific work.func is dangerous, even if an opcode
handler set it itself. E.g. io_wq_assign_next() can modify it.

io_close() sets a custom work.func to indicate that
__close_fd_get_file() was already called. Fortunately, there is no bugs
with io_wq_assign_next() and close yet.

Still, do it safe and always be prepared to be called through
io_wq_submit_work(). Zero req->close.put_file in prep, and call
__close_fd_get_file() IFF it's NULL.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-08 13:47:37 -06:00
Linus Torvalds
ca687877e0 Changes in gfs2:
- An iopen glock locking scheme rework that speeds up deletes of
   inodes accessed from multiple nodes.
 - Various bug fixes and debugging improvements.
 - Convert gfs2-glocks.txt to ReST.
 -----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAl7eYjMUHGFncnVlbmJh
 QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTr/9g//cJ6jgiD/+qzh0VzougVksVZIduAl
 RMB+kldOjBS2ORbyYM87Jm1tdyakgZuFO91XlSwChWRdC3Y2mqdaJIEE/kATqfY9
 7Frlw++SyFKLvIf04kDYGk2hXX+umXXYFfrIiKb0tzDSGkPRaARUb3RM4TRvlSDP
 /U0JlYA/4aXMUge+VpYsbpSGeqHNfEzmcmCyPXGmZYyh1MZ/RocMZFYEsP9NH82J
 l07fxowUd10LJPEmBajzjD2NmEvjdvF4gBCOfJVNIfOzCj0CwXL3vmtu1SUNOKr+
 em266EWZF89eOcvfdtE6xF0w81oCAK43wYRjIODSgI9JCLXGiOYmlWZVwZoqu5iy
 2GQDhj/taq3ObuVqjR5n6GYuqMoJ+D0LSD13qccDALq/Bdy4lq9TMLSdDbkrVIM/
 8BVn0nI5MzUlTV3mq6uxhU0HqtYDwUEiHWURWw6bYRug5OvQy3nbg/XZptYlnD87
 XQccE4ErjlgSHLiYx1YckFz/GG6ytrRAKl9bGMkZ0u2+XmDsH+iJJgLcaXCPUP9h
 /hhYagKI55UBDer7we4tppbu+gnJrg3PXlgImf53iMc7ia0KQHd+TfSIFkGPuydI
 aEKKhIQzje23JayMbPRnwPbNlw9zU1loPi7hPS3rCpDY+w8oawpFyIieEOcxJyEt
 pYkOt4BQi9LvpGc=
 =PsCY
 -----END PGP SIGNATURE-----

Merge tag 'gfs2-for-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2

Pull gfs2 updates from Andreas Gruenbacher:

 - An iopen glock locking scheme rework that speeds up deletes of inodes
   accessed from multiple nodes

 - Various bug fixes and debugging improvements

 - Convert gfs2-glocks.txt to ReST

* tag 'gfs2-for-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
  gfs2: fix use-after-free on transaction ail lists
  gfs2: new slab for transactions
  gfs2: initialize transaction tr_ailX_lists earlier
  gfs2: Smarter iopen glock waiting
  gfs2: Wake up when setting GLF_DEMOTE
  gfs2: Check inode generation number in delete_work_func
  gfs2: Move inode generation number check into gfs2_inode_lookup
  gfs2: Minor gfs2_lookup_by_inum cleanup
  gfs2: Try harder to delete inodes locally
  gfs2: Give up the iopen glock on contention
  gfs2: Turn gl_delete into a delayed work
  gfs2: Keep track of deleted inode generations in LVBs
  gfs2: Allow ASPACE glocks to also have an lvb
  gfs2: instrumentation wrt log_flush stuck
  gfs2: introduce new gfs2_glock_assert_withdraw
  gfs2: print mapping->nrpages in glock dump for address space glocks
  gfs2: Only do glock put in gfs2_create_inode for free inodes
  gfs2: Allow lock_nolock mount to specify jid=X
  gfs2: Don't ignore inode write errors during inode_go_sync
  docs: filesystems: convert gfs2-glocks.txt to ReST
2020-06-08 12:47:09 -07:00
Linus Torvalds
20b0d06722 Merge branch 'akpm' (patches from Andrew)
Merge still more updates from Andrew Morton:
 "Various trees. Mainly those parts of MM whose linux-next dependents
  are now merged. I'm still sitting on ~160 patches which await merges
  from -next.

  Subsystems affected by this patch series: mm/proc, ipc, dynamic-debug,
  panic, lib, sysctl, mm/gup, mm/pagemap"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (52 commits)
  doc: cgroup: update note about conditions when oom killer is invoked
  module: move the set_fs hack for flush_icache_range to m68k
  nommu: use flush_icache_user_range in brk and mmap
  binfmt_flat: use flush_icache_user_range
  exec: use flush_icache_user_range in read_code
  exec: only build read_code when needed
  m68k: implement flush_icache_user_range
  arm: rename flush_cache_user_range to flush_icache_user_range
  xtensa: implement flush_icache_user_range
  sh: implement flush_icache_user_range
  asm-generic: add a flush_icache_user_range stub
  mm: rename flush_icache_user_range to flush_icache_user_page
  arm,sparc,unicore32: remove flush_icache_user_range
  riscv: use asm-generic/cacheflush.h
  powerpc: use asm-generic/cacheflush.h
  openrisc: use asm-generic/cacheflush.h
  m68knommu: use asm-generic/cacheflush.h
  microblaze: use asm-generic/cacheflush.h
  ia64: use asm-generic/cacheflush.h
  hexagon: use asm-generic/cacheflush.h
  ...
2020-06-08 11:11:38 -07:00
Christoph Hellwig
79ef1e1fff binfmt_flat: use flush_icache_user_range
load_flat_file works on user addresses.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Greg Ungerer <gerg@linux-m68k.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Link: http://lkml.kernel.org/r/20200515143646.3857579-28-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-08 11:05:58 -07:00
Christoph Hellwig
bce2b68b89 exec: use flush_icache_user_range in read_code
read_code operates on user addresses.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Link: http://lkml.kernel.org/r/20200515143646.3857579-27-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-08 11:05:58 -07:00
Christoph Hellwig
48304f7994 exec: only build read_code when needed
Only build read_code when binary formats that use it are built into the
kernel.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Link: http://lkml.kernel.org/r/20200515143646.3857579-26-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-08 11:05:58 -07:00
Guilherme G. Piccoli
f117955a22 kernel/watchdog.c: convert {soft/hard}lockup boot parameters to sysctl aliases
After a recent change introduced by Vlastimil's series [0], kernel is
able now to handle sysctl parameters on kernel command line; also, the
series introduced a simple infrastructure to convert legacy boot
parameters (that duplicate sysctls) into sysctl aliases.

This patch converts the watchdog parameters softlockup_panic and
{hard,soft}lockup_all_cpu_backtrace to use the new alias infrastructure.
It fixes the documentation too, since the alias only accepts values 0 or
1, not the full range of integers.

We also took the opportunity here to improve the documentation of the
previously converted hung_task_panic (see the patch series [0]) and put
the alias table in alphabetical order.

[0] http://lkml.kernel.org/r/20200427180433.7029-1-vbabka@suse.cz

Signed-off-by: Guilherme G. Piccoli <gpiccoli@canonical.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Kees Cook <keescook@chromium.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Link: http://lkml.kernel.org/r/20200507214624.21911-1-gpiccoli@canonical.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-08 11:05:56 -07:00
Vlastimil Babka
b467f3ef3c kernel/hung_task convert hung_task_panic boot parameter to sysctl
We can now handle sysctl parameters on kernel command line and have
infrastructure to convert legacy command line options that duplicate
sysctl to become a sysctl alias.

This patch converts the hung_task_panic parameter.  Note that the sysctl
handler is more strict and allows only 0 and 1, while the legacy
parameter allowed any non-zero value.  But there is little reason anyone
would not be using 1.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Guilherme G . Piccoli" <gpiccoli@canonical.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Ivan Teterevkov <ivan.teterevkov@nutanix.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20200427180433.7029-4-vbabka@suse.cz
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-08 11:05:56 -07:00
Vlastimil Babka
0a477e1ae2 kernel/sysctl: support handling command line aliases
We can now handle sysctl parameters on kernel command line, but
historically some parameters introduced their own command line
equivalent, which we don't want to remove for compatibility reasons.

We can, however, convert them to the generic infrastructure with a table
translating the legacy command line parameters to their sysctl names,
and removing the one-off param handlers.

This patch adds the support and makes the first conversion to
demonstrate it, on the (deprecated) numa_zonelist_order parameter.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Guilherme G . Piccoli" <gpiccoli@canonical.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Ivan Teterevkov <ivan.teterevkov@nutanix.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20200427180433.7029-3-vbabka@suse.cz
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-08 11:05:56 -07:00
Vlastimil Babka
3db978d480 kernel/sysctl: support setting sysctl parameters from kernel command line
Patch series "support setting sysctl parameters from kernel command line", v3.

This series adds support for something that seems like many people
always wanted but nobody added it yet, so here's the ability to set
sysctl parameters via kernel command line options in the form of
sysctl.vm.something=1

The important part is Patch 1.  The second, not so important part is an
attempt to clean up legacy one-off parameters that do the same thing as
a sysctl.  I don't want to remove them completely for compatibility
reasons, but with generic sysctl support the idea is to remove the
one-off param handlers and treat the parameters as aliases for the
sysctl variants.

I have identified several parameters that mention sysctl counterparts in
Documentation/admin-guide/kernel-parameters.txt but there might be more.
The conversion also has varying level of success:

 - numa_zonelist_order is converted in Patch 2 together with adding the
   necessary infrastructure. It's easy as it doesn't really do anything
   but warn on deprecated value these days.

 - hung_task_panic is converted in Patch 3, but there's a downside that
   now it only accepts 0 and 1, while previously it was any integer
   value

 - nmi_watchdog maps to two sysctls nmi_watchdog and hardlockup_panic,
   so there's no straighforward conversion possible

 - traceoff_on_warning is a flag without value and it would be required
   to handle that somehow in the conversion infractructure, which seems
   pointless for a single flag

This patch (of 5):

A recently proposed patch to add vm_swappiness command line parameter in
addition to existing sysctl [1] made me wonder why we don't have a
general support for passing sysctl parameters via command line.

Googling found only somebody else wondering the same [2], but I haven't
found any prior discussion with reasons why not to do this.

Settings the vm_swappiness issue aside (the underlying issue might be
solved in a different way), quick search of kernel-parameters.txt shows
there are already some that exist as both sysctl and kernel parameter -
hung_task_panic, nmi_watchdog, numa_zonelist_order, traceoff_on_warning.

A general mechanism would remove the need to add more of those one-offs
and might be handy in situations where configuration by e.g.
/etc/sysctl.d/ is impractical.

Hence, this patch adds a new parse_args() pass that looks for parameters
prefixed by 'sysctl.' and tries to interpret them as writes to the
corresponding sys/ files using an temporary in-kernel procfs mount.
This mechanism was suggested by Eric W.  Biederman [3], as it handles
all dynamically registered sysctl tables, even though we don't handle
modular sysctls.  Errors due to e.g.  invalid parameter name or value
are reported in the kernel log.

The processing is hooked right before the init process is loaded, as
some handlers might be more complicated than simple setters and might
need some subsystems to be initialized.  At the moment the init process
can be started and eventually execute a process writing to /proc/sys/
then it should be also fine to do that from the kernel.

Sysctls registered later on module load time are not set by this
mechanism - it's expected that in such scenarios, setting sysctl values
from userspace is practical enough.

[1] https://lore.kernel.org/r/BL0PR02MB560167492CA4094C91589930E9FC0@BL0PR02MB5601.namprd02.prod.outlook.com/
[2] https://unix.stackexchange.com/questions/558802/how-to-set-sysctl-using-kernel-command-line-parameter
[3] https://lore.kernel.org/r/87bloj2skm.fsf@x220.int.ebiederm.org/

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Ivan Teterevkov <ivan.teterevkov@nutanix.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: "Guilherme G . Piccoli" <gpiccoli@canonical.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Link: http://lkml.kernel.org/r/20200427180433.7029-1-vbabka@suse.cz
Link: http://lkml.kernel.org/r/20200427180433.7029-2-vbabka@suse.cz
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-08 11:05:56 -07:00
Linus Torvalds
63d72b93f2 vfs: clean up posix_acl_permission() logic aroudn MAY_NOT_BLOCK
posix_acl_permission() does not care about MAY_NOT_BLOCK, and in fact
the permission logic internally must not check that bit (it's only for
upper layers to decide whether they can block to do IO to look up the
acl information or not).

But the way the code was written, it _looked_ like it cared, since the
function explicitly did not mask that bit off.

But it has exactly two callers: one for when that bit is set, which
first clears the bit before calling posix_acl_permission(), and the
other call site when that bit was clear.

So stop the silly games "saving" the MAY_NOT_BLOCK bit that must not be
used for the actual permission test, and that currently is pointlessly
cleared by the callers when the function itself should just not care.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-08 11:04:19 -07:00
Linus Torvalds
5fc475b749 vfs: do not do group lookup when not necessary
Rasmus Villemoes points out that the 'in_group_p()' tests can be a
noticeable expense, and often completely unnecessary.  A common
situation is that the 'group' bits are the same as the 'other' bits
wrt the permissions we want to test.

So rewrite 'acl_permission_check()' to not bother checking for group
ownership when the permission check doesn't care.

For example, if we're asking for read permissions, and both 'group' and
'other' allow reading, there's really no reason to check if we're part
of the group or not: either way, we'll allow it.

Rasmus says:
 "On a bog-standard Ubuntu 20.04 install, a workload consisting of
  compiling lots of userspace programs (i.e., calling lots of
  short-lived programs that all need to get their shared libs mapped in,
  and the compilers poking around looking for system headers - lots of
  /usr/lib, /usr/bin, /usr/include/ accesses) puts in_group_p around
  0.1% according to perf top.

  System-installed files are almost always 0755 (directories and
  binaries) or 0644, so in most cases, we can avoid the binary search
  and the cost of pulling the cred->groups array and in_group_p() .text
  into the cpu cache"

Reported-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-08 11:04:19 -07:00
Denis Efremov
a8c73c1a61 io_uring: use kvfree() in io_sqe_buffer_register()
Use kvfree() to free the pages and vmas, since they are allocated by
kvmalloc_array() in a loop.

Fixes: d4ef647510 ("io_uring: avoid page allocation warnings")
Signed-off-by: Denis Efremov <efremov@linux.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20200605093203.40087-1-efremov@linux.com
2020-06-08 09:39:13 -06:00
Bijan Mottahedeh
efe68c1ca8 io_uring: validate the full range of provided buffers for access
Account for the number of provided buffers when validating the address
range.

Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-08 09:39:13 -06:00
youngjun
2068cf7dfb ovl: remove unnecessary lock check
Directory is always locked until "out_unlock" label.  So lock check is not
needed.

Signed-off-by: youngjun <her0gyugyu@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-06-08 09:57:19 +02:00
Linus Torvalds
a2b447066c Tag summary
+ Features
   - Replace zero-length array with flexible-array
   - add a valid state flags check
   - add consistency check between state and dfa diff encode flags
   - add apparmor subdir to proc attr interface
   - fail unpack if profile mode is unknown
   - add outofband transition and use it in xattr match
   - ensure that dfa state tables have entries
 
 + Cleanups
   - Use true and false for bool variable
   - Remove semicolon
   - Clean code by removing redundant instructions
   - Replace two seq_printf() calls by seq_puts() in aa_label_seq_xprint()
   - remove duplicate check of xattrs on profile attachment
   - remove useless aafs_create_symlink
 
 + Bug fixes
   - Fix memory leak of profile proxy
   - fix introspection of of task mode for unconfined tasks
   - fix nnp subset test for unconfined
   - check/put label on apparmor_sk_clone_security()
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE7cSDD705q2rFEEf7BS82cBjVw9gFAl7dUf4ACgkQBS82cBjV
 w9j8rA//R3qbVeiN3SJtxLhiF3AAdP2cVbZ/mAhQLwYObI6flb1bliiahJHRf8Ey
 FaVb4srOH8NlmzNINZehXOvD3UDwX/sbpw8h0Y0JolO+v1m3UXkt/eRoMt6gRz7I
 jtaImY1/V+G4O5rV5fGA1HQI8Geg+W9Abt32d16vyKIIpnBS/Pfv8ppM0NcHCZ4G
 e8935T/dMNK5K0Y7HNb1nMjyzEr0LtEXvXznBOrGVpCtDQ45m0/NBvAqpfhuKsVm
 FE5Na8rgtiB9sU72LaoNXNr8Y5LVgkXPmBr/e1FqZtF01XEarKb7yJDGOLrLpp1o
 rGYpY9DQSBT/ZZrwMaLFqCd1XtnN1BAmhlM6TXfnm25ArEnQ49ReHFc7ZHZRSTZz
 LWVBD6atZbapvqckk1SU49eCLuGs5wmRj/CmwdoQUbZ+aOfR68zF+0PANbP5xDo4
 862MmeMsm8JHndeCelpZQRbhtXt0t9MDzwMBevKhxV9hbpt4g8DcnC5tNUc9AnJi
 qJDsMkytYhazIW+/4MsnLTo9wzhqzXq5kBeE++Xl7vDE/V+d5ocvQg73xtwQo9sx
 LzMlh3cPmBvOnlpYfnONZP8pJdjDAuESsi/H5+RKQL3cLz7NX31CLWR8dXLBHy80
 Dvxqvy84Cf7buigqwSzgAGKjDI5HmeOECAMjpLbEB2NS9xxQYuk=
 =U7d2
 -----END PGP SIGNATURE-----

Merge tag 'apparmor-pr-2020-06-07' of git://git.kernel.org/pub/scm/linux/kernel/git/jj/linux-apparmor

Pull apparmor updates from John Johansen:
 "Features:
   - Replace zero-length array with flexible-array
   - add a valid state flags check
   - add consistency check between state and dfa diff encode flags
   - add apparmor subdir to proc attr interface
   - fail unpack if profile mode is unknown
   - add outofband transition and use it in xattr match
   - ensure that dfa state tables have entries

  Cleanups:
   - Use true and false for bool variable
   - Remove semicolon
   - Clean code by removing redundant instructions
   - Replace two seq_printf() calls by seq_puts() in aa_label_seq_xprint()
   - remove duplicate check of xattrs on profile attachment
   - remove useless aafs_create_symlink

  Bug fixes:
   - Fix memory leak of profile proxy
   - fix introspection of of task mode for unconfined tasks
   - fix nnp subset test for unconfined
   - check/put label on apparmor_sk_clone_security()"

* tag 'apparmor-pr-2020-06-07' of git://git.kernel.org/pub/scm/linux/kernel/git/jj/linux-apparmor:
  apparmor: Fix memory leak of profile proxy
  apparmor: fix introspection of of task mode for unconfined tasks
  apparmor: check/put label on apparmor_sk_clone_security()
  apparmor: Use true and false for bool variable
  security/apparmor/label.c: Clean code by removing redundant instructions
  apparmor: Replace zero-length array with flexible-array
  apparmor: ensure that dfa state tables have entries
  apparmor: remove duplicate check of xattrs on profile attachment.
  apparmor: add outofband transition and use it in xattr match
  apparmor: fail unpack if profile mode is unknown
  apparmor: fix nnp subset test for unconfined
  apparmor: remove useless aafs_create_symlink
  apparmor: add proc subdir to attrs
  apparmor: add consistency check between state and dfa diff encode flags
  apparmor: add a valid state flags check
  AppArmor: Remove semicolon
  apparmor: Replace two seq_printf() calls by seq_puts() in aa_label_seq_xprint()
2020-06-07 16:04:49 -07:00
Linus Torvalds
f558b8364e Driver core patches for 5.8-rc1
Here is the set of driver core patches for 5.8-rc1.
 
 Not all that huge this release, just a number of small fixes and
 updates:
 	- software node fixes
 	- kobject now sends KOBJ_REMOVE when it is removed from sysfs,
 	  not when it is removed from memory (which could come much
 	  later)
 	- device link additions and fixes based on testing on more
 	  devices
 	- firmware core cleanups
 	- other minor changes, full details in the shortlog
 
 All have been in linux-next for a while with no reported issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXtzmXg8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ymaAQCfZZ9prH3AMLF7DIkG3vMw0njLXt0An2FxrKYU
 wetHRG4KL9vTkdz7+TqU
 =t5LE
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core updates from Greg KH:
 "Here is the set of driver core patches for 5.8-rc1.

  Not all that huge this release, just a number of small fixes and
  updates:

   - software node fixes

   - kobject now sends KOBJ_REMOVE when it is removed from sysfs, not
     when it is removed from memory (which could come much later)

   - device link additions and fixes based on testing on more devices

   - firmware core cleanups

   - other minor changes, full details in the shortlog

  All have been in linux-next for a while with no reported issues"

* tag 'driver-core-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (23 commits)
  driver core: Update device link status correctly for SYNC_STATE_ONLY links
  firmware_loader: change enum fw_opt to u32
  software node: implement software_node_unregister()
  kobject: send KOBJ_REMOVE uevent when the object is removed from sysfs
  driver core: Remove unnecessary is_fwnode_dev variable in device_add()
  drivers property: When no children in primary, try secondary
  driver core: platform: Fix spelling errors in platform.c
  driver core: Remove check in driver_deferred_probe_force_trigger()
  of: platform: Batch fwnode parsing when adding all top level devices
  driver core: fw_devlink: Add support for batching fwnode parsing
  driver core: Look for waiting consumers only for a fwnode's primary device
  driver core: Move code to the right part of the file
  Revert "Revert "driver core: Set fw_devlink to "permissive" behavior by default""
  drivers: base: Fix NULL pointer exception in __platform_driver_probe() if a driver developer is foolish
  firmware_loader: move fw_fallback_config to a private kernel symbol namespace
  driver core: Add missing '\n' in log messages
  driver/base/soc: Use kobj_to_dev() API
  Add documentation on meaning of -EPROBE_DEFER
  driver core: platform: remove redundant assignment to variable ret
  debugfs: Use the correct style for SPDX License Identifier
  ...
2020-06-07 10:53:36 -07:00
Linus Torvalds
3b69e8b457 Fix for arch/sh build regression with newer binutils, removal of SH5,
fixes for module exports, and misc cleanup.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQEcBAABAgAGBQJe28l0AAoJELcQ+SIFb8HaKFMH/0T7tHfWit4+efmeDLhfrewd
 Fq9lLnEGmLy82AZqmd730gvD2ckbjUCm0ikKC79sCd14r3bIB1RCDKfXbY6rB3uI
 EDijbkzsjfOYG9ZAiDYTIbyrM2u2/1PzFiYTxHVDtPLbCPGfacbcfrDL+u143IXP
 ez/RHGLE6uYDvKi0Y0/VDKgMCW9bNlcEkL2/tKFVg2cipDi2Lfmi3Jss/id+5uOI
 N8XeZoyHjyWr7GeRZwN/hNPLDvLY//Uf5q6RB9VrTsN4Vrja7kjWMZkgsGkmGbNo
 f6BbLenq+KMfOSJrIzS3MgTRinoqRF5S518pkbGtgRQn0rZKfd6h85DG15RlPGk=
 =Ktnp
 -----END PGP SIGNATURE-----

Merge tag 'sh-for-5.8' of git://git.libc.org/linux-sh

Pull arch/sh updates from Rich Felker:
 "Fix for arch/sh build regression with newer binutils, removal of SH5,
  fixes for module exports, and misc cleanup"

* tag 'sh-for-5.8' of git://git.libc.org/linux-sh:
  sh: remove sh5 support
  sh: add missing EXPORT_SYMBOL() for __delay
  sh: Convert ins[bwl]/outs[bwl] macros to inline functions
  sh: Convert iounmap() macros to inline functions
  sh: Add missing DECLARE_EXPORT() for __ashiftrt_r4_xx
  sh: configs: Cleanup old Kconfig IO scheduler options
  arch/sh: vmlinux.scr
  sh: Replace CONFIG_MTD_M25P80 with CONFIG_MTD_SPI_NOR in sh7757lcr_defconfig
  sh: sh4a: Bring back tmu3_device early device
2020-06-06 15:22:01 -07:00
Zou Wei
9fa88c5d3f hpfs: fix warning due to superfluous semicolon
Fixes coccicheck warning:

  fs/hpfs/buffer.c:56:2-3: Unneeded semicolon

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zou Wei <zou_wei@huawei.com>
Signed-off-by: Mikulas Patocka <mikulas@twibright.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-06 10:08:17 -07:00
Steve French
5865985416 smb3: extend fscache mount volume coherency check
It is better to check volume id and creation time, not just
the root inode number to verify if the volume has changed
when remounting.

Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-06-06 11:16:25 -05:00
Linus Torvalds
aaa2faab4e orangefs: a conversion and a cleanup...
Conversion: John Hubbard's conversion from get_user_pages() to pin_user_pages()
 
 cleanup: Colin Ian King's removal of an unneeded variable initialization.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEIGSFVdO6eop9nER2z0QOqevODb4FAl7ao4AACgkQz0QOqevO
 Db4MqRAAjb+nRMDA6aYfCY1Veh9BUnVeVPAqWdlSPbo7TdDfhCBJKgwZ8Y3GmOnR
 vYnVn6Yz/uhFwoZWmSd0AXgB55kGKyXNcHlyPO4FcaWGDNd/Dn8WWc7/lsqHnDFX
 cg0Ioy7VeYS+Y+iW3ZkEjkeGyVFProl/OsJrf9vfJiuyZrLp4th/ctlbV/sIA2R6
 XNk0ld21gEB5YbrTCQebtyhdJLp9+hhI0BB2Lxm/JUZyyK4J2rN8H+SF/5JE8wEj
 SJD7K5kukxED2Kh3pU1fvRVr0VvHZjLHQav6TgW6GPokmb6EWZwvIYfUa8go50Jz
 5fkpyRc8d3zibxDSdL6/Gr6mZQxceFnYvfPs27Vq3O08J/dhWHX0LlKctD4pEHNR
 NUsF8Piko+16JwPg+EeXTIsYrMiW8g5FhoTCl7FILQ06F2P6NDCyOUyHiLOU4KaN
 +bp6YmIyJBfIBgXz58Pqq2JwibkN6zYiRrb5jDDWv2Nw9ykpjNykrAXH6Fbz8JKV
 dPbKkxzx5HQuHBaxYCRKV4r3UNBYKAECrWcLQglG8D+EzvYrBxFZQoPu3qapJPJG
 8fQKCKqU8hM6BGYpI76FIK8o/BrOdYr0ttqDsambDWt/uRXRKXwvKJVz3wOB8Q7n
 om+XlIk4UE/7vgpTru/wzVX0pmq8WWskbDOKkGqgeUC0BLuZcJw=
 =3A+d
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-5.8-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux

Pull orangefs updates from Mike Marshall:

 - John Hubbard's conversion from get_user_pages() to pin_user_pages()

 - Colin Ian King's removal of an unneeded variable initialization

* tag 'for-linus-5.8-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
  orangefs: convert get_user_pages() --> pin_user_pages()
  orangefs: remove redundant assignment to variable ret
2020-06-05 16:44:36 -07:00
Linus Torvalds
e3cea0cad1 dlm for 5.8
This set includes a couple minor cleanups, and dropping the
 interruptible from a wait_event that waits for an event from
 the userspace cluster management.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJe2op+AAoJEDgbc8f8gGmqJKkP/jE0BCnE7VbpID0sDaBLpntk
 +v53jxR/bpw6g+LwlJGBDLCRrcn8MnhC72W1OPZyFJ4AiuN5BHdvTVVSPSCBSHnU
 WfJKqbVSRt+PRaATJd8iCLdoMj858vLWFo380XVMTRsAaxYJ4dwRUMBUOx/pL8Dy
 pcDXDgWTC06nuRmSY8eYJYJWX3c6SdpHHbg4IA127Y9oTjIAvYyeFC4ohRtRJqrj
 bQlpjnfJ9bNwd5N15turGiR5IzY7UXI9wa/qylngr5B5gVxJttFpaE4OFF80HOkY
 Xxqqqhi9dIcyAAzftJaRQpyAq6kRmFgFAjH5Rvx+7n0BEMe+AAhQC9Coei1Zysh1
 fssYrxRC3fqX1kg2alhpdBdCugFq7IUdFuwH044M+w8jcsoggGq8vGRMJEuKgnnm
 kLLBEi5HtAv0S9rAE9Acnqanc6lRvzJIc0vysG/0LQqzWr+F7HtRnS7kkOkO2+e1
 014k20FmooJiu5XGrCz+dvmnIACpr6Z+S119uXntYGDDA+YGcJQTMs6pzLh5UTqB
 40aIZPPRRj6K/Vvx2VJgEbzwmwWjjdtsYkcJEq+QPPsAkPa8EnbqICCyJDkUuY5K
 kltYHQ2IOa466KC8JHBS9jdVnHftk0jIl75eiP7vK+ZUEW+iwRgEXY11oiWeY23J
 EKfPgi0oikfmlj2Y3jme
 =VTRw
 -----END PGP SIGNATURE-----

Merge tag 'dlm-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm

Pull dlm updates from David Teigland:
 "This set includes a couple minor cleanups, and dropping the
  interruptible from a wait_event that waits for an event from the
  userspace cluster management"

* tag 'dlm-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
  dlm: remove BUG() before panic()
  dlm: Switch to using wait_event()
  fs:dlm:remove unneeded semicolon in rcom.c
  dlm: user: Replace zero-length array with flexible-array member
  dlm: dlm_internal: Replace zero-length array with flexible-array member
2020-06-05 16:43:16 -07:00
Linus Torvalds
3803d5e4d3 22 changesets, 2 for stable. Includes big performance improvement for large i/o when using multichannel, also includes DFS fixes
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl7aelsACgkQiiy9cAdy
 T1HDmwv9Fj6OaXXx+btNvbB6xTWvCwMVKHwTPURMx+IjBYjJC65yPGkInPPkfUVo
 7L9h55XCLwFohECleZJCkKOrJtnX1P8SsHtZck6QqjvUETJl/L3pAXpMMYACHLpg
 x4DE/NFkcW95J38s9Jtjhphq8ZGUhuDhaT+QeEd2Iq8HzAxk5ND47ZXkomMx1EEM
 ZsOrmJF+k2YQyDDpfhJeVF5iZDkbpASqA/TlLxxGH34IdAZIUB9qtGKADNLZ6YyT
 qpG601CSrEdl3tVY+SlRMHqwVTRhCViPD6Q3fMw8Xha436RIiWJJ4Rvn6bSP/ZQl
 PDPuSVRB2zmd70C/3ojXdku9+VfQLO52qkO3bf2IjgVJ3ARrxFxW7cb7bmYRqdyT
 WI5N1+8gETrIAK7aB3QKdmkcRFDtJD3wOTfBcgctuB8WrYrDvW2MNKkPbQdY5tnN
 xfp4f10Dg4d+/8knSytJrdKkDublU0kGbfLAa0oupjAzV6WB0qNt0TNGxE42L+ug
 iFSJOZxi
 =gCcP
 -----END PGP SIGNATURE-----

Merge tag '5.8-rc-smb3-fixes-part-1' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs updates from Steve French:
 "22 changesets, 2 for stable.

  Includes big performance improvement for large i/o when using
  multichannel, also includes DFS fixes"

* tag '5.8-rc-smb3-fixes-part-1' of git://git.samba.org/sfrench/cifs-2.6: (22 commits)
  cifs: update internal module version number
  cifs: multichannel: try to rebind when reconnecting a channel
  cifs: multichannel: use pointer for binding channel
  smb3: remove static checker warning
  cifs: multichannel: move channel selection above transport layer
  cifs: multichannel: always zero struct cifs_io_parms
  cifs: dump Security Type info in DebugData
  smb3: fix incorrect number of credits when ioctl MaxOutputResponse > 64K
  smb3: default to minimum of two channels when multichannel specified
  cifs: multichannel: move channel selection in function
  cifs: fix minor typos in comments and log messages
  smb3: minor update to compression header definitions
  cifs: minor fix to two debug messages
  cifs: Standardize logging output
  smb3: Add new parm "nodelete"
  cifs: move some variables off the stack in smb2_ioctl_query_info
  cifs: reduce stack use in smb2_compound_op
  cifs: get rid of unused parameter in reconn_setup_dfs_targets()
  cifs: handle hostnames that resolve to same ip in failover
  cifs: set up next DFS target before generic_ip_connect()
  ...
2020-06-05 16:40:53 -07:00
Linus Torvalds
9daa0a27a0 AFS Changes
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAl7ZC5kACgkQ+7dXa6fL
 C2uv9A/+NKlTSXyv2ZuvtmXADelndcXJ+nC+3bwI7Jh43aa8uCCsAVYD0VE+dxor
 Ingj/LUJ2sjjp6RXCeeqqETXCoCVt0zK2g216+An7k84KJ+ms+MDa8dNN7l6280S
 1jw4hnT0+g9Ln6elgqBroV980MJC2NGL0Eaete8zFO8UqYZy5w1ge0HfGck2l45U
 2lr6egCWYSUPmtFKXJnLV8luwRvq7DzvTk9WrJu3kwOjaY1AQP1+1VpdhChJLrRc
 /4Ddy1On5IXiFrPi5OtHA422bfirUpIv2HbmI047W9uiZ05MiXwSvNS1qJLTa1AA
 T/SK88d3FCeSYw3olAne2kEl9uewvGByr98fDKFOcDHZj18abd9/VtUp33RXxYBy
 lN2wqlWP++LlZ4sMCbbvLXX8OB1tekQzWQC0vJ5rhRSgveOlhL9TLG2Y05xokFs+
 AwK8zTlDIZ6Pa/JIHfp2E0ZhXEazWTSmP+d7NkgzF0iiORukvsmxjOVUZC4+UCqK
 rYN6goJ5g8qpejRv5NhfP6/olb1NK33f/F2QSSFfxv9zda4HNlayvcoSnFrdUEnt
 IfBhSKPkeDVWs1yse7glDuw19tHp94B9UYwJ46qfHngQPArgy+gp23d0cSy41Pr5
 FRQ23eNvBWIP4srt1gSCBexSGA1h/ACji41CPTJbF2jg5uWFAUE=
 =YVwD
 -----END PGP SIGNATURE-----

Merge tag 'afs-next-20200604' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull AFS updates from David Howells:
 "There's some core VFS changes which affect a couple of filesystems:

   - Make the inode hash table RCU safe and providing some RCU-safe
     accessor functions. The search can then be done without taking the
     inode_hash_lock. Care must be taken because the object may be being
     deleted and no wait is made.

   - Allow iunique() to avoid taking the inode_hash_lock.

   - Allow AFS's callback processing to avoid taking the inode_hash_lock
     when using the inode table to find an inode to notify.

   - Improve Ext4's time updating. Konstantin Khlebnikov said "For now,
     I've plugged this issue with try-lock in ext4 lazy time update.
     This solution is much better."

  Then there's a set of changes to make a number of improvements to the
  AFS driver:

   - Improve callback (ie. third party change notification) processing
     by:

      (a) Relying more on the fact we're doing this under RCU and by
          using fewer locks. This makes use of the RCU-based inode
          searching outlined above.

      (b) Moving to keeping volumes in a tree indexed by volume ID
          rather than a flat list.

      (c) Making the server and volume records logically part of the
          cell. This means that a server record now points directly at
          the cell and the tree of volumes is there. This removes an N:M
          mapping table, simplifying things.

   - Improve keeping NAT or firewall channels open for the server
     callbacks to reach the client by actively polling the fileserver on
     a timed basis, instead of only doing it when we have an operation
     to process.

   - Improving detection of delayed or lost callbacks by including the
     parent directory in the list of file IDs to be queried when doing a
     bulk status fetch from lookup. We can then check to see if our copy
     of the directory has changed under us without us getting notified.

   - Determine aliasing of cells (such as a cell that is pointed to be a
     DNS alias). This allows us to avoid having ambiguity due to
     apparently different cells using the same volume and file servers.

   - Improve the fileserver rotation to do more probing when it detects
     that all of the addresses to a server are listed as non-responsive.
     It's possible that an address that previously stopped responding
     has become responsive again.

  Beyond that, lay some foundations for making some calls asynchronous:

   - Turn the fileserver cursor struct into a general operation struct
     and hang the parameters off of that rather than keeping them in
     local variables and hang results off of that rather than the call
     struct.

   - Implement some general operation handling code and simplify the
     callers of operations that affect a volume or a volume component
     (such as a file). Most of the operation is now done by core code.

   - Operations are supplied with a table of operations to issue
     different variants of RPCs and to manage the completion, where all
     the required data is held in the operation object, thereby allowing
     these to be called from a workqueue.

   - Put the standard "if (begin), while(select), call op, end" sequence
     into a canned function that just emulates the current behaviour for
     now.

  There are also some fixes interspersed:

   - Don't let the EACCES from ICMP6 mapping reach the user as such,
     since it's confusing as to whether it's a filesystem error. Convert
     it to EHOSTUNREACH.

   - Don't use the epoch value acquired through probing a server. If we
     have two servers with the same UUID but in different cells, it's
     hard to draw conclusions from them having different epoch values.

   - Don't interpret the argument to the CB.ProbeUuid RPC as a
     fileserver UUID and look up a fileserver from it.

   - Deal with servers in different cells having the same UUIDs. In the
     event that a CB.InitCallBackState3 RPC is received, we have to
     break the callback promises for every server record matching that
     UUID.

   - Don't let afs_statfs return values that go below 0.

   - Don't use running fileserver probe state to make server selection
     and address selection decisions on. Only make decisions on final
     state as the running state is cleared at the start of probing"

Acked-by: Al Viro <viro@zeniv.linux.org.uk> (fs/inode.c part)

* tag 'afs-next-20200604' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: (27 commits)
  afs: Adjust the fileserver rotation algorithm to reprobe/retry more quickly
  afs: Show more a bit more server state in /proc/net/afs/servers
  afs: Don't use probe running state to make decisions outside probe code
  afs: Fix afs_statfs() to not let the values go below zero
  afs: Fix the by-UUID server tree to allow servers with the same UUID
  afs: Reorganise volume and server trees to be rooted on the cell
  afs: Add a tracepoint to track the lifetime of the afs_volume struct
  afs: Detect cell aliases 3 - YFS Cells with a canonical cell name op
  afs: Detect cell aliases 2 - Cells with no root volumes
  afs: Detect cell aliases 1 - Cells with root volumes
  afs: Implement client support for the YFSVL.GetCellName RPC op
  afs: Retain more of the VLDB record for alias detection
  afs: Fix handling of CB.ProbeUuid cache manager op
  afs: Don't get epoch from a server because it may be ambiguous
  afs: Build an abstraction around an "operation" concept
  afs: Rename struct afs_fs_cursor to afs_operation
  afs: Remove the error argument from afs_protocol_error()
  afs: Set error flag rather than return error from file status decode
  afs: Make callback processing more efficient.
  afs: Show more information in /proc/net/afs/servers
  ...
2020-06-05 16:26:36 -07:00
Linus Torvalds
0b166a57e6 A lot of bug fixes and cleanups for ext4, including:
* Fix performance problems found in dioread_nolock now that it is the
   default, caused by transaction leaks.
 * Clean up fiemap handling in ext4
 * Clean up and refactor multiple block allocator (mballoc) code
 * Fix a problem with mballoc with a smaller file systems running out
   of blocks because they couldn't properly use blocks that had been
   reserved by inode preallocation.
 * Fixed a race in ext4_sync_parent() versus rename()
 * Simplify the error handling in the extent manipulation code
 * Make sure all metadata I/O errors are felected to ext4_ext_dirty()'s and
   ext4_make_inode_dirty()'s callers.
 * Avoid passing an error pointer to brelse in ext4_xattr_set()
 * Fix race which could result to freeing an inode on the dirty last
   in data=journal mode.
 * Fix refcount handling if ext4_iget() fails
 * Fix a crash in generic/019 caused by a corrupted extent node
 -----BEGIN PGP SIGNATURE-----
 
 iQEyBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAl7Ze8kACgkQ8vlZVpUN
 gaNChAf4xn0ytFSrweI/S2Sp05G/2L/ocZ2TZZk2ZdGeN1E+ABdSIv/zIF9zuFgZ
 /pY/C+fyEZWt4E3FlNO8gJzoEedkzMCMnUhSIfI+wZbcclyTOSNMJtnrnJKAEtVH
 HOvGZJmg357jy407RCGhZpJ773nwU2xhBTr5OFxvSf9mt/vzebxIOnw5D7HPlC1V
 Fgm6Du8q+tRrPsyjv1Yu4pUEVXMJ7qUcvt326AXVM3kCZO1Aa5GrURX0w3J4mzW1
 tc1tKmtbLcVVYTo9CwHXhk/edbxrhAydSP2iACand3tK6IJuI6j9x+bBJnxXitnr
 vsxsfTYMG18+2SxrJ9LwmagqmrRq
 =HMTs
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "A lot of bug fixes and cleanups for ext4, including:

   - Fix performance problems found in dioread_nolock now that it is the
     default, caused by transaction leaks.

   - Clean up fiemap handling in ext4

   - Clean up and refactor multiple block allocator (mballoc) code

   - Fix a problem with mballoc with a smaller file systems running out
     of blocks because they couldn't properly use blocks that had been
     reserved by inode preallocation.

   - Fixed a race in ext4_sync_parent() versus rename()

   - Simplify the error handling in the extent manipulation code

   - Make sure all metadata I/O errors are felected to
     ext4_ext_dirty()'s and ext4_make_inode_dirty()'s callers.

   - Avoid passing an error pointer to brelse in ext4_xattr_set()

   - Fix race which could result to freeing an inode on the dirty last
     in data=journal mode.

   - Fix refcount handling if ext4_iget() fails

   - Fix a crash in generic/019 caused by a corrupted extent node"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (58 commits)
  ext4: avoid unnecessary transaction starts during writeback
  ext4: don't block for O_DIRECT if IOCB_NOWAIT is set
  ext4: remove the access_ok() check in ext4_ioctl_get_es_cache
  fs: remove the access_ok() check in ioctl_fiemap
  fs: handle FIEMAP_FLAG_SYNC in fiemap_prep
  fs: move fiemap range validation into the file systems instances
  iomap: fix the iomap_fiemap prototype
  fs: move the fiemap definitions out of fs.h
  fs: mark __generic_block_fiemap static
  ext4: remove the call to fiemap_check_flags in ext4_fiemap
  ext4: split _ext4_fiemap
  ext4: fix fiemap size checks for bitmap files
  ext4: fix EXT4_MAX_LOGICAL_BLOCK macro
  add comment for ext4_dir_entry_2 file_type member
  jbd2: avoid leaking transaction credits when unreserving handle
  ext4: drop ext4_journal_free_reserved()
  ext4: mballoc: use lock for checking free blocks while retrying
  ext4: mballoc: refactor ext4_mb_good_group()
  ext4: mballoc: introduce pcpu seqcnt for freeing PA to improve ENOSPC handling
  ext4: mballoc: refactor ext4_mb_discard_preallocations()
  ...
2020-06-05 16:19:28 -07:00
Linus Torvalds
242b233198 RDMA 5.8 merge window pull request
A few large, long discussed works this time. The RNBD block driver has
 been posted for nearly two years now, and the removal of FMR has been a
 recurring discussion theme for a long time. The usual smattering of
 features and bug fixes.
 
 - Various small driver bugs fixes in rxe, mlx5, hfi1, and efa
 
 - Continuing driver cleanups in bnxt_re, hns
 
 - Big cleanup of mlx5 QP creation flows
 
 - More consistent use of src port and flow label when LAG is used and a
   mlx5 implementation
 
 - Additional set of cleanups for IB CM
 
 - 'RNBD' network block driver and target. This is a network block RDMA
   device specific to ionos's cloud environment. It brings strong multipath
   and resiliency capabilities.
 
 - Accelerated IPoIB for HFI1
 
 - QP/WQ/SRQ ioctl migration for uverbs, and support for multiple async fds
 
 - Support for exchanging the new IBTA defiend ECE data during RDMA CM
   exchanges
 
 - Removal of the very old and insecure FMR interface from all ULPs and
   drivers. FRWR should be preferred for at least a decade now.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAl7X/IwACgkQOG33FX4g
 mxp2uw/+MI2S/aXqEBvZfTT8yrkAwqYezS0VeTDnwH/T6UlTMDhHVN/2Ji3tbbX3
 FEKT1i2mnAL5RqUAL1lr9g4sG/bVozrpN46Ws5Lu9dTbIPLKTNPWDuLFQDUShKY7
 OyMI/bRx6anGnsOy20iiBqnrQbrrZj5TECgnmrkAl62QFdcl7aBWe/yYjy4CT11N
 ub+aBXBREN1F1pc0HIjd2tI+8gnZc+mNm1LVVDRH9Capun/pI26qDNh7e6QwGyIo
 n8ItraC8znLwv/nsUoTE7/JRcsTEe6vJI26PQmczZfNJs/4O65G7fZg0eSBseZYi
 qKf7Uwtb3qW0R7jRUMEgFY4DKXVAA0G2ph40HXBuzOSsqlT6HqYMO2wgG8pJkrTc
 qAjoSJGzfAHIsjxzxKI8wKuufCddjCm30VWWU7EKeriI6h1J0uPVqKkQMfYBTkik
 696eZSBycAVgwayOng3XaehiTxOL7qGMTjUpDjUR6UscbiPG919vP+QsbIUuBXdb
 YoddBQJdyGJiaCXv32ciJjo9bjPRRi/bII7Q5qzCNI2mi4ZVbudF4ffzyQvdHtNJ
 nGnpRXoPi7kMvUrKTMPWkFjj0R5/UsPszsA51zbxPydfgBe0Dlc2PrrIG8dlzYAp
 wbV0Lec+iJucKlt7EZtrjz1xOiOOaQt/5/cW1bWqL+wk2t6gAuY=
 =9zTe
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma updates from Jason Gunthorpe:
 "A more active cycle than most of the recent past, with a few large,
  long discussed works this time.

  The RNBD block driver has been posted for nearly two years now, and
  flowing through RDMA due to it also introducing a new ULP.

  The removal of FMR has been a recurring discussion theme for a long
  time.

  And the usual smattering of features and bug fixes.

  Summary:

   - Various small driver bugs fixes in rxe, mlx5, hfi1, and efa

   - Continuing driver cleanups in bnxt_re, hns

   - Big cleanup of mlx5 QP creation flows

   - More consistent use of src port and flow label when LAG is used and
     a mlx5 implementation

   - Additional set of cleanups for IB CM

   - 'RNBD' network block driver and target. This is a network block
     RDMA device specific to ionos's cloud environment. It brings strong
     multipath and resiliency capabilities.

   - Accelerated IPoIB for HFI1

   - QP/WQ/SRQ ioctl migration for uverbs, and support for multiple
     async fds

   - Support for exchanging the new IBTA defiend ECE data during RDMA CM
     exchanges

   - Removal of the very old and insecure FMR interface from all ULPs
     and drivers. FRWR should be preferred for at least a decade now"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (247 commits)
  RDMA/cm: Spurious WARNING triggered in cm_destroy_id()
  RDMA/mlx5: Return ECE DC support
  RDMA/mlx5: Don't rely on FW to set zeros in ECE response
  RDMA/mlx5: Return an error if copy_to_user fails
  IB/hfi1: Use free_netdev() in hfi1_netdev_free()
  RDMA/hns: Uninitialized variable in modify_qp_init_to_rtr()
  RDMA/core: Move and rename trace_cm_id_create()
  IB/hfi1: Fix hfi1_netdev_rx_init() error handling
  RDMA: Remove 'max_map_per_fmr'
  RDMA: Remove 'max_fmr'
  RDMA/core: Remove FMR device ops
  RDMA/rdmavt: Remove FMR memory registration
  RDMA/mthca: Remove FMR support for memory registration
  RDMA/mlx4: Remove FMR support for memory registration
  RDMA/i40iw: Remove FMR leftovers
  RDMA/bnxt_re: Remove FMR leftovers
  RDMA/mlx5: Remove FMR leftovers
  RDMA/core: Remove FMR pool API
  RDMA/rds: Remove FMR support for memory registration
  RDMA/srp: Remove support for FMR memory registration
  ...
2020-06-05 14:05:57 -07:00
Linus Torvalds
ac7b34218a Split the old READ_IMPLIES_EXEC workaround from executable PT_GNU_STACK
now that toolchains long support PT_GNU_STACK marking and there's no
 need anymore to force modern programs into having all its user mappings
 executable instead of only the stack and the PROT_EXEC ones. Disable
 that automatic READ_IMPLIES_EXEC forcing on x86-64 and arm64. Add tables
 documenting how READ_IMPLIES_EXEC is handled on x86-64, arm and arm64.
 By Kees Cook.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl7YFDIACgkQEsHwGGHe
 VUpnzxAAmXdODNOb1gGQvt+KJthkfkWh2A2R+tWxCRmFtjFTcS/eRxFfvGu2KmFY
 2b2AcJzuJeGjs7WIvQU0pkR2p6STyzuSBBLj5J/OJR9FonQ4pPah38df4A0fOgI6
 GJyJV9Ie7O2Ph1w2iLOeWBdmR90CnYuabxsfipgOL+sjHlEI0RqLSDgARRQsxTEj
 KM+JVAFD472KcUJnQKBVBOD1I1DOVBGu12r3y6chgsOtwshLNW/cO15cDgYrgnJZ
 OlR3EIUukCEEc1KQzUCihsypLuGfrmdq1MyPN8CME8gLfmOBsJyGRDhvmdbS+Wxh
 kAMYQ9BuNP/jMVtN950qV0qUtnZCeIPlj1sDb9STWz5fInLsXDSCS0eYi32yBFi+
 7yviVU95ml6Mda1Qd5axItTHFAjKIn0qfMZszkLOtUszIzNinCgH7t3ThoXeV223
 BqrpntRwiGZVpXDdcp0QFYBsWSMchR47yuhL8pB4SWxQzgNzXqAEg2KFQU0XMDKp
 pdia9IzUozg/BrjG5cnRfZhq2lBra7fy3Dn6fw5+NR5vqhka0Wr8L6dyM1Rj74EU
 HPk5bRXgt0OIiIFPi4139ApY7k+8j2nbf12qUchue1ZVVKzbvK996FDXbrGgW3zD
 Wis1wglxB9urSUTmC1bMOeyOd+gebo3i/ACAjgSo+EbDN7qW0Qw=
 =2L7y
 -----END PGP SIGNATURE-----

Merge tag 'core_core_updates_for_5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull READ_IMPLIES_EXEC changes from Borislav Petkov:
 "Split the old READ_IMPLIES_EXEC workaround from executable
  PT_GNU_STACK now that toolchains long support PT_GNU_STACK marking and
  there's no need anymore to force modern programs into having all its
  user mappings executable instead of only the stack and the PROT_EXEC
  ones.

  Disable that automatic READ_IMPLIES_EXEC forcing on x86-64 and
  arm64.

  Add tables documenting how READ_IMPLIES_EXEC is handled on x86-64, arm
  and arm64.

  By Kees Cook"

* tag 'core_core_updates_for_5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  arm64/elf: Disable automatic READ_IMPLIES_EXEC for 64-bit address spaces
  arm32/64/elf: Split READ_IMPLIES_EXEC from executable PT_GNU_STACK
  arm32/64/elf: Add tables to document READ_IMPLIES_EXEC
  x86/elf: Disable automatic READ_IMPLIES_EXEC on 64-bit
  x86/elf: Split READ_IMPLIES_EXEC from executable PT_GNU_STACK
  x86/elf: Add table to document READ_IMPLIES_EXEC
2020-06-05 13:45:21 -07:00
Andreas Gruenbacher
300e549b6e Merge branch 'gfs2-iopen' into for-next 2020-06-05 21:25:36 +02:00
Bob Peterson
83d060ca8d gfs2: fix use-after-free on transaction ail lists
Before this patch, transactions could be merged into the system
transaction by function gfs2_merge_trans(), but the transaction ail
lists were never merged. Because the ail flushing mechanism can run
separately, bd elements can be attached to the transaction's buffer
list during the transaction (trans_add_meta, etc) but quickly moved
to its ail lists. Later, in function gfs2_trans_end, the transaction
can be freed (by gfs2_trans_end) while it still has bd elements
queued to its ail lists, which can cause it to either lose track of
the bd elements altogether (memory leak) or worse, reference the bd
elements after the parent transaction has been freed.

Although I've not seen any serious consequences, the problem becomes
apparent with the previous patch's addition of:

	gfs2_assert_warn(sdp, list_empty(&tr->tr_ail1_list));

to function gfs2_trans_free().

This patch adds logic into gfs2_merge_trans() to move the merged
transaction's ail lists to the sdp transaction. This prevents the
use-after-free. To do this properly, we need to hold the ail lock,
so we pass sdp into the function instead of the transaction itself.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-06-05 21:24:25 +02:00
Bob Peterson
b839dadae8 gfs2: new slab for transactions
This patch adds a new slab for gfs2 transactions. That allows us to
reduce kernel memory fragmentation, have better organization of data
for analysis of vmcore dumps. A new centralized function is added to
free the slab objects, and it exposes use-after-free by giving
warnings if a transaction is freed while it still has bd elements
attached to its buffers or ail lists. We make sure to initialize
those transaction ail lists so we can check their integrity when freeing.

At a later time, we should add a slab initialization function to
make it more efficient, but for this initial patch I wanted to
minimize the impact.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-06-05 21:24:25 +02:00
Bob Peterson
cbcc89b630 gfs2: initialize transaction tr_ailX_lists earlier
Since transactions may be freed shortly after they're created, before
a log_flush occurs, we need to initialize their ail1 and ail2 lists
earlier. Before this patch, the ail1 list was initialized in gfs2_log_flush().
This moves the initialization to the point when the transaction is first
created.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-06-05 21:24:25 +02:00
Andreas Gruenbacher
9e8990dea9 gfs2: Smarter iopen glock waiting
When trying to upgrade the iopen glock from a shared to an exclusive lock in
gfs2_evict_inode, abort the wait if there is contention on the corresponding
inode glock: in that case, the inode must still be in active use on another
node, and we're not guaranteed to get the iopen glock anytime soon.

To make this work even better, when we notice contention on the iopen glock and
we can't evict the corresponsing inode and release the iopen glock immediately,
poke the inode glock.  The other node(s) trying to acquire the lock can then
abort instead of timing out.

Thanks to Heinz Mauelshagen for pointing out a locking bug in a previous
version of this patch.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-06-05 20:19:21 +02:00