Commit Graph

60814 Commits

Author SHA1 Message Date
Gustavo A. R. Silva
0a4c92657f fs: mark expected switch fall-throughs
In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

This patch fixes the following warnings:

fs/affs/affs.h:124:38: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/configfs/dir.c:1692:11: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/configfs/dir.c:1694:7: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/ceph/file.c:249:3: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/ext4/hash.c:233:15: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/ext4/hash.c:246:15: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/ext2/inode.c:1237:7: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/ext2/inode.c:1244:7: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/ext4/indirect.c:1182:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/ext4/indirect.c:1188:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/ext4/indirect.c:1432:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/ext4/indirect.c:1440:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/f2fs/node.c:618:8: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/f2fs/node.c:620:8: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/btrfs/ref-verify.c:522:15: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/gfs2/bmap.c:711:7: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/gfs2/bmap.c:722:7: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/jffs2/fs.c:339:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/nfsd/nfs4proc.c:429:12: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/ufs/util.h:62:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/ufs/util.h:43:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/fcntl.c:770:7: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/seq_file.c:319:10: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/libfs.c:148:11: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/libfs.c:150:7: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/signalfd.c:178:7: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/locks.c:1473:16: warning: this statement may fall through [-Wimplicit-fallthrough=]

Warning level 3 was used: -Wimplicit-fallthrough=3

This patch is part of the ongoing efforts to enabling
-Wimplicit-fallthrough.

Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
2019-04-08 18:21:02 -05:00
Jens Axboe
3ec482d15c io_uring: restrict IORING_SETUP_SQPOLL to root
This options spawns a kernel side thread that will poll for submissions
(and completions, if IORING_SETUP_IOPOLL is set). As this allows a user
to potentially use more cycles outside of the normal hierarchy,
restrict the use of this feature to root.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-08 10:51:01 -06:00
Trond Myklebust
e6abc8caa6 nfsd: Don't release the callback slot unless it was actually held
If there are multiple callbacks queued, waiting for the callback
slot when the callback gets shut down, then they all currently
end up acting as if they hold the slot, and call
nfsd4_cb_sequence_done() resulting in interesting side-effects.

In addition, the 'retry_nowait' path in nfsd4_cb_sequence_done()
causes a loop back to nfsd4_cb_prepare() without first freeing the
slot, which causes a deadlock when nfsd41_cb_get_slot() gets called
a second time.

This patch therefore adds a boolean to track whether or not the
callback did pick up the slot, so that it can do the right thing
in these 2 cases.

Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-04-08 12:43:15 -04:00
Linus Torvalds
429fba106e Merge tag 'for-linus-20190407' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:

 - Fixups for the pf/pcd queue handling (YueHaibing)

 - Revert of the three direct issue changes as they have been proven to
   cause an issue with dm-mpath (Bart)

 - Plug rq_count reset fix (Dongli)

 - io_uring double free in fileset registration error handling (me)

 - Make null_blk handle bad numa node passed in (John)

 - BFQ ifdef fix (Konstantin)

 - Flush queue leak fix (Shenghui)

 - Plug trace fix (Yufen)

* tag 'for-linus-20190407' of git://git.kernel.dk/linux-block:
  xsysace: Fix error handling in ace_setup
  null_blk: prevent crash from bad home_node value
  block: Revert v5.0 blk_mq_request_issue_directly() changes
  paride/pcd: Fix potential NULL pointer dereference and mem leak
  blk-mq: do not reset plug->rq_count before the list is sorted
  paride/pf: Fix potential NULL pointer dereference
  io_uring: fix double free in case of fileset regitration failure
  blk-mq: add trace block plug and unplug for multiple queues
  block: use blk_free_flush_queue() to free hctx->fq in blk_mq_init_hctx
  block/bfq: fix ifdef for CONFIG_BFQ_GROUP_IOSCHED=y
2019-04-07 13:28:36 -10:00
Arnd Bergmann
1e83bc8156 ext4: use BUG() instead of BUG_ON(1)
BUG_ON(1) leads to bogus warnings from clang when
CONFIG_PROFILE_ANNOTATED_BRANCHES is set:

 fs/ext4/inode.c:544:4: error: variable 'retval' is used uninitialized whenever 'if' condition is false
      [-Werror,-Wsometimes-uninitialized]
                        BUG_ON(1);
                        ^~~~~~~~~
 include/asm-generic/bug.h:61:36: note: expanded from macro 'BUG_ON'
                                   ^~~~~~~~~~~~~~~~~~~
 include/linux/compiler.h:48:23: note: expanded from macro 'unlikely'
                        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 fs/ext4/inode.c:591:6: note: uninitialized use occurs here
        if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
            ^~~~~~
 fs/ext4/inode.c:544:4: note: remove the 'if' if its condition is always true
                        BUG_ON(1);
                        ^
 include/asm-generic/bug.h:61:32: note: expanded from macro 'BUG_ON'
                               ^
 fs/ext4/inode.c:502:12: note: initialize the variable 'retval' to silence this warning

Change it to BUG() so clang can see that this code path can never
continue.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
2019-04-07 12:24:43 -04:00
Liu Xiang
d454a27384 ext4: fix prefetchw of NULL page
In ext4_mpage_readpages(), if the parameter pages is not NULL, another
parameter page is NULL. At the first time prefetchw(&page->flags)
works on NULL. From second time, prefetchw(&page->flags) always works on
the last consumed page. This might do little improvment for handling
current page. So prefetchw() should be called while the page pointer
has just been updated.

Signed-off-by: Liu Xiang <liu.xiang6@zte.com.cn>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-04-07 11:54:27 -04:00
Jiufei Xue
742b06b562 jbd2: check superblock mapped prior to committing
We hit a BUG at fs/buffer.c:3057 if we detached the nbd device
before unmounting ext4 filesystem.

The typical chain of events leading to the BUG:
jbd2_write_superblock
  submit_bh
    submit_bh_wbc
      BUG_ON(!buffer_mapped(bh));

The block device is removed and all the pages are invalidated. JBD2
was trying to write journal superblock to the block device which is
no longer present.

Fix this by checking the journal superblock's buffer head prior to
submitting.

Reported-by: Eric Ren <renzhen@linux.alibaba.com>
Signed-off-by: Jiufei Xue <jiufei.xue@linux.alibaba.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@kernel.org
2019-04-06 18:57:40 -04:00
Eric Biggers
fe53cbc5a3 ext4: remove incorrect comment for NEXT_ORPHAN()
The comment above NEXT_ORPHAN() was meant for ext4_encrypted_inode(),
which was moved by commit a7550b30ab ("ext4 crypto: migrate into vfs's
crypto engine") but the comment was accidentally left in place.  Since
ext4_encrypted_inode() has now been removed, just remove the comment.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2019-04-06 18:53:05 -04:00
Jan Kara
31562b954b ext4: make sanity check in mballoc more strict
The sanity check in mb_find_extent() only checked that returned extent
does not extend past blocksize * 8, however it should not extend past
EXT4_CLUSTERS_PER_GROUP(sb). This can happen when clusters_per_group <
blocksize * 8 and the tail of the bitmap is not properly filled by 1s
which happened e.g. when ancient kernels have grown the filesystem.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2019-04-06 18:33:06 -04:00
Liu Song
fb20375109 jbd2: remove repeated assignments in __jbd2_log_wait_for_space()
At the beginning, nblocks has been assigned. There is no need
to repeat the assignment in the while loop, and remove it.

Signed-off-by: Liu Song <liu.song11@zte.com.cn>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2019-04-06 18:14:17 -04:00
Kirill Smelkov
10dce8af34 fs: stream_open - opener for stream-like files so that read and write can run simultaneously without deadlock
Commit 9c225f2655 ("vfs: atomic f_pos accesses as per POSIX") added
locking for file.f_pos access and in particular made concurrent read and
write not possible - now both those functions take f_pos lock for the
whole run, and so if e.g. a read is blocked waiting for data, write will
deadlock waiting for that read to complete.

This caused regression for stream-like files where previously read and
write could run simultaneously, but after that patch could not do so
anymore. See e.g. commit 581d21a2d0 ("xenbus: fix deadlock on writes
to /proc/xen/xenbus") which fixes such regression for particular case of
/proc/xen/xenbus.

The patch that added f_pos lock in 2014 did so to guarantee POSIX thread
safety for read/write/lseek and added the locking to file descriptors of
all regular files. In 2014 that thread-safety problem was not new as it
was already discussed earlier in 2006.

However even though 2006'th version of Linus's patch was adding f_pos
locking "only for files that are marked seekable with FMODE_LSEEK (thus
avoiding the stream-like objects like pipes and sockets)", the 2014
version - the one that actually made it into the tree as 9c225f2655 -
is doing so irregardless of whether a file is seekable or not.

See

    https://lore.kernel.org/lkml/53022DB1.4070805@gmail.com/
    https://lwn.net/Articles/180387
    https://lwn.net/Articles/180396

for historic context.

The reason that it did so is, probably, that there are many files that
are marked non-seekable, but e.g. their read implementation actually
depends on knowing current position to correctly handle the read. Some
examples:

	kernel/power/user.c		snapshot_read
	fs/debugfs/file.c		u32_array_read
	fs/fuse/control.c		fuse_conn_waiting_read + ...
	drivers/hwmon/asus_atk0110.c	atk_debugfs_ggrp_read
	arch/s390/hypfs/inode.c		hypfs_read_iter
	...

Despite that, many nonseekable_open users implement read and write with
pure stream semantics - they don't depend on passed ppos at all. And for
those cases where read could wait for something inside, it creates a
situation similar to xenbus - the write could be never made to go until
read is done, and read is waiting for some, potentially external, event,
for potentially unbounded time -> deadlock.

Besides xenbus, there are 14 such places in the kernel that I've found
with semantic patch (see below):

	drivers/xen/evtchn.c:667:8-24: ERROR: evtchn_fops: .read() can deadlock .write()
	drivers/isdn/capi/capi.c:963:8-24: ERROR: capi_fops: .read() can deadlock .write()
	drivers/input/evdev.c:527:1-17: ERROR: evdev_fops: .read() can deadlock .write()
	drivers/char/pcmcia/cm4000_cs.c:1685:7-23: ERROR: cm4000_fops: .read() can deadlock .write()
	net/rfkill/core.c:1146:8-24: ERROR: rfkill_fops: .read() can deadlock .write()
	drivers/s390/char/fs3270.c:488:1-17: ERROR: fs3270_fops: .read() can deadlock .write()
	drivers/usb/misc/ldusb.c:310:1-17: ERROR: ld_usb_fops: .read() can deadlock .write()
	drivers/hid/uhid.c:635:1-17: ERROR: uhid_fops: .read() can deadlock .write()
	net/batman-adv/icmp_socket.c:80:1-17: ERROR: batadv_fops: .read() can deadlock .write()
	drivers/media/rc/lirc_dev.c:198:1-17: ERROR: lirc_fops: .read() can deadlock .write()
	drivers/leds/uleds.c:77:1-17: ERROR: uleds_fops: .read() can deadlock .write()
	drivers/input/misc/uinput.c:400:1-17: ERROR: uinput_fops: .read() can deadlock .write()
	drivers/infiniband/core/user_mad.c:985:7-23: ERROR: umad_fops: .read() can deadlock .write()
	drivers/gnss/core.c:45:1-17: ERROR: gnss_fops: .read() can deadlock .write()

In addition to the cases above another regression caused by f_pos
locking is that now FUSE filesystems that implement open with
FOPEN_NONSEEKABLE flag, can no longer implement bidirectional
stream-like files - for the same reason as above e.g. read can deadlock
write locking on file.f_pos in the kernel.

FUSE's FOPEN_NONSEEKABLE was added in 2008 in a7c1b990f7 ("fuse:
implement nonseekable open") to support OSSPD. OSSPD implements /dev/dsp
in userspace with FOPEN_NONSEEKABLE flag, with corresponding read and
write routines not depending on current position at all, and with both
read and write being potentially blocking operations:

See

    https://github.com/libfuse/osspd
    https://lwn.net/Articles/308445

    https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1406
    https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1438-L1477
    https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1479-L1510

Corresponding libfuse example/test also describes FOPEN_NONSEEKABLE as
"somewhat pipe-like files ..." with read handler not using offset.
However that test implements only read without write and cannot exercise
the deadlock scenario:

    https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L124-L131
    https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L146-L163
    https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L209-L216

I've actually hit the read vs write deadlock for real while implementing
my FUSE filesystem where there is /head/watch file, for which open
creates separate bidirectional socket-like stream in between filesystem
and its user with both read and write being later performed
simultaneously. And there it is semantically not easy to split the
stream into two separate read-only and write-only channels:

    https://lab.nexedi.com/kirr/wendelin.core/blob/f13aa600/wcfs/wcfs.go#L88-169

Let's fix this regression. The plan is:

1. We can't change nonseekable_open to include &~FMODE_ATOMIC_POS -
   doing so would break many in-kernel nonseekable_open users which
   actually use ppos in read/write handlers.

2. Add stream_open() to kernel to open stream-like non-seekable file
   descriptors. Read and write on such file descriptors would never use
   nor change ppos. And with that property on stream-like files read and
   write will be running without taking f_pos lock - i.e. read and write
   could be running simultaneously.

3. With semantic patch search and convert to stream_open all in-kernel
   nonseekable_open users for which read and write actually do not
   depend on ppos and where there is no other methods in file_operations
   which assume @offset access.

4. Add FOPEN_STREAM to fs/fuse/ and open in-kernel file-descriptors via
   steam_open if that bit is present in filesystem open reply.

   It was tempting to change fs/fuse/ open handler to use stream_open
   instead of nonseekable_open on just FOPEN_NONSEEKABLE flags, but
   grepping through Debian codesearch shows users of FOPEN_NONSEEKABLE,
   and in particular GVFS which actually uses offset in its read and
   write handlers

	https://codesearch.debian.net/search?q=-%3Enonseekable+%3D
	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1080
	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1247-1346
	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1399-1481

   so if we would do such a change it will break a real user.

5. Add stream_open and FOPEN_STREAM handling to stable kernels starting
   from v3.14+ (the kernel where 9c225f2655 first appeared).

   This will allow to patch OSSPD and other FUSE filesystems that
   provide stream-like files to return FOPEN_STREAM | FOPEN_NONSEEKABLE
   in their open handler and this way avoid the deadlock on all kernel
   versions. This should work because fs/fuse/ ignores unknown open
   flags returned from a filesystem and so passing FOPEN_STREAM to a
   kernel that is not aware of this flag cannot hurt. In turn the kernel
   that is not aware of FOPEN_STREAM will be < v3.14 where just
   FOPEN_NONSEEKABLE is sufficient to implement streams without read vs
   write deadlock.

This patch adds stream_open, converts /proc/xen/xenbus to it and adds
semantic patch to automatically locate in-kernel places that are either
required to be converted due to read vs write deadlock, or that are just
safe to be converted because read and write do not use ppos and there
are no other funky methods in file_operations.

Regarding semantic patch I've verified each generated change manually -
that it is correct to convert - and each other nonseekable_open instance
left - that it is either not correct to convert there, or that it is not
converted due to current stream_open.cocci limitations.

The script also does not convert files that should be valid to convert,
but that currently have .llseek = noop_llseek or generic_file_llseek for
unknown reason despite file being opened with nonseekable_open (e.g.
drivers/input/mousedev.c)

Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Yongzhi Pan <panyongzhi@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Tejun Heo <tj@kernel.org>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Julia Lawall <Julia.Lawall@lip6.fr>
Cc: Nikolaus Rath <Nikolaus@rath.org>
Cc: Han-Wen Nienhuys <hanwen@google.com>
Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-04-06 07:01:55 -10:00
Christoph Hellwig
72deb455b5 block: remove CONFIG_LBDAF
Currently support for 64-bit sector_t and blkcnt_t is optional on 32-bit
architectures.  These types are required to support block device and/or
file sizes larger than 2 TiB, and have generally defaulted to on for
a long time.  Enabling the option only increases the i386 tinyconfig
size by 145 bytes, and many data structures already always use
64-bit values for their in-core and on-disk data structures anyway,
so there should not be a large change in dynamic memory usage either.

Dropping this option removes a somewhat weird non-default config that
has cause various bugs or compiler warnings when actually used.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-06 10:48:35 -06:00
Linus Torvalds
f654f0fc0b Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "14 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  kernel/sysctl.c: fix out-of-bounds access when setting file-max
  mm/util.c: fix strndup_user() comment
  sh: fix multiple function definition build errors
  MAINTAINERS: add maintainer and replacing reviewer ARM/NUVOTON NPCM
  MAINTAINERS: fix bad pattern in ARM/NUVOTON NPCM
  mm: writeback: use exact memcg dirty counts
  psi: clarify the units used in pressure files
  mm/huge_memory.c: fix modifying of page protection by insert_pfn_pmd()
  hugetlbfs: fix memory leak for resv_map
  mm: fix vm_fault_t cast in VM_FAULT_GET_HINDEX()
  lib/lzo: fix bugs for very short or empty input
  include/linux/bitrev.h: fix constant bitrev
  kmemleak: powerpc: skip scanning holes in the .bss section
  lib/string.c: implement a basic bcmp
2019-04-05 17:08:55 -10:00
Mike Kravetz
58b6e5e8f1 hugetlbfs: fix memory leak for resv_map
When mknod is used to create a block special file in hugetlbfs, it will
allocate an inode and kmalloc a 'struct resv_map' via resv_map_alloc().
inode->i_mapping->private_data will point the newly allocated resv_map.
However, when the device special file is opened bd_acquire() will set
inode->i_mapping to bd_inode->i_mapping.  Thus the pointer to the
allocated resv_map is lost and the structure is leaked.

Programs to reproduce:
        mount -t hugetlbfs nodev hugetlbfs
        mknod hugetlbfs/dev b 0 0
        exec 30<> hugetlbfs/dev
        umount hugetlbfs/

resv_map structures are only needed for inodes which can have associated
page allocations.  To fix the leak, only allocate resv_map for those
inodes which could possibly be associated with page allocations.

Link: http://lkml.kernel.org/r/20190401213101.16476-1-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reported-by: Yufen Yu <yuyufen@huawei.com>
Suggested-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-04-05 16:02:31 -10:00
Murphy Zhou
3c86794ac0 nfsd/nfsd3_proc_readdir: fix buffer count and page pointers
After this commit
  f875a79 nfsd: allow nfsv3 readdir request to be larger.
nfsv3 readdir request size can be larger than PAGE_SIZE. So if the
directory been read is large enough, we can use multiple pages
in rq_respages. Update buffer count and page pointers like we do
in readdirplus to make this happen.

Now listing a directory within 3000 files will panic because we
are counting in a wrong way and would write on random page.

Fixes: f875a79 "nfsd: allow nfsv3 readdir request to be larger"
Signed-off-by: Murphy Zhou <jencce.kernel@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-04-05 19:57:24 -04:00
Linus Torvalds
970b766cfd Merge tag 'trace-5.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull syscall-get-arguments cleanup and fixes from Steven Rostedt:
 "Andy Lutomirski approached me to tell me that the
  syscall_get_arguments() implementation in x86 was horrible and gcc
  certainly gets it wrong.

  He said that since the tracepoints only pass in 0 and 6 for i and n
  repectively, it should be optimized for that case. Inspecting the
  kernel, I discovered that all users pass in 0 for i and only one file
  passing in something other than 6 for the number of arguments. That
  code happens to be my own code used for the special syscall tracing.

  That can easily be converted to just using 0 and 6 as well, and only
  copying what is needed. Which is probably the faster path anyway for
  that case.

  Along the way, a couple of real fixes came from this as the
  syscall_get_arguments() function was incorrect for csky and riscv.

  x86 has been optimized to for the new interface that removes the
  variable number of arguments, but the other architectures could still
  use some loving and take more advantage of the simpler interface"

* tag 'trace-5.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  syscalls: Remove start and number from syscall_set_arguments() args
  syscalls: Remove start and number from syscall_get_arguments() args
  csky: Fix syscall_get_arguments() and syscall_set_arguments()
  riscv: Fix syscall_get_arguments() and syscall_set_arguments()
  tracing/syscalls: Pass in hardcoded 6 into syscall_get_arguments()
  ptrace: Remove maxargs from task_current_syscall()
2019-04-05 13:15:57 -10:00
Chao Yu
e1074d4b1d f2fs: add comment for conditional compilation statement
Commit af033b2aa8 ("f2fs: guarantee journalled quota data by checkpoint")
added function is_journalled_quota() in f2fs.h, but it located outside of
_LINUX_F2FS_H macro coverage, it has been fixed with commit 0af725fcb7
("f2fs: fix wrong #endif").

But anyway, in order to avoid making same mistake latter, let's add single
line comment to notice which #if the last #endif is corresponding to.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: Remove unnecessary empty EOL]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-04-05 09:35:02 -07:00
Chao Yu
186857c5a1 f2fs: fix potential recursive call when enabling data_flush
As Hagbard Celine reported:

Hi, this is a long standing bug that I've hit before on older kernels,
but I was not able to get the syslog saved because of the nature of
the bug. This time I had booted form a pen-drive, and was able to save
the log to it's efi-partition.
What i did to trigger it was to create a partition and format it f2fs,
then mount it with options:
"rw,relatime,lazytime,background_gc=on,disable_ext_identify,discard,heap,user_xattr,inline_xattr,acl,inline_data,inline_dentry,flush_merge,data_flush,extent_cache,mode=adaptive,active_logs=6,whint_mode=fs-based,alloc_mode=default,fsync_mode=strict".
Then I unpacked a big .tar.xz to the partition (I used a
gentoo-stage3-tarball as I was in process of installing Gentoo).

Same options just without data_flush gives no problems.

Mar 20 20:54:01 usbgentoo kernel: FAT-fs (nvme0n1p4): Volume was not
properly unmounted. Some data may be corrupt. Please run fsck.
Mar 20 21:05:23 usbgentoo kernel: kworker/dying (1588) used greatest
stack depth: 12064 bytes left
Mar 20 21:06:40 usbgentoo kernel: BUG: stack guard page was hit at
00000000a4b0733c (stack is 0000000056016422..0000000096e7463f)
Mar 20 21:06:40 usbgentoo kernel: kernel stack overflow

......

Mar 20 21:06:40 usbgentoo kernel: Call Trace:
Mar 20 21:06:40 usbgentoo kernel:  read_node_page+0x71/0xf0
Mar 20 21:06:40 usbgentoo kernel:  ? xas_load+0x8/0x50
Mar 20 21:06:40 usbgentoo kernel:  __get_node_page+0x73/0x2a0
Mar 20 21:06:40 usbgentoo kernel:  f2fs_get_dnode_of_data+0x34e/0x580
Mar 20 21:06:40 usbgentoo kernel:  f2fs_write_inline_data+0x5e/0x2a0
Mar 20 21:06:40 usbgentoo kernel:  __write_data_page+0x421/0x690
Mar 20 21:06:40 usbgentoo kernel:  f2fs_write_cache_pages+0x1cf/0x460
Mar 20 21:06:40 usbgentoo kernel:  f2fs_write_data_pages+0x2b3/0x2e0
Mar 20 21:06:40 usbgentoo kernel:  ? f2fs_inode_chksum_verify+0x1d/0xc0
Mar 20 21:06:40 usbgentoo kernel:  ? read_node_page+0x71/0xf0
Mar 20 21:06:40 usbgentoo kernel:  do_writepages+0x3c/0xd0
Mar 20 21:06:40 usbgentoo kernel:  __filemap_fdatawrite_range+0x7c/0xb0
Mar 20 21:06:40 usbgentoo kernel:  f2fs_sync_dirty_inodes+0xf2/0x200
Mar 20 21:06:40 usbgentoo kernel:  f2fs_balance_fs_bg+0x2a3/0x2c0
Mar 20 21:06:40 usbgentoo kernel:  ? f2fs_inode_dirtied+0x21/0xc0
Mar 20 21:06:40 usbgentoo kernel:  f2fs_balance_fs+0xd6/0x2b0
Mar 20 21:06:40 usbgentoo kernel:  __write_data_page+0x4fb/0x690

......

Mar 20 21:06:40 usbgentoo kernel:  __writeback_single_inode+0x2a1/0x340
Mar 20 21:06:40 usbgentoo kernel:  ? soft_cursor+0x1b4/0x220
Mar 20 21:06:40 usbgentoo kernel:  writeback_sb_inodes+0x1d5/0x3e0
Mar 20 21:06:40 usbgentoo kernel:  __writeback_inodes_wb+0x58/0xa0
Mar 20 21:06:40 usbgentoo kernel:  wb_writeback+0x250/0x2e0
Mar 20 21:06:40 usbgentoo kernel:  ? 0xffffffff8c000000
Mar 20 21:06:40 usbgentoo kernel:  ? cpumask_next+0x16/0x20
Mar 20 21:06:40 usbgentoo kernel:  wb_workfn+0x2f6/0x3b0
Mar 20 21:06:40 usbgentoo kernel:  ? __switch_to_asm+0x40/0x70
Mar 20 21:06:40 usbgentoo kernel:  process_one_work+0x1f5/0x3f0
Mar 20 21:06:40 usbgentoo kernel:  worker_thread+0x28/0x3c0
Mar 20 21:06:40 usbgentoo kernel:  ? rescuer_thread+0x330/0x330
Mar 20 21:06:40 usbgentoo kernel:  kthread+0x10e/0x130
Mar 20 21:06:40 usbgentoo kernel:  ? kthread_create_on_node+0x60/0x60
Mar 20 21:06:40 usbgentoo kernel:  ret_from_fork+0x35/0x40

The root cause is that we run into an infinite recursive calling in
between f2fs_balance_fs_bg and writepage() as described below:

- f2fs_write_data_pages		--- A
 - __write_data_page
  - f2fs_balance_fs
   - f2fs_balance_fs_bg		--- B
    - f2fs_sync_dirty_inodes
     - filemap_fdatawrite
      - f2fs_write_data_pages	--- A
...
          - f2fs_balance_fs_bg	--- B
...

In order to fix this issue, let's detect such condition in __write_data_page()
and just skip calling f2fs_balance_fs() recursively.

Reported-by: Hagbard Celine <hagbardcelin@gmail.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-04-05 09:34:14 -07:00
Damien Le Moal
7f3d7719c1 f2fs: improve discard handling with multi-device volumes
f2fs_hw_support_discard() only tests if the super block device supports
discard. However, for a multi-device volume, not all disks used may
support discard. Improve the check performed to test all devices of
the volume and report discard as supported if at least one device of
the volume supports discard. To implement this, introduce the helper
function f2fs_bdev_support_discard(), which returns true for zoned block
devices (where discard is processed as a zone reset) and for regular
disks supporting the discard command.

f2fs_bdev_support_discard() is also used in __queue_discard_cmd() to
handle discard command issuing for a particular device of the volume.
That is, prevent issuing a discard command for block devices that do
not support it.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-04-05 09:33:55 -07:00
Damien Le Moal
95175dafc4 f2fs: Reduce zoned block device memory usage
For zoned block devices, an array of zone types for each device is
allocated and initialized in order to determine if a section is stored
on a sequential zone (zone reset needed) or a conventional zone (no
zone reset needed and regular discard applies). Considering this usage,
the zone types stored in memory can be replaced with a bitmap to
indicate an equivalent information, that is, if a zone is sequential or
not. This reduces the memory usage for each zoned device by roughly 8:
on a 14TB disk with zones of 256 MB, the zone type array consumes
13x4KB pages while the bitmap uses only 2x4KB pages.

This patch changes the f2fs_dev_info structure blkz_type field to the
bitmap blkz_seq. Access to this bitmap is done using the helper
function f2fs_blkz_is_seq(), which is a rewrite of the function
get_blkz_type().

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-04-05 09:33:55 -07:00
Damien Le Moal
0916878da3 f2fs: Fix use of number of devices
For a single device mount using a zoned block device, the zone
information for the device is stored in the sbi->devs single entry
array and sbi->s_ndevs is set to 1. This differs from a single device
mount using a regular block device which does not allocate sbi->devs
and sets sbi->s_ndevs to 0.

However, sbi->s_devs == 0 condition is used throughout the code to
differentiate a single device mount from a multi-device mount where
sbi->s_ndevs is always larger than 1. This results in problems with
single zoned block device volumes as these are treated as multi-device
mounts but do not have the start_blk and end_blk information set. One
of the problem observed is skipping of zone discard issuing resulting in
write commands being issued to full zones or unaligned to a zone write
pointer.

Fix this problem by simply treating the cases sbi->s_ndevs == 0 (single
regular block device mount) and sbi->s_ndevs == 1 (single zoned block
device mount) in the same manner. This is done by introducing the
helper function f2fs_is_multi_device() and using this helper in place
of direct tests of sbi->s_ndevs value, improving code readability.

Fixes: 7bb3a371d1 ("f2fs: Fix zoned block device support")
Cc: <stable@vger.kernel.org>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-04-05 09:33:54 -07:00
Al Viro
9419a3191d acct_on(): don't mess with freeze protection
What happens there is that we are replacing file->path.mnt of
a file we'd just opened with a clone and we need the write
count contribution to be transferred from original mount to
new one.  That's it.  We do *NOT* want any kind of freeze
protection for the duration of switchover.

IOW, we should just use __mnt_{want,drop}_write() for that
switchover; no need to bother with mnt_{want,drop}_write()
there.

Tested-by: Amir Goldstein <amir73il@gmail.com>
Reported-by: syzbot+2a73a6ea9507b7112141@syzkaller.appspotmail.com
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-04-04 21:04:13 -04:00
Wei Yongjun
6af1c849df aio: use kmem_cache_free() instead of kfree()
memory allocated by kmem_cache_alloc() should be freed using
kmem_cache_free(), not kfree().

Fixes: fa0ca2aee3 ("deal with get_reqs_available() in aio_get_req() itself")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-04-04 20:13:59 -04:00
Anand Jain
272e5326c7 btrfs: prop: fix vanished compression property after failed set
The compression property resets to NULL, instead of the old value if we
fail to set the new compression parameter.

  $ btrfs prop get /btrfs compression
    compression=lzo
  $ btrfs prop set /btrfs compression zli
    ERROR: failed to set compression for /btrfs: Invalid argument
  $ btrfs prop get /btrfs compression

This is because the compression property ->validate() is successful for
'zli' as the strncmp() used the length passed from the userspace.

Fix it by using the expected string length in strncmp().

Fixes: 63541927c8 ("Btrfs: add support for inode properties")
Fixes: 5c1aab1dd5 ("btrfs: Add zstd support")
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-04 17:57:53 +02:00
Anand Jain
50398fde99 btrfs: prop: fix zstd compression parameter validation
We let pass zstd compression parameter even if it is not fully valid.
For example:

  $ btrfs prop set /btrfs compression zst
  $ btrfs prop get /btrfs compression
     compression=zst

zlib and lzo are fine.

Fix it by checking the correct prefix length.

Fixes: 5c1aab1dd5 ("btrfs: Add zstd support")
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-04 17:56:12 +02:00
Steven Rostedt (Red Hat)
631b7abacd ptrace: Remove maxargs from task_current_syscall()
task_current_syscall() has a single user that passes in 6 for maxargs, which
is the maximum arguments that can be used to get system calls from
syscall_get_arguments(). Instead of passing in a number of arguments to
grab, just get 6 arguments. The args argument even specifies that it's an
array of 6 items.

This will also allow changing syscall_get_arguments() to not get a variable
number of arguments, but always grab 6.

Linus also suggested not passing in a bunch of arguments to
task_current_syscall() but to instead pass in a pointer to a structure, and
just fill the structure. struct seccomp_data has almost all the parameters
that is needed except for the stack pointer (sp). As seccomp_data is part of
uapi, and I'm afraid to change it, a new structure was created
"syscall_info", which includes seccomp_data and adds the "sp" field.

Link: http://lkml.kernel.org/r/20161107213233.466776454@goodmis.org

Cc: Andy Lutomirski <luto@kernel.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-04-04 09:17:15 -04:00
Ondrej Mosnacek
1537ad15c9 kernfs: fix xattr name handling in LSM helpers
The implementation of kernfs_security_xattr_*() helpers reuses the
kernfs_node_xattr_*() functions, which take the suffix of the xattr name
and extract full xattr name from it using xattr_full_name(). However,
this function relies on the fact that the suffix passed to xattr
handlers from VFS is always constructed from the full name by just
incerementing the pointer. This doesn't necessarily hold for the callers
of kernfs_security_xattr_*(), so their usage will easily lead to
out-of-bounds access.

Fix this by moving the xattr name reconstruction to the VFS xattr
handlers and replacing the kernfs_security_xattr_*() helpers with more
general kernfs_xattr_*() helpers that take full xattr name and allow
accessing all kernfs node's xattrs.

Reported-by: kernel test robot <rong.a.chen@intel.com>
Fixes: b230d5aba2 ("LSM: add new hook for kernfs node initialization")
Fixes: ec882da5cd ("selinux: implement the kernfs_init_security hook")
Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2019-04-04 09:00:27 -04:00
Dan Carpenter
18bfb9c6a8 aio: Fix an error code in __io_submit_one()
This accidentally returns the wrong variable.  The "req->ki_eventfd"
pointer is NULL so this return success.

Fixes: 7316b49c2a ("aio: move sanity checks and request allocation to io_submit_one()")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-04-03 12:47:36 -04:00
Jens Axboe
25adf50fe2 io_uring: fix double free in case of fileset regitration failure
Will Deacon reported the following KASAN complaint:

[  149.890370] ==================================================================
[  149.891266] BUG: KASAN: double-free or invalid-free in io_sqe_files_unregister+0xa8/0x140
[  149.892218]
[  149.892411] CPU: 113 PID: 3974 Comm: io_uring_regist Tainted: G    B             5.1.0-rc3-00012-g40b114779944 #3
[  149.893623] Hardware name: linux,dummy-virt (DT)
[  149.894169] Call trace:
[  149.894539]  dump_backtrace+0x0/0x228
[  149.895172]  show_stack+0x14/0x20
[  149.895747]  dump_stack+0xe8/0x124
[  149.896335]  print_address_description+0x60/0x258
[  149.897148]  kasan_report_invalid_free+0x78/0xb8
[  149.897936]  __kasan_slab_free+0x1fc/0x228
[  149.898641]  kasan_slab_free+0x10/0x18
[  149.899283]  kfree+0x70/0x1f8
[  149.899798]  io_sqe_files_unregister+0xa8/0x140
[  149.900574]  io_ring_ctx_wait_and_kill+0x190/0x3c0
[  149.901402]  io_uring_release+0x2c/0x48
[  149.902068]  __fput+0x18c/0x510
[  149.902612]  ____fput+0xc/0x18
[  149.903146]  task_work_run+0xf0/0x148
[  149.903778]  do_notify_resume+0x554/0x748
[  149.904467]  work_pending+0x8/0x10
[  149.905060]
[  149.905331] Allocated by task 3974:
[  149.905934]  __kasan_kmalloc.isra.0.part.1+0x48/0xf8
[  149.906786]  __kasan_kmalloc.isra.0+0xb8/0xd8
[  149.907531]  kasan_kmalloc+0xc/0x18
[  149.908134]  __kmalloc+0x168/0x248
[  149.908724]  __arm64_sys_io_uring_register+0x2b8/0x15a8
[  149.909622]  el0_svc_common+0x100/0x258
[  149.910281]  el0_svc_handler+0x48/0xc0
[  149.910928]  el0_svc+0x8/0xc
[  149.911425]
[  149.911696] Freed by task 3974:
[  149.912242]  __kasan_slab_free+0x114/0x228
[  149.912955]  kasan_slab_free+0x10/0x18
[  149.913602]  kfree+0x70/0x1f8
[  149.914118]  __arm64_sys_io_uring_register+0xc2c/0x15a8
[  149.915009]  el0_svc_common+0x100/0x258
[  149.915670]  el0_svc_handler+0x48/0xc0
[  149.916317]  el0_svc+0x8/0xc
[  149.916817]
[  149.917101] The buggy address belongs to the object at ffff8004ce07ed00
[  149.917101]  which belongs to the cache kmalloc-128 of size 128
[  149.919197] The buggy address is located 0 bytes inside of
[  149.919197]  128-byte region [ffff8004ce07ed00, ffff8004ce07ed80)
[  149.921142] The buggy address belongs to the page:
[  149.921953] page:ffff7e0013381f00 count:1 mapcount:0 mapping:ffff800503417c00 index:0x0 compound_mapcount: 0
[  149.923595] flags: 0x1ffff00000010200(slab|head)
[  149.924388] raw: 1ffff00000010200 dead000000000100 dead000000000200 ffff800503417c00
[  149.925706] raw: 0000000000000000 0000000080400040 00000001ffffffff 0000000000000000
[  149.927011] page dumped because: kasan: bad access detected
[  149.927956]
[  149.928224] Memory state around the buggy address:
[  149.929054]  ffff8004ce07ec00: 00 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc
[  149.930274]  ffff8004ce07ec80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  149.931494] >ffff8004ce07ed00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  149.932712]                    ^
[  149.933281]  ffff8004ce07ed80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  149.934508]  ffff8004ce07ee00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  149.935725] ==================================================================

which is due to a failure in registrering a fileset. This frees the
ctx->user_files pointer, but doesn't clear it. When the io_uring
instance is later freed through the normal channels, we free this
pointer again. At this point it's invalid.

Ensure we clear the pointer when we free it for the error case.

Reported-by: Will Deacon <will.deacon@arm.com>
Tested-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-03 09:52:40 -06:00
Chengguang Xu
d358b1733f chardev: update comment based on the code
The function comment of __register_chrdev_region()
is out of date, so update it based on the code.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-04-02 17:49:58 +02:00
Chengguang Xu
4b0be57260 chardev: code cleanup for __register_chrdev_region()
It's just code cleanup, not functional change.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-04-02 17:49:58 +02:00
Chengguang Xu
4712d3796f chardev: add a check for given minor range
register_chrdev_region() carefully checks minor range
before calling __register_chrdev_region() but there is
another path from alloc_chrdev_region() which does not
check the range properly. So add a check for given minor
range in __register_chrdev_region().

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-04-02 17:49:58 +02:00
Chengguang Xu
de36e16d15 chardev: add additional check for minor range overlap
Current overlap checking cannot correctly handle
a case which is baseminor < existing baseminor &&
baseminor + minorct > existing baseminor + minorct.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-04-02 17:49:58 +02:00
Ronnie Sahlberg
4811e3096d cifs: a smb2_validate_and_copy_iov failure does not mean the handle is invalid.
It only means that we do not have a valid cached value for the
file_all_info structure.

CC: Stable <stable@vger.kernel.org>
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2019-04-01 14:33:38 -05:00
Steve French
ca567eb2b3 SMB3: Allow persistent handle timeout to be configurable on mount
Reconnecting after server or network failure can be improved
(to maintain availability and protect data integrity) by allowing
the client to choose the default persistent (or resilient)
handle timeout in some use cases.  Today we default to 0 which lets
the server pick the default timeout (usually 120 seconds) but this
can be problematic for some workloads.  Add the new mount parameter
to cifs.ko for SMB3 mounts "handletimeout" which enables the user
to override the default handle timeout for persistent (mount
option "persistenthandles") or resilient handles (mount option
"resilienthandles").  Maximum allowed is 16 minutes (960000 ms).
Units for the timeout are expressed in milliseconds. See
section 2.2.14.2.12 and 2.2.31.3 of the MS-SMB2 protocol
specification for more information.

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
CC: Stable <stable@vger.kernel.org>
2019-04-01 14:33:36 -05:00
Steve French
153322f753 smb3: Fix enumerating snapshots to Azure
Some servers (see MS-SMB2 protocol specification
section 3.3.5.15.1) expect that the FSCTL enumerate snapshots
is done twice, with the first query having EXACTLY the minimum
size response buffer requested (16 bytes) which refreshes
the snapshot list (otherwise that and subsequent queries get
an empty list returned).  So had to add code to set
the maximum response size differently for the first snapshot
query (which gets the size needed for the second query which
contains the actual list of snapshots).

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
CC: Stable <stable@vger.kernel.org> # 4.19+
2019-04-01 14:33:34 -05:00
Ronnie Sahlberg
2f94a3125b cifs: fix kref underflow in close_shroot()
Fix a bug where we used to not initialize the cached fid structure at all
in open_shroot() if the open was successful but we did not get a lease.
This would leave the structure uninitialized and later when we close the handle
we would in close_shroot() try to kref_put() an uninitialized refcount.

Fix this by always initializing this structure if the open was successful
but only do the extra get() if we got a lease.
This extra get() is only used to hold the structure until we get a lease
break from the server at which point we will kref_put() it during lease
processing.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
2019-04-01 14:33:30 -05:00
Linus Torvalds
5e7a8ca319 Merge branch 'work.aio' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull aio race fixes and cleanups from Al Viro.

The aio code had more issues with error handling and races with the aio
completing at just the right (wrong) time along with freeing the file
descriptor when another thread closes the file.

Just a couple of these commits are the actual fixes: the others are
cleanups to either make the fixes simpler, or to make the code legible
and understandable enough that we hope there's no more fundamental races
hiding.

* 'work.aio' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  aio: move sanity checks and request allocation to io_submit_one()
  deal with get_reqs_available() in aio_get_req() itself
  aio: move dropping ->ki_eventfd into iocb_destroy()
  make aio_read()/aio_write() return int
  Fix aio_poll() races
  aio: store event at final iocb_put()
  aio: keep io_event in aio_kiocb
  aio: fold lookup_kiocb() into its sole caller
  pin iocb through aio.
2019-04-01 08:28:36 -07:00
Linus Torvalds
db5481e705 Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull symlink fixes from Al Viro:
 "The ceph fix is already in mainline, Daniel's bpf fix is in bpf tree
  (1da6c4d914 "bpf: fix use after free in bpf_evict_inode"), the rest
  is in here"

* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  debugfs: fix use-after-free on symlink traversal
  ubifs: fix use-after-free on symlink traversal
  jffs2: fix use-after-free on symlink traversal
2019-04-01 07:51:48 -07:00
Al Viro
93b919da64 debugfs: fix use-after-free on symlink traversal
symlink body shouldn't be freed without an RCU delay.  Switch debugfs to
->destroy_inode() and use of call_rcu(); free both the inode and symlink
body in the callback.  Similar to solution for bpf, only here it's even
more obvious that ->evict_inode() can be dropped.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-04-01 00:31:02 -04:00
Al Viro
0cdc17ebd2 ubifs: fix use-after-free on symlink traversal
free the symlink body after the same RCU delay we have for freeing the
struct inode itself, so that traversal during RCU pathwalk wouldn't step
into freed memory.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-04-01 00:31:02 -04:00
Al Viro
4fdcfab5b5 jffs2: fix use-after-free on symlink traversal
free the symlink body after the same RCU delay we have for freeing the
struct inode itself, so that traversal during RCU pathwalk wouldn't step
into freed memory.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-04-01 00:31:02 -04:00
Linus Torvalds
922c010cf2 Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "22 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (22 commits)
  fs/proc/proc_sysctl.c: fix NULL pointer dereference in put_links
  fs: fs_parser: fix printk format warning
  checkpatch: add %pt as a valid vsprintf extension
  mm/migrate.c: add missing flush_dcache_page for non-mapped page migrate
  drivers/block/zram/zram_drv.c: fix idle/writeback string compare
  mm/page_isolation.c: fix a wrong flag in set_migratetype_isolate()
  mm/memory_hotplug.c: fix notification in offline error path
  ptrace: take into account saved_sigmask in PTRACE{GET,SET}SIGMASK
  fs/proc/kcore.c: make kcore_modules static
  include/linux/list.h: fix list_is_first() kernel-doc
  mm/debug.c: fix __dump_page when mapping->host is not set
  mm: mempolicy: make mbind() return -EIO when MPOL_MF_STRICT is specified
  include/linux/hugetlb.h: convert to use vm_fault_t
  iommu/io-pgtable-arm-v7s: request DMA32 memory, and improve debugging
  mm: add support for kmem caches in DMA32 zone
  ocfs2: fix inode bh swapping mixup in ocfs2_reflink_inodes_lock
  mm/hotplug: fix offline undo_isolate_page_range()
  fs/open.c: allow opening only regular files during execve()
  mailmap: add Changbin Du
  mm/debug.c: add a cast to u64 for atomic64_read()
  ...
2019-03-29 16:02:28 -07:00
Linus Torvalds
ffb8e45cf3 Merge tag 'for-linus-20190329' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
 "Small set of fixes that should go into this series. This contains:

   - compat signal mask fix for io_uring (Arnd)

   - EAGAIN corner case for direct vs buffered writes for io_uring
     (Roman)

   - NVMe pull request from Christoph with various little fixes

   - sbitmap ws_active fix, which caused a perf regression for shared
     tags (me)

   - sbitmap bit ordering fix (Ming)

   - libata on-stack DMA fix (Raymond)"

* tag 'for-linus-20190329' of git://git.kernel.dk/linux-block:
  nvmet: fix error flow during ns enable
  nvmet: fix building bvec from sg list
  nvme-multipath: relax ANA state check
  nvme-tcp: fix an endianess miss-annotation
  libata: fix using DMA buffers on stack
  io_uring: offload write to async worker in case of -EAGAIN
  sbitmap: order READ/WRITE freed instance and setting clear bit
  blk-mq: fix sbitmap ws_active for shared tags
  io_uring: fix big-endian compat signal mask handling
  blk-mq: update comment for blk_mq_hctx_has_pending()
  blk-mq: use blk_mq_put_driver_tag() to put tag
2019-03-29 14:43:07 -07:00
Linus Torvalds
7376e39ad9 Merge tag 'ceph-for-5.1-rc3' of git://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
 "A patch to avoid choking on multipage bvecs in the messenger and a
  small use-after-free fix"

* tag 'ceph-for-5.1-rc3' of git://github.com/ceph/ceph-client:
  ceph: fix use-after-free on symlink traversal
  libceph: fix breakage caused by multipage bvecs
2019-03-29 14:41:09 -07:00
Linus Torvalds
c6503f12d1 Merge tag 'xfs-5.1-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
 "Here are a few fixes for some corruption bugs and uninitialized
  variable problems. The few patches here have gone through a few days
  worth of fstest runs with no new problems observed.

  Changes since last update:

   - Fix a bunch of static checker complaints about uninitialized
     variables and insufficient range checks.

   - Avoid a crash when incore extent map data are corrupt.

   - Disallow FITRIM when we haven't recovered the log and know the
     metadata are stale.

   - Fix a data corruption when doing unaligned overlapping dio writes"

* tag 'xfs-5.1-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: serialize unaligned dio writes against all other dio writes
  xfs: prohibit fstrim in norecovery mode
  xfs: always init bma in xfs_bmapi_write
  xfs: fix btree scrub checking with regards to root-in-inode
  xfs: dabtree scrub needs to range-check level
  xfs: don't trip over uninitialized buffer on extent read of corrupted inode
2019-03-29 14:36:57 -07:00
YueHaibing
23da958803 fs/proc/proc_sysctl.c: fix NULL pointer dereference in put_links
Syzkaller reports:

kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] SMP KASAN PTI
CPU: 1 PID: 5373 Comm: syz-executor.0 Not tainted 5.0.0-rc8+ #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
RIP: 0010:put_links+0x101/0x440 fs/proc/proc_sysctl.c:1599
Code: 00 0f 85 3a 03 00 00 48 8b 43 38 48 89 44 24 20 48 83 c0 38 48 89 c2 48 89 44 24 28 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 <80> 3c 02 00 0f 85 fe 02 00 00 48 8b 74 24 20 48 c7 c7 60 2a 9d 91
RSP: 0018:ffff8881d828f238 EFLAGS: 00010202
RAX: dffffc0000000000 RBX: ffff8881e01b1140 RCX: ffffffff8ee98267
RDX: 0000000000000007 RSI: ffffc90001479000 RDI: ffff8881e01b1178
RBP: dffffc0000000000 R08: ffffed103ee27259 R09: ffffed103ee27259
R10: 0000000000000001 R11: ffffed103ee27258 R12: fffffffffffffff4
R13: 0000000000000006 R14: ffff8881f59838c0 R15: dffffc0000000000
FS:  00007f072254f700(0000) GS:ffff8881f7100000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fff8b286668 CR3: 00000001f0542002 CR4: 00000000007606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
 drop_sysctl_table+0x152/0x9f0 fs/proc/proc_sysctl.c:1629
 get_subdir fs/proc/proc_sysctl.c:1022 [inline]
 __register_sysctl_table+0xd65/0x1090 fs/proc/proc_sysctl.c:1335
 br_netfilter_init+0xbc/0x1000 [br_netfilter]
 do_one_initcall+0xfa/0x5ca init/main.c:887
 do_init_module+0x204/0x5f6 kernel/module.c:3460
 load_module+0x66b2/0x8570 kernel/module.c:3808
 __do_sys_finit_module+0x238/0x2a0 kernel/module.c:3902
 do_syscall_64+0x147/0x600 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x462e99
Code: f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f072254ec58 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
RAX: ffffffffffffffda RBX: 000000000073bf00 RCX: 0000000000462e99
RDX: 0000000000000000 RSI: 0000000020000280 RDI: 0000000000000003
RBP: 00007f072254ec70 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f072254f6bc
R13: 00000000004bcefa R14: 00000000006f6fb0 R15: 0000000000000004
Modules linked in: br_netfilter(+) dvb_usb_dibusb_mc_common dib3000mc dibx000_common dvb_usb_dibusb_common dvb_usb_dw2102 dvb_usb classmate_laptop palmas_regulator cn videobuf2_v4l2 v4l2_common snd_soc_bd28623 mptbase snd_usb_usx2y snd_usbmidi_lib snd_rawmidi wmi libnvdimm lockd sunrpc grace rc_kworld_pc150u rc_core rtc_da9063 sha1_ssse3 i2c_cros_ec_tunnel adxl34x_spi adxl34x nfnetlink lib80211 i5500_temp dvb_as102 dvb_core videobuf2_common videodev media videobuf2_vmalloc videobuf2_memops udc_core lnbp22 leds_lp3952 hid_roccat_ryos s1d13xxxfb mtd vport_geneve openvswitch nf_conncount nf_nat_ipv6 nsh geneve udp_tunnel ip6_udp_tunnel snd_soc_mt6351 sis_agp phylink snd_soc_adau1761_spi snd_soc_adau1761 snd_soc_adau17x1 snd_soc_core snd_pcm_dmaengine ac97_bus snd_compress snd_soc_adau_utils snd_soc_sigmadsp_regmap snd_soc_sigmadsp raid_class hid_roccat_konepure hid_roccat_common hid_roccat c2port_duramar2150 core mdio_bcm_unimac iptable_security iptable_raw iptable_mangle
 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter bpfilter ip6_vti ip_vti ip_gre ipip sit tunnel4 ip_tunnel hsr veth netdevsim devlink vxcan batman_adv cfg80211 rfkill chnl_net caif nlmon dummy team bonding vcan bridge stp llc ip6_gre gre ip6_tunnel tunnel6 tun crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel joydev mousedev ide_pci_generic piix aesni_intel aes_x86_64 ide_core crypto_simd atkbd cryptd glue_helper serio_raw ata_generic pata_acpi i2c_piix4 floppy sch_fq_codel ip_tables x_tables ipv6 [last unloaded: lm73]
Dumping ftrace buffer:
   (ftrace buffer empty)
---[ end trace 770020de38961fd0 ]---

A new dir entry can be created in get_subdir and its 'header->parent' is
set to NULL.  Only after insert_header success, it will be set to 'dir',
otherwise 'header->parent' is set to NULL and drop_sysctl_table is called.
However in err handling path of get_subdir, drop_sysctl_table also be
called on 'new->header' regardless its value of parent pointer.  Then
put_links is called, which triggers NULL-ptr deref when access member of
header->parent.

In fact we have multiple error paths which call drop_sysctl_table() there,
upon failure on insert_links() we also call drop_sysctl_table().And even
in the successful case on __register_sysctl_table() we still always call
drop_sysctl_table().This patch fix it.

Link: http://lkml.kernel.org/r/20190314085527.13244-1-yuehaibing@huawei.com
Fixes: 0e47c99d7f ("sysctl: Replace root_list with links between sysctl_table_sets")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reported-by: Hulk Robot <hulkci@huawei.com>
Acked-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: <stable@vger.kernel.org>    [3.4+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-29 10:01:38 -07:00
Randy Dunlap
2620327852 fs: fs_parser: fix printk format warning
Fix printk format warning (seen on i386 builds) by using ptrdiff format
specifier (%t):

  fs/fs_parser.c:413:6: warning: format `%lu' expects argument of type `long unsigned int', but argument 3 has type `int' [-Wformat=]

Link: http://lkml.kernel.org/r/19432668-ffd3-fbb2-af4f-1c8e48f6cc81@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-29 10:01:38 -07:00
YueHaibing
eebf364806 fs/proc/kcore.c: make kcore_modules static
Fix sparse warning:

  fs/proc/kcore.c:591:19: warning:
   symbol 'kcore_modules' was not declared. Should it be static?

Link: http://lkml.kernel.org/r/20190320135417.13272-1-yuehaibing@huawei.com
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Acked-by: Mukesh Ojha <mojha@codeaurora.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: James Morse <james.morse@arm.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-29 10:01:37 -07:00
Darrick J. Wong
e6a9467ea1 ocfs2: fix inode bh swapping mixup in ocfs2_reflink_inodes_lock
ocfs2_reflink_inodes_lock() can swap the inode1/inode2 variables so that
we always grab cluster locks in order of increasing inode number.

Unfortunately, we forget to swap the inode record buffer head pointers
when we've done this, which leads to incorrect bookkeepping when we're
trying to make the two inodes have the same refcount tree.

This has the effect of causing filesystem shutdowns if you're trying to
reflink data from inode 100 into inode 97, where inode 100 already has a
refcount tree attached and inode 97 doesn't.  The reflink code decides
to copy the refcount tree pointer from 100 to 97, but uses inode 97's
inode record to open the tree root (which it doesn't have) and blows up.
This issue causes filesystem shutdowns and metadata corruption!

Link: http://lkml.kernel.org/r/20190312214910.GK20533@magnolia
Fixes: 29ac8e856c ("ocfs2: implement the VFS clone_range, copy_range, and dedupe_range features")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-29 10:01:37 -07:00