Pull vfs pile #2 from Al Viro:
"Next pile (and there'll be one or two more).
The large piece in this one is getting rid of /proc/*/ns/* weirdness;
among other things, it allows to (finally) make nameidata completely
opaque outside of fs/namei.c, making for easier further cleanups in
there"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
coda_venus_readdir(): use file_inode()
fs/namei.c: fold link_path_walk() call into path_init()
path_init(): don't bother with LOOKUP_PARENT in argument
fs/namei.c: new helper (path_cleanup())
path_init(): store the "base" pointer to file in nameidata itself
make default ->i_fop have ->open() fail with ENXIO
make nameidata completely opaque outside of fs/namei.c
kill proc_ns completely
take the targets of /proc/*/ns/* symlinks to separate fs
bury struct proc_ns in fs/proc
copy address of proc_ns_ops into ns_common
new helpers: ns_alloc_inum/ns_free_inum
make proc_ns_operations work with struct ns_common * instead of void *
switch the rest of proc_ns_operations to working with &...->ns
netns: switch ->get()/->put()/->install()/->inum() to working with &net->ns
make mntns ->get()/->put()/->install()/->inum() work with &mnt_ns->ns
common object embedded into various struct ....ns
Pull isofs and reiserfs fixes from Jan Kara:
"A reiserfs and an isofs fix. They arrived after I sent you my first
pull request and I don't want to delay them unnecessarily till rc2"
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
isofs: Fix infinite looping over CE entries
reiserfs: destroy allocated commit workqueue
Pull nfsd updates from Bruce Fields:
"A comparatively quieter cycle for nfsd this time, but still with two
larger changes:
- RPC server scalability improvements from Jeff Layton (using RCU
instead of a spinlock to find idle threads).
- server-side NFSv4.2 ALLOCATE/DEALLOCATE support from Anna
Schumaker, enabling fallocate on new clients"
* 'for-3.19' of git://linux-nfs.org/~bfields/linux: (32 commits)
nfsd4: fix xdr4 count of server in fs_location4
nfsd4: fix xdr4 inclusion of escaped char
sunrpc/cache: convert to use string_escape_str()
sunrpc: only call test_bit once in svc_xprt_received
fs: nfsd: Fix signedness bug in compare_blob
sunrpc: add some tracepoints around enqueue and dequeue of svc_xprt
sunrpc: convert to lockless lookup of queued server threads
sunrpc: fix potential races in pool_stats collection
sunrpc: add a rcu_head to svc_rqst and use kfree_rcu to free it
sunrpc: require svc_create callers to pass in meaningful shutdown routine
sunrpc: have svc_wake_up only deal with pool 0
sunrpc: convert sp_task_pending flag to use atomic bitops
sunrpc: move rq_cachetype field to better optimize space
sunrpc: move rq_splice_ok flag into rq_flags
sunrpc: move rq_dropme flag into rq_flags
sunrpc: move rq_usedeferral flag to rq_flags
sunrpc: move rq_local field to rq_flags
sunrpc: add a generic rq_flags field to svc_rqst and move rq_secure to it
nfsd: minor off by one checks in __write_versions()
sunrpc: release svc_pool_map reference when serv allocation fails
...
Rock Ridge extensions define so called Continuation Entries (CE) which
define where is further space with Rock Ridge data. Corrupted isofs
image can contain arbitrarily long chain of these, including a one
containing loop and thus causing kernel to end in an infinite loop when
traversing these entries.
Limit the traversal to 32 entries which should be more than enough space
to store all the Rock Ridge data.
Reported-by: P J P <ppandit@redhat.com>
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Pull security layer updates from James Morris:
"In terms of changes, there's general maintenance to the Smack,
SELinux, and integrity code.
The IMA code adds a new kconfig option, IMA_APPRAISE_SIGNED_INIT,
which allows IMA appraisal to require signatures. Support for reading
keys from rootfs before init is call is also added"
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (23 commits)
selinux: Remove security_ops extern
security: smack: fix out-of-bounds access in smk_parse_smack()
VFS: refactor vfs_read()
ima: require signature based appraisal
integrity: provide a hook to load keys when rootfs is ready
ima: load x509 certificate from the kernel
integrity: provide a function to load x509 certificate from the kernel
integrity: define a new function integrity_read_file()
Security: smack: replace kzalloc with kmem_cache for inode_smack
Smack: Lock mode for the floor and hat labels
ima: added support for new kernel cmdline parameter ima_template_fmt
ima: allocate field pointers array on demand in template_desc_init_fields()
ima: don't allocate a copy of template_fmt in template_desc_init_fields()
ima: display template format in meas. list if template name length is zero
ima: added error messages to template-related functions
ima: use atomic bit operations to protect policy update interface
ima: ignore empty and with whitespaces policy lines
ima: no need to allocate entry for comment
ima: report policy load status
ima: use path names cache
...
Here's the set of driver core patches for 3.19-rc1.
They are dominated by the removal of the .owner field in platform
drivers. They touch a lot of files, but they are "simple" changes, just
removing a line in a structure.
Other than that, a few minor driver core and debugfs changes. There are
some ath9k patches coming in through this tree that have been acked by
the wireless maintainers as they relied on the debugfs changes.
Everything has been in linux-next for a while.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iEYEABECAAYFAlSOD20ACgkQMUfUDdst+ylLPACg2QrW1oHhdTMT9WI8jihlHVRM
53kAoLeteByQ3iVwWurwwseRPiWa8+MI
=OVRS
-----END PGP SIGNATURE-----
Merge tag 'driver-core-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core update from Greg KH:
"Here's the set of driver core patches for 3.19-rc1.
They are dominated by the removal of the .owner field in platform
drivers. They touch a lot of files, but they are "simple" changes,
just removing a line in a structure.
Other than that, a few minor driver core and debugfs changes. There
are some ath9k patches coming in through this tree that have been
acked by the wireless maintainers as they relied on the debugfs
changes.
Everything has been in linux-next for a while"
* tag 'driver-core-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (324 commits)
Revert "ath: ath9k: use debugfs_create_devm_seqfile() helper for seq_file entries"
fs: debugfs: add forward declaration for struct device type
firmware class: Deletion of an unnecessary check before the function call "vunmap"
firmware loader: fix hung task warning dump
devcoredump: provide a one-way disable function
device: Add dev_<level>_once variants
ath: ath9k: use debugfs_create_devm_seqfile() helper for seq_file entries
ath: use seq_file api for ath9k debugfs files
debugfs: add helper function to create device related seq_file
drivers/base: cacheinfo: remove noisy error boot message
Revert "core: platform: add warning if driver has no owner"
drivers: base: support cpu cache information interface to userspace via sysfs
drivers: base: add cpu_device_create to support per-cpu devices
topology: replace custom attribute macros with standard DEVICE_ATTR*
cpumask: factor out show_cpumap into separate helper function
driver core: Fix unbalanced device reference in drivers_probe
driver core: fix race with userland in device_add()
sysfs/kernfs: make read requests on pre-alloc files use the buffer.
sysfs/kernfs: allow attributes to request write buffer be pre-allocated.
fs: sysfs: return EGBIG on write if offset is larger than file size
...
LZ4 is a lightweight compression algorithm which can be used
on embedded systems to reduce CPU and memory overhead (in comparison
to the standard zlib compression).
These patches add the wrapper code to allow Squashfs to use
the existing LZ4 decompression code, and the necessary configuration
option.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJUjgIJAAoJEJAch/D1fbHUArwP/iDiDSpqxdfTQHwUKxB57skO
0iBzg6bXPlqwmsnllegg0SV0vOvFjpWiXWpOOtAVCJlfbov8DgUndsigpvG3UhD/
qJpNXAJ+xSdOzBqj3bS7SOu2DwPY8Gz4rxGQcNN3PsuOVR/EUgAnNlv22ZHY10A5
XQVyPbkwZ73TrZ2uKA8leWArFtCbM4oYGpxP+ramEox8nVFEOtixn5IcX5WkbGEL
Yt0NRw8K8vDIIETWVariugUFE4C1olFk+YmqqAw7cmDGJ70cEg5jh9ocNkwDIZPj
I9BNtkggBRMaCPwGsH6IvahMFUyWLQUgGayfY/fgbRiB9ZuYIQ1lyPDhzbgWczoE
o34eXAIDdmfPrmYlDEBDkYnXXtwuYqdOVYOtcEnyFEYqpHfaeS2h2s9nTiM+rz21
v0UEaDRmPtlkK/ZdLKUsrOf+8y9ejkT0R67swFaguHshL6EHey7X5ghmOuwCoL9x
fzGWtPFR+Nbqga5T3dwf+apvyUVrPaw6gZu36NNim2779ZgpnIPzW6MEYUMhtXCn
ef2+NvS9AeGyo7kiqlNQrihQWZSN0W/AiVsEeulzk5h+adzSNQ5eipzAO9DAAp16
8muY4nq51bOGVaWzqJz/KacCmt7i0qUdmS1p4l2uqPp9gH/s/S91yrYn/iszf3AV
CpwU2i9g3nQu9ecDc1Os
=mr7J
-----END PGP SIGNATURE-----
Merge tag 'squashfs-updates' of git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-next
Pull squashfs update from Phillip Lougher:
"These patches optionally add LZ4 compression support to Squashfs.
LZ4 is a lightweight compression algorithm which can be used on
embedded systems to reduce CPU and memory overhead (in comparison to
the standard zlib compression).
These patches add the wrapper code to allow Squashfs to use the
existing LZ4 decompression code, and the necessary configuration
option"
* tag 'squashfs-updates' of git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-next:
Squashfs: Add LZ4 compression configuration option
Squashfs: add LZ4 compression support
Pull aio updates from Benjamin LaHaise.
* git://git.kvack.org/~bcrl/aio-next:
aio: Skip timer for io_getevents if timeout=0
aio: Make it possible to remap aio ring
Commit 2ae83bf938 ("[CIFS] Fix setting time before epoch (negative
time values)") changed "u64 t" to "s64 t", which makes do_div() complain
about a pointer signedness mismatch:
CC fs/cifs/netmisc.o
In file included from ./arch/mips/include/asm/div64.h:12:0,
from include/linux/kernel.h:124,
from include/linux/list.h:8,
from include/linux/wait.h:6,
from include/linux/net.h:23,
from fs/cifs/netmisc.c:25:
fs/cifs/netmisc.c: In function ‘cifs_NTtimeToUnix’:
include/asm-generic/div64.h:43:28: warning: comparison of distinct pointer types lacks a cast [enabled by default]
(void)(((typeof((n)) *)0) == ((uint64_t *)0)); \
^
fs/cifs/netmisc.c:941:22: note: in expansion of macro ‘do_div’
ts.tv_nsec = (long)do_div(t, 10000000) * 100;
Introduce a temporary "u64 abs_t" variable to fix this.
Signed-off-by: Kevin Cernekee <cernekee@gmail.com>
Signed-off-by: Steve French <steve.french@primarydata.com>
We have encountered failures when When testing smb2 mounts on ppc64
machines when using both Samba as well as Windows 2012.
On poking around, the problem was determined to be caused by the
high endian MessageID passed in the header for smb2. On checking the
corresponding MID for smb1 is converted to LE before being sent on the
wire.
We have tested this patch successfully on a ppc64 machine.
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
In this case, it is basically a polling. Let's not involve timer at all
because that would hurt performance for application event loops.
In an arbitrary test I've done, io_getevents syscall elapsed time
reduces from 50000+ nanoseconds to a few hundereds.
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
There are actually two issues this patch addresses. Let me start with
the one I tried to solve in the beginning.
So, in the checkpoint-restore project (criu) we try to dump tasks'
state and restore one back exactly as it was. One of the tasks' state
bits is rings set up with io_setup() call. There's (almost) no problems
in dumping them, there's a problem restoring them -- if I dump a task
with aio ring originally mapped at address A, I want to restore one
back at exactly the same address A. Unfortunately, the io_setup() does
not allow for that -- it mmaps the ring at whatever place mm finds
appropriate (it calls do_mmap_pgoff() with zero address and without
the MAP_FIXED flag).
To make restore possible I'm going to mremap() the freshly created ring
into the address A (under which it was seen before dump). The problem is
that the ring's virtual address is passed back to the user-space as the
context ID and this ID is then used as search key by all the other io_foo()
calls. Reworking this ID to be just some integer doesn't seem to work, as
this value is already used by libaio as a pointer using which this library
accesses memory for aio meta-data.
So, to make restore work we need to make sure that
a) ring is mapped at desired virtual address
b) kioctx->user_id matches this value
Having said that, the patch makes mremap() on aio region update the
kioctx's user_id and mmap_base values.
Here appears the 2nd issue I mentioned in the beginning of this mail.
If (regardless of the C/R dances I do) someone creates an io context
with io_setup(), then mremap()-s the ring and then destroys the context,
the kill_ioctx() routine will call munmap() on wrong (old) address.
This will result in a) aio ring remaining in memory and b) some other
vma get unexpectedly unmapped.
What do you think?
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Pull block driver core update from Jens Axboe:
"This is the pull request for the core block IO changes for 3.19. Not
a huge round this time, mostly lots of little good fixes:
- Fix a bug in sysfs blktrace interface causing a NULL pointer
dereference, when enabled/disabled through that API. From Arianna
Avanzini.
- Various updates/fixes/improvements for blk-mq:
- A set of updates from Bart, mostly fixing buts in the tag
handling.
- Cleanup/code consolidation from Christoph.
- Extend queue_rq API to be able to handle batching issues of IO
requests. NVMe will utilize this shortly. From me.
- A few tag and request handling updates from me.
- Cleanup of the preempt handling for running queues from Paolo.
- Prevent running of unmapped hardware queues from Ming Lei.
- Move the kdump memory limiting check to be in the correct
location, from Shaohua.
- Initialize all software queues at init time from Takashi. This
prevents a kobject warning when CPUs are brought online that
weren't online when a queue was registered.
- Single writeback fix for I_DIRTY clearing from Tejun. Queued with
the core IO changes, since it's just a single fix.
- Version X of the __bio_add_page() segment addition retry from
Maurizio. Hope the Xth time is the charm.
- Documentation fixup for IO scheduler merging from Jan.
- Introduce (and use) generic IO stat accounting helpers for non-rq
drivers, from Gu Zheng.
- Kill off artificial limiting of max sectors in a request from
Christoph"
* 'for-3.19/core' of git://git.kernel.dk/linux-block: (26 commits)
bio: modify __bio_add_page() to accept pages that don't start a new segment
blk-mq: Fix uninitialized kobject at CPU hotplugging
blktrace: don't let the sysfs interface remove trace from running list
blk-mq: Use all available hardware queues
blk-mq: Micro-optimize bt_get()
blk-mq: Fix a race between bt_clear_tag() and bt_get()
blk-mq: Avoid that __bt_get_word() wraps multiple times
blk-mq: Fix a use-after-free
blk-mq: prevent unmapped hw queue from being scheduled
blk-mq: re-check for available tags after running the hardware queue
blk-mq: fix hang in bt_get()
blk-mq: move the kdump check to blk_mq_alloc_tag_set
blk-mq: cleanup tag free handling
blk-mq: use 'nr_cpu_ids' as highest CPU ID count for hwq <-> cpu map
blk: introduce generic io stat accounting help function
blk-mq: handle the single queue case in blk_mq_hctx_next_cpu
genhd: check for int overflow in disk_expand_part_tbl()
blk-mq: add blk_mq_free_hctx_request()
blk-mq: export blk_mq_free_request()
blk-mq: use get_cpu/put_cpu instead of preempt_disable/preempt_enable
...
destroy_list is used to track marks which still need waiting for srcu
period end before they can be freed. However by the time mark is added to
destroy_list it isn't in group's list of marks anymore and thus we can
reuse fsnotify_mark->g_list for queueing into destroy_list. This saves
two pointers for each fsnotify_mark.
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Eric Paris <eparis@redhat.com>
Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There's a lot of common code in inode and mount marks handling. Factor it
out to a common helper function.
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Eric Paris <eparis@redhat.com>
Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The fanotify and the inotify API can be used to monitor changes of the
file system. System call fallocate() modifies files. Hence it should
trigger the corresponding fanotify (FAN_MODIFY) and inotify (IN_MODIFY)
events. The most interesting case is FALLOC_FL_COLLAPSE_RANGE because
this value allows to create arbitrary file content from random data.
This patch adds the missing call to fsnotify_modify().
The FAN_MODIFY and IN_MODIFY event will be created when fallocate()
succeeds. It will even be created if the file length remains unchanged,
e.g. when calling fanotify with flag FALLOC_FL_KEEP_SIZE.
This logic was primarily chosen to keep the coding simple.
It resembles the logic of the write() system call.
When we call write() we always create a FAN_MODIFY event, even in the case
of overwriting with identical data.
Events FAN_MODIFY and IN_MODIFY do not provide any guarantee that data was
actually changed.
Furthermore even if if the filesize remains unchanged, fallocate() may
influence whether a subsequent write() will succeed and hence the
fallocate() call may be considered a modification.
The fallocate(2) man page teaches: After a successful call, subsequent
writes into the range specified by offset and len are guaranteed not to
fail because of lack of disk space.
So calling fallocate(fd, FALLOC_FL_KEEP_SIZE, offset, len) may result in
different outcomes of a subsequent write depending on the values of offset
and len.
Signed-off-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@parisplace.org>
Cc: John McCutchan <john@johnmccutchan.com>
Cc: Robert Love <rlove@rlove.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Based on ext2_direct_IO
Tested with O_DIRECT file open and sysbench/mariadb with 1% written
queries improvement (update_non_index test) on a volume created with
mkaffs.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-Remove ErrorBuffer and use %pV
-Add __printf to enable argument mistmatch warnings
Original patch by Joe Perches.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patchset adds execveat(2) for x86, and is derived from Meredydd
Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).
The primary aim of adding an execveat syscall is to allow an
implementation of fexecve(3) that does not rely on the /proc filesystem,
at least for executables (rather than scripts). The current glibc version
of fexecve(3) is implemented via /proc, which causes problems in sandboxed
or otherwise restricted environments.
Given the desire for a /proc-free fexecve() implementation, HPA suggested
(https://lkml.org/lkml/2006/7/11/556) that an execveat(2) syscall would be
an appropriate generalization.
Also, having a new syscall means that it can take a flags argument without
back-compatibility concerns. The current implementation just defines the
AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other flags could be
added in future -- for example, flags for new namespaces (as suggested at
https://lkml.org/lkml/2006/7/11/474).
Related history:
- https://lkml.org/lkml/2006/12/27/123 is an example of someone
realizing that fexecve() is likely to fail in a chroot environment.
- http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered
documenting the /proc requirement of fexecve(3) in its manpage, to
"prevent other people from wasting their time".
- https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a
problem where a process that did setuid() could not fexecve()
because it no longer had access to /proc/self/fd; this has since
been fixed.
This patch (of 4):
Add a new execveat(2) system call. execveat() is to execve() as openat()
is to open(): it takes a file descriptor that refers to a directory, and
resolves the filename relative to that.
In addition, if the filename is empty and AT_EMPTY_PATH is specified,
execveat() executes the file to which the file descriptor refers. This
replicates the functionality of fexecve(), which is a system call in other
UNIXen, but in Linux glibc it depends on opening "/proc/self/fd/<fd>" (and
so relies on /proc being mounted).
The filename fed to the executed program as argv[0] (or the name of the
script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
(for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
reflecting how the executable was found. This does however mean that
execution of a script in a /proc-less environment won't work; also, script
execution via an O_CLOEXEC file descriptor fails (as the file will not be
accessible after exec).
Based on patches by Meredydd Luff.
Signed-off-by: David Drysdale <drysdale@google.com>
Cc: Meredydd Luff <meredydd@senatehouse.org>
Cc: Shuah Khan <shuah.kh@samsung.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Rich Felker <dalias@aerifal.cx>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When running FSX with direct I/O mode, fsx resulted in DATA past EOF issues.
fsx ./file2 -Z -r 4096 -w 4096
...
..
truncating to largest ever: 0x907c
fallocating to largest ever: 0x11137
truncating to largest ever: 0x2c6fe
truncating to largest ever: 0x2cfdf
fallocating to largest ever: 0x40000
Mapped Read: non-zero data past EOF (0x18628) page offset 0x629 is 0x2a4e
...
..
The reason being, it is doing a truncate down, but the zeroing does not
happen on the last block boundary when offset is not aligned. Even though
it calls truncate_setsize()->truncate_inode_pages()->
truncate_inode_pages_range() and considers the partial zeroout but it
retrieves the page using find_lock_page() - which only looks the page in
the cache. So, zeroing out does not happen in case of direct IO.
Make a truncate page based around block_truncate_page for FAT filesystem
and invoke that helper to zerout in case the offset is not aligned with
the blocksize.
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Coverity id: 1042674
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since commit 058504edd0 ("fs/seq_file: fallback to vmalloc allocation"),
seq_buf_alloc() falls back to vmalloc() when the kmalloc() for contiguous
memory fails. This was done to address order-4 slab allocations for
reading /proc/stat on large machines and noticed because
PAGE_ALLOC_COSTLY_ORDER < 4, so there is no infinite loop in the page
allocator when allocating new slab for such high-order allocations.
Contiguous memory isn't necessary for caller of seq_buf_alloc(), however.
Other GFP_KERNEL high-order allocations that are <=
PAGE_ALLOC_COSTLY_ORDER will simply loop forever in the page allocator and
oom kill processes as a result.
We don't want to kill processes so that we can allocate contiguous memory
in situations when contiguous memory isn't necessary.
This patch does the kmalloc() allocation with __GFP_NORETRY for high-order
allocations. This still utilizes memory compaction and direct reclaim in
the allocation path, the only difference is that it will fail immediately
instead of oom kill processes when out of memory.
[akpm@linux-foundation.org: add comment]
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The slab shrinkers are currently invoked from the zonelist walkers in
kswapd, direct reclaim, and zone reclaim, all of which roughly gauge the
eligible LRU pages and assemble a nodemask to pass to NUMA-aware
shrinkers, which then again have to walk over the nodemask. This is
redundant code, extra runtime work, and fairly inaccurate when it comes to
the estimation of actually scannable LRU pages. The code duplication will
only get worse when making the shrinkers cgroup-aware and requiring them
to have out-of-band cgroup hierarchy walks as well.
Instead, invoke the shrinkers from shrink_zone(), which is where all
reclaimers end up, to avoid this duplication.
Take the count for eligible LRU pages out of get_scan_count(), which
considers many more factors than just the availability of swap space, like
zone_reclaimable_pages() currently does. Accumulate the number over all
visited lruvecs to get the per-zone value.
Some nodes have multiple zones due to memory addressing restrictions. To
avoid putting too much pressure on the shrinkers, only invoke them once
for each such node, using the class zone of the allocation as the pivot
zone.
For now, this integrates the slab shrinking better into the reclaim logic
and gets rid of duplicative invocations from kswapd, direct reclaim, and
zone reclaim. It also prepares for cgroup-awareness, allowing
memcg-capable shrinkers to be added at the lruvec level without much
duplication of both code and runtime work.
This changes kswapd behavior, which used to invoke the shrinkers for each
zone, but with scan ratios gathered from the entire node, resulting in
meaningless pressure quantities on multi-zone nodes.
Zone reclaim behavior also changes. It used to shrink slabs until the
same amount of pages were shrunk as were reclaimed from the LRUs. Now it
merely invokes the shrinkers once with the zone's scan ratio, which makes
the shrinkers go easier on caches that implement aging and would prefer
feeding back pressure from recently used slab objects to unused LRU pages.
[vdavydov@parallels.com: assure class zone is populated]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The i_mmap_mutex is a close cousin of the anon vma lock, both protecting
similar data, one for file backed pages and the other for anon memory. To
this end, this lock can also be a rwsem. In addition, there are some
important opportunities to share the lock when there are no tree
modifications.
This conversion is straightforward. For now, all users take the write
lock.
[sfr@canb.auug.org.au: update fremap.c]
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since the rework of the sparse interrupt code to actually free the
unused interrupt descriptors there exists a race between the /proc
interfaces to the irq subsystem and the code which frees the interrupt
descriptor.
CPU0 CPU1
show_interrupts()
desc = irq_to_desc(X);
free_desc(desc)
remove_from_radix_tree();
kfree(desc);
raw_spinlock_irq(&desc->lock);
/proc/interrupts is the only interface which can actively corrupt
kernel memory via the lock access. /proc/stat can only read from freed
memory. Extremly hard to trigger, but possible.
The interfaces in /proc/irq/N/ are not affected by this because the
removal of the proc file is serialized in procfs against concurrent
readers/writers. The removal happens before the descriptor is freed.
For architectures which have CONFIG_SPARSE_IRQ=n this is a non issue
as the descriptor is never freed. It's merely cleared out with the irq
descriptor lock held. So any concurrent proc access will either see
the old correct value or the cleared out ones.
Protect the lookup and access to the irq descriptor in
show_interrupts() with the sparse_irq_lock.
Provide kstat_irqs_usr() which is protecting the lookup and access
with sparse_irq_lock and switch /proc/stat to use it.
Document the existing kstat_irqs interfaces so it's clear that the
caller needs to take care about protection. The users of these
interfaces are either not affected due to SPARSE_IRQ=n or already
protected against removal.
Fixes: 1f5a5b87f7 "genirq: Implement a sane sparse_irq allocator"
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
This patch adds two macros:
OVL_XATTR_PRE_NAME and OVL_XATTR_PRE_LEN
to present ovl_xattr name prefix and its length. Also, a
new macro OVL_XATTR_OPAQUE is introduced to replace old
*ovl_opaque_xattr*.
Fix the length of "trusted.overlay." to *16*.
Signed-off-by: hujianyang <hujianyang@huawei.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Allow "lowerdir=" option to contain multiple lower directories separated by
a colon (e.g. "lowerdir=/bin:/usr/bin"). Colon characters in filenames can
be escaped with a backslash.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Make "upperdir=" mount option optional. If "upperdir=" is not given, then
the "workdir=" option is also optional (and ignored if given).
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Move allocation of root entry above to where it's needed.
Move initializations related to upperdir and workdir near each other.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
"Suppose you have in one of the lower layers a filesystem with
->lookup()-enforced upper limit on name length. Pretty much every local fs
has one, but... they are not all equal. 255 characters is the common upper
limit, but e.g. jffs2 stops at 254, minixfs upper limit is somewhere from
14 to 60, depending upon version, etc. You are doing a lookup for
something that is present in upper layer, but happens to be too long for
one of the lower layers. Too bad - ENAMETOOLONG for you..."
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Not checking whiteouts on lowest layer was an optimization (there's nothing
to white out there), but it could result in inconsitent behavior when a
layer previously used as upper/middle is later used as lowest.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
If multiple lower layers exist, merge them as well in readdir according to
the same rules as merging upper with lower. I.e. take whiteouts and opaque
directories into account on all but the lowers layer.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Add helper to iterate through all the layers, starting from the upper layer
(if exists) and continuing down through the lower layers.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Add multiple lower layers to 'struct ovl_fs' and 'struct ovl_entry'.
ovl_entry will have an array of paths, instead of just the dentry. This
allows a compact array containing just the layers which exist at current
point in the tree (which is expected to be a small number for the majority
of dentries).
The number of layers is not limited by this infrastructure.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
When removing an empty opaque directory, then it makes no sense to replace
it with an exact replica of itself before removal.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Don't make a separate pass for checking whiteouts, since we can do it while
reading the upper directory.
This will make it easier to handle multiple layers.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
When resirefs is trying to mount a partition, it creates a commit
workqueue (sbi->commit_wq). But when mount fails later, the workqueue
is not freed.
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Reported-by: auxsvr@gmail.com
Reported-by: Benoît Monin <benoit.monin@gmx.fr>
Cc: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org # >= 3.16
Cc: reiserfs-devel@vger.kernel.org
Fixes: 797d9016ce
Signed-off-by: Jan Kara <jack@suse.cz>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJUi0B6AAoJEKurIx+X31iByH8P/jfMgzyUO+KpJMA1DbgCAG7x
WPJgbMUyPwB63DH09RyMEmiwf61Rl1klXTPVNY0Dnj7qRJOmpB9U3vGIfO4HpD84
5IZMBlc+Jl+kJCxSAJYbTJTZLsIMjFGOfuVTvlY+HnMBitQVBumKptmC0DoBBqgz
yYy5MHRMaVoHcogyMyBiknmxdxu6/ruUKY+6yyvdUESt0SCcJG8V6Qik7TMmnx47
NvIIPzfibvvLLnd8IOEj2fwh8XMtJdfcCxPpAEvEaNq0jZEDF9K22jttTQvl9r92
NQf7JKQQrNfzloRZ3flKax5ZMGi9RkcirTLLdJ4I2xMGVHOA4XUAjsSCYR6INuuJ
Ox00FnuiIrADNw37m52Y+ujPTF1C2PQUNK69gwsLd84MSjy+95F2dlC5cC3Yt4N5
rpstXxWELZTqjMGD8GTPOpv6zlg799IbFexr4H6KTc+47EX0MNayJiI6L597gYnq
gIiPmDnnz6WlWp4HHgBIwjNAH3Tbf/uU3MlgzqS3Ftd7YkYmLnxvClhrwgErviFn
Nfnz2LtGuMxMHSt0uSWxODVEaR4reKRVJBvhRSGWL1PufylEyt0YWayiqpohuKD9
6X/RufWK5qdCBHytoGyMUZ57oqxth9QSVG4RBkGPmaZgMq/5DdyOhBfW0yInjMuo
AuDMmqrU5yFTitLMGcsG
=kcmD
-----END PGP SIGNATURE-----
Merge tag 'please-pull-morepstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux
Pull pstore update #2 from Tony Luck:
"Couple of pstore-ram enhancements to allow use of different memory
attributes"
* tag 'please-pull-morepstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux:
pstore-ram: Allow optional mapping with pgprot_noncached
pstore-ram: Fix hangs by using write-combine mappings
Pull btrfs update from Chris Mason:
"From a feature point of view, most of the code here comes from Miao
Xie and others at Fujitsu to implement scrubbing and replacing devices
on raid56. This has been in development for a while, and it's a big
improvement.
Filipe and Josef have a great assortment of fixes, many of which solve
problems corruptions either after a crash or in error conditions. I
still have a round two from Filipe for next week that solves
corruptions with discard and block group removal"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (62 commits)
Btrfs: make get_caching_control unconditionally return the ctl
Btrfs: fix unprotected deletion from pending_chunks list
Btrfs: fix fs mapping extent map leak
Btrfs: fix memory leak after block remove + trimming
Btrfs: make btrfs_abort_transaction consider existence of new block groups
Btrfs: fix race between writing free space cache and trimming
Btrfs: fix race between fs trimming and block group remove/allocation
Btrfs, replace: enable dev-replace for raid56
Btrfs: fix freeing used extents after removing empty block group
Btrfs: fix crash caused by block group removal
Btrfs: fix invalid block group rbtree access after bg is removed
Btrfs, raid56: fix use-after-free problem in the final device replace procedure on raid56
Btrfs, replace: write raid56 parity into the replace target device
Btrfs, replace: write dirty pages into the replace target device
Btrfs, raid56: support parity scrub on raid56
Btrfs, raid56: use a variant to record the operation type
Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted
Btrfs, raid56: don't change bbio and raid_map
Btrfs: remove unnecessary code of stripe_index assignment in __btrfs_map_block
Btrfs: remove noused bbio_ret in __btrfs_map_block in condition
...
* UBI debug messages now include the UBI device number. This change
is responsible for the big diffstat since it touched every debugging
print statement.
* An Xattr bug-fix which fixes SELinux support
* Several error path fixes in UBI/UBIFS
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJUixUjAAoJECmIfjd9wqK0n0oQAL5fAGFszUnmPa+NHi1IgDlv
dUBcN9GrXM8CN5LxQX2NH4WuxyY9gZpQsZtDXolutICbHT55De/plQyJUE5XHXnq
U2SHir1wsHnUeDJEqlAKE4zXWUEwY4C5mqDZh8fPUM+pyFNmlt4L/mi4hjkFmpqt
1gPqJ9boa3fwrT3jdaClJTXN5d+8Y1JahQwuSINsX6rInB/cfh2FFZ2fxWWogxvf
BoN1iQdbWJrmkd2KLLbQqOeI5LwBT5jdf0Z0hkwHEsDCA0ZiKBCoQRZKjkwlaTCZ
JSQ2Fv/RkUGg+YJJgC5xJnpR4VlGyn6X2z7/W5idhKzELlmHrKaw3bXZJJoTElPr
kRSpcq02eF3pJKLMpvFuV6rLpqbkpGDML3+VtZ1Fta3cqQ0E9TvrSJAII5j+EiNG
D03IkCVX5ozmeZr08DmSx8W7HJ4beMs5E8eDkEaAS7AhQU5pmCcai1vZJQZjrsV0
5yCmYsArN6yYS79mrH6eQuoKsJ48mOoU6zC8vmvu5uar6HzfK9eC6J3JndH1lGp1
iDXJ9TS1AX/jFdZWAdyJic29TQi1hPhZITdhLfT11MZtYLWT1CpvNgBa4MefpD0X
YBsgVjQXA96F4Aix3ILWGEyaKbHUOmqIBpKy95tRpGgMxwlpagsqm2jn1e8Sb4Kd
H9YCeVsracNeK1E2ua13
=sZ7o
-----END PGP SIGNATURE-----
Merge tag 'upstream-3.19-rc1' of git://git.infradead.org/linux-ubifs
Pull UBI/UBIFS updates from Artem Bityutskiy:
"This includes the following UBI/UBIFS changes:
- UBI debug messages now include the UBI device number. This change
is responsible for the big diffstat since it touched every
debugging print statement.
- An Xattr bug-fix which fixes SELinux support
- Several error path fixes in UBI/UBIFS"
* tag 'upstream-3.19-rc1' of git://git.infradead.org/linux-ubifs:
UBI: Fix invalid vfree()
UBI: Fix double free after do_sync_erase()
UBIFS: fix a couple bugs in UBIFS xattr length calculation
UBI: vtbl: Use ubi_eba_atomic_leb_change()
UBI: Extend UBI layer debug/messaging capabilities
UBIFS: fix budget leak in error path
This update contains:
o more on-disk format header consolidation
o move some structures shared with userspace to libxfs
o new per-mount workqueue to fix for deadlocks between nested loop
mounted filesystems
o various bug fixes for ENOSPC, stats, quota off and preallocation
o a bunch of compiler warning fixes for set-but-unused variables
o various code cleanups
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJUihOWAAoJEK3oKUf0dfodYbkP/iXuIYOhpmc1rUORMDl2JDBc
iTjXqz1Ydp6vJrq2+3qeAsCbJciNdZ72eNKdvgRbFAN4BW8tv1Wc9QR5m2ZIpCkf
7buCzbkI64j9HoNAiZJhrMp/eyJ0X1hRGk1ANUaBT9ouXWOBDaOD/sNj9cMptWOA
72BpTMN0FszAJxW6rNEk1M/i+W2ly0qmD0QJPQU18Z62NU5E+D/uMkg2xif4dhwK
CSNMgCIv0X1qmve2lMOgwHbgkmHRwbXKSb4Z5vV8pDUh49tkRtxJ2ky7mE7aglrq
pjChpEqDktkCL/RHAT3XJ77tRIyBXwvpC7ewHXiYBY83OcGfRFv0jMCJ+R+1b3KD
p1faOVwd/H0tStd+0rF+tMMn8TuujQ597upLGhWdy1BpY3nnkJ7iJ8lyJv+aiCzr
Oh3DvyX1XgxnEo7yVr+x64TFz/GPkyuvVPSfL3gspqEZErC4BN+AEP/3fF+5SGed
x9QplB+lcy7IpzB+HURPZL4TqWl4Ib29pArZY1mQ1rJz6IFFbDSzj6lo36YDBrP8
HRG2LDxgc1udPPMxdZ3PAV3nt4/ufaxSTmT5HGV0Aj+hjkSfLvBDFMuVz9t6vfn9
YN3ocKWxJr2QISc0fcQ/hsBDiHVyoFgDOikBAetaqpdoM7OM7FHtLXtwLDILldx9
DZAIS0msNrjc7gGCrbxj
=2SJP
-----END PGP SIGNATURE-----
Merge tag 'xfs-for-linus-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs
Pull xfs update from Dave Chinner:
"There's relatively little change in this update; it is mainly bug
fixes, cleanups and more of the on-going libxfs restructuring and
on-disk format header consolidation work.
Details:
- more on-disk format header consolidation
- move some structures shared with userspace to libxfs
- new per-mount workqueue to fix for deadlocks between nested loop
mounted filesystems
- various bug fixes for ENOSPC, stats, quota off and preallocation
- a bunch of compiler warning fixes for set-but-unused variables
- various code cleanups"
* tag 'xfs-for-linus-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (24 commits)
xfs: split metadata and log buffer completion to separate workqueues
xfs: fix set-but-unused warnings
xfs: move type conversion functions to xfs_dir.h
xfs: move ftype conversion functions to libxfs
xfs: lobotomise xfs_trans_read_buf_map()
xfs: active inodes stat is broken
xfs: cleanup xfs_bmse_merge returns
xfs: cleanup xfs_bmse_shift_one goto mess
xfs: fix premature enospc on inode allocation
xfs: overflow in xfs_iomap_eof_align_last_fsb
xfs: fix simple_return.cocci warning in xfs_bmse_shift_one
xfs: fix simple_return.cocci warning in xfs_file_readdir
libxfs: fix simple_return.cocci warnings
xfs: remove unnecessary null checks
xfs: merge xfs_inum.h into xfs_format.h
xfs: move most of xfs_sb.h to xfs_format.h
xfs: merge xfs_ag.h into xfs_format.h
xfs: move acl structures to xfs_format.h
xfs: merge xfs_dinode.h into xfs_format.h
xfs: catch invalid negative blknos in _xfs_buf_find()
...
fixes, which should improve CPU utilization and potential soft lockups
under heavy memory pressure, and Eric Whitney's bigalloc fixes.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJUiRUwAAoJENNvdpvBGATwltQP/3sjHtFw+RUvKgQ8vX9M2THk
4b9j0ja0mrD3ObTXUxdDuOh1q09MsfSUiOYK6KZOav3nO/dRODqZnWgXz/zJt3LC
R97s4velgzZi3F2ijnLiCo5RVZahN9xs8bUHZ85orMIr5wogwGdaUpnoqZSg0Ehr
PIFnTNORyNXBwEm3XPjUmENTdyq9FZ8DsS6ACFzgFi79QTSyJFEM4LAl2XaqwMGV
fVhNwnOGIyT8lHZAtDcobkaC86NjakmpW2Ip3p9/UEQtynh16UeVXKEO3K7CcQ+L
YJRDNnSIlGpR1OJp+v6QJPUd8q4fc/8JW9AxxsLak0eqkszuB+MxoQXOCFV5AWaf
jrs4TV3y0hCuB4OwuYUpnfcU1o+O7p39MqXMv8SA1ZBPbijN/LQSMErFtXj2oih6
3gJHUWLwELGeR+d9JlI29zxhOeOIotX255UBgj2oasQ0X3BW3qAgQ4LmP3QY90Pm
BUmxiMoIWB9N3kU4XQGf+Kyy8JeMLJj0frHDxI3XLz+B+IlWCCkBH6y3AD/a13kS
HHMMLOwHGEs0lYEKsm89dkcij5GuKd8eKT8Q0+CvKD9Z6HPdYvQxoazmF87Q6j/7
ZmshaVxtWaLpNbDaXVg+IgZifJAN0+mVzVHRhY9TSjx8k9qLdSgSEqYWjkSjx9Ij
nNB2zVrHZDMvZ7MCZy85
=ZrTc
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"Lots of bugs fixes, including Zheng and Jan's extent status shrinker
fixes, which should improve CPU utilization and potential soft lockups
under heavy memory pressure, and Eric Whitney's bigalloc fixes"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (26 commits)
ext4: ext4_da_convert_inline_data_to_extent drop locked page after error
ext4: fix suboptimal seek_{data,hole} extents traversial
ext4: ext4_inline_data_fiemap should respect callers argument
ext4: prevent fsreentrance deadlock for inline_data
ext4: forbid journal_async_commit in data=ordered mode
jbd2: remove unnecessary NULL check before iput()
ext4: Remove an unnecessary check for NULL before iput()
ext4: remove unneeded code in ext4_unlink
ext4: don't count external journal blocks as overhead
ext4: remove never taken branch from ext4_ext_shift_path_extents()
ext4: create nojournal_checksum mount option
ext4: update comments regarding ext4_delete_inode()
ext4: cleanup GFP flags inside resize path
ext4: introduce aging to extent status tree
ext4: cleanup flag definitions for extent status tree
ext4: limit number of scanned extents in status tree shrinker
ext4: move handling of list of shrinkable inodes into extent status code
ext4: change LRU to round-robin in extent status tree shrinker
ext4: cache extent hole in extent status tree for ext4_da_map_blocks()
ext4: fix block reservation for bigalloc filesystems
...
Make the extent buffer allocation interface consistent. Cloned eb will
set a valid fs_info. For dummy eb, we can drop the length parameter and
set it from fs_info.
The built-in sanity checks may pass a NULL fs_info that's queried for
nodesize, but we know it's 4096.
Signed-off-by: David Sterba <dsterba@suse.cz>
Finally it's clear that the requested blocksize is always equal to
nodesize, with one exception, the superblock.
Superblock has fixed size regardless of the metadata block size, but
uses the same helpers to initialize sys array/chunk tree and to work
with the chunk items. So it pretends to be an extent_buffer for a
moment, btrfs_read_sys_array is full of special cases, we're adding one
more.
Signed-off-by: David Sterba <dsterba@suse.cz>
The following pattern is repeated many times:
req = fuse_get_req_nopages(fc);
/* Initialize req->(in|out).args */
fuse_request_send(fc, req);
err = req->out.h.error;
fuse_put_request(req);
Create a new replacement helper:
/* Initialize args */
err = fuse_simple_request(fc, &args);
In addition to reducing the code size, this will ease moving from the
complex arg-based to a simpler page-based I/O on the fuse device.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
path_put() in release could trigger a DESTROY request in fuseblk. The
possible deadlock was worked around by doing the path_put() with
schedule_work().
This complexity isn't needed if we just hold the inode instead of the path.
Since we now flush all requests before destroying the super block we can be
sure that all held inodes will be dropped.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Use fuse_abort_conn() instead of fuse_conn_kill() in fuse_put_super().
This flushes and aborts requests still on any queues. But since we've
already reset fc->connected, those requests would not be useful anyway and
would be flushed when the fuse device is closed.
Next patches will rely on requests being flushed before the superblock is
destroyed.
Use fuse_abort_conn() in cuse_process_init_reply() too, since it makes no
difference there, and we can get rid of fuse_conn_kill().
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Waking up reserved_req_waitq from fuse_conn_kill() doesn't make sense since
we aren't chaging ff->reserved_req here, which is what this waitqueue
signals.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Pull MIPS updates from Ralf Baechle:
"This is an unusually large pull request for MIPS - in parts because
lots of patches missed the 3.18 deadline but primarily because some
folks opened the flood gates.
- Retire the MIPS-specific phys_t with the generic phys_addr_t.
- Improvments for the backtrace code used by oprofile.
- Better backtraces on SMP systems.
- Cleanups for the Octeon platform code.
- Cleanups and fixes for the Loongson platform code.
- Cleanups and fixes to the firmware library.
- Switch ATH79 platform to use the firmware library.
- Grand overhault to the SEAD3 and Malta interrupt code.
- Move the GIC interrupt code to drivers/irqchip
- Lots of GIC cleanups and updates to the GIC code to use modern IRQ
infrastructures and features of the kernel.
- OF documentation updates for the GIC bindings
- Move GIC clocksource driver to drivers/clocksource
- Merge GIC clocksource driver with clockevent driver.
- Further updates to bring the GIC clocksource driver up to date.
- R3000 TLB code cleanups
- Improvments to the Loongson 3 platform code.
- Convert pr_warning to pr_warn.
- Merge a bunch of small lantiq and ralink fixes that have been
staged/lingering inside the openwrt tree for a while.
- Update archhelp for IP22/IP32
- Fix a number of issues for Loongson 1B.
- New clocksource and clockevent driver for Loongson 1B.
- Further work on clk handling for Loongson 1B.
- Platform work for Broadcom BMIPS.
- Error handling cleanups for TurboChannel.
- Fixes and optimization to the microMIPS support.
- Option to disable the FTLB.
- Dump more relevant information on machine check exception
- Change binfmt to allow arch to examine PT_*PROC headers
- Support for new style FPU register model in O32
- VDSO randomization.
- BCM47xx cleanups
- BCM47xx reimplement the way the kernel accesses NVRAM information.
- Random cleanups
- Add support for ATH25 platforms
- Remove pointless locking code in some PCI platforms.
- Some improvments to EVA support
- Minor Alchemy cleanup"
* 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus: (185 commits)
MIPS: Add MFHC0 and MTHC0 instructions to uasm.
MIPS: Cosmetic cleanups of page table headers.
MIPS: Add CP0 macros for extended EntryLo registers
MIPS: Remove now unused definition of phys_t.
MIPS: Replace use of phys_t with phys_addr_t.
MIPS: Replace MIPS-specific 64BIT_PHYS_ADDR with generic PHYS_ADDR_T_64BIT
PCMCIA: Alchemy Don't select 64BIT_PHYS_ADDR in Kconfig.
MIPS: lib: memset: Clean up some MIPS{EL,EB} ifdefery
MIPS: iomap: Use __mem_{read,write}{b,w,l} for MMIO
MIPS: <asm/types.h> fix indentation.
MAINTAINERS: Add entry for BMIPS multiplatform kernel
MIPS: Enable VDSO randomization
MIPS: Remove a temporary hack for debugging cache flushes in SMTC configuration
MIPS: Remove declaration of obsolete arch_init_clk_ops()
MIPS: atomic.h: Reformat to fit in 79 columns
MIPS: Apply `.insn' to fixup labels throughout
MIPS: Fix microMIPS LL/SC immediate offsets
MIPS: Kconfig: Only allow 32-bit microMIPS builds
MIPS: signal.c: Fix an invalid cast in ISA mode bit handling
MIPS: mm: Only build one microassembler that is suitable
...
- Expose the knob to user space through a proc file /proc/<pid>/setgroups
A value of "deny" means the setgroups system call is disabled in the
current processes user namespace and can not be enabled in the
future in this user namespace.
A value of "allow" means the segtoups system call is enabled.
- Descendant user namespaces inherit the value of setgroups from
their parents.
- A proc file is used (instead of a sysctl) as sysctls currently do
not allow checking the permissions at open time.
- Writing to the proc file is restricted to before the gid_map
for the user namespace is set.
This ensures that disabling setgroups at a user namespace
level will never remove the ability to call setgroups
from a process that already has that ability.
A process may opt in to the setgroups disable for itself by
creating, entering and configuring a user namespace or by calling
setns on an existing user namespace with setgroups disabled.
Processes without privileges already can not call setgroups so this
is a noop. Prodcess with privilege become processes without
privilege when entering a user namespace and as with any other path
to dropping privilege they would not have the ability to call
setgroups. So this remains within the bounds of what is possible
without a knob to disable setgroups permanently in a user namespace.
Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Pull networking updates from David Miller:
1) New offloading infrastructure and example 'rocker' driver for
offloading of switching and routing to hardware.
This work was done by a large group of dedicated individuals, not
limited to: Scott Feldman, Jiri Pirko, Thomas Graf, John Fastabend,
Jamal Hadi Salim, Andy Gospodarek, Florian Fainelli, Roopa Prabhu
2) Start making the networking operate on IOV iterators instead of
modifying iov objects in-situ during transfers. Thanks to Al Viro
and Herbert Xu.
3) A set of new netlink interfaces for the TIPC stack, from Richard
Alpe.
4) Remove unnecessary looping during ipv6 routing lookups, from Martin
KaFai Lau.
5) Add PAUSE frame generation support to gianfar driver, from Matei
Pavaluca.
6) Allow for larger reordering levels in TCP, which are easily
achievable in the real world right now, from Eric Dumazet.
7) Add a variable of napi_schedule that doesn't need to disable cpu
interrupts, from Eric Dumazet.
8) Use a doubly linked list to optimize neigh_parms_release(), from
Nicolas Dichtel.
9) Various enhancements to the kernel BPF verifier, and allow eBPF
programs to actually be attached to sockets. From Alexei
Starovoitov.
10) Support TSO/LSO in sunvnet driver, from David L Stevens.
11) Allow controlling ECN usage via routing metrics, from Florian
Westphal.
12) Remote checksum offload, from Tom Herbert.
13) Add split-header receive, BQL, and xmit_more support to amd-xgbe
driver, from Thomas Lendacky.
14) Add MPLS support to openvswitch, from Simon Horman.
15) Support wildcard tunnel endpoints in ipv6 tunnels, from Steffen
Klassert.
16) Do gro flushes on a per-device basis using a timer, from Eric
Dumazet. This tries to resolve the conflicting goals between the
desired handling of bulk vs. RPC-like traffic.
17) Allow userspace to ask for the CPU upon what a packet was
received/steered, via SO_INCOMING_CPU. From Eric Dumazet.
18) Limit GSO packets to half the current congestion window, from Eric
Dumazet.
19) Add a generic helper so that all drivers set their RSS keys in a
consistent way, from Eric Dumazet.
20) Add xmit_more support to enic driver, from Govindarajulu
Varadarajan.
21) Add VLAN packet scheduler action, from Jiri Pirko.
22) Support configurable RSS hash functions via ethtool, from Eyal
Perry.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1820 commits)
Fix race condition between vxlan_sock_add and vxlan_sock_release
net/macb: fix compilation warning for print_hex_dump() called with skb->mac_header
net/mlx4: Add support for A0 steering
net/mlx4: Refactor QUERY_PORT
net/mlx4_core: Add explicit error message when rule doesn't meet configuration
net/mlx4: Add A0 hybrid steering
net/mlx4: Add mlx4_bitmap zone allocator
net/mlx4: Add a check if there are too many reserved QPs
net/mlx4: Change QP allocation scheme
net/mlx4_core: Use tasklet for user-space CQ completion events
net/mlx4_core: Mask out host side virtualization features for guests
net/mlx4_en: Set csum level for encapsulated packets
be2net: Export tunnel offloads only when a VxLAN tunnel is created
gianfar: Fix dma check map error when DMA_API_DEBUG is enabled
cxgb4/csiostor: Don't use MASTER_MUST for fw_hello call
net: fec: only enable mdio interrupt before phy device link up
net: fec: clear all interrupt events to support i.MX6SX
net: fec: reset fep link status in suspend function
net: sock: fix access via invalid file descriptor
net: introduce helper macro for_each_cmsghdr
...
On some ARMs the memory can be mapped pgprot_noncached() and still
be working for atomic operations. As pointed out by Colin Cross
<ccross@android.com>, in some cases you do want to use
pgprot_noncached() if the SoC supports it to see a debug printk
just before a write hanging the system.
On ARMs, the atomic operations on strongly ordered memory are
implementation defined. So let's provide an optional kernel parameter
for configuring pgprot_noncached(), and use pgprot_writecombine() by
default.
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Rob Herring <robherring2@gmail.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Anton Vorontsov <anton@enomsg.org>
Cc: Colin Cross <ccross@android.com>
Cc: Olof Johansson <olof@lixom.net>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: stable@vger.kernel.org
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tony Lindgren <tony@atomide.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Currently trying to use pstore on at least ARMs can hang as we're
mapping the peristent RAM with pgprot_noncached().
On ARMs, pgprot_noncached() will actually make the memory strongly
ordered, and as the atomic operations pstore uses are implementation
defined for strongly ordered memory, they may not work. So basically
atomic operations have undefined behavior on ARM for device or strongly
ordered memory types.
Let's fix the issue by using write-combine variants for mappings. This
corresponds to normal, non-cacheable memory on ARM. For many other
architectures, this change does not change the mapping type as by
default we have:
#define pgprot_writecombine pgprot_noncached
The reason why pgprot_noncached() was originaly used for pstore
is because Colin Cross <ccross@android.com> had observed lost
debug prints right before a device hanging write operation on some
systems. For the platforms supporting pgprot_noncached(), we can
add a an optional configuration option to support that. But let's
get pstore working first before adding new features.
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Anton Vorontsov <cbouatmailru@gmail.com>
Cc: Colin Cross <ccross@android.com>
Cc: Olof Johansson <olof@lixom.net>
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Rob Herring <rob.herring@calxeda.com>
[tony@atomide.com: updated description]
Signed-off-by: Tony Lindgren <tony@atomide.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
All callers of path_init() proceed to do the identical cleanup when
they are done with nameidata. Don't open-code it...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Merge first patchbomb from Andrew Morton:
- a few minor cifs fixes
- dma-debug upadtes
- ocfs2
- slab
- about half of MM
- procfs
- kernel/exit.c
- panic.c tweaks
- printk upates
- lib/ updates
- checkpatch updates
- fs/binfmt updates
- the drivers/rtc tree
- nilfs
- kmod fixes
- more kernel/exit.c
- various other misc tweaks and fixes
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (190 commits)
exit: pidns: fix/update the comments in zap_pid_ns_processes()
exit: pidns: alloc_pid() leaks pid_namespace if child_reaper is exiting
exit: exit_notify: re-use "dead" list to autoreap current
exit: reparent: call forget_original_parent() under tasklist_lock
exit: reparent: avoid find_new_reaper() if no children
exit: reparent: introduce find_alive_thread()
exit: reparent: introduce find_child_reaper()
exit: reparent: document the ->has_child_subreaper checks
exit: reparent: s/while_each_thread/for_each_thread/ in find_new_reaper()
exit: reparent: fix the cross-namespace PR_SET_CHILD_SUBREAPER reparenting
exit: reparent: fix the dead-parent PR_SET_CHILD_SUBREAPER reparenting
exit: proc: don't try to flush /proc/tgid/task/tgid
exit: release_task: fix the comment about group leader accounting
exit: wait: drop tasklist_lock before psig->c* accounting
exit: wait: don't use zombie->real_parent
exit: wait: cleanup the ptrace_reparented() checks
usermodehelper: kill the kmod_thread_locker logic
usermodehelper: don't use CLONE_VFORK for ____call_usermodehelper()
fs/hfs/catalog.c: fix comparison bug in hfs_cat_keycmp
nilfs2: fix the nilfs_iget() vs. nilfs_new_inode() races
...
As it is, default ->i_fop has NULL ->open() (along with all other methods).
The only case where it matters is reopening (via procfs symlink) a file that
didn't get its ->f_op from ->i_fop - anything else will have ->i_fop assigned
to something sane (default would fail on read/write/ioctl/etc.).
Unfortunately, such case exists - alloc_file() users, especially
anon_get_file() ones. There we have tons of opened files of very different
kinds sharing the same inode. As the result, attempt to reopen those via
procfs succeeds and you get a descriptor you can't do anything with.
Moreover, in case of sockets we set ->i_fop that will only be used
on such reopen attempts - and put a failing ->open() into it to make sure
those do not succeed.
It would be simpler to put such ->open() into default ->i_fop and leave
it unchanged both for anon inode (as we do anyway) and for socket ones. Result:
* everything going through do_dentry_open() works as it used to
* sock_no_open() kludge is gone
* attempts to reopen anon-inode files fail as they really ought to
* ditto for aio_private_file()
* ditto for perfmon - this one actually tried to imitate sock_no_open()
trick, but failed to set ->i_fop, so in the current tree reopens succeed and
yield completely useless descriptor. Intent clearly had been to fail with
-ENXIO on such reopens; now it actually does.
* everything else that used alloc_file() keeps working - it has ->i_fop
set for its inodes anyway
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
New pseudo-filesystem: nsfs. Targets of /proc/*/ns/* live there now.
It's not mountable (not even registered, so it's not in /proc/filesystems,
etc.). Files on it *are* bindable - we explicitly permit that in do_loopback().
This stuff lives in fs/nsfs.c now; proc_ns_fget() moved there as well.
get_proc_ns() is a macro now (it's simply returning ->i_private; would
have been an inline, if not for header ordering headache).
proc_ns_inode() is an ex-parrot. The interface used in procfs is
ns_get_path(path, task, ops) and ns_get_name(buf, size, task, ops).
Dentries and inodes are never hashed; a non-counting reference to dentry
is stashed in ns_common (removed by ->d_prune()) and reused by ns_get_path()
if present. See ns_get_path()/ns_prune_dentry/nsfs_evict() for details
of that mechanism.
As the result, proc_ns_follow_link() has stopped poking in nd->path.mnt;
it does nd_jump_link() on a consistent <vfsmount,dentry> pair it gets
from ns_get_path().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
proc_flush_task_mnt() always tries to flush task/pid, but this is
pointless if we reap the leader. d_invalidate() is recursive, and
if nothing else the next d_hash_and_lookup(tgid) should fail anyway.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sterling Alexander <stalexan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Relying on the sign (after casting to int) of the difference of two
quantities for comparison is usually wrong. For example, should a-b
turn out to be 2^31, the return value of cmp(a,b) is -2^31; but that
would also be the return value from cmp(b, a). So a compares less than
b and b compares less than a. One can also easily find three values
a,b,c such that a compares less than b, b compares less than c, but a
does not compare less than c.
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Reviewed-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Same story as in commit 41080b5a24 ("nfsd race fixes: ext2") (similar
ext2 fix) except that nilfs2 needs to use insert_inode_locked4() instead
of insert_inode_locked() and a bug of a check for dead inodes needs to
be fixed.
If nilfs_iget() is called from nfsd after nilfs_new_inode() calls
insert_inode_locked4(), nilfs_iget() will wait for unlock_new_inode() at
the end of nilfs_mkdir()/nilfs_create()/etc to unlock the inode.
If nilfs_iget() is called before nilfs_new_inode() calls
insert_inode_locked4(), it will create an in-core inode and read its
data from the on-disk inode. But, nilfs_iget() will find i_nlink equals
zero and fail at nilfs_read_inode_common(), which will lead it to call
iget_failed() and cleanly fail.
However, this sanity check doesn't work as expected for reused on-disk
inodes because they leave a non-zero value in i_mode field and it
hinders the test of i_nlink. This patch also fixes the issue by
removing the test on i_mode that nilfs2 doesn't need.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The iput() function tests whether its argument is NULL and then returns
immediately. Thus the test around the call is not needed.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch removes filemap_write_and_wait_range() from nilfs_sync_file(),
because it triggers a data segment construction by calling
nilfs_writepages() with WB_SYNC_ALL. A data segment construction does not
remove the inode from the i_dirty list and it does not clear the
NILFS_I_DIRTY flag. Therefore nilfs_inode_dirty() still returns true,
which leads to an unnecessary duplicate segment construction in
nilfs_sync_file().
A call to filemap_write_and_wait_range() is not needed, because NILFS2
does not rely on the generic writeback mechanisms. Instead it implements
its own mechanism to collect all dirty pages and write them into segments.
It is more efficient to initiate the segment construction directly in
nilfs_sync_file() without the detour over filemap_write_and_wait_range().
Additionally the lock of i_mutex is not needed, because all code blocks
that are protected by i_mutex are also protected by a NILFS transaction:
Function i_mutex nilfs_transaction
------------------------------------------------------
nilfs_ioctl_setflags: yes yes
nilfs_fiemap: yes no
nilfs_write_begin: yes yes
nilfs_write_end: yes yes
nilfs_lookup: yes no
nilfs_create: yes yes
nilfs_link: yes yes
nilfs_mknod: yes yes
nilfs_symlink: yes yes
nilfs_mkdir: yes yes
nilfs_unlink: yes yes
nilfs_rmdir: yes yes
nilfs_rename: yes yes
nilfs_setattr: yes yes
For nilfs_lookup() i_mutex is held for the parent directory, to protect it
from modification. The segment construction does not modify directory
inodes, so no lock is needed.
nilfs_fiemap() reads the block layout on the disk, by using
nilfs_bmap_lookup_contig(). This is already protected by bmap->b_sem.
Signed-off-by: Andreas Rohner <andreas.rohner@gmx.net>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If some error happens in NCP_IOC_SETROOT ioctl, the appropriate error
return value is then (in most cases) just overwritten before we return.
This can result in reporting success to userspace although error happened.
This bug was introduced by commit 2e54eb96e2 ("BKL: Remove BKL from
ncpfs"). Propagate the errors correctly.
Coverity id: 1226925.
Fixes: 2e54eb96e2 ("BKL: Remove BKL from ncpfs")
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Petr Vandrovec <petr@vandrovec.name>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
vma_dump_size() has been used several times on actual dumper and it is
supposed to return the same value for the same vma. But vma_dump_size()
could return different values for same vma.
The known problem case is concurrent shared memory removal. If a vma is
used for a shared memory and that shared memory is removed between
writing program header and dumping vma memory, this will result in a
dump file which is internally consistent.
To fix the problem, we set baseline to get dump size and store the size
into vma_filesz and always use the same vma dump size which is stored in
vma_filsz. The consistnecy with reality is not actually guranteed, but
it's tolerable since that is fully consistent with base line.
Signed-off-by: Jungseung Lee <js07.lee@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
GFP_USER means "honour cpuset nodes-allowed beancounting". These are
regular old kernel objects and there seems no reason to give them this
treatment.
Acked-by: Mike Frysinger <vapier@gentoo.org>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Clean up various coding style issues that checkpatch complains about.
No functional changes here.
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When trying to develop a custom format handler, the errors returned all
effectively get bucketed as EINVAL with no kernel messages. The other
errors (ENOMEM/EFAULT) are internal/obvious and basic. Thus any time a
bad handler is rejected, the developer has to walk the dense code and
try to guess where it went wrong. Needing to dive into kernel code is
itself a fairly high barrier for a lot of people.
To improve this situation, let's deploy extensive pr_debug markers at
logical parse points, and add comments to the dense parsing logic. It
let's you see exactly where the parsing aborts, the string the kernel
received (useful when dealing with shell code), how it translated the
buffers to binary data, and how it will apply the mask at runtime.
Some example output:
$ echo ':qemu-foo:M::\x7fELF\xAD\xAD\x01\x00:\xff\xff\xff\xff\xff\x00\xff\x00:/usr/bin/qemu-foo:POC' > register
$ dmesg
binfmt_misc: register: received 92 bytes
binfmt_misc: register: delim: 0x3a {:}
binfmt_misc: register: name: {qemu-foo}
binfmt_misc: register: type: M (magic)
binfmt_misc: register: offset: 0x0
binfmt_misc: register: magic[raw]: 5c 78 37 66 45 4c 46 5c 78 41 44 5c 78 41 44 5c \x7fELF\xAD\xAD\
binfmt_misc: register: magic[raw]: 78 30 31 5c 78 30 30 00 x01\x00.
binfmt_misc: register: mask[raw]: 5c 78 66 66 5c 78 66 66 5c 78 66 66 5c 78 66 66 \xff\xff\xff\xff
binfmt_misc: register: mask[raw]: 5c 78 66 66 5c 78 30 30 5c 78 66 66 5c 78 30 30 \xff\x00\xff\x00
binfmt_misc: register: mask[raw]: 00 .
binfmt_misc: register: magic/mask length: 8
binfmt_misc: register: magic[decoded]: 7f 45 4c 46 ad ad 01 00 .ELF....
binfmt_misc: register: mask[decoded]: ff ff ff ff ff 00 ff 00 ........
binfmt_misc: register: magic[masked]: 7f 45 4c 46 ad 00 01 00 .ELF....
binfmt_misc: register: interpreter: {/usr/bin/qemu-foo}
binfmt_misc: register: flag: P (preserve argv0)
binfmt_misc: register: flag: O (open binary)
binfmt_misc: register: flag: C (preserve creds)
The [raw] lines show us exactly what was received from userspace. The
lines after that show us how the kernel has decoded things.
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch replaces calls to get_unused_fd() with equivalent call to
get_unused_fd_flags(0) to preserve current behavor for existing code.
In a further patch, get_unused_fd() will be removed so that new code
start using get_unused_fd_flags(), with the hope O_CLOEXEC could be
used, either by default or choosen by userspace.
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch replaces calls to get_unused_fd() with equivalent call to
get_unused_fd_flags(0) to preserve current behavor for existing code.
In a further patch, get_unused_fd() will be removed so that new code start
using get_unused_fd_flags(), with the hope O_CLOEXEC could be used, either
by default or choosen by userspace.
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
p->ptrace != 0 means that release_task(p) was not called, so pid_alive()
buys nothing and we can remove this check. Other callers already use it
directly without additional checks.
Note: with or without this patch ptrace_parent() can return the pointer to
the freed task, this will be explained/fixed later.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>,
Cc: Sterling Alexander <stalexan@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roland McGrath <roland@hack.frob.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
task_state() does seq_printf() under rcu_read_lock(), but this is only
needed for task_tgid_nr_ns() and task_numa_group_id(). We can calculate
tgid/ngid and drop rcu lock.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>,
Cc: Sterling Alexander <stalexan@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roland McGrath <roland@hack.frob.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1. The usage of fdt looks very ugly, it can't be NULL if ->files is
not NULL. We can use "unsigned int max_fds" instead.
2. This also allows to move seq_printf(max_fds) outside of task_lock()
and join it with the previous seq_printf(). See also the next patch.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>,
Cc: Sterling Alexander <stalexan@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roland McGrath <roland@hack.frob.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
task_state() reads cred->group_info under task_lock() because a long ago
it was task_struct->group_info and it was actually protected by
task->alloc_lock. Today this task_unlock() after rcu_read_unlock() just
adds the confusion, move task_unlock() up.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>,
Cc: Sterling Alexander <stalexan@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roland McGrath <roland@hack.frob.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Better to use existing macro that rewriting them.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When a lot of netdevices are created, one of the bottleneck is the
creation of proc entries. This serie aims to accelerate this part.
The current implementation for the directories in /proc is using a single
linked list. This is slow when handling directories with large numbers of
entries (eg netdevice-related entries when lots of tunnels are opened).
This patch replaces this linked list by a red-black tree.
Here are some numbers:
dummy30000.batch contains 30 000 times 'link add type dummy'.
Before the patch:
$ time ip -b dummy30000.batch
real 2m31.950s
user 0m0.440s
sys 2m21.440s
$ time rmmod dummy
real 1m35.764s
user 0m0.000s
sys 1m24.088s
After the patch:
$ time ip -b dummy30000.batch
real 2m0.874s
user 0m0.448s
sys 1m49.720s
$ time rmmod dummy
real 1m13.988s
user 0m0.000s
sys 1m1.008s
The idea of improving this part was suggested by Thierry Herbelot.
[akpm@linux-foundation.org: initialise proc_root.subdir at compile time]
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: Thierry Herbelot <thierry.herbelot@6wind.com>.
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As a small zero page, huge zero page should not be accounted in smaps
report as normal page.
For small pages we rely on vm_normal_page() to filter out zero page, but
vm_normal_page() is not designed to handle pmds. We only get here due
hackish cast pmd to pte in smaps_pte_range() -- pte and pmd format is not
necessary compatible on each and every architecture.
Let's add separate codepath to handle pmds. follow_trans_huge_pmd() will
detect huge zero page for us.
We would need pmd_dirty() helper to do this properly. The patch adds it
to THP-enabled architectures which don't yet have one.
[akpm@linux-foundation.org: use do_div to fix 32-bit build]
Signed-off-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Tested-by: Fengwei Yin <yfw.kernel@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
At one place we assign major number we found to ret. That assignment is
then never used and actually doesn't make any sense given how the code is
currently structured (the assignment comes from pre-git times). Just
remove it.
Coverity id: 1226852.
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In commit 1faf289454 ("ocfs2_dlm: disallow a domain join if node maps
mismatch") we introduced a new earlier NULL check so this one is not
needed. Also static checkers complain because we dereference it first
and then check for NULL.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
"inode" isn't NULL here, and also we dereference it on the previous line
so static checkers get annoyed.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Do not set the filesystem readonly if the storage link is down. In this
case, metadata is not corrupted and only -EIO is returned. And if it is
indeed corrupted metadata, it has already called ocfs2_error() in
ocfs2_validate_inode_block().
Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ocfs2_readpages() use nonblocking flag to avoid page lock inversion. It
will trigger cluster hang because that flag OCFS2_LOCK_UPCONVERT_FINISHING
is not cleared if nonblocking lock cannot be granted at once. The flag
would prevent dc thread from downconverting. So other nodes cannot
acheive this lockres for ever.
So we should not set OCFS2_LOCK_UPCONVERT_FINISHING when receiving ast if
nonblocking lock had already returned.
Signed-off-by: joyce.xue <xuejiufei@huawei.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Error handling if creation of root of debugfs in ocfs2_init() fails is
broken. Although error code is set we fail to exit ocfs2_init() with
error and thus initialization ends with success. Later when mounting a
filesystem, ocfs2 debugfs entries end up being created in the root of
debugfs filesystem which is confusing.
Fix the error handling to bail out.
Coverity id: 1227009.
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Filesize is not a good indication that the file needs to be synced.
An example where this breaks is:
1. Open the file in O_SYNC|O_RDWR
2. Read a small portion of the file (say 64 bytes)
3. Lseek to starting of the file
4. Write 64 bytes
If the node crashes, it is not written out to disk because this was not
committed in the journal and the other node which reads the file after
recovery reads stale data (even if the write on the other node was
successful)
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.de>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Set nn_persistent_error to -ENOTCONN will stop reconnect since the
"stop" condition in o2net_start_connect() will be true.
stop = (nn->nn_sc ||
(nn->nn_persistent_error &&
(nn->nn_persistent_error != -ENOTCONN || timeout == 0)));
This will make connection never be established if the first connection
request is lost.
Set nn_persistent_error to 0 when connect expired to fix this. With
this changes, dlm will not be waken up when connect expired, this is OK
since dlm depends on network, dlm can do nothing in this case if waken
up. Let it wait there for network recover and connect built again to
continue.
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Node A sends master query request to node B which is the master. At this
time lockres happens to be on purgelist. dlm_master_request_handler gets
the dlm spinlock, finds the resource and releases the dlm spin lock.
Right at this dlm_thread on this node could purge the lockres.
dlm_master_request_handler can then acquire lockres spinlock and reply to
Node A that node B is the master even though lockres on node B is purged.
The above scenario will now make node A falsely think node B is the master
which is inconsistent. Further if another node C tries to master the same
resource, every node will respond they are not the master. Node C then
masters the resource and sends assert master to all nodes. This will now
make node A crash with the following message.
dlm_assert_master_handler:1831 ERROR: DIE! Mastery assert from 9, but current
owner is 10!
Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Reviewed-by: Wengang Wang <wen.gang.wang@oracle.com>
Tested-by: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Report return value of o2hb_do_disk_heartbeat() as a part of ML_HEARTBEAT
message so that we know whether a heartbeat actually happened or not.
This also makes assigned but otherwise unused 'ret' variable used.
Coverity id: 1227053.
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
'args' are always set for ocfs2_read_locked_inode() and brelse() checks
whether bh is NULL. So the test (args && bh) is unnecessary (plus the
args part is really confusing anyway). Remove it.
Coverity id: 1128856.
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ocfs2_get_xattr_nolock() checks whether inode has any extended attributes
(OCFS2_HAS_XATTR_FL). If not, it just sets 'ret' to -ENODATA but
continues with checking inline and external attributes anyway (which is
pointless although it does not harm). Just return immediately when we
know there are no extended attributes in the inode.
Coverity id: 1226906.
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The ->si_slots[] array is allocated in ocfs2_init_slot_info() it has
"->max_slots" number of elements so this test should be >= instead of >.
Static checker work. Compile tested only.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Do not BUG() if GFP_ATOMIC allocation fails in dlm_dispatch_assert_master.
Instead, return -ENOMEM to the sender and then retry.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Alex Chen <alex.chen@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kcalloc manages count*sizeof overflow.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Steve French <sfrench@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kcalloc manages count*sizeof overflow.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Steve French <sfrench@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Replace all __constant_foo to foo() except in smb2status.h (1700 lines to
update).
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Steve French <sfrench@samba.org>
Cc: Jeff Layton <jlayton@poochiereds.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull VFS changes from Al Viro:
"First pile out of several (there _definitely_ will be more). Stuff in
this one:
- unification of d_splice_alias()/d_materialize_unique()
- iov_iter rewrite
- killing a bunch of ->f_path.dentry users (and f_dentry macro).
Getting that completed will make life much simpler for
unionmount/overlayfs, since then we'll be able to limit the places
sensitive to file _dentry_ to reasonably few. Which allows to have
file_inode(file) pointing to inode in a covered layer, with dentry
pointing to (negative) dentry in union one.
Still not complete, but much closer now.
- crapectomy in lustre (dead code removal, mostly)
- "let's make seq_printf return nothing" preparations
- assorted cleanups and fixes
There _definitely_ will be more piles"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
copy_from_iter_nocache()
new helper: iov_iter_kvec()
csum_and_copy_..._iter()
iov_iter.c: handle ITER_KVEC directly
iov_iter.c: convert copy_to_iter() to iterate_and_advance
iov_iter.c: convert copy_from_iter() to iterate_and_advance
iov_iter.c: get rid of bvec_copy_page_{to,from}_iter()
iov_iter.c: convert iov_iter_zero() to iterate_and_advance
iov_iter.c: convert iov_iter_get_pages_alloc() to iterate_all_kinds
iov_iter.c: convert iov_iter_get_pages() to iterate_all_kinds
iov_iter.c: convert iov_iter_npages() to iterate_all_kinds
iov_iter.c: iterate_and_advance
iov_iter.c: macros for iterating over iov_iter
kill f_dentry macro
dcache: fix kmemcheck warning in switch_names
new helper: audit_file()
nfsd_vfs_write(): use file_inode()
ncpfs: use file_inode()
kill f_dentry uses
lockd: get rid of ->f_path.dentry->d_sb
...
This set includes one feature, which allows locks that
have been orphaned to be reacquired.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJUiIOTAAoJEDgbc8f8gGmqIHcP/1JW/ji74+bdci533MH1pL9W
l6HhoiGsNi4bF7QJckKXUKZYDii4VBjf5VkvvA0aNnXoncBr2XW6b96TUPsdbzBd
St11wSkhlfMOMfg+aalBFd34fm4QwNt5hqmuPAJXbK24Jgy4JMzHsEHSi7yz5WDn
Pgl2s9+6fDU648Vd0iu1u3jMXY8MP1apGWJkV7tFmt2XE1DyOP+yqghYp2PkQeDk
hZPdLhO2JwynRliPg99qNLvBurzYwFk1RJ1fbi9WPnvgTp42i+JuxbMtZwh5ehjr
FLWskJIJmbsm8S95uw4lGFhPE+76Psq4etmoTl+lyT2pZQeNItWX6JQ6u8UcGevD
xJeokmdhvbF4NRIcgP7b3u3Mue78PdkqAy40nmkBdp4+9uJrXB/+Mts1FBJHgXIH
jdEGGdVCBSGr7TRkbJ5hMfI51Wyrl6u2JICCBlzwGWbRiXy76u78YAIj+w5s45yL
HxkRZll9UMNEDlO//Ldhwh0CV0yBW00bdeurwxm6i4xS9vAUEIXByJ62EwbPnMvC
vD6oufWkfNzAKZoF8gvPwQStt9pXWPNe314QvUVHx6B9VZpcV9VfEqmNN4qzhuBU
I5a5G03tnjtd1JdcsFfxduDIYVDYTmba/Bj/CLVMECWsBRAvCzr57a3JHL+cabA0
Lz/LqaTNterF8l4zAo8J
=u91p
-----END PGP SIGNATURE-----
Merge tag 'dlm-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm
Pull dlm update from David Teigland:
"This set includes one feature, which allows locks that have been
orphaned to be reacquired"
* tag 'dlm-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
dlm: adopt orphan locks
Pull quota updates from Jan Kara:
"Quota improvements and some minor cleanups.
The main portion in the pull request are changes which move i_dquot
array from struct inode into fs-private part of an inode which saves
memory for filesystems which don't use VFS quotas"
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
udf: One function call less in udf_fill_super() after error detection
udf: Deletion of unnecessary checks before the function call "iput"
jbd: Deletion of an unnecessary check before the function call "iput"
vfs: Remove i_dquot field from inode
jfs: Convert to private i_dquot field
reiserfs: Convert to private i_dquot field
ocfs2: Convert to private i_dquot field
ext4: Convert to private i_dquot field
ext3: Convert to private i_dquot field
ext2: Convert to private i_dquot field
quota: Use function to provide i_dquot pointers
xfs: Set allowed quota types
gfs2: Set allowed quota types
quota: Allow each filesystem to specify which quota types it supports
quota: Remove const from function declarations
quota: Add log level to printk
Pull f2fs updates from Jaegeuk Kim:
"This patch-set includes lots of bug fixes based on clean-ups and
refactored codes. And inline_dir was introduced and two minor mount
options were added. Details from signed tag:
This series includes the following enhancement with refactored flows.
- fix inmemory page operations
- fix wrong inline_data & inline_dir logics
- enhance memory and IO control under memory pressure
- consider preemption on radix_tree operation
- fix memory leaks and deadlocks
But also, there are a couple of new features:
- support inline_dir to store dentries inside inode page
- add -o fastboot to reduce booting time
- implement -o dirsync
And a lot of clean-ups and minor bug fixes as well"
* tag 'for-f2fs-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (88 commits)
f2fs: avoid to ra unneeded blocks in recover flow
f2fs: introduce is_valid_blkaddr to cleanup codes in ra_meta_pages
f2fs: fix to enable readahead for SSA/CP blocks
f2fs: use atomic for counting inode with inline_{dir,inode} flag
f2fs: cleanup path to need cp at fsync
f2fs: check if inode state is dirty at fsync
f2fs: count the number of inmemory pages
f2fs: release inmemory pages when the file was closed
f2fs: set page private for inmemory pages for truncation
f2fs: count inline_xx in do_read_inode
f2fs: do retry operations with cond_resched
f2fs: call radix_tree_preload before radix_tree_insert
f2fs: use rw_semaphore for nat entry lock
f2fs: fix missing kmem_cache_free
f2fs: more fast lookup for gc_inode list
f2fs: cleanup redundant macro
f2fs: fix to return correct error number in f2fs_write_begin
f2fs: cleanup if-statement of phase in gc_data_segment
f2fs: fix to recover converted inline_data
f2fs: make clean the page before writing
...
Pull cifs update from Steve French:
"Mostly cifs cleanup but also a few cifs fixes"
* 'for-linus' of git://git.samba.org/sfrench/cifs-2.6:
cifs: remove unneeded condition check
Set UID in sess_auth_rawntlmssp_authenticate too
cifs: convert printk(LEVEL...) to pr_<level>
cifs: convert to print_hex_dump() instead of custom implementation
cifs: call strtobool instead of custom implementation
Update MAINTAINERS entry
Update modinfo cifs version for cifs.ko
decode_negTokenInit had wrong calling sequence
Add missing defines for ACL query support
Add support for original fallocate
this time. There is a set of patches to improve performance in relation to
block reservations. Some correctness fixes for fallocate, and an update
to the freeze/thaw code which greatly simplyfies this code path. In
addition there is a set of clean ups from Al Viro too.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.15 (GNU/Linux)
iQIcBAABAgAGBQJUher6AAoJEMrg3m4a/8jSaJwP/Ai9cohCBohYgzgBIas0L8zy
H6BwYwLoUU0E7UlL7RBkjE9ZNL2meFcDM4NGpzXkOcJaJw5hkWHcwSmLBOU1V27N
v3wgaLd1J2BXwaYMrJ0XTqbdzU63Y27KkXOHPBr+UwEtd3azeugNX2sfgrKg8cqd
6AM8sbPifGs+2u1viTbtAhirIo/TE2kk60OuBeX6hCNjvN/PcOKKF+ISewtpqfFD
1vHwjVDX7USuUkjGQRCmM7A032b2YilMf+57Oe/a2Q+CyI7E41259nrwWC0/vcst
AuKb48WyL6Y6YLMXA2HlqxeYkyEAyr0pk0D4hRYYofebSn3d4mDaxvTU0y/vKuL1
bD9J3niPv44B9OtrjzbKf0Utsk9cUeYMOcb6ydMTcEYdMIEITG21N/yR1bU2MkYt
4KpnjcdEtoNteo0OsxtWq2poL0RxlKde8P7wUtwvnrK0wcVDdWbLU1iXf0t2r2RF
JO9ZSTYrKoFvTpg34zCcUlHBMarZSdP1Kou9hUkTXmZtmirwqR+9T6GtexD60jxz
TIRMHOf8HXz9wM4kUI442IBaHIW38AsXNEPVUp3vk04qLCqCPmE7ISBvAB4NHbIn
Yw/X9fJwK3hn+/R9+u09aJKLGDKWwlSOVdTb+yFgQcqz6BcaBoZMdamiKQcOGEk2
5qQ8J/F5f87BZOvuUUpI
=t1F/
-----END PGP SIGNATURE-----
Merge tag 'gfs2-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw
Pull GFS2 update from Steven Whitehouse:
"In contrast to recent merge windows, there are a number of interesting
features this time:
There is a set of patches to improve performance in relation to block
reservations. Some correctness fixes for fallocate, and an update to
the freeze/thaw code which greatly simplyfies this code path. In
addition there is a set of clean ups from Al Viro too"
* tag 'gfs2-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw:
GFS2: gfs2_atomic_open(): simplify the use of finish_no_open()
GFS2: gfs2_dir_get_hash_table(): avoiding deferred vfree() is easy here...
GFS2: use kvfree() instead of open-coding it
GFS2: gfs2_create_inode(): don't bother with d_splice_alias()
GFS2: bugger off early if O_CREAT open finds a directory
GFS2: Deletion of unnecessary checks before two function calls
GFS2: update freeze code to use freeze/thaw_super on all nodes
fs: add freeze_super/thaw_super fs hooks
GFS2: Update timestamps on fallocate
GFS2: Update i_size properly on fallocate
GFS2: Use inode_newsize_ok and get_write_access in fallocate
GFS2: If we use up our block reservation, request more next time
GFS2: Only increase rs_sizehint
GFS2: Set of distributed preferences for rgrps
GFS2: directly return gfs2_dir_check()
side-step that by reading copies that pstore saved.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJUheJgAAoJEKurIx+X31iB5F0P/jdpAw6cI26icGiOcRvRYvce
jLq/WbGggxZlx3rtgGpekJmcJ1NBBTLdyx4b86q4q/zstQkoJ9lqGCn63YcIMJNB
pdctmbkGyoQQXBTAzSCFs6pybMUmtYKMDiT3OJddcCm4fUjd4RQHvNP+5ESsf0lQ
9YpIS+rZOtB2/5N6/i4+Lnaffc3s5gXw/dJMxOm/laWtRFRyhf22YP18cRp5LmuV
NHqu1NoeLnar/qL6plPl73lEyZVOPRC01T7OWmmCkcLieYPGkqQlkoXp95VBKf5u
CvD167oM71OccMa0gOTlCS8a6y5KO6y8I+YAR60iANTLDh+rHZiwNj1gY4v/Z29m
2ba1xAulQrpCxqml6eVxAKaF+4HXaXVXKqjQIivJcGyfYf6BXLMvC0M3Lsv7XQdz
HKl++o0JELDEJjVW0i9Wa5CjgcqXdvuRXOoKDaKTZWff2yfUxqIN5Xl7zIV2kgVy
ZqPDBHJSmHjuzmJ6inhPkmdS2uz94PVSE7ykeaa8iCBbpdsS+FchtF2sRMvUhU23
ekHsxk0Mk/pS5EBNc6rrrM9NtKrUQMa1e/oT5G7QowksDeNpsPjx92OeUImxgh3x
+hmObN9vx6SepwVSfjI1rwrMsAknphJfPmyi/XJgkVbfRMCv2we1npvYd6hqFUMV
daekMzGOi5eqoaWB8hje
=Ezg0
-----END PGP SIGNATURE-----
Merge tag 'please-pull-pstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux
Pull pstore fixes from Tony Luck:
"On a system that restricts access to dmesg, don't let people side-step
that by reading copies that pstore saved"
* tag 'please-pull-pstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux:
syslog: Provide stub check_syslog_permissions
pstore: Honor dmesg_restrict sysctl on dmesg dumps
pstore/ram: Strip ramoops header for correct decompression
Highlights include:
Features:
- NFSv4.2 client support for hole punching and preallocation.
- Further RPC/RDMA client improvements.
- Add more RPC transport debugging tracepoints.
- Add RPC debugging tools in debugfs.
Bugfixes:
- Stable fix for layoutget error handling
- Fix a change in COMMIT behaviour resulting from the recent io code updates
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJUhRVTAAoJEGcL54qWCgDyfeUP/RoFo3ImTMbGxfcPJqoELjcO
lZbQ+27pOE/whFDkWgiOVTwlgGct5a0WRo7GCZmpYJA4q1kmSv4ngTb3nMTCUztt
xMJ0mBr0BqttVs+ouKiVPm3cejQXedEhttwWcloIXS8lNenlpL29Zlrx2NHdU8UU
13+souocj0dwIyTYYS/4Lm9KpuCYnpDBpP5ShvQjVaMe/GxJo6GyZu70c7FgwGNz
Nh9onzZV3mz1elhfizlV38aVA7KWVXtLWIqOFIKlT2fa4nWB8Hc07miR5UeOK0/h
r+icnF2qCQe83MbjOxYNxIKB6uiA/4xwVc90X4AQ7F0RX8XPWHIQWG5tlkC9jrCQ
3RGzYshWDc9Ud2mXtLMyVQxHVVYlFAe1WtdP8ZWb1oxDInmhrarnWeNyECz9xGKu
VzIDZzeq9G8slJXATWGRfPsYr+Ihpzcen4QQw58cakUBcqEJrYEhlEOfLovM71k3
/S/jSHBAbQqiw4LPMw87bA5A6+ZKcVSsNE0XCtNnhmqFpLc1kKRrl5vaN+QMk5tJ
v4/zR0fPqH7SGAJWYs4brdfahyejEo0TwgpDs7KHmu1W9zQ0LCVTaYnQuUmQjta6
WyYwIy3TTibdfR191O0E3NOW82Q/k/NBD6ySvabN9HqQ9eSk6+rzrWAslXCbYohb
BJfzcQfDdx+lsyhjeTx9
=wOP3
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-3.19-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Highlights include:
Features:
- NFSv4.2 client support for hole punching and preallocation.
- Further RPC/RDMA client improvements.
- Add more RPC transport debugging tracepoints.
- Add RPC debugging tools in debugfs.
Bugfixes:
- Stable fix for layoutget error handling
- Fix a change in COMMIT behaviour resulting from the recent io code
updates"
* tag 'nfs-for-3.19-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (28 commits)
sunrpc: add a debugfs rpc_xprt directory with an info file in it
sunrpc: add debugfs file for displaying client rpc_task queue
nfs: Add DEALLOCATE support
nfs: Add ALLOCATE support
NFS: Clean up nfs4_init_callback()
NFS: SETCLIENTID XDR buffer sizes are incorrect
SUNRPC: serialize iostats updates
xprtrdma: Display async errors
xprtrdma: Enable pad optimization
xprtrdma: Re-write rpcrdma_flush_cqs()
xprtrdma: Refactor tasklet scheduling
xprtrdma: unmap all FMRs during transport disconnect
xprtrdma: Cap req_cqinit
xprtrdma: Return an errno from rpcrdma_register_external()
nfs: define nfs_inc_fscache_stats and using it as possible
nfs: replace nfs_add_stats with nfs_inc_stats when add one
NFS: Deletion of unnecessary checks before the function call "nfs_put_client"
sunrpc: eliminate RPC_TRACEPOINTS
sunrpc: eliminate RPC_DEBUG
lockd: eliminate LOCKD_DEBUG
...
Conflicts:
drivers/net/ethernet/amd/xgbe/xgbe-desc.c
drivers/net/ethernet/renesas/sh_eth.c
Overlapping changes in both conflict cases.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull EFI updates from Ingo Molnar:
"Changes in this cycle are:
- support module unload for efivarfs (Mathias Krause)
- another attempt at moving x86 to libstub taking advantage of the
__pure attribute (Ard Biesheuvel)
- add EFI runtime services section to ptdump (Mathias Krause)"
* 'x86-efi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, ptdump: Add section for EFI runtime services
efi/x86: Move x86 back to libstub
efivarfs: Allow unloading when build as module
It doesn't do anything special, it just calls btrfs_discard_extent(),
so just remove it.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
When we abort a transaction we iterate over all the ranges marked as dirty
in fs_info->freed_extents[0] and fs_info->freed_extents[1], clear them
from those trees, add them back (unpin) to the free space caches and, if
the fs was mounted with "-o discard", perform a discard on those regions.
Also, after adding the regions to the free space caches, a fitrim ioctl call
can see those ranges in a block group's free space cache and perform a discard
on the ranges, so the same issue can happen without "-o discard" as well.
This causes corruption, affecting one or multiple btree nodes (in the worst
case leaving the fs unmountable) because some of those ranges (the ones in
the fs_info->pinned_extents tree) correspond to btree nodes/leafs that are
referred by the last committed super block - breaking the rule that anything
that was committed by a transaction is untouched until the next transaction
commits successfully.
I ran into this while running in a loop (for several hours) the fstest that
I recently submitted:
[PATCH] fstests: add btrfs test to stress chunk allocation/removal and fstrim
The corruption always happened when a transaction aborted and then fsck complained
like this:
_check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent
*** fsck.btrfs output ***
Check tree block failed, want=94945280, have=0
Check tree block failed, want=94945280, have=0
Check tree block failed, want=94945280, have=0
Check tree block failed, want=94945280, have=0
Check tree block failed, want=94945280, have=0
read block failed check_tree_block
Couldn't open file system
In this case 94945280 corresponded to the root of a tree.
Using frace what I observed was the following sequence of steps happened:
1) transaction N started, fs_info->pinned_extents pointed to
fs_info->freed_extents[0];
2) node/eb 94945280 is created;
3) eb is persisted to disk;
4) transaction N commit starts, fs_info->pinned_extents now points to
fs_info->freed_extents[1], and transaction N completes;
5) transaction N + 1 starts;
6) eb is COWed, and btrfs_free_tree_block() called for this eb;
7) eb range (94945280 to 94945280 + 16Kb) is added to
fs_info->pinned_extents (fs_info->freed_extents[1]);
8) Something goes wrong in transaction N + 1, like hitting ENOSPC
for example, and the transaction is aborted, turning the fs into
readonly mode. The stack trace I got for example:
[112065.253935] [<ffffffff8140c7b6>] dump_stack+0x4d/0x66
[112065.254271] [<ffffffff81042984>] warn_slowpath_common+0x7f/0x98
[112065.254567] [<ffffffffa0325990>] ? __btrfs_abort_transaction+0x50/0x10b [btrfs]
[112065.261674] [<ffffffff810429e5>] warn_slowpath_fmt+0x48/0x50
[112065.261922] [<ffffffffa032949e>] ? btrfs_free_path+0x26/0x29 [btrfs]
[112065.262211] [<ffffffffa0325990>] __btrfs_abort_transaction+0x50/0x10b [btrfs]
[112065.262545] [<ffffffffa036b1d6>] btrfs_remove_chunk+0x537/0x58b [btrfs]
[112065.262771] [<ffffffffa033840f>] btrfs_delete_unused_bgs+0x1de/0x21b [btrfs]
[112065.263105] [<ffffffffa0343106>] cleaner_kthread+0x100/0x12f [btrfs]
(...)
[112065.264493] ---[ end trace dd7903a975a31a08 ]---
[112065.264673] BTRFS: error (device sdc) in btrfs_remove_chunk:2625: errno=-28 No space left
[112065.264997] BTRFS info (device sdc): forced readonly
9) The clear kthread sees that the BTRFS_FS_STATE_ERROR bit is set in
fs_info->fs_state and calls btrfs_cleanup_transaction(), which in
turn calls btrfs_destroy_pinned_extent();
10) Then btrfs_destroy_pinned_extent() iterates over all the ranges
marked as dirty in fs_info->freed_extents[], and for each one
it calls discard, if the fs was mounted with "-o discard", and
adds the range to the free space cache of the respective block
group;
11) btrfs_trim_block_group(), invoked from the fitrim ioctl code path,
sees the free space entries and performs a discard;
12) After an umount and mount (or fsck), our eb's location on disk was full
of zeroes, and it should have been untouched, because it was marked as
dirty in the fs_info->pinned_extents tree, and therefore used by the
trees that the last committed superblock points to.
Fix this by not performing a discard and not adding the ranges to the free space
caches - it's useless from this point since the fs is now in readonly mode and
we won't write free space caches to disk anymore (otherwise we would leak space)
nor any new superblock. By not adding the ranges to the free space caches, it
prevents other code paths from allocating that space and write to it as well,
therefore being safer and simpler.
This isn't a new problem, as it's been present since 2011 (git commit
acce952b02).
Cc: stable@vger.kernel.org # any kernel released after 2011-01-06
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Always clear a block group's rbnode after removing it from the rbtree to
ensure that any tasks that might be holding a reference on the block group
don't end up accessing stale rbnode left and right child pointers through
next_block_group().
This is a leftover from the change titled:
"Btrfs: fix invalid block group rbtree access after bg is removed"
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
The call to remove_extent_mapping() actually deletes the extent map
from the list it's included in - fs_info->pinned_chunks - and that
list is protected by the chunk mutex. Therefore make that call
while holding the chunk mutex and remove the redundant list delete
call because it's a noop.
This fixes an overlook of the patch titled
"Btrfs: fix race between fs trimming and block group remove/allocation"
following the same obvervation from the patch titled
"Btrfs: fix unprotected deletion from pending_chunks list".
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
This patch effectively reverts commit 500f808726 ("net: ovs: use CRC32
accelerated flow hash if available"), and other remaining arch_fast_hash()
users such as from nfsd via commit 6282cd5655 ("NFSD: Don't hand out
delegations for 30 seconds after recalling them.") where it has been used
as a hash function for bloom filtering.
While we think that these users are actually not much of concern, it has
been requested to remove the arch_fast_hash() library bits that arose
from [1] entirely as per recent discussion [2]. The main argument is that
using it as a hash may introduce bias due to its linearity (see avalanche
criterion) and thus makes it less clear (though we tried to document that)
when this security/performance trade-off is actually acceptable for a
general purpose library function.
Lets therefore avoid any further confusion on this matter and remove it to
prevent any future accidental misuse of it. For the time being, this is
going to make hashing of flow keys a bit more expensive in the ovs case,
but future work could reevaluate a different hashing discipline.
[1] https://patchwork.ozlabs.org/patch/299369/
[2] https://patchwork.ozlabs.org/patch/418756/
Cc: Neil Brown <neilb@suse.de>
Cc: Francesco Fusco <fusco@ntop.org>
Cc: Jesse Gross <jesse@nicira.com>
Cc: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull x86 MPX support from Thomas Gleixner:
"This enables support for x86 MPX.
MPX is a new debug feature for bound checking in user space. It
requires kernel support to handle the bound tables and decode the
bound violating instruction in the trap handler"
* 'x86-mpx-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
asm-generic: Remove asm-generic arch_bprm_mm_init()
mm: Make arch_unmap()/bprm_mm_init() available to all architectures
x86: Cleanly separate use of asm-generic/mm_hooks.h
x86 mpx: Change return type of get_reg_offset()
fs: Do not include mpx.h in exec.c
x86, mpx: Add documentation on Intel MPX
x86, mpx: Cleanup unused bound tables
x86, mpx: On-demand kernel allocation of bounds tables
x86, mpx: Decode MPX instruction to get bound violation information
x86, mpx: Add MPX-specific mmap interface
x86, mpx: Introduce VM_MPX to indicate that a VMA is MPX specific
x86, mpx: Add MPX to disabled features
ia64: Sync struct siginfo with general version
mips: Sync struct siginfo with general version
mpx: Extend siginfo structure to include bound violation information
x86, mpx: Rename cfg_reg_u and status_reg
x86: mpx: Give bndX registers actual names
x86: Remove arbitrary instruction size limit in instruction decoder
Pull scheduler updates from Ingo Molnar:
"The main changes in this cycle are:
- 'Nested Sleep Debugging', activated when CONFIG_DEBUG_ATOMIC_SLEEP=y.
This instruments might_sleep() checks to catch places that nest
blocking primitives - such as mutex usage in a wait loop. Such
bugs can result in hard to debug races/hangs.
Another category of invalid nesting that this facility will detect
is the calling of blocking functions from within schedule() ->
sched_submit_work() -> blk_schedule_flush_plug().
There's some potential for false positives (if secondary blocking
primitives themselves are not ready yet for this facility), but the
kernel will warn once about such bugs per bootup, so the warning
isn't much of a nuisance.
This feature comes with a number of fixes, for problems uncovered
with it, so no messages are expected normally.
- Another round of sched/numa optimizations and refinements, for
CONFIG_NUMA_BALANCING=y.
- Another round of sched/dl fixes and refinements.
Plus various smaller fixes and cleanups"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits)
sched: Add missing rcu protection to wake_up_all_idle_cpus
sched/deadline: Introduce start_hrtick_dl() for !CONFIG_SCHED_HRTICK
sched/numa: Init numa balancing fields of init_task
sched/deadline: Remove unnecessary definitions in cpudeadline.h
sched/cpupri: Remove unnecessary definitions in cpupri.h
sched/deadline: Fix rq->dl.pushable_tasks bug in push_dl_task()
sched/fair: Fix stale overloaded status in the busiest group finding logic
sched: Move p->nr_cpus_allowed check to select_task_rq()
sched/completion: Document when to use wait_for_completion_io_*()
sched: Update comments about CLONE_NEWUTS and CLONE_NEWIPC
sched/fair: Kill task_struct::numa_entry and numa_group::task_list
sched: Refactor task_struct to use numa_faults instead of numa_* pointers
sched/deadline: Don't check CONFIG_SMP in switched_from_dl()
sched/deadline: Reschedule from switched_from_dl() after a successful pull
sched/deadline: Push task away if the deadline is equal to curr during wakeup
sched/deadline: Add deadline rq status print
sched/deadline: Fix artificial overrun introduced by yield_task_dl()
sched/rt: Clean up check_preempt_equal_prio()
sched/core: Use dl_bw_of() under rcu_read_lock_sched()
sched: Check if we got a shallowest_idle_cpu before searching for least_loaded_cpu
...
Note that the code _using_ ->msg_iter at that point will be very
unhappy with anything other than unshifted iovec-backed iov_iter.
We still need to convert users to proper primitives.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Fix a bug where nfsd4_encode_components_esc() incorrectly calculates the
length of server array in fs_location4--note that it is a count of the
number of array elements, not a length in bytes.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Fixes: 082d4bd72a (nfsd4: "backfill" using write_bytes_to_xdr_buf)
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Fix a bug where nfsd4_encode_components_esc() includes the esc_end char as
an additional string encoding.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Cc: stable@vger.kernel.org
Fixes: e7a0444aef "nfsd: add IPv6 addr escaping to fs_location hosts"
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Bugs similar to the one in acbbe6fbb2 (kcmp: fix standard comparison
bug) are in rich supply.
In this variant, the problem is that struct xdr_netobj::len has type
unsigned int, so the expression o1->len - o2->len _also_ has type
unsigned int; it has completely well-defined semantics, and the result
is some non-negative integer, which is always representable in a long
long. But this means that if the conditional triggers, we are
guaranteed to return a positive value from compare_blob.
In this case it could be fixed by
- res = o1->len - o2->len;
+ res = (long long)o1->len - (long long)o2->len;
but I'd rather eliminate the usually broken 'return a - b;' idiom.
Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Currently all svc_create callers pass in NULL for the shutdown parm,
which then gets fixed up to be svc_rpcb_cleanup if the service uses
rpcbind.
Simplify this by instead having the the only caller that requires it
(lockd) pass in svc_rpcb_cleanup and get rid of the special casing.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
In a later patch, we're going to need some atomic bit flags. Since that
field will need to be an unsigned long, we mitigate that space
consumption by migrating some other bitflags to the new field. Start
with the rq_secure flag.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Highlights include:
Features:
- NFSv4.2 client support for hole punching and preallocation.
- Further RPC/RDMA client improvements.
- Add more RPC transport debugging tracepoints.
- Add RPC debugging tools in debugfs.
Bugfixes:
- Stable fix for layoutget error handling
- Fix a change in COMMIT behaviour resulting from the recent io code updates
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJUhRVTAAoJEGcL54qWCgDyfeUP/RoFo3ImTMbGxfcPJqoELjcO
lZbQ+27pOE/whFDkWgiOVTwlgGct5a0WRo7GCZmpYJA4q1kmSv4ngTb3nMTCUztt
xMJ0mBr0BqttVs+ouKiVPm3cejQXedEhttwWcloIXS8lNenlpL29Zlrx2NHdU8UU
13+souocj0dwIyTYYS/4Lm9KpuCYnpDBpP5ShvQjVaMe/GxJo6GyZu70c7FgwGNz
Nh9onzZV3mz1elhfizlV38aVA7KWVXtLWIqOFIKlT2fa4nWB8Hc07miR5UeOK0/h
r+icnF2qCQe83MbjOxYNxIKB6uiA/4xwVc90X4AQ7F0RX8XPWHIQWG5tlkC9jrCQ
3RGzYshWDc9Ud2mXtLMyVQxHVVYlFAe1WtdP8ZWb1oxDInmhrarnWeNyECz9xGKu
VzIDZzeq9G8slJXATWGRfPsYr+Ihpzcen4QQw58cakUBcqEJrYEhlEOfLovM71k3
/S/jSHBAbQqiw4LPMw87bA5A6+ZKcVSsNE0XCtNnhmqFpLc1kKRrl5vaN+QMk5tJ
v4/zR0fPqH7SGAJWYs4brdfahyejEo0TwgpDs7KHmu1W9zQ0LCVTaYnQuUmQjta6
WyYwIy3TTibdfR191O0E3NOW82Q/k/NBD6ySvabN9HqQ9eSk6+rzrWAslXCbYohb
BJfzcQfDdx+lsyhjeTx9
=wOP3
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-3.19-1' into nfsd for-3.19 branch
Mainly what I need is 860a0d9e51 "sunrpc: add some tracepoints in
svc_rqst handling functions", which subsequent server rpc patches from
jlayton depend on. I'm merging this later tag on the assumption that's
more likely to be a tested and stable point.
To improve recovery speed, f2fs try to readahead many contiguous blocks in warm
node segment, but for most time, abnormal power-off do not occur frequently, so
when mount a normal power-off f2fs image, by contrary ra so many blocks and then
invalid them will hurt the performance of mount.
It's better to just ra the first next-block for normal condition.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch does cleanup work, it introduces is_valid_blkaddr() to include
verification code for blkaddr with upper and down boundary value which were in
ra_meta_pages previous.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
1.We use zero as upper boundary value for ra SSA/CP blocks, we will skip
readahead as verification failure with max number, it causes low performance.
2.Low boundary value is not accurate for SSA/CP/POR region verification, so
these values need to be redefined.
This patch fixes above issues.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
As inline_{dir,inode} stat is increased/decreased concurrently by multi threads,
so the value is not so accurate, let's use atomic type for counting accurately.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Added some commentaries for code readability and cleaned up if-statement
clearly.
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If inode state is dirty, go straight to write.
Suggested-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The inmemory pages should be handled by invalidate_page since it needs to be
released int the truncation path.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In do_read_inode, if we failed __recover_inline_status, the inode has inline
flag without increasing its count.
Later, f2fs_evict_inode will decrease the count, which causes -1.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch revists retrial paths in f2fs.
The basic idea is to use cond_resched instead of retrying from the very early
stage.
Suggested-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
file->private_data can never be null after calling initiate_cifs_search.
So private null check condition is not needed.
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
Signed-off-by: Steve French <steve.french@primarydata.com>
A user complained that they were unable to login to their cifs share
after a kernel update. From the wiretrace we can see that the server
returns different UIDs as response to NTLMSSP_NEGOTIATE and NTLMSSP_AUTH
phases.
With changes in the authentication code, we no longer set the
cifs_sess->Suid returned in response to the NTLM_AUTH phase and continue
to use the UID sent in response to the NTLMSSP_NEGOTIATE phase. This
results in the server denying access to the user when the user attempts
to do a tcon connect.
See https://bugzilla.redhat.com/show_bug.cgi?id=1163927
A test kernel containing patch was tested successfully by the user.
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Signed-off-by: Steve French <steve.french@primarydata.com>
The useful macros embed message level in the name. Thus, it cleans up the code
a bit. In cases when it was plain printk() the conversion was done to info
level.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Steve French <steve.french@primarydata.com>
This patch converts custom dumper to use native print_hex_dump() instead. The
cifs_dump_mem() will have an offsets per each line which differs it from the
original code.
In the dump_smb() we may use native print_hex_dump() as well. It will show
slightly different output in ASCII part when character is unprintable,
otherwise it keeps same structure.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Steve French <steve.french@primarydata.com>
Meanwhile it cleans up the code, the behaviour is slightly changed. In case of
providing non-boolean value it will fails with corresponding error. In the
original code the attempt of an update was just ignored in such case.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Jeff Layton <jlayton@poochiereds.net>
Reviewed-by: Alexander Bokovoy <ab@samba.org>
Signed-off-by: Steve French <steve.french@primarydata.com>
Add missing defines needed for ACL query support.
For definitions of these security info type additionalinfo flags
and also the EA Flags see MS-SMB2 (2.2.37) or MS-DTYP
Signed-of-by: Steven French <smfrench@gmail.com>
Reviewed-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
In many cases the simple fallocate call is
a no op (since the file is already not sparse) or
can simply be converted from a sparse to a non-sparse
file if we are fallocating the whole file and keeping
the size.
Signed-off-by: Steven French <smfrench@gmail.com>
This patch tries to fix:
BUG: using smp_processor_id() in preemptible [00000000] code: f2fs_gc-254:0/384
(radix_tree_node_alloc+0x14/0x74) from [<c033d8a0>] (radix_tree_insert+0x110/0x200)
(radix_tree_insert+0x110/0x200) from [<c02e8264>] (gc_data_segment+0x340/0x52c)
(gc_data_segment+0x340/0x52c) from [<c02e8658>] (f2fs_gc+0x208/0x400)
(f2fs_gc+0x208/0x400) from [<c02e8a98>] (gc_thread_func+0x248/0x28c)
(gc_thread_func+0x248/0x28c) from [<c0139944>] (kthread+0xa0/0xac)
(kthread+0xa0/0xac) from [<c0105ef8>] (ret_from_fork+0x14/0x3c)
The reason is that f2fs calls radix_tree_insert under enabled preemption.
So, before calling it, we need to call radix_tree_preload.
Otherwise, we should use _GFP_WAIT for the radix tree, and use mutex or
semaphore to cover the radix tree operations.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
a) make get_proc_ns() return a pointer to struct ns_common
b) mirror ns_ops in dentry->d_fsdata of ns dentries, so that
is_mnt_ns_file() could get away with fewer dereferences.
That way struct proc_ns becomes invisible outside of fs/proc/*.c
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
for now - just move corresponding ->proc_inum instances over there
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Previoulsy, we used rwlock for nat_entry lock.
But, now we have a lot of complex operations in set_node_addr.
(e.g., allocating kernel memories, handling radix_trees, and so on)
So, this patches tries to change spinlock to rw_semaphore to give CPUs to other
threads.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
XFS traditionally sends all buffer I/O completion work to a single
workqueue. This includes metadata buffer completion and log buffer
completion. The log buffer completion requires a high priority queue to
prevent stalls due to log forces getting stuck behind other queued work.
Rather than continue to prioritize all buffer I/O completion due to the
needs of log completion, split log buffer completion off to
m_log_workqueue and move the high priority flag from m_buf_workqueue to
m_log_workqueue.
Add a b_ioend_wq wq pointer to xfs_buf to allow completion workqueue
customization on a per-buffer basis. Initialize b_ioend_wq to
m_buf_workqueue by default in the generic buffer I/O submission path.
Finally, override the default wq with the high priority m_log_workqueue
in the log buffer I/O submission path.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The kernel compile doesn't turn on these checks by default, so it's
only when I do a kernel-user sync that I find that there are lots of
compiler warnings waiting to be fixed. Fix up these set-but-unused
warnings.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
These are currently considered private to libxfs, but they are
widely used by the userspace code to decode, walk and check
directory structures. Hence they really form part of the external
API and as such need to bemoved to xfs_dir2.h.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
These functions are needed in userspace for repair and mkfs to
do the right thing. Move them to libxfs so they can be easily
shared.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
There's a case in that code where it checks for a buffer match in a
transaction where the buffer is not marked done. i.e. trying to
catch a buffer we have locked in the transaction but have not
completed IO on.
The only way we can find a buffer that has not had IO completed on
it is if it had readahead issued on it, but we never do readahead on
buffers that we have already joined into a transaction. Hence this
condition cannot occur, and buffers locked and joined into a
transaction should always be marked done and not under IO.
Remove this code and re-order xfs_trans_read_buf_map() to remove
duplicated IO dispatch and error handling code.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
vn_active only ever gets decremented, so it has a very large
negative number. Make it track the inode count we currently have
allocated properly so we can easily track the size of the inode
cache via tools like PCP.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
xfs_bmse_merge() has a jump label for return that just returns the
error value. Convert all the code to just return the error directly
and use XFS_WANT_CORRUPTED_RETURN. This also allows the final call
to xfs_bmbt_update() to return directly.
Noticed while reviewing coccinelle return cleanup patches and
wondering why the same return pattern as in xfs_bmse_shift_one()
wasn't picked up by the checker pattern...
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
xfs_bmse_shift_one() jumps around determining whether to shift or
merge, making the code flow difficult to follow. Clean it up and
use direct error returns (including XFS_WANT_CORRUPTED_RETURN) to
make the code flow better and be easier to read.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
After growing a filesystem, XFS can fail to allocate inodes even
though there is a large amount of space available in the filesystem
for inodes. The issue is caused by a nearly full allocation group
having enough free space in it to be considered for inode
allocation, but not enough contiguous free space to actually
allocation inodes. This situation results in successful selection
of the AG for allocation, then failure of the allocation resulting
in ENOSPC being reported to the caller.
It is caused by two possible issues. Firstly, we only consider the
lognest free extent and whether it would fit an inode chunk. If the
extent is not correctly aligned, then we can't allocate an inode
chunk in it regardless of the fact that it is large enough. This
tends to be a permanent error until space in the AG is freed.
The second issue is that we don't actually lock the AGI or AGF when
we are doing these checks, and so by the time we get to actually
allocating the inode chunk the space we thought we had in the AG may
have been allocated. This tends to be a spurious error as it
requires a race to trigger. Hence this case is ignored in this patch
as the reported problem is for permanent errors.
The first issue could be addressed by simply taking into account the
alignment when checking the longest extent. This, however, would
prevent allocation in AGs that have aligned, exact sized extents
free. However, this case should be fairly rare compared to the
number of allocations that occur near ENOSPC that would trigger this
condition.
Hence, when selecting the inode AG, take into account the inode
cluster alignment when checking the lognest free extent in the AG.
If we can't find any AGs with a contiguous free space large
enough to be aligned, drop the alignment addition and just try for
an AG that has enough contiguous free space available for an inode
chunk. This won't prevent issues from occurring, but should avoid
situations where other AGs have lots of free space but the selected
AG can't allocate due to alignment constraints.
Reported-by: Arkadiusz Miskiewicz <arekm@maven.pl>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
If extsize is set and new_last_fsb is larger than 32 bits, the
roundup to extsize will overflow the align variable. Instead,
combine alignments by rounding stripe size up to extsize.
Signed-off-by: Peter Watkins <treestem@gmail.com>
Reviewed-by: Nathaniel W. Turner <nate@houseofnate.net>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
a) don't bother with ->d_time for positives - we only check it for
negatives anyway.
b) make sure to set it at unlink and rmdir time - at *that* point
soon-to-be negative dentry matches then-current directory contents
c) don't go into renaming of old alias in vfat_lookup() unless it
has the same parent (which it will, unless we are seeing corrupted
image)
[hirofumi@mail.parknet.co.jp: make change minimum, don't call d_move() for dir]
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: <stable@vger.kernel.org> [3.17.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This was written when we didn't do a caching control for the fast free space
cache loading. However we started doing that a long time ago, and there is
still a small window of time that we could be caching the block group the fast
way, so if there is a caching_ctl at all on the block group just return it, the
callers all wait properly for what they want. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
On block group remove if the corresponding extent map was on the
transaction->pending_chunks list, we were deleting the extent map
from that list, through remove_extent_mapping(), without any
synchronization with chunk allocation (which iterates that list
and adds new elements to it). Fix this by ensure that this is done
while the chunk mutex is held, since that's the mutex that protects
the list in the chunk allocation code path.
This applies on top (depends on) of my previous patch titled:
"Btrfs: fix race between fs trimming and block group remove/allocation"
But the issue in fact was already present before that change, it only
became easier to hit after Josef's 3.18 patch that added automatic
removal of empty block groups.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
On chunk allocation error (label "error_del_extent"), after adding the
extent map to the tree and to the pending chunks list, we would leave
decrementing the extent map's refcount by 2 instead of 3 (our allocation
+ tree reference + list reference).
Also, on chunk/block group removal, if the block group was on the list
pending_chunks we weren't decrementing the respective list reference.
Detected by 'rmmod btrfs':
[20770.105881] kmem_cache_destroy btrfs_extent_map: Slab cache still has objects
[20770.106127] CPU: 2 PID: 11093 Comm: rmmod Tainted: G W L 3.17.0-rc5-btrfs-next-1+ #1
[20770.106128] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[20770.106130] 0000000000000000 ffff8800ba867eb8 ffffffff813e7a13 ffff8800a2e11040
[20770.106132] ffff8800ba867ed0 ffffffff81105d0c 0000000000000000 ffff8800ba867ee0
[20770.106134] ffffffffa035d65e ffff8800ba867ef0 ffffffffa03b0654 ffff8800ba867f78
[20770.106136] Call Trace:
[20770.106142] [<ffffffff813e7a13>] dump_stack+0x45/0x56
[20770.106145] [<ffffffff81105d0c>] kmem_cache_destroy+0x4b/0x90
[20770.106164] [<ffffffffa035d65e>] extent_map_exit+0x1a/0x1c [btrfs]
[20770.106176] [<ffffffffa03b0654>] exit_btrfs_fs+0x27/0x9d3 [btrfs]
[20770.106179] [<ffffffff8109dc97>] SyS_delete_module+0x153/0x1c4
[20770.106182] [<ffffffff8121261b>] ? trace_hardirqs_on_thunk+0x3a/0x3c
[20770.106184] [<ffffffff813ebf52>] system_call_fastpath+0x16/0x1b
This applies on top (depends on) of my previous patch titled:
"Btrfs: fix race between fs trimming and block group remove/allocation"
But the issue in fact was already present before that change, it only
became easier to hit after Josef's 3.18 patch that added automatic
removal of empty block groups.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
There was a free space entry structure memeory leak if a block
group is remove while a free space entry is being trimmed, which
the following diagram explains:
CPU 1 CPU 2
btrfs_trim_block_group()
trim_no_bitmap()
remove free space entry from
block group cache's rbtree
do_trimming()
btrfs_remove_block_group()
btrfs_remove_free_space_cache()
add back free space entry to
block group's cache rbtree
btrfs_put_block_group()
(...)
btrfs_put_block_group()
kfree(bg->free_space_ctl)
kfree(bg)
The free space entry added after doing the discard of its respective
range ends up never being freed.
Detected after doing an "rmmod btrfs" after running the stress test
recently submitted for fstests:
[ 8234.642212] kmem_cache_destroy btrfs_free_space: Slab cache still has objects
[ 8234.642657] CPU: 1 PID: 32276 Comm: rmmod Tainted: G W L 3.17.0-rc5-btrfs-next-2+ #1
[ 8234.642660] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[ 8234.642664] 0000000000000000 ffff8801af1b3eb8 ffffffff8140c7b6 ffff8801dbedd0c0
[ 8234.642670] ffff8801af1b3ed0 ffffffff811149ce 0000000000000000 ffff8801af1b3ee0
[ 8234.642676] ffffffffa042dbe7 ffff8801af1b3ef0 ffffffffa0487422 ffff8801af1b3f78
[ 8234.642682] Call Trace:
[ 8234.642692] [<ffffffff8140c7b6>] dump_stack+0x4d/0x66
[ 8234.642699] [<ffffffff811149ce>] kmem_cache_destroy+0x4d/0x92
[ 8234.642731] [<ffffffffa042dbe7>] btrfs_destroy_cachep+0x63/0x76 [btrfs]
[ 8234.642757] [<ffffffffa0487422>] exit_btrfs_fs+0x9/0xbe7 [btrfs]
[ 8234.642762] [<ffffffff810a76a5>] SyS_delete_module+0x155/0x1c6
[ 8234.642768] [<ffffffff8122a7eb>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 8234.642773] [<ffffffff814122d2>] system_call_fastpath+0x16/0x1b
This applies on top (depends on) of my previous patch titled:
"Btrfs: fix race between fs trimming and block group remove/allocation"
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
If the transaction handle doesn't have used blocks but has created new block
groups make sure we turn the fs into readonly mode too. This is because the
new block groups didn't get all their metadata persisted into the chunk and
device trees, and therefore if a subsequent transaction starts, allocates
space from the new block groups, writes data or metadata into that space,
commits successfully and then after we unmount and mount the filesystem
again, the same space can be allocated again for a new block group,
resulting in file data or metadata corruption.
Example where we don't abort the transaction when we fail to finish the
chunk allocation (add items to the chunk and device trees) and later a
future transaction where the block group is removed fails because it can't
find the chunk item in the chunk tree:
[25230.404300] WARNING: CPU: 0 PID: 7721 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x50/0xfc [btrfs]()
[25230.404301] BTRFS: Transaction aborted (error -28)
[25230.404302] Modules linked in: btrfs dm_flakey nls_utf8 fuse xor raid6_pq ntfs vfat msdos fat xfs crc32c_generic libcrc32c ext3 jbd ext2 dm_mod nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse i2c_piix4 i2ccore parport_pc parport processor button pcspkr serio_raw thermal_sys evdev microcode ext4 crc16 jbd2 mbcache sr_mod cdrom ata_generic sg sd_mod crc_t10dif crct10dif_generic crct10dif_common virtio_scsi floppy e1000 ata_piix libata virtio_pci virtio_ring scsi_mod virtio [last unloaded: btrfs]
[25230.404325] CPU: 0 PID: 7721 Comm: xfs_io Not tainted 3.17.0-rc5-btrfs-next-1+ #1
[25230.404326] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[25230.404328] 0000000000000000 ffff88004581bb08 ffffffff813e7a13 ffff88004581bb50
[25230.404330] ffff88004581bb40 ffffffff810423aa ffffffffa049386a 00000000ffffffe4
[25230.404332] ffffffffa05214c0 000000000000240c ffff88010fc8f800 ffff88004581bba8
[25230.404334] Call Trace:
[25230.404338] [<ffffffff813e7a13>] dump_stack+0x45/0x56
[25230.404342] [<ffffffff810423aa>] warn_slowpath_common+0x7f/0x98
[25230.404351] [<ffffffffa049386a>] ? __btrfs_abort_transaction+0x50/0xfc [btrfs]
[25230.404353] [<ffffffff8104240b>] warn_slowpath_fmt+0x48/0x50
[25230.404362] [<ffffffffa049386a>] __btrfs_abort_transaction+0x50/0xfc [btrfs]
[25230.404374] [<ffffffffa04a8c43>] btrfs_create_pending_block_groups+0x10c/0x135 [btrfs]
[25230.404387] [<ffffffffa04b77fd>] __btrfs_end_transaction+0x7e/0x2de [btrfs]
[25230.404398] [<ffffffffa04b7a6d>] btrfs_end_transaction+0x10/0x12 [btrfs]
[25230.404408] [<ffffffffa04a3d64>] btrfs_check_data_free_space+0x111/0x1f0 [btrfs]
[25230.404421] [<ffffffffa04c53bd>] __btrfs_buffered_write+0x160/0x48d [btrfs]
[25230.404425] [<ffffffff811a9268>] ? cap_inode_need_killpriv+0x2d/0x37
[25230.404429] [<ffffffff810f6501>] ? get_page+0x1a/0x2b
[25230.404441] [<ffffffffa04c7c95>] btrfs_file_write_iter+0x321/0x42f [btrfs]
[25230.404443] [<ffffffff8110f5d9>] ? handle_mm_fault+0x7f3/0x846
[25230.404446] [<ffffffff813e98c5>] ? mutex_unlock+0x16/0x18
[25230.404449] [<ffffffff81138d68>] new_sync_write+0x7c/0xa0
[25230.404450] [<ffffffff81139401>] vfs_write+0xb0/0x112
[25230.404452] [<ffffffff81139c9d>] SyS_pwrite64+0x66/0x84
[25230.404454] [<ffffffff813ebf52>] system_call_fastpath+0x16/0x1b
[25230.404455] ---[ end trace 5aa5684fdf47ab38 ]---
[25230.404458] BTRFS warning (device sdc): btrfs_create_pending_block_groups:9228: Aborting unused transaction(No space left).
[25288.084814] BTRFS: error (device sdc) in btrfs_free_chunk:2509: errno=-2 No such entry (Failed lookup while freeing chunk.)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Trimming is completely transactionless, and the way it operates consists
of hiding free space entries from a block group, perform the trim/discard
and then make the free space entries visible again.
Therefore while a free space entry is being trimmed, we can have free space
cache writing running in parallel (as part of a transaction commit) which
will miss the free space entry. This means that an unmount (or crash/reboot)
after that transaction commit and mount again before another transaction
starts/commits after the discard finishes, we will have some free space
that won't be used again unless the free space cache is rebuilt. After the
unmount, fsck (btrfsck, btrfs check) reports the issue like the following
example:
*** fsck.btrfs output ***
checking extents
checking free space cache
There is no free space entry for 521764864-521781248
There is no free space entry for 521764864-1103101952
cache appears valid but isnt 29360128
Checking filesystem on /dev/sdc
UUID: b4789e27-4774-4626-98e9-ae8dfbfb0fb5
found 1235681286 bytes used err is -22
(...)
Another issue caused by this race is a crash while writing bitmap entries
to the cache, because while the cache writeout task accesses the bitmaps,
the trim task can be concurrently modifying the bitmap or worse might
be freeing the bitmap. The later case results in the following crash:
[55650.804460] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
[55650.804835] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop parport_pc parport i2c_piix4 psmouse evdev pcspkr microcode processor i2ccore serio_raw thermal_sys button ext4 crc16 jbd2 mbcache sg sd_mod crc_t10dif sr_mod cdrom crct10dif_generic crct10dif_common ata_generic virtio_scsi floppy ata_piix libata virtio_pci virtio_ring virtio scsi_mod e1000 [last unloaded: btrfs]
[55650.806169] CPU: 1 PID: 31002 Comm: btrfs-transacti Tainted: G W 3.17.0-rc5-btrfs-next-1+ #1
[55650.806493] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[55650.806867] task: ffff8800b12f6410 ti: ffff880071538000 task.ti: ffff880071538000
[55650.807166] RIP: 0010:[<ffffffffa037cf45>] [<ffffffffa037cf45>] write_bitmap_entries+0x65/0xbb [btrfs]
[55650.807514] RSP: 0018:ffff88007153bc30 EFLAGS: 00010246
[55650.807687] RAX: 000000005d1ec000 RBX: ffff8800a665df08 RCX: 0000000000000400
[55650.807885] RDX: ffff88005d1ec000 RSI: 6b6b6b6b6b6b6b6b RDI: ffff88005d1ec000
[55650.808017] RBP: ffff88007153bc58 R08: 00000000ddd51536 R09: 00000000000001e0
[55650.808017] R10: 0000000000000000 R11: 0000000000000037 R12: 6b6b6b6b6b6b6b6b
[55650.808017] R13: ffff88007153bca8 R14: 6b6b6b6b6b6b6b6b R15: ffff88007153bc98
[55650.808017] FS: 0000000000000000(0000) GS:ffff88023ec80000(0000) knlGS:0000000000000000
[55650.808017] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[55650.808017] CR2: 0000000002273b88 CR3: 00000000b18f6000 CR4: 00000000000006e0
[55650.808017] Stack:
[55650.808017] ffff88020e834e00 ffff880172d68db0 0000000000000000 ffff88019257c800
[55650.808017] ffff8801d42ea720 ffff88007153bd10 ffffffffa037d2fa ffff880224e99180
[55650.808017] ffff8801469a6188 ffff880224e99140 ffff880172d68c50 00000003000000b7
[55650.808017] Call Trace:
[55650.808017] [<ffffffffa037d2fa>] __btrfs_write_out_cache+0x1ea/0x37f [btrfs]
[55650.808017] [<ffffffffa037d959>] btrfs_write_out_cache+0xa1/0xd8 [btrfs]
[55650.808017] [<ffffffffa033936b>] btrfs_write_dirty_block_groups+0x4b5/0x505 [btrfs]
[55650.808017] [<ffffffffa03aa98e>] commit_cowonly_roots+0x15e/0x1f7 [btrfs]
[55650.808017] [<ffffffff813eb9c7>] ? _raw_spin_lock+0xe/0x10
[55650.808017] [<ffffffffa0346e46>] btrfs_commit_transaction+0x411/0x882 [btrfs]
[55650.808017] [<ffffffffa03432a4>] transaction_kthread+0xf2/0x1a4 [btrfs]
[55650.808017] [<ffffffffa03431b2>] ? btrfs_cleanup_transaction+0x3d8/0x3d8 [btrfs]
[55650.808017] [<ffffffff8105966b>] kthread+0xb7/0xbf
[55650.808017] [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67
[55650.808017] [<ffffffff813ebeac>] ret_from_fork+0x7c/0xb0
[55650.808017] [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67
[55650.808017] Code: 4c 89 ef 8d 70 ff e8 d4 fc ff ff 41 8b 45 34 41 39 45 30 7d 5c 31 f6 4c 89 ef e8 80 f6 ff ff 49 8b 7d 00 4c 89 f6 b9 00 04 00 00 <f3> a5 4c 89 ef 41 8b 45 30 8d 70 ff e8 a3 fc ff ff 41 8b 45 34
[55650.808017] RIP [<ffffffffa037cf45>] write_bitmap_entries+0x65/0xbb [btrfs]
[55650.808017] RSP <ffff88007153bc30>
[55650.815725] ---[ end trace 1c032e96b149ff86 ]---
Fix this by serializing both tasks in such a way that cache writeout
doesn't wait for the trim/discard of free space entries to finish and
doesn't miss any free space entry.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Our fs trim operation, which is completely transactionless (doesn't start
or joins an existing transaction) consists of visiting all block groups
and then for each one to iterate its free space entries and perform a
discard operation against the space range represented by the free space
entries. However before performing a discard, the corresponding free space
entry is removed from the free space rbtree, and when the discard completes
it is added back to the free space rbtree.
If a block group remove operation happens while the discard is ongoing (or
before it starts and after a free space entry is hidden), we end up not
waiting for the discard to complete, remove the extent map that maps
logical address to physical addresses and the corresponding chunk metadata
from the the chunk and device trees. After that and before the discard
completes, the current running transaction can finish and a new one start,
allowing for new block groups that map to the same physical addresses to
be allocated and written to.
So fix this by keeping the extent map in memory until the discard completes
so that the same physical addresses aren't reused before it completes.
If the physical locations that are under a discard operation end up being
used for a new metadata block group for example, and dirty metadata extents
are written before the discard finishes (the VM might call writepages() of
our btree inode's i_mapping for example, or an fsync log commit happens) we
end up overwriting metadata with zeroes, which leads to errors from fsck
like the following:
checking extents
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
read block failed check_tree_block
owner ref check failed [833912832 16384]
Errors found in extent allocation tree or chunk allocation
checking free space cache
checking fs roots
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
read block failed check_tree_block
root 5 root dir 256 error
root 5 inode 260 errors 2001, no inode item, link count wrong
unresolved ref dir 256 index 0 namelen 8 name foobar_3 filetype 1 errors 6, no dir index, no inode ref
root 5 inode 262 errors 2001, no inode item, link count wrong
unresolved ref dir 256 index 0 namelen 8 name foobar_5 filetype 1 errors 6, no dir index, no inode ref
root 5 inode 263 errors 2001, no inode item, link count wrong
(...)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
There's a race between adding a block group to the list of the unused
block groups and removing an unused block group (cleaner kthread) that
leads to freeing extents that are in use or a crash during transaction
commmit. Basically the cleaner kthread, when executing
btrfs_delete_unused_bgs(), might catch the newly added block group to
the list fs_info->unused_bgs and clear the range representing the whole
group from fs_info->freed_extents[] before the task that added the block
group to the list (running update_block_group()) marked the last freed
extent as dirty in fs_info->freed_extents (pinned_extents).
That is:
CPU 1 CPU 2
btrfs_delete_unused_bgs()
update_block_group()
add block group to
fs_info->unused_bgs
got block group from the list
clear_extent_bits for the whole
block group range in freed_extents[]
set_extent_dirty for the
range covering the freed
extent in freed_extents[]
(fs_info->pinned_extents)
block group deleted, and a new block
group with the same logical address is
created
reserve space from the new block group
for new data or metadata - the reserved
space overlaps the range specified by
CPU 1 for set_extent_dirty()
commit transaction
find all ranges marked as dirty in
fs_info->pinned_extents, clear them
and add them to the free space cache
Alternatively, if CPU 2 doesn't create a new block group with the same
logical address, we get a crash/BUG_ON at transaction commit when unpining
extent ranges because we can't find a block group for the range marked as
dirty by CPU 1. Sample trace:
[ 2163.426462] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
[ 2163.426640] Modules linked in: btrfs xor raid6_pq dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio crc32c_generic libcrc32c dm_mod nfsd auth_rpc
gss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse parport_pc parport i2c_piix4 processor thermal_sys i2ccore evdev button pcspkr microcode serio_raw ext4 crc16 jbd2 mbcache
sg sr_mod cdrom sd_mod crc_t10dif crct10dif_generic crct10dif_common ata_generic virtio_scsi floppy ata_piix libata e1000 scsi_mod virtio_pci virtio_ring virtio
[ 2163.428209] CPU: 0 PID: 11858 Comm: btrfs-transacti Tainted: G W 3.17.0-rc5-btrfs-next-1+ #1
[ 2163.428519] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[ 2163.428875] task: ffff88009f2c0650 ti: ffff8801356bc000 task.ti: ffff8801356bc000
[ 2163.429157] RIP: 0010:[<ffffffffa037728e>] [<ffffffffa037728e>] unpin_extent_range.isra.58+0x62/0x192 [btrfs]
[ 2163.429562] RSP: 0018:ffff8801356bfda8 EFLAGS: 00010246
[ 2163.429802] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 2163.429990] RDX: 0000000041bfffff RSI: 0000000001c00000 RDI: ffff880024307080
[ 2163.430042] RBP: ffff8801356bfde8 R08: 0000000000000068 R09: ffff88003734f118
[ 2163.430042] R10: ffff8801356bfcb8 R11: fffffffffffffb69 R12: ffff8800243070d0
[ 2163.430042] R13: 0000000083c04000 R14: ffff8800751b0f00 R15: ffff880024307000
[ 2163.430042] FS: 0000000000000000(0000) GS:ffff88013f400000(0000) knlGS:0000000000000000
[ 2163.430042] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 2163.430042] CR2: 00007ff10eb43fc0 CR3: 0000000004cb8000 CR4: 00000000000006f0
[ 2163.430042] Stack:
[ 2163.430042] ffff8800243070d0 0000000083c08000 0000000083c07fff ffff88012d6bc800
[ 2163.430042] ffff8800243070d0 ffff8800751b0f18 ffff8800751b0f00 0000000000000000
[ 2163.430042] ffff8801356bfe18 ffffffffa037a481 0000000083c04000 0000000083c07fff
[ 2163.430042] Call Trace:
[ 2163.430042] [<ffffffffa037a481>] btrfs_finish_extent_commit+0xac/0xbf [btrfs]
[ 2163.430042] [<ffffffffa038c06d>] btrfs_commit_transaction+0x6ee/0x882 [btrfs]
[ 2163.430042] [<ffffffffa03881f1>] transaction_kthread+0xf2/0x1a4 [btrfs]
[ 2163.430042] [<ffffffffa03880ff>] ? btrfs_cleanup_transaction+0x3d8/0x3d8 [btrfs]
[ 2163.430042] [<ffffffff8105966b>] kthread+0xb7/0xbf
[ 2163.430042] [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67
[ 2163.430042] [<ffffffff813ebeac>] ret_from_fork+0x7c/0xb0
[ 2163.430042] [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67
So fix this by making update_block_group() first set the range as dirty
in pinned_extents before adding the block group to the unused_bgs list.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
If we grab a block group, for example in btrfs_trim_fs(), we will be holding
a reference on it but the block group can be removed after we got it (via
btrfs_remove_block_group), which means it will no longer be part of the
rbtree.
However, btrfs_remove_block_group() was only calling rb_erase() which leaves
the block group's rb_node left and right child pointers with the same content
they had before calling rb_erase. This was dangerous because a call to
next_block_group() would access the node's left and right child pointers (via
rb_next), which can be no longer valid.
Fix this by clearing a block group's node after removing it from the tree,
and have next_block_group() do a tree search to get the next block group
instead of using rb_next() if our block group was removed.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
The commit c404e0dc (Btrfs: fix use-after-free in the finishing
procedure of the device replace) fixed a use-after-free problem
which happened when removing the source device at the end of device
replace, but at that time, btrfs didn't support device replace
on raid56, so we didn't fix the problem on the raid56 profile.
Currently, we implemented device replace for raid56, so we need
kick that problem out before we enable that function for raid56.
The fix method is very simple, we just increase the bio per-cpu
counter before we submit a raid56 io, and decrease the counter
when the raid56 io ends.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
This function reused the code of parity scrub, and we just write
the right parity or corrected parity into the target device before
the parity scrub end.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
The implementation is simple:
- In order to avoid changing the code logic of btrfs_map_bio and
RAID56, we add the stripes of the replace target devices at the
end of the stripe array in btrfs bio, and we sort those target
device stripes in the array. And we keep the number of the target
device stripes in the btrfs bio.
- Except write operation on RAID56, all the other operation don't
take the target device stripes into account.
- When we do write operation, we read the data from the common devices
and calculate the parity. Then write the dirty data and new parity
out, at this time, we will find the relative replace target stripes
and wirte the relative data into it.
Note: The function that copying old data on the source device to
the target device was implemented in the past, it is similar to
the other RAID type.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
The implementation is:
- Read and check all the data with checksum in the same stripe.
All the data which has checksum is COW data, and we are sure
that it is not changed though we don't lock the stripe. because
the space of that data just can be reclaimed after the current
transction is committed, and then the fs can use it to store the
other data, but when doing scrub, we hold the current transaction,
that is that data can not be recovered, it is safe that read and check
it out of the stripe lock.
- Lock the stripe
- Read out all the data without checksum and parity
The data without checksum and the parity may be changed if we don't
lock the stripe, so we need read it in the stripe lock context.
- Check the parity
- Re-calculate the new parity and write back it if the old parity
is not right
- Unlock the stripe
If we can not read out the data or the data we read is corrupted,
we will try to repair it. If the repair fails. we will mark the
horizontal sub-stripe(pages on the same horizontal) as corrupted
sub-stripe, and we will skip the parity check and repair of that
horizontal sub-stripe.
And in order to skip the horizontal sub-stripe that has no data, we
introduce a bitmap. If there is some data on the horizontal sub-stripe,
we will the relative bit to 1, and when we check and repair the
parity, we will skip those horizontal sub-stripes that the relative
bits is 0.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
We will introduce new operation type later, if we still use integer
variant as bool variant to record the operation type, we would add new
variant and increase the size of raid bio structure. It is not good,
by this patch, we define different number for different operation,
and we can just use a variant to record the operation type.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
This patch implement the RAID5/6 common data repair function, the
implementation is similar to the scrub on the other RAID such as
RAID1, the differentia is that we don't read the data from the
mirror, we use the data repair function of RAID5/6.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Because we will reuse bbio and raid_map during the scrub later, it is
better that we don't change any variant of bbio and don't free it at
the end of IO request. So we introduced similar variants into the raid
bio, and don't access those bbio's variants any more.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
stripe_index's value was set again in latter line:
stripe_index = 0;
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
bbio_ret in this condition is always !NULL because previous code
already have a check-and-skip:
4908 if (!bbio_ret)
4909 goto out;
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
It is ridiculous practice to scan inode block by block, this technique
applicable only for old indirect files. This takes significant amount
of time for really large files. Let's reuse ext4_fiemap which already
traverse inode-tree in most optimal meaner.
TESTCASE:
ftruncate64(fd, 0);
ftruncate64(fd, 1ULL << 40);
/* lseek will spin very long time */
lseek64(fd, 0, SEEK_DATA);
lseek64(fd, 0, SEEK_HOLE);
Original report: https://lkml.org/lkml/2014/10/16/620
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Currently ext4_inline_data_fiemap ignores requested arguments (start
and len) which may lead endless loop if start != 0. Also fix incorrect
extent length determination.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
ext4_da_convert_inline_data_to_extent() invokes
grab_cache_page_write_begin(). grab_cache_page_write_begin performs
memory allocation, so fs-reentrance should be prohibited because we
are inside journal transaction.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If there are many inodes that have data blocks in victim segment,
it takes long time to find a inode in gc_inode list.
Let's use radix_tree to reduce lookup time.
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When inspecting the pivot_root and the current mount expiry logic I
realized that pivot_root fails to clear like mount move does.
Add the missing line in case someone does the interesting feat of
moving an expirable submount. This gives a strong guarantee that root
of the filesystem tree will never expire.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Clear MNT_LOCKED in the callers of copy_tree except copy_mnt_ns, and
collect_mounts. In copy_mnt_ns it is necessary to create an exact
copy of a mount tree, so not clearing MNT_LOCKED is important.
Similarly collect_mounts is used to take a snapshot of the mount tree
for audit logging purposes and auditing using a faithful copy of the
tree is important.
This becomes particularly significant when we start setting MNT_LOCKED
on rootfs to prevent it from being unmounted.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Forced unmount affects not just the mount namespace but the underlying
superblock as well. Restrict forced unmount to the global root user
for now. Otherwise it becomes possible a user in a less privileged
mount namespace to force the shutdown of a superblock of a filesystem
in a more privileged mount namespace, allowing a DOS attack on root.
Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Now that remount is properly enforcing the rule that you can't remove
nodev at least sandstorm.io is breaking when performing a remount.
It turns out that there is an easy intuitive solution implicitly
add nodev on remount when nodev was implicitly added on mount.
Tested-by: Cedric Bosdonnat <cbosdonnat@suse.com>
Tested-by: Richard Weinberger <richard@nod.at>
Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
When we're enabling journal features, we cannot use the predicate
jbd2_journal_has_csum_v2or3() because we haven't yet set the sb
feature flag fields! Moreover, we just finished loading the shash
driver, so the test is unnecessary; calculate the seed always.
Without this patch, we fail to initialize the checksum seed the first
time we turn on journal_checksum, which means that all journal blocks
written during that first mount are corrupt. Transactions written
after the second mount will be fine, since the feature flag will be
set in the journal superblock. xfstests generic/{034,321,322} are the
regression tests.
(This is important for 3.18.)
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.coM>
Reported-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
We've already made fi and sbi for inode. Let's avoid duplicated work.
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Fix the wrong error number in error path of f2fs_write_begin.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
My static checker complains that if "len == remaining" then it means we
have truncated the last character off the version string.
The intent of the code is that we print as many versions as we can
without truncating a version. Then we put a newline at the end. If the
newline can't fit we return -EINVAL.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
fs/xfs/libxfs/xfs_bmap.c:5591:1-6: WARNING: end returns can be simpified
Simplify a trivial if-return sequence. Possibly combine with a
preceding function call.
Generated by: scripts/coccinelle/misc/simple_return.cocci
CC: Brian Foster <bfoster@redhat.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
fs/xfs/xfs_file.c:919:1-6: WARNING: end returns can be simpified and declaration on line 902 can be dropped
Simplify a trivial if-return sequence. Possibly combine with a
preceding function call.
Generated by: scripts/coccinelle/misc/simple_return.cocci
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
fs/xfs/libxfs/xfs_ialloc.c:1141:1-6: WARNING: end returns can be simpified
Simplify a trivial if-return sequence. Possibly combine with a
preceding function call.
Generated by: scripts/coccinelle/misc/simple_return.cocci
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The functions xfs_blkdev_put() and xfs_qm_dqrele() test whether
their argument is NULL and then return immediately. Thus the test
around the call is not needed.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Don Bailey noticed that our page zeroing for compression at end-io time
isn't complete. This reworks a patch from Linus to push the zeroing
into the zlib and lzo specific functions instead of trying to handle the
corners inside btrfs_decompress_buf2page
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reported-by: Don A. Bailey <donb@securitymouse.com>
cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Little cleanup to distinguish each phase easily
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
[Jaegeuk Kim: modify indentation for code readability]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
More on-disk format consolidation.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
More on-disk format consolidation. A few declarations that weren't on-disk
format related move into better suitable spots.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Move the on-disk ACL format to xfs_format.h, so that repair can
use the common defintion.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
More consolidatation for the on-disk format defintions. Note that the
XFS_IS_REALTIME_INODE moves to xfs_linux.h instead as it is not related
to the on disk format, but depends on a CONFIG_ option.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Here blkno is a daddr_t, which is a __s64; it's possible to hold
a value which is negative, and thus pass the (blkno >= eofs)
test. Then we try to do a xfs_perag_get() for a ridiculous
agno via xfs_daddr_to_agno(), and bad things happen when that
fails, and returns a null pag which is dereferenced shortly
thereafter.
Found via a user-supplied fuzzed image...
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The expectation since the introduction the lazy superblock counters is
that the counters are synced and superblock logged appropriately as part
of the filesystem freeze sequence. This does not occur, however, due to
the logic in xfs_fs_writable() that prevents progress when the fs is in
any state other than SB_UNFROZEN.
While this is a bug, it has not been exposed to date because the last
thing XFS does during freeze is dirty the log. The log recovery process
recalculates the counters from AGI/AGF metadata to ensure everything is
correct. Therefore should a crash occur while an fs is frozen, the
subsequent log recovery puts everything back in order. See the following
commit for reference:
92821e2b [XFS] Lazy Superblock Counters
We might not always want to rely on dirtying the log on a frozen fs.
Modify xfs_log_sbcount() to proceed when the filesystem is freezing but
not once the freeze process has completed. Modify xfs_fs_writable() to
accept the minimum freeze level for which modifications should be
blocked to support various codepaths.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The error handling in xfs_qm_log_quotaoff() has a couple problems. If
xfs_trans_commit() fails, we fall through to the error block and call
xfs_trans_cancel(). This is incorrect on commit failure. If
xfs_trans_reserve() fails, we jump to the error block, cancel the tp and
restore the superblock qflags to oldsbqflag. However, oldsbqflag has
been initialized to zero and not yet updated from the original flags so
we set the flags to zero.
Fix up the error handling in xfs_qm_log_quotaoff() to not restore flags
if they haven't been modified and not cancel the tp on commit failure.
Remove the flag restore code altogether because commit error is the only
failure condition and we don't know whether the transaction made it to
disk.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
There's no need to store a full struct xfs_trans_res on the stack in
xfs_create() and copy the fields. Use a pointer to the appropriate
structures embedded in the xfs_mount.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The xfslogd workqueue is a global, single-job workqueue for buffer ioend
processing. This means we allow for a single work item at a time for all
possible XFS mounts on a system. fsstress testing in loopback XFS over
XFS configurations has reproduced xfslogd deadlocks due to the single
threaded nature of the queue and dependencies introduced between the
separate XFS instances by online discard (-o discard).
Discard over a loopback device converts the discard request to a hole
punch (fallocate) on the underlying file. Online discard requests are
issued synchronously and from xfslogd context in XFS, hence the xfslogd
workqueue is blocked in the upper fs waiting on a hole punch request to
be servied in the lower fs. If the lower fs issues I/O that depends on
xfslogd to complete, both filesystems end up hung indefinitely. This is
reproduced reliabily by generic/013 on XFS->loop->XFS test devices with
the '-o discard' mount option.
Further, docker implementations appear to use this kind of configuration
for container instance filesystems by default (container fs->dm->
loop->base fs) and therefore are subject to this deadlock when running
on XFS.
Replace the global xfslogd workqueue with a per-mount variant. This
guarantees each mount access to a single worker and prevents deadlocks
due to inter-fs dependencies introduced by discard. Since the queue is
only responsible for buffer iodone processing at this point in time,
rename xfslogd to xfs-buf.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Add support for reading file systems compressed with the
LZ4 compression algorithm.
This patch adds the LZ4 decompressor wrapper code.
Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
This patch adds a helper function that simplifies adding a
so-called single_open sequence file for device drivers. The
calling device driver needs to provide a read function and
a device pointer. The field struct seq_file::private will
reference the device pointer upon call to the read function
so the driver can obtain his data from it and do its task
of providing the file content using seq_printf() calls and
alike. Using this helper function also gets rid of the need
to specify file operations per debugfs file.
Signed-off-by: Arend van Spriel <arend@broadcom.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
These patches fixes for iostats and SETCLIENTID in addition to cleaning
up the nfs4_init_callback() function.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJUdfhQAAoJENfLVL+wpUDr9x4QANWmG6xjEU7RuIBLalOoxit6
eXnEDNZlwYp6NCkYktHVWaTqXdwq7fdGX+3p4eiwNg3C3SrJHkpWvjeFd4KT/SyP
B/w3vYG3o5H01i3Mb5kCD2uW0gbS9soZXpIg+uHHOl43yZzneC0vPTLhQ/h+9zPc
B32XcgQhvLIHR3LWrC/+uolsa31lcyya0W45PX+iCpzHF7i9qRBrNLODkTlw1hNQ
eEnYIvVy5oW00zlHJUYiTHP3e+0EJn5PAngdYbqiboJ9mK7DbB0QDwqyvJbIT7ql
WAip6cNcJnSv1eiVYqDwlR1ok8drK5X7yQCT3lcLzAMDznLsSAL1Itu0h2Ay3z61
f8XCyTwI0izq0DbdrMcPoqPSitqyM8nkPElnOuitXwzEroPaG40OF67yss3+ixbl
JeQZ+u35pnpCkKUaZdCK3Pn83StxmUaBcFx8eg30NBc0SN13Eiz6aZcGEperClrR
RwMLDUhUtAMcMRunRRxiN9lHafPqqeDeJre7uky0p0sU9CsH+1n5qIKLmk+Ber2d
ZS29TobdR7ktfjQ52XazMAIFzI7r1v4Zn7ziH/WRbvKoKw9ICcyoUGNy9CS5LyWu
BxWCZAJmby5H1i7V7ituJK8TImb81L4aPW06hHX9k+0SoTrgFjas//4jZYfysJ9N
PvR0MRNfMFhuBlsTj7Gy
=PMdS
-----END PGP SIGNATURE-----
Merge tag 'nfs-cel-for-3.19' of git://git.linux-nfs.org/projects/anna/nfs-rdma into linux-next
Pull pull additional NFS client changes for 3.19 from Anna Schumaker:
"NFS: Generic client side changes from Chuck
These patches fixes for iostats and SETCLIENTID in addition to cleaning
up the nfs4_init_callback() function.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>"
* tag 'nfs-cel-for-3.19' of git://git.linux-nfs.org/projects/anna/nfs-rdma:
NFS: Clean up nfs4_init_callback()
NFS: SETCLIENTID XDR buffer sizes are incorrect
SUNRPC: serialize iostats updates
Dmitry Chernenkov used KASAN to discover that eCryptfs writes past the
end of the allocated buffer during encrypted filename decoding. This
fix corrects the issue by getting rid of the unnecessary 0 write when
the current bit offset is 2.
Signed-off-by: Michael Halcrow <mhalcrow@google.com>
Reported-by: Dmitry Chernenkov <dmitryc@google.com>
Suggested-by: Kees Cook <keescook@chromium.org>
Cc: stable@vger.kernel.org # v2.6.29+: 51ca58d eCryptfs: Filename Encryption: Encoding and encryption functions
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Pull nfsd bugfixes from Bruce Fields:
"These fix one mishandling of the case when security labels are
configured out, and two races in the 4.1 backchannel code"
* 'for-3.18' of git://linux-nfs.org/~bfields/linux:
nfsd: Fix slot wake up race in the nfsv4.1 callback code
SUNRPC: Fix locking around callback channel reply receive
nfsd: correctly define v4.2 support attributes
Pull aio fix from Ben LaHaise:
"Dirty page accounting fix for aio"
* git://git.kvack.org/~bcrl/aio-fixes:
aio: fix uncorrent dirty pages accouting when truncating AIO ring buffer
If an inode has converted inline_data which was written to the disk, we should
set its inode flag for further fsync so that this inline_data can be recovered
from sudden power off.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
After flushing dirty nat entries, it has to be no more dirty nat
entries.
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
It's meaningless to check dirty_nat_cnt after re-dirtying nat entries in
journal. And although there are rooms for dirty nat entires if dirty_nat_cnt
is zero, it's also meaningless to check __has_cursum_space.
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Option journal_async_commit breaks gurantees of data=ordered mode as it
sends only a single cache flush after writing a transaction commit
block. Thus even though the transaction including the commit block is
fully stored on persistent storage, file data may still linger in drives
caches and will be lost on power failure. Since all checksums match on
journal recovery, we replay the transaction thus possibly exposing stale
user data.
To fix this data exposure issue, remove the possibility to use
journal_async_commit in data=ordered mode.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The iput() function tests whether its argument is NULL and then
returns immediately. Thus the test around the call is not needed.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This patch adds support for using the NFS v4.2 operation DEALLOCATE to
punch holes in a file.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
This patch adds support for using the NFS v4.2 operation ALLOCATE to
preallocate data in a file.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Setting retval to zero is not needed in ext4_unlink.
Remove unneeded code.
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This was fixed for ext3 with:
e6d8fb3 ext3: Count internal journal as bsddf overhead in ext3_statfs
but was never fixed for ext4.
With a large external journal and no used disk blocks, df comes
out negative without this, as journal blocks are added to the
overhead & subtracted from used blocks unconditionally.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
path[depth].p_hdr can never be NULL for a path passed to us (and even if
it could, EXT_LAST_EXTENT() would make something != NULL from it). So
just remove the branch.
Coverity-id: 1196498
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
nfs4_init_callback() is never invoked for NFS versions other than 4.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Use the correct calculation of the maximum size of a clientaddr4
when encoding and decoding SETCLIENTID operations. clientaddr4 is
defined in section 2.2.10 of RFC3530bis-31.
The usage in encode_setclientid_maxsz is missing the 4-byte length
in both strings, but is otherwise correct. decode_setclientid_maxsz
simply asks for a page of receive buffer space, which is
unnecessarily large (more than 4KB).
Note that a SETCLIENTID reply is either clientid+verifier, or
clientaddr4, depending on the returned NFS status. It doesn't
hurt to allocate enough space for both.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Create a mount option to disable journal checksumming (because the
metadata_csum feature turns it on by default now), and fix remount not
to allow changing the journal checksumming option, since changing the
mount options has no effect on the journal.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
ext4_delete_inode() has been renamed for a long time, update
comments for this.
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
A deadlock can be occurred:
Thread 1] Thread 2]
- f2fs_write_data_pages - f2fs_write_begin
- lock_page(page #0)
- grab_cache_page(page #X)
- get_node_page(inode_page)
- grab_cache_page(page #0)
: to convert inline_data
- f2fs_write_data_page
- f2fs_write_inline_data
- get_node_page(inode_page)
In this case, trying to lock inode_page and page #0 causes deadlock.
In order to avoid this, this patch adds a rule for this locking policy,
which is that page #0 should be locked followed by inode_page lock.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Two jump labels were adjusted in the implementation of the
create_node_manager_caches() function because these identifiers
contained typos.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We must use GFP_NOFS instead GFP_KERNEL inside ext4_mb_add_groupinfo
and ext4_calculate_overhead() because they are called from inside a
journal transaction. Call trace:
ioctl
->ext4_group_add
->journal_start
->ext4_setup_new_descs
->ext4_mb_add_groupinfo -> GFP_KERNEL
->ext4_flex_group_add
->ext4_update_super
->ext4_calculate_overhead -> GFP_KERNEL
->journal_stop
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Introduce a simple aging to extent status tree. Each extent has a
REFERENCED bit which gets set when the extent is used. Shrinker then
skips entries with referenced bit set and clears the bit. Thus
frequently used extents have higher chances of staying in memory.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Currently flags for extent status tree are defined twice, once shifted
and once without a being shifted. Consolidate these definitions into one
place and make some computations automatic to make adding flags less
error prone. Compiler should be clever enough to figure out these are
constants and generate the same code.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Currently we scan extent status trees of inodes until we reclaim nr_to_scan
extents. This can however require a lot of scanning when there are lots
of delayed extents (as those cannot be reclaimed).
Change shrinker to work as shrinkers are supposed to and *scan* only
nr_to_scan extents regardless of how many extents did we actually
reclaim. We however need to be careful and avoid scanning each status
tree from the beginning - that could lead to a situation where we would
not be able to reclaim anything at all when first nr_to_scan extents in
the tree are always unreclaimable. We remember with each inode offset
where we stopped scanning and continue from there when we next come
across the inode.
Note that we also need to update places calling __es_shrink() manually
to pass reasonable nr_to_scan to have a chance of reclaiming anything and
not just 1.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Currently callers adding extents to extent status tree were responsible
for adding the inode to the list of inodes with freeable extents. This
is error prone and puts list handling in unnecessarily many places.
Just add inode to the list automatically when the first non-delay extent
is added to the tree and remove inode from the list when the last
non-delay extent is removed.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
In this commit we discard the lru algorithm for inodes with extent
status tree because it takes significant effort to maintain a lru list
in extent status tree shrinker and the shrinker can take a long time to
scan this lru list in order to reclaim some objects.
We replace the lru ordering with a simple round-robin. After that we
never need to keep a lru list. That means that the list needn't be
sorted if the shrinker can not reclaim any objects in the first round.
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Currently extent status tree doesn't cache extent hole when a write
looks up in extent tree to make sure whether a block has been allocated
or not. In this case, we don't put extent hole in extent cache because
later this extent might be removed and a new delayed extent might be
added back. But it will cause a defect when we do a lot of writes. If
we don't put extent hole in extent cache, the following writes also need
to access extent tree to look at whether or not a block has been
allocated. It brings a cache miss. This commit fixes this defect.
Also if the inode doesn't have any extent, this extent hole will be
cached as well.
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
For bigalloc filesystems we have to check whether newly requested inode
block isn't already part of a cluster for which we already have delayed
allocation reservation. This check happens in ext4_ext_map_blocks() and
that function sets EXT4_MAP_FROM_CLUSTER if that's the case. However if
ext4_da_map_blocks() finds in extent cache information about the block,
we don't call into ext4_ext_map_blocks() and thus we always end up
getting new reservation even if the space for cluster is already
reserved. This results in overreservation and premature ENOSPC reports.
Fix the problem by checking for existing cluster reservation already in
ext4_da_map_blocks(). That simplifies the logic and actually allows us
to get rid of the EXT4_MAP_FROM_CLUSTER flag completely.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If right after starting the snapshot creation ioctl we perform a write against a
file followed by a truncate, with both operations increasing the file's size, we
can get a snapshot tree that reflects a state of the source subvolume's tree where
the file truncation happened but the write operation didn't. This leaves a gap
between 2 file extent items of the inode, which makes btrfs' fsck complain about it.
For example, if we perform the following file operations:
$ mkfs.btrfs -f /dev/vdd
$ mount /dev/vdd /mnt
$ xfs_io -f \
-c "pwrite -S 0xaa -b 32K 0 32K" \
-c "fsync" \
-c "pwrite -S 0xbb -b 32770 16K 32770" \
-c "truncate 90123" \
/mnt/foobar
and the snapshot creation ioctl was just called before the second write, we often
can get the following inode items in the snapshot's btree:
item 120 key (257 INODE_ITEM 0) itemoff 7987 itemsize 160
inode generation 146 transid 7 size 90123 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 flags 0x0
item 121 key (257 INODE_REF 256) itemoff 7967 itemsize 20
inode ref index 282 namelen 10 name: foobar
item 122 key (257 EXTENT_DATA 0) itemoff 7914 itemsize 53
extent data disk byte 1104855040 nr 32768
extent data offset 0 nr 32768 ram 32768
extent compression 0
item 123 key (257 EXTENT_DATA 53248) itemoff 7861 itemsize 53
extent data disk byte 0 nr 0
extent data offset 0 nr 40960 ram 40960
extent compression 0
There's a file range, corresponding to the interval [32K; ALIGN(16K + 32770, 4096)[
for which there's no file extent item covering it. This is because the file write
and file truncate operations happened both right after the snapshot creation ioctl
called btrfs_start_delalloc_inodes(), which means we didn't start and wait for the
ordered extent that matches the write and, in btrfs_setsize(), we were able to call
btrfs_cont_expand() before being able to commit the current transaction in the
snapshot creation ioctl. So this made it possibe to insert the hole file extent
item in the source subvolume (which represents the region added by the truncate)
right before the transaction commit from the snapshot creation ioctl.
Btrfs' fsck tool complains about such cases with a message like the following:
"root 331 inode 257 errors 100, file extent discount"
>From a user perspective, the expectation when a snapshot is created while those
file operations are being performed is that the snapshot will have a file that
either:
1) is empty
2) only the first write was captured
3) only the 2 writes were captured
4) both writes and the truncation were captured
But never capture a state where only the first write and the truncation were
captured (since the second write was performed before the truncation).
A test case for xfstests follows.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Move the logic from the snapshot creation ioctl into send. This avoids
doing the transaction commit if send isn't used, and ensures that if
a crash/reboot happens after the transaction commit that created the
snapshot and before the transaction commit that switched the commit
root, send will not get a commit root that differs from the main root
(that has orphan items).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Due to ignoring errors returned by clear_extent_bits (at the moment only
-ENOMEM is possible), we can end up freeing an extent that is actually in
use (i.e. return the extent to the free space cache).
The sequence of steps that lead to this:
1) Cleaner thread starts execution and calls btrfs_delete_unused_bgs(), with
the goal of freeing empty block groups;
2) btrfs_delete_unused_bgs() finds an empty block group, joins the current
transaction (or starts a new one if none is running) and attempts to
clear the EXTENT_DIRTY bit for the block group's range from freed_extents[0]
and freed_extents[1] (of which one corresponds to fs_info->pinned_extents);
3) Clearing the EXTENT_DIRTY bit (via clear_extent_bits()) fails with
-ENOMEM, but such error is ignored and btrfs_delete_unused_bgs() proceeds
to delete the block group and the respective chunk, while pinned_extents
remains with that bit set for the whole (or a part of the) range covered
by the block group;
4) Later while the transaction is still running, the chunk ends up being reused
for a new block group (maybe for different purpose, data or metadata), and
extents belonging to the new block group are allocated for file data or btree
nodes/leafs;
5) The current transaction is committed, meaning that we unpinned one or more
extents from the new block group (through btrfs_finish_extent_commit() and
unpin_extent_range()) which are now being used for new file data or new
metadata (through btrfs_finish_extent_commit() and unpin_extent_range()).
And unpinning means we returned the extents to the free space cache of the
new block group, which implies those extents can be used for future allocations
while they're still in use.
Alternatively, we can hit a BUG_ON() when doing a lookup for a block group's cache
object in unpin_extent_range() if a new block group didn't end up being allocated for
the same chunk (step 4 above).
Fix this by not freeing the block group and chunk if we fail to clear the dirty bit.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Fengguang's build monster reported warnings on some arches because we
don't have vmalloc.h included
Signed-off-by: Chris Mason <clm@fb.com>
Reported-by: fengguang.wu@intel.com
The following lockdep warning is triggered during xfstests:
[ 1702.980872] =========================================================
[ 1702.981181] [ INFO: possible irq lock inversion dependency detected ]
[ 1702.981482] 3.18.0-rc1 #27 Not tainted
[ 1702.981781] ---------------------------------------------------------
[ 1702.982095] kswapd0/77 just changed the state of lock:
[ 1702.982415] (&delayed_node->mutex){+.+.-.}, at: [<ffffffffa03b0b51>] __btrfs_release_delayed_node+0x41/0x1f0 [btrfs]
[ 1702.982794] but this lock took another, RECLAIM_FS-unsafe lock in the past:
[ 1702.983160] (&fs_info->dev_replace.lock){+.+.+.}
and interrupts could create inverse lock ordering between them.
[ 1702.984675]
other info that might help us debug this:
[ 1702.985524] Chain exists of:
&delayed_node->mutex --> &found->groups_sem --> &fs_info->dev_replace.lock
[ 1702.986799] Possible interrupt unsafe locking scenario:
[ 1702.987681] CPU0 CPU1
[ 1702.988137] ---- ----
[ 1702.988598] lock(&fs_info->dev_replace.lock);
[ 1702.989069] local_irq_disable();
[ 1702.989534] lock(&delayed_node->mutex);
[ 1702.990038] lock(&found->groups_sem);
[ 1702.990494] <Interrupt>
[ 1702.990938] lock(&delayed_node->mutex);
[ 1702.991407]
*** DEADLOCK ***
It is because the btrfs_kobj_{add/rm}_device() will call memory
allocation with GFP_KERNEL,
which may flush fs page cache to free space, waiting for it self to do
the commit, causing the deadlock.
To solve the problem, move btrfs_kobj_{add/rm}_device() out of the
dev_replace lock range, also involing split the
btrfs_rm_dev_replace_srcdev() function into remove and free parts.
Now only btrfs_rm_dev_replace_remove_srcdev() is called in dev_replace
lock range, and kobj_{add/rm} and btrfs_rm_dev_replace_free_srcdev() are
called out of the lock range.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Define and use nfs_inc_fscache_stats when plus one, which can save to
pass one parameter.
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The nfs_put_client() function tests whether its argument is NULL and then
returns immediately. Thus the test around the call is not needed.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
LOCKD_DEBUG is always the same value as CONFIG_SUNRPC_DEBUG, so we can
just use it instead.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
nfs4_layoutget_release() drops layout hdr refcnt. Grab the refcnt
early so that it is safe to call .release in case nfs4_alloc_pages
fails.
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Fixes: a47970ff78 ("NFSv4.1: Hold reference to layout hdr in layoutget")
Cc: stable@vger.kernel.org # 3.9+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Recent work in the pgio layer made it possible for there to be more than one
request per page. This caused a subtle change in commit behavior, because
write.c:nfs_commit_unstable_pages compares the number of *pages* waiting for
writeback against the number of requests on a commit list to choose when to
send a COMMIT in a non-blocking flush.
This is probably hard to hit in normal operation - you have to be using
rsize/wsize < PAGE_SIZE, or pnfs with lots of boundaries that are not page
aligned to have a noticeable change in behavior.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Use the number of pages in the pagecache mapping instead of the
number of pnfs requests which is only slightly related.
Reported-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
This should make the code easier to maintain in the future.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
NFS4ERR_ACCESS has number 13 and thus is matched and returned
immediately at the beginning of nfs4_map_errors() and there's no point
in checking it later.
Coverity-id: 733891
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
MIPS is introducing new variants of its O32 ABI which differ in their
handling of floating point, in order to enable a gradual transition
towards a world where mips32 binaries can take advantage of new hardware
features only available when configured for certain FP modes. In order
to do this ELF binaries are being augmented with a new section that
indicates, amongst other things, the FP mode requirements of the binary.
The presence & location of such a section is indicated by a program
header in the PT_LOPROC ... PT_HIPROC range.
In order to allow the MIPS architecture code to examine the program
header & section in question, pass all program headers in this range
to an architecture-specific arch_elf_pt_proc function. This function
may return an error if the header is deemed invalid or unsuitable for
the system, in which case that error will be returned from
load_elf_binary and upwards through the execve syscall.
A means is required for the architecture code to make a decision once
it is known that all such headers have been seen, but before it is too
late to return from an execve syscall. For this purpose the
arch_check_elf function is added, and called once, after all PT_LOPROC
to PT_HIPROC headers have been passed to arch_elf_pt_proc but before
the code which invoked execve has been lost. This enables the
architecture code to make a decision based upon all the headers present
in an ELF binary and its interpreter, as is required to forbid
conflicting FP ABI requirements between an ELF & its interpreter.
In order to allow data to be stored throughout the calls to the above
functions, struct arch_elf_state is introduced.
Finally a variant of the SET_PERSONALITY macro is introduced which
accepts a pointer to the struct arch_elf_state, allowing it to act
based upon state observed from the architecture specific program
headers.
Signed-off-by: Paul Burton <paul.burton@imgtec.com>
Cc: linux-mips@linux-mips.org
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Patchwork: https://patchwork.linux-mips.org/patch/7679/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Load the program headers of an ELF interpreter early enough in
load_elf_binary that they can be examined before it's too late to return
an error from an exec syscall. This patch does not perform any such
checking, it merely lays the groundwork for a further patch to do so.
No functional change is intended.
Signed-off-by: Paul Burton <paul.burton@imgtec.com>
Cc: linux-mips@linux-mips.org
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Patchwork: https://patchwork.linux-mips.org/patch/7675/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
load_elf_binary & load_elf_interp both load program headers from an ELF
executable in the same way, duplicating the code. This patch introduces
a helper function (load_elf_phdrs) which performs this common task &
calls it from both load_elf_binary & load_elf_interp. In addition to
reducing code duplication, this is part of preparing to load the ELF
interpreter headers earlier such that they can be examined before it's
too late to return an error from an exec syscall.
Signed-off-by: Paul Burton <paul.burton@imgtec.com>
Cc: linux-mips@linux-mips.org
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Patchwork: https://patchwork.linux-mips.org/patch/7676/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
In f2fs_evict_inode,
commit_inmemory_pages
f2fs_gc
f2fs_iget
iget_locked
-> wait for inode free
Here, if the inode is same as the one to be evicted, f2fs should wait forever.
Actually, we should not call f2fs_balance_fs during f2fs_evict_inode to avoid
this.
But, the commit_inmem_pages calls f2fs_balance_fs by default, even if
f2fs_evict_inode wants to free inmemory pages only.
Hence, this patch adds to trigger f2fs_balance_fs only when there is something
to write.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
It used nat_entry_set when create slab for sit_entry_set.
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Pull btrfs deadlock fix from Chris Mason:
"This has a fix for a long standing deadlock that we've been trying to
nail down for a while. It ended up being a bad interaction with the
fair reader/writer locks and the order btrfs reacquires locks in the
btree"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
btrfs: fix lockups from btrfs_clear_path_blocking
ext4_ext_remove_space() can incorrectly free a partial_cluster if
EAGAIN is encountered while truncating or punching. Extent removal
should be retried in this case.
It also fails to free a partial cluster when the punched region begins
at the start of a file on that unaligned cluster and where the entire
file has not been punched. Remove the requirement that all blocks in
the file must have been freed in order to free the partial cluster.
Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Add some casts and rearrange a few statements for improved readability.
Some code can also be simplified and made more readable if we set
partial_cluster to 0 rather than to a negative value when we can tell
we've hit the left edge of the punched region.
Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The fix in commit ad6599ab3a ("ext4: fix premature freeing of
partial clusters split across leaf blocks"), intended to avoid
dereferencing an invalid extent pointer when determining whether a
partial cluster should be freed, wasn't quite good enough. Assure that
at least one extent remains at the start of the leaf once the hole has
been punched. Otherwise, the pointer to the extent to the right of the
hole will be invalid and a partial cluster will be incorrectly freed.
Set partial_cluster to 0 when we can tell we've hit the left edge of
the punched region within the leaf. This prevents incorrect freeing
of a partial cluster when ext4_ext_rm_leaf is called one last time
during extent tree traversal after the punched region has been removed.
Adjust comments to reflect code changes and a correction. Remove a bit
of dead code.
Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The partial_cluster variable is not always initialized correctly when
hole punching on bigalloc file systems. Although commit c063449394
("ext4: fix partial cluster handling for bigalloc file systems")
addressed the case where the right edge of the punched region and the
next extent to its right were within the same leaf, it didn't handle
the case where the next extent to its right is in the next leaf. This
causes xfstest generic/300 to fail.
Fix this by replacing the code in c0634493922 with a more general
solution that can continue the search for the first cluster to the
right of the punched region into the next leaf if present. If found,
partial_cluster is initialized to this cluster's negative value.
There's no need to determine if that cluster is actually shared; we
simply record it so its blocks won't be freed in the event it does
happen to be shared.
Also, minimize the burden on non-bigalloc file systems with some minor
code simplification.
Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When doing a fsync with a fast path we have a time window where we can miss
the fact that writeback of some file data failed, and therefore we endup
returning success (0) from fsync when we should return an error.
The steps that lead to this are the following:
1) We start all ordered extents by calling filemap_fdatawrite_range();
2) We do some other work like locking the inode's i_mutex, start a transaction,
start a log transaction, etc;
3) We enter btrfs_log_inode(), acquire the inode's log_mutex and collect all the
ordered extents from inode's ordered tree into a list;
4) But by the time we do ordered extent collection, some ordered extents we started
at step 1) might have already completed with an error, and therefore we didn't
found them in the ordered tree and had no idea they finished with an error. This
makes our fsync return success (0) to userspace, but has no bad effects on the log
like for example insertion of file extent items into the log that point to unwritten
extents, because the invalid extent maps were removed before the ordered extent
completed (in inode.c:btrfs_finish_ordered_io).
So after collecting the ordered extents just check if the inode's i_mapping has any
error flags set (AS_EIO or AS_ENOSPC) and leave with an error if it does. Whenever
writeback fails for a page of an ordered extent, we call mapping_set_error (done in
extent_io.c:end_extent_writepage, called by extent_io.c:end_bio_extent_writepage)
that sets one of those error flags in the inode's i_mapping flags.
This change also has the side effect of fixing the issue where for fast fsyncs we
never checked/cleared the error flags from the inode's i_mapping flags, which means
that a full fsync performed after a fast fsync could get such errors that belonged
to the fast fsync - because the full fsync calls btrfs_wait_ordered_range() which
calls filemap_fdatawait_range(), and the later checks for and clears those flags,
while for fast fsyncs we never call filemap_fdatawait_range() or anything else
that checks for and clears the error flags from the inode's i_mapping.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Instead of collecting all ordered extents from the inode's ordered tree
and then wait for all of them to complete, just collect the ones that
overlap the fsync range.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
If an error happens during writeback of log btree extents, make sure the
error is returned to the caller (fsync), so that it takes proper action
(commit current transaction) instead of writing a superblock that points
to log btrees with all or some nodes that weren't durably persisted.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
We use the modified list to keep track of which extents have been modified so we
know which ones are candidates for logging at fsync() time. Newly modified
extents are added to the list at modification time, around the same time the
ordered extent is created. We do this so that we don't have to wait for ordered
extents to complete before we know what we need to log. The problem is when
something like this happens
log extent 0-4k on inode 1
copy csum for 0-4k from ordered extent into log
sync log
commit transaction
log some other extent on inode 1
ordered extent for 0-4k completes and adds itself onto modified list again
log changed extents
see ordered extent for 0-4k has already been logged
at this point we assume the csum has been copied
sync log
crash
On replay we will see the extent 0-4k in the log, drop the original 0-4k extent
which is the same one that we are replaying which also drops the csum, and then
we won't find the csum in the log for that bytenr. This of course causes us to
have errors about not having csums for certain ranges of our inode. So remove
the modified list manipulation in unpin_extent_cache, any modified extents
should have been added well before now, and we don't want them re-logged. This
fixes my test that I could reliably reproduce this problem with. Thanks,
cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Liu Bo pointed out that my previous fix would lose the generation update in the
scenario I described. It is actually much worse than that, we could lose the
entire extent if we lose power right after the transaction commits. Consider
the following
write extent 0-4k
log extent in log tree
commit transaction
< power fail happens here
ordered extent completes
We would lose the 0-4k extent because it hasn't updated the actual fs tree, and
the transaction commit will reset the log so it isn't replayed. If we lose
power before the transaction commit we are save, otherwise we are not.
Fix this by keeping track of all extents we logged in this transaction. Then
when we go to commit the transaction make sure we wait for all of those ordered
extents to complete before proceeding. This will make sure that if we lose
power after the transaction commit we still have our data. This also fixes the
problem of the improperly updated extent generation. Thanks,
cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
"The biggest change is to rename the filesystem from "overlayfs" to "overlay".
This will allow legacy overlayfs to be easily carried by distros alongside the
new mainline one. Also fix a couple of copy-up races and allow escaping comma
character in filenames."
The last bit is about commas in pathname mount options...
If we have two fsync()'s race on different subvols one will do all of its work
to get into the log_tree, wait on it's outstanding IO, and then allow the
log_tree to finish it's commit. The problem is we were just free'ing that
subvols logged extents instead of waiting on them, so whoever lost the race
wouldn't really have their data on disk. Fix this by waiting properly instead
of freeing the logged extents. Thanks,
cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
The sizes that are obtained from space infos are in raw units and have
to be adjusted according to the raid factor. This was missing for
f_bavail and df reported doubled size for raid1.
Reported-by: Martin Steigerwald <Martin@lichtvoll.de>
Fixes: ba7b6e62f4 ("btrfs: adjust statfs calculations according to raid profiles")
CC: stable@vger.kernel.org
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
This can be reproduced by fstests: btrfs/070
The scenario is like the following:
replace worker thread defrag thread
--------------------- -------------
copy_nocow_pages_worker btrfs_defrag_file
copy_nocow_pages_for_inode ...
btrfs_writepages
|A| lock_extent_bits extent_write_cache_pages
|B| lock_page
__extent_writepage
... writepage_delalloc
find_lock_delalloc_range
|B| lock_extent_bits
find_or_create_page
pagecache_get_page
|A| lock_page
This leads to an ABBA pattern deadlock. To fix it,
o we just change it to an AABB pattern which means to @unlock_extent_bits()
before we @lock_page(), and in this way the @extent_read_full_page_nolock()
is no longer in an locked context, so change it back to @extent_read_full_page()
to regain protection.
o Since we @unlock_extent_bits() earlier, then before @write_page_nocow(),
the extent may not really point at the physical block we want, so we
have to check it before write.
Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
Replacing a xattr consists of doing a lookup for its existing value, delete
the current value from the respective leaf, release the search path and then
finally insert the new value. This leaves a time window where readers (getxattr,
listxattrs) won't see any value for the xattr. Xattrs are used to store ACLs,
so this has security implications.
This change also fixes 2 other existing issues which were:
*) Deleting the old xattr value without verifying first if the new xattr will
fit in the existing leaf item (in case multiple xattrs are packed in the
same item due to name hash collision);
*) Returning -EEXIST when the flag XATTR_CREATE is given and the xattr doesn't
exist but we have have an existing item that packs muliple xattrs with
the same name hash as the input xattr. In this case we should return ENOSPC.
A test case for xfstests follows soon.
Thanks to Alexandre Oliva for reporting the non-atomicity of the xattr replace
implementation.
Reported-by: Alexandre Oliva <oliva@gnu.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
We try to allocate an extent state structure before acquiring the extent
state tree's spinlock as we might need a new one later and therefore avoid
doing later an atomic allocation while holding the tree's spinlock. However
we returned -ENOMEM if that initial non-atomic allocation failed, which is
a bit excessive since we might end up not needing the pre-allocated extent
state at all - for the case where the tree doesn't have any extent states
that cover the input range and cover too any other range. Therefore don't
return -ENOMEM if that pre-allocation fails.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Our gluster boxes get several thousand statfs() calls per second, which begins
to suck hardcore with all of the lock contention on the chunk mutex and dev list
mutex. We don't really need to hold these things, if we have transient
weirdness with statfs() because of the chunk allocator we don't care, so remove
this locking.
We still need the dev_list lock if you mount with -o alloc_start however, which
is a good argument for nuking that thing from orbit, but that's a patch for
another day. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Our gluster boxes were spending lots of time in statfs because our fs'es are
huge. The problem is statfs loops through all of the block groups looking for
read only block groups, and when you have several terabytes worth of data that
ends up being a lot of block groups. Move the read only block groups onto a
read only list and only proces that list in
btrfs_account_ro_block_groups_free_space to reduce the amount of churn. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
Copy&paste errors in some messages and add few more missing macro
accessors.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
The xfstest btrfs/014 which tests the balance operation caused that the
check_int module complained that known blocks changed their physical
location. Since this is not an error in this case, only print such
message if the verbose mode was enabled.
Reported-by: Wang Shilong <wangshilong1991@gmail.com>
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Tested-by: Wang Shilong <wangshilong1991@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
The xfstest btrfs/014 which tests the balance operation caused issues with
the check_int module. The attempt was made to use btrfs_map_block() to
find the physical location for a written block. However, this was not
at all needed since the location of the written block was known since
a hook to submit_bio() was the reason for entering the check_int module.
Additionally, after a block relocation it happened that btrfs_map_block()
failed causing misleading error messages afterwards.
This patch changes the check_int module to use the known information of
the physical location from the bio.
Reported-by: Wang Shilong <wangshilong1991@gmail.com>
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Tested-by: Wang Shilong <wangshilong1991@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
We try to allocate an extent state before acquiring the tree's spinlock
just in case we end up needing to split an existing extent state into two.
If that allocation failed, we would return -ENOMEM.
However, our only single caller (transaction/log commit code), passes in
an extent state that was cached from a call to find_first_extent_bit() and
that has a very high chance to match exactly the input range (always true
for a transaction commit and very often, but not always, true for a log
commit) - in this case we end up not needing at all that initial extent
state used for an eventual split. Therefore just don't return -ENOMEM if
we can't allocate the temporary extent state, since we might not need it
at all, and if we end up needing one, we'll do it later anyway.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Right now the only caller of find_first_extent_bit() that is interested
in caching extent states (transaction or log commit), never gets an extent
state cached. This is because find_first_extent_bit() only caches states
that have at least one of the flags EXTENT_IOBITS or EXTENT_BOUNDARY, and
the transaction/log commit caller always passes a tree that doesn't have
ever extent states with any of those flags (they can only have one of the
following flags: EXTENT_DIRTY, EXTENT_NEW or EXTENT_NEED_WAIT).
This change together with the following one in the patch series (titled
"Btrfs: avoid returning -ENOMEM in convert_extent_bit() too early") will
help reduce significantly the chances of calls to convert_extent_bit()
fail with -ENOMEM when called from the transaction/log commit code.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
When committing a transaction or a log, we look for btree extents that
need to be durably persisted by searching for ranges in a io tree that
have some bits set (EXTENT_DIRTY or EXTENT_NEW). We then attempt to clear
those bits and set the EXTENT_NEED_WAIT bit, with calls to the function
convert_extent_bit, and then start writeback for the extents.
That function however can return an error (at the moment only -ENOMEM
is possible, specially when it does GFP_ATOMIC allocation requests
through alloc_extent_state_atomic) - that means the ranges didn't got
the EXTENT_NEED_WAIT bit set (or at least not for the whole range),
which in turn means a call to btrfs_wait_marked_extents() won't find
those ranges for which we started writeback, causing a transaction
commit or a log commit to persist a new superblock without waiting
for the writeback of extents in that range to finish first.
Therefore if a crash happens after persisting the new superblock and
before writeback finishes, we have a superblock pointing to roots that
weren't fully persisted or roots that point to nodes or leafs that weren't
fully persisted, causing all sorts of unexpected/bad behaviour as we endup
reading garbage from disk or the content of some node/leaf from a past
generation that got cowed or deleted and is no longer valid (for this later
case we end up getting error messages like "parent transid verify failed on
X wanted Y found Z" when reading btree nodes/leafs from disk).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
device replace could fail due to another running scrub process or any
other errors btrfs_scrub_dev() may hit, but this failure doesn't get
returned to userspace.
The following steps could reproduce this issue
mkfs -t btrfs -f /dev/sdb1 /dev/sdb2
mount /dev/sdb1 /mnt/btrfs
while true; do btrfs scrub start -B /mnt/btrfs >/dev/null 2>&1; done &
btrfs replace start -Bf /dev/sdb2 /dev/sdb3 /mnt/btrfs
# if this replace succeeded, do the following and repeat until
# you see this log in dmesg
# BTRFS: btrfs_scrub_dev(/dev/sdb2, 2, /dev/sdb3) failed -115
#btrfs replace start -Bf /dev/sdb3 /dev/sdb2 /mnt/btrfs
# once you see the error log in dmesg, check return value of
# replace
echo $?
Introduce a new dev replace result
BTRFS_IOCTL_DEV_REPLACE_RESULT_SCRUB_INPROGRESS
to catch -EINPROGRESS explicitly and return other errors directly to
userspace.
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
size of @btrfsic_state needs more than 2M, it is very likely to
fail allocating memory using kzalloc(). see following mesage:
[91428.902148] Call Trace:
[<ffffffff816f6e0f>] dump_stack+0x4d/0x66
[<ffffffff811b1c7f>] warn_alloc_failed+0xff/0x170
[<ffffffff811b66e1>] __alloc_pages_nodemask+0x951/0xc30
[<ffffffff811fd9da>] alloc_pages_current+0x11a/0x1f0
[<ffffffff811b1e0b>] ? alloc_kmem_pages+0x3b/0xf0
[<ffffffff811b1e0b>] alloc_kmem_pages+0x3b/0xf0
[<ffffffff811d1018>] kmalloc_order+0x18/0x50
[<ffffffff811d1074>] kmalloc_order_trace+0x24/0x140
[<ffffffffa06c097b>] btrfsic_mount+0x8b/0xae0 [btrfs]
[<ffffffff810af555>] ? check_preempt_curr+0x85/0xa0
[<ffffffff810b2de3>] ? try_to_wake_up+0x103/0x430
[<ffffffffa063d200>] open_ctree+0x1bd0/0x2130 [btrfs]
[<ffffffffa060fdde>] btrfs_mount+0x62e/0x8b0 [btrfs]
[<ffffffff811fd9da>] ? alloc_pages_current+0x11a/0x1f0
[<ffffffff811b0a5e>] ? __get_free_pages+0xe/0x50
[<ffffffff81230429>] mount_fs+0x39/0x1b0
[<ffffffff812509fb>] vfs_kern_mount+0x6b/0x150
[<ffffffff812537fb>] do_mount+0x27b/0xc30
[<ffffffff811b0a5e>] ? __get_free_pages+0xe/0x50
[<ffffffff812544f6>] SyS_mount+0x96/0xf0
[<ffffffff81701970>] system_call_fastpath+0x16/0x1b
Since we are allocating memory for hash table array, so
it will be good if we could allocate continuous pages here.
Fix this problem by firstly trying kzalloc(), if we fail,
use vzalloc() instead.
Signed-off-by: Wang Shilong <wangshilong1991@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
If cow_file_range_inline() failed, when called from compress_file_range(),
we were tagging the locked page for writeback, end its writeback and unlock it,
but not marking it with an error nor setting AS_EIO in inode's mapping flags.
This made it impossible for a caller of filemap_fdatawrite_range (writepages)
or filemap_fdatawait_range() to know that an error happened. And the return
value of compress_file_range() is useless because it's returned to a workqueue
task and not to the task calling filemap_fdatawrite_range (writepages).
This change applies on top of the previous patchset starting at the patch
titled:
"[1/5] Btrfs: set page and mapping error on compressed write failure"
Which changed extent_clear_unlock_delalloc() to use SetPageError and
mapping_set_error().
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
To avoid duplicating this double filemap_fdatawrite_range() call for
inodes with async extents (compressed writes) so often.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
For compressed writes, after doing the first filemap_fdatawrite_range() we
don't get the pages tagged for writeback immediately. Instead we create
a workqueue task, which is run by other kthread, and keep the pages locked.
That other kthread compresses data, creates the respective ordered extent/s,
tags the pages for writeback and unlocks them. Therefore we need a second
call to filemap_fdatawrite_range() if we have compressed writes, as this
second call will wait for the pages to become unlocked, then see they became
tagged for writeback and finally wait for the writeback to finish.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Its return value is useless, its single caller ignores it and can't do
anything with it anyway, since it's a workqueue task and not the task
calling filemap_fdatawrite_range (writepages) nor filemap_fdatawait_range().
Failure is communicated to such functions via start and end of writeback
with the respective pages tagged with an error and AS_EIO flag set in the
inode's imapping.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Steps to reproduce:
# mkfs.btrfs -f /dev/sdb
# mount -t btrfs /dev/sdb /mnt -o compress=lzo
# dd if=/dev/zero of=/mnt/data bs=$((33*4096)) count=1
after previous steps, inode will be detected as bad compression ratio,
and NOCOMPRESS flag will be set for that inode.
Reason is that compress have a max limit pages every time(128K), if a
132k write in, it will be splitted into two write(128k+4k), this bug
is a leftover for commit 68bb462d42a(Btrfs: don't compress for a small write)
Fix this problem by checking every time before compression, if it is a
small write(<=blocksize), we bail out and fall into nocompression directly.
Signed-off-by: Wang Shilong <wangshilong1991@gmail.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Our compressed bio write end callback was essentially ignoring the error
parameter. When a write error happens, it must pass a value of 0 to the
inode's write_page_end_io_hook callback, SetPageError on the respective
pages and set AS_EIO in the inode's mapping flags, so that a call to
filemap_fdatawait_range() / filemap_fdatawait() can find out that errors
happened (we surely don't want silent failures on fsync for example).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Its return value is completely ignored by its single caller and it's
useless anyway, since errors are indicated through SetPageError and
the bit AS_EIO set in the flags of the inode's mapping. The caller
can't do anything with the value, as it's invoked from a workqueue
task and not by the task calling filemap_fdatawrite_range (which calls
the writepages address space callback, which in turn calls the inode's
fill_delalloc callback).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
If we had an error when processing one of the async extents from our list,
we were not processing the remaining async extents, meaning we would leak
those async_extent structs, never release the pages with the compressed
data and never unlock and clear the dirty flag from the inode's pages (those
that correspond to the uncompressed content).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
In inode.c:submit_compressed_extents(), if we fail before calling
btrfs_submit_compressed_write(), or when that function fails, we
were freeing the async_extent structure without releasing its pages
and freeing the pages array.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
In inode.c:submit_compressed_extents(), before calling btrfs_submit_compressed_write()
we start writeback for all pages, clear their dirty flag, unlock them, etc, but if
btrfs_submit_compressed_write() fails (at the moment it can only fail with -ENOMEM),
we never end the writeback on the pages, so any filemap_fdatawait_range() call will
hang forever. We were also not calling the writepage end io hook, which means the
corresponding ordered extent will never complete and all its waiters will block
forever, such as a full fsync (via btrfs_wait_ordered_range()).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
If we fail in submit_compressed_extents() before calling btrfs_submit_compressed_write(),
we start and end the writeback for the pages (clear their dirty flag, unlock them, etc)
but we don't tag the pages, nor the inode's mapping, with an error. This makes it
impossible for a caller of filemap_fdatawait_range() (fsync, or transaction commit
for e.g.) know that there was an error.
Note that the return value of submit_compressed_extents() is useless, as that function
is executed by a workqueue task and not directly by the fill_delalloc callback. This
means the writepage/s callbacks of the inode's address space operations don't get that
return value.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Check against !OVL_PATH_LOWER instead of OVL_PATH_MERGE. For a copied up
directory the two are currently equivalent.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Pass dentry into ovl_dir_read_merged() insted of upperpath and lowerpath.
This cleans up callers and paves the way for multi-layer directory reads.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Xattr operations can race with copy up. This does not matter as long as
we consistently fiter out "trunsted.overlay.opaque" attribute on upper
directories.
Previously we checked parent against OVL_PATH_MERGE. This is too general,
and prone to race with copy-up. I.e. we found the parent to be on the
lower layer but ovl_dentry_real() would return the copied-up dentry,
possibly with the "opaque" attribute.
So instead use ovl_path_real() and decide to filter the attributes based on
the actual type of the dentry we'll use.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
ovl_remove_and_whiteout() needs to check if upper dentry exists or not
after having locked upper parent directory.
Previously we used a "type" value computed before locking the upper parent
directory, which is susceptible to racing with copy-up.
There's a similar check in ovl_check_empty_and_clear(). This one is not
actually racy, since copy-up doesn't change the "emptyness" property of a
directory. Add a comment to this effect, and check the existence of upper
dentry locally to make the code cleaner.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Some distributions carry an "old" format of overlayfs while mainline has a
"new" format.
The distros will possibly want to keep the old overlayfs alongside the new
for compatibility reasons.
To make it possible to differentiate the two versions change the name of
the new one from "overlayfs" to "overlay".
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reported-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Cc: Andy Whitcroft <apw@canonical.com>
This patch fix spelling typo in printk and Kconfig within
various part of kernel sources.
Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
In ->atomic_open(inode, dentry, file, opened) calling finish_no_open(file, NULL)
is equivalent to dget(dentry); return finish_no_open(file, dentry);
No need to open-code that...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
vfree() is allowed under spinlock these days, but it's cheaper when
it doesn't step into deferred case and here it's very easy to avoid.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
dentry is always hashed and negative, inode - non-error, non-NULL and
non-directory. In such conditions d_splice_alias() is equivalent to
"d_instantiate(dentry, inode) and return NULL", which simplifies the
downstream code and is consistent with the "have to create a new object"
case.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
If a node page is request to be written during the reclaiming path, we should
submit the bio to avoid pending to recliam it.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Now in f2fs, we have three inode cache: ORPHAN_INO, APPEND_INO, UPDATE_INO,
and we manage fields related to inode cache separately in struct f2fs_sb_info
for each inode cache type.
This makes codes a bit messy, so that this patch intorduce a new struct
inode_management to wrap inner fields as following which make codes more neat.
/* for inner inode cache management */
struct inode_management {
struct radix_tree_root ino_root; /* ino entry array */
spinlock_t ino_lock; /* for ino entry lock */
struct list_head ino_list; /* inode list head */
unsigned long ino_num; /* number of entries */
};
struct f2fs_sb_info {
...
struct inode_management im[MAX_INO_ENTRY]; /* manage inode cache */
...
}
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Because we have checked the contrary condition in case of "if" judgment, we do
not need to check the condition again in case of "else" judgment. Let's remove
it.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In f2fs_remount, we will stop gc thread and set need_restart_gc as true when new
option is set without BG_GC, then if any error occurred in the following
procedure, we can restore to start the gc thread.
But after that, We will fail to restore gc thread in start_gc_thread as BG_GC is
not set in new option, so we'd better move this condition judgment out of
start_gc_thread to fix this issue.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The iput() function was called in up to three cases by the udf_fill_super()
function during error handling even if the passed data structure element
contained still a null pointer. This implementation detail could be improved
by the introduction of another jump label.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Jan Kara <jack@suse.cz>
The iput() function tests whether its argument is NULL and then
returns immediately. Thus the test around the call is not needed.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Jan Kara <jack@suse.cz>
A process may exit, leaving an orphan lock in the lockspace.
This adds the capability for another process to acquire the
orphan lock. Acquiring the orphan just moves the lock from
the orphan list onto the acquiring process's list of locks.
An adopting process must specify the resource name and mode
of the lock it wants to adopt. If a matching lock is found,
the lock is moved to the caller's 's list of locks, and the
lkid of the lock is returned like the lkid of a new lock.
If an orphan with a different mode is found, then -EAGAIN is
returned. If no orphan lock is found on the resource, then
-ENOENT is returned. No async completion is used because
the result is immediately available.
Also, when orphans are purged, allow a zero nodeid to refer
to the local nodeid so the caller does not need to look up
the local nodeid.
Signed-off-by: David Teigland <teigland@redhat.com>
The currect code for nfsd41_cb_get_slot() and nfsd4_cb_done() has no
locking in order to guarantee atomicity, and so allows for races of
the form.
Task 1 Task 2
====== ======
if (test_and_set_bit(0) != 0) {
clear_bit(0)
rpc_wake_up_next(queue)
rpc_sleep_on(queue)
return false;
}
This patch breaks the race condition by adding a retest of the bit
after the call to rpc_sleep_on().
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The fair reader/writer locks mean that btrfs_clear_path_blocking needs
to strictly follow lock ordering rules even when we already have
blocking locks on a given path.
Before we can clear a blocking lock on the path, we need to make sure
all of the locks have been converted to blocking. This will remove lock
inversions against anyone spinning in write_lock() against the buffers
we're trying to get read locks on. These inversions didn't exist before
the fair read/writer locks, but now we need to be more careful.
We papered over this deadlock in the past by changing
btrfs_try_read_lock() to be a true trylock against both the spinlock and
the blocking lock. This was slower, and not sufficient to fix all the
deadlocks. This patch adds a btrfs_tree_read_lock_atomic(), which
basically means get the spinlock but trylock on the blocking lock.
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reported-by: Patrick Schmid <schmid@phys.ethz.ch>
cc: stable@vger.kernel.org #v3.15+
With the isofs_hash() function removed, isofs_hash_ms() is the only user
of isofs_hash_common(), but it's defined inside of an #ifdef, which triggers
this gcc warning in ARM axm55xx_defconfig starting with v3.18-rc3:
fs/isofs/inode.c:177:1: warning: 'isofs_hash_common' defined but not used [-Wunused-function]
This patch moves the function inside of the same #ifdef section to avoid that
warning, which seems the best compromise of a relatively harmless patch for
a late -rc.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: b0afd8e5db ("isofs: don't bother with ->d_op for normal case")
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In "d_prune_alias(): just lock the parent and call __dentry_kill()" the old
dget + d_drop + dput has been replaced with lock_parent + __dentry_kill;
unfortunately, dput() does more than just killing dentry - it also drops the
reference to parent. New variant leaks that reference and needs dput(parent)
after killing the child off.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull the beginning of seq_file cleanup from Steven:
"I'm looking to clean up the seq_file code and to eventually merge the
trace_seq code with seq_file as well, since they basically do the same thing.
Part of this process is to remove the return code of seq_printf() and friends
as they are rather inconsistent. It is better to use the new function
seq_has_overflowed() if you want to stop processing when the buffer
is full. Note, if the buffer is full, the seq_file code will throw away
the contents, allocate a bigger buffer, and then call your code again
to fill in the data. The only thing that breaking out of the function
early does is to save a little time which is probably never noticed.
I started with patches from Joe Perches and modified them as well.
There's many more places that need to be updated before we can convert
seq_printf() and friends to return void. But this patch set introduces
the seq_has_overflowed() and does some initial updates."
This patch fixes kmemcheck warning in switch_names. The function
switch_names swaps inline names of two dentries. It swaps full arrays
d_iname, no matter how many bytes are really used by the strings. Reading
data beyond string ends results in kmemcheck warning.
We fix the bug by marking both arrays as fully initialized.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org # v3.15
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
... for situations when we don't have any candidate in pathnames - basically,
in descriptor-based syscalls.
[Folded the build fix for !CONFIG_AUDITSYSCALL configs from Chen Gang]
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In ->atomic_open(inode, dentry, file, opened) calling finish_no_open(file, NULL)
is equivalent to dget(dentry); return finish_no_open(file, dentry);
No need to open-code that...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
dentry is always hashed and negative, inode - non-error, non-NULL and
non-directory. In such conditions d_splice_alias() is equivalent to
"d_instantiate(dentry, inode) and return NULL", which simplifies the
downstream code and is consistent with the "have to create a new object"
case.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Even when security labels are disabled we support at least the same
attributes as v4.1.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: stable@kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The locked page should be released before returning the function.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The functions iput() and put_pid() test whether their argument is NULL
and then return immediately. Thus the test around the call is not needed.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The iput() function tests whether its argument is NULL and then
returns immediately. Thus the test around the call is not needed.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Jan Kara <jack@suse.cz>
integrity_kernel_read() duplicates the file read operations code
in vfs_read(). This patch refactors vfs_read() code creating a
helper function __vfs_read(). It is used by both vfs_read() and
integrity_kernel_read().
Signed-off-by: Dmitry Kasatkin <d.kasatkin@samsung.com>
Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
We no longer need mpx.h in exec.c. This will obviously also
break the build for non-x86 builds. We get the MPX includes that
we need from mmu_context.h now.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Hansen <dave@sr71.net>
Link: http://lkml.kernel.org/r/20141118003608.837015B3@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This is really the meat of the MPX patch set. If there is one patch to
review in the entire series, this is the one. There is a new ABI here
and this kernel code also interacts with userspace memory in a
relatively unusual manner. (small FAQ below).
Long Description:
This patch adds two prctl() commands to provide enable or disable the
management of bounds tables in kernel, including on-demand kernel
allocation (See the patch "on-demand kernel allocation of bounds tables")
and cleanup (See the patch "cleanup unused bound tables"). Applications
do not strictly need the kernel to manage bounds tables and we expect
some applications to use MPX without taking advantage of this kernel
support. This means the kernel can not simply infer whether an application
needs bounds table management from the MPX registers. The prctl() is an
explicit signal from userspace.
PR_MPX_ENABLE_MANAGEMENT is meant to be a signal from userspace to
require kernel's help in managing bounds tables.
PR_MPX_DISABLE_MANAGEMENT is the opposite, meaning that userspace don't
want kernel's help any more. With PR_MPX_DISABLE_MANAGEMENT, the kernel
won't allocate and free bounds tables even if the CPU supports MPX.
PR_MPX_ENABLE_MANAGEMENT will fetch the base address of the bounds
directory out of a userspace register (bndcfgu) and then cache it into
a new field (->bd_addr) in the 'mm_struct'. PR_MPX_DISABLE_MANAGEMENT
will set "bd_addr" to an invalid address. Using this scheme, we can
use "bd_addr" to determine whether the management of bounds tables in
kernel is enabled.
Also, the only way to access that bndcfgu register is via an xsaves,
which can be expensive. Caching "bd_addr" like this also helps reduce
the cost of those xsaves when doing table cleanup at munmap() time.
Unfortunately, we can not apply this optimization to #BR fault time
because we need an xsave to get the value of BNDSTATUS.
==== Why does the hardware even have these Bounds Tables? ====
MPX only has 4 hardware registers for storing bounds information.
If MPX-enabled code needs more than these 4 registers, it needs to
spill them somewhere. It has two special instructions for this
which allow the bounds to be moved between the bounds registers
and some new "bounds tables".
They are similar conceptually to a page fault and will be raised by
the MPX hardware during both bounds violations or when the tables
are not present. This patch handles those #BR exceptions for
not-present tables by carving the space out of the normal processes
address space (essentially calling the new mmap() interface indroduced
earlier in this patch set.) and then pointing the bounds-directory
over to it.
The tables *need* to be accessed and controlled by userspace because
the instructions for moving bounds in and out of them are extremely
frequent. They potentially happen every time a register pointing to
memory is dereferenced. Any direct kernel involvement (like a syscall)
to access the tables would obviously destroy performance.
==== Why not do this in userspace? ====
This patch is obviously doing this allocation in the kernel.
However, MPX does not strictly *require* anything in the kernel.
It can theoretically be done completely from userspace. Here are
a few ways this *could* be done. I don't think any of them are
practical in the real-world, but here they are.
Q: Can virtual space simply be reserved for the bounds tables so
that we never have to allocate them?
A: As noted earlier, these tables are *HUGE*. An X-GB virtual
area needs 4*X GB of virtual space, plus 2GB for the bounds
directory. If we were to preallocate them for the 128TB of
user virtual address space, we would need to reserve 512TB+2GB,
which is larger than the entire virtual address space today.
This means they can not be reserved ahead of time. Also, a
single process's pre-popualated bounds directory consumes 2GB
of virtual *AND* physical memory. IOW, it's completely
infeasible to prepopulate bounds directories.
Q: Can we preallocate bounds table space at the same time memory
is allocated which might contain pointers that might eventually
need bounds tables?
A: This would work if we could hook the site of each and every
memory allocation syscall. This can be done for small,
constrained applications. But, it isn't practical at a larger
scale since a given app has no way of controlling how all the
parts of the app might allocate memory (think libraries). The
kernel is really the only place to intercept these calls.
Q: Could a bounds fault be handed to userspace and the tables
allocated there in a signal handler instead of in the kernel?
A: (thanks to tglx) mmap() is not on the list of safe async
handler functions and even if mmap() would work it still
requires locking or nasty tricks to keep track of the
allocation state there.
Having ruled out all of the userspace-only approaches for managing
bounds tables that we could think of, we create them on demand in
the kernel.
Based-on-patch-by: Qiaowei Ren <qiaowei.ren@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-mm@kvack.org
Cc: linux-mips@linux-mips.org
Cc: Dave Hansen <dave@sr71.net>
Link: http://lkml.kernel.org/r/20141114151829.AD4310DE@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
MPX-enabled applications using large swaths of memory can
potentially have large numbers of bounds tables in process
address space to save bounds information. These tables can take
up huge swaths of memory (as much as 80% of the memory on the
system) even if we clean them up aggressively. In the worst-case
scenario, the tables can be 4x the size of the data structure
being tracked. IOW, a 1-page structure can require 4 bounds-table
pages.
Being this huge, our expectation is that folks using MPX are
going to be keen on figuring out how much memory is being
dedicated to it. So we need a way to track memory use for MPX.
If we want to specifically track MPX VMAs we need to be able to
distinguish them from normal VMAs, and keep them from getting
merged with normal VMAs. A new VM_ flag set only on MPX VMAs does
both of those things. With this flag, MPX bounds-table VMAs can
be distinguished from other VMAs, and userspace can also walk
/proc/$pid/smaps to get memory usage for MPX.
In addition to this flag, we also introduce a special ->vm_ops
specific to MPX VMAs (see the patch "add MPX specific mmap
interface"), but currently different ->vm_ops do not by
themselves prevent VMA merging, so we still need this flag.
We understand that VM_ flags are scarce and are open to other
options.
Signed-off-by: Qiaowei Ren <qiaowei.ren@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-mm@kvack.org
Cc: linux-mips@linux-mips.org
Cc: Dave Hansen <dave@sr71.net>
Link: http://lkml.kernel.org/r/20141114151825.565625B3@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The current gfs2 freezing code is considerably more complicated than it
should be because it doesn't use the vfs freezing code on any node except
the one that begins the freeze. This is because it needs to acquire a
cluster glock before calling the vfs code to prevent a deadlock, and
without the new freeze_super and thaw_super hooks, that was impossible. To
deal with the issue, gfs2 had to do some hacky locking tricks to make sure
that a frozen node couldn't be holding on a lock it needed to do the
unfreeze ioctl.
This patch makes use of the new hooks to simply the gfs2 locking code. Now,
all the nodes in the cluster freeze and thaw in exactly the same way. Every
node in the cluster caches the freeze glock in the shared state. The new
freeze_super hook allows the freezing node to grab this freeze glock in
the exclusive state without first calling the vfs freeze_super function.
All the nodes in the cluster see this lock change, and call the vfs
freeze_super function. The vfs locking code guarantees that the nodes can't
get stuck holding the glocks necessary to unfreeze the system. To
unfreeze, the freezing node uses the new thaw_super hook to drop the freeze
glock. Again, all the nodes notice this, reacquire the glock in shared mode
and call the vfs thaw_super function.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Currently, freezing a filesystem involves calling freeze_super, which locks
sb->s_umount and then calls the fs-specific freeze_fs hook. This makes it
hard for gfs2 (and potentially other cluster filesystems) to use the vfs
freezing code to do freezes on all the cluster nodes.
In order to communicate that a freeze has been requested, and to make sure
that only one node is trying to freeze at a time, gfs2 uses a glock
(sd_freeze_gl). The problem is that there is no hook for gfs2 to acquire
this lock before calling freeze_super. This means that two nodes can
attempt to freeze the filesystem by both calling freeze_super, acquiring
the sb->s_umount lock, and then attempting to grab the cluster glock
sd_freeze_gl. Only one will succeed, and the other will be stuck in
freeze_super, making it impossible to finish freezing the node.
To solve this problem, this patch adds the freeze_super and thaw_super
hooks. If a filesystem implements these hooks, they are called instead of
the vfs freeze_super and thaw_super functions. This means that every
filesystem that implements these hooks must call the vfs freeze_super and
thaw_super functions itself within the hook function to make use of the vfs
freezing code.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
* Another attempt at moving x86 to libstub taking advantage of the
__pure attribute - Ard Biesheuvel
* Add EFI runtime services section to ptdump - Mathias Krause
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJUZnQGAAoJEC84WcCNIz1VfyUP/1MCVt4vepl7+0JzdUP/eVPs
CwPM6gBOvgx1PviWrtvSU8UyjtYDqZx7jnCvyvbmlgixAqqIoFm80x5sd9DfyJBj
vSrmavXaJgQomJN3N+fvaIpGJXp8NQmeNT87++UMb6VE5nYvx7suDcwfTqOaxcYt
yDwKatTXTvQxDLlGgtymp2UhgVKBICs9WVo8weevB5LPmpt4TFCi1GDSimJkfg+0
JTvkKF+QxPmVqgwY7bgdFFcfhsYCux5VgtbD4DJKS3LgfLJLAMKPOt83DbeOSmIa
8zqtlF3eMwHgecKrMLSfIZH3XIl2gsIdPdvT6iBQkwwGZuSzG93JgwJ90HYiKoDm
yNlffnhmdgI2RXO97UZJpzqGor+eNc0auuS4485PcE8NtZ1tbo20A/OCpfIzK8j8
Vk7sfZxaaHKF5PUtRe6vo3myRlUCofMIuSEWSF8d709R6AEEuia6RZ3Y45EJPROn
fKOiLsf7Og1Mk43Iy2lb7kFT766OsUnZZHU/xiIZj/v94HPWFWoFPtxPC0IURvPx
24oiJxCnyXWGtoyn+SSprl+NAPuPsxVFYriTwaq2RBuoY0NAdy7NIXKe2HTp/WI0
oSTtYkCRcwiHv0aSrg+yQHmwH7y7m39S3yIS4t5LXenn2G4ObUUjhcAQdF2Ft0Pr
MT/l+stTt390cpfpcQbE
=he2O
-----END PGP SIGNATURE-----
Merge tag 'efi-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi into x86/efi
Pull EFI updates for v3.19 from Matt Fleming:
- Support module unload for efivarfs - Mathias Krause
- Another attempt at moving x86 to libstub taking advantage of the
__pure attribute - Ard Biesheuvel
- Add EFI runtime services section to ptdump - Mathias Krause
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Highlights include:
- Stable patches to fix NFSv4.x delegation reclaim error paths
- Fix a bug whereby we were advertising NFSv4.1 but using NFSv4.2 features
- Fix a use-after-free problem with pNFS block layouts
- Fix a memory leak in the pNFS files O_DIRECT code
- Replace an intrusive and Oops-prone performance fix in the NFSv4 atomic
open code with a safer one-line version and revert the two original patches.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJUZol9AAoJEGcL54qWCgDyNQQQALnngvpPR51BoO/iTz9ruXol
fGZy0SRIlTUKm1ArsQsQ+HGbV5K0hgP3Tg+z2AtEEZ8u/2Fi2Bqdl6+eNY12tKHd
uUctDdM5TXLrETAn1UULrnd2eX1cvPMBfOlXlAdNHHsGEgC7w7YQ+rzGwnls+HDy
LYXzY7Y3jYGdTMaRgZc5YRdtd8JBpCxciRvPEQLDIobwP0JnZC1afTLe1XInqB2I
TZ4NTHT+DEWA+Ou1P2deL7+RuJNEAeWWBvULJy76n4BqKvN/HNedOO5HyBYXrwSd
3UX3wbx9CWRxN1F0EqNKxjxZ/597JwqBeNoTDRcofLsqumUfAOtlbym1EahcD3Ls
pykopNfgUhGuhxolStmuHdS6CnyQPERpR5lFZcDp7XtcwSq4FcwD8DRzLJMZW5dg
N1lkfFlwQN3rqdk/NEHL+IxS41Hlk4HXjMoP6MNbRtqzIN6tW9tvC4MtAWd1aYxO
YuUW281pbWxXQ731s0kTIrMUdQ9vGSRBMcbnO9rL3o+xkh8y5SPVkx9lhdhJN0UD
VbQ5Ws/xZ54bD1PfyYb+Yx659lI8MSFOsDuMuLmDtfYnVicHwCA3H63StvQ3ihf/
q0gu8Iex9YbNNjf7IfYGuWPmPn3gwPBoURPC0bcZvMPdY6DXodU6Oj4BRTQ5VCie
9N0pt2wp2eRjaSzD7r5A
=8YN6
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-3.18-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
"Highlights include:
- stable patches to fix NFSv4.x delegation reclaim error paths
- fix a bug whereby we were advertising NFSv4.1 but using NFSv4.2
features
- fix a use-after-free problem with pNFS block layouts
- fix a memory leak in the pNFS files O_DIRECT code
- replace an intrusive and Oops-prone performance fix in the NFSv4
atomic open code with a safer one-line version and revert the two
original patches"
* tag 'nfs-for-3.18-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
sunrpc: fix sleeping under rcu_read_lock in gss_stringify_acceptor
NFS: Don't try to reclaim delegation open state if recovery failed
NFSv4: Ensure that we call FREE_STATEID when NFSv4.x stateids are revoked
NFSv4: Fix races between nfs_remove_bad_delegation() and delegation return
NFSv4.1: nfs41_clear_delegation_stateid shouldn't trust NFS_DELEGATED_STATE
NFSv4: Ensure that we remove NFSv4.0 delegations when state has expired
NFS: SEEK is an NFS v4.2 feature
nfs: Fix use of uninitialized variable in nfs_getattr()
nfs: Remove bogus assignment
nfs: remove spurious WARN_ON_ONCE in write path
pnfs/blocklayout: serialize GETDEVICEINFO calls
nfs: fix pnfs direct write memory leak
Revert "NFS: nfs4_do_open should add negative results to the dcache."
Revert "NFS: remove BUG possibility in nfs4_open_and_get_state"
NFSv4: Ensure nfs_atomic_open set the dentry verifier on ENOENT
gfs2_fallocate() wasn't updating ctime and mtime when modifying the
inode. Add a call to file_update_time() to do that.
Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This addresses an issue caught by fsx where the inode size was not being
updated to the expected value after fallocate(2) with mode 0.
The problem was caused by the offset and len parameters being converted
to multiples of the file system's block size, so i_size would be rounded
up to the nearest block size multiple instead of the requested size.
This replaces the per-chunk i_size updates with a single i_size_write on
successful completion of the operation. With this patch gfs2 gets
through a complete run of fsx.
For clarity, the check for (error == 0) following the loop is removed as
all failures before that point jump to out_* labels or return.
Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
gfs2_fallocate wasn't checking inode_newsize_ok nor get_write_access.
Split out the context setup and inode locking pieces into a separate
function to make it more clear and add these missing calls.
inode_newsize_ok is called conditional on FALLOC_FL_KEEP_SIZE as there
is no need to enforce a file size limit if it isn't going to change.
Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Merge misc fixes from Andrew Morton:
"15 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
MAINTAINERS: add IIO include files
kernel/panic.c: update comments for print_tainted
mem-hotplug: reset node present pages when hot-adding a new pgdat
mem-hotplug: reset node managed pages when hot-adding a new pgdat
mm/debug-pagealloc: correct freepage accounting and order resetting
fanotify: fix notification of groups with inode & mount marks
mm, compaction: prevent infinite loop in compact_zone
mm: alloc_contig_range: demote pages busy message from warn to info
mm/slab: fix unalignment problem on Malta with EVA due to slab merge
mm/page_alloc: restrict max order of merging on isolated pageblock
mm/page_alloc: move freepage counting logic to __free_one_page()
mm/page_alloc: add freepage on isolate pageblock to correct buddy list
mm/page_alloc: fix incorrect isolation behavior by rechecking migratetype
mm/compaction: skip the range until proper target pageblock is met
zram: avoid kunmap_atomic() of a NULL pointer
fsnotify() needs to merge inode and mount marks lists when notifying
groups about events so that ignore masks from inode marks are reflected
in mount mark notifications and groups are notified in proper order
(according to priorities).
Currently the sorting of the lists done by fsnotify_add_inode_mark() /
fsnotify_add_vfsmount_mark() and fsnotify() differed which resulted
ignore masks not being used in some cases.
Fix the problem by always using the same comparison function when
sorting / merging the mark lists.
Thanks to Heinrich Schuchardt for improvements of my patch.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=87721
Signed-off-by: Jan Kara <jack@suse.cz>
Reported-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Tested-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
TID of cap flush ack is 64 bits, but ceph_inode_info::flushing_cap_tid
is only 16 bits. 16 bits should be plenty to let the cap flush updates
pipeline appropriately, but we need to cast in the proper direction when
comparing these differently-sized versions. So downcast the 64-bits one
to 16 bits.
Reflects ceph.git commit a5184cf46a6e867287e24aeb731634828467cd98.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@redhat.com>
Any attempt to call nfs_remove_bad_delegation() while a delegation is being
returned is currently a no-op. This means that we can end up looping
forever in nfs_end_delegation_return() if something causes the delegation
to be revoked.
This patch adds a mechanism whereby the state recovery code can communicate
to the delegation return code that the delegation is no longer valid and
that it should not be used when reclaiming state.
It also changes the return value for nfs4_handle_delegation_recall_error()
to ensure that nfs_end_delegation_return() does not reattempt the lock
reclaim before state recovery is done.
http://lkml.kernel.org/r/CAN-5tyHwG=Cn2Q9KsHWadewjpTTy_K26ee+UnSvHvG4192p-Xw@mail.gmail.com
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
This patch removes the assumption made previously, that we only need to
check the delegation stateid when it matches the stateid on a cached
open.
If we believe that we hold a delegation for this file, then we must assume
that its stateid may have been revoked or expired too. If we don't test it
then our state recovery process may end up caching open/lock state in a
situation where it should not.
We therefore rename the function nfs41_clear_delegation_stateid as
nfs41_check_delegation_stateid, and change it to always run through the
delegation stateid test and recovery process as outlined in RFC5661.
http://lkml.kernel.org/r/CAN-5tyHwG=Cn2Q9KsHWadewjpTTy_K26ee+UnSvHvG4192p-Xw@mail.gmail.com
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Somehow the nfs_v4_1_minor_ops had the NFS_CAP_SEEK flag set, enabling
SEEK over v4.1. This is wrong, and can make servers crash.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Variable 'err' needn't be initialized when nfs_getattr() uses it to
check whether it should call generic_fillattr() or not. That can result
in spurious error returns. Initialize 'err' properly.
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Commit 3a6fd1f004 (pnfs/blocklayout: remove read-modify-write handling
in bl_write_pagelist) introduced a bogus assignment pg_index = pg_index
in variable initialization. AFAICS it's just a typo so remove it.
Spotted by Coverity (id 1248711).
CC: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
This WARN_ON_ONCE was supposed to catch reference counting bugs, but can
trigger in inappropriate situations.
This was reproducible using NFSv2 on an architecture with 64K pages -- we
verified that it was not a reference counting bug and the warning was
safe to ignore.
Reported-by: Will Deacon <will.deacon@arm.com>
Tested-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
For pNFS direct writes, layout driver may dynamically allocate ds_cinfo.buckets.
So we need to take care to free them when freeing dreq.
Ideally this needs to be done inside layout driver where ds_cinfo.buckets
are allocated. But buckets are attached to dreq and reused across LD IO iterations.
So I feel it's OK to free them in the generic layer.
Cc: stable@vger.kernel.org [v3.4+]
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
In some contexts, like in sysfs handlers, we don't want to trigger a
transaction commit. It's a heavy operation, we don't know what external
locks may be taken. Instead, make it possible to finish the operation
through sync syscall or SYNC_FS ioctl.
Signed-off-by: David Sterba <dsterba@suse.cz>
The pending mount option(s) now share namespace and bits with the normal
options, and the existing one for (inode_cache) is unset unconditionally
at each transaction commit.
Introduce a separate namespace for pending changes and enhance the
descriptions of the intended change to use separate bits for each
action.
Signed-off-by: David Sterba <dsterba@suse.cz>
If a pending change is requested, it's not processed unless there is a
transaction commit about to happen, not even after sync or SYNC_FS
ioctl. For example a remount that toggles the inode_cache option will
not take effect after sync on a quiescent filesystem.
Signed-off-by: David Sterba <dsterba@suse.cz>
There are some actions that modify global filesystem state but cannot be
performed at the time of request, but later at the transaction commit
time when the filesystem is in a known state.
For example enabling new incompat features on-the-fly or issuing
transaction commit from unsafe contexts (sysfs handlers).
Signed-off-by: David Sterba <dsterba@suse.cz>
There is no need to keep the module loaded when it serves no function in
case the EFI runtime services are disabled. Return an error in this case
so loading the module will fail.
Also supply a module_exit function to allow unloading the module.
Last, but not least, set the owner of the file_system_type struct.
Cc: Jeremy Kerr <jk@ozlabs.org>
Cc: Matthew Garrett <matthew.garrett@nebula.com>
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
If i_size becomes large outside of MAX_INLINE_DATA, we shoud convert the inode.
Otherwise, we can make some dirty pages during the truncation, and those pages
will be written through f2fs_write_data_page.
At that moment, the inode has still inline_data, so that it tries to write non-
zero pages into inline_data area.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The scenario is like this.
One trhead triggers:
f2fs_write_data_pages
lock_page
f2fs_write_data_page
f2fs_lock_op <- wait
The other thread triggers:
f2fs_truncate
truncate_blocks
f2fs_lock_op
truncate_partial_data_page
lock_page <- wait for locking the page
This patch resolves this bug by relocating truncate_partial_data_page.
This function is just to truncate user data page and not related to FS
consistency as well.
And, we don't need to call truncate_inline_data. Rather than that,
f2fs_write_data_page will finally update inline_data later.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The # of inline_data inode is decreased only when it has inline_data.
After clearing the flag, we can't decreased the number.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
It needs to write node pages if checkpoint is not doing in order to avoid
memory pressure.
Reviewed-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
All filesystems using VFS quotas are now converted to use their private
i_dquot fields. Remove the i_dquot field from generic inode structure.
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com>
CC: jfs-discussion@lists.sourceforge.net
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
CC: Mark Fasheh <mfasheh@suse.com>
CC: Joel Becker <jlbec@evilplan.org>
CC: ocfs2-devel@oss.oracle.com
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
CC: linux-ext4@vger.kernel.org
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
i_dquot array is used by relatively few filesystems (ext?, ocfs2, jfs,
reiserfs) so it is beneficial to move this array to fs-private part of
the inode. We cannot just pass quota pointers from filesystems to quota
functions because during quotaon and quotaoff we have to traverse list
of all inodes and manipulate i_dquot pointers for each inode. So we
provide a function which generic quota code can use to get pointer to
the i_dquot array from the filesystem.
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
We support user, group, and project quotas. Tell VFS about it.
CC: xfs@oss.sgi.com
CC: Dave Chinner <david@fromorbit.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
We support user and group quotas. Tell vfs about it.
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
CC: cluster-devel@redhat.com
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Currently all filesystems supporting VFS quota support user and group
quotas. With introduction of project quotas this is going to change so
make sure filesystem isn't called for quota type it doesn't support by
introduction of a bitmask determining which quota types each filesystem
supports.
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Pull btrfs fix from Chris Mason:
"It's a one liner for an error cleanup path that leads to crashes"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix kfree on list_head in btrfs_lookup_csums_range error cleanup
This update fixes:
- incorrect warnings about i_mutex locking in
pagecache_isize_extended() and updates comments to match expected
locking
- another zero-range bug fix for stray file size updates
- a bunch of fixes for regression in the bulkstat code introduced in
3.17.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJUXTquAAoJEK3oKUf0dfodzPQP+wVWm3suRpvfpOeljlwcCB1w
r1MjJjEgKRmlRCwTe4IYSSS2ZBSz3f5qTunde1PEAcUyyf2gO5b/gV2WCaVDWIpV
0/1RDaXIRplTvY/i5UAOtqOSUpNwvWz7PCmCAR7RCHFfTyBBbFRlRdg1GPCrBAzv
rf/kRm9C6fRHHwLojwNCLEzA0MAdjKVG05A+Xv3MnWcd9fRxtNryQnhTIUJoDCVl
5keebmizquoc88WoQxDX29j2Ce+yjMPj27YhB9Z09mfmFvbLHT46UP2jI5Ty+DX5
rJikXA5Jv6wtig9wsbsenK7fsy1CyAAbmxS8vjyHsdCSAMpN98HkndsSadF8nH5U
sEh43OJjJS5LedVNqfV+LtI9ZD9+fGpAETQFVI8TpefFBX2aYw3fomWO0AUbRuMi
s4f6iz2sDvgDa+oE38XqKif6CqFsBX0QQSuCDiPaMi79vy2VLE/I5SU7QAj42+BW
sEyGVVcdJDsDpkUe6SicIpbNNwhXCR4GYc4jI4QYjDdcK6rrliCnsI8hHk5pPeKk
Qvt6ERiP5dLvp9f6KEzvPAkdxBmDKOtHKMSEMJzBBDtxFJXStJNuYZVtMYb8JwJq
nV0WKSoXR0xT/IeMw8336HF4GO04VCHjY3QnNh0qNdHYtfXJerZmpzVhae7dJmZZ
nLFimb3q2TlfgEvkPqmi
=8S0c
-----END PGP SIGNATURE-----
Merge tag 'xfs-for-linus-3.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs
Pull xfs fixes from Dave Chinner:
"This update fixes a warning in the new pagecache_isize_extended() and
updates some related comments, another fix for zero-range
misbehaviour, and an unforntuately large set of fixes for regressions
in the bulkstat code.
The bulkstat fixes are large but necessary. I wouldn't normally push
such a rework for a -rcX update, but right now xfsdump can silently
create incomplete dumps on 3.17 and it's possible that even xfsrestore
won't notice that the dumps were incomplete. Hence we need to get
this update into 3.17-stable kernels ASAP.
In more detail, the refactoring work I committed in 3.17 has exposed a
major hole in our QA coverage. With both xfsdump (the major user of
bulkstat) and xfsrestore silently ignoring missing files in the
dump/restore process, incomplete dumps were going unnoticed if they
were being triggered. Many of the dump/restore filesets were so small
that they didn't evenhave a chance of triggering the loop iteration
bugs we introduced in 3.17, so we didn't exercise the code
sufficiently, either.
We have already taken steps to improve QA coverage in xfstests to
avoid this happening again, and I've done a lot of manual verification
of dump/restore on very large data sets (tens of millions of inodes)
of the past week to verify this patch set results in bulkstat behaving
the same way as it does on 3.16.
Unfortunately, the fixes are not exactly simple - in tracking down the
problem historic API warts were discovered (e.g xfsdump has been
working around a 20 year old bug in the bulkstat API for the past 10
years) and so that complicated the process of diagnosing and fixing
the problems. i.e. we had to fix bugs in the code as well as
discover and re-introduce the userspace visible API bugs that we
unwittingly "fixed" in 3.17 that xfsdump relied on to work correctly.
Summary:
- incorrect warnings about i_mutex locking in pagecache_isize_extended()
and updates comments to match expected locking
- another zero-range bug fix for stray file size updates
- a bunch of fixes for regression in the bulkstat code introduced in
3.17"
* tag 'xfs-for-linus-3.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
xfs: track bulkstat progress by agino
xfs: bulkstat error handling is broken
xfs: bulkstat main loop logic is a mess
xfs: bulkstat chunk-formatter has issues
xfs: bulkstat chunk formatting cursor is broken
xfs: bulkstat btree walk doesn't terminate
mm: Fix comment before truncate_setsize()
xfs: rework zero range to prevent invalid i_size updates
mm: Remove false WARN_ON from pagecache_isize_extended()
xfs: Check error during inode btree iteration in xfs_bulkstat()
xfs: bulkstat doesn't release AGI buffer on error
The global state_lock protects the file_hashtbl, and that has the
potential to be a scalability bottleneck.
Address this by making the file_hashtbl use RCU. Add a rcu_head to the
nfs4_file and use that when freeing ones that have been hashed. In order
to conserve space, we union the fi_rcu field with the fi_delegations
list_head which must be clear by the time the last reference to the file
is dropped.
Convert find_file_locked to use RCU lookup primitives and not to require
that the state_lock be held, and convert find_file to do a lockless
lookup. Convert find_or_add_file to attempt a lockless lookup first, and
then fall back to doing a locked search and insert if that fails to find
anything.
Also, minimize the number of times we need to calculate the hash value
by passing it in as an argument to the search and insert functions, and
optimize the order of arguments in nfsd4_init_file.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
DEALLOCATE only returns a status value, meaning we can use the noop()
xdr encoder to reply to the client.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The ALLOCATE operation is used to preallocate space in a file. I can do
this by using vfs_fallocate() to do the actual preallocation.
ALLOCATE only returns a status indicator, so we don't need to write a
special encode() function.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
This function needs to be exported so it can be used by the NFSD module
when responding to the new ALLOCATE and DEALLOCATE operations in NFS
v4.2. Christoph Hellwig suggested renaming the function to stay
consistent with how other vfs functions are named.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
To match the previous patch which used the pre-alloc buffer for
writes, this patch causes reads to use the same buffer.
This is not strictly necessary as the current seq_read() will allocate
on first read, so user-space can trigger the required pre-alloc. But
consistency is valuable.
The read function is somewhat simpler than seq_read() and, for example,
does not support reading from an offset into the file: reads must be
at the start of the file.
As seq_read() does not use the prealloc buffer, ->seq_show is
incompatible with ->prealloc and caused an EINVAL return from open().
sysfs code which calls into kernfs always chooses the correct function.
As the buffer is shared with writes and other reads, the mutex is
extended to cover the copy_to_user.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
md/raid allows metadata management to be performed in user-space.
A various times, particularly on device failure, the metadata needs
to be updated before further writes can be permitted.
This means that the user-space program which updates metadata much
not block on writeout, and so must not allocate memory.
mlockall(MCL_CURRENT|MCL_FUTURE) and pre-allocation can avoid all
memory allocation issues for user-memory, but that does not help
kernel memory.
Several kernel objects can be pre-allocated. e.g. files opened before
any writes to the array are permitted.
However some kernel allocation happens in places that cannot be
pre-allocated.
In particular, writes to sysfs files (to tell md that it can now
allow writes to the array) allocate a buffer using GFP_KERNEL.
This patch allows attributes to be marked as "PREALLOC". In that case
the maximal buffer is allocated when the file is opened, and then used
on each write instead of allocating a new buffer.
As the same buffer is now shared for all writes on the same file
description, the mutex is extended to cover full use of the buffer
including the copy_from_user().
The new __ATTR_PREALLOC() 'or's a new flag in to the 'mode', which is
inspected by sysfs_add_file_mode_ns() to determine if the file should be
marked as requiring prealloc.
Despite the comment, we *do* use ->seq_show together with ->prealloc
in this patch. The next patch fixes that.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
According to the user expectations common utilities like dd or sh
redirection operator > should work correctly over binary files from
sysfs. At the moment doing excessive write can not be completed:
write(1, "\0\0\0\0\0\0\0\0", 8) = 4
write(1, "\0\0\0\0", 4) = 0
write(1, "\0\0\0\0", 4) = 0
write(1, "\0\0\0\0", 4) = 0
...
Fix the problem by returning EFBIG described in man 2 write.
Signed-off-by: Vladimir Zapolskiy <vladimir_zapolskiy@mentor.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The journal update function did not work for extended attributes properly,
because extended attribute inodes carry the xattr data, and the size of this
data was not taken into account.
Artem: improved commit message, amended the patch a bit.
Signed-off-by: Subodh Nijsure <snijsure@grid-net.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Signed-off-by: Ben Shelton <ben.shelton@ni.com>
Acked-by: Brad Mouring <brad.mouring@ni.com>
Acked-by: Gratian Crisan <gratian.crisan@ni.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
We forgot to free the budget in 'write_begin_slow()' when 'do_readpage()'
fails. This patch fixes the issue.
Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
This patch adds to control the memory footprint used by ino entries.
This will conduct best effort, not strictly.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The bulkstat main loop progress is tracked by the "lastino"
variable, which is a full 64 bit inode. However, the loop actually
works on agno/agino pairs, and so there's a significant disconnect
between the rest of the loop and the main cursor. Convert this to
use the agino, and pass the agino into the chunk formatting function
and convert it too.
This gets rid of the inconsistency in the loop processing, and
finally makes it simple for us to skip inodes at any point in the
loop simply by incrementing the agino cursor.
cc: <stable@vger.kernel.org> # 3.17
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The error propagation is a horror - xfs_bulkstat() returns
a rval variable which is only set if there are formatter errors. Any
sort of btree walk error or corruption will cause the bulkstat walk
to terminate but will not pass an error back to userspace. Worse
is the fact that formatter errors will also be ignored if any inodes
were correctly formatted into the user buffer.
Hence bulkstat can fail badly yet still report success to userspace.
This causes significant issues with xfsdump not dumping everything
in the filesystem yet reporting success. It's not until a restore
fails that there is any indication that the dump was bad and tha
bulkstat failed. This patch now triggers xfsdump to fail with
bulkstat errors rather than silently missing files in the dump.
This now causes bulkstat to fail when the lastino cookie does not
fall inside an existing inode chunk. The pre-3.17 code tolerated
that error by allowing the code to move to the next inode chunk
as the agino target is guaranteed to fall into the next btree
record.
With the fixes up to this point in the series, xfsdump now passes on
the troublesome filesystem image that exposes all these bugs.
cc: <stable@vger.kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
There are a bunch of variables tha tare more wildy scoped than they
need to be, obfuscated user buffer checks and tortured "next inode"
tracking. This all needs cleaning up to expose the real issues that
need fixing.
cc: <stable@vger.kernel.org> # 3.17
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The loop construct has issues:
- clustidx is completely unused, so remove it.
- the loop tries to be smart by terminating when the
"freecount" tells it that all inodes are free. Just drop
it as in most cases we have to scan all inodes in the
chunk anyway.
- move the "user buffer left" condition check to the only
point where we consume space int eh user buffer.
- move the initialisation of agino out of the loop, leaving
just a simple loop control logic using the clusteridx.
Also, double handling of the user buffer variables leads to problems
tracking the current state - use the cursor variables directly
rather than keeping local copies and then having to update the
cursor before returning.
cc: <stable@vger.kernel.org> # 3.17
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The xfs_bulkstat_agichunk formatting cursor takes buffer values from
the main loop and passes them via the structure to the chunk
formatter, and the writes the changed values back into the main loop
local variables. Unfortunately, this complex dance is full of corner
cases that aren't handled correctly.
The biggest problem is that it is double handling the information in
both the main loop and the chunk formatting function, leading to
inconsistent updates and endless loops where progress is not made.
To fix this, push the struct xfs_bulkstat_agichunk outwards to be
the primary holder of user buffer information. this removes the
double handling in the main loop.
Also, pass the last inode processed by the chunk formatter as a
separate parameter as it purely an output variable and is not
related to the user buffer consumption cursor.
Finally, the chunk formatting code is not shared by anyone, so make
it local to xfs_itable.c.
cc: <stable@vger.kernel.org> # 3.17
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The bulkstat code has several different ways of detecting the end of
an AG when doing a walk. They are not consistently detected, and the
code that checks for the end of AG conditions is not consistently
coded. Hence the are conditions where the walk code can get stuck in
an endless loop making no progress and not triggering any
termination conditions.
Convert all the "tmp/i" status return codes from btree operations
to a common name (stat) and apply end-of-ag detection to these
operations consistently.
cc: <stable@vger.kernel.org> # 3.17
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
When lockd can't talk to a remote statd, it'll spew a warning message
to the ring buffer. If the application is really hammering on locks
however, it's possible for that message to spam the logs. Ratelimit it
to minimize the potential for harm.
Reported-by: Ian Collier <imc@cs.ox.ac.uk>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
https://bugzilla.kernel.org/show_bug.cgi?id=86831
Markus reported that when shutting down mysqld (with AIO support,
on a ext3 formatted Harddrive) leads to a negative number of dirty pages
(underrun to the counter). The negative number results in a drastic reduction
of the write performance because the page cache is not used, because the kernel
thinks it is still 2 ^ 32 dirty pages open.
Add a warn trace in __dec_zone_state will catch this easily:
static inline void __dec_zone_state(struct zone *zone, enum
zone_stat_item item)
{
atomic_long_dec(&zone->vm_stat[item]);
+ WARN_ON_ONCE(item == NR_FILE_DIRTY &&
atomic_long_read(&zone->vm_stat[item]) < 0);
atomic_long_dec(&vm_stat[item]);
}
[ 21.341632] ------------[ cut here ]------------
[ 21.346294] WARNING: CPU: 0 PID: 309 at include/linux/vmstat.h:242
cancel_dirty_page+0x164/0x224()
[ 21.355296] Modules linked in: wutbox_cp sata_mv
[ 21.359968] CPU: 0 PID: 309 Comm: kworker/0:1 Not tainted 3.14.21-WuT #80
[ 21.366793] Workqueue: events free_ioctx
[ 21.370760] [<c0016a64>] (unwind_backtrace) from [<c0012f88>]
(show_stack+0x20/0x24)
[ 21.378562] [<c0012f88>] (show_stack) from [<c03f8ccc>]
(dump_stack+0x24/0x28)
[ 21.385840] [<c03f8ccc>] (dump_stack) from [<c0023ae4>]
(warn_slowpath_common+0x84/0x9c)
[ 21.393976] [<c0023ae4>] (warn_slowpath_common) from [<c0023bb8>]
(warn_slowpath_null+0x2c/0x34)
[ 21.402800] [<c0023bb8>] (warn_slowpath_null) from [<c00c0688>]
(cancel_dirty_page+0x164/0x224)
[ 21.411524] [<c00c0688>] (cancel_dirty_page) from [<c00c080c>]
(truncate_inode_page+0x8c/0x158)
[ 21.420272] [<c00c080c>] (truncate_inode_page) from [<c00c0a94>]
(truncate_inode_pages_range+0x11c/0x53c)
[ 21.429890] [<c00c0a94>] (truncate_inode_pages_range) from
[<c00c0f6c>] (truncate_pagecache+0x88/0xac)
[ 21.439252] [<c00c0f6c>] (truncate_pagecache) from [<c00c0fec>]
(truncate_setsize+0x5c/0x74)
[ 21.447731] [<c00c0fec>] (truncate_setsize) from [<c013b3a8>]
(put_aio_ring_file.isra.14+0x34/0x90)
[ 21.456826] [<c013b3a8>] (put_aio_ring_file.isra.14) from
[<c013b424>] (aio_free_ring+0x20/0xcc)
[ 21.465660] [<c013b424>] (aio_free_ring) from [<c013b4f4>]
(free_ioctx+0x24/0x44)
[ 21.473190] [<c013b4f4>] (free_ioctx) from [<c003d8d8>]
(process_one_work+0x134/0x47c)
[ 21.481132] [<c003d8d8>] (process_one_work) from [<c003e988>]
(worker_thread+0x130/0x414)
[ 21.489350] [<c003e988>] (worker_thread) from [<c00448ac>]
(kthread+0xd4/0xec)
[ 21.496621] [<c00448ac>] (kthread) from [<c000ec18>]
(ret_from_fork+0x14/0x20)
[ 21.503884] ---[ end trace 79c4bf42c038c9a1 ]---
The cause is that we set the aio ring file pages as *DIRTY* via SetPageDirty
(bypasses the VFS dirty pages increment) when init, and aio fs uses
*default_backing_dev_info* as the backing dev, which does not disable
the dirty pages accounting capability.
So truncating aio ring file will contribute to accounting dirty pages (VFS
dirty pages decrement), then error occurs.
The original goal is keeping these pages in memory (can not be reclaimed
or swapped) in life-time via marking it dirty. But thinking more, we have
already pinned pages via elevating the page's refcount, which can already
achieve the goal, so the SetPageDirty seems unnecessary.
In order to fix the issue, using the __set_page_dirty_no_writeback instead
of the nop .set_page_dirty, and dropped the SetPageDirty (don't manually
set the dirty flags, don't disable set_page_dirty(), rely on default behaviour).
With the above change, the dirty pages accounting can work well. But as we
known, aio fs is an anonymous one, which should never cause any real write-back,
we can ignore the dirty pages (write back) accounting by disabling the dirty
pages (write back) accounting capability. So we introduce an aio private
backing dev info (disabled the ACCT_DIRTY/WRITEBACK/ACCT_WB capabilities) to
replace the default one.
Reported-by: Markus Königshaus <m.koenigshaus@wut.de>
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Cc: stable <stable@vger.kernel.org>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
uninitialized msghdr. Broken in "ocfs2: don't open-code kernel_recvmsg()"
by me ;-/
Cc: stable@vger.kernel.org # 3.15+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The seq_printf() will soon just return void, and seq_has_overflowed()
should be used instead to see if the seq can no longer accept input.
As the return value of debugfs_print_regs32() has no users and
the seq_file descriptor should be checked with seq_has_overflowed()
instead of return values of functions, it is better to just have
debugfs_print_regs32() also return void.
Link: http://lkml.kernel.org/p/2634b19eb1c04a9d31148c1fe6f1f3819be95349.1412031505.git.joe@perches.com
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Joe Perches <joe@perches.com>
[ original change only updated seq_printf() return, added return of
void to debugfs_print_regs32() as well ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
seq_printf functions shouldn't really check the return value.
Checking seq_has_overflowed() occasionally is used instead.
Update vfs documentation.
Link: http://lkml.kernel.org/p/e37e6e7b76acbdcc3bb4ab2a57c8f8ca1ae11b9a.1412031505.git.joe@perches.com
Cc: David S. Miller <davem@davemloft.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Joe Perches <joe@perches.com>
[ did a few clean ups ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When the kernel.dmesg_restrict restriction is in place, only users with
CAP_SYSLOG should be able to access crash dumps (like: attacker is
trying to exploit a bug, watchdog reboots, attacker can happily read
crash dumps and logs).
This puts the restriction on console-* types as well as sensitive
information could have been leaked there.
Other log types are unaffected.
Signed-off-by: Sebastian Schmidt <yath@yath.de>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
pstore compression/decompression was added during 3.12.
The ramoops driver prepends a "====timestamp.timestamp-C|D\n"
header to the compressed record before handing it over to pstore
driver which doesn't know about the header. In pstore_decompress(),
the pstore driver reads the first "==" as a zlib header, so the
decompression always fails. For example, this causes the driver
to write /dev/pstore/dmesg-ramoops-0.enc.z instead of
/dev/pstore/dmesg-ramoops-0.
This patch makes the ramoops driver remove the header before
pstore decompression.
Signed-off-by: Ben Zhang <benzh@chromium.org>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Xiaoguang Wang has reported sporadic EBUSY failures of ext4/302
Unfortunetly there is nothing we can do if some other task holds BH's
refenrence. So we must return EBUSY in this case. But we can try
kicking the journal to see if the other task releases the bh reference
after the commit is complete. Also decrease false positives by
properly checking for ENOSPC and retrying the allocation after kicking
the journal --- which is done by ext4_should_retry_alloc().
[ Modified by tytso to properly check for ENOSPC. ]
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
ovl_cache_put() can be called from ovl_dir_reset() if the cache needs to be
rebuilt. We did list_del() on the cursor, which results in an Oops on the
poisoned pointer in ovl_seek_cursor().
Reported-by: Jordi Pujol Palomer <jordipujolp@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Tested-by: Jordi Pujol Palomer <jordipujolp@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
If the OPEN rpc call to the server fails with an ENOENT call, nfs_atomic_open
will create a negative dentry for that file, however it currently fails
to call nfs_set_verifier(), thus causing the dentry to be immediately
revalidated on the next call to nfs_lookup_revalidate() instead of following
the usual lookup caching rules.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If a system wants to reduce the booting time as a top priority, now we can
use a mount option, -o fastboot.
With this option, f2fs conducts a little bit slow write_checkpoint, but
it can avoid the node page reads during the next mount time.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
__submit_merged_bio f2fs_write_end_io f2fs_write_end_io
wait_io = X wait_io = x
complete(X) complete(X)
wait_io = NULL
wait_for_completion()
free(X)
spin_lock(X)
kernel panic
In order to avoid this, this patch removes the wait_io facility.
Instead, we can use wait_on_all_pages_writeback(sbi) to wait for end_ios.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If there is a chance to make a huge sized discard command, we don't need
to split it out, since each blkdev_issue_discard should wait one at a
time.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch simplifies the inline_data usage with the following rule.
1. inline_data is set during the file creation.
2. If new data is requested to be written ranges out of inline_data,
f2fs converts that inode permanently.
3. There is no cases which converts non-inline_data inode to inline_data.
4. The inline_data flag should be changed under inode page lock.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
After invoking ->dirty_inode(), __mark_inode_dirty() does smp_mb() and
tests inode->i_state locklessly to see whether it already has all the
necessary I_DIRTY bits set. The comment above the barrier doesn't
contain any useful information - memory barriers can't ensure "changes
are seen by all cpus" by itself.
And it sure enough was broken. Please consider the following
scenario.
CPU 0 CPU 1
-------------------------------------------------------------------------------
enters __writeback_single_inode()
grabs inode->i_lock
tests PAGECACHE_TAG_DIRTY which is clear
enters __set_page_dirty()
grabs mapping->tree_lock
sets PAGECACHE_TAG_DIRTY
releases mapping->tree_lock
leaves __set_page_dirty()
enters __mark_inode_dirty()
smp_mb()
sees I_DIRTY_PAGES set
leaves __mark_inode_dirty()
clears I_DIRTY_PAGES
releases inode->i_lock
Now @inode has dirty pages w/ I_DIRTY_PAGES clear. This doesn't seem
to lead to an immediately critical problem because requeue_inode()
later checks PAGECACHE_TAG_DIRTY instead of I_DIRTY_PAGES when
deciding whether the inode needs to be requeued for IO and there are
enough unintentional memory barriers inbetween, so while the inode
ends up with inconsistent I_DIRTY_PAGES flag, it doesn't fall off the
IO list.
The lack of explicit barrier may also theoretically affect the other
I_DIRTY bits which deal with metadata dirtiness. There is no
guarantee that a strong enough barrier exists between
I_DIRTY_[DATA]SYNC clearing and write_inode() writing out the dirtied
inode. Filesystem inode writeout path likely has enough stuff which
can behave as full barrier but it's theoretically possible that the
writeout may not see all the updates from ->dirty_inode().
Fix it by adding an explicit smp_mb() after I_DIRTY clearing. Note
that I_DIRTY_PAGES needs a special treatment as it always needs to be
cleared to be interlocked with the lockless test on
__mark_inode_dirty() side. It's cleared unconditionally and
reinstated after smp_mb() if the mapping still has dirty pages.
Also add comments explaining how and why the barriers are paired.
Lightly tested.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@vger.kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
If we hit any errors in btrfs_lookup_csums_range, we'll loop through all
the csums we allocate and free them. But the code was using list_entry
incorrectly, and ended up trying to free the on-stack list_head instead.
This bug came from commit 0678b6185
btrfs: Don't BUG_ON kzalloc error in btrfs_lookup_csums_range()
Signed-off-by: Chris Mason <clm@fb.com>
Reported-by: Erik Berg <btrfs@slipsprogrammoer.no>
cc: stable@vger.kernel.org # 3.3 or newer
JK: Added VFS: prefix to the message when changing it to make it more
standard.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Jan Kara <jack@suse.cz>
There's no point in using test_and_clear_bit_le() when we don't use the
return value of the function. Just use clear_bit_le() instead.
Coverity-id: 1016434
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
f2fs_write_begin() doesn't initialize the 'dn' variable if the inode has
inline data. However it uses its contents to decide whether it should
just zero out the page or load data to it. Thus if we are unlucky we can
zero out page contents instead of loading inline data into a page.
CC: stable@vger.kernel.org
CC: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Rename f2fs_set/clear_bit to f2fs_test_and_set/clear_bit, which mean
set/clear bit and return the old value, for better readability.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Set raw_super default to NULL to avoid the possibly used
uninitialized warning, though we may never hit it in fact.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Introduce f2fs_change_bit to simplify the change bit logic in
function set_to_next_nat{sit}.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Use clear_inode_flag to replace the redundant cond_clear_inode_flag.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Remove the unneeded argument 'type' from __get_victim, use
NO_CHECK_TYPE directly when calling v_ops->get_victim().
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If user specifies too low end sector for trimming, f2fs_trim_fs() will
use uninitialized value as a number of trimmed blocks and returns it to
userspace. Initialize number of trimmed blocks early to avoid the
problem.
Coverity-id: 1248809
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch declares f2fs_convert_inline_dir as a static function, which was
reported by kbuild test robot.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch introduces f2fs_dentry_ptr structure for the use of a function
parameter in inline_dentry operations.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch introduces a core function, f2fs_fill_dentries, to remove
redundant code in f2fs_readdir and f2fs_read_inline_dir.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previously, init_inode_metadata does not hold any parent directory's inode
page. So, f2fs_init_acl can grab its parent inode page without any problem.
But, when we use inline_dentry, that page is grabbed during f2fs_add_link,
so that we can fall into deadlock condition like below.
INFO: task mknod:11006 blocked for more than 120 seconds.
Tainted: G OE 3.17.0-rc1+ #13
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
mknod D ffff88003fc94580 0 11006 11004 0x00000000
ffff880007717b10 0000000000000002 ffff88003c323220 ffff880007717fd8
0000000000014580 0000000000014580 ffff88003daecb30 ffff88003c323220
ffff88003fc94e80 ffff88003ffbb4e8 ffff880007717ba0 0000000000000002
Call Trace:
[<ffffffff8173dc40>] ? bit_wait+0x50/0x50
[<ffffffff8173d4cd>] io_schedule+0x9d/0x130
[<ffffffff8173dc6c>] bit_wait_io+0x2c/0x50
[<ffffffff8173da3b>] __wait_on_bit_lock+0x4b/0xb0
[<ffffffff811640a7>] __lock_page+0x67/0x70
[<ffffffff810acf50>] ? autoremove_wake_function+0x40/0x40
[<ffffffff811652cc>] pagecache_get_page+0x14c/0x1e0
[<ffffffffa029afa9>] get_node_page+0x59/0x130 [f2fs]
[<ffffffffa02a63ad>] read_all_xattrs+0x24d/0x430 [f2fs]
[<ffffffffa02a6ca2>] f2fs_getxattr+0x52/0xe0 [f2fs]
[<ffffffffa02a7481>] f2fs_get_acl+0x41/0x2d0 [f2fs]
[<ffffffff8122d847>] get_acl+0x47/0x70
[<ffffffff8122db5a>] posix_acl_create+0x5a/0x150
[<ffffffffa02a7759>] f2fs_init_acl+0x29/0xcb [f2fs]
[<ffffffffa0286a8d>] init_inode_metadata+0x5d/0x340 [f2fs]
[<ffffffffa029253a>] f2fs_add_inline_entry+0x12a/0x2e0 [f2fs]
[<ffffffffa0286ea5>] __f2fs_add_link+0x45/0x4a0 [f2fs]
[<ffffffffa028b5b6>] ? f2fs_new_inode+0x146/0x220 [f2fs]
[<ffffffffa028b816>] f2fs_mknod+0x86/0xf0 [f2fs]
[<ffffffff811e3ec1>] vfs_mknod+0xe1/0x160
[<ffffffff811e4b26>] SyS_mknod+0x1f6/0x200
[<ffffffff81741d7f>] tracesys+0xe1/0xe6
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Add inline dir functions into normal dir ops' function to handle inline ops.
Besides, we enable inline dir mode when a new dir inode is created if
inline_data option is on.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Adds Functions to implement inline dir init/lookup/insert/delete/convert ops.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
[Jaegeuk Kim: remove needless reserved area copy, pointed by Dan Carpenter]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch exports some dir operations for inline dir, additionally introduces
f2fs_drop_nlink from f2fs_delete_entry for reusing by inline dir function.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch defines macro/inline dentry structure, and adds some helpers for
inline dir infrastructure.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The sceanrio is like this.
inline_data i_size page write_begin/vm_page_mkwrite
X 30 dirty_page
X 30 write to #4096 position
X 30 get_dnode_of_data wait for get_dnode_of_data
O 30 write inline_data
O 30 get_dnode_of_data
O 30 reserve data block
..
In this case, we have #0 = NEW_ADDR and inline_data as well.
We should not allow this condition for further access.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Let's consider the following scenario.
blkaddr[0] inline_data i_size i_blocks writepage truncate
NEW X 4096 2 dirty page #0
NEW X 0 change i_size
NEW X 0 2 f2fs_write_inline_data
NEW X 0 2 get_dnode_of_data
NEW X 0 2 truncate_data_blocks_range
NULL O 0 1 memcpy(inline_data)
NULL O 0 1 f2fs_put_dnode
NULL O 0 1 f2fs_truncate
NULL O 0 1 get_dnode_of_data
NULL O 0 1 *invalid block addr*
This patch adds checking inline_data flag during f2fs_truncate not to refer
corrupted block indices.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When trying to write inline_data, we should truncate any data block allocated
and pointed by the inode block.
We should consider the data index is not 0.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
... by not hitting rename_retry for reasons other than rename having
happened. In other words, do _not_ restart when finding that
between unlocking the child and locking the parent the former got
into __dentry_kill(). Skip the killed siblings instead...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
If we run out of blocks for a given multi-block allocation, we obviously
did not reserve enough. We should reserve more blocks for the next
reservation to reduce fragmentation. This patch increases the size hint
for reservations when they run out.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
If an application does a sequence of (1) big write, (2) little write
we don't necessarily want to reset the size hint based on the smaller
size. The fact that they did any big writes implies they may do more,
and therefore we should try to allocate bigger block reservations, even
if the last few were small writes. Therefore this patch changes function
gfs2_size_hint so that the size hint can only grow; it cannot shrink.
This is especially important where there are multiple writers.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch tries to use the journal numbers to evenly distribute
which node prefers which resource group for block allocations. This
is to help performance.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
No need to store gfs2_dir_check result and test it before returning.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Pull VFS fixes from Al Viro:
"A bunch of assorted fixes, most of them followups to overlayfs merge"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
ovl: initialize ->is_cursor
Return short read or 0 at end of a raw device, not EIO
isofs: don't bother with ->d_op for normal case
isofs_cmp(): we'll never see a dentry for . or ..
overlayfs: fix lockdep misannotation
ovl: fix check for cursor
overlayfs: barriers for opening upper-layer directory
rcu: Provide counterpart to rcu_dereference() for non-RCU situations
staging: android: logger: Fix log corruption regression
Pull btrfs fixes from Chris Mason:
"Filipe is nailing down some problems with our skinny extent variation,
and Dave's patch fixes endian problems in the new super block checks"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix race that makes btrfs_lookup_extent_info miss skinny extent items
Btrfs: properly clean up btrfs_end_io_wq_cache
Btrfs: fix invalid leaf slot access in btrfs_lookup_extent()
btrfs: use macro accessors in superblock validation checks
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJUVAF/AAoJENNvdpvBGATwEbAQALNiAIChEyJTnQDkAQc2wqqn
dv8NQmFr5aefc63A/+n/yJJGrQZtKs0ceh29ty5ksYLFXzUdc2ctFg6vBmllQfbz
PQawAk2gOkF8zfVuqiQU7X+wTBpGmGXTa8HY+WJTtk0pBfhl+p0PDCYsWXMwZJ1D
tAZpxJ4AmPc7A4hApWOvce6r7Xg24vZk/8UA93Tif9AkeY6VoN272Hx5b/UGmBHY
RCEgpowuiIY38bghtLh5+T0J98/EQNof46cEHgGI9nIDZeXRzgvDojE5bLI0/IS/
K07MjYlm/WFWsLFkgNJkTiqEXgnji9BNYRF1xxUjMMBAR4+fnFLw9kXXgcETrPCx
U7lHOhs8M2FK40cWhUDz/tukvL4S4lQwPEeqBPlRE8J5/twRyXHeZDp4F7LOobwq
mk6AajSJlP+05XwXOuCx7Hcf9uxjw/IpqhBS5IZxy8Nn3T2guPlY9wMhYU1RYFws
54FeE76SJ8EDgjVK/txj7rgh11GggWsjsdXvftSElM2DsKsqYEOKAvDzvwmbm7eV
dsFOlRB6B/X4UpiAC2MiPJynYg9TJ7LkVBzDZeZ/fbm7JhTqChSJDzapqdrmNPIY
SQqwLmFXnHqaw6HNitZ5Bs+fD6nfvKqy85NeImxE3lhLWDuiTt77Y3o80IW30TgN
5bnuXq8Rkukrxs/VDvPq
=kI6P
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 bugfixes from Ted Ts'o:
"A set of miscellaneous ext4 bug fixes for 3.18"
* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: make ext4_ext_convert_to_initialized() return proper number of blocks
ext4: bail early when clearing inode journal flag fails
ext4: bail out from make_indexed_dir() on first error
jbd2: use a better hash function for the revoke table
ext4: prevent bugon on race between write/fcntl
ext4: remove extent status procfs files if journal load fails
ext4: disallow changing journal_csum option during remount
ext4: enable journal checksum when metadata checksum feature enabled
ext4: fix oops when loading block bitmap failed
ext4: fix overflow when updating superblock backups after resize
Pull quota and ext3 fixes from Jan Kara.
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
fs, jbd: use a more generic hash function
quota: Properly return errors from dquot_writeback_dquots()
ext3: Don't check quota format when there are no quota files
Author: David Jeffery <djeffery@redhat.com>
Changes to the basic direct I/O code have broken the raw driver when reading
to the end of a raw device. Instead of returning a short read for a read that
extends partially beyond the device's end or 0 when at the end of the device,
these reads now return EIO.
The raw driver needs the same end of device handling as was added for normal
block devices. Using blkdev_read_iter, which has the needed size checks,
prevents the EIO conditions at the end of the device.
Signed-off-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The man page for open(2) indicates that when O_CREAT is specified, the
'mode' argument applies only to future accesses to the file:
Note that this mode applies only to future accesses of the newly
created file; the open() call that creates a read-only file
may well return a read/write file descriptor.
The man page for open(2) implies that 'mode' is treated identically by
O_CREAT and O_TMPFILE.
O_TMPFILE, however, behaves differently:
int fd = open("/tmp", O_TMPFILE | O_RDWR, 0);
assert(fd == -1);
assert(errno == EACCES);
int fd = open("/tmp", O_TMPFILE | O_RDWR, 0600);
assert(fd > 0);
For O_CREAT, do_last() sets acc_mode to MAY_OPEN only:
if (*opened & FILE_CREATED) {
/* Don't check for write permission, don't truncate */
open_flag &= ~O_TRUNC;
will_truncate = false;
acc_mode = MAY_OPEN;
path_to_nameidata(path, nd);
goto finish_open_created;
}
But for O_TMPFILE, do_tmpfile() passes the full op->acc_mode to
may_open().
This patch lines up the behavior of O_TMPFILE with O_CREAT. After the
inode is created, may_open() is called with acc_mode = MAY_OPEN, in
do_tmpfile().
A different, but related glibc bug revealed the discrepancy:
https://sourceware.org/bugzilla/show_bug.cgi?id=17523
The glibc lazily loads the 'mode' argument of open() and openat() using
va_arg() only if O_CREAT is present in 'flags' (to support both the 2
argument and the 3 argument forms of open; same idea for openat()).
However, the glibc ignores the 'mode' argument if O_TMPFILE is in
'flags'.
On x86_64, for open(), it magically works anyway, as 'mode' is in
RDX when entering open(), and is still in RDX on SYSCALL, which is where
the kernel looks for the 3rd argument of a syscall.
But openat() is not quite so lucky: 'mode' is in RCX when entering the
glibc wrapper for openat(), while the kernel looks for the 4th argument
of a syscall in R10. Indeed, the syscall calling convention differs from
the regular calling convention in this respect on x86_64. So the kernel
sees mode = 0 when trying to use glibc openat() with O_TMPFILE, and
fails with EACCES.
Signed-off-by: Eric Rannaud <e@nanocritical.com>
Acked-by: Andy Lutomirski <luto@amacapital.net>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ext4_ext_convert_to_initialized() can return more blocks than are
actually allocated from map->m_lblk in case where initial part of the
on-disk extent is zeroed out. Luckily this doesn't have serious
consequences because the caller currently uses the return value
only to unmap metadata buffers. Anyway this is a data
corruption/exposure problem waiting to happen so fix it.
Coverity-id: 1226848
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When clearing inode journal flag, we call jbd2_journal_flush() to force
all the journalled data to their final locations. Currently we ignore
when this fails and continue clearing inode journal flag. This isn't a
big problem because when jbd2_journal_flush() fails, journal is likely
aborted anyway. But it can still lead to somewhat confusing results so
rather bail out early.
Coverity-id: 989044
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When ext4_handle_dirty_dx_node() or ext4_handle_dirty_dirent_node()
fail, there's really something wrong with the fs and there's no point in
continuing further. Just return error from make_indexed_dir() in that
case. Also initialize frames array so that if we return early due to
error, dx_release() doesn't try to dereference uninitialized memory
(which could happen also due to error in do_split()).
Coverity-id: 741300
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
The old hash function didn't work well for 64-bit block numbers, and
used undefined (negative) shift right behavior. Use the generic
64-bit hash function instead.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reported-by: Andrey Ryabinin <a.ryabinin@samsung.com>
If we can't load the journal, remove the procfs files for the extent
status information file to avoid leaking resources.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
ext4 does not permit changing the metadata or journal checksum feature
flag while mounted. Until we decide to support that, don't allow a
remount to change the journal_csum flag (right now we silently fail to
change anything).
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If metadata checksumming is turned on for the FS, we need to tell the
journal to use checksumming too.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
When we fail to load block bitmap in __ext4_new_inode() we will
dereference NULL pointer in ext4_journal_get_write_access(). So check
for error from ext4_read_block_bitmap().
Coverity-id: 989065
Cc: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When there are no meta block groups update_backups() will compute the
backup block in 32-bit arithmetics thus possibly overflowing the block
number and corrupting the filesystem. OTOH filesystems without meta
block groups larger than 16 TB should be rare. Fix the problem by doing
the counting in 64-bit arithmetics.
Coverity-id: 741252
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
The return values of seq_printf/puts/putc are frequently misused.
Start down a path to remove all the return value uses of these
functions.
Move the seq_overflow() to a global inlined function called
seq_has_overflowed() that can be used by the users of seq_file() calls.
Update the documentation to not show return types for seq_printf
et al. Add a description of seq_has_overflowed().
Link: http://lkml.kernel.org/p/848ac7e3d1c31cddf638a8526fa3c59fa6fdeb8a.1412031505.git.joe@perches.com
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Joe Perches <joe@perches.com>
[ Reworked the original patch from Joe ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Merge misc fixes from Andrew Morton:
"21 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (21 commits)
mm/balloon_compaction: fix deflation when compaction is disabled
sh: fix sh770x SCIF memory regions
zram: avoid NULL pointer access in concurrent situation
mm/slab_common: don't check for duplicate cache names
ocfs2: fix d_splice_alias() return code checking
mm: rmap: split out page_remove_file_rmap()
mm: memcontrol: fix missed end-writeback page accounting
mm: page-writeback: inline account_page_dirtied() into single caller
lib/bitmap.c: fix undefined shift in __bitmap_shift_{left|right}()
drivers/rtc/rtc-bq32k.c: fix register value
memory-hotplug: clear pgdat which is allocated by bootmem in try_offline_node()
drivers/rtc/rtc-s3c.c: fix initialization failure without rtc source clock
kernel/kmod: fix use-after-free of the sub_info structure
drivers/rtc/rtc-pm8xxx.c: rework to support pm8941 rtc
mm, thp: fix collapsing of hugepages on madvise
drivers: of: add return value to of_reserved_mem_device_init()
mm: free compound page with correct order
gcov: add ARM64 to GCOV_PROFILE_ALL
fsnotify: next_i is freed during fsnotify_unmount_inodes.
mm/compaction.c: avoid premature range skip in isolate_migratepages_range
...
The zero range operation is analogous to fallocate with the exception of
converting the range to zeroes. E.g., it attempts to allocate zeroed
blocks over the range specified by the caller. The XFS implementation
kills all delalloc blocks currently over the aligned range, converts the
range to allocated zero blocks (unwritten extents) and handles the
partial pages at the ends of the range by sending writes through the
pagecache.
The current implementation suffers from several problems associated with
inode size. If the aligned range covers an extending I/O, said I/O is
discarded and an inode size update from a previous write never makes it
to disk. Further, if an unaligned zero range extends beyond eof, the
page write induced for the partial end page can itself increase the
inode size, even if the zero range request is not supposed to update
i_size (via KEEP_SIZE, similar to an fallocate beyond EOF).
The latter behavior not only incorrectly increases the inode size, but
can lead to stray delalloc blocks on the inode. Typically, post-eof
preallocation blocks are either truncated on release or inode eviction
or explicitly written to by xfs_zero_eof() on natural file size
extension. If the inode size increases due to zero range, however,
associated blocks leak into the address space having never been
converted or mapped to pagecache pages. A direct I/O to such an
uncovered range cannot convert the extent via writeback and will BUG().
For example:
$ xfs_io -fc "pwrite 0 128k" -c "fzero -k 1m 54321" <file>
...
$ xfs_io -d -c "pread 128k 128k" <file>
<BUG>
If the entire delalloc extent happens to not have page coverage
whatsoever (e.g., delalloc conversion couldn't find a large enough free
space extent), even a full file writeback won't convert what's left of
the extent and we'll assert on inode eviction.
Rework xfs_zero_file_space() to avoid buffered I/O for partial pages.
Use the existing hole punch and prealloc mechanisms as primitives for
zero range. This implementation is not efficient nor ideal as we
writeback dirty data over the range and remove existing extents rather
than convert to unwrittern. The former writeback, however, is currently
the only mechanism available to ensure consistency between pagecache and
extent state. Even a pagecache truncate/delalloc punch prior to hole
punch has lead to inconsistencies due to racing with writeback.
This provides a consistent, correct implementation of zero range that
survives fsstress/fsx testing without assert failures. The
implementation can be optimized from this point forward once the
fundamental issue of pagecache and delalloc extent state consistency is
addressed.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
xfs_bulkstat() doesn't check error return from xfs_btree_increment(). In
case of specific fs corruption that could result in xfs_bulkstat()
entering an infinite loop because we would be looping over the same
chunk over and over again. Fix the problem by checking the return value
and terminating the loop properly.
Coverity-id: 1231338
cc: <stable@vger.kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jie Liu <jeff.u.liu@gmail.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
d_splice_alias() can return a valid dentry, NULL or an ERR_PTR.
Currently the code checks not for ERR_PTR and will cuase an oops in
ocfs2_dentry_attach_lock(). Fix this by using IS_ERR_OR_NULL().
Signed-off-by: Richard Weinberger <richard@nod.at>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
During file system stress testing on 3.10 and 3.12 based kernels, the
umount command occasionally hung in fsnotify_unmount_inodes in the
section of code:
spin_lock(&inode->i_lock);
if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
spin_unlock(&inode->i_lock);
continue;
}
As this section of code holds the global inode_sb_list_lock, eventually
the system hangs trying to acquire the lock.
Multiple crash dumps showed:
The inode->i_state == 0x60 and i_count == 0 and i_sb_list would point
back at itself. As this is not the value of list upon entry to the
function, the kernel never exits the loop.
To help narrow down problem, the call to list_del_init in
inode_sb_list_del was changed to list_del. This poisons the pointers in
the i_sb_list and causes a kernel to panic if it transverse a freed
inode.
Subsequent stress testing paniced in fsnotify_unmount_inodes at the
bottom of the list_for_each_entry_safe loop showing next_i had become
free.
We believe the root cause of the problem is that next_i is being freed
during the window of time that the list_for_each_entry_safe loop
temporarily releases inode_sb_list_lock to call fsnotify and
fsnotify_inode_delete.
The code in fsnotify_unmount_inodes attempts to prevent the freeing of
inode and next_i by calling __iget. However, the code doesn't do the
__iget call on next_i
if i_count == 0 or
if i_state & (I_FREEING | I_WILL_FREE)
The patch addresses this issue by advancing next_i in the above two cases
until we either find a next_i which we can __iget or we reach the end of
the list. This makes the handling of next_i more closely match the
handling of the variable "inode."
The time to reproduce the hang is highly variable (from hours to days.) We
ran the stress test on a 3.10 kernel with the proposed patch for a week
without failure.
During list_for_each_entry_safe, next_i is becoming free causing
the loop to never terminate. Advance next_i in those cases where
__iget is not done.
Signed-off-by: Jerry Hoemann <jerry.hoemann@hp.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Ken Helias <kenhelias@firemail.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The elements in the data array are already unsigned chars and do not
need to be casted.
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Pull block layer fixes from Jens Axboe:
"A small collection of fixes for the current kernel. This contains:
- Two error handling fixes from Jan Kara. One for null_blk on
failure to add a device, and the other for the block/scsi_ioctl
SCSI_IOCTL_SEND_COMMAND fixing up the error jump point.
- A commit added in the merge window for the bio integrity bits
unfortunately disabled merging for all requests if
CONFIG_BLK_DEV_INTEGRITY wasn't set. Reverse the logic, so that
integrity checking wont disallow merges when not enabled.
- A fix from Ming Lei for merging and generating too many segments.
This caused a BUG in virtio_blk.
- Two error handling printk() fixups from Robert Elliott, improving
the information given when we rate limit.
- Error handling fixup on elevator_init() failure from Sudip
Mukherjee.
- A fix from Tony Battersby, fixing up a memory leak in the
scatterlist handling with scsi-mq"
* 'for-linus' of git://git.kernel.dk/linux-block:
block: Fix merge logic when CONFIG_BLK_DEV_INTEGRITY is not defined
lib/scatterlist: fix memory leak with scsi-mq
block: fix wrong error return in elevator_init()
scsi: Fix error handling in SCSI_IOCTL_SEND_COMMAND
null_blk: Cleanup error recovery in null_add_dev()
blk-merge: recaculate segment if it isn't less than max segments
fs: clarify rate limit suppressed buffer I/O errors
fs: merge I/O error prints into one line
In an overlay directory that shadows an empty lower directory, say
/mnt/a/empty102, do:
touch /mnt/a/empty102/x
unlink /mnt/a/empty102/x
rmdir /mnt/a/empty102
It's actually harmless, but needs another level of nesting between
I_MUTEX_CHILD and I_MUTEX_NORMAL.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Tested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
ovl_cache_entry.name is now an array not a pointer, so it makes no sense
test for it being NULL.
Detected by coverity.
From: Miklos Szeredi <mszeredi@suse.cz>
Fixes: 68bf861107 ("overlayfs: make ovl_cache_entry->name an array instead of
+pointer")
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
make sure that
a) all stores done by opening struct file don't leak past storing
the reference in od->upperfile
b) the lockless side has read dependency barrier
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The recent refactoring of the bulkstat code left a small landmine in
the code. If a inobt read fails, then the tree walk is aborted and
returns without releasing the AGI buffer or freeing the cursor. This
can lead to a subsequent bulkstat call hanging trying to grab the
AGI buffer again.
cc: <stable@vger.kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
We have a race that can lead us to miss skinny extent items in the function
btrfs_lookup_extent_info() when the skinny metadata feature is enabled.
So basically the sequence of steps is:
1) We search in the extent tree for the skinny extent, which returns > 0
(not found);
2) We check the previous item in the returned leaf for a non-skinny extent,
and we don't find it;
3) Because we didn't find the non-skinny extent in step 2), we release our
path to search the extent tree again, but this time for a non-skinny
extent key;
4) Right after we released our path in step 3), a skinny extent was inserted
in the extent tree (delayed refs were run) - our second extent tree search
will miss it, because it's not looking for a skinny extent;
5) After the second search returned (with ret > 0), we look for any delayed
ref for our extent's bytenr (and we do it while holding a read lock on the
leaf), but we won't find any, as such delayed ref had just run and completed
after we released out path in step 3) before doing the second search.
Fix this by removing completely the path release and re-search logic. This is
safe, because if we seach for a metadata item and we don't find it, we have the
guarantee that the returned leaf is the one where the item would be inserted,
and so path->slots[0] > 0 and path->slots[0] - 1 must be the slot where the
non-skinny extent item is if it exists. The only case where path->slots[0] is
zero is when there are no smaller keys in the tree (i.e. no left siblings for
our leaf), in which case the re-search logic isn't needed as well.
This race has been present since the introduction of skinny metadata (change
3173a18f70).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Pull two nfsd fixes from Bruce Fields:
"One regression from the 3.16 xdr rewrite, one an older bug exposed by
a separate bug in the client's new SEEK code"
* 'for-3.18' of git://linux-nfs.org/~bfields/linux:
nfsd4: fix crash on unknown operation number
nfsd4: fix response size estimation for OP_SEQUENCE
inotify_read is a wait loop with sleeps in. Wait loops rely on
task_struct::state and sleeps do too, since that's the only means of
actually sleeping. Therefore the nested sleeps destroy the wait loop
state and the wait loop breaks the sleep functions that assume
TASK_RUNNING (mutex_lock).
Fix this by using the new woken_wake_function and wait_woken() stuff,
which registers wakeups in wait and thereby allows shrinking the
task_state::state changes to the actual sleep part.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: tglx@linutronix.de
Cc: ilya.dryomov@inktank.com
Cc: umgwanakikbuti@gmail.com
Cc: Robert Love <rlove@rlove.org>
Cc: Eric Paris <eparis@parisplace.org>
Cc: John McCutchan <john@johnmccutchan.com>
Cc: Robert Love <rlove@rlove.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Link: http://lkml.kernel.org/r/20140924082242.254858080@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In one of Dave's cleanup commits he forgot to call btrfs_end_io_wq_exit on
unload, which makes us unable to unload and then re-load the btrfs module. This
fixes the problem. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
If we couldn't find our extent item, we accessed the current slot
(path->slots[0]) to check if it corresponds to an equivalent skinny
metadata item. However this slot could be beyond our last item in the
leaf (i.e. path->slots[0] >= btrfs_header_nritems(leaf)), in which case
we shouldn't process it.
Since btrfs_lookup_extent() is only used to find extent items for data
extents, fix this by removing completely the logic that looks up for an
equivalent skinny metadata item, since it can not exist.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
The initial patch c926093ec5 (btrfs: add more superblock checks)
did not properly use the macro accessors that wrap endianness and the
code would not work correctly on big endian machines.
Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
no sense having it a pointer - all instances have it pointing to
local variable in the same stack frame
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
d_splice_alias() callers expect it to either stash the inode reference
into a new alias, or drop the inode reference. That makes it possible
to just return d_splice_alias() result from ->lookup() instance, without
any extra housekeeping required.
Unfortunately, that should include the failure exits. If d_splice_alias()
returns an error, it leaves the dentry it has been given negative and
thus it *must* drop the inode reference. Easily fixed, but it goes way
back and will need backporting.
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add a simple read-only counter to super_block that indicates how deep this
is in the stack of filesystems. Previously ecryptfs was the only stackable
filesystem and it explicitly disallowed multiple layers of itself.
Overlayfs, however, can be stacked recursively and also may be stacked
on top of ecryptfs or vice versa.
To limit the kernel stack usage we must limit the depth of the
filesystem stack. Initially the limit is set to 2.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
This is useful because of the stacking nature of overlayfs. Users like to
find out (via /proc/mounts) which lower/upper directory were used at mount
time.
AV: even failing ovl_parse_opt() could've done some kstrdup()
AV: failure of ovl_alloc_entry() should end up with ENOMEM, not EINVAL
Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Add support for statfs to the overlayfs filesystem. As the upper layer
is the target of all write operations assume that the space in that
filesystem is the space in the overlayfs. There will be some inaccuracy as
overwriting a file will copy it up and consume space we were not expecting,
but it is better than nothing.
Use the upper layer dentry and mount from the overlayfs root inode,
passing the statfs call to that filesystem.
Signed-off-by: Andy Whitcroft <apw@canonical.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Overlayfs allows one, usually read-write, directory tree to be
overlaid onto another, read-only directory tree. All modifications
go to the upper, writable layer.
This type of mechanism is most often used for live CDs but there's a
wide variety of other uses.
The implementation differs from other "union filesystem"
implementations in that after a file is opened all operations go
directly to the underlying, lower or upper, filesystems. This
simplifies the implementation and allows native performance in these
cases.
The dentry tree is duplicated from the underlying filesystems, this
enables fast cached lookups without adding special support into the
VFS. This uses slightly more memory than union mounts, but dentries
are relatively small.
Currently inodes are duplicated as well, but it is a possible
optimization to share inodes for non-directories.
Opening non directories results in the open forwarded to the
underlying filesystem. This makes the behavior very similar to union
mounts (with the same limitations vs. fchmod/fchown on O_RDONLY file
descriptors).
Usage:
mount -t overlayfs overlayfs -olowerdir=/lower,upperdir=/upper/upper,workdir=/upper/work /overlay
The following cotributions have been folded into this patch:
Neil Brown <neilb@suse.de>:
- minimal remount support
- use correct seek function for directories
- initialise is_real before use
- rename ovl_fill_cache to ovl_dir_read
Felix Fietkau <nbd@openwrt.org>:
- fix a deadlock in ovl_dir_read_merged
- fix a deadlock in ovl_remove_whiteouts
Erez Zadok <ezk@fsl.cs.sunysb.edu>
- fix cleanup after WARN_ON
Sedat Dilek <sedat.dilek@googlemail.com>
- fix up permission to confirm to new API
Robin Dong <hao.bigrat@gmail.com>
- fix possible leak in ovl_new_inode
- create new inode in ovl_link
Andy Whitcroft <apw@canonical.com>
- switch to __inode_permission()
- copy up i_uid/i_gid from the underlying inode
AV:
- ovl_copy_up_locked() - dput(ERR_PTR(...)) on two failure exits
- ovl_clear_empty() - one failure exit forgetting to do unlock_rename(),
lack of check for udir being the parent of upper, dropping and regaining
the lock on udir (which would require _another_ check for parent being
right).
- bogus d_drop() in copyup and rename [fix from your mail]
- copyup/remove and copyup/rename races [fix from your mail]
- ovl_dir_fsync() leaving ERR_PTR() in ->realfile
- ovl_entry_free() is pointless - it's just a kfree_rcu()
- fold ovl_do_lookup() into ovl_lookup()
- manually assigning ->d_op is wrong. Just use ->s_d_op.
[patches picked from Miklos]:
* copyup/remove and copyup/rename races
* bogus d_drop() in copyup and rename
Also thanks to the following people for testing and reporting bugs:
Jordi Pujol <jordipujolp@gmail.com>
Andy Whitcroft <apw@canonical.com>
Michal Suchanek <hramrach@centrum.cz>
Felix Fietkau <nbd@openwrt.org>
Erez Zadok <ezk@fsl.cs.sunysb.edu>
Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Add whiteout support to ext4_rename(). A whiteout inode (chrdev/0,0) is
created before the rename takes place. The whiteout inode is added to the
old entry instead of deleting it.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
This adds a new RENAME_WHITEOUT flag. This flag makes rename() create a
whiteout of source. The whiteout creation is atomic relative to the
rename.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Whiteout isn't actually a new file type, but is represented as a char
device (Linus's idea) with 0/0 device number.
This has several advantages compared to introducing a new whiteout file
type:
- no userspace API changes (e.g. trivial to make backups of upper layer
filesystem, without losing whiteouts)
- no fs image format changes (you can boot an old kernel/fsck without
whiteout support and things won't break)
- implementation is trivial
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
It's already duplicated in btrfs and about to be used in overlayfs too.
Move the sticky bit check to an inline helper and call the out-of-line
helper only in the unlikly case of the sticky bit being set.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
We need to be able to check inode permissions (but not filesystem implied
permissions) for stackable filesystems. Expose this interface for overlayfs.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Add a new inode operation i_op->dentry_open(). This is for stacked filesystems
that want to return a struct file from a different filesystem.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
They're a bit outdated wrt to some recent changes.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
A client may not want to use the back channel on a transport it sent
CREATE_SESSION on, in which case it clears SESSION4_BACK_CHAN.
However, cl_cb_addr should be populated anyway, to be used if the
client binds other connections to this session. If cl_cb_addr is
not initialized, rpc_create() fails when the server attempts to
set up a back channel on such secondary transports.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The vfs_fsync_range() call during write processing got the end of the
range off by one. The range is inclusive, not exclusive. The error has
nfsd sync more data than requested -- it's correct but unnecessary
overhead.
The call during commit processing is correct so I copied that pattern in
write processing. Maybe a helper would be nice but I kept it trivial.
This is untested. I found it while reviewing code for something else
entirely.
Signed-off-by: Zach Brown <zab@zabbo.net>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Unknown operation numbers are caught in nfsd4_decode_compound() which
sets op->opnum to OP_ILLEGAL and op->status to nfserr_op_illegal. The
error causes the main loop in nfsd4_proc_compound() to skip most
processing. But nfsd4_proc_compound also peeks ahead at the next
operation in one case and doesn't take similar precautions there.
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The ecryptfs_encrypted_view mount option greatly changes the
functionality of an eCryptfs mount. Instead of encrypting and decrypting
lower files, it provides a unified view of the encrypted files in the
lower filesystem. The presence of the ecryptfs_encrypted_view mount
option is intended to force a read-only mount and modifying files is not
supported when the feature is in use. See the following commit for more
information:
e77a56d [PATCH] eCryptfs: Encrypted passthrough
This patch forces the mount to be read-only when the
ecryptfs_encrypted_view mount option is specified by setting the
MS_RDONLY flag on the superblock. Additionally, this patch removes some
broken logic in ecryptfs_open() that attempted to prevent modifications
of files when the encrypted view feature was in use. The check in
ecryptfs_open() was not sufficient to prevent file modifications using
system calls that do not operate on a file descriptor.
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Reported-by: Priya Bansal <p.bansal@samsung.com>
Cc: stable@vger.kernel.org # v2.6.21+: e77a56d [PATCH] eCryptfs: Encrypted passthrough
While the hash function used by the revoke hashtable is good somewhere else,
it's not really good here.
The default hash shift (8) means that one third of the hashing function
gets lost (and is undefined anyways (8 - 12 = negative shift)):
"(block << (hash_shift - 12))) & (table->hash_size - 1)"
Instead, just use the kernel's generic hash function that gets used everywhere
else.
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Due to a switched left and right side of an assignment,
dquot_writeback_dquots() never returned error. This could result in
errors during quota writeback to not be reported to userspace properly.
Fix it.
CC: stable@vger.kernel.org
Coverity-id: 1226884
Signed-off-by: Jan Kara <jack@suse.cz>
The check whether quota format is set even though there are no
quota files with journalled quota is pointless and it actually
makes it impossible to turn off journalled quotas (as there's
no way to unset journalled quota format). Just remove the check.
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
When quiet_error applies rate limiting to buffer_io_error calls, what the
they apply to is unclear because the name is so generic, particularly
if the messages are interleaved with others:
[ 1936.063572] quiet_error: 664293 callbacks suppressed
[ 1936.065297] Buffer I/O error on dev sdr, logical block 257429952, lost async page write
[ 1936.067814] Buffer I/O error on dev sdr, logical block 257429953, lost async page write
Also, the function uses printk_ratelimit(), although printk.h includes a
comment advising "Please don't use... Instead use printk_ratelimited()."
Change buffer_io_error to check the BH_Quiet bit itself, drop the
printk_ratelimit call, and print using printk_ratelimited.
This makes the messages look like:
[ 387.208839] buffer_io_error: 676394 callbacks suppressed
[ 387.210693] Buffer I/O error on dev sdr, logical block 211291776, lost async page write
[ 387.213432] Buffer I/O error on dev sdr, logical block 211291777, lost async page write
Signed-off-by: Robert Elliott <elliott@hp.com>
Reviewed-by: Webb Scales <webbnh@hp.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
buffer.c uses two printk calls to print these messages:
[67353.422338] Buffer I/O error on device sdr, logical block 212868488
[67353.422338] lost page write due to I/O error on sdr
In a busy system, they may be interleaved with other prints,
losing the context for the second message. Merge them into
one line with one printk call so the prints are atomic.
Also, differentiate between async page writes, sync page writes, and
async page reads.
Also, shorten "device" to "dev" to match the block layer prints:
[67353.467906] blk_update_request: critical target error, dev sdr, sector
1707107328
Also, use %llu rather than %Lu.
Resulting prints look like:
[ 1356.437006] blk_update_request: critical target error, dev sdr, sector 1719693992
[ 1361.383522] quiet_error: 659876 callbacks suppressed
[ 1361.385816] Buffer I/O error on dev sdr, logical block 256902912, lost async page write
[ 1361.385819] Buffer I/O error on dev sdr, logical block 256903644, lost async page write
Signed-off-by: Robert Elliott <elliott@hp.com>
Reviewed-by: Webb Scales <webbnh@hp.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
We added this new estimator function but forgot to hook it up. The
effect is that NFSv4.1 (and greater) won't do zero-copy reads.
The estimate was also wrong by 8 bytes.
Fixes: ccae70a9ee "nfsd4: estimate sequence response size"
Cc: stable@vger.kernel.org
Reported-by: Chuck Lever <chucklever@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
optimizations.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJUPlLCAAoJENNvdpvBGATwpN8P/jnbDL1RqM9ZEAWfbDhvYumR
Fi59b3IDzSJHuuJeP0nTblVbbWclpO9ljCd18ttsHr8gBXA0ViaEU0XvWbpHIwPN
1fr1/Ovd0wvBdIVdLlaLXTR9skH4lbkiXxv/tkfjVCOSpzqiKID98Z72e/gUjB7Z
8xjAn/mTCnXKnhqMGzi8RC2MP1wgY//ErR21bj6so/8RC8zu4P6JuVj/hI6s0y5i
IPtAmjhdM7nxnS0wJwj7dLT0yNDftDh69qE6CgIwyK+Xn/SZFgYwE6+l02dj3DET
ZcAzTT9ToTMJdWtMu+5Y4LY8ObJ5xqMPbMoUclQ3DWe6nZicvtcBVCjfG/J8pFlY
IFD0nfh/OpX9cQMwJ+5Y8P4TrMiqM+FfuLfu+X83gLyrAyIazwoaZls2lxlEyC0w
M25oAqeKGUeVakVlmDZlVyBf05cu5m62x1rRvpcwMXMNhJl8/xwsSdhdYGeJfbO0
0MfL1n6GmvHvouMXKNsXlat/w3QVaQWVRzqdF9x7Q730fSHC/zxVGO+Po3jz2fBd
fBdfE14BIIU7nkyBVy0CZG5SDmQW4YACocOv/ATmII9j76F9eZQ3zsA8J1x+dLmJ
dP1Uxvsn1C3HW8Ua239j0XUJncglb06iEId0ywdkmWcc1rbzsyZ/NzXN/QBdZmqB
9g4GKAXAyh15PeBTJ5K/
=vWic
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"A large number of cleanups and bug fixes, with some (minor) journal
optimizations"
[ This got sent to me before -rc1, but was stuck in my spam folder. - Linus ]
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (67 commits)
ext4: check s_chksum_driver when looking for bg csum presence
ext4: move error report out of atomic context in ext4_init_block_bitmap()
ext4: Replace open coded mdata csum feature to helper function
ext4: delete useless comments about ext4_move_extents
ext4: fix reservation overflow in ext4_da_write_begin
ext4: add ext4_iget_normal() which is to be used for dir tree lookups
ext4: don't orphan or truncate the boot loader inode
ext4: grab missed write_count for EXT4_IOC_SWAP_BOOT
ext4: optimize block allocation on grow indepth
ext4: get rid of code duplication
ext4: fix over-defensive complaint after journal abort
ext4: fix return value of ext4_do_update_inode
ext4: fix mmap data corruption when blocksize < pagesize
vfs: fix data corruption when blocksize < pagesize for mmaped data
ext4: fold ext4_nojournal_sops into ext4_sops
ext4: support freezing ext2 (nojournal) file systems
ext4: fold ext4_sync_fs_nojournal() into ext4_sync_fs()
ext4: don't check quota format when there are no quota files
jbd2: simplify calling convention around __jbd2_journal_clean_checkpoint_list
jbd2: avoid pointless scanning of checkpoint lists
...
Pull cifs/smb3 updates from Steve French:
"Improved SMB3 support (symlink and device emulation, and remapping by
default the 7 reserved posix characters) and a workaround for cifs
mounts to Mac (working around a commonly encountered Mac server bug)"
* 'for-linus' of git://git.samba.org/sfrench/cifs-2.6:
[CIFS] Remove obsolete comment
Check minimum response length on query_network_interface
Workaround Mac server problem
Remap reserved posix characters by default (part 3/3)
Allow conversion of characters in Mac remap range (part 2)
Allow conversion of characters in Mac remap range. Part 1
mfsymlinks support for SMB2.1/SMB3. Part 2 query symlink
Add mfsymlinks support for SMB2.1/SMB3. Part 1 create symlink
Allow mknod and mkfifo on SMB2/SMB3 mounts
add defines for two new file attributes
This includes a single commit fixing a missing endian conversion.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJUQT+qAAoJEDgbc8f8gGmqT1wQAMjq4O0wh0DNI08SrHHszwnI
e5btCmlMfNIvSfKmNcwstEJH5R86qi+8Q7fAKPj73XXlb1HLYjhQltuuysRSlmZE
Ll+eqnOWVJPxlFjyRDKzO7HxejMsqEE29gtmzw9iL43F2M1acmCMgrVLz8VJGHH2
21hK9eo9L51kLVXK2+rSexdJJNDVjuy9N25J1d/r9/9UEIp+gGhD5ZMpBlDmw1md
VJgbL/yv99mjicjVnd5nZE7fhcWFPPV5IyeyUQS0Pg92cAYZUNBBAzta3yU4ira4
tJU4V2k3zoK40TXqRnGNFgFNvm8cfzNUX+9HtWyWmr6VySfRVXFhSDcHa2/NQva9
4T1vVO5Dg/4+ELIdtaGMEv+B5KCv9/hWjmhnlLBIJktJdH+1NmRXjZAg6gM51XRB
ZwQIf929lELeqGTdfJyO4bR/NPcjrEnEpQFBBAuypko06ZJhXLJZqtKD/3o5lCjX
VOgPkuNtylPARdUo/m1EffF4Rm1LREcbMHDMUInaVt6qAeHRqq8QkvBhBQ7VzFX1
Y1dwGWh8m9pA3q48j2GIT1rWbl5Fue8f34RMnn4psRVM0rnQ8WHh3/p5wsSsXlib
BCX5ez5O2aF7pe4Lqj3NkGmyB6PvICRbU2ZEe2uPbvMZnoq0NouO4asCLSeycu52
fYlz9mZ3e+Y+sv2JN3vG
=eEi+
-----END PGP SIGNATURE-----
Merge tag 'dlm-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm
Pull dlm fix from David Teigland:
"This includes a single commit fixing a missing endian conversion"
* tag 'dlm-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
dlm: fix missing endian conversion of rcom_status flags
Pull btrfs data corruption fix from Chris Mason:
"I'm testing a pull with more fixes, but wanted to get this one out so
Greg can pick it up.
The corruption isn't easy to hit, you have to do a readonly snapshot
and have orphans in the snapshot. But my review and testing missed
the bug. Filipe has added a better xfstest to cover it"
* 'for-linus-update' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Revert "Btrfs: race free update of commit root for ro snapshots"
Pull NTFS update from Anton Altaparmakov:
"Here is a small NTFS update notably implementing FIBMAP ioctl for NTFS
by adding the bmap address space operation. People seem to still want
FIBMAP"
* git://git.kernel.org/pub/scm/linux/kernel/git/aia21/ntfs:
NTFS: Bump version to 2.1.31.
NTFS: Add bmap address space operation needed for FIBMAP ioctl.
NTFS: Remove changelog from Documentation/filesystems/ntfs.txt.
NTFS: Split ntfs_aops into ntfs_normal_aops and ntfs_compressed_aops in preparation for them diverging.
Highlights include:
Stable fixes:
- Fix an uninitialised pointer Oops in the writeback error path
- Fix a bogus warning (and early exit from the loop) in nfs_generic_pgio
Features:
- Add NFSv4.2 SEEK feature and client support for lseek(SEEK_HOLE/SEEK_DATA)
Other fixes:
- pnfs: replace broken pnfs_put_lseg_async
- Remove dead prototype for nfs4_insert_deviceid_node
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJUP4dhAAoJEGcL54qWCgDy1+EQAKpFVKDjGEik5V1ye492mGbS
moP3zvJtGTfwoK+eCV4HqIj0oFh06RqoQc5z4vc0MKd3uuxXgPFj/wmJef48CWot
1BKoDWItzZZlBAJsE0+0i6EWZ05tAZv7Fr9ial6VwchpNAih1K/NPcBix/jEwWYl
WEFhKyGiN6UNhWfL0nXgvyQPD5tevvkr+5qv0A8dm8HGTEBlGhbzt4sfiUbWHPMD
jCVsHBZNOeZQrFwVzMx+LCZaBwb+xeRAs0Pg1U51MetpwnNce9737M9oetbzgkI5
cSACmvo76osaerC2Avxj4t713xeqfur4YRwIajamkUxWbcjXm3vaAQmE744meJOO
IOkIrmiGfMnJHw3f5CFnsAXcXmGu16umRqTexzz5iCqZ65ZuKWgIxs9yzBCOK/Or
f8nbGsCU2EaX8mJxf+esP4SzstxtqHRtkTj4f8D4wWm/W+PXYPCzCsI1q3rDni0U
yNH2Z4gRWcGz6LVfL/9lDHldN5S4WLkJuZQJtyijcy1jYCjwEEUFbX1V41s85GFB
QdQvZ7Cr4NqRyqvR7zyIdH1jTWt2N5mJytsmEbSGoXgBFzRDYCFXfSNjjJ2JfThl
sGeWlP99uEVbTxCTTBbulFXqYrmiNN7mW61+qedqT2Oy5/YjJTSE4PCp1p2YWlx/
/c17q7jOq6cnbKcyu7cl
=ekPm
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-3.18-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Highlights include:
Stable fixes:
- fix an uninitialised pointer Oops in the writeback error path
- fix a bogus warning (and early exit from the loop) in nfs_generic_pgio()
Features:
- Add NFSv4.2 SEEK feature and client support for lseek(SEEK_HOLE/SEEK_DATA)
Other fixes:
- pnfs: replace broken pnfs_put_lseg_async
- Remove dead prototype for nfs4_insert_deviceid_node"
* tag 'nfs-for-3.18-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFS: Fix a bogus warning in nfs_generic_pgio
NFS: Fix an uninitialised pointer Oops in the writeback error path
NFSv4.1/pnfs: replace broken pnfs_put_lseg_async
NFSv4: Remove dead prototype for nfs4_insert_deviceid_node()
NFS: Implement SEEK
Pull core block layer changes from Jens Axboe:
"This is the core block IO pull request for 3.18. Apart from the new
and improved flush machinery for blk-mq, this is all mostly bug fixes
and cleanups.
- blk-mq timeout updates and fixes from Christoph.
- Removal of REQ_END, also from Christoph. We pass it through the
->queue_rq() hook for blk-mq instead, freeing up one of the request
bits. The space was overly tight on 32-bit, so Martin also killed
REQ_KERNEL since it's no longer used.
- blk integrity updates and fixes from Martin and Gu Zheng.
- Update to the flush machinery for blk-mq from Ming Lei. Now we
have a per hardware context flush request, which both cleans up the
code should scale better for flush intensive workloads on blk-mq.
- Improve the error printing, from Rob Elliott.
- Backing device improvements and cleanups from Tejun.
- Fixup of a misplaced rq_complete() tracepoint from Hannes.
- Make blk_get_request() return error pointers, fixing up issues
where we NULL deref when a device goes bad or missing. From Joe
Lawrence.
- Prep work for drastically reducing the memory consumption of dm
devices from Junichi Nomura. This allows creating clone bio sets
without preallocating a lot of memory.
- Fix a blk-mq hang on certain combinations of queue depths and
hardware queues from me.
- Limit memory consumption for blk-mq devices for crash dump
scenarios and drivers that use crazy high depths (certain SCSI
shared tag setups). We now just use a single queue and limited
depth for that"
* 'for-3.18/core' of git://git.kernel.dk/linux-block: (58 commits)
block: Remove REQ_KERNEL
blk-mq: allocate cpumask on the home node
bio-integrity: remove the needless fail handle of bip_slab creating
block: include func name in __get_request prints
block: make blk_update_request print prefix match ratelimited prefix
blk-merge: don't compute bi_phys_segments from bi_vcnt for cloned bio
block: fix alignment_offset math that assumes io_min is a power-of-2
blk-mq: Make bt_clear_tag() easier to read
blk-mq: fix potential hang if rolling wakeup depth is too high
block: add bioset_create_nobvec()
block: use bio_clone_fast() in blk_rq_prep_clone()
block: misplaced rq_complete tracepoint
sd: Honor block layer integrity handling flags
block: Replace strnicmp with strncasecmp
block: Add T10 Protection Information functions
block: Don't merge requests if integrity flags differ
block: Integrity checksum flag
block: Relocate bio integrity flags
block: Add a disk flag to block integrity profile
block: Add prefix to block integrity profile flags
...
This reverts commit 9c3b306e1c.
Switching only one commit root during a transaction is wrong because it
leads the fs into an inconsistent state. All commit roots should be
switched at once, at transaction commit time, otherwise backref walking
can often miss important references that were only accessible through
the old commit root. Plus, the root item for the snapshot's root wasn't
getting updated and preventing the next transaction commit to do it.
This made several users get into random corruption issues after creation
of readonly snapshots.
A regression test for xfstests will follow soon.
Cc: stable@vger.kernel.org # 3.17
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Mac server returns that they support CIFS Unix Extensions but
doesn't actually support QUERY_FILE_UNIX_BASIC so mount fails.
Workaround this problem by disabling use of Unix CIFS protocol
extensions if server returns an EOPNOTSUPP error on
QUERY_FILE_UNIX_BASIC during mount.
Signed-off-by: Steve French <smfrench@gmail.com>
This is a bigger patch, but its size is mostly due to
a single change for how we check for remapping illegal characters
in file names - a lot of repeated, small changes to
the way callers request converting file names.
The final patch in the series does the following:
1) changes default behavior for cifs to be more intuitive.
Currently we do not map by default to seven reserved characters,
ie those valid in POSIX but not in NTFS/CIFS/SMB3/Windows,
unless a mount option (mapchars) is specified. Change this
to by default always map and map using the SFM maping
(like the Mac uses) unless the server negotiates the CIFS Unix
Extensions (like Samba does when mounting with the cifs protocol)
when the remapping of the characters is unnecessary. This should
help SMB3 mounts in particular since Samba will likely be
able to implement this mapping with its new "vfs_fruit" module
as it will be doing for the Mac.
2) if the user specifies the existing "mapchars" mount option then
use the "SFU" (Microsoft Services for Unix, SUA) style mapping of
the seven characters instead.
3) if the user specifies "nomapposix" then disable SFM/MAC style mapping
(so no character remapping would be used unless the user specifies
"mapchars" on mount as well, as above).
4) change all the places in the code that check for the superblock
flag on the mount which is set by mapchars and passed in on all
path based operation and change it to use a small function call
instead to set the mapping type properly (and check for the
mapping type in the cifs unicode functions)
Signed-off-by: Steve French <smfrench@gmail.com>
The previous patch allowed remapping reserved characters from directory
listenings, this patch adds conversion the other direction, allowing
opening of files with any of the seven reserved characters.
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
This allows directory listings to Mac to display filenames
correctly which have been created with illegal (to Windows)
characters in their filename. It does not allow
converting the other direction yet ie opening files with
these characters (followon patch).
There are seven reserved characters that need to be remapped when
mounting to Windows, Mac (or any server without Unix Extensions) which
are valid in POSIX but not in the other OS.
: \ < > ? * |
We used the normal UCS-2 remap range for this in order to convert this
to/from UTF8 as did Windows Services for Unix (basically add 0xF000 to
any of the 7 reserved characters), at least when the "mapchars" mount
option was specified.
Mac used a very slightly different "Services for Mac" remap range
0xF021 through 0xF027. The attached patch allows cifs.ko (the kernel
client) to read directories on macs containing files with these
characters and display their names properly. In theory this even
might be useful on mounts to Samba when the vfs_catia or new
"vfs_fruit" module is loaded.
Currently the 7 reserved characters look very strange in directory
listings from cifs.ko to Mac server. This patch allows these file
name characters to be read (requires specifying mapchars on mount).
Two additional changes are needed:
1) Make it more automatic: a way of detecting enough info so that
we know to try to always remap these characters or not. Various
have suggested that the SFM approach be made the default when
the server does not support POSIX Unix extensions (cifs mounts
to Samba for example) so need to make SFM remapping the default
unless mapchars (SFU style mapping) specified on mount or no
mapping explicitly requested or no mapping needed (cifs mounts to Samba).
2) Adding a patch to map the characters the other direction
(ie UTF-8 to UCS-2 on open). This patch does it for translating
readdir entries (ie UCS-2 to UTF-8)
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Adds support on SMB2.1 and SMB3 mounts for emulation of symlinks
via the "Minshall/French" symlink format already used for cifs
mounts when mfsymlinks mount option is used (and also used by Apple).
http://wiki.samba.org/index.php/UNIX_Extensions#Minshall.2BFrench_symlinks
This second patch adds support to query them (recognize them as symlinks
and read them). Third version of patch makes minor corrections
to error handling.
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Stefan Metzmacher <metze@samba.org>
Adds support on SMB2.1 and SMB3 mounts for emulation of symlinks
via the "Minshall/French" symlink format already used for cifs
mounts when mfsymlinks mount option is used (and also used by Apple).
http://wiki.samba.org/index.php/UNIX_Extensions#Minshall.2BFrench_symlinks
This first patch adds support to create them. The next patch will
add support for recognizing them and reading them. Although CIFS/SMB3
have other types of symlinks, in the many use cases they aren't
practical (e.g. either require cifs only mounts with unix extensions
to Samba, or require the user to be Administrator to Windows for SMB3).
This also helps enable running additional xfstests over SMB3 (since some
xfstests directly or indirectly require symlink support).
Signed-off-by: Steve French <smfrench@gmail.com>
CC: Stefan Metzmacher <metze@samba.org>
The "sfu" mount option did not work on SMB2/SMB3 mounts.
With these changes when the "sfu" mount option is passed in
on an smb2/smb2.1/smb3 mount the client can emulate (and
recognize) fifo and device (character and device files).
In addition the "sfu" mount option should not conflict
with "mfsymlinks" (symlink emulation) as we will never
create "sfu" style symlinks, but using "sfu" mount option
will allow us to recognize existing symlinks, created with
Microsoft "Services for Unix" (SFU and SUA).
To enable the "sfu" mount option for SMB2/SMB3 the calling
syntax of the generic cifs/smb2/smb3 sync_read and sync_write
protocol dependent function needed to be changed (we
don't have a file struct in all cases), but this actually
ended up simplifying the code a little.
Signed-off-by: Steve French <smfrench@gmail.com>
Pull percpu consistent-ops changes from Tejun Heo:
"Way back, before the current percpu allocator was implemented, static
and dynamic percpu memory areas were allocated and handled separately
and had their own accessors. The distinction has been gone for many
years now; however, the now duplicate two sets of accessors remained
with the pointer based ones - this_cpu_*() - evolving various other
operations over time. During the process, we also accumulated other
inconsistent operations.
This pull request contains Christoph's patches to clean up the
duplicate accessor situation. __get_cpu_var() uses are replaced with
with this_cpu_ptr() and __this_cpu_ptr() with raw_cpu_ptr().
Unfortunately, the former sometimes is tricky thanks to C being a bit
messy with the distinction between lvalues and pointers, which led to
a rather ugly solution for cpumask_var_t involving the introduction of
this_cpu_cpumask_var_ptr().
This converts most of the uses but not all. Christoph will follow up
with the remaining conversions in this merge window and hopefully
remove the obsolete accessors"
* 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (38 commits)
irqchip: Properly fetch the per cpu offset
percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t -fix
ia64: sn_nodepda cannot be assigned to after this_cpu conversion. Use __this_cpu_write.
percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t
Revert "powerpc: Replace __get_cpu_var uses"
percpu: Remove __this_cpu_ptr
clocksource: Replace __this_cpu_ptr with raw_cpu_ptr
sparc: Replace __get_cpu_var uses
avr32: Replace __get_cpu_var with __this_cpu_write
blackfin: Replace __get_cpu_var uses
tile: Use this_cpu_ptr() for hardware counters
tile: Replace __get_cpu_var uses
powerpc: Replace __get_cpu_var uses
alpha: Replace __get_cpu_var
ia64: Replace __get_cpu_var uses
s390: cio driver &__get_cpu_var replacements
s390: Replace __get_cpu_var uses
mips: Replace __get_cpu_var uses
MIPS: Replace __get_cpu_var uses in FPU emulator.
arm: Replace __this_cpu_ptr with raw_cpu_ptr
...
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJUPOQPAAoJEHKvRublQaWiZnQP/0bDfPvhaNKZtmDKYzbqQm9x
Gh3fB53d7TtIei///5eD/LUGObg0Ze8g2k8aAC05hcq5Be8Haitma3iDDSvJq6ig
umYU9BWkv47BHQy0gAVMXyaxHocICq7W9bnOvU4SSEW/sWeneBllyxgCfV7EKSTd
B/OXP+Ovr4FtjAweOACCN8b0M9w4wdxNKfDNV+WDIHAddYBngxrwq7zASKNNCx2N
0u9sXIdTp0Fvxmyx/lYLC5NXN0CUDjB3Ffdx3+eehrBp2lT6JdlkYU403c85cIMP
oIPJVKFbOnM04MfCjhFNTrK9OtC2eD6PoWq+FLL0UJx3YW5HkbLsiHGm/2UTJ2Z1
5QOwDebMxlvrb6f6Gv846ADl7YcByiXkieTDHRlOnplVRNV5Sj8UWAgq+zyq7sWq
2uRuW2UvKx7vYoAwKRwCaIoqpIe3NIvZQzE7C9mGOprIawZ5e0YJzaR6OoBs4Y8i
gmBeFx266URJun7isy1R7JJsMjYzxbEXju9zH/SkghbLHnf8yqafIG+pG1GD7n5R
o2C/5TVXjmEhIoDn8j2ZozaElX4mD7REKILpIGaE4XltExNTexq8neRo9D3ajzif
N5RrMCAkBzIxMz83evDppe3ObtkEaf0K43VCO3AVQ2g9jXg7ttKhR2hb8HRqCMHe
Lp3d8qKyZyL/HZ5F62yX
=zJyI
-----END PGP SIGNATURE-----
Merge tag 'llvmlinux-for-v3.18' of git://git.linuxfoundation.org/llvmlinux/kernel
Pull LLVM updates from Behan Webster:
"These patches remove the use of VLAIS using a new SHASH_DESC_ON_STACK
macro.
Some of the previously accepted VLAIS removal patches haven't used
this macro. I will push new patches to consistently use this macro in
all those older cases for 3.19"
[ More LLVM patches coming in through subsystem trees, and LLVM itself
needs some fixes that are already in many distributions but not in
released versions of LLVM. Some day this will all "just work" - Linus ]
* tag 'llvmlinux-for-v3.18' of git://git.linuxfoundation.org/llvmlinux/kernel:
crypto: LLVMLinux: Remove VLAIS usage from crypto/testmgr.c
security, crypto: LLVMLinux: Remove VLAIS from ima_crypto.c
crypto: LLVMLinux: Remove VLAIS usage from libcrc32c.c
crypto: LLVMLinux: Remove VLAIS usage from crypto/hmac.c
crypto, dm: LLVMLinux: Remove VLAIS usage from dm-crypt
crypto: LLVMLinux: Remove VLAIS from crypto/.../qat_algs.c
crypto: LLVMLinux: Remove VLAIS from crypto/omap_sham.c
crypto: LLVMLinux: Remove VLAIS from crypto/n2_core.c
crypto: LLVMLinux: Remove VLAIS from crypto/mv_cesa.c
crypto: LLVMLinux: Remove VLAIS from crypto/ccp/ccp-crypto-sha.c
btrfs: LLVMLinux: Remove VLAIS
crypto: LLVMLinux: Add macro to remove use of VLAIS in crypto code
Pull Ceph updates from Sage Weil:
"There is the long-awaited discard support for RBD (Guangliang Zhao,
Josh Durgin), a pile of RBD bug fixes that didn't belong in late -rc's
(Ilya Dryomov, Li RongQing), a pile of fs/ceph bug fixes and
performance and debugging improvements (Yan, Zheng, John Spray), and a
smattering of cleanups (Chao Yu, Fabian Frederick, Joe Perches)"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (40 commits)
ceph: fix divide-by-zero in __validate_layout()
rbd: rbd workqueues need a resque worker
libceph: ceph-msgr workqueue needs a resque worker
ceph: fix bool assignments
libceph: separate multiple ops with commas in debugfs output
libceph: sync osd op definitions in rados.h
libceph: remove redundant declaration
ceph: additional debugfs output
ceph: export ceph_session_state_name function
ceph: include the initial ACL in create/mkdir/mknod MDS requests
ceph: use pagelist to present MDS request data
libceph: reference counting pagelist
ceph: fix llistxattr on symlink
ceph: send client metadata to MDS
ceph: remove redundant code for max file size verification
ceph: remove redundant io_iter_advance()
ceph: move ceph_find_inode() outside the s_mutex
ceph: request xattrs if xattr_version is zero
rbd: set the remaining discard properties to enable support
rbd: use helpers to handle discard for layered images correctly
...
Pull pivot_root() fix from Andy Lutomirski.
Prevent a leak of unreachable mounts.
* 'CVE-2014-7970' of git://git.kernel.org/pub/scm/linux/kernel/git/luto/linux:
mnt: Prevent pivot_root from creating a loop in the mount tree
Andy Lutomirski recently demonstrated that when chroot is used to set
the root path below the path for the new ``root'' passed to pivot_root
the pivot_root system call succeeds and leaks mounts.
In examining the code I see that starting with a new root that is
below the current root in the mount tree will result in a loop in the
mount tree after the mounts are detached and then reattached to one
another. Resulting in all kinds of ugliness including a leak of that
mounts involved in the leak of the mount loop.
Prevent this problem by ensuring that the new mount is reachable from
the current root of the mount tree.
[Added stable cc. Fixes CVE-2014-7970. --Andy]
Cc: stable@vger.kernel.org
Reported-by: Andy Lutomirski <luto@amacapital.net>
Reviewed-by: Andy Lutomirski <luto@amacapital.net>
Link: http://lkml.kernel.org/r/87bnpmihks.fsf@x220.int.ebiederm.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
The flags are already converted to le when being sent,
but are not being converted back to cpu when received.
Signed-off-by: Neale Ferguson <neale@sinenomine.net>
Signed-off-by: David Teigland <teigland@redhat.com>
Fix some coccinelle warnings:
fs/ceph/caps.c:2400:6-10: WARNING: Assignment of bool to 0/1
fs/ceph/caps.c:2401:6-15: WARNING: Assignment of bool to 0/1
fs/ceph/caps.c:2402:6-17: WARNING: Assignment of bool to 0/1
fs/ceph/caps.c:2403:6-22: WARNING: Assignment of bool to 0/1
fs/ceph/caps.c:2404:6-22: WARNING: Assignment of bool to 0/1
fs/ceph/caps.c:2405:6-19: WARNING: Assignment of bool to 0/1
fs/ceph/caps.c:2440:4-20: WARNING: Assignment of bool to 0/1
fs/ceph/caps.c:2469:3-16: WARNING: Assignment of bool to 0/1
fs/ceph/caps.c:2490:2-18: WARNING: Assignment of bool to 0/1
fs/ceph/caps.c:2519:3-7: WARNING: Assignment of bool to 0/1
fs/ceph/caps.c:2549:3-12: WARNING: Assignment of bool to 0/1
fs/ceph/caps.c:2575:2-6: WARNING: Assignment of bool to 0/1
fs/ceph/caps.c:2589:3-7: WARNING: Assignment of bool to 0/1
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
Current code set new file/directory's initial ACL in a non-atomic
manner.
Client first sends request to MDS to create new file/directory, then set
the initial ACL after the new file/directory is successfully created.
The fix is include the initial ACL in create/mkdir/mknod MDS requests.
So MDS can handle creating file/directory and setting the initial ACL in
one request.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
Current code uses page array to present MDS request data. Pages in the
array are allocated/freed by caller of ceph_mdsc_do_request(). If request
is interrupted, the pages can be freed while they are still being used by
the request message.
The fix is use pagelist to present MDS request data. Pagelist is
reference counted.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
this allow pagelist to present data that may be sent multiple times.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
Implement version 2 of CEPH_MSG_CLIENT_SESSION syntax,
which includes additional client metadata to allow
the MDS to report on clients by user-sensible names
like hostname.
Signed-off-by: John Spray <john.spray@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Both ceph_update_writeable_page and ceph_setattr will verify file size
with max size ceph supported.
There are two caller for ceph_update_writeable_page, ceph_write_begin and
ceph_page_mkwrite. For ceph_write_begin, we have already verified the size in
generic_write_checks of ceph_write_iter; for ceph_page_mkwrite, we have no
chance to change file size when mmap. Likewise we have already verified the size
in inode_change_ok when we call ceph_setattr.
So let's remove the redundant code for max file size verification.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
ceph_find_inode() may wait on freeing inode, using it inside the s_mutex
may cause deadlock. (the freeing inode is waiting for OSD read reply, but
dispatch thread is blocked by the s_mutex)
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
Following sequence of events can happen.
- Client releases an inode, queues cap release message.
- A 'lookup' reply brings the same inode back, but the reply
doesn't contain xattrs because MDS didn't receive the cap release
message and thought client already has up-to-data xattrs.
The fix is force sending a getattr request to MDS if xattrs_version
is 0. The getattr mask is set to CEPH_STAT_CAP_XATTR, so MDS knows client
does not have xattr.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
we may corrupt waiting list if a request in the waiting list is kicked.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
So the recovering MDS does not need to fetch these ununsed inodes during
cache rejoin. This may reduce MDS recovery time.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
REQ_KERNEL is no longer used. Remove it and drop the redundant uio
argument to nfs_file_direct_{read,write}.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Replaced the use of a Variable Length Array In Struct (VLAIS) with a C99
compliant equivalent. This patch instead allocates the appropriate amount of
memory using a char array using the SHASH_DESC_ON_STACK macro.
The new code can be compiled with both gcc and clang.
Signed-off-by: Vinícius Tinti <viniciustinti@gmail.com>
Reviewed-by: Jan-Simon Möller <dl9pf@gmx.de>
Reviewed-by: Mark Charlebois <charlebm@gmail.com>
Signed-off-by: Behan Webster <behanw@converseincode.com>
Acked-by: Chris Mason <clm@fb.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Cc: "David S. Miller" <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIVAwUAVDwD3ROxKuMESys7AQLayg//Tmdi4eLzcky/HcOfAoVIY3B5Wvs1MBbN
3HhaYWKDeJvWxFmRDfQK0c1dyjBA2Xe7bPhdwQ8S9epAWAoW6D4g3Mg2+YReGLCK
U/CcrMHN77RSydTG0Mj/Z99IynSdf9rwdNrCEy8NiNkGe8Z/JCFPpZurRCc4PL44
4miTUq3ESMTGkUsa9BH+T0ngEka2ZdwnmzlYkdzeqmjmlbFx8RxcEewBeAoAlU73
eihKKyX+1uWX/2DmJol5NtZx+BbNkFsO+pX+s+70TsbjiyILCAmgh5meTpkGsDrW
iJGcgxwhcmyq1aTPcHRmXeNsVenbqRefGUtz7B5Q0x1Uk+ofRYfVVdiyTS2juGbC
DFGyNBUcFqsmbSMxM+yZGSzgR9KbzoZHDR/ppbJfMqIoe+oGju/NE+AZ6Q3f2/Es
AIGc8imc96QU08OnrZtreZxfgFMcFxBoGHvAM9AUr1ue80SWhVRZjwYx/JcIP7Cm
TKyilgb5hfxJ7zon+JuHSqttpeG3zOTjjhcKDmJlybYkKlTeRXm6ZcKVrro5d2+z
GLnH32HQRJvXBZslymqb7OgkxIW4ySO3PcAWTosUv9+zG0BPR1mB0NVQrSLEPk4L
JHA+Mjp8O37pN3kRantVNHk73t0z4qkbi8Ixft0yAus9qNNFMeKh+7NbBRjxUZAU
ARcAbvVMyT0=
=RtLr
-----END PGP SIGNATURE-----
Merge tag 'fscache-fixes-20141013' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull fs-cache fixes from David Howells:
"Two fixes for bugs in CacheFiles and a cleanup in FS-Cache"
* tag 'fscache-fixes-20141013' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
fs/fscache/object-list.c: use __seq_open_private()
CacheFiles: Fix incorrect test for in-memory object collision
CacheFiles: Handle object being killed before being set up
UBIFS is unable to mount a file-system (Hujianyang)
* Few fixes for the ubiblock sybsystem, error path fixes
* The ubiblock subsystem has had the volume size change handling improved
* Few fixes and nicifications in the fastmap subsystem
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJUO9l0AAoJECmIfjd9wqK01TwP/jAcA7GEnxUpQ8UFBZhJEIN0
0Ad4oDrGShpuEYgYyRFjCstuXErJBhMwImrevJhRmwxaY2fzGqBeDO9YKKGkKDfa
qjGsQrUaCJgV6qC2iT056ZmI7V/XyZfnZQ4Z8nQbzafoJ3MPbB6ExqBy8CZi8q/6
A516cen/cnZfHOQ1aqN6gyw2l976IzdJx8v0WOeYaXcvfDMrmfY8mkfh7EahOIVm
Kz9BVlVRxxfKPCqMpm+xV8KAOsMueOnKy+6zL7rFh+AvLQBACq44BV1HkZtg2avX
NBAo1RTPumeCht2t4nLJfgc+BJZ7cNpNFAijsWVJxp6umUqlsnbqckAx69O+JE9/
VZjM1KN1suI0bm01bj6xysGvg+JNTMiZ+HEqiseICSWtDbnCT4qDL3MPFgmD9OYh
9ar92Ku2HeY3DakKNd89gqw0ey28cv4i957KleneYzewcfFQ5pC/dp4thcDWa5fH
AHoblC4ShmcURDPYsIKRZsiTUf/uf3iLFIWAGJBDnSRg4dzzjoJkenz4W5ecWFDj
JokceklSf0zm8qAAdIUXw5Sihza1cnSBAIYBxVR808U+bwkCTOFF5xcTQy6wKf3y
NBb+ygh/ugps8B2evJEmp6ByLWQZr8j1q7IokZtglKWN2qOTfzyMxzlWl9vOQJYq
EQytnka5OEEXamr7g1iB
=2XCN
-----END PGP SIGNATURE-----
Merge tag 'upstream-3.18-rc1-v2' of git://git.infradead.org/linux-ubifs
Pull UBI/UBIFS fixes from Artem Bityutskiy:
- fix for a theoretical race condition which could lead to a situation
when UBIFS is unable to mount a file-system (Hujianyang)
- a few fixes for the ubiblock sybsystem, error path fixes
- the ubiblock subsystem has had the volume size change handling
improved
- a few fixes and nicifications in the fastmap subsystem
* tag 'upstream-3.18-rc1-v2' of git://git.infradead.org/linux-ubifs:
UBI: Fastmap: Calc fastmap size correctly
UBIFS: Fix trivial typo in power_cut_emulated()
UBI: Fix trivial typo in __schedule_ubi_work
UBI: wl: Rename cancel flag to shutdown
UBI: ubi_eba_read_leb: Remove in vain variable assignment
UBIFS: Align the dump messages of SB_NODE
UBI: Fix livelock in produce_free_peb()
UBI: return on error in rename_volumes()
UBI: Improve comment on work_sem
UBIFS: Remove bogus assert
UBI: Dispatch update notification if the volume is updated
UBI: block: Add support for the UBI_VOLUME_UPDATED notification
UBI: block: Fix block device size setting
UBI: block: fix dereference on uninitialized dev
UBI: add missing kmem_cache_free() in process_pool_aeb error path
UBIFS: fix free log space calculation
UBIFS: fix a race condition
Convert the ext4_has_group_desc_csum predicate to look for a checksum
driver instead of the metadata_csum flag and change the bg checksum
calculation function to look for GDT_CSUM before taking the crc16
path.
Without this patch, if we mount with ^uninit_bg,^metadata_csum and
later metadata_csum gets turned on by accident, the block group
checksum functions will incorrectly assume that checksumming is
enabled (metadata_csum) but that crc16 should be used
(!s_chksum_driver). This is totally wrong, so fix the predicate
and the checksum formula selection.
(Granted, if the metadata_csum feature bit gets enabled on a live FS
then something underhanded is going on, but we could at least avoid
writing garbage into the on-disk fields.)
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Dmitry Monakhov <dmonakhov@openvz.org>
Cc: stable@vger.kernel.org
Pull do_umount fix from Andy Lutomirski:
"This fix really ought to be safe. Inside a mountns owned by a
non-root user namespace, the namespace root almost always has
MNT_LOCKED set (if it doesn't, then there's a bug, because rootfs
could be exposed). In that case, calling umount on "/" will return
-EINVAL with or without this patch.
Outside a userns, this patch will have no effect. may_mount, required
by umount, already checks
ns_capable(current->nsproxy->mnt_ns->user_ns, CAP_SYS_ADMIN)
so an additional capable(CAP_SYS_ADMIN) check will have no effect.
That leaves anything that calls umount on "/" in a non-root userns
while chrooted. This is the case that is currently broken (it
remounts ro, which shouldn't be allowed) and that my patch changes to
-EPERM. If anything relies on *that*, I'd be surprised"
* 'CVE-2014-7975' of git://git.kernel.org/pub/scm/linux/kernel/git/luto/linux:
fs: Add a missing permission check to do_umount
For VMAs that don't want write notifications, PTEs created for read faults
have their write bit set. If the read fault happens after VM_SOFTDIRTY is
cleared, then the PTE's softdirty bit will remain clear after subsequent
writes.
Here's a simple code snippet to demonstrate the bug:
char* m = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_SHARED, -1, 0);
system("echo 4 > /proc/$PPID/clear_refs"); /* clear VM_SOFTDIRTY */
assert(*m == '\0'); /* new PTE allows write access */
assert(!soft_dirty(x));
*m = 'x'; /* should dirty the page */
assert(soft_dirty(x)); /* fails */
With this patch, write notifications are enabled when VM_SOFTDIRTY is
cleared. Furthermore, to avoid unnecessary faults, write notifications
are disabled when VM_SOFTDIRTY is set.
As a side effect of enabling and disabling write notifications with
care, this patch fixes a bug in mprotect where vm_page_prot bits set by
drivers were zapped on mprotect. An analogous bug was fixed in mmap by
commit c9d0bf2414 ("mm: uncached vma support with writenotify").
Signed-off-by: Peter Feiner <pfeiner@google.com>
Reported-by: Peter Feiner <pfeiner@google.com>
Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Jamie Liu <jamieliu@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It's very common for the buffer heads in the lru to have different block
numbers. By comparing the blocknr before the bdev and size we can
reduce the cost of searching in the very common case where all the
entries have the same bdev and size.
In quick hot cache cycle counting tests on a single fs workstation this
cut the cost of a miss by about 20%.
A diff of the disassembly shows the reordering of the bdev and blocknr
comparisons. This is in such a tiny loop that skipping one comparison
is a meaningful portion of the total work being done:
1628: 83 c1 01 add $0x1,%ecx
162b: 83 f9 08 cmp $0x8,%ecx
162e: 74 60 je 1690 <__find_get_block+0xa0>
1630: 89 c8 mov %ecx,%eax
1632: 65 4c 8b 04 c5 00 00 mov %gs:0x0(,%rax,8),%r8
1639: 00 00
163b: 4d 85 c0 test %r8,%r8
163e: 4c 89 c3 mov %r8,%rbx
1641: 74 e5 je 1628 <__find_get_block+0x38>
- 1643: 4d 3b 68 30 cmp 0x30(%r8),%r13
+ 1643: 4d 3b 68 18 cmp 0x18(%r8),%r13
1647: 75 df jne 1628 <__find_get_block+0x38>
- 1649: 4d 3b 60 18 cmp 0x18(%r8),%r12
+ 1649: 4d 3b 60 30 cmp 0x30(%r8),%r12
164d: 75 d9 jne 1628 <__find_get_block+0x38>
164f: 49 39 50 20 cmp %rdx,0x20(%r8)
1653: 75 d3 jne 1628 <__find_get_block+0x38>
Signed-off-by: Zach Brown <zab@zabbo.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kernel used to contain two functions for length-delimited,
case-insensitive string comparison, strnicmp with correct semantics and
a slightly buggy strncasecmp. The latter is the POSIX name, so strnicmp
was renamed to strncasecmp, and strnicmp made into a wrapper for the new
strncasecmp to avoid breaking existing users.
To allow the compat wrapper strnicmp to be removed at some point in the
future, and to avoid the extra indirection cost, do
s/strnicmp/strncasecmp/g.
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kernel used to contain two functions for length-delimited,
case-insensitive string comparison, strnicmp with correct semantics and
a slightly buggy strncasecmp. The latter is the POSIX name, so strnicmp
was renamed to strncasecmp, and strnicmp made into a wrapper for the new
strncasecmp to avoid breaking existing users.
To allow the compat wrapper strnicmp to be removed at some point in the
future, and to avoid the extra indirection cost, do
s/strnicmp/strncasecmp/g.
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kernel used to contain two functions for length-delimited,
case-insensitive string comparison, strnicmp with correct semantics and
a slightly buggy strncasecmp. The latter is the POSIX name, so strnicmp
was renamed to strncasecmp, and strnicmp made into a wrapper for the new
strncasecmp to avoid breaking existing users.
To allow the compat wrapper strnicmp to be removed at some point in the
future, and to avoid the extra indirection cost, do
s/strnicmp/strncasecmp/g.
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Steve French <sfrench@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch defines maximum block number to 2^31. It also converts
bitmap_size and array_size to unsigned int in omfs_get_imap
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Suggested-by: Bob Copeland <me@bobcopeland.com>
Acked-by: Bob Copeland <me@bobcopeland.com>
Tested-by: Bob Copeland <me@bobcopeland.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sys_tz is already declared in include/linux/time.h
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Four functions declared variables twice resulting in shadow warnings.
This patch renames internal variables and adds blank line after
declarations.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
head is set to AFFS_HEAD(bh) but never used.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
key is set in affs_fill_super but never used.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
format_corename() can only pass the leader's pid to the core handler,
but there is no simple way to figure out which thread originated the
coredump.
As Jan explains, this also means that there is no simple way to create
the backtrace of the crashed process:
As programs are mostly compiled with implicit gcc -fomit-frame-pointer
one needs program's .eh_frame section (equivalently PT_GNU_EH_FRAME
segment) or .debug_frame section. .debug_frame usually is present only
in separate debug info files usually not even installed on the system.
While .eh_frame is a part of the executable/library (and it is even
always mapped for C++ exceptions unwinding) it no longer has to be
present anywhere on the disk as the program could be upgraded in the
meantime and the running instance has its executable file already
unlinked from disk.
One possibility is to echo 0x3f >/proc/*/coredump_filter and dump all
the file-backed memory including the executable's .eh_frame section.
But that can create huge core files, for example even due to mmapped
data files.
Other possibility would be to read .eh_frame from /proc/PID/mem at the
core_pattern handler time of the core dump. For the backtrace one needs
to read the register state first which can be done from core_pattern
handler:
ptrace(PTRACE_SEIZE, tid, 0, PTRACE_O_TRACEEXIT)
close(0); // close pipe fd to resume the sleeping dumper
waitpid(); // should report EXIT
PTRACE_GETREGS or other requests
The remaining problem is how to get the 'tid' value of the crashed
thread. It could be read from the first NT_PRSTATUS note of the core
file but that makes the core_pattern handler complicated.
Unfortunately %t is already used so this patch uses %i/%I.
Automatic Bug Reporting Tool (https://github.com/abrt/abrt/wiki/overview)
is experimenting with this. It is using the elfutils
(https://fedorahosted.org/elfutils/) unwinder for generating the
backtraces. Apart from not needing matching executables as mentioned
above, another advantage is that we can get the backtrace without saving
the core (which might be quite large) to disk.
[mmilata@redhat.com: final paragraph of changelog]
Signed-off-by: Jan Kratochvil <jan.kratochvil@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
Cc: Mark Wielaard <mjw@redhat.com>
Cc: Martin Milata <mmilata@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge conditional unlock/lock in the same condition to avoid sparse
warning:
fs/reiserfs/journal.c:703:36: warning: context imbalance in 'add_to_chunk' - unexpected unlock
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ucg is defined and set in ufs_bitmap_search but never used.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Evgeniy Dushistov <dushistov@mail.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sys_tz is already declared in include/linux/time.h
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Support for fdatasync() has been implemented in NILFS2 for a long time,
but whenever the corresponding inode is dirty the implementation falls
back to a full-flegded sync(). Since every write operation has to
update the modification time of the file, the inode will almost always
be dirty and fdatasync() will fall back to sync() most of the time. But
this fallback is only necessary for a change of the file size and not
for a change of the various timestamps.
This patch adds a new flag NILFS_I_INODE_SYNC to differentiate between
those two situations.
* If it is set the file size was changed and a full sync is necessary.
* If it is not set then only the timestamps were updated and
fdatasync() can go ahead.
There is already a similar flag I_DIRTY_DATASYNC on the VFS layer with
the exact same semantics. Unfortunately it cannot be used directly,
because NILFS2 doesn't implement write_inode() and doesn't clear the VFS
flags when inodes are written out. So the VFS writeback thread can
clear I_DIRTY_DATASYNC at any time without notifying NILFS2. So
I_DIRTY_DATASYNC has to be mapped onto NILFS_I_INODE_SYNC in
nilfs_update_inode().
Signed-off-by: Andreas Rohner <andreas.rohner@gmx.net>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Under normal circumstances nilfs_sync_fs() writes out the super block,
which causes a flush of the underlying block device. But this depends
on the THE_NILFS_SB_DIRTY flag, which is only set if the pointer to the
last segment crosses a segment boundary. So if only a small amount of
data is written before the call to nilfs_sync_fs(), no flush of the
block device occurs.
In the above case an additional call to blkdev_issue_flush() is needed.
To prevent unnecessary overhead, the new flag nilfs->ns_flushed_device
is introduced, which is cleared whenever new logs are written and set
whenever the block device is flushed. For convenience the function
nilfs_flush_device() is added, which contains the above logic.
Signed-off-by: Andreas Rohner <andreas.rohner@gmx.net>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If rcu-walk mode we don't *have* to return -EISDIR for non-mount-traps
as we will simply drop into REF-walk and handling DCACHE_NEED_AUTOMOUNT
dentrys the slow way. But it is better if we do when possible.
In 'oz_mode', use the same condition as ref-walk: if not a mountpoint,
then it must be -EISDIR.
In regular mode there are most tests needed. Most of them can be
performed without taking any spinlocks. If we find a directory that
isn't obviously empty, and isn't mounted on, we need to call
'simple_empty()' which does take a spinlock. If this turned out to hurt
performance, some other approach could be found to signal when a
directory is known to be empty.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Ian Kent <raven@themaw.net>
Tested-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
->fs_lock protects AUTOFS_INF_EXPIRING. We need to be sure that once
the flag is set, no new references beneath the dentry are taken. So
rcu-walk currently needs to take fs_lock before checking the flag. This
hurts performance.
Change the expiry to a two-stage process. First set AUTOFS_INF_NO_RCU
which forces any path walk into ref-walk mode, then drop the lock and
call synchronize_rcu(). Once that returns we can be sure no rcu-walk is
active beneath the dentry and we can check reference counts again.
Now during an RCU-walk we can test AUTOFS_INF_EXPIRING without taking
the lock as along as we test AUTOFS_INF_NO_RCU too. If either are set,
we must abort the RCU-walk If neither are set, we know that refcounts
will be tested again after we finish the RCU-walk so we are safe to
continue.
->fs_lock is still taken in d_manage() to check for a non-trap
directory. That will be resolved in the next patch.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Ian Kent <raven@themaw.net>
Tested-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Have a "test" function change the value it is testing can be confusing,
particularly as a future patch will be calling this function twice.
So move the update for 'last_used' to avoid repeat expiry to the place
where the final determination on what to expire is known.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Ian Kent <raven@themaw.net>
Tested-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Future patch will potentially call this twice, so make it separate.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Ian Kent <raven@themaw.net>
Tested-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This series teaches autofs about RCU-walk so that we don't drop straight
into REF-walk when we hit an autofs directory, and so that we avoid
spinlocks as much as possible when performing an RCU-walk.
This is needed so that the benefits of the recent NFS support for
RCU-walk are fully available when NFS filesystems are automounted.
Patches have been carefully reviewed and tested both with test suites
and in production - thanks a lot to Ian Kent for his support there.
This patch (of 6):
Any attempt to look up a pathname that passes though an autofs4 mount is
currently forced out of RCU-walk into REF-walk.
This can significantly hurt performance of many-thread work loads on
many-core systems, especially if the automounted filesystem supports
RCU-walk but doesn't get to benefit from it.
So if autofs4_d_manage is called with rcu_walk set, only fail with -ECHILD
if it is necessary to wait longer than a spinlock.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Ian Kent <raven@themaw.net>
Tested-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sys_tz is already declared in include/linux/time.h
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Petr Vandrovec <petr@vandrovec.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
gcc-4.9 on ARM gives us a mysterious warning about the binfmt_misc
parse_command function:
fs/binfmt_misc.c: In function 'parse_command.part.3':
fs/binfmt_misc.c:405:7: warning: array subscript is above array bounds [-Warray-bounds]
I've managed to trace this back to the ARM implementation of memset,
which is called from copy_from_user in case of a fault and which does
#define memset(p,v,n) \
({ \
void *__p = (p); size_t __n = n; \
if ((__n) != 0) { \
if (__builtin_constant_p((v)) && (v) == 0) \
__memzero((__p),(__n)); \
else \
memset((__p),(v),(__n)); \
} \
(__p); \
})
Apparently gcc gets confused by the check for "size != 0" and believes
that the size might be zero when it gets to the line that does "if
(s[count-1] == '\n')", so it would access data outside of the array.
gcc is clearly wrong here, since this condition was already checked
earlier in the function and the 'size' value can not change in the
meantime.
Fortunately, we can work around it and get rid of the warning by
rearranging the function to check for zero size after doing the
copy_from_user. It is still safe to pass a zero size into
copy_from_user, so it does not cause any side effects.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The current code places a 256 byte limit on the registration format.
This ends up being fairly limited when you try to do matching against a
binary format like ELF:
- the magic & mask formats cannot have any embedded NUL chars
(string_unescape_inplace halts at the first NUL)
- each escape sequence quadruples the size: \x00 is needed for NUL
- trying to match bytes at the start of the file as well as further
on leads to a lot of \x00 sequences in the mask
- magic & mask have to be the same length (when decoded)
- still need bytes for the other fields
- impossible!
Let's look at a concrete (and common) example: using QEMU to run MIPS
ELFs. The name field uses 11 bytes "qemu-mipsel". The interp uses 20
bytes "/usr/bin/qemu-mipsel". The type & flags takes up 4 bytes. We
need 7 bytes for the delimiter (usually ":"). We can skip offset. So
already we're down to 107 bytes to use with the magic/mask instead of
the real limit of 128 (BINPRM_BUF_SIZE). If people use shell code to
register (which they do the majority of the time), they're down to ~26
possible bytes since the escape sequence must be \x##.
The ELF format looks like (both 32 & 64 bit):
e_ident: 16 bytes
e_type: 2 bytes
e_machine: 2 bytes
Those 20 bytes are enough for most architectures because they have so few
formats in the first place, thus they can be uniquely identified. That
also means for shell users, since 20 is smaller than 26, they can sanely
register a handler.
But for some targets (like MIPS), we need to poke further. The ELF fields
continue on:
e_entry: 4 or 8 bytes
e_phoff: 4 or 8 bytes
e_shoff: 4 or 8 bytes
e_flags: 4 bytes
We only care about e_flags here as that includes the bits to identify
whether the ELF is O32/N32/N64. But now we have to consume another 16
bytes (for 32 bit ELFs) or 28 bytes (for 64 bit ELFs) just to match the
flags. If every byte is escaped, we send 288 more bytes to the kernel
((20 {e_ident,e_type,e_machine} + 12 {e_entry,e_phoff,e_shoff} + 4
{e_flags}) * 2 {mask,magic} * 4 {escape}) and we've clearly blown our
budget.
Even if we try to be clever and do the decoding ourselves (rather than
relying on the kernel to process \x##), we still can't hit the mark --
string_unescape_inplace treats mask & magic as C strings so NUL cannot
be embedded. That leaves us with having to pass \x00 for the 12/24
entry/phoff/shoff bytes (as those will be completely random addresses),
and that is a minimum requirement of 48/96 bytes for the mask alone.
Add up the rest and we blow through it (this is for 64 bit ELFs):
magic: 20 {e_ident,e_type,e_machine} + 24 {e_entry,e_phoff,e_shoff} +
4 {e_flags} = 48 # ^^ See note below.
mask: 20 {e_ident,e_type,e_machine} + 96 {e_entry,e_phoff,e_shoff} +
4 {e_flags} = 120
Remember above we had 107 left over, and now we're at 168. This is of
course the *best* case scenario -- you'll also want to have NUL bytes
in the magic & mask too to match literal zeros.
Note: the reason we can use 24 in the magic is that we can work off of the
fact that for bytes the mask would clobber, we can stuff any value into
magic that we want. So when mask is \x00, we don't need the magic to also
be \x00, it can be an unescaped raw byte like '!'. This lets us handle
more formats (barely) under the current 256 limit, but that's a pretty
tall hoop to force people to jump through.
With all that said, let's bump the limit from 256 bytes to 1920. This way
we support escaping every byte of the mask & magic field (which is 1024
bytes by themselves -- 128 * 4 * 2), and we leave plenty of room for other
fields. Like long paths to the interpreter (when you have source in your
/really/long/homedir/qemu/foo). Since the current code stuffs more than
one structure into the same buffer, we leave a bit of space to easily
round up to 2k. 1920 is just as arbitrary as 256 ;).
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Reduce boilerplate code by using __seq_open_private() instead of seq_open()
in fscache_objlist_open().
Signed-off-by: Rob Jones <rob.jones@codethink.co.uk>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Steve Dickson <steved@redhat.com>
When CacheFiles cache objects are in use, they have in-memory representations,
as defined by the cachefiles_object struct. These are kept in a tree rooted in
the cache and indexed by dentry pointer (since there's a unique mapping between
object index key and dentry).
Collisions can occur between a representation already in the tree and a new
representation being set up because it takes time to dispose of an old
representation - particularly if it must be unlinked or renamed.
When such a collision occurs, cachefiles_mark_object_active() is meant to check
to see if the old, already-present representation is in the process of being
discarded (ie. FSCACHE_OBJECT_IS_LIVE is not set on it) - and, if so, wait for
the representation to be removed (ie. CACHEFILES_OBJECT_ACTIVE is then
cleared).
However, the test for whether the old representation is still live is checking
the new object - which always will be live at this point. This leads to an
oops looking like:
CacheFiles: Error: Unexpected object collision
object: OBJ1b354
objstate=LOOK_UP_OBJECT fl=8 wbusy=2 ev=0[0]
ops=0 inp=0 exc=0
parent=ffff88053f5417c0
cookie=ffff880538f202a0 [pr=ffff8805381b7160 nd=ffff880509c6eb78 fl=27]
key=[8] '2490000000000000'
xobject: OBJ1a600
xobjstate=DROP_OBJECT fl=70 wbusy=2 ev=0[0]
xops=0 inp=0 exc=0
xparent=ffff88053f5417c0
xcookie=ffff88050f4cbf70 [pr=ffff8805381b7160 nd= (null) fl=12]
------------[ cut here ]------------
kernel BUG at fs/cachefiles/namei.c:200!
...
Workqueue: fscache_object fscache_object_work_func [fscache]
...
RIP: ... cachefiles_walk_to_object+0x7ea/0x860 [cachefiles]
...
Call Trace:
[<ffffffffa04dadd8>] ? cachefiles_lookup_object+0x58/0x100 [cachefiles]
[<ffffffffa01affe9>] ? fscache_look_up_object+0xb9/0x1d0 [fscache]
[<ffffffffa01afc4d>] ? fscache_parent_ready+0x2d/0x80 [fscache]
[<ffffffffa01b0672>] ? fscache_object_work_func+0x92/0x1f0 [fscache]
[<ffffffff8107e82b>] ? process_one_work+0x16b/0x400
[<ffffffff8107fc16>] ? worker_thread+0x116/0x380
[<ffffffff8107fb00>] ? manage_workers.isra.21+0x290/0x290
[<ffffffff81085edc>] ? kthread+0xbc/0xe0
[<ffffffff81085e20>] ? flush_kthread_worker+0x80/0x80
[<ffffffff81502d0c>] ? ret_from_fork+0x7c/0xb0
[<ffffffff81085e20>] ? flush_kthread_worker+0x80/0x80
Reported-by: Manuel Schölling <manuel.schoelling@gmx.de>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Steve Dickson <steved@redhat.com>
It is OK for pageused == pagecount in the loop, as long as we don't add
another entry to the *pages array. Move the test so that it only triggers
in that case.
Reported-by: Steve Dickson <SteveD@redhat.com>
Fixes: bba5c1887a (nfs: disallow duplicate pages in pgio page vectors)
Cc: Weston Andros Adamson <dros@primarydata.com>
Cc: stable@vger.kernel.org # 3.16.x
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Pull scheduler updates from Ingo Molnar:
"The main changes in this cycle were:
- Optimized support for Intel "Cluster-on-Die" (CoD) topologies (Dave
Hansen)
- Various sched/idle refinements for better idle handling (Nicolas
Pitre, Daniel Lezcano, Chuansheng Liu, Vincent Guittot)
- sched/numa updates and optimizations (Rik van Riel)
- sysbench speedup (Vincent Guittot)
- capacity calculation cleanups/refactoring (Vincent Guittot)
- Various cleanups to thread group iteration (Oleg Nesterov)
- Double-rq-lock removal optimization and various refactorings
(Kirill Tkhai)
- various sched/deadline fixes
... and lots of other changes"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits)
sched/dl: Use dl_bw_of() under rcu_read_lock_sched()
sched/fair: Delete resched_cpu() from idle_balance()
sched, time: Fix build error with 64 bit cputime_t on 32 bit systems
sched: Improve sysbench performance by fixing spurious active migration
sched/x86: Fix up typo in topology detection
x86, sched: Add new topology for multi-NUMA-node CPUs
sched/rt: Use resched_curr() in task_tick_rt()
sched: Use rq->rd in sched_setaffinity() under RCU read lock
sched: cleanup: Rename 'out_unlock' to 'out_free_new_mask'
sched: Use dl_bw_of() under RCU read lock
sched/fair: Remove duplicate code from can_migrate_task()
sched, mips, ia64: Remove __ARCH_WANT_UNLOCKED_CTXSW
sched: print_rq(): Don't use tasklist_lock
sched: normalize_rt_tasks(): Don't use _irqsave for tasklist_lock, use task_rq_lock()
sched: Fix the task-group check in tg_has_rt_tasks()
sched/fair: Leverage the idle state info when choosing the "idlest" cpu
sched: Let the scheduler see CPU idle states
sched/deadline: Fix inter- exclusive cpusets migrations
sched/deadline: Clear dl_entity params when setscheduling to different class
sched/numa: Kill the wrong/dead TASK_DEAD check in task_numa_fault()
...
Pull RCU updates from Ingo Molnar:
"The main changes in this cycle were:
- changes related to No-CBs CPUs and NO_HZ_FULL
- RCU-tasks implementation
- torture-test updates
- miscellaneous fixes
- locktorture updates
- RCU documentation updates"
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (81 commits)
workqueue: Use cond_resched_rcu_qs macro
workqueue: Add quiescent state between work items
locktorture: Cleanup header usage
locktorture: Cannot hold read and write lock
locktorture: Fix __acquire annotation for spinlock irq
locktorture: Support rwlocks
rcu: Eliminate deadlock between CPU hotplug and expedited grace periods
locktorture: Document boot/module parameters
rcutorture: Rename rcutorture_runnable parameter
locktorture: Add test scenario for rwsem_lock
locktorture: Add test scenario for mutex_lock
locktorture: Make torture scripting account for new _runnable name
locktorture: Introduce torture context
locktorture: Support rwsems
locktorture: Add infrastructure for torturing read locks
torture: Address race in module cleanup
locktorture: Make statistics generic
locktorture: Teach about lock debugging
locktorture: Support mutexes
locktorture: Add documentation
...
This update contains:
o various cleanups
o log recovery debug hooks
o seek hole/data implementation merge
o extent shift rework to fix collapse range bugs
o various sparse warning fixes
o log recovery transaction processing rework to fix use after free bugs
o metadata buffer IO infrastructuer rework to ensure all buffers under IO have
valid reference counts
o various fixes for ondisk flags, writeback and zero range corner cases
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJUOxyCAAoJEK3oKUf0dfodzt8QAKcFdE8hyCAnD8IK85v46gWG
IHnxOlTLrhs/22wfD1fSUcjCBQsQAIloQihvVGStugFnkEUHOUjlZ/oMcGNFPECC
L7B4Ns6WmA9TA8ibgYvLZepautNjzhS5/lGfqSWpw4hQPsJJp2fGyCVF/ZhwnP6D
qPeflVic8E8rgaJp98X8uFyZ+9EEoSF7/9EhmvVNwsO6UaThhIO/oPydx8oNrhKS
k6aADmxNYtFWJb6kUjFbXaJwrFIFLvJc60FZz2eUViVGBx6K8D5FBiVbzZKe2WZ6
VOz4fj63BYI7Nxk4rZGJPoyql+ChO/pIVwH15ZmYRkkgUXs8FGy85mNKMg7DHnFm
K/ZUhW5IBc6GtkwPCjNIM642IQYnTR5SdQfFxMS2EYPBUumcQ3EbD44aGZY69YYu
pP+2g4b1diadNkGACccj6teQ9V0fbyF0lfZqoZMeN/W0As6l9oYa0yFBGsK9sblq
yrPfce+wEy5HBy9M7Fqpvm3bwMunNViqilGZXKlOyodSgXxSF3JwXuc+8/TNwcnL
O0RSD7R7k6TvrmAntTgwT4beZi4ziG+/tVa0rD3mJM/sXyzcP2bwbA1APM74NcHh
p8mrJRci6vtKPwIylQ1xeCeK/WD21OhbJWBYR+0JOEJSnAjtv8flk7mqGLhy+M+Y
yCdHJIfuJ4NKj4X3f0Kc
=TdAB
-----END PGP SIGNATURE-----
Merge tag 'xfs-for-linus-3.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs
Pull xfs update from Dave Chinner:
"This update contains:
- various cleanups
- log recovery debug hooks
- seek hole/data implementation merge
- extent shift rework to fix collapse range bugs
- various sparse warning fixes
- log recovery transaction processing rework to fix use after free
bugs
- metadata buffer IO infrastructuer rework to ensure all buffers
under IO have valid reference counts
- various fixes for ondisk flags, writeback and zero range corner
cases"
* tag 'xfs-for-linus-3.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (56 commits)
xfs: fix agno increment in xfs_inumbers() loop
xfs: xfs_iflush_done checks the wrong log item callback
xfs: flush the range before zero range conversion
xfs: restore buffer_head unwritten bit on ioend cancel
xfs: check for null dquot in xfs_quota_calc_throttle()
xfs: fix crc field handling in xfs_sb_to/from_disk
xfs: don't send null bp to xfs_trans_brelse()
xfs: check for inode size overflow in xfs_new_eof()
xfs: only set extent size hint when asked
xfs: project id inheritance is a directory only flag
xfs: kill time.h
xfs: compat_xfs_bstat does not have forkoff
xfs: simplify xfs_zero_remaining_bytes
xfs: check xfs_buf_read_uncached returns correctly
xfs: introduce xfs_buf_submit[_wait]
xfs: kill xfs_bioerror_relse
xfs: xfs_bioerror can die.
xfs: kill xfs_bdstrat_cb
xfs: rework xfs_buf_bio_endio error handling
xfs: xfs_buf_ioend and xfs_buf_iodone_work duplicate functionality
...
Pull vfs updates from Al Viro:
"The big thing in this pile is Eric's unmount-on-rmdir series; we
finally have everything we need for that. The final piece of prereqs
is delayed mntput() - now filesystem shutdown always happens on
shallow stack.
Other than that, we have several new primitives for iov_iter (Matt
Wilcox, culled from his XIP-related series) pushing the conversion to
->read_iter()/ ->write_iter() a bit more, a bunch of fs/dcache.c
cleanups and fixes (including the external name refcounting, which
gives consistent behaviour of d_move() wrt procfs symlinks for long
and short names alike) and assorted cleanups and fixes all over the
place.
This is just the first pile; there's a lot of stuff from various
people that ought to go in this window. Starting with
unionmount/overlayfs mess... ;-/"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (60 commits)
fs/file_table.c: Update alloc_file() comment
vfs: Deduplicate code shared by xattr system calls operating on paths
reiserfs: remove pointless forward declaration of struct nameidata
don't need that forward declaration of struct nameidata in dcache.h anymore
take dname_external() into fs/dcache.c
let path_init() failures treated the same way as subsequent link_path_walk()
fix misuses of f_count() in ppp and netlink
ncpfs: use list_for_each_entry() for d_subdirs walk
vfs: move getname() from callers to do_mount()
gfs2_atomic_open(): skip lookups on hashed dentry
[infiniband] remove pointless assignments
gadgetfs: saner API for gadgetfs_create_file()
f_fs: saner API for ffs_sb_create_file()
jfs: don't hash direct inode
[s390] remove pointless assignment of ->f_op in vmlogrdr ->open()
ecryptfs: ->f_op is never NULL
android: ->f_op is never NULL
nouveau: __iomem misannotations
missing annotation in fs/file.c
fs: namespace: suppress 'may be used uninitialized' warnings
...
Besides the fact that this replacement improves code readability
it also protects from errors caused direct EXT4_S(sb)->s_es manipulation
which may result attempt to use uninitialized csum machinery.
#Testcase_BEGIN
IMG=/dev/ram0
MNT=/mnt
mkfs.ext4 $IMG
mount $IMG $MNT
#Enable feature directly on disk, on mounted fs
tune2fs -O metadata_csum $IMG
# Provoke metadata update, likey result in OOPS
touch $MNT/test
umount $MNT
#Testcase_END
# Replacement script
@@
expression E;
@@
- EXT4_HAS_RO_COMPAT_FEATURE(E, EXT4_FEATURE_RO_COMPAT_METADATA_CSUM)
+ ext4_has_metadata_csum(E)
https://bugzilla.kernel.org/show_bug.cgi?id=82201
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
caused a regression in xfs_inumbers, which in turn broke
xfsdump, causing incomplete dumps.
The loop in xfs_inumbers() needs to fill the user-supplied
buffers, and iterates via xfs_btree_increment, reading new
ags as needed.
But the first time through the loop, if xfs_btree_increment()
succeeds, we continue, which triggers the ++agno at the bottom
of the loop, and we skip to soon to the next ag - without
the proper setup under next_ag to read the next ag.
Fix this by removing the agno increment from the loop conditional,
and only increment agno if we have actually hit the code under
the next_ag: target.
Cc: stable@vger.kernel.org
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
This comment is 5 years outdated; init_file() no longer exists.
Signed-off-by: Eric Biggers <ebiggers3@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The following pairs of system calls dealing with extended attributes only
differ in their behavior on whether the symbolic link is followed (when
the named file is a symbolic link):
- setxattr() and lsetxattr()
- getxattr() and lgetxattr()
- listxattr() and llistxattr()
- removexattr() and lremovexattr()
Despite this, the implementations all had duplicated code, so this commit
redirects each of the above pairs of system calls to a corresponding
function to which different lookup flags (LOOKUP_FOLLOW or 0) are passed.
For me this reduced the stripped size of xattr.o from 8824 to 8248 bytes.
Signed-off-by: Eric Biggers <ebiggers3@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
As it is, path_lookupat() and path_mounpoint() might end up leaking struct file
reference in some cases.
Spotted-by: Eric Biggers <ebiggers3@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull security subsystem updates from James Morris.
Mostly ima, selinux, smack and key handling updates.
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (65 commits)
integrity: do zero padding of the key id
KEYS: output last portion of fingerprint in /proc/keys
KEYS: strip 'id:' from ca_keyid
KEYS: use swapped SKID for performing partial matching
KEYS: Restore partial ID matching functionality for asymmetric keys
X.509: If available, use the raw subjKeyId to form the key description
KEYS: handle error code encoded in pointer
selinux: normalize audit log formatting
selinux: cleanup error reporting in selinux_nlmsg_perm()
KEYS: Check hex2bin()'s return when generating an asymmetric key ID
ima: detect violations for mmaped files
ima: fix race condition on ima_rdwr_violation_check and process_measurement
ima: added ima_policy_flag variable
ima: return an error code from ima_add_boot_aggregate()
ima: provide 'ima_appraise=log' kernel option
ima: move keyring initialization to ima_init()
PKCS#7: Handle PKCS#7 messages that contain no X.509 certs
PKCS#7: Better handling of unsupported crypto
KEYS: Overhaul key identification when searching for asymmetric keys
KEYS: Implement binary asymmetric key ID handling
...
In patch 'ext4: refactor ext4_move_extents code base', Dmitry Monakhov has
refactored ext4_move_extents' implementation, but forgot to update the
corresponding comments, this patch will try to delete some useless comments.
Reviewed-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Xiaoguang Wang <wangxg.fnst@cn.fujitsu.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Delalloc write journal reservations only reserve 1 credit,
to update the inode if necessary. However, it may happen
once in a filesystem's lifetime that a file will cross
the 2G threshold, and require the LARGE_FILE feature to
be set in the superblock as well, if it was not set already.
This overruns the transaction reservation, and can be
demonstrated simply on any ext4 filesystem without the LARGE_FILE
feature already set:
dd if=/dev/zero of=testfile bs=1 seek=2147483646 count=1 \
conv=notrunc of=testfile
sync
dd if=/dev/zero of=testfile bs=1 seek=2147483647 count=1 \
conv=notrunc of=testfile
leads to:
EXT4-fs: ext4_do_update_inode:4296: aborting transaction: error 28 in __ext4_handle_dirty_super
EXT4-fs error (device loop0) in ext4_do_update_inode:4301: error 28
EXT4-fs error (device loop0) in ext4_reserve_inode_write:4757: Readonly filesystem
EXT4-fs error (device loop0) in ext4_dirty_inode:4876: error 28
EXT4-fs error (device loop0) in ext4_da_write_end:2685: error 28
Adjust the number of credits based on whether the flag is
already set, and whether the current write may extend past the
LARGE_FILE limit.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Cc: stable@vger.kernel.org
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJUNZK4AAoJEAAOaEEZVoIVI08P/iM7eaIVRnqaqtWw/JBzxiba
EMDlJYUBSlv6lYk9s8RJT4bMmcmGAKSYzVAHSoPahzNcqTDdFLeDTLGxJ8uKBbjf
d1qRRdH1yZHGUzCvJq3mEendjfXn435Y3YburUxjLfmzrzW7EbMvndiQsS5dhAm9
PEZ+wrKF/zFL7LuXa1YznYrbqOD/GRsJAXGEWc3kNwfS9avephVG/RI3GtpI2PJj
RY1mf8P7+WOlrShYoEuUo5aqs01MnU70LbqGHzY8/QKH+Cb0SOkCHZPZyClpiA+G
MMJ+o2XWcif3BZYz+dobwz/FpNZ0Bar102xvm2E8fqByr/T20JFjzooTKsQ+PtCk
DetQptrU2gtyZDKtInJUQSDPrs4cvA13TW+OEB1tT8rKBnmyEbY3/TxBpBTB9E6j
eb/V3iuWnywR3iE+yyvx24Qe7Pov6deM31s46+Vj+GQDuWmAUJXemhfzPtZiYpMT
exMXTyDS3j+W+kKqHblfU5f+Bh1eYGpG2m43wJVMLXKV7NwDf8nVV+Wea962ga+w
BAM3ia4JRVgRWJBPsnre3lvGT5kKPyfTZsoG+kOfRxiorus2OABoK+SIZBZ+c65V
Xh8VH5p3qyCUBOynXlHJWFqYWe2wH0LfbPrwe9dQwTwON51WF082EMG5zxTG0Ymf
J2z9Shz68zu0ok8cuSlo
=Hhee
-----END PGP SIGNATURE-----
Merge tag 'locks-v3.18-1' of git://git.samba.org/jlayton/linux
Pull file locking related changes from Jeff Layton:
"This release is a little more busy for file locking changes than the
last:
- a set of patches from Kinglong Mee to fix the lockowner handling in
knfsd
- a pile of cleanups to the internal file lease API. This should get
us a bit closer to allowing for setlease methods that can block.
There are some dependencies between mine and Bruce's trees this cycle,
and I based my tree on top of the requisite patches in Bruce's tree"
* tag 'locks-v3.18-1' of git://git.samba.org/jlayton/linux: (26 commits)
locks: fix fcntl_setlease/getlease return when !CONFIG_FILE_LOCKING
locks: flock_make_lock should return a struct file_lock (or PTR_ERR)
locks: set fl_owner for leases to filp instead of current->files
locks: give lm_break a return value
locks: __break_lease cleanup in preparation of allowing direct removal of leases
locks: remove i_have_this_lease check from __break_lease
locks: move freeing of leases outside of i_lock
locks: move i_lock acquisition into generic_*_lease handlers
locks: define a lm_setup handler for leases
locks: plumb a "priv" pointer into the setlease routines
nfsd: don't keep a pointer to the lease in nfs4_file
locks: clean up vfs_setlease kerneldoc comments
locks: generic_delete_lease doesn't need a file_lock at all
nfsd: fix potential lease memory leak in nfs4_setlease
locks: close potential race in lease_get_mtime
security: make security_file_set_fowner, f_setown and __f_setown void return
locks: consolidate "nolease" routines
locks: remove lock_may_read and lock_may_write
lockd: rip out deferred lock handling from testlock codepath
NFSD: Get reference of lockowner when coping file_lock
...
Pull btrfs updates from Chris Mason:
"The largest set of changes here come from Miao Xie. He's cleaning up
and improving read recovery/repair for raid, and has a number of
related fixes.
I've merged another set of fsync fixes from Filipe, and he's also
improved the way we handle metadata write errors to make sure we force
the FS readonly if things go wrong.
Otherwise we have a collection of fixes and cleanups. Dave Sterba
gets a cookie for removing the most lines (thanks Dave)"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (139 commits)
btrfs: Fix compile error when CONFIG_SECURITY is not set.
Btrfs: fix compiles when CONFIG_BTRFS_FS_RUN_SANITY_TESTS is off
btrfs: Make btrfs handle security mount options internally to avoid losing security label.
Btrfs: send, don't delay dir move if there's a new parent inode
btrfs: add more superblock checks
Btrfs: fix race in WAIT_SYNC ioctl
Btrfs: be aware of btree inode write errors to avoid fs corruption
Btrfs: remove redundant btrfs_verify_qgroup_counts declaration.
btrfs: fix shadow warning on cmp
Btrfs: fix compilation errors under DEBUG
Btrfs: fix crash of btrfs_release_extent_buffer_page
Btrfs: add missing end_page_writeback on submit_extent_page failure
btrfs: Fix the wrong condition judgment about subset extent map
Btrfs: fix build_backref_tree issue with multiple shared blocks
Btrfs: cleanup error handling in build_backref_tree
btrfs: move checks for DUMMY_ROOT into a helper
btrfs: new define for the inline extent data start
btrfs: kill extent_buffer_page helper
btrfs: drop constant param from btrfs_release_extent_buffer_page
btrfs: hide typecast to definition of BTRFS_SEND_TRANS_STUB
...
Pull UDF and quota updates from Jan Kara:
"A few UDF fixes and also a few patches which are preparing filesystems
for support of project quotas in VFS"
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
udf: Fix loading of special inodes
ocfs2: Back out change to use OCFS2_MAXQUOTAS in ocfs2_setattr()
udf: remove redundant sys_tz declaration
ocfs2: Don't use MAXQUOTAS value
reiserfs: Don't use MAXQUOTAS value
ext3: Don't use MAXQUOTAS value
udf: Fix race between write(2) and close(2)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCgAGBQJUNtXXAAoJENaSAD2qAscKx1UP/jqt4gm1AYFBpkwBhVRUssIQ
wHckk8QPasIdEGeKvyCKXl88sUDLSsJwf/mUpl8pBfKm64LokP2fmUBU9Pkf9hVU
lcuaFNIEmHh8p1IqcfaFbnZOjuuHc9M2ULQLmo5ShoTHNu6JYAP2zRBMfFrEdcMR
vKh+RCARa5jr1CdwTHX+dH5vJQIQXW/qgRK5G5Z6KeBI766jK2BvZxYHivUgLWuC
dV6K4RzvHHJYEVoXjCUhrgGepGwHlDoEgx/Y0GK9vbFPG38IrfSlN6fgvzV6mMYE
ien8FsuHPv5oiBmFM2byRmWpJtWUOViCbMaMmKqY5Ix21E0lUafA7ixH3nSpOHNZ
b29dxmHnEDomCJXLAF0NQUE84yTw6ITLp2FldUR2o+sidnJsDx/hph/KvmsK6d6P
sDfEN/DtzPluZoXKY0jrRtoAhi0citNTgKfrujmx6baBstxRp7AfwGcP4skXJq7w
wkg0Seo449CUaBJK9A4s7nIjMBQ6/3hjvF/NVcry1aY+/RyhJDz6uFrtnOEqW3So
6khwJOOcRkmLLXrk+DzvthDRCVNJhWM80cB5/UBBfVG2Kk3PHa86Jfp5Y4teWugx
zyoacEWiitKcK4Po8skZpLd2vUwny/cD0qS43LT+SMvgRsYZ4XdUnceoY8FkOnLr
ooIFKHCR/+2XVnAaC101
=mjS7
-----END PGP SIGNATURE-----
Merge tag 'ecryptfs-3.18-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs
Pull eCryptfs updates from Tyler Hicks:
"Minor code cleanups and a fix for when eCryptfs metadata is stored in
xattrs"
* tag 'ecryptfs-3.18-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs:
ecryptfs: remove unneeded buggy code in ecryptfs_do_create()
ecryptfs: avoid to access NULL pointer when write metadata in xattr
ecryptfs: remove unnecessary break after goto
ecryptfs: Remove unnecessary include of syscall.h in keystore.c
fs/ecryptfs/messaging.c: remove null test before kfree
ecryptfs: Drop cast
Use %pd in eCryptFS
which are now ignored (i_goal is basically a hint so it is safe to so this)
and another relating to the saving of the dirent location during rename.
There is one performance improvement, which is an optimisation in rgblk_free
so that multiple block deallocations will now be more efficient,
and one clean up patch to use _RET_IP_ rather than writing it out longhand.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.15 (GNU/Linux)
iQIcBAABAgAGBQJUNQIcAAoJEMrg3m4a/8jSKaQP+wT4SCjKI0TMkVqfyuyDjTAW
5TxV9h3B5sOxlroJY063WSZUFFWCuooLn4rIK+IT72Jju/AtW1NOb+kx6T8PZ+bT
5Qj7sReGNJADwWdFNlrE3l+7SecVHO2fxfoI5zKX06YgL8WDptR+nYo/Hn1kwQFL
4/cukJSkKMLzvrParR/S7RildlG98jQUhS1WtgxUhyyLVC3+b/9HvwAznq6JUX2U
DLePL2rliCTPZEEvq7tKZGi4uv3ZSbhfSgJHeQeYW565OAiYiJ4fmPFXK9LSyxvv
+WWoRLGkKDjwHLejkBwjwVbpDOrzVSXuwGrRh0DixL/0xwSAsraBMhiIr6XzXxq8
3S2uTbl+i+aeuIEwcE/OjDW1Mxp9GovuW0cKdpscVW/Hz6Id5asEtF+YUPBWFi/R
fFydsn7Gjrjtd2K++n+094N6kqhYo6Vud9es0kigp9D7pq4NSaoKSu74qz2sCWel
51SgysflmL92lXM2f619LgVL4IBdciECNKaC/JPiUJSc8K0MgUNWE4wreGoeTsuc
ceQZ8dW1i0w/3MVy5RbjPiEkNpN2ISNs0vDga07dr1Imy/6CJQR+hq9IdLEoku9V
iBJH46EorVWN4Flt2jeM3uvIc7j+w3kMc5Mxia8D4+aF5eIx/D44ucPb1zPDrILV
O+LD3XOy+kxecCn1wesC
=+QNn
-----END PGP SIGNATURE-----
Merge tag 'gfs2-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw
Pull gfs2 updates from Steven Whitehouse:
"This time we have a couple of bug fixes, one relating to bad i_goal
values which are now ignored (i_goal is basically a hint so it is safe
to so this) and another relating to the saving of the dirent location
during rename.
There is one performance improvement, which is an optimisation in
rgblk_free so that multiple block deallocations will now be more
efficient, and one clean up patch to use _RET_IP_ rather than writing
it out longhand"
* tag 'gfs2-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw:
GFS2: use _RET_IP_ instead of (unsigned long)__builtin_return_address(0)
GFS2: Use gfs2_rbm_incr in rgblk_free
GFS2: Make rename not save dirent location
GFS2: fix bad inode i_goal values during block allocation
Pull percpu updates from Tejun Heo:
"A lot of activities on percpu front. Notable changes are...
- percpu allocator now can take @gfp. If @gfp doesn't contain
GFP_KERNEL, it tries to allocate from what's already available to
the allocator and a work item tries to keep the reserve around
certain level so that these atomic allocations usually succeed.
This will replace the ad-hoc percpu memory pool used by
blk-throttle and also be used by the planned blkcg support for
writeback IOs.
Please note that I noticed a bug in how @gfp is interpreted while
preparing this pull request and applied the fix 6ae833c7fe
("percpu: fix how @gfp is interpreted by the percpu allocator")
just now.
- percpu_ref now uses longs for percpu and global counters instead of
ints. It leads to more sparse packing of the percpu counters on
64bit machines but the overhead should be negligible and this
allows using percpu_ref for refcnting pages and in-memory objects
directly.
- The switching between percpu and single counter modes of a
percpu_ref is made independent of putting the base ref and a
percpu_ref can now optionally be initialized in single or killed
mode. This allows avoiding percpu shutdown latency for cases where
the refcounted objects may be synchronously created and destroyed
in rapid succession with only a fraction of them reaching fully
operational status (SCSI probing does this when combined with
blk-mq support). It's also planned to be used to implement forced
single mode to detect underflow more timely for debugging.
There's a separate branch percpu/for-3.18-consistent-ops which cleans
up the duplicate percpu accessors. That branch causes a number of
conflicts with s390 and other trees. I'll send a separate pull
request w/ resolutions once other branches are merged"
* 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (33 commits)
percpu: fix how @gfp is interpreted by the percpu allocator
blk-mq, percpu_ref: start q->mq_usage_counter in atomic mode
percpu_ref: make INIT_ATOMIC and switch_to_atomic() sticky
percpu_ref: add PERCPU_REF_INIT_* flags
percpu_ref: decouple switching to percpu mode and reinit
percpu_ref: decouple switching to atomic mode and killing
percpu_ref: add PCPU_REF_DEAD
percpu_ref: rename things to prepare for decoupling percpu/atomic mode switch
percpu_ref: replace pcpu_ prefix with percpu_
percpu_ref: minor code and comment updates
percpu_ref: relocate percpu_ref_reinit()
Revert "blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe"
Revert "percpu: free percpu allocation info for uniprocessor system"
percpu-refcount: make percpu_ref based on longs instead of ints
percpu-refcount: improve WARN messages
percpu: fix locking regression in the failure path of pcpu_alloc()
percpu-refcount: add @gfp to percpu_ref_init()
proportions: add @gfp to init functions
percpu_counter: add @gfp to percpu_counter_init()
percpu_counter: make percpu_counters_lock irq-safe
...
Pull cgroup updates from Tejun Heo:
"Nothing too interesting. Just a handful of cleanup patches"
* 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
Revert "cgroup: remove redundant variable in cgroup_mount()"
cgroup: remove redundant variable in cgroup_mount()
cgroup: fix missing unlock in cgroup_release_agent()
cgroup: remove CGRP_RELEASABLE flag
perf/cgroup: Remove perf_put_cgroup()
cgroup: remove redundant check in cgroup_ino()
cpuset: simplify proc_cpuset_show()
cgroup: simplify proc_cgroup_show()
cgroup: use a per-cgroup work for release agent
cgroup: remove bogus comments
cgroup: remove redundant code in cgroup_rmdir()
cgroup: remove some useless forward declarations
cgroup: fix a typo in comment.
Increase the buffer-head per-CPU LRU size to allow efficient filesystem
operations that access many blocks for each transaction. For example,
creating a file in a large ext4 directory with quota enabled will access
multiple buffer heads and will overflow the LRU at the default 8-block LRU
size:
* parent directory inode table block (ctime, nlinks for subdirs)
* new inode bitmap
* inode table block
* 2 quota blocks
* directory leaf block (not reused, but pollutes one cache entry)
* 2 levels htree blocks (only one is reused, other pollutes cache)
* 2 levels indirect/index blocks (only one is reused)
The buffer-head per-CPU LRU size is raised to 16, as it shows in metadata
performance benchmarks up to 10% gain for create, 4% for lookup and 7% for
destroy.
Signed-off-by: Liang Zhen <liang.zhen@intel.com>
Signed-off-by: Andreas Dilger <andreas.dilger@intel.com>
Signed-off-by: Sebastien Buisson <sebastien.buisson@bull.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Always mark pages with PageBalloon even if balloon compaction is disabled
and expose this mark in /proc/kpageflags as KPF_BALLOON.
Also this patch adds three counters into /proc/vmstat: "balloon_inflate",
"balloon_deflate" and "balloon_migrate". They accumulate balloon
activity. Current size of balloon is (balloon_inflate - balloon_deflate)
pages.
All generic balloon code now gathered under option CONFIG_MEMORY_BALLOON.
It should be selected by ballooning driver which wants use this feature.
Currently virtio-balloon is the only user.
Signed-off-by: Konstantin Khlebnikov <k.khlebnikov@samsung.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If a /proc/pid/pagemap read spans a [VMA, an unmapped region, then a
VM_SOFTDIRTY VMA], the virtual pages in the unmapped region are reported
as softdirty. Here's a program to demonstrate the bug:
int main() {
const uint64_t PAGEMAP_SOFTDIRTY = 1ul << 55;
uint64_t pme[3];
int fd = open("/proc/self/pagemap", O_RDONLY);;
char *m = mmap(NULL, 3 * getpagesize(), PROT_READ,
MAP_ANONYMOUS | MAP_SHARED, -1, 0);
munmap(m + getpagesize(), getpagesize());
pread(fd, pme, 24, (unsigned long) m / getpagesize() * 8);
assert(pme[0] & PAGEMAP_SOFTDIRTY); /* passes */
assert(!(pme[1] & PAGEMAP_SOFTDIRTY)); /* fails */
assert(pme[2] & PAGEMAP_SOFTDIRTY); /* passes */
return 0;
}
(Note that all pages in new VMAs are softdirty until cleared).
Tested:
Used the program given above. I'm going to include this code in
a selftest in the future.
[n-horiguchi@ah.jp.nec.com: prevent pagemap_pte_range() from overrunning]
Signed-off-by: Peter Feiner <pfeiner@google.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Jamie Liu <jamieliu@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix a deadlock problem caused by direct memory reclaim in o2net_wq. The
situation is as follows:
1) Receive a connect message from another node, node queues a
work_struct o2net_listen_work.
2) o2net_wq processes this work and call the following functions:
o2net_wq
-> o2net_accept_one
-> sock_create_lite
-> sock_alloc()
-> kmem_cache_alloc with GFP_KERNEL
-> ____cache_alloc_node
->__alloc_pages_nodemask
-> do_try_to_free_pages
-> shrink_slab
-> evict
-> ocfs2_evict_inode
-> ocfs2_drop_lock
-> dlmunlock
-> o2net_send_message_vec
then o2net_wq wait for the unlock reply from master.
3) tcp layer received the reply, call o2net_data_ready() and queue
sc_rx_work, waiting o2net_wq to process this work.
4) o2net_wq is a single thread workqueue, it process the work one by
one. Right now it is still doing o2net_listen_work and cannot handle
sc_rx_work. so we deadlock.
Junxiao Bi's patch "mm: clear __GFP_FS when PF_MEMALLOC_NOIO is set"
(http://ozlabs.org/~akpm/mmots/broken-out/mm-clear-__gfp_fs-when-pf_memalloc_noio-is-set.patch)
clears __GFP_FS in memalloc_noio_flags() besides __GFP_IO. We use
memalloc_noio_save() to set process flag PF_MEMALLOC_NOIO so that all
allocations done by this process are done as if GFP_NOIO was specified.
We are not reentering filesystem while doing memory reclaim.
Signed-off-by: joyce.xue <xuejiufei@huawei.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9e7814404b "hold task->mempolicy while numa_maps scans." fixed the
race with the exiting task but this is not enough.
The current code assumes that get_vma_policy(task) should either see
task->mempolicy == NULL or it should be equal to ->task_mempolicy saved
by hold_task_mempolicy(), so we can never race with __mpol_put(). But
this can only work if we can't race with do_set_mempolicy(), and thus
we can't race with another do_set_mempolicy() or do_exit() after that.
However, do_set_mempolicy()->down_write(mmap_sem) can not prevent this
race. This task can exec, change it's ->mm, and call do_set_mempolicy()
after that; in this case they take 2 different locks.
Change hold_task_mempolicy() to use get_task_policy(), it never returns
NULL, and change show_numa_map() to use __get_vma_policy() or fall back
to proc_priv->task_mempolicy.
Note: this is the minimal fix, we will cleanup this code later. I think
hold_task_mempolicy() and release_task_mempolicy() should die, we can
move this logic into show_numa_map(). Or we can move get_task_policy()
outside of ->mmap_sem and !CONFIG_NUMA code at least.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sequential read from a block device is expected to be equal or faster than
from the file on a filesystem. But it is not correct due to the lack of
effective readpages() in the address space operations for block device.
This implements readpages() operation for block device by using
mpage_readpages() which can create multipage BIOs instead of BIOs for each
page and reduce system CPU time consumption.
Install 1GB of RAM disk storage:
# modprobe scsi_debug dev_size_mb=1024 delay=0
Sequential read from file on a filesystem:
# mkfs.ext4 /dev/$DEV
# mount /dev/$DEV /mnt
# fio --name=t --size=512m --rw=read --filename=/mnt/file
...
read : io=524288KB, bw=2133.4MB/s, iops=546133, runt= 240msec
Sequential read from a block device:
# fio --name=t --size=512m --rw=read --filename=/dev/$DEV
...
(Without this commit)
read : io=524288KB, bw=1700.2MB/s, iops=435455, runt= 301msec
(With this commit)
read : io=524288KB, bw=2160.4MB/s, iops=553046, runt= 237msec
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add guard_bio_eod() check for mpage code in order to allow us to do IO
even on the odd last sectors of a device, even if the block size is some
multiple of the physical sector size.
Using mpage_readpages() for block device requires this guard check.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patchset implements readpages() operation for block device by using
mpage_readpages() which can create multipage BIOs instead of BIOs for each
page and reduce system CPU time consumption.
This patch (of 3):
guard_bh_eod() is used in submit_bh() to allow us to do IO even on the odd
last sectors of a device, even if the block size is some multiple of the
physical sector size. This makes guard_bh_eod() more generic and renames
it guard_bio_eod() so that we can use it without struct buffer_head
argument.
The reason for this change is that using mpage_readpages() for block
device requires to add this guard check in mpage code.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On some ARCHs modules range is eauql to vmalloc range. E.g on i686
"#define MODULES_VADDR VMALLOC_START"
"#define MODULES_END VMALLOC_END"
This will cause 2 duplicate program segments in /proc/kcore, and no flag
to indicate they are different. This is confusing. And usually people
who need check the elf header or read the content of kcore will check
memory ranges. Two program segments which are the same are unnecessary.
So check if the modules range is equal to vmalloc range. If so, just skip
adding the modules range.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Rename vm_is_stack() to task_of_stack() and change it to return
"struct task_struct *" rather than the global (and thus wrong in
general) pid_t.
- Add the new pid_of_stack() helper which calls task_of_stack() and
uses the right namespace to report the correct pid_t.
Unfortunately we need to define this helper twice, in task_mmu.c
and in task_nommu.c. perhaps it makes sense to add fs/proc/util.c
and move at least pid_of_stack/task_of_stack there to avoid the
code duplication.
- Change show_map_vma() and show_numa_map() to use the new helper.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg Ungerer <gerg@uclinux.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
m_start() can use get_proc_task() instead, and "struct inode *"
provides more potentially useful info, see the next changes.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg Ungerer <gerg@uclinux.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I do not know if CONFIG_PREEMPT/SMP is possible without CONFIG_MMU
but the usage of task->mm in m_stop(). The task can exit/exec before
we take mmap_sem, in this case m_stop() can hit NULL or unlock the
wrong rw_semaphore.
Also, this code uses priv->task != NULL to decide whether we need
up_read/mmput. This is correct, but we will probably kill priv->task.
Change m_start/m_stop to rely on IS_ERR_OR_NULL() like task_mmu.c does.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Greg Ungerer <gerg@uclinux.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Copy-and-paste the changes from "fs/proc/task_mmu.c: shift mm_access()
from m_start() to proc_maps_open()" into task_nommu.c.
Change maps_open() to initialize priv->mm using proc_mem_open(), m_start()
can rely on atomic_inc_not_zero(mm_users) like task_mmu.c does.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Greg Ungerer <gerg@uclinux.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change the main loop in m_start() to update m->version. Mostly for
consistency, but this can help to avoid the same loop if the very
1st ->show() fails due to seq_overflow().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add the "last_addr" optimization back. Like before, every ->show()
method checks !seq_overflow() and sets m->version = vma->vm_start.
However, it also checks that m_next_vma(vma) != NULL, otherwise it
sets m->version = -1 for the lockless "EOF" fast-path in m_start().
m_start() can simply do find_vma() + m_next_vma() if last_addr is
not zero, the code looks clear and simple and this case is clearly
separated from "scan vmas" path.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Extract the tail_vma/vm_next calculation from m_next() into the new
trivial helper, m_next_vma().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that m->version is gone we can cleanup m_start(). In particular,
- Remove the "unsigned long" typecast, m->index can't be negative
or exceed ->map_count. But lets use "unsigned int pos" to make
it clear that "pos < map_count" is safe.
- Remove the unnecessary "vma != NULL" check in the main loop. It
can't be NULL unless we have a vm bug.
- This also means that "pos < map_count" case can simply return the
valid vma and avoid "goto" and subsequent checks.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
m_start() carefully documents, checks, and sets "m->version = -1" if
we are going to return NULL. The only problem is that we will be never
called again if m_start() returns NULL, so this is simply pointless
and misleading.
Otoh, ->show() methods m->version = 0 if vma == tail_vma and this is
just wrong, we want -1 in this case. And in fact we also want -1 if
->vm_next == NULL and ->tail_vma == NULL.
And it is not used consistently, the "scan vmas" loop in m_start()
should update last_addr too.
Finally, imo the whole "last_addr" logic in m_start() looks horrible.
find_vma(last_addr) is called unconditionally even if we are not going
to use the result. But the main problem is that this code participates
in tail_vma-or-NULL mess, and this looks simply unfixable.
Remove this optimization. We will add it back after some cleanups.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1. There is no reason to reset ->tail_vma in m_start(), if we return
IS_ERR_OR_NULL() it won't be used.
2. m_start() also clears priv->task to ensure that m_stop() won't use
the stale pointer if we fail before get_task_struct(). But this is
ugly and confusing, move this initialization in m_stop().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1. Kill the first "vma != NULL" check. Firstly this is not possible,
m_next() won't be called if ->start() or the previous ->next()
returns NULL.
And if it was possible the 2nd "vma != tail_vma" check is buggy,
we should not wrongly return ->tail_vma.
2. Make this function readable. The logic is very simple, we should
return check "vma != tail" once and return "vm_next || tail_vma".
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
m_start() drops ->mmap_sem and does mmput() if it retuns vsyscall
vma. This is because in this case m_stop()->vma_stop() obviously
can't use gate_vma->vm_mm.
Now that we have proc_maps_private->mm we can simplify this logic:
- Change m_start() to return with ->mmap_sem held unless it returns
IS_ERR_OR_NULL().
- Change vma_stop() to use priv->mm and avoid the ugly vma checks,
this makes "vm_area_struct *vma" unnecessary.
- This also allows m_start() to use vm_stop().
- Cleanup m_next() to follow the new locking rule.
Note: m_stop() looks very ugly, and this temporary uglifies it
even more. Fixed by the next change.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A simple test-case from Kirill Shutemov
cat /proc/self/maps >/dev/null
chmod +x /proc/self/net/packet
exec /proc/self/net/packet
makes lockdep unhappy, cat/exec take seq_file->lock + cred_guard_mutex in
the opposite order.
It's a false positive and probably we should not allow "chmod +x" on proc
files. Still I think that we should avoid mm_access() and cred_guard_mutex
in sys_read() paths, security checking should happen at open time. Besides,
this doesn't even look right if the task changes its ->mm between m_stop()
and m_start().
Add the new "mm_struct *mm" member into struct proc_maps_private and change
proc_maps_open() to initialize it using proc_mem_open(). Change m_start() to
use priv->mm if atomic_inc_not_zero(mm_users) succeeds or return NULL (eof)
otherwise.
The only complication is that proc_maps_open() users should additionally do
mmdrop() in fop->release(), add the new proc_map_release() helper for that.
Note: this is the user-visible change, if the task execs after open("maps")
the new ->mm won't be visible via this file. I hope this is fine, and this
matches /proc/pid/mem bahaviour.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Extract the mm_access() code from __mem_open() into the new helper,
proc_mem_open(), the next patch will add another caller.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
do_maps_open() and numa_maps_open() are overcomplicated, they could use
__seq_open_private(). Plus they do the same, just sizeof(*priv)
Change them to use a new simple helper, proc_maps_open(ops, psize). This
simplifies the code and allows us to do the next changes.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
get_gate_vma(priv->task->mm) looks ugly and wrong, task->mm can be NULL or
it can changed by exec right after mm_access().
And in theory this race is not harmless, the task can exec and then later
exit and free the new mm_struct. In this case get_task_mm(oldmm) can't
help, get_gate_vma(task->mm) can read the freed/unmapped memory.
I think that priv->task should simply die and hold_task_mempolicy() logic
can be simplified. tail_vma logic asks for cleanups too.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For commit ocfs2 journal, ocfs2 journal thread will acquire the mutex
osb->journal->j_trans_barrier and wake up jbd2 commit thread, then it
will wait until jbd2 commit thread done. In order journal mode, jbd2
needs flushing dirty data pages first, and this needs get page lock.
So osb->journal->j_trans_barrier should be got before page lock.
But ocfs2_write_zero_page() and ocfs2_write_begin_inline() obey this
locking order, and this will cause deadlock and hung the whole cluster.
One deadlock catched is the following:
PID: 13449 TASK: ffff8802e2f08180 CPU: 31 COMMAND: "oracle"
#0 [ffff8802ee3f79b0] __schedule at ffffffff8150a524
#1 [ffff8802ee3f7a58] schedule at ffffffff8150acbf
#2 [ffff8802ee3f7a68] rwsem_down_failed_common at ffffffff8150cb85
#3 [ffff8802ee3f7ad8] rwsem_down_read_failed at ffffffff8150cc55
#4 [ffff8802ee3f7ae8] call_rwsem_down_read_failed at ffffffff812617a4
#5 [ffff8802ee3f7b50] ocfs2_start_trans at ffffffffa0498919 [ocfs2]
#6 [ffff8802ee3f7ba0] ocfs2_zero_start_ordered_transaction at ffffffffa048b2b8 [ocfs2]
#7 [ffff8802ee3f7bf0] ocfs2_write_zero_page at ffffffffa048e9bd [ocfs2]
#8 [ffff8802ee3f7c80] ocfs2_zero_extend_range at ffffffffa048ec83 [ocfs2]
#9 [ffff8802ee3f7ce0] ocfs2_zero_extend at ffffffffa048edfd [ocfs2]
#10 [ffff8802ee3f7d50] ocfs2_extend_file at ffffffffa049079e [ocfs2]
#11 [ffff8802ee3f7da0] ocfs2_setattr at ffffffffa04910ed [ocfs2]
#12 [ffff8802ee3f7e70] notify_change at ffffffff81187d29
#13 [ffff8802ee3f7ee0] do_truncate at ffffffff8116bbc1
#14 [ffff8802ee3f7f50] sys_ftruncate at ffffffff8116bcbd
#15 [ffff8802ee3f7f80] system_call_fastpath at ffffffff81515142
RIP: 00007f8de750c6f7 RSP: 00007fffe786e478 RFLAGS: 00000206
RAX: 000000000000004d RBX: ffffffff81515142 RCX: 0000000000000000
RDX: 0000000000000200 RSI: 0000000000028400 RDI: 000000000000000d
RBP: 00007fffe786e040 R8: 0000000000000000 R9: 000000000000000d
R10: 0000000000000000 R11: 0000000000000206 R12: 000000000000000d
R13: 00007fffe786e710 R14: 00007f8de70f8340 R15: 0000000000028400
ORIG_RAX: 000000000000004d CS: 0033 SS: 002b
crash64> bt
PID: 7610 TASK: ffff88100fd56140 CPU: 1 COMMAND: "ocfs2cmt"
#0 [ffff88100f4d1c50] __schedule at ffffffff8150a524
#1 [ffff88100f4d1cf8] schedule at ffffffff8150acbf
#2 [ffff88100f4d1d08] jbd2_log_wait_commit at ffffffffa01274fd [jbd2]
#3 [ffff88100f4d1d98] jbd2_journal_flush at ffffffffa01280b4 [jbd2]
#4 [ffff88100f4d1dd8] ocfs2_commit_cache at ffffffffa0499b14 [ocfs2]
#5 [ffff88100f4d1e38] ocfs2_commit_thread at ffffffffa0499d38 [ocfs2]
#6 [ffff88100f4d1ee8] kthread at ffffffff81090db6
#7 [ffff88100f4d1f48] kernel_thread_helper at ffffffff81516284
crash64> bt
PID: 7609 TASK: ffff88100f2d4480 CPU: 0 COMMAND: "jbd2/dm-20-86"
#0 [ffff88100def3920] __schedule at ffffffff8150a524
#1 [ffff88100def39c8] schedule at ffffffff8150acbf
#2 [ffff88100def39d8] io_schedule at ffffffff8150ad6c
#3 [ffff88100def39f8] sleep_on_page at ffffffff8111069e
#4 [ffff88100def3a08] __wait_on_bit_lock at ffffffff8150b30a
#5 [ffff88100def3a58] __lock_page at ffffffff81110687
#6 [ffff88100def3ab8] write_cache_pages at ffffffff8111b752
#7 [ffff88100def3be8] generic_writepages at ffffffff8111b901
#8 [ffff88100def3c48] journal_submit_data_buffers at ffffffffa0120f67 [jbd2]
#9 [ffff88100def3cf8] jbd2_journal_commit_transaction at ffffffffa0121372[jbd2]
#10 [ffff88100def3e68] kjournald2 at ffffffffa0127a86 [jbd2]
#11 [ffff88100def3ee8] kthread at ffffffff81090db6
#12 [ffff88100def3f48] kernel_thread_helper at ffffffff81516284
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Alex Chen <alex.chen@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The following case may lead to o2net_wq and o2hb thread deadlock on
o2hb_callback_sem.
Currently there are 2 nodes say N1, N2 in the cluster. And N2 down, at
the same time, N3 tries to join the cluster. So N1 will handle node
down (N2) and join (N3) simultaneously.
o2hb o2net_wq
->o2hb_do_disk_heartbeat
->o2hb_check_slot
->o2hb_run_event_list
->o2hb_fire_callbacks
->down_write(&o2hb_callback_sem)
->o2net_hb_node_down_cb
->flush_workqueue(o2net_wq)
->o2net_process_message
->dlm_query_join_handler
->o2hb_check_node_heartbeating
->o2hb_fill_node_map
->down_read(&o2hb_callback_sem)
No need to take o2hb_callback_sem in dlm_query_join_handler,
o2hb_live_lock is enough to protect live node map.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: xMark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: jiangyiwen <jiangyiwen@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Firing quorum before connection established can cause unexpected node to
reboot.
Assume there are 3 nodes in the cluster, Node 1, 2, 3. Node 2 and 3 have
wrong ip address of Node 1 in cluster.conf and global heartbeat is enabled
in the cluster. After the heatbeats are started on these three nodes,
Node 1 will reboot due to quorum fencing. It is similar case if Node 1's
networking is not ready when starting the global heartbeat.
The reboot is not friendly as customer is not fully ready for ocfs2 to
work. Fix it by not allowing firing quorum before the connection is
established. In this case, ocfs2 will wait until the wrong configuration
is fixed or networking is up to continue. Also update the log to guide
the user where to check when connection is not built for a long time.
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Reduce boilerplate code by using seq_open_private() instead of seq_open()
Signed-off-by: Rob Jones <rob.jones@codethink.co.uk>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Reduce boilerplate code by using seq_open_private() instead of seq_open()
Note that the code in and using sc_common_open() has been quite
extensively changed. Not least because there was a latent memory leak in
the code as was: if sc_common_open() failed, the previously allocated
buffer was not freed.
Signed-off-by: Rob Jones <rob.jones@codethink.co.uk>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Reduce boilerplate code by using seq_open_private() instead of seq_open()
Signed-off-by: Rob Jones <rob.jones@codethink.co.uk>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove the branch that free res->lockname.name because the condition
is never satisfied when jump to label error.
Signed-off-by: joyce.xue <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
dlm_lockres_put() should be called without &res->spinlock, otherwise a
deadlock case may happen.
spin_lock(&res->spinlock)
...
dlm_lockres_put
->dlm_lockres_release
->dlm_print_one_lock_resource
->spin_lock(&res->spinlock)
Signed-off-by: Alex Chen <alex.chen@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In o2net_init, if malloc failed, it directly returns -ENOMEM. Then
o2quo_exit won't be called in init_o2nm.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: joyce.xue <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ocfs2_inode_info->ip_clusters and ocfs2_dinode->id1.bitmap1.i_total are
defined as type u32, so the shift left operations may overflow if volume
size is large, for example, 2TB and cluster size is 1MB.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Alex Chen <alex.chen@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Refactoring error handling in dlm_alloc_ctxt to simplify code.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Alex Chen <alex.chen@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It is supposed to zero pv_minor.
Reported-by: Himangi Saraogi <himangi774@gmail.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
fs/ntfs/debug.c:124: WARNING: space prohibited between function name and
open parenthesis '('
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Signed-off-by: Anton Altaparmakov <anton@tuxera.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mel Gorman's commit 2457aec637 ("mm: non-atomically mark page accessed
during page cache allocation where possible") removed mark_page_accessed()
calls from NTFS without updating the matching find_lock_page() to
find_get_page_flags(GFP_LOCK | FGP_ACCESSED) thus causing the page to
never be marked accessed.
This patch fixes that.
Signed-off-by: Anton Altaparmakov <anton@tuxera.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
According to commit 80af258867 ("fanotify: groups can specify their
f_flags for new fd"), file descriptors created as part of file access
notification events inherit flags from the event_f_flags argument passed
to syscall fanotify_init(2)[1].
Unfortunately O_CLOEXEC is currently silently ignored.
Indeed, event_f_flags are only given to dentry_open(), which only seems to
care about O_ACCMODE and O_PATH in do_dentry_open(), O_DIRECT in
open_check_o_direct() and O_LARGEFILE in generic_file_open().
It's a pity, since, according to some lookup on various search engines and
http://codesearch.debian.net/, there's already some userspace code which
use O_CLOEXEC:
- in systemd's readahead[2]:
fanotify_fd = fanotify_init(FAN_CLOEXEC|FAN_NONBLOCK, O_RDONLY|O_LARGEFILE|O_CLOEXEC|O_NOATIME);
- in clsync[3]:
#define FANOTIFY_EVFLAGS (O_LARGEFILE|O_RDONLY|O_CLOEXEC)
int fanotify_d = fanotify_init(FANOTIFY_FLAGS, FANOTIFY_EVFLAGS);
- in examples [4] from "Filesystem monitoring in the Linux
kernel" article[5] by Aleksander Morgado:
if ((fanotify_fd = fanotify_init (FAN_CLOEXEC,
O_RDONLY | O_CLOEXEC | O_LARGEFILE)) < 0)
Additionally, since commit 48149e9d3a ("fanotify: check file flags
passed in fanotify_init"). having O_CLOEXEC as part of fanotify_init()
second argument is expressly allowed.
So it seems expected to set close-on-exec flag on the file descriptors if
userspace is allowed to request it with O_CLOEXEC.
But Andrew Morton raised[6] the concern that enabling now close-on-exec
might break existing applications which ask for O_CLOEXEC but expect the
file descriptor to be inherited across exec().
In the other hand, as reported by Mihai Dontu[7] close-on-exec on the file
descriptor returned as part of file access notify can break applications
due to deadlock. So close-on-exec is needed for most applications.
More, applications asking for close-on-exec are likely expecting it to be
enabled, relying on O_CLOEXEC being effective. If not, it might weaken
their security, as noted by Jan Kara[8].
So this patch replaces call to macro get_unused_fd() by a call to function
get_unused_fd_flags() with event_f_flags value as argument. This way
O_CLOEXEC flag in the second argument of fanotify_init(2) syscall is
interpreted and close-on-exec get enabled when requested.
[1] http://man7.org/linux/man-pages/man2/fanotify_init.2.html
[2] http://cgit.freedesktop.org/systemd/systemd/tree/src/readahead/readahead-collect.c?id=v208#n294
[3] https://github.com/xaionaro/clsync/blob/v0.2.1/sync.c#L1631https://github.com/xaionaro/clsync/blob/v0.2.1/configuration.h#L38
[4] http://www.lanedo.com/~aleksander/fanotify/fanotify-example.c
[5] http://www.lanedo.com/2013/filesystem-monitoring-linux-kernel/
[6] http://lkml.kernel.org/r/20141001153621.65e9258e65a6167bf2e4cb50@linux-foundation.org
[7] http://lkml.kernel.org/r/20141002095046.3715eb69@mdontu-l
[8] http://lkml.kernel.org/r/20141002104410.GB19748@quack.suse.cz
Link: http://lkml.kernel.org/r/cover.1411562410.git.ydroneaud@opteya.com
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Tested-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Cc: Mihai Don\u021bu <mihai.dontu@gmail.com>
Cc: Pádraig Brady <P@draigBrady.com>
Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
Cc: Jan Kara <jack@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Michael Kerrisk-manpages <mtk.manpages@gmail.com>
Cc: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Cc: Richard Guy Briggs <rgb@redhat.com>
Cc: Eric Paris <eparis@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On some failure paths we may attempt to free user context even if it
wasn't assigned yet. This will cause a NULL ptr deref and a kernel BUG.
The path I was looking at is in inotify_new_group():
oevent = kmalloc(sizeof(struct inotify_event_info), GFP_KERNEL);
if (unlikely(!oevent)) {
fsnotify_destroy_group(group);
return ERR_PTR(-ENOMEM);
}
fsnotify_destroy_group() would get called here, but
group->inotify_data.user is only getting assigned later:
group->inotify_data.user = get_current_user();
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: John McCutchan <john@johnmccutchan.com>
Cc: Robert Love <rlove@rlove.org>
Cc: Eric Paris <eparis@parisplace.org>
Reviewed-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some UDF media have special inodes (like VAT or metadata partition
inodes) whose link_count is 0. Thus commit 4071b91362 (udf: Properly
detect stale inodes) broke loading these inodes because udf_iget()
started returning -ESTALE for them. Since we still need to properly
detect stale inodes queried by NFS, create two variants of udf_iget() -
one which is used for looking up special inodes (which ignores
link_count == 0) and one which is used for other cases which return
ESTALE when link_count == 0.
Fixes: 4071b91362
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Pull timer updates from Thomas Gleixner:
"Nothing really exciting this time:
- a few fixlets in the NOHZ code
- a new ARM SoC timer abomination. One should expect that we have
enough of them already, but they insist on inventing new ones.
- the usual bunch of ARM SoC timer updates. That feels like herding
cats"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
clocksource: arm_arch_timer: Consolidate arch_timer_evtstrm_enable
clocksource: arm_arch_timer: Enable counter access for 32-bit ARM
clocksource: arm_arch_timer: Change clocksource name if CP15 unavailable
clocksource: sirf: Disable counter before re-setting it
clocksource: cadence_ttc: Add support for 32bit mode
clocksource: tcb_clksrc: Sanitize IRQ request
clocksource: arm_arch_timer: Discard unavailable timers correctly
clocksource: vf_pit_timer: Support shutdown mode
ARM: meson6: clocksource: Add Meson6 timer support
ARM: meson: documentation: Add timer documentation
clocksource: sh_tmu: Document r8a7779 binding
clocksource: sh_mtu2: Document r7s72100 binding
clocksource: sh_cmt: Document SoC specific bindings
timerfd: Remove an always true check
nohz: Avoid tick's double reprogramming in highres mode
nohz: Fix spurious periodic tick behaviour in low-res dynticks mode
It would make more sense to pass char __user * instead of
char * in callers of do_mount() and do getname() inside do_mount().
Suggested-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Seunghun Lee <waydi1@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
hashed dentry can be passed to ->atomic_open() only if
a) it has just passed revalidation and
b) it's negative
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The gcc version 4.9.1 compiler complains Even though it isn't possible for
these variables to not get initialized before they are used.
fs/namespace.c: In function ‘SyS_mount’:
fs/namespace.c:2720:8: warning: ‘kernel_dev’ may be used uninitialized in this function [-Wmaybe-uninitialized]
ret = do_mount(kernel_dev, kernel_dir->name, kernel_type, flags,
^
fs/namespace.c:2699:8: note: ‘kernel_dev’ was declared here
char *kernel_dev;
^
fs/namespace.c:2720:8: warning: ‘kernel_type’ may be used uninitialized in this function [-Wmaybe-uninitialized]
ret = do_mount(kernel_dev, kernel_dir->name, kernel_type, flags,
^
fs/namespace.c:2697:8: note: ‘kernel_type’ was declared here
char *kernel_type;
^
Fix the warnings by simplifying copy_mount_string() as suggested by Al Viro.
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This patch makes it possible to kill a process looping in
cont_expand_zero. A process may spend a lot of time in this function, so
it is desirable to be able to kill it.
It happened to me that I wanted to copy a piece data from the disk to a
file. By mistake, I used the "seek" parameter to dd instead of "skip". Due
to the "seek" parameter, dd attempted to extend the file and became stuck
doing so - the only possibility was to reset the machine or wait many
hours until the filesystem runs out of space and cont_expand_zero fails.
We need this patch to be able to terminate the process.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
total_objects could be 0 and is used as a denom.
While total_objects is a "long", total_objects == 0 unlikely happens for
3.12 and later kernels because 32-bit architectures would not be able to
hold (1 << 32) objects. However, total_objects == 0 may happen for kernels
between 3.1 and 3.11 because total_objects in prune_super() was an "int"
and (e.g.) x86_64 architecture might be able to hold (1 << 32) objects.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: stable <stable@kernel.org> # 3.1+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
schedule_delayed_work() happening when the work is already pending is
a cheap no-op. Don't bother with ->wbuf_queued logics - it's both
broken (cancelling ->wbuf_dwork leaves it set, as spotted by Jeff Harris)
and pointless. It's cheaper to let schedule_delayed_work() handle that
case.
Reported-by: Jeff Harris <jefftharris@gmail.com>
Tested-by: Jeff Harris <jefftharris@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
... rather than doing that in the guts of ->load_binary().
[updated to fix the bug spotted by Shentino - for SIGSEGV we really need
something stronger than send_sig_info(); again, better do that in one place]
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
the only in-tree instance checks d_unhashed() anyway,
out-of-tree code can preserve the current behaviour by
adding such check if they want it and we get an ability
to use it in cases where we *want* to be notified of
killing being inevitable before ->d_lock is dropped,
whether it's unhashed or not. In particular, autofs
would benefit from that.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The only reason for games with ->d_prune() was __d_drop(), which
was needed only to force dput() into killing the sucker off.
Note that lock_parent() can be called under ->i_lock and won't
drop it, so dentry is safe from somebody managing to kill it
under us - it won't happen while we are holding ->i_lock.
__dentry_kill() is called only with ->d_lockref.count being 0
(here and when picked from shrink list) or 1 (dput() and dropping
the ancestors in shrink_dentry_list()), so it will never be called
twice - the first thing it's doing is making ->d_lockref.count
negative and once that happens, nothing will increment it.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Now that d_invalidate always succeeds and flushes mount points use
it in stead of a combination of shrink_dcache_parent and d_drop
in proc_flush_task_mnt. This removes the danger of a mount point
under /proc/<pid>/... becoming unreachable after the d_drop.
Reviewed-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Now that d_invalidate always succeeds it is not longer necessary or
desirable to hard code d_drop calls into filesystem specific
d_revalidate implementations.
Remove the unnecessary d_drop calls and rely on d_invalidate
to drop the dentries. Using d_invalidate ensures that paths
to mount points will not be dropped.
Reviewed-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Now that d_invalidate can no longer fail, stop returning a useless
return code. For the few callers that checked the return code update
remove the handling of d_invalidate failure.
Reviewed-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Now that d_invalidate is the only caller of check_submounts_and_drop,
expand check_submounts_and_drop inline in d_invalidate.
Reviewed-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Now that check_submounts_and_drop can not fail and is called from
d_invalidate there is no longer a need to call check_submounts_and_drom
from filesystem d_revalidate methods so remove it.
Reviewed-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
With the introduction of mount namespaces and bind mounts it became
possible to access files and directories that on some paths are mount
points but are not mount points on other paths. It is very confusing
when rm -rf somedir returns -EBUSY simply because somedir is mounted
somewhere else. With the addition of user namespaces allowing
unprivileged mounts this condition has gone from annoying to allowing
a DOS attack on other users in the system.
The possibility for mischief is removed by updating the vfs to support
rename, unlink and rmdir on a dentry that is a mountpoint and by
lazily unmounting mountpoints on deleted dentries.
In particular this change allows rename, unlink and rmdir system calls
on a dentry without a mountpoint in the current mount namespace to
succeed, and it allows rename, unlink, and rmdir performed on a
distributed filesystem to update the vfs cache even if when there is a
mount in some namespace on the original dentry.
There are two common patterns of maintaining mounts: Mounts on trusted
paths with the parent directory of the mount point and all ancestory
directories up to / owned by root and modifiable only by root
(i.e. /media/xxx, /dev, /dev/pts, /proc, /sys, /sys/fs/cgroup/{cpu,
cpuacct, ...}, /usr, /usr/local). Mounts on unprivileged directories
maintained by fusermount.
In the case of mounts in trusted directories owned by root and
modifiable only by root the current parent directory permissions are
sufficient to ensure a mount point on a trusted path is not removed
or renamed by anyone other than root, even if there is a context
where the there are no mount points to prevent this.
In the case of mounts in directories owned by less privileged users
races with users modifying the path of a mount point are already a
danger. fusermount already uses a combination of chdir,
/proc/<pid>/fd/NNN, and UMOUNT_NOFOLLOW to prevent these races. The
removable of global rename, unlink, and rmdir protection really adds
nothing new to consider only a widening of the attack window, and
fusermount is already safe against unprivileged users modifying the
directory simultaneously.
In principle for perfect userspace programs returning -EBUSY for
unlink, rmdir, and rename of dentires that have mounts in the local
namespace is actually unnecessary. Unfortunately not all userspace
programs are perfect so retaining -EBUSY for unlink, rmdir and rename
of dentries that have mounts in the current mount namespace plays an
important role of maintaining consistency with historical behavior and
making imperfect userspace applications hard to exploit.
v2: Remove spurious old_dentry.
v3: Optimized shrink_submounts_and_drop
Removed unsued afs label
v4: Simplified the changes to check_submounts_and_drop
Do not rename check_submounts_and_drop shrink_submounts_and_drop
Document what why we need atomicity in check_submounts_and_drop
Rely on the parent inode mutex to make d_revalidate and d_invalidate
an atomic unit.
v5: Refcount the mountpoint to detach in case of simultaneous
renames.
Reviewed-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The new function detach_mounts comes in two pieces. The first piece
is a static inline test of d_mounpoint that returns immediately
without taking any locks if d_mounpoint is not set. In the common
case when mountpoints are absent this allows the vfs to continue
running with it's same cacheline foot print.
The second piece of detach_mounts __detach_mounts actually does the
work and it assumes that a mountpoint is present so it is slow and
takes namespace_sem for write, and then locks the mount hash (aka
mount_lock) after a struct mountpoint has been found.
With those two locks held each entry on the list of mounts on a
mountpoint is selected and lazily unmounted until all of the mount
have been lazily unmounted.
v7: Wrote a proper change description and removed the changelog
documenting deleted wrong turns.
Signed-off-by: Eric W. Biederman <ebiederman@twitter.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
I am shortly going to add a new user of struct mountpoint that
needs to look up existing entries but does not want to create
a struct mountpoint if one does not exist. Therefore to keep
the code simple and easy to read split out lookup_mountpoint
from new_mountpoint.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
To spot any possible problems call BUG if a mountpoint
is put when it's list of mounts is not empty.
AV: use hlist instead of list_head
Reviewed-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Eric W. Biederman <ebiederman@twitter.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In preparation for allowing mountpoints to be renamed and unlinked
in remote filesystems and in other mount namespaces test if on a dentry
there is a mount in the local mount namespace before allowing it to
be renamed or unlinked.
The primary motivation here are old versions of fusermount unmount
which is not safe if the a path can be renamed or unlinked while it is
verifying the mount is safe to unmount. More recent versions are simpler
and safer by simply using UMOUNT_NOFOLLOW when unmounting a mount
in a directory owned by an arbitrary user.
Miklos Szeredi <miklos@szeredi.hu> reports this is approach is good
enough to remove concerns about new kernels mixed with old versions
of fusermount.
A secondary motivation for restrictions here is that it removing empty
directories that have non-empty mount points on them appears to
violate the rule that rmdir can not remove empty directories. As
Linus Torvalds pointed out this is useful for programs (like git) that
test if a directory is empty with rmdir.
Therefore this patch arranges to enforce the existing mount point
semantics for local mount namespace.
v2: Rewrote the test to be a drop in replacement for d_mountpoint
v3: Use bool instead of int as the return type of is_local_mountpoint
Reviewed-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The current comments in d_invalidate about what and why it is doing
what it is doing are wildly off-base. Which is not surprising as
the comments date back to last minute bug fix of the 2.2 kernel.
The big fat lie of a comment said: If it's a directory, we can't drop
it for fear of somebody re-populating it with children (even though
dropping it would make it unreachable from that root, we still might
repopulate it if it was a working directory or similar).
[AV] What we really need to avoid is multiple dentry aliases of the
same directory inode; on all filesystems that have ->d_revalidate()
we either declare all positive dentries always valid (and thus never
fed to d_invalidate()) or use d_materialise_unique() and/or d_splice_alias(),
which take care of alias prevention.
The current rules are:
- To prevent mount point leaks dentries that are mount points or that
have childrent that are mount points may not be be unhashed.
- All dentries may be unhashed.
- Directories may be rehashed with d_materialise_unique
check_submounts_and_drop implements this already for well maintained
remote filesystems so implement the current rules in d_invalidate
by just calling check_submounts_and_drop.
The one difference between d_invalidate and check_submounts_and_drop
is that d_invalidate must respect it when a d_revalidate method has
earlier called d_drop so preserve the d_unhashed check in
d_invalidate.
Reviewed-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
d_drop or check_submounts_and_drop called from d_revalidate can result
in renamed directories with child dentries being unhashed. These
renamed and drop directory dentries can be rehashed after
d_materialise_unique uses d_find_alias to find them.
Reviewed-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
On final mntput() we want fs shutdown to happen before return to
userland; however, the only case where we want it happen right
there (i.e. where task_work_add won't do) is MNT_INTERNAL victim.
Those have to be fully synchronous - failure halfway through module
init might count on having vfsmount killed right there. Fortunately,
final mntput on MNT_INTERNAL vfsmounts happens on shallow stack.
So we handle those synchronously and do an analog of delayed fput
logics for everything else.
As the result, we are guaranteed that fs shutdown will always happen
on shallow stack.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Biederman's umount-on-rmdir series changes d_invalidate() to sumarily remove
mounts under the passed in dentry regardless of whether they are busy
or not. So calling this in fs/autofs4/expire.c:autofs4_tree_busy() is
definitely the wrong thing to do becuase it will silently umount entries
instead of just cleaning stale dentrys.
But this call shouldn't be needed and testing shows that automounting
continues to function without it.
As Al Viro correctly surmises the original intent of the call was to
perform what shrink_dcache_parent() does.
If at some time in the future I see stale dentries accumulating
following failed mounts I'll revisit the issue and possibly add a
shrink_dcache_parent() call if needed.
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* external dentry names get a small structure prepended to them
(struct external_name).
* it contains an atomic refcount, matching the number of struct dentry
instances that have ->d_name.name pointing to that external name. The
first thing free_dentry() does is decrementing refcount of external name,
so the instances that are between the call of free_dentry() and
RCU-delayed actual freeing do not contribute.
* __d_move(x, y, false) makes the name of x equal to the name of y,
external or not. If y has an external name, extra reference is grabbed
and put into x->d_name.name. If x used to have an external name, the
reference to the old name is dropped and, should it reach zero, freeing
is scheduled via kfree_rcu().
* free_dentry() in dentry with external name decrements the refcount of
that name and, should it reach zero, does RCU-delayed call that will
free both the dentry and external name. Otherwise it does what it
used to do, except that __d_free() doesn't even look at ->d_name.name;
it simply frees the dentry.
All non-RCU accesses to dentry external name are safe wrt freeing since they
all should happen before free_dentry() is called. RCU accesses might run
into a dentry seen by free_dentry() or into an old name that got already
dropped by __d_move(); however, in both cases dentry must have been
alive and refer to that name at some point after we'd done rcu_read_lock(),
which means that any freeing must be still pending.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
You cannot call pnfs_put_lseg_async() more than once per lseg, so it
is really an inappropriate way to deal with a refcount issue.
Instead, replace it with a function that decrements the refcount, and
puts the final 'free' operation (which is incompatible with locks) on
the workqueue.
Cc: Weston Andros Adamson <dros@primarydata.com>
Fixes: e6cf82d183: pnfs: add pnfs_put_lseg_async
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Accessing do_remount_sb should require global CAP_SYS_ADMIN, but
only one of the two call sites was appropriately protected.
Fixes CVE-2014-7975.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
nfs4_insert_deviceid_node() was removed in 661373b13d
Signed-off-by: Tom Haynes <loghyr@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Pull f2fs updates from Jaegeuk Kim:
"This patch-set introduces a couple of new features such as large
sector size, FITRIM, and atomic/volatile writes.
Several patches enhance power-off recovery and checkpoint routines.
The fsck.f2fs starts to support fixing corrupted partitions with
recovery hints provided by this patch-set.
Summary:
- retain some recovery information for fsck.f2fs
- enhance checkpoint speed
- enhance flush command management
- bug fix for lseek
- tune in-place-update policies
- enhance roll-forward speed
- revisit all the roll-forward and fsync rules
- support larget sector size
- support FITRIM
- support atomic and volatile writes
And several clean-ups and bug fixes are included"
* tag 'f2fs-for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (42 commits)
f2fs: support volatile operations for transient data
f2fs: support atomic writes
f2fs: remove unused return value
f2fs: clean up f2fs_ioctl functions
f2fs: potential shift wrapping buf in f2fs_trim_fs()
f2fs: call f2fs_unlock_op after error was handled
f2fs: check the use of macros on block counts and addresses
f2fs: refactor flush_nat_entries to remove costly reorganizing ops
f2fs: introduce FITRIM in f2fs_ioctl
f2fs: introduce cp_control structure
f2fs: use more free segments until SSR is activated
f2fs: change the ipu_policy option to enable combinations
f2fs: fix to search whole dirty segmap when get_victim
f2fs: fix to clean previous mount option when remount_fs
f2fs: skip punching hole in special condition
f2fs: support large sector size
f2fs: fix to truncate blocks past EOF in ->setattr
f2fs: update i_size when __allocate_data_block
f2fs: use MAX_BIO_BLOCKS(sbi)
f2fs: remove redundant operation during roll-forward recovery
...
Pull nfsd updates from Bruce Fields:
"Highlights:
- support the NFSv4.2 SEEK operation (allowing clients to support
SEEK_HOLE/SEEK_DATA), thanks to Anna.
- end the grace period early in a number of cases, mitigating a
long-standing annoyance, thanks to Jeff
- improve SMP scalability, thanks to Trond"
* 'for-3.18' of git://linux-nfs.org/~bfields/linux: (55 commits)
nfsd: eliminate "to_delegation" define
NFSD: Implement SEEK
NFSD: Add generic v4.2 infrastructure
svcrdma: advertise the correct max payload
nfsd: introduce nfsd4_callback_ops
nfsd: split nfsd4_callback initialization and use
nfsd: introduce a generic nfsd4_cb
nfsd: remove nfsd4_callback.cb_op
nfsd: do not clear rpc_resp in nfsd4_cb_done_sequence
nfsd: fix nfsd4_cb_recall_done error handling
nfsd4: clarify how grace period ends
nfsd4: stop grace_time update at end of grace period
nfsd: skip subsequent UMH "create" operations after the first one for v4.0 clients
nfsd: set and test NFSD4_CLIENT_STABLE bit to reduce nfsdcltrack upcalls
nfsd: serialize nfsdcltrack upcalls for a particular client
nfsd: pass extra info in env vars to upcalls to allow for early grace period end
nfsd: add a v4_end_grace file to /proc/fs/nfsd
lockd: add a /proc/fs/lockd/nlm_end_grace file
nfsd: reject reclaim request when client has already sent RECLAIM_COMPLETE
nfsd: remove redundant boot_time parm from grace_done client tracking op
...
Highlights include:
Stable fixes:
- fix an NFSv4.1 state renewal regression
- fix open/lock state recovery error handling
- fix lock recovery when CREATE_SESSION/SETCLIENTID_CONFIRM fails
- fix statd when reconnection fails
- Don't wake tasks during connection abort
- Don't start reboot recovery if lease check fails
- fix duplicate proc entries
Features:
- pNFS block driver fixes and clean ups from Christoph
- More code cleanups from Anna
- Improve mmap() writeback performance
- Replace use of PF_TRANS with a more generic mechanism for avoiding
deadlocks in nfs_release_page
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJUMpFYAAoJEGcL54qWCgDywHYP/A7XNykwOGhoHVP1Cgr3xqoz
gVhAw97AEMZE8xSNVEGS++pJTe59JVzsIsYAwdHMwePV33l3zyzYorae6N9p7zWF
0xVaNQ4qNLVhbrNLAoB5KA/c3/jMnNjF5t15+8akZad5pt4kXLlhSKjyVpdEEtJE
A0eneXShMYEeLZoOJhpQt5bsw0OZ8YbWWEMjGlDqyeelvV3K1+zfivQOoyX6hS4w
XFkPEDmU7zunE/xFP9ZoUaVdLO0TvOWfEZ7STWoHm7NuWfPQiDb9w1mTnuZbZyka
ssezoGcitzwsjCcQ5e1iKTOoFRIsm/zYXFQgFQL7VFMBU1Tss9Of8047EyDkqcPF
GxctsGg0gQ2FkG7yx7JH7AKpyibOIuByQrQQ916coWSf7K0L4H4Rcky3vryroylP
1e1RI49xu215OTm+dLvlvYCv55bqCrTmaUGImZac18+ixD2eh6MNfW2ubSdxk89L
U2rTFV09Bd52N7IQOGQx1FBEI2ZnIFUV4UaFz7v+rGFxOnk6+WYe+iWyb4wC70Yc
8Jh/gTIQDd5aghql3FTieMOyfEvO6Re4pLMXmqEWMAevicx2t8DwkJriRu6X8Iy2
rlDlBPwu5QmRWC20Dc897f0VajwDtwdeB8puod7nobOWzOfx4FrNqLJ+jR3pmHUk
0otvJytqemXt+zkqqHKK
=/OQi
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-3.18-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Highlights include:
Stable fixes:
- fix an NFSv4.1 state renewal regression
- fix open/lock state recovery error handling
- fix lock recovery when CREATE_SESSION/SETCLIENTID_CONFIRM fails
- fix statd when reconnection fails
- don't wake tasks during connection abort
- don't start reboot recovery if lease check fails
- fix duplicate proc entries
Features:
- pNFS block driver fixes and clean ups from Christoph
- More code cleanups from Anna
- Improve mmap() writeback performance
- Replace use of PF_TRANS with a more generic mechanism for avoiding
deadlocks in nfs_release_page"
* tag 'nfs-for-3.18-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (66 commits)
NFSv4.1: Fix an NFSv4.1 state renewal regression
NFSv4: fix open/lock state recovery error handling
NFSv4: Fix lock recovery when CREATE_SESSION/SETCLIENTID_CONFIRM fails
NFS: Fabricate fscache server index key correctly
SUNRPC: Add missing support for RPC_CLNT_CREATE_NO_RETRANS_TIMEOUT
NFSv3: Fix missing includes of nfs3_fs.h
NFS/SUNRPC: Remove other deadlock-avoidance mechanisms in nfs_release_page()
NFS: avoid waiting at all in nfs_release_page when congested.
NFS: avoid deadlocks with loop-back mounted NFS filesystems.
MM: export page_wakeup functions
SCHED: add some "wait..on_bit...timeout()" interfaces.
NFS: don't use STABLE writes during writeback.
NFSv4: use exponential retry on NFS4ERR_DELAY for async requests.
rpc: Add -EPERM processing for xs_udp_send_request()
rpc: return sent and err from xs_sendpages()
lockd: Try to reconnect if statd has moved
SUNRPC: Don't wake tasks during connection abort
Fixing lease renewal
nfs: fix duplicate proc entries
pnfs/blocklayout: Fix a 64-bit division/remainder issue in bl_map_stripe
...
Fix the following compile error when CONFIG_SECURITY is not set:
error: 'struct security_mnt_opts' has no member named 'num_mnt_opts'
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Commit fccb84c94 moved added some helpers to cleanup our sanity tests,
but it looks like both Dave and I always compile with the tests enabled.
This fixes things to work when they are turned off too.
Signed-off-by: Chris Mason <clm@fb.com>
This patch adds support for volatile writes which keep data pages in memory
until f2fs_evict_inode is called by iput.
For instance, we can use this feature for the sqlite database as follows.
While supporting atomic writes for main database file, we can keep its journal
data temporarily in the page cache by the following sequence.
1. open
-> ioctl(F2FS_IOC_START_VOLATILE_WRITE);
2. writes
: keep all the data in the page cache.
3. flush to the database file with atomic writes
a. ioctl(F2FS_IOC_START_ATOMIC_WRITE);
b. writes
c. ioctl(F2FS_IOC_COMMIT_ATOMIC_WRITE);
4. close
-> drop the cached data
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Like flock locks, leases are owned by the file description. Now that the
i_have_this_lease check in __break_lease is gone, we don't actually use
the fl_owner for leases for anything. So, it's now safe to set this more
appropriately to the same value as the fl_file.
While we're at it, fix up the comments over the fl_owner_t definition
since they're rather out of date.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>