of nasty outcomes when inodes are transitioned from the unlinked state
to the free state. Small file systems are particularly vulnerable to these
problems, and it can manifest as mainly hangs, but also file system
corruption. The patches have been tested for literally many weeks, with a
very gruelling test, so I have a high level of confidence.
- Andreas Gruenbacher wrote a series of 5 patches for various lockups
during the transition of inodes from unlinked to free. The main patch
is titled "Fix gfs2_lookup_by_inum lock inversion" and the other 4 are
support and cleanup patches related to that.
- Ben Marzinski contributed 2 patches with regard to a recreatable
problem when gfs2 tries to write a page to a file that is being
truncated, resulting in a BUG() in gfs2_remove_from_journal.
Note that Ben had to export vfs function __block_write_full_page to get
this to work properly. It's been posted a long time and he talked to
various VFS people about it, and nobody seemed to mind.
- I contributed 3 patches. (1) The first one fixes a memory corruptor:
a race in which one process can overwrite the gl_object pointer set by
another process, causing kernel panic and other symptoms. (2) The second
patch fixes another race that resulted in a false-positive BUG_ON. This
occurred when resource group reservations were freed by one process
while another process was trying to grab a new reservation in the same
resource group. (3) The third patch fixes a problem with doing journal
replay when the journals are not all the same size.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJXklXIAAoJENeLYdPf93o7AbIIAImLEixK+4CaItEArAKG9TXv
WbO+eDJfo6AOtAteB6+MdX2UxXAHJsCY6RmiEIAi5LzlVFiiCgRo4z/QgDARAw3c
2RxlndElaESh82S27sLiFbgZeY7GZv04C0t6AzMkc830BLXiKMs6bXfeq1fzW8Sf
AgAneACVsX0faRWo/XDuQcK81dwZ+qdOnR2+FvtOSFl1KgV0BrtnsW7IHv+5MIot
SREDN7VvSQwQrLgwMlC0PvhwK3KCVvuO9ZziLEPpYJONESJfEmuCpG265+tUJNTw
dIcW3p/vvgow8fb56fSnAxaeplPLlF9qJCq1M9fWZrKVbHg2uyCZMx4P52Fnmz4=
=uUVs
-----END PGP SIGNATURE-----
Merge tag 'gfs2-4.7.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 updates from Bob Peterson:
"We've got ten patches this time, half of which are related to a
plethora of nasty outcomes when inodes are transitioned from the
unlinked state to the free state. Small file systems are particularly
vulnerable to these problems, and it can manifest as mainly hangs, but
also file system corruption. The patches have been tested for
literally many weeks, with a very gruelling test, so I have a high
level of confidence.
- Andreas Gruenbacher wrote a series of five patches for various
lockups during the transition of inodes from unlinked to free.
The main patch is titled "Fix gfs2_lookup_by_inum lock inversion"
and the other four are support and cleanup patches related to that.
- Ben Marzinski contributed two patches with regard to a recreatable
problem when gfs2 tries to write a page to a file that is being
truncated, resulting in a BUG() in gfs2_remove_from_journal.
Note that Ben had to export vfs function __block_write_full_page to
get this to work properly. It's been posted a long time and he
talked to various VFS people about it, and nobody seemed to mind.
- I contributed 3 patches:
o The first one fixes a memory corruptor: a race in which one
process can overwrite the gl_object pointer set by another
process, causing kernel panic and other symptoms.
o The second patch fixes another race that resulted in a
false-positive BUG_ON. This occurred when resource group
reservations were freed by one process while another process
was trying to grab a new reservation in the same resource
group.
o The third patch fixes a problem with doing journal replay when
the journals are not all the same size"
* tag 'gfs2-4.7.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
GFS2: Fix gfs2_replay_incr_blk for multiple journal sizes
GFS2: Check rs_free with rd_rsspin protection
gfs2: writeout truncated pages
fs: export __block_write_full_page
gfs2: Lock holder cleanup
gfs2: Large-filesystem fix for 32-bit systems
gfs2: Get rid of gfs2_ilookup
gfs2: Fix gfs2_lookup_by_inum lock inversion
gfs2: Initialize iopen glock holder for new inodes
GFS2: don't set rgrp gl_object until it's inserted into rgrp tree
Before this patch, if you used gfs2_jadd to add new journals of a
size smaller than the existing journals, replaying those new journals
would withdraw. That's because function gfs2_replay_incr_blk was
using the number of journal blocks (jd_block) from the superblock's
journal pointer. In other words, "My journal's max size" rather than
"the journal we're replaying's size." This patch changes the function
to use the size of the pertinent journal rather than always using the
journal we happen to be using.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
For the last process to close a file opened for write, function
gfs2_rsqa_delete was deleting the file's inode's block reservation
out of the rgrp reservations tree. Then it was checking to make sure
rs_free was 0, but it was performing the check outside the protection
of rd_rsspin spin_lock. The rd_rsspin spin_lock protection is needed
to prevent a race between the process freeing the reservation and
another who is allocating a new set of blocks inside the same rgrp
for the same inode, thus changing its value.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
->atomic_open() can be given an in-lookup dentry *or* a negative one
found in dcache. Use d_in_lookup() to tell one from another, rather
than d_unhashed().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
When gfs2 attempts to write a page to a file that is being truncated,
and notices that the page is completely outside of the file size, it
tries to invalidate it. However, this may require a transaction for
journaled data files to revoke any buffers from the page on the active
items list. Unfortunately, this can happen inside a log flush, where a
transaction cannot be started. Also, gfs2 may need to be able to remove
the buffer from the ail1 list before it can finish the log flush.
To deal with this, when writing a page of a file with data journalling
enabled gfs2 now skips the check to see if the write is outside the file
size, and simply writes it anyway. This situation can only occur when
the truncate code still has the file locked exclusively, and hasn't
marked this block as free in the metadata (which happens later in
truc_dealloc). After gfs2 writes this page out, the truncation code
will shortly invalidate it and write out any revokes if necessary.
To do this, gfs2 now implements its own version of block_write_full_page
without the check, and calls the newly exported __block_write_full_page.
It also no longer calls gfs2_writepage_common from gfs2_jdata_writepage.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Make the code more readable by cleaning up the different ways of
initializing lock holders and checking for initialized lock holders:
mark lock holders as uninitialized by setting the holder's glock to NULL
(gfs2_holder_mark_uninitialized) instead of zeroing out the entire
object or using a separate flag. Recognize initialized holders by their
non-NULL glock (gfs2_holder_initialized). Don't zero out holder objects
which are immeditiately initialized via gfs2_holder_init or
gfs2_glock_nq_init.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Commit ff34245d switched from iget5_locked to iget_locked among other
things, but iget_locked doesn't work for filesystems larger than 2^32
blocks on 32-bit systems. Switch back to iget5_locked. Filesystems
larger than 2^32 blocks are unrealistic to work well on 32-bit systems,
so this is mostly a code cleanliness fix.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Now that gfs2_lookup_by_inum only takes the inode glock for new inodes
(and not for cached inodes anymore), there no longer is a need to
optimize the cached-inode case in gfs2_get_dentry or delete_work_func,
and gfs2_ilookup can be removed.
In addition, gfs2_get_dentry wasn't checking the GFS2_DIF_SYSTEM flag in
i_diskflags in the gfs2_ilookup case (see gfs2_lookup_by_inum); this
inconsistency goes away as well.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
The current gfs2_lookup_by_inum takes the glock of a presumed inode
identified by block number, verifies that the block is indeed an inode,
and then instantiates and reads the new inode via gfs2_inode_lookup.
However, instantiating a new inode may block on freeing a previous
instance of that inode (__wait_on_freeing_inode), and freeing an inode
requires to take the glock already held, leading to lock inversion and
deadlock.
Fix this by first instantiating the new inode, then verifying that the
block is an inode (if required), and then reading in the new inode, all
in gfs2_inode_lookup.
If the block we are looking for is not an inode, we discard the new
inode via iget_failed, which marks inodes as bad and unhashes them.
Other tasks waiting on that inode will get back a bad inode back from
ilookup or iget_locked; in that case, retry the lookup.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
In gfs2_init_inode_once, initialize inode->i_iopen_gh.gh_gl to NULL:
otherwise, when gfs2_inode_lookup fails, the iopen glock holder can
remain unset and iget_failed can end up accessing random memory.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Before this patch, function read_rindex_entry would set a rgrp
glock's gl_object pointer to itself before inserting the rgrp into
the rgrp rbtree. The problem is: if another process was also reading
the rgrp in, and had already inserted its newly created rgrp, then
the second call to read_rindex_entry would overwrite that value,
then return a bad return code to the caller. Later, other functions
would reference the now-freed rgrp memory by way of gl_object.
In some cases, that could result in gfs2_rgrp_brelse being called
twice for the same rgrp: once for the failed attempt and once for
the "real" rgrp release. Eventually the kernel would panic.
There are also a number of other things that could go wrong when
a kernel module is accessing freed storage. For example, this could
result in rgrp corruption because the fake rgrp would point to a
fake bitmap in memory too, causing gfs2_inplace_reserve to search
some random memory for free blocks, and find some, since we were
never setting rgd->rd_bits to NULL before freeing it.
This patch fixes the problem by not setting gl_object until we
have successfully inserted the rgrp into the rbtree. Also, it sets
rd_bits to NULL as it frees them, which will ensure any accidental
access to the wrong rgrp will result in a kernel panic rather than
file system corruption, which is preferred.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Pull vfs fixes from Al Viro:
"Followups to the parallel lookup work:
- update docs
- restore killability of the places that used to take ->i_mutex
killably now that we have down_write_killable() merged
- Additionally, it turns out that I missed a prerequisite for
security_d_instantiate() stuff - ->getxattr() wasn't the only thing
that could be called before dentry is attached to inode; with smack
we needed the same treatment applied to ->setxattr() as well"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
switch ->setxattr() to passing dentry and inode separately
switch xattr_handler->set() to passing dentry and inode separately
restore killability of old mutex_lock_killable(&inode->i_mutex) users
add down_write_killable_nested()
update D/f/directory-locking
Most users of IS_ERR_VALUE() in the kernel are wrong, as they
pass an 'int' into a function that takes an 'unsigned long'
argument. This happens to work because the type is sign-extended
on 64-bit architectures before it gets converted into an
unsigned type.
However, anything that passes an 'unsigned short' or 'unsigned int'
argument into IS_ERR_VALUE() is guaranteed to be broken, as are
8-bit integers and types that are wider than 'unsigned long'.
Andrzej Hajda has already fixed a lot of the worst abusers that
were causing actual bugs, but it would be nice to prevent any
users that are not passing 'unsigned long' arguments.
This patch changes all users of IS_ERR_VALUE() that I could find
on 32-bit ARM randconfig builds and x86 allmodconfig. For the
moment, this doesn't change the definition of IS_ERR_VALUE()
because there are probably still architecture specific users
elsewhere.
Almost all the warnings I got are for files that are better off
using 'if (err)' or 'if (err < 0)'.
The only legitimate user I could find that we get a warning for
is the (32-bit only) freescale fman driver, so I did not remove
the IS_ERR_VALUE() there but changed the type to 'unsigned long'.
For 9pfs, I just worked around one user whose calling conventions
are so obscure that I did not dare change the behavior.
I was using this definition for testing:
#define IS_ERR_VALUE(x) ((unsigned long*)NULL == (typeof (x)*)NULL && \
unlikely((unsigned long long)(x) >= (unsigned long long)(typeof(x))-MAX_ERRNO))
which ends up making all 16-bit or wider types work correctly with
the most plausible interpretation of what IS_ERR_VALUE() was supposed
to return according to its users, but also causes a compile-time
warning for any users that do not pass an 'unsigned long' argument.
I suggested this approach earlier this year, but back then we ended
up deciding to just fix the users that are obviously broken. After
the initial warning that caused me to get involved in the discussion
(fs/gfs2/dir.c) showed up again in the mainline kernel, Linus
asked me to send the whole thing again.
[ Updated the 9p parts as per Al Viro - Linus ]
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Andrzej Hajda <a.hajda@samsung.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lkml.org/lkml/2016/1/7/363
Link: https://lkml.org/lkml/2016/5/27/486
Acked-by: Srinivas Kandagatla <srinivas.kandagatla@linaro.org> # For nvmem part
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Abhi Das has two patches that fix a GFS2 splice issue (and an adjustment).
- Ben Marzinski has a patch which allows the proper unmount of a GFS2
file system after hitting a withdraw error.
- I have a patch to fix a problem where GFS2 would dereference an error
value, plus three cosmetic / refactoring patches.
- Daniel DeFreez has a patch to fix two glock reference count problems,
where GFS2 was not properly "uninitializing" its glock holder on error
paths.
- Denys Vlasenko has a patch to change a function to not be inlined,
thus reducing the memory footprint of the GFS2 module.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJXP2KLAAoJENeLYdPf93o7nScH/jlDCv+JzwBwAiFEVelgB657
lFY+Q3bPvmcTaHiOqXzco7U1EeyvAuZ+x9duaFBBEGC2/NydIbkAdCP1XjDOTAGZ
aVVAP8qov4adFiXTfBdzMYjJ7DArOy/JRnlC8WK5OiD/3yxsXJZNlAHTgjjL/ZMq
PkhjxX2Lc9ZHGF9YIFTyeoBlqf1qO6WbhkxvSNefgq3c702QPM+T9XND5T1duGDF
ThEcXS2DRRoE+6wr3h+ruvS5kcfxXRLLausAgqNpj2FYQOFCQ6XXgGK8pyXA5V/t
3Vh8mHW0VmSjAfrRWslCRjd+APXoOf+7QHqOHUAA1dvQ6CGfC9qTETH9Wp32JCI=
=8g+a
-----END PGP SIGNATURE-----
Merge tag 'gfs2-4.7.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull GFS2 updates from Bob Peterson:
"We've got nine patches this time:
- Abhi Das has two patches that fix a GFS2 splice issue (and an
adjustment).
- Ben Marzinski has a patch which allows the proper unmount of a GFS2
file system after hitting a withdraw error.
- I have a patch to fix a problem where GFS2 would dereference an
error value, plus three cosmetic / refactoring patches.
- Daniel DeFreez has a patch to fix two glock reference count
problems, where GFS2 was not properly "uninitializing" its glock
holder on error paths.
- Denys Vlasenko has a patch to change a function to not be inlined,
thus reducing the memory footprint of the GFS2 module"
* tag 'gfs2-4.7.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
GFS2: Refactor gfs2_remove_from_journal
GFS2: Remove allocation parms from gfs2_rbm_find
gfs2: use inode_lock/unlock instead of accessing i_mutex directly
GFS2: Add calls to gfs2_holder_uninit in two error handlers
GFS2: Don't dereference inode in gfs2_inode_lookup until it's valid
GFS2: fs/gfs2/glock.c: Deinline do_error, save 1856 bytes
gfs2: Use gfs2 wrapper to sync inode before calling generic_file_splice_read()
GFS2: Get rid of dead code in inode_go_demote_ok
GFS2: ignore unlock failures after withdraw
Pull remaining vfs xattr work from Al Viro:
"The rest of work.xattr (non-cifs conversions)"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
btrfs: Switch to generic xattr handlers
ubifs: Switch to generic xattr handlers
jfs: Switch to generic xattr handlers
jfs: Clean up xattr name mapping
gfs2: Switch to generic xattr handlers
ceph: kill __ceph_removexattr()
ceph: Switch to generic xattr handlers
ceph: Get rid of d_find_alias in ceph_set_acl
Pull networking updates from David Miller:
"Highlights:
1) Support SPI based w5100 devices, from Akinobu Mita.
2) Partial Segmentation Offload, from Alexander Duyck.
3) Add GMAC4 support to stmmac driver, from Alexandre TORGUE.
4) Allow cls_flower stats offload, from Amir Vadai.
5) Implement bpf blinding, from Daniel Borkmann.
6) Optimize _ASYNC_ bit twiddling on sockets, unless the socket is
actually using FASYNC these atomics are superfluous. From Eric
Dumazet.
7) Run TCP more preemptibly, also from Eric Dumazet.
8) Support LED blinking, EEPROM dumps, and rxvlan offloading in mlx5e
driver, from Gal Pressman.
9) Allow creating ppp devices via rtnetlink, from Guillaume Nault.
10) Improve BPF usage documentation, from Jesper Dangaard Brouer.
11) Support tunneling offloads in qed, from Manish Chopra.
12) aRFS offloading in mlx5e, from Maor Gottlieb.
13) Add RFS and RPS support to SCTP protocol, from Marcelo Ricardo
Leitner.
14) Add MSG_EOR support to TCP, this allows controlling packet
coalescing on application record boundaries for more accurate
socket timestamp sampling. From Martin KaFai Lau.
15) Fix alignment of 64-bit netlink attributes across the board, from
Nicolas Dichtel.
16) Per-vlan stats in bridging, from Nikolay Aleksandrov.
17) Several conversions of drivers to ethtool ksettings, from Philippe
Reynes.
18) Checksum neutral ILA in ipv6, from Tom Herbert.
19) Factorize all of the various marvell dsa drivers into one, from
Vivien Didelot
20) Add VF support to qed driver, from Yuval Mintz"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1649 commits)
Revert "phy dp83867: Fix compilation with CONFIG_OF_MDIO=m"
Revert "phy dp83867: Make rgmii parameters optional"
r8169: default to 64-bit DMA on recent PCIe chips
phy dp83867: Make rgmii parameters optional
phy dp83867: Fix compilation with CONFIG_OF_MDIO=m
bpf: arm64: remove callee-save registers use for tmp registers
asix: Fix offset calculation in asix_rx_fixup() causing slow transmissions
switchdev: pass pointer to fib_info instead of copy
net_sched: close another race condition in tcf_mirred_release()
tipc: fix nametable publication field in nl compat
drivers: net: Don't print unpopulated net_device name
qed: add support for dcbx.
ravb: Add missing free_irq() calls to ravb_close()
qed: Remove a stray tab
net: ethernet: fec-mpc52xx: use phy_ethtool_{get|set}_link_ksettings
net: ethernet: fec-mpc52xx: use phydev from struct net_device
bpf, doc: fix typo on bpf_asm descriptions
stmmac: hardware TX COE doesn't work when force_thresh_dma_mode is set
net: ethernet: fs-enet: use phy_ethtool_{get|set}_link_ksettings
net: ethernet: fs-enet: use phydev from struct net_device
...
Pull vfs cleanups from Al Viro:
"More cleanups from Christoph"
* 'work.preadv2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
nfsd: use RWF_SYNC
fs: add RWF_DSYNC aand RWF_SYNC
ceph: use generic_write_sync
fs: simplify the generic_write_sync prototype
fs: add IOCB_SYNC and IOCB_DSYNC
direct-io: remove the offset argument to dio_complete
direct-io: eliminate the offset argument to ->direct_IO
xfs: eliminate the pos variable in xfs_file_dio_aio_write
filemap: remove the pos argument to generic_file_direct_write
filemap: remove pos variables in generic_file_read_iter
Switch to the generic xattr handlers and take the necessary glocks at
the layer below. The following are the new xattr "entry points"; they
are called with the glock held already in the following cases:
gfs2_xattr_get: From SELinux, during lookups.
gfs2_xattr_set: The glock is never held.
gfs2_get_acl: From gfs2_create_inode -> posix_acl_create and
gfs2_setattr -> posix_acl_chmod.
gfs2_set_acl: From gfs2_setattr -> posix_acl_chmod.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This patch makes two simple changes to function gfs2_remove_from_journal.
First, it removes the parameter that specifies the transaction.
Since it's always passed in as current->journal_info, we might as well
set that in the function rather than passing it in. Second, it changes
the meta parameter to use an enum to make the code more clear.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
ta-da!
The main issue is the lack of down_write_killable(), so the places
like readdir.c switched to plain inode_lock(); once killable
variants of rwsem primitives appear, that'll be dealt with.
lockdep side also might need more work
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
i_mutex has been replaced by i_rwsem and directly accessing the
non-existent i_mutex breaks the kernel build.
Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
This will allow us to do per-I/O sync file writes, as required by a lot
of fileservers or storage targets.
XXX: Will need a few additional audits for O_DSYNC
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Including blkdev_direct_IO and dax_do_io. It has to be ki_pos to actually
work, so eliminate the superflous argument.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This patch fixes two locations that do not call gfs2_holder_uninit
if gfs2_glock_nq returns an error.
Signed-off-by: Daniel DeFreez <dcdefreez@ucdavis.edu>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Function gfs2_inode_lookup was dereferencing the inode, and after,
it checks for the value being NULL. We need to check that first.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
This function compiles to 522 bytes of machine code.
Error paths are not very time critical.
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
gfs2_file_splice_read() f_op grabs and releases the cluster-wide
inode glock to sync the inode size to the latest.
Without this, generic_file_splice_read() uses an older i_size value
and can return EOF for valid offsets in the inode.
Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Function inode_go_demote_ok had some code that was only executed
if gl_holders was not empty. However, if gl_holders was not empty,
the only caller, demote_ok(), returns before inode_go_demote_ok
would ever be called. Therefore, it's dead code, so I removed it.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
In certain cases, the 802.11 mesh pathtable code wants to
iterate over all of the entries in the forwarding table from
the receive path, which is inside an RCU read-side critical
section. Enable walks inside atomic sections by allowing
GFP_ATOMIC allocations for the walker state.
Change all existing callsites to pass in GFP_KERNEL.
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Bob Copeland <me@bobcopeland.com>
[also adjust gfs2/glock.c and rhashtable tests]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After gfs2 has withdrawn the filesystem, it may still have many locks not
in the unlocked state. If it is using lock_dlm, it will failed trying
the unlocks since it has already unmounted the lock manager. Instead, it
should set the SDF_SKIP_DLM_UNLOCK flag on withdraw, to signal that
it can skip the lock_manager on unlocks, and failback to lock_nolock
style unlocking.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
We only have six patches ready for this merge window.
- Arnd Bergmann contributed a patch that fixes an uninitialized variable
warning.
- The second patch avoids a kernel panic due to referencing an iopen
glock that may not be held, in an error path.
- The third patch fixes a rounding error that caused xfs_tests direct IO
write "fsx" tests to fail on GFS2.
- The fourth patch tidies up the code path when glocks are being reused
to recreate a dinode that was recently deleted.
- The fifth reverts an ages-old patch that should no longer be needed, and
which interfered with the transition of dinodes from unlinked to free.
- And lastly, a patch to eliminate a function parameter that's not needed.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJW6qcKAAoJENeLYdPf93o70SwH/0czrFBIWaPbRgyuGLNvye4G
qvfIk2Yky3UKfgUA+ZW+cAWkxeKubc9scMwce0elKxm0e03rNwIX0slIOx8hymj3
19AgMnj3kcCJvRLdkBITNqnd6vTY2quadLN3j8I2cCNbHOV0GelEkP4jWcTHB+2F
AG0ZJOsvvrcD1ClgdIvGdV52qipZApS/kgZjLpJBEyzxq8SpRe9vNqMDsDyoKWgi
yjp0+aJ0IJAWA24fzdT5HE4fb5yGRWehg51l6Z2mbfXAvT+oeKcYrQiq1zhUutHw
dT6SUHbjt+y0EulPClsf3r5zjRjVCCbpj0wkUY3kPB98lgnpAD2QnUP/JDA+CD0=
=xRmX
-----END PGP SIGNATURE-----
Merge tag 'gfs2-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull GFS2 updates from Bob Peterson:
"We only have six patches ready for this merge window:
- Arnd Bergmann contributed a patch that fixes an uninitialized
variable warning.
- The second patch avoids a kernel panic due to referencing an iopen
glock that may not be held, in an error path.
- The third patch fixes a rounding error that caused xfs_tests direct
IO write "fsx" tests to fail on GFS2.
- The fourth patch tidies up the code path when glocks are being
reused to recreate a dinode that was recently deleted.
- The fifth reverts an ages-old patch that should no longer be
needed, and which interfered with the transition of dinodes from
unlinked to free.
- And lastly, a patch to eliminate a function parameter that's not
needed"
* tag 'gfs2-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
GFS2: Eliminate parameter non_block on gfs2_inode_lookup
GFS2: Don't filter out I_FREEING inodes anymore
GFS2: Prevent delete work from occurring on glocks used for create
GFS2: Fix direct IO write rounding error
gfs2: avoid uninitialized variable warning
GFS2: Check if iopen is held when deleting inode
Now that we're not filtering out I_FREEING inodes from our lookups
anymore, we can eliminate the non_block parameter from the lookup
function.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
This patch basically reverts a very old patch from 2008,
7a9f53b3c1, with the title
"Alternate gfs2_iget to avoid looking up inodes being freed".
The original patch was designed to avoid a deadlock caused by lock
ordering with try_rgrp_unlink. The patch forced the function to not
find inodes that were being removed by VFS. The problem is, that
made it impossible for nodes to delete their own unlinked dinodes
after a certain point in time, because the inode needed was not found
by this filtering process. There is no longer a need for the patch,
since function try_rgrp_unlink no longer locks the inode: All it does
is queue the glock onto the delete work_queue, so there should be no
more deadlock.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch tries to prevent delete work (queued via iopen callback)
from executing if the glock is currently being used to create
a new inode.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
The fsx test in xfstests was failing because it was using direct IO
writes which were using a bad calculation. It was using
loff_t lstart = offset & (PAGE_CACHE_SIZE - 1); when it should be
loff_t lstart = offset & ~(PAGE_CACHE_SIZE - 1);
Thus, the write at offset 0x67e00 was calculating lstart to be
0xe00, the address of our corruption. Instead, it should have been
0x67000. This patch fixes the calculation.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
We get a bogus warning about a potential uninitialized variable
use in gfs2, because the compiler does not figure out that we
never use the leaf number if get_leaf_nr() returns an error:
fs/gfs2/dir.c: In function 'get_first_leaf':
fs/gfs2/dir.c:802:9: warning: 'leaf_no' may be used uninitialized in this function [-Wmaybe-uninitialized]
fs/gfs2/dir.c: In function 'dir_split_leaf':
fs/gfs2/dir.c:1021:8: warning: 'leaf_no' may be used uninitialized in this function [-Wmaybe-uninitialized]
Changing the 'if (!error)' to 'if (!IS_ERR_VALUE(error))' is
sufficient to let gcc understand that this is exactly the same
condition as in IS_ERR() so it can optimize the code path enough
to understand it.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
inode_foo(inode) being mutex_foo(&inode->i_mutex).
Please, use those for access to ->i_mutex; over the coming cycle
->i_mutex will become rwsem, with ->lookup() done with it held
only shared.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull security subsystem updates from James Morris:
- EVM gains support for loading an x509 cert from the kernel
(EVM_LOAD_X509), into the EVM trusted kernel keyring.
- Smack implements 'file receive' process-based permission checking for
sockets, rather than just depending on inode checks.
- Misc enhancments for TPM & TPM2.
- Cleanups and bugfixes for SELinux, Keys, and IMA.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (41 commits)
selinux: Inode label revalidation performance fix
KEYS: refcount bug fix
ima: ima_write_policy() limit locking
IMA: policy can be updated zero times
selinux: rate-limit netlink message warnings in selinux_nlmsg_perm()
selinux: export validatetrans decisions
gfs2: Invalid security labels of inodes when they go invalid
selinux: Revalidate invalid inode security labels
security: Add hook to invalidate inode security labels
selinux: Add accessor functions for inode->i_security
security: Make inode argument of inode_getsecid non-const
security: Make inode argument of inode_getsecurity non-const
selinux: Remove unused variable in selinux_inode_init_security
keys, trusted: seal with a TPM2 authorization policy
keys, trusted: select hash algorithm for TPM2 chips
keys, trusted: fix: *do not* allow duplicate key options
tpm_ibmvtpm: properly handle interrupted packet receptions
tpm_tis: Tighten IRQ auto-probing
tpm_tis: Refactor the interrupt setup
tpm_tis: Get rid of the duplicate IRQ probing code
...
Mark those kmem allocations that are known to be easily triggered from
userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
memcg. For the list, see below:
- threadinfo
- task_struct
- task_delay_info
- pid
- cred
- mm_struct
- vm_area_struct and vm_region (nommu)
- anon_vma and anon_vma_chain
- signal_struct
- sighand_struct
- fs_struct
- files_struct
- fdtable and fdtable->full_fds_bits
- dentry and external_name
- inode for all filesystems. This is the most tedious part, because
most filesystems overwrite the alloc_inode method.
The list is far from complete, so feel free to add more objects.
Nevertheless, it should be close to "account everything" approach and
keep most workloads within bounds. Malevolent users will be able to
breach the limit, but this was possible even with the former "account
everything" approach (simply because it did not account everything in
fact).
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch fixes an error condition in which an inode is partially
created in gfs2_create_inode() but then some error is discovered,
which causes it to fail and call iput() before the iopen glock is
created or held. In that case, gfs2_delete_inode would try to
unlock an iopen glock that doesn't yet exist. Therefore, we test
its holder (which must exist) for the HIF_HOLDER bit before trying
to dq it.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Here is a list of patches we've accumulated for GFS2 for the current upstream
merge window. Last window's set was short, but I warned that this one would
be bigger, and so it is. We've got 19 patches:
- A patch from Abhi Das to propagate the GFS2_DIF_SYSTEM bit so that newly
added journals don't get flagged, deleted, and recreated by fsck.gfs2.
- Two patches from Andreas Gruenbacher to improve GFS2 performance where
extended attributes are involved.
- A patch from Andy Price to fix a suspicious rcu dereference error.
- Two patches from Ben Marzinski that rework how GFS2's NFS cookies are
managed. This fixes readdir problems with nfs-over-gfs2.
- A patch from Ben Marzinski that fixes a race in unmounting GFS2.
- A set of four patches from me to move the resource group reservations
inside the gfs2 inode to improve performance and fix a bug whereby
get_write_access improperly prevented some operations like chown.
- A patch from me to spinlock-protect the setting of system statfs file data.
This was causing small discrepancies between df and du.
- A patch from me to reintroduce a timeout while clearing glocks which was
accidentally dropped some time ago.
- A patch from me to wait for iopen glock dequeues in order to improve
deleting of files that were unlinked from a different cluster node.
- A patch from me to ensure metadata address spaces get truncated when an
inode is evicted.
- A patch from me to fix a bug in which a memory leak could occur in some
error cases when inodes were trying to be created.
- A patch to consistently use iopen glocks to transition from the unlinked
state to the deleted state.
- A patch to fix a glock reference count error when inode creation fails.
- A patch from Junxiao Bi to fix an flock panic.
- A patch from Markus Elfring that removes an unnecessary if.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJWlUOCAAoJENeLYdPf93o78/IH/0PfoQtQnYQpvbTPgaFICEUI
rx+uLqPxpJgjJMsbIy+VA/bhZsbuENDbor99Cs6GiTLy3Q/4TKNY9NxDN+aO8o+Q
qrp3ZANkyTGaneCzrXzfgmOxe1G8xRQ7pboRUEt2yPlGK1oLax+PR4uPb+IuJ5Wa
QvSwnZj+9K2LkMQkKPyxoltjZiZLywvNB5dx0F35+dvriQxHXfJ1Naek6gs140MB
SfG4/Zi8dfGOJQjSQTUENatFhyFB0V7CbDPFuBVzTGD8IS4ASgqIBMFjO0VBHkP4
gt4zb0BJAn0aQcbxIaDtsuEffpn+5zdK/Onq5g4Fr9ZJv3fkVpVcOgYui/rkDD0=
=28ev
-----END PGP SIGNATURE-----
Merge tag 'gfs2-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull GFS2 updates from Bob Peterson:
"Here is a list of patches we've accumulated for GFS2 for the current
upstream merge window. Last window's set was short, but I warned that
this one would be bigger, and so it is. We've got 19 patches:
- A patch from Abhi Das to propagate the GFS2_DIF_SYSTEM bit so that
newly added journals don't get flagged, deleted, and recreated by
fsck.gfs2.
- Two patches from Andreas Gruenbacher to improve GFS2 performance
where extended attributes are involved.
- A patch from Andy Price to fix a suspicious rcu dereference error.
- Two patches from Ben Marzinski that rework how GFS2's NFS cookies
are managed. This fixes readdir problems with nfs-over-gfs2.
- A patch from Ben Marzinski that fixes a race in unmounting GFS2.
- A set of four patches from me to move the resource group
reservations inside the gfs2 inode to improve performance and fix a
bug whereby get_write_access improperly prevented some operations
like chown.
- A patch from me to spinlock-protect the setting of system statfs
file data. This was causing small discrepancies between df and du.
- A patch from me to reintroduce a timeout while clearing glocks
which was accidentally dropped some time ago.
- A patch from me to wait for iopen glock dequeues in order to
improve deleting of files that were unlinked from a different
cluster node.
- A patch from me to ensure metadata address spaces get truncated
when an inode is evicted.
- A patch from me to fix a bug in which a memory leak could occur in
some error cases when inodes were trying to be created.
- A patch to consistently use iopen glocks to transition from the
unlinked state to the deleted state.
- A patch to fix a glock reference count error when inode creation
fails.
- A patch from Junxiao Bi to fix an flock panic.
- A patch from Markus Elfring that removes an unnecessary if"
* tag 'gfs2-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: fix flock panic issue
GFS2: Don't do glock put on when inode creation fails
GFS2: Always use iopen glock for gl_deletes
GFS2: Release iopen glock in gfs2_create_inode error cases
GFS2: Truncate address space mapping when deleting an inode
GFS2: Wait for iopen glock dequeues
gfs2: clear journal live bit in gfs2_log_flush
gfs2: change gfs2 readdir cookie
gfs2: keep offset when splitting dir leaf blocks
GFS2: Reintroduce a timeout in function gfs2_gl_hash_clear
GFS2: Update master statfs buffer with sd_statfs_spin locked
GFS2: Reduce size of incore inode
GFS2: Make rgrp reservations part of the gfs2_inode structure
GFS2: Extract quota data from reservations structure (revert 5407e24)
gfs2: Extended attribute readahead optimization
gfs2: Extended attribute readahead
GFS2: Use rht_for_each_entry_rcu in glock_hash_walk
GFS2: Delete an unnecessary check before the function call "iput"
gfs2: Automatically set GFS2_DIF_SYSTEM flag on system files