Since we do things like setting flags, etc it really is more appropriate
to use sock_lock().
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
If we already hold open state on the client, yet the server gives us a
completely different stateid to the one we already hold, then we
currently treat it as if it were an out-of-sequence update, and wait for
5 seconds for other updates to come in.
This commit fixes the behaviour so that we immediately start processing
of the new stateid, and then leave it to the call to
nfs4_test_and_free_stateid() to decide what to do with the old stateid.
Fixes: b4868b44c5 ("NFSv4: Wait for stateid updates after CLOSE/OPEN_DOWNGRADE")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Remove a redundant call in nfs_updatepage(). nfs_writepage_setup() will
have already called nfs_mark_request_dirty() on success.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
In rpc_task_set_client(), testing for a NULL clnt is not necessary, as
clnt should always be a valid pointer to a rpc_client.
Signed-off-by: Thiago Rafael Becker <trbecker@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
All the callers are now in client.c so we can remove the
EXPORT_SYMBOL_GPL() and make it static.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
This lets us update the server's attributes when the user does a "mount
-o remount" on the filesystem.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Clean up. There are a few places where we want to probe the server, but
don't actually care about the fsinfo result. Change these to use
nfs_probe_server(), which handles the fattr allocation for us.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
And rename it to nfs_probe_server(). I also change it to take the nfs_fh
as an argument so callers can choose what filehandle to probe.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
And call it before doing an FSINFO probe to reset to the baseline
capabilities before probing.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
dprintk call sites that display no other information than the
function name can be replaced with use of the trace "function" or
"function_graph" plug-ins.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
These new events report slightly different information for readpage
and readpages/readahead.
For readpage:
fsx-1387 [006] 380.761896: nfs_aop_readpage: fileid=00:28:2 fhandle=0x36fbbe51 version=1752899355910932437 offset=131072
fsx-1387 [006] 380.761900: nfs_aop_readpage_done: fileid=00:28:2 fhandle=0x36fbbe51 version=1752899355910932437 offset=131072 ret=0
The index of a synchronous single-page read is reported.
For readpages:
fsx-1387 [006] 380.760847: nfs_aop_readahead: fileid=00:28:2 fhandle=0x36fbbe51 version=1752899355909932456 nr_pages=3
fsx-1387 [006] 380.760853: nfs_aop_readahead_done: fileid=00:28:2 fhandle=0x36fbbe51 version=1752899355909932456 nr_pages=3 ret=0
The count of pages requested is reported. nfs_readpages does not
wait for the READ requests to complete.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Clean up: BIT() is preferred over open-coding the shift.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
For certain special cases, RPC-related tracepoints record a -1 as
the task ID or the client ID. It's ugly for a trace event to display
4 billion in these cases.
To help keep SUNRPC tracepoints consistent, create a macro that
defines the print format specifiers for tk_pid and cl_clid. At some
point in the future we might try tk_pid with a wider range of values
than 0..64K so this makes it easier to make that change.
RPC tracepoints now look like this:
<...>-1276 [009] 149.720358: rpc_clnt_new: client=00000005 peer=[192.168.2.55]:20049 program=nfs server=klimt.ib
<...>-1342 [004] 149.921234: rpc_xdr_recvfrom: task:0000001a@00000005 head=[0xff1242d9ab6dc01c,144] page=0 tail=[(nil),0] len=144
<...>-1342 [004] 149.921235: xprt_release_cong: task:0000001a@00000005 snd_task:ffffffff cong=256 cwnd=16384
<...>-1342 [004] 149.921235: xprt_put_cong: task:0000001a@00000005 snd_task:ffffffff cong=0 cwnd=16384
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Clean up: this field is no longer used.
xprt_rdma_pad_optimize is also no longer used, but is left in place
because it is part of the kernel/userspace API.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
This is a buffer to be left persistently registered while a
connection is up. Connection tear-down will automatically DMA-unmap,
invalidate, and dereg the MR. A persistently registered buffer is
lower in cost to provide, and it can never be coalesced into the
RDMA segment that carries the data payload.
An RPC that provisions a Write chunk with a non-aligned length now
uses this MR rather than the tail buffer of the RPC's rq_rcv_buf.
Reviewed-By: Tom Talpey <tom@talpey.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Save some space in the nfs_inode by setting up an anonymous union with
the fields that are peculiar to a specific type of filesystem object.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
We mustn't call nfs_wb_all() on anything other than a regular file.
Furthermore, we can exit early when we don't hold a delegation.
Reported-by: David Wysochanski <dwysocha@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
If O_DIRECT bumps the commit_info rpcs_out field, then that could lead
to fsync() hangs. The fix is to ensure that O_DIRECT calls
nfs_commit_end().
Fixes: 723c921e7d ("sched/wait, fs/nfs: Convert wait_on_atomic_t() usage to the new wait_var_event() API")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Partially revert commit 2ce209c42c ("NFS: Wait for requests that are
locked on the commit list"), since it can lead to deadlocks between
commit requests and nfs_join_page_group().
For now we should assume that any locked requests on the commit list are
either about to be removed and committed by another task, or the writes
they describe are about to be retransmitted. In either case, we should
not need to worry.
Fixes: 2ce209c42c ("NFS: Wait for requests that are locked on the commit list")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Generate a trace event whenever the NFS client modifies the size of
a file. These new events aid troubleshooting workloads that trigger
races around size updates.
There are four new trace points, all named nfs_size_something so
they are easy to grep for or enable as a group with a single glob.
Size updated on the server:
kworker/u24:10-194 [010] 369.939174: nfs_size_update: fileid=00:28:2 fhandle=0x36fbbe51 version=1752899344277980615 cursize=250471 newsize=172083
Server-side size update reported via NFSv3 WCC attributes:
fsx-1387 [006] 380.760686: nfs_size_wcc: fileid=00:28:2 fhandle=0x36fbbe51 version=1752899355909932456 cursize=146792 newsize=171216
File has been truncated locally:
fsx-1387 [007] 369.437421: nfs_size_truncate: fileid=00:28:2 fhandle=0x36fbbe51 version=1752899231200117272 cursize=215244 newsize=0
File has been extended locally:
fsx-1387 [007] 369.439213: nfs_size_grow: fileid=00:28:2 fhandle=0x36fbbe51 version=1752899343704248410 cursize=258048 newsize=262144
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
The current range of RPC task PIDs is 0..65535. That's not adequate
for distinguishing tasks across multiple rpc_clnts running high
throughput workloads.
To help relieve this situation and to reduce the bottleneck of
having a single atomic for assigning all RPC task PIDs, assign task
PIDs per rpc_clnt.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Clean up: TRACE_DEFINE_ENUM is unnecessary because the target
symbols are all C macros, not enums.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
_nfs4_pnfs_v3/v4_ds_connect do
some work
smp_wmb
ds->ds_clp = clp;
And nfs4_ff_layout_prepare_ds currently does
smp_rmb
if(ds->ds_clp)
...
This patch places the smp_rmb after the if. This ensures that following
reads only happen once nfs4_ff_layout_prepare_ds has checked that data
has been properly initialized.
Fixes: d67ae825a5 ("pnfs/flexfiles: Add the FlexFile Layout Driver")
Signed-off-by: Baptiste Lepers <baptiste.lepers@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
The original premise in commit 83672d392f ("NFS: Fix directory caching
problem - with test case and patch.") was that readdirplus was caching
attribute information and replaying it later. This is no longer the
case.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
If the directory changed while we were revalidating the dentry, then
don't update the dentry verifier. There is no value in setting the
verifier to an older value, and we could end up overwriting a more up to
date verifier from a parallel revalidation.
Fixes: efeda80da3 ("NFSv4: Fix revalidation of dentries with delegations")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
If a user is doing 'ls -l', we have a heuristic in GETATTR that tells
the readdir code to try to use READDIRPLUS in order to refresh the inode
attributes. In certain cirumstances, we also try to invalidate the
remaining directory entries in order to ensure this refresh.
If there are multiple readers of the directory, we probably should avoid
invalidating the page cache, since the heuristic breaks down in that
situation anyway.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
The check for duplicate readdir cookies should only care if the change
attribute is invalid or the data cache is invalid.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
If we want to revalidate the directory, then just mark the change
attribute as invalid.
Fixes: 13c0b082b6 ("NFS: Replace use of NFS_INO_REVAL_PAGECACHE when checking cache validity")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
NFS_INO_DATA_INVAL_DEFER and NFS_INO_INVALID_DATA should be considered
mutually exclusive.
Fixes: 1c341b7775 ("NFS: Add deferred cache invalidation for close-to-open consistency violations")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Both NFSv3 and NFSv2 generate their change attribute from the ctime
value that was supplied by the server. However the problem is that there
are plenty of servers out there with ctime resolutions of 1ms or worse.
In a modern performance system, this is insufficient when trying to
decide which is the most recent set of attributes when, for instance, a
READ or GETATTR call races with a WRITE or SETATTR.
For this reason, let's revert to labelling the NFSv2/v3 change
attributes as NFS4_CHANGE_TYPE_IS_UNDEFINED. This will ensure we protect
against such races.
Fixes: 7b24dacf08 ("NFS: Another inode revalidation improvement")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Tested-by: Chuck Lever <chuck.lever@oracle.com>
NFS4_CREATE_EXCLUSIVE does not allow the caller to set an access mode,
so for most Linux filesystems, the access call ends up returning no
permissions. However both NFS4_CREATE_EXCLUSIVE4_1 and
NFS4_CREATE_GUARDED allow the client to set the access mode.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
The clearing of the XPRT_LOCKED bit has to happen after we clear
xprt->snd_task, but we don't require any extra memory barriers after
that.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
The only check for RPC_TASK_RUNNING is the one in rpc_make_runnable(),
which happens under the same spin lock held when we call
rpc_clear_running().
Ditto, the last check for RPC_TASK_QUEUED in rpc_execute() is performed
under the same lock as the one held when we call rpc_clear_queued().
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
The premise of commit 6f9f17287e ("SUNRPC: Mitigate cond_resched() in
xprt_transmit()") was that cond_resched() is expensive and unnecessary
when there has been just a single send.
The point of cond_resched() is to ensure that tasks that should pre-empt
this one get a chance to do so when it is safe to do so. The code prior
to commit 6f9f17287e failed to take into account that it was keeping a
rpc_task pinned for longer than it needed to, and so rather than doing a
full revert, let's just move the cond_resched.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
If the cached credential exists but doesn't have any expiration callback
then exit early.
Fix up atomicity issues when replacing the credential with a new one
since the existing code could lead to refcount leaks.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
After the success of an operation such as rmdir() or unlink(), we expect
to add the dentry back to the dcache as an ordinary negative dentry.
However in NFS, unless it is labelled with the appropriate verifier for
the parent directory state, then nfs_lookup_revalidate will end up
discarding that dentry and forcing a new lookup.
The fix is to ensure that we relabel the dentry appropriately on
success.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
After the success of an operation such as link(), or symlink(), we
expect to add the dentry back to the dcache as an ordinary positive
dentry.
However in NFS, unless it is labelled with the appropriate verifier for
the parent directory state, then nfs_lookup_revalidate will end up
discarding that dentry and forcing a new lookup.
The fix is to ensure that we relabel the dentry appropriately on
success.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
In commit b212921b13 ("elf: don't use MAP_FIXED_NOREPLACE for elf
executable mappings") we still leave MAP_FIXED_NOREPLACE in place for
load_elf_interp.
Unfortunately, this will cause kernel to fail to start with:
1 (init): Uhuuh, elf segment at 00003ffff7ffd000 requested but the memory is mapped already
Failed to execute /init (error -17)
The reason is that the elf interpreter (ld.so) has overlapping segments.
readelf -l ld-2.31.so
Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
LOAD 0x0000000000000000 0x0000000000000000 0x0000000000000000
0x000000000002c94c 0x000000000002c94c R E 0x10000
LOAD 0x000000000002dae0 0x000000000003dae0 0x000000000003dae0
0x00000000000021e8 0x0000000000002320 RW 0x10000
LOAD 0x000000000002fe00 0x000000000003fe00 0x000000000003fe00
0x00000000000011ac 0x0000000000001328 RW 0x10000
The reason for this problem is the same as described in commit
ad55eac74f ("elf: enforce MAP_FIXED on overlaying elf segments").
Not only executable binaries, elf interpreters (e.g. ld.so) can have
overlapping elf segments, so we better drop MAP_FIXED_NOREPLACE and go
back to MAP_FIXED in load_elf_interp.
Fixes: 4ed2863951 ("fs, elf: drop MAP_FIXED usage from elf_map")
Cc: <stable@vger.kernel.org> # v4.19
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Chen Jingwen <chenjingwen6@huawei.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
allocation. Also fix error handling code paths in ext4_dx_readdir()
and ext4_fill_super(). Finally, avoid a grabbing a journal head in
the delayed allocation write in the common cases where we are
overwriting an pre-existing block or appending to an inode.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmFZ2SsACgkQ8vlZVpUN
gaN6DAgAkIeisL1EfQT0VwshEs8y7N6IoX8dydLSRLpNf5oWiJOv2CTY9Qpi6X/C
qNfuLsbJ2NXChvhIAM2hD82hvX21rYc6iqPxgho02VF4eYIP7NzLjwTFKnKbHPB5
TiF498nJTnkcmSrJUEXmSAEdLoCwa5THH9+9HVHXZrkLXPULBtOOJ85mDAcIzVhV
Zqb7yfbpWl0gnF0S0YjNATPtbhcC9EiC4MOVYVesRlgT9B3+k5q4fmVU0euTU9OH
F2H6TNG+Mg/19gTnDP5acB9+eXHvYEqMpe+CaDifR9iFE9PTG/Edhxr6z9roXhHr
kBvEVHSFH+YTEJXghnpS9YDd9Lwc9w==
=WKzd
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fixes from Ted Ts'o:
"Fix a number of ext4 bugs in fast_commit, inline data, and delayed
allocation.
Also fix error handling code paths in ext4_dx_readdir() and
ext4_fill_super().
Finally, avoid a grabbing a journal head in the delayed allocation
write in the common cases where we are overwriting a pre-existing
block or appending to an inode"
* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: recheck buffer uptodate bit under buffer lock
ext4: fix potential infinite loop in ext4_dx_readdir()
ext4: flush s_error_work before journal destroy in ext4_fill_super
ext4: fix loff_t overflow in ext4_max_bitmap_size()
ext4: fix reserved space counter leakage
ext4: limit the number of blocks in one ADD_RANGE TLV
ext4: enforce buffer head state assertion in ext4_da_map_blocks
ext4: remove extent cache entries when truncating inline data
ext4: drop unnecessary journal handle in delalloc write
ext4: factor out write end code of inline file
ext4: correct the error path of ext4_write_inline_data_end()
ext4: check and update i_disksize properly
ext4: add error checking to ext4_ext_replay_set_iblocks()
The objtool warning that the kvm instruction emulation code triggered
wasn't very useful:
arch/x86/kvm/emulate.o: warning: objtool: __ex_table+0x4: don't know how to handle reloc symbol type: kvm_fastop_exception
in that it helpfully tells you which symbol name it had trouble figuring
out the relocation for, but it doesn't actually say what the unknown
symbol type was that triggered it all.
In this case it was because of missing type information (type 0, aka
STT_NOTYPE), but on the whole it really should just have printed that
out as part of the message.
Because if this warning triggers, that's very much the first thing you
want to know - why did reloc2sec_off() return failure for that symbol?
So rather than just saying you can't handle some type of symbol without
saying what the type _was_, just print out the type number too.
Fixes: 24ff652573 ("objtool: Teach get_alt_entry() about more relocation types")
Link: https://lore.kernel.org/lkml/CAHk-=wiZwq-0LknKhXN4M+T8jbxn_2i9mcKpO+OaBSSq_Eh7tg@mail.gmail.com/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The recent change to make objtool aware of more symbol relocation types
(commit 24ff652573: "objtool: Teach get_alt_entry() about more
relocation types") also added another check, and resulted in this
objtool warning when building kvm on x86:
arch/x86/kvm/emulate.o: warning: objtool: __ex_table+0x4: don't know how to handle reloc symbol type: kvm_fastop_exception
The reason seems to be that kvm_fastop_exception() is marked as a global
symbol, which causes the relocation to ke kept around for objtool. And
at the same time, the kvm_fastop_exception definition (which is done as
an inline asm statement) doesn't actually set the type of the global,
which then makes objtool unhappy.
The minimal fix is to just not mark kvm_fastop_exception as being a
global symbol. It's only used in that one compilation unit anyway, so
it was always pointless. That's how all the other local exception table
labels are done.
I'm not entirely happy about the kinds of games that the kvm code plays
with doing its own exception handling, and the fact that it confused
objtool is most definitely a symptom of the code being a bit too subtle
and ad-hoc. But at least this trivial one-liner makes objtool no longer
upset about what is going on.
Fixes: 24ff652573 ("objtool: Teach get_alt_entry() about more relocation types")
Link: https://lore.kernel.org/lkml/CAHk-=wiZwq-0LknKhXN4M+T8jbxn_2i9mcKpO+OaBSSq_Eh7tg@mail.gmail.com/
Cc: Borislav Petkov <bp@suse.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>