Reorganise afs_volume objects such that they're in a tree keyed on volume
ID, rooted at on an afs_cell object rather than being in multiple trees,
each of which is rooted on an afs_server object.
afs_server structs become per-cell and acquire a pointer to the cell.
The process of breaking a callback then starts with finding the server by
its network address, following that to the cell and then looking up each
volume ID in the volume tree.
This is simpler than the afs_vol_interest/afs_cb_interest N:M mapping web
and allows those structs and the code for maintaining them to be simplified
or removed.
It does make a couple of things a bit more tricky, though:
(1) Operations now start with a volume, not a server, so there can be more
than one answer as to whether or not the server we'll end up using
supports the FS.InlineBulkStatus RPC.
(2) CB RPC operations that specify the server UUID. There's still a tree
of servers by UUID on the afs_net struct, but the UUIDs in it aren't
guaranteed unique.
Signed-off-by: David Howells <dhowells@redhat.com>
YFS Volume Location servers have an operation by which the cell name may be
queried. Use this to find out what a YFS server thinks the canonical cell
name should be.
Signed-off-by: David Howells <dhowells@redhat.com>
Implement the second phase of cell alias detection. This part handles
alias detection for cells that don't have root.cell volumes and so we have
to find some other volume or fileserver to query.
We take the first volume from each such cell and attempt to look it up in
the new cell. If found, we compare the records, if they are the same, we
judge the cell names to be aliases.
Signed-off-by: David Howells <dhowells@redhat.com>
Put in the first phase of cell alias detection. This part handles alias
detection for cells that have root.cell volumes (which is expected to be
likely).
When a cell becomes newly active, it is probed for its root.cell volume,
and if it has one, this volume is compared against other root.cell volumes
to find out if the list of fileserver UUIDs have any in common - and if
that's the case, do the address lists of those fileservers have any
addresses in common. If they do, the new cell is adjudged to be an alias
of the old cell and the old cell is used instead.
Comparing is aided by the server list in struct afs_server_list being
sorted in UUID order and the addresses in the fileserver address lists
being sorted in address order.
The cell then retains the afs_volume object for the root.cell volume, even
if it's not mounted for future alias checking.
This necessary because:
(1) Whilst fileservers have UUIDs that are meant to be globally unique, in
practice they are not because cells get cloned without changing the
UUIDs - so afs_server records need to be per cell.
(2) Sometimes the DNS is used to make cell aliases - but if we don't know
they're the same, we may end up with multiple superblocks and multiple
afs_server records for the same thing, impairing our ability to
deliver callback notifications of third party changes
(3) The fileserver RPC API doesn't contain the cell name, so it can't tell
us which cell it's notifying and can't see that a change made to to
one cell should notify the same client that's also accessed as the
other cell.
Reported-by: Jeffrey Altman <jaltman@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Implement client support for the YFSVL.GetCellName RPC operation by which
YFS permits the canonical cell name to be queried from a VL server.
Signed-off-by: David Howells <dhowells@redhat.com>
Save more bits from the volume location database record obtained for a
server so that we can use this information in cell alias detection.
Signed-off-by: David Howells <dhowells@redhat.com>
The AFS filesystem driver is handling the CB.ProbeUuid request incorrectly.
The UUID presented in the request is that of the cache manager, not the
fileserver, so afs_deliver_cb_probe_uuid() shouldn't be using that UUID to
look up the server.
Fix this by looking up the server by address instead.
Signed-off-by: David Howells <dhowells@redhat.com>
Don't get the epoch from a server, particularly one that we're looking up
by UUID, as UUIDs may be ambiguous and may map to more than one server - so
we can't draw any conclusions from it.
Reported-by: Jeffrey Altman <jaltman@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Turn the afs_operation struct into the main way that most fileserver
operations are managed. Various things are added to the struct, including
the following:
(1) All the parameters and results of the relevant operations are moved
into it, removing corresponding fields from the afs_call struct.
afs_call gets a pointer to the op.
(2) The target volume is made the main focus of the operation, rather than
the target vnode(s), and a bunch of op->vnode->volume are made
op->volume instead.
(3) Two vnode records are defined (op->file[]) for the vnode(s) involved
in most operations. The vnode record (struct afs_vnode_param)
contains:
- The vnode pointer.
- The fid of the vnode to be included in the parameters or that was
returned in the reply (eg. FS.MakeDir).
- The status and callback information that may be returned in the
reply about the vnode.
- Callback break and data version tracking for detecting
simultaneous third-parth changes.
(4) Pointers to dentries to be updated with new inodes.
(5) An operations table pointer. The table includes pointers to functions
for issuing AFS and YFS-variant RPCs, handling the success and abort
of an operation and handling post-I/O-lock local editing of a
directory.
To make this work, the following function restructuring is made:
(A) The rotation loop that issues calls to fileservers that can be found
in each function that wants to issue an RPC (such as afs_mkdir()) is
extracted out into common code, in a new file called fs_operation.c.
(B) The rotation loops, such as the one in afs_mkdir(), are replaced with
a much smaller piece of code that allocates an operation, sets the
parameters and then calls out to the common code to do the actual
work.
(C) The code for handling the success and failure of an operation are
moved into operation functions (as (5) above) and these are called
from the core code at appropriate times.
(D) The pseudo inode getting stuff used by the dynamic root code is moved
over into dynroot.c.
(E) struct afs_iget_data is absorbed into the operation struct and
afs_iget() expects to be given an op pointer and a vnode record.
(F) Point (E) doesn't work for the root dir of a volume, but we know the
FID in advance (it's always vnode 1, unique 1), so a separate inode
getter, afs_root_iget(), is provided to special-case that.
(G) The inode status init/update functions now also take an op and a vnode
record.
(H) The RPC marshalling functions now, for the most part, just take an
afs_operation struct as their only argument. All the data they need
is held there. The result delivery functions write their answers
there as well.
(I) The call is attached to the operation and then the operation core does
the waiting.
And then the new operation code is, for the moment, made to just initialise
the operation, get the appropriate vnode I/O locks and do the same rotation
loop as before.
This lays the foundation for the following changes in the future:
(*) Overhauling the rotation (again).
(*) Support for asynchronous I/O, where the fileserver rotation must be
done asynchronously also.
Signed-off-by: David Howells <dhowells@redhat.com>
As a prelude to implementing asynchronous fileserver operations in the afs
filesystem, rename struct afs_fs_cursor to afs_operation.
This struct is going to form the core of the operation management and is
going to acquire more members in later.
Signed-off-by: David Howells <dhowells@redhat.com>
Set a flag in the call struct to indicate an unmarshalling error rather
than return and handle an error from the decoding of file statuses. This
flag is checked on a successful return from the delivery function.
Signed-off-by: David Howells <dhowells@redhat.com>
afs_vol_interest objects represent the volume IDs currently being accessed
from a fileserver. These hold lists of afs_cb_interest objects that
repesent the superblocks using that volume ID on that server.
When a callback notification from the server telling of a modification by
another client arrives, the volume ID specified in the notification is
looked up in the server's afs_vol_interest list. Through the
afs_cb_interest list, the relevant superblocks can be iterated over and the
specific inode looked up and marked in each one.
Make the following efficiency improvements:
(1) Hold rcu_read_lock() over the entire processing rather than locking it
each time.
(2) Do all the callbacks for each vid together rather than individually.
Each volume then only needs to be looked up once.
(3) afs_vol_interest objects are now stored in an rb_tree rather than a
flat list to reduce the lookup step count.
(4) afs_vol_interest lookup is now done with RCU, but because it's in an
rb_tree which may rotate under us, a seqlock is used so that if it
changes during the walk, we repeat the walk with a lock held.
With this and the preceding patch which adds RCU-based lookups in the inode
cache, target volumes/vnodes can be taken without the need to take any
locks, except on the target itself.
Signed-off-by: David Howells <dhowells@redhat.com>
Show more information in /proc/net/afs/servers to make it easier to see
what's going on with the server probing.
Signed-off-by: David Howells <dhowells@redhat.com>
When an AFS client accesses a file, it receives a limited-duration callback
promise that the server will notify it if another client changes a file.
This callback duration can be a few hours in length.
If a client mounts a volume and then an application prevents it from being
unmounted, say by chdir'ing into it, but then does nothing for some time,
the rxrpc_peer record will expire and rxrpc-level keepalive will cease.
If there is NAT or a firewall between the client and the server, the route
back for the server may close after a comparatively short duration, meaning
that attempts by the server to notify the client may then bounce.
The client, however, may (so far as it knows) still have a valid unexpired
promise and will then rely on its cached data and will not see changes made
on the server by a third party until it incidentally rechecks the status or
the promise needs renewal.
To deal with this, the client needs to regularly probe the server. This
has two effects: firstly, it keeps a route open back for the server, and
secondly, it causes the server to disgorge any notifications that got
queued up because they couldn't be sent.
Fix this by adding a mechanism to emit regular probes.
Two levels of probing are made available: Under normal circumstances the
'slow' queue will be used for a fileserver - this just probes the preferred
address once every 5 mins or so; however, if server fails to respond to any
probes, the server will shift to the 'fast' queue from which all its
interfaces will be probed every 30s. When it finally responds, the record
will switch back to the slow queue.
Further notes:
(1) Probing is now no longer driven from the fileserver rotation
algorithm.
(2) Probes are dispatched to all interfaces on a fileserver when that an
afs_server object is set up to record it.
(3) The afs_server object is removed from the probe queues when we start
to probe it. afs_is_probing_server() returns true if it's not listed
- ie. it's undergoing probing.
(4) The afs_server object is added back on to the probe queue when the
final outstanding probe completes, but the probed_at time is set when
we're about to launch a probe so that it's not dependent on the probe
duration.
(5) The timer and the work item added for this must be handed a count on
net->servers_outstanding, which they hand on or release. This makes
sure that network namespace cleanup waits for them.
Fixes: d2ddc776a4 ("afs: Overhaul volume and server record caching and fileserver rotation")
Reported-by: Dave Botsch <botsch@cnf.cornell.edu>
Signed-off-by: David Howells <dhowells@redhat.com>
Split the usage count on the afs_server struct to have an active count that
registers who's actually using it separately from the reference count on
the object.
This allows a future patch to dispatch polling probes without advancing the
"unuse" time into the future each time we emit a probe, which would
otherwise prevent unused server records from expiring.
Included in this:
(1) The latter part of afs_destroy_server() in which the RCU destruction
of afs_server objects is invoked and the outstanding server count is
decremented is split out into __afs_put_server().
(2) afs_put_server() now calls __afs_put_server() rather then setting the
management timer.
(3) The calls begun by afs_fs_give_up_all_callbacks() and
afs_fs_get_capabilities() can now take a ref on the server record, so
afs_destroy_server() can just drop its ref and needn't wait for the
completion of these calls. They'll put the ref when they're done.
(4) Because of (3), afs_fs_probe_done() no longer needs to wake up
afs_destroy_server() with server->probe_outstanding.
(5) afs_gc_servers can be simplified. It only needs to check if
server->active is 0 rather than playing games with the refcount.
(6) afs_manage_servers() can propose a server for gc if usage == 0 rather
than if ref == 1. The gc is effected by (5).
Signed-off-by: David Howells <dhowells@redhat.com>
The U-version VLDB volume record retrieved by the VL.GetEntryByNameU rpc op
carries a change counter (the serverUnique field) for each fileserver
listed in the record as backing that volume. This is incremented whenever
the registration details for a fileserver change (such as its address
list). Note that the same value will be seen in all UVLDB records that
refer to that fileserver.
This should be checked before calling the VL server to re-query the address
list for a fileserver. If it's the same, there's no point doing the query.
Reported-by: Jeffrey Altman <jaltman@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
When a lookup is done in an AFS directory, the filesystem will speculate
and fetch up to 49 other statuses for files in the same directory and fetch
those as well, turning them into inodes or updating inodes that already
exist.
However, occasionally, a callback break might go missing due to NAT timing
out, but the afs filesystem doesn't then realise that the directory is not
up to date.
Alleviate this by using one of the status slots to check the directory in
which the lookup is being done.
Reported-by: Dave Botsch <botsch@cnf.cornell.edu>
Suggested-by: Jeffrey Altman <jaltman@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Make the inode hash table RCU searchable so that searches that want to
access or modify an inode without taking a ref on that inode can do so
without taking the inode hash table lock.
The main thing this requires is some RCU annotation on the list
manipulation operations. Inodes are already freed by RCU in most cases.
Users of this interface must take care as the inode may be still under
construction or may be being torn down around them.
There are at least three instances where this can be of use:
(1) Testing whether the inode number iunique() is going to return is
currently unique (the iunique_lock is still held).
(2) Ext4 date stamp updating.
(3) AFS callback breaking.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
cc: linux-ext4@vger.kernel.org
cc: linux-afs@lists.infradead.org
Pull networking fixes from David Miller:
1) Fix RCU warnings in ipv6 multicast router code, from Madhuparna
Bhowmik.
2) Nexthop attributes aren't being checked properly because of
mis-initialized iterator, from David Ahern.
3) Revert iop_idents_reserve() change as it caused performance
regressions and was just working around what is really a UBSAN bug
in the compiler. From Yuqi Jin.
4) Read MAC address properly from ROM in bmac driver (double iteration
proceeds past end of address array), from Jeremy Kerr.
5) Add Microsoft Surface device IDs to r8152, from Marc Payne.
6) Prevent reference to freed SKB in __netif_receive_skb_core(), from
Boris Sukholitko.
7) Fix ACK discard behavior in rxrpc, from David Howells.
8) Preserve flow hash across packet scrubbing in wireguard, from Jason
A. Donenfeld.
9) Cap option length properly for SO_BINDTODEVICE in AX25, from Eric
Dumazet.
10) Fix encryption error checking in kTLS code, from Vadim Fedorenko.
11) Missing BPF prog ref release in flow dissector, from Jakub Sitnicki.
12) dst_cache must be used with BH disabled in tipc, from Eric Dumazet.
13) Fix use after free in mlxsw driver, from Jiri Pirko.
14) Order kTLS key destruction properly in mlx5 driver, from Tariq
Toukan.
15) Check devm_platform_ioremap_resource() return value properly in
several drivers, from Tiezhu Yang.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (71 commits)
net: smsc911x: Fix runtime PM imbalance on error
net/mlx4_core: fix a memory leak bug.
net: ethernet: ti: cpsw: fix ASSERT_RTNL() warning during suspend
net: phy: mscc: fix initialization of the MACsec protocol mode
net: stmmac: don't attach interface until resume finishes
net: Fix return value about devm_platform_ioremap_resource()
net/mlx5: Fix error flow in case of function_setup failure
net/mlx5e: CT: Correctly get flow rule
net/mlx5e: Update netdev txq on completions during closure
net/mlx5: Annotate mutex destroy for root ns
net/mlx5: Don't maintain a case of del_sw_func being null
net/mlx5: Fix cleaning unmanaged flow tables
net/mlx5: Fix memory leak in mlx5_events_init
net/mlx5e: Fix inner tirs handling
net/mlx5e: kTLS, Destroy key object after destroying the TIS
net/mlx5e: Fix allowed tc redirect merged eswitch offload cases
net/mlx5: Avoid processing commands before cmdif is ready
net/mlx5: Fix a race when moving command interface to events mode
net/mlx5: Add command entry handling completion
rxrpc: Fix a memory leak in rxkad_verify_response()
...
Fix a warning due to an uninitialised variable.
le included from ../fs/afs/fs_probe.c:11:
../fs/afs/fs_probe.c: In function 'afs_fileserver_probe_result':
../fs/afs/internal.h:1453:2: warning: 'rtt_us' may be used uninitialized in this function [-Wmaybe-uninitialized]
1453 | printk("[%-6.6s] "FMT"\n", current->comm ,##__VA_ARGS__)
| ^~~~~~
../fs/afs/fs_probe.c:35:15: note: 'rtt_us' was declared here
Signed-off-by: David Howells <dhowells@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAl7FQH4ACgkQ+7dXa6fL
C2u05BAAlilXCYGZU23/yQmMxxkmcG+jW+oDV9ySNDl6iJT+OtKxHEGReDD4D2f0
rhaBivgGcOnZy89AGjNrSVROGlwSBXl7ArJcjsfsx8AuNzxHUHQKWlW/k8n87qEt
NCTze7f65IT6NowgYAFgJn5kIpY/9iKuNiCf6NGL3Z35wqxPvwNs6AQSGM495uvB
el/ddkr8QzzjI9Ejsgzj94x4DAOjk4T4WzfWMAgyr1OEqz6vKNKkCwSKPySOsQAK
72JRaGhWA9rfAOkA7nAZpnjHdfFYnkFBOVQzmswOJYRYe3D/QY5D9PUlGIQ5OSjL
yV5YOi/+AUrSif79NfEYXga0r/NFJMFqBg2zo/eiSrhfZZFZMDcagnGhzpGjbYF1
IaeIu4q/MQOQybi8m1GJhvFfPOhdKRn731jlsUvEoxK0TonSu/u64eus+qelQxOd
uiIcu/kLxfPZSznUd8cXZ+Pffce0uBIRWq0nRQZ703TyHY+/gYo7ZGHr/FZNKaK4
lRNP4Nu3goLQCI40R7y7USnpX+kWfd4mYC9zl+VBSXG1JymYbOezXYrNATBCqpo6
9VoYtqDdo8ESksFUBqM7fGRDZ20nah6KdRGmnrPU+rpODHZEZmNN7D/rayU31wua
VIbVw1WluvSVnQ8+b1BwJwvhJQ4CazyGcnbxDqx1zd07EjUiiJI=
=jJvy
-----END PGP SIGNATURE-----
Merge tag 'rxrpc-fixes-20200520' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
David Howells says:
====================
rxrpc: Fix retransmission timeout and ACK discard
Here are a couple of fixes and an extra tracepoint for AF_RXRPC:
(1) Calculate the RTO pretty much as TCP does, rather than making
something up, including an initial 4s timeout (which causes return
probes from the fileserver to fail if a packet goes missing), and add
backoff.
(2) Fix the discarding of out-of-order received ACKs. We mustn't let the
hard-ACK point regress, nor do we want to do unnecessary
retransmission because the soft-ACK list regresses. This is not
trivial, however, due to some loose wording in various old protocol
specs, the ACK field that should be used for this sometimes has the
wrong information in it.
(3) Add a tracepoint to log a discarded ACK.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl7IByoQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpigmEACWJoK7zk5OK2RhavpzOsb2SDu8nz/YAbUe
+R6tXAjwe4Z7lVVa+FW/fmN9/mQcjyYRIbbG564IFs5fe6hPUoOjUzHqvGTOFLHd
Fw8mjVKgWjWAE5GdoX6ATauLhVwwjnImej1PNfO/J5y29o0SQksP8MbM0eGuuNx1
piqxBj0/3h3YyPn1GeJmqxwwcsFhzHDqk7fbkfbQokZk+7SPiKpqWgJBa7AKSlNC
N0WTluT4UOummQZw1RFynPfA4cCuX6XHVgWAa9h7vrJHXigvuMWqLaHG+MBFqeKu
xD6PPnaCnMwcLRe4T2sJvtjxmNSdyr15Q2kGkIi/RhohSIn4u/y8jEA6wTprCP48
rDi30dn1o2LwUj2S1NO3YCOV8jIKWUguztEvKiAXmjf4KDZIDd4/OwrFsJdb4vg9
EuK86SEwXbvFHf9nu1M7pHlGThKfQi0CiK6C6M7Qb/kOthio72wwZ46gGkwLDk5z
DZWHymHBhQw/z1c20loX7pBvFIzLzbuYUThf23UegPzXVqqQfBkqs4BGFcOGuqy6
yfEYF/MAX/O/TQgm2dDQHrhl05AevLu/UQXMXZ8Ha6OrmlC4C2qu3Te/iZO8FUew
YIx5H5XmBh93McjpmJ8VCn7CjE+y/ufNTMdvm8WzCyAIfH40gfcyLangpre26QoJ
CCAARffXrQ==
=ZYUy
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.7-2020-05-22' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
"A small collection of small fixes that should go into this release:
- Two fixes for async request preparation (Pavel)
- Busy clear fix for SQPOLL (Xiaoguang)
- Don't use kiocb->private for O_DIRECT buf index, some file systems
use it (Bijan)
- Kill dead check in io_splice()
- Ensure sqo_wait is initialized early
- Cancel task_work if we fail adding to original process
- Only add (IO)pollable requests to iopoll list, fixing a regression
in this merge window"
* tag 'io_uring-5.7-2020-05-22' of git://git.kernel.dk/linux-block:
io_uring: reset -EBUSY error when io sq thread is waken up
io_uring: don't add non-IO requests to iopoll pending list
io_uring: don't use kiocb.private to store buf_index
io_uring: cancel work if task_work_add() fails
io_uring: remove dead check in io_splice()
io_uring: fix FORCE_ASYNC req preparation
io_uring: don't prepare DRAIN reqs twice
io_uring: initialize ctx->sqo_wait earlier
As Ubuntu and Fedora release new version used kernel version equal to or
higher than v5.4, They started to support kernel exfat filesystem.
Linus reported a mount error with new version of exfat on Fedora:
exfat: Unknown parameter 'namecase'
This is because there is a difference in mount option between old
staging/exfat and new exfat. And utf8, debug, and codepage options as
well as namecase have been removed from new exfat.
This patch add the dummy mount options as deprecated option to be
backward compatible with old one.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Correctly set next cursor for detailed_erase_block_info debugfs file
- Don't use crypto_shash_descsize() for digest size in UBIFS
- Remove broken lazytime support from UBIFS
-----BEGIN PGP SIGNATURE-----
iQJKBAABCAA0FiEEdgfidid8lnn52cLTZvlZhesYu8EFAl7Fh08WHHJpY2hhcmRA
c2lnbWEtc3Rhci5hdAAKCRBm+VmF6xi7wW2WD/428LjXh+24Y3rekfnCRXG5w+es
yITAfhOmNuzn2vjS1UvCD0HsoBaS/LYbjuaceoyfXF9BG5mcrRTjFH7dVEEWFGDZ
YeRvBFkyt4xBEJtrY/6MW35KPRtnCp4Jau9HR9M5RCcQ5xzOeGtw0r/JMdZe56Av
zc2mLnZag1x5NyS4TvS30nCgj5pxVbO2bdAkyULJwBfPYs0C3TKeIul/4vjRi+57
PjyIUSR7CxpsOJde0tMjDvf23ewn1IUEW+YnewP1qk36ijRw1M6C90ERr4CU9BM5
YTEfjsxAheCItSf8r+BC70gaPBQPADtvHzPFqs9yNMSsLHYdOkkvqT8Bpwisj76d
1zL45DjZZ8UxC3HfSMFPl/dYDWvfddpffNwrimeltoAzzejI/Wk8AX0VqH1IQ3Z1
zDbz0ixP21ADATvrHUxr7UsoeEU9havGV+2sg+4wSU1aLtKIZUTjceizjkTN+9oB
ntHLuv6cS2iop22iSbJGClOv2TjpBlGQNwMDQ7TdD1a0QqxTSPRiguMmf/mDpQa/
MgQGAO6xS5NKRNiEbifniiCugLqpUQBHBPyn+q+4unmfK5sPzzLdpb3vpc0XNmbm
WgwfuMZdfmK0jO27P1/MRG6LUGxXKh5arsi6JrUJVIsdxzV3bdc2xBjkUFOOS/tH
W7fn4QS+WmbPVm09Jg==
=eCh7
-----END PGP SIGNATURE-----
Merge tag 'for-linus-5.7-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs
Pull UBI and UBIFS fixes from Richard Weinberger:
- Correctly set next cursor for detailed_erase_block_info debugfs file
- Don't use crypto_shash_descsize() for digest size in UBIFS
- Remove broken lazytime support from UBIFS
* tag 'for-linus-5.7-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs:
ubi: Fix seq_file usage in detailed_erase_block_info debugfs file
ubifs: fix wrong use of crypto_shash_descsize()
ubifs: remove broken lazytime support
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCXsU6SgAKCRDh3BK/laaZ
PABRAP9MCZz/CLH2sEqHqH9KQHScNc4uf4bReiCU1hrLs7PbYwD/Y+vbRMffki7I
B/gt0Dg4kGxG5CV+ckeZK0+p2NWUUgQ=
=PPLW
-----END PGP SIGNATURE-----
Merge tag 'ovl-fixes-5.7-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs fixes from Miklos Szeredi:
"Fix two bugs introduced in this cycle and one introduced in v5.5"
* tag 'ovl-fixes-5.7-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: potential crash in ovl_fid_to_fh()
ovl: clear ATTR_OPEN from attr->ia_valid
ovl: clear ATTR_FILE from attr->ia_valid
In io_sq_thread(), currently if we get an -EBUSY error and go to sleep,
we will won't clear it again, which will result in io_sq_thread() will
never have a chance to submit sqes again. Below test program test.c
can reveal this bug:
int main(int argc, char *argv[])
{
struct io_uring ring;
int i, fd, ret;
struct io_uring_sqe *sqe;
struct io_uring_cqe *cqe;
struct iovec *iovecs;
void *buf;
struct io_uring_params p;
if (argc < 2) {
printf("%s: file\n", argv[0]);
return 1;
}
memset(&p, 0, sizeof(p));
p.flags = IORING_SETUP_SQPOLL;
ret = io_uring_queue_init_params(4, &ring, &p);
if (ret < 0) {
fprintf(stderr, "queue_init: %s\n", strerror(-ret));
return 1;
}
fd = open(argv[1], O_RDONLY | O_DIRECT);
if (fd < 0) {
perror("open");
return 1;
}
iovecs = calloc(10, sizeof(struct iovec));
for (i = 0; i < 10; i++) {
if (posix_memalign(&buf, 4096, 4096))
return 1;
iovecs[i].iov_base = buf;
iovecs[i].iov_len = 4096;
}
ret = io_uring_register_files(&ring, &fd, 1);
if (ret < 0) {
fprintf(stderr, "%s: register %d\n", __FUNCTION__, ret);
return ret;
}
for (i = 0; i < 10; i++) {
sqe = io_uring_get_sqe(&ring);
if (!sqe)
break;
io_uring_prep_readv(sqe, 0, &iovecs[i], 1, 0);
sqe->flags |= IOSQE_FIXED_FILE;
ret = io_uring_submit(&ring);
sleep(1);
printf("submit %d\n", i);
}
for (i = 0; i < 10; i++) {
io_uring_wait_cqe(&ring, &cqe);
printf("receive: %d\n", i);
if (cqe->res != 4096) {
fprintf(stderr, "ret=%d, wanted 4096\n", cqe->res);
ret = 1;
}
io_uring_cqe_seen(&ring, cqe);
}
close(fd);
io_uring_queue_exit(&ring);
return 0;
}
sudo ./test testfile
above command will hang on the tenth request, to fix this bug, when io
sq_thread is waken up, we reset the variable 'ret' to be zero.
Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We normally disable any commands that aren't specifically poll commands
for a ring that is setup for polling, but we do allow buffer provide and
remove commands to support buffer selection for polled IO. Once a
request is issued, we add it to the poll list to poll for completion. But
we should not do that for non-IO commands, as those request complete
inline immediately and aren't pollable. If we do, we can leave requests
on the iopoll list after they are freed.
Fixes: ddf0322db7 ("io_uring: add IORING_OP_PROVIDE_BUFFERS")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull vfs fix from Al Viro:
"Stable fodder fix: copy_fdtable() would get screwed on 64bit boxen
with sysctl_nr_open raised to 512M or higher, which became possible
since 2.6.25.
Nobody sane would set the things up that way, but..."
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fix multiplication overflow in copy_fdtable()
cpy and set really should be size_t; we won't get an overflow on that,
since sysctl_nr_open can't be set above ~(size_t)0 / sizeof(void *),
so nr that would've managed to overflow size_t on that multiplication
won't get anywhere near copy_fdtable() - we'll fail with EMFILE
before that.
Cc: stable@kernel.org # v2.6.25+
Fixes: 9cfe015aa4 (get rid of NR_OPEN and introduce a sysctl_nr_open)
Reported-by: Thiago Macieira <thiago.macieira@intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
kiocb.private is used in iomap_dio_rw() so store buf_index separately.
Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Move 'buf_index' to a hole in io_kiocb.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add an extra validation of the len parameter, as for ext4 some files
might have smaller file size limits than others. This also means the
redundant size check in ext4_ioctl_get_es_cache can go away, as all
size checking is done in the shared fiemap handler.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200505154324.3226743-3-hch@lst.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
ext4 supports max number of logical blocks in a file to be 0xffffffff.
(This is since ext4_extent's ee_block is __le32).
This means that EXT4_MAX_LOGICAL_BLOCK should be 0xfffffffe (starting
from 0 logical offset). This patch fixes this.
The issue was seen when ext4 moved to iomap_fiemap API and when
overlayfs was mounted on top of ext4. Since overlayfs was missing
filemap_check_ranges(), so it could pass a arbitrary huge length which
lead to overflow of map.m_len logic.
This patch fixes that.
Fixes: d3b6f23f71 ("ext4: move ext4_fiemap to use iomap framework")
Reported-by: syzbot+77fa5bdb65cc39711820@syzkaller.appspotmail.com
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20200505154324.3226743-2-hch@lst.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
- Fix potential memory leak in exfat_find.
- Set exfat's splice_write to iter_file_splice_write to fix the splice
failure on direct-opened file
-----BEGIN PGP SIGNATURE-----
iQJMBAABCgA2FiEE6NzKS6Uv/XAAGHgyZwv7A1FEIQgFAl7CCAkYHG5hbWphZS5q
ZW9uQHNhbXN1bmcuY29tAAoJEGcL+wNRRCEIX3AQAM7cV9GZecl6YfQu5AIeFbHT
uvSnvuW5O5JS9qdra4knSTthHYJ8eUucjcPlxUtHhs4oznm+erjZc9A0tRwDQyjy
EjoZZGEBOphWFLCY28K9LdJZD89JhNh9v5XUD9dId3XFnznaRjvZRHlbCVzqAWG1
DUcRedNEderpkg0FySEBIx6EHhKX6+YgkKOWlGG8r8bqdRrgZbjyAyduRdKlyX31
7XIeS4qFMDWLrqcbJdmL9pljx4VH2MswNIXK6kA2pydMwItGhod2yRWzFMYPeTDm
fTRDKzHvfA3J30h3wMI5FJu/ikfuVqsmp8i5rND7v/eRP13uuxZCSI2MfnUzHEj2
ciWxGfr5kFGg/1eAjNtOy3AnS5wsaEQ0ixYFGgKb8ENvToyT4cHa+9X2y0PrVnRu
bOyqJTBwlSisqp3DiK8aAhklHHbX1/CheGOLMj1B48H42eREUHFn/yPYroOb+Ot/
CiRH4feACSCMRGn8HdlgnguOs4zwZIWtLQWpfqhu4CJSNFa3IW6PSl53U1vPzuXG
v2Cdxn6D1gCqxsFbSmzmMJVkNfILrY7sLSU9lqrXWCQ4T6I8FpBxIvU8CCi1boQD
7hpdXstL/0xhb/gTFQL2uJ2MasQdSzVQgl6dmGK5riJkqwgaWz4FDro+IF3JxdQT
qtUZ5nd6e33pl6PwK3nt
=JN5f
-----END PGP SIGNATURE-----
Merge tag 'for-5.7-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat
Pull exfat fixes from Namjae Jeon:
- Fix potential memory leak in exfat_find
- Set exfat's splice_write to iter_file_splice_write to fix a splice
failure on direct-opened files
* tag 'for-5.7-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat:
exfat: fix possible memory leak in exfat_find()
exfat: use iter_file_splice_write
Don't call req->page_done() on each page as we finish filling it with
the data coming from the network. Whilst this might speed up the
application a bit, it's a problem if there's a network failure and the
operation has to be reissued.
If this happens, an oops occurs because afs_readpages_page_done() clears
the pointer to each page it unlocks and when a retry happens, the
pointers to the pages it wants to fill are now NULL (and the pages have
been unlocked anyway).
Instead, wait till the operation completes successfully and only then
release all the pages after clearing any terminal gap (the server can
give us less data than we requested as we're allowed to ask for more
than is available).
KASAN produces a bug like the following, and even without KASAN, it can
oops and panic.
BUG: KASAN: wild-memory-access in _copy_to_iter+0x323/0x5f4
Write of size 1404 at addr 0005088000000000 by task md5sum/5235
CPU: 0 PID: 5235 Comm: md5sum Not tainted 5.7.0-rc3-fscache+ #250
Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014
Call Trace:
memcpy+0x39/0x58
_copy_to_iter+0x323/0x5f4
__skb_datagram_iter+0x89/0x2a6
skb_copy_datagram_iter+0x129/0x135
rxrpc_recvmsg_data.isra.0+0x615/0xd42
rxrpc_kernel_recv_data+0x1e9/0x3ae
afs_extract_data+0x139/0x33a
yfs_deliver_fs_fetch_data64+0x47a/0x91b
afs_deliver_to_call+0x304/0x709
afs_wait_for_call_to_complete+0x1cc/0x4ad
yfs_fs_fetch_data+0x279/0x288
afs_fetch_data+0x1e1/0x38d
afs_readpages+0x593/0x72e
read_pages+0xf5/0x21e
__do_page_cache_readahead+0x128/0x23f
ondemand_readahead+0x36e/0x37f
generic_file_buffered_read+0x234/0x680
new_sync_read+0x109/0x17e
vfs_read+0xe6/0x138
ksys_read+0xd8/0x14d
do_syscall_64+0x6e/0x8a
entry_SYSCALL_64_after_hwframe+0x49/0xb3
Fixes: 196ee9cd2d ("afs: Make afs_fs_fetch_data() take a list of pages")
Fixes: 30062bd13e ("afs: Implement YFS support in the fs client")
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We currently move it to the io_wqe_manager for execution, but we cannot
safely do so as we may lack some of the state to execute it out of
context. As we cancel work anyway when the ring/task exits, just mark
this request as canceled and io_async_task_func() will do the right
thing.
Fixes: aa96bf8a9e ("io_uring: use io-wq manager as backup task if task is exiting")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
'es' is malloced from exfat_get_dentry_set() in exfat_find() and should
be freed before leaving from the error handling cases, otherwise it will
cause memory leak.
Fixes: 5f2aa07507 ("exfat: add inode operations")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Doing copy_file_range() on exfat with a file opened for direct IO leads
to an -EFAULT:
# xfs_io -f -d -c "truncate 32768" \
-c "copy_range -d 16384 -l 16384 -f 0" /mnt/test/junk
copy_range: Bad address
and the reason seems to be that we go through:
default_file_splice_write
splice_from_pipe
__splice_from_pipe
write_pipe_buf
__kernel_write
new_sync_write
generic_file_write_iter
generic_file_direct_write
exfat_direct_IO
do_blockdev_direct_IO
iov_iter_get_pages
and land in iterate_all_kinds(), which does "return -EFAULT" for our kvec
iter.
Setting exfat's splice_write to iter_file_splice_write fixes this and lets
fsx (which originally detected the problem) run to success from
the xfstests harness.
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
crypto_shash_descsize() returns the size of the shash_desc context
needed to compute the hash, not the size of the hash itself.
crypto_shash_digestsize() would be correct, or alternatively using
c->hash_len and c->hmac_desc_len which already store the correct values.
But actually it's simpler to just use stack arrays, so do that instead.
Fixes: 49525e5eec ("ubifs: Add helper functions for authentication support")
Fixes: da8ef65f95 ("ubifs: Authenticate replayed journal")
Cc: <stable@vger.kernel.org> # v4.20+
Cc: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
We checked for 'force_nonblock' higher up, so it's definitely false
at this point. Kill the check, it's a remnant of when we tried to do
inline splice without always punting to async context.
Fixes: 2fb3e82284 ("io_uring: punt splice async because of inode mutex")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull execve fix from Eric Biederman:
"While working on my exec cleanups I found a bug in exec that I
introduced by accident a couple of years ago. I apparently missed the
fact that bprm->file can change.
Now I have a very personal motive to clean up exec and make it more
approachable.
The change is just moving woud_dump to where it acts on the final
bprm->file not the initial bprm->file. I have been careful and tested
and verify this fix works"
* 'exec-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
exec: Move would_dump into flush_old_exec
I goofed when I added mm->user_ns support to would_dump. I missed the
fact that in the case of binfmt_loader, binfmt_em86, binfmt_misc, and
binfmt_script bprm->file is reassigned. Which made the move of
would_dump from setup_new_exec to __do_execve_file before exec_binprm
incorrect as it can result in would_dump running on the script instead
of the interpreter of the script.
The net result is that the code stopped making unreadable interpreters
undumpable. Which allows them to be ptraced and written to disk
without special permissions. Oops.
The move was necessary because the call in set_new_exec was after
bprm->mm was no longer valid.
To correct this mistake move the misplaced would_dump from
__do_execve_file into flos_old_exec, before exec_mmap is called.
I tested and confirmed that without this fix I can attach with gdb to
a script with an unreadable interpreter, and with this fix I can not.
Cc: stable@vger.kernel.org
Fixes: f84df2a6f2 ("exec: Ensure mm->user_ns contains the execed files")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
As for other not inlined requests, alloc req->io for FORCE_ASYNC reqs,
so they can be prepared properly.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If req->io is not NULL, it's already prepared. Don't do it again,
it's dangerous.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl6/cDAQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpo1DEADMUm23Ha/X5YSXujM55hzV6MKQQ+BjNwl1
WIU0qdDkzmoiqSpcengjHPo/2CZvjlavXGaLDWKqfz1pojld9iBhCVgFaOm03Aiq
yKbG5gu5UcKNGtlE7cObhkaX8WFDtyZBuoyx+MB6F2eOdO2sbuOEV4o4bRB0w6Gt
hn91zHtQRS5WPBxxLGbj8xDOWrrhbiHhfM2QkUX9kd10D9UcOdHsdPt2hPs2L6OF
sKkjwcVEd8oR4A5l1pKEwrbm1+7Gn1H0fR8HuBtaZ3ZU65ugaPB6cj6Brn8GDBuT
JZ4ThKPqb/6jYVpxWVKCMYGLydSDZYxsQWJP7FgICqeqQgb6q+wHYScgJTPx8v4/
rJgfgWSPbxVf/kKhjFj2JYo8QDcR5R7IUlO4TxcRaIockPdRdc4kTc7EFtzodUoN
/5wWAuVuIRBicu+KBshrB3RNTT7NCcYfSom+eqMPIvWTvw1vIpK/5ULV6WZjqFbV
bPt3mUjEYV3AfQ9l39pkoclQ3YWWq718l1iuyHdD+1u/41D85UTfGHa/m9NibO5g
TP6Jj3FWLeD0UCpUiTEuUfYfdXxnP7R5SNnF+F15fucuHtRILh7rhTCqGbG4crLw
0a1fOx0wmf1JUxQmQtJLQTQ4LdUfwqBbSekKrviClc8HizvFroQyxaFMIjgRXn8X
F37nPAzswA==
=kqDx
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.7-2020-05-15' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
"Two small fixes that should go into this release:
- Check and handle zero length splice (Pavel)
- Fix a regression in this merge window for fixed files used with
polled block IO"
* tag 'io_uring-5.7-2020-05-15' of git://git.kernel.dk/linux-block:
io_uring: polled fixed file must go through free iteration
io_uring: fix zero len do_splice()