Sabrina Dubroca says:
============
XFRM states and policies are complex objects, and there are many
reasons why the kernel can reject userspace's request to create
one. This series makes it a bit clearer by providing extended ack
messages for policy creation.
A few other operations that reuse the same helper functions are also
getting partial extack support in this series. More patches will
follow to complete extack support, in particular for state creation.
Note: The policy->share attribute seems to be entirely ignored in the
kernel outside of checking its value in verify_newpolicy_info(). There
are some (very) old comments in copy_from_user_policy and
copy_to_user_policy suggesting that it should at least be copied
to/from userspace. I don't know what it was intended for.
============
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
xfrm_user_rcv_msg() already handles extack, we just need to pass it down.
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Allow specifying the xfrm interface if_id and link as part of a route
metadata using the lwtunnel infrastructure.
This allows for example using a single xfrm interface in collect_md
mode as the target of multiple routes each specifying a different if_id.
With the appropriate changes to iproute2, considering an xfrm device
ipsec1 in collect_md mode one can for example add a route specifying
an if_id like so:
ip route add <SUBNET> dev ipsec1 encap xfrm if_id 1
In which case traffic routed to the device via this route would use
if_id in the xfrm interface policy lookup.
Or in the context of vrf, one can also specify the "link" property:
ip route add <SUBNET> dev ipsec1 encap xfrm if_id 1 link_dev eth15
Note: LWT_XFRM_LINK uses NLA_U32 similar to IFLA_XFRM_LINK even though
internally "link" is signed. This is consistent with other _LINK
attributes in other devices as well as in bpf and should not have an
effect as device indexes can't be negative.
Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
This commit adds support for 'collect_md' mode on xfrm interfaces.
Each net can have one collect_md device, created by providing the
IFLA_XFRM_COLLECT_METADATA flag at creation. This device cannot be
altered and has no if_id or link device attributes.
On transmit to this device, the if_id is fetched from the attached dst
metadata on the skb. If exists, the link property is also fetched from
the metadata. The dst metadata type used is METADATA_XFRM which holds
these properties.
On the receive side, xfrmi_rcv_cb() populates a dst metadata for each
packet received and attaches it to the skb. The if_id used in this case is
fetched from the xfrm state, and the link is fetched from the incoming
device. This information can later be used by upper layers such as tc,
ebpf, and ip rules.
Because the skb is scrubed in xfrmi_rcv_cb(), the attachment of the dst
metadata is postponed until after scrubing. Similarly, xfrm_input() is
adapted to avoid dropping metadata dsts by only dropping 'valid'
(skb_valid_dst(skb) == true) dsts.
Policy matching on packets arriving from collect_md xfrmi devices is
done by using the xfrm state existing in the skb's sec_path.
The xfrm_if_cb.decode_cb() interface implemented by xfrmi_decode_session()
is changed to keep the details of the if_id extraction tucked away
in xfrm_interface.c.
Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
XFRM interfaces provide the association of various XFRM transformations
to a netdevice using an 'if_id' identifier common to both the XFRM data
structures (polcies, states) and the interface. The if_id is configured by
the controlling entity (usually the IKE daemon) and can be used by the
administrator to define logical relations between different connections.
For example, different connections can share the if_id identifier so
that they pass through the same interface, . However, currently it is
not possible for connections using a different if_id to use the same
interface while retaining the logical separation between them, without
using additional criteria such as skb marks or different traffic
selectors.
When having a large number of connections, it is useful to have a the
logical separation offered by the if_id identifier but use a single
network interface. Similar to the way collect_md mode is used in IP
tunnels.
This patch attempts to enable different configuration mechanisms - such
as ebpf programs, LWT encapsulations, and TC - to attach metadata
to skbs which would carry the if_id. This way a single xfrm interface in
collect_md mode can demux traffic based on this configuration on tx and
provide this metadata on rx.
The XFRM metadata is somewhat similar to ip tunnel metadata in that it
has an "id", and shares similar configuration entities (bpf, tc, ...),
however, it does not necessarily represent an IP tunnel or use other
ip tunnel information, and also has an optional "link" property which
can be used for affecting underlying routing decisions.
Additional xfrm related criteria may also be added in the future.
Therefore, a new metadata type is introduced, to be used in subsequent
patches in the xfrm interface and configuration entities.
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Drop unused argument from xfrm_policy_match,
__xfrm_policy_eval_candidates and xfrm_policy_eval_candidates.
No functional changes intended.
Signed-off-by: Hongbin Wang <wh_bin@126.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
A TODO in net/ipsec.c asks to refactor the code in xfrm_fill_key() to
use set/map to avoid manually comparing each algorithm with the "name"
parameter passed to the function as an argument. This patch refactors
the code to create an array of structs where each struct contains the
algorithm name and its corresponding key length.
Signed-off-by: Gautam Menghani <gautammenghani201@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
A little longer PR than usual but it's all fixes, no late features.
It's long partially because of timing, and partially because of
follow ups to stuff that got merged a week or so before the merge
window and wasn't as widely tested. Maybe the Bluetooth fixes are
a little alarming so we'll address that, but the rest seems okay
and not scary.
Notably we're including a fix for the netfilter Kconfig [1], your
WiFi warning [2] and a bluetooth fix which should unblock syzbot [3].
Current release - regressions:
- Bluetooth:
- don't try to cancel uninitialized works [3]
- L2CAP: fix use-after-free caused by l2cap_chan_put
- tls: rx: fix device offload after recent rework
- devlink: fix UAF on failed reload and leftover locks in mlxsw
Current release - new code bugs:
- netfilter:
- flowtable: fix incorrect Kconfig dependencies [1]
- nf_tables: fix crash when nf_trace is enabled
- bpf:
- use proper target btf when exporting attach_btf_obj_id
- arm64: fixes for bpf trampoline support
- Bluetooth:
- ISO: unlock on error path in iso_sock_setsockopt()
- ISO: fix info leak in iso_sock_getsockopt()
- ISO: fix iso_sock_getsockopt for BT_DEFER_SETUP
- ISO: fix memory corruption on iso_pinfo.base
- ISO: fix not using the correct QoS
- hci_conn: fix updating ISO QoS PHY
- phy: dp83867: fix get nvmem cell fail
Previous releases - regressions:
- wifi: cfg80211: fix validating BSS pointers in
__cfg80211_connect_result [2]
- atm: bring back zatm uAPI after ATM had been removed
- properly fix old bug making bonding ARP monitor mode not being
able to work with software devices with lockless Tx
- tap: fix null-deref on skb->dev in dev_parse_header_protocol
- revert "net: usb: ax88179_178a needs FLAG_SEND_ZLP" it helps
some devices and breaks others
- netfilter:
- nf_tables: many fixes rejecting cross-object linking
which may lead to UAFs
- nf_tables: fix null deref due to zeroed list head
- nf_tables: validate variable length element extension
- bgmac: fix a BUG triggered by wrong bytes_compl
- bcmgenet: indicate MAC is in charge of PHY PM
Previous releases - always broken:
- bpf:
- fix bad pointer deref in bpf_sys_bpf() injected via test infra
- disallow non-builtin bpf programs calling the prog_run command
- don't reinit map value in prealloc_lru_pop
- fix UAFs during the read of map iterator fd
- fix invalidity check for values in sk local storage map
- reject sleepable program for non-resched map iterator
- mptcp:
- move subflow cleanup in mptcp_destroy_common()
- do not queue data on closed subflows
- virtio_net: fix memory leak inside XDP_TX with mergeable
- vsock: fix memory leak when multiple threads try to connect()
- rework sk_user_data sharing to prevent psock leaks
- geneve: fix TOS inheriting for ipv4
- tunnels & drivers: do not use RT_TOS for IPv6 flowlabel
- phy: c45 baset1: do not skip aneg configuration if clock role
is not specified
- rose: avoid overflow when /proc displays timer information
- x25: fix call timeouts in blocking connects
- can: mcp251x: fix race condition on receive interrupt
- can: j1939:
- replace user-reachable WARN_ON_ONCE() with netdev_warn_once()
- fix memory leak of skbs in j1939_session_destroy()
Misc:
- docs: bpf: clarify that many things are not uAPI
- seg6: initialize induction variable to first valid array index
(to silence clang vs objtool warning)
- can: ems_usb: fix clang 14's -Wunaligned-access warning
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmL1TtkACgkQMUZtbf5S
Iruz8Q/+O5xFFsjxuyZD0Mw9d3Jeo3ZI9PeeDvcYl5dZXVegpxqorujTFntxv1Ad
JC8o5qqms3kO51d+W/yai6iDacEHX2YcJrupZve+vGvpOEVmBRY5O0E1AckJ18+u
ItmjSVESkybUP5P08/An7Y0dMmj9Xb2z84dGkLe+n8lg6/fimo6Ki6yZjcOBOALu
AYquMXUcnwztRMbTFjscbJjBd4xFMKZEtthljYtPdIReIN976wmMNYYx+jcPK7ha
g39Kv6maklp4euerkGIJ/AMnOWHaOGCFjIaz7rr4444NDfrKdt/jeirUXJaz77Jo
TJM2UOwgOeg6WZkSa3cmdq6UdjdkJ6LTe2CJFf1wJ1qfhAi+s8yWoszsM2Enp+66
c/mo9jTCMAjmgEJF11idZuz2S697/5j0hvbfM3ZPgNyNBgn8qxz/Z56fNOisx95u
TkoKKFnGH+mcm/et+omBcyLBtBVK2+/6B6mpl6btf4DOkPn5KFYWHV67uV3ksHzQ
ye+pnzidoIG0yKbRM2EQKXk7ELKROpl52xUHko93ZinMJt0Q7jBm7tZhJozNFEzi
hWgUvpmNXgawzLYQcJ9jJmKw3PmYZnRhvYZB/1r91YamM28Hd58k9WfpWtUtjYJN
N0X58L6JSnKPqzR70pcFppz6iBlh0tHdcEQGWhhKU5ScS3FDxGc=
=C5Ck
-----END PGP SIGNATURE-----
Merge tag 'net-6.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from bluetooth, bpf, can and netfilter.
A little larger than usual but it's all fixes, no late features. It's
large partially because of timing, and partially because of follow ups
to stuff that got merged a week or so before the merge window and
wasn't as widely tested. Maybe the Bluetooth fixes are a little
alarming so we'll address that, but the rest seems okay and not scary.
Notably we're including a fix for the netfilter Kconfig [1], your WiFi
warning [2] and a bluetooth fix which should unblock syzbot [3].
Current release - regressions:
- Bluetooth:
- don't try to cancel uninitialized works [3]
- L2CAP: fix use-after-free caused by l2cap_chan_put
- tls: rx: fix device offload after recent rework
- devlink: fix UAF on failed reload and leftover locks in mlxsw
Current release - new code bugs:
- netfilter:
- flowtable: fix incorrect Kconfig dependencies [1]
- nf_tables: fix crash when nf_trace is enabled
- bpf:
- use proper target btf when exporting attach_btf_obj_id
- arm64: fixes for bpf trampoline support
- Bluetooth:
- ISO: unlock on error path in iso_sock_setsockopt()
- ISO: fix info leak in iso_sock_getsockopt()
- ISO: fix iso_sock_getsockopt for BT_DEFER_SETUP
- ISO: fix memory corruption on iso_pinfo.base
- ISO: fix not using the correct QoS
- hci_conn: fix updating ISO QoS PHY
- phy: dp83867: fix get nvmem cell fail
Previous releases - regressions:
- wifi: cfg80211: fix validating BSS pointers in
__cfg80211_connect_result [2]
- atm: bring back zatm uAPI after ATM had been removed
- properly fix old bug making bonding ARP monitor mode not being able
to work with software devices with lockless Tx
- tap: fix null-deref on skb->dev in dev_parse_header_protocol
- revert "net: usb: ax88179_178a needs FLAG_SEND_ZLP" it helps some
devices and breaks others
- netfilter:
- nf_tables: many fixes rejecting cross-object linking which may
lead to UAFs
- nf_tables: fix null deref due to zeroed list head
- nf_tables: validate variable length element extension
- bgmac: fix a BUG triggered by wrong bytes_compl
- bcmgenet: indicate MAC is in charge of PHY PM
Previous releases - always broken:
- bpf:
- fix bad pointer deref in bpf_sys_bpf() injected via test infra
- disallow non-builtin bpf programs calling the prog_run command
- don't reinit map value in prealloc_lru_pop
- fix UAFs during the read of map iterator fd
- fix invalidity check for values in sk local storage map
- reject sleepable program for non-resched map iterator
- mptcp:
- move subflow cleanup in mptcp_destroy_common()
- do not queue data on closed subflows
- virtio_net: fix memory leak inside XDP_TX with mergeable
- vsock: fix memory leak when multiple threads try to connect()
- rework sk_user_data sharing to prevent psock leaks
- geneve: fix TOS inheriting for ipv4
- tunnels & drivers: do not use RT_TOS for IPv6 flowlabel
- phy: c45 baset1: do not skip aneg configuration if clock role is
not specified
- rose: avoid overflow when /proc displays timer information
- x25: fix call timeouts in blocking connects
- can: mcp251x: fix race condition on receive interrupt
- can: j1939:
- replace user-reachable WARN_ON_ONCE() with netdev_warn_once()
- fix memory leak of skbs in j1939_session_destroy()
Misc:
- docs: bpf: clarify that many things are not uAPI
- seg6: initialize induction variable to first valid array index (to
silence clang vs objtool warning)
- can: ems_usb: fix clang 14's -Wunaligned-access warning"
* tag 'net-6.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (117 commits)
net: atm: bring back zatm uAPI
dpaa2-eth: trace the allocated address instead of page struct
net: add missing kdoc for struct genl_multicast_group::flags
nfp: fix use-after-free in area_cache_get()
MAINTAINERS: use my korg address for mt7601u
mlxsw: minimal: Fix deadlock in ports creation
bonding: fix reference count leak in balance-alb mode
net: usb: qmi_wwan: Add support for Cinterion MV32
bpf: Shut up kern_sys_bpf warning.
net/tls: Use RCU API to access tls_ctx->netdev
tls: rx: device: don't try to copy too much on detach
tls: rx: device: bound the frag walk
net_sched: cls_route: remove from list when handle is 0
selftests: forwarding: Fix failing tests with old libnet
net: refactor bpf_sk_reuseport_detach()
net: fix refcount bug in sk_psock_get (2)
selftests/bpf: Ensure sleepable program is rejected by hash map iter
selftests/bpf: Add write tests for sk local storage map iterator
selftests/bpf: Add tests for reading a dangling map iter fd
bpf: Only allow sleepable program for resched-able iterator
...
- Replace direct references to the fwnode field in struct device with
dev_fwnode() and device_match_fwnode() (Andy Shevchenko).
- Make the ACPI code handling device properties support properties
with buffer values (Sakari Ailus).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmL1PkQSHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRxFWsQAIlWhHFfeDboSPK9EUngbyrJFbEC2cpn
3RSwTfJkpJNdN1YzCZGiTug//ImAncS9z0Lip4A5in50sfGbWtkQS54fGn5t1e+l
HL+PGDIM07ye+aUXgnjSmg/wXoAYSCg+vURQeB/8GphRbWwXwyYBV8iqrFnMEfr+
EX5lQTtKtFvjdMbO4wmpdydsCueZcgLKkc57T1WoiWvoggKGVcVs0OkeDAOH4/Kw
Gwt9cAl69jbFE3R4fFtdzySZlTnTAA8J6Mfer8yOnqcnL6JneI+GpOVsJQ06n7d3
JgPLeQNlYR6TGGmZIC0u5rsIDXaiT4//Yjnca3vHmUy1hqvsyyUVHW1t0JiPaj4h
EGdpl4XP9+4CJyXeo9u2RM/dJF94nFtqHgJKdUuFS2wYXytrUny+jvccfaeEXCUa
N7Z3IV9JIJeuQKDTy5EyngXwXa+FfpjrrrZr17j7goJ+xQ/zf4oAFQejiuv70aqQ
5v9iwwZ2Dco/n/oDMKP8tUOplzPi4V42HPAmffvc8bheEE5IOiumkDB6KgVciZa5
/M91nTnjaTMtH+ObGfwe2hG8HIskI+O/5IO/bWv9IKikBFezpBaRDXXHpfBbOefF
64SVGmeoDkMCvmPQEGqSfx/PcltIY13wqtcJ651xKkUee8tbx76giPWpeAAguqWl
bS3Q7ndP+gC3
=iaSF
-----END PGP SIGNATURE-----
Merge tag 'acpi-5.20-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull more ACPI updates from Rafael Wysocki:
"These fix up direct references to the fwnode field in struct device
and extend ACPI device properties support.
Specifics:
- Replace direct references to the fwnode field in struct device with
dev_fwnode() and device_match_fwnode() (Andy Shevchenko)
- Make the ACPI code handling device properties support properties
with buffer values (Sakari Ailus)"
* tag 'acpi-5.20-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: property: Fix error handling in acpi_init_properties()
ACPI: VIOT: Do not dereference fwnode in struct device
ACPI: property: Read buffer properties as integers
ACPI: property: Add support for parsing buffer property UUID
ACPI: property: Unify integer value reading functions
ACPI: property: Switch node property referencing from ifs to a switch
ACPI: property: Move property ref argument parsing into a new function
ACPI: property: Use acpi_object_type consistently in property ref parsing
ACPI: property: Tie data nodes to acpi handles
ACPI: property: Return type of acpi_add_nondev_subnodes() should be bool
- Remove iomap_writepage and all callers, since the mm apparently never
called the zonefs or gfs2 writepage functions.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmL1H7kACgkQ+H93GTRK
tOvzfw/+JJQM3WjwCUg+11O9E+oKS3wbczr0yAd2m8j+EqapdndXzIVevcZKXoTx
K4zOK9oDecPtRKgQkvrDt7HrMB7oYv8tuSzyfcsNVHbMA6U3twkLdr5c19/lm9uj
rnP2Xrs0RkiiFpImmTHsviPEyzniJ+BjtRDF7FxSFELxREae4EQW3YX2MjffvqQA
dT+xXptWiOSa3ygwfoGqVeOLOMt0DqXICiV0GLrGxD6S7TLRRIPo7ojYS4703vUL
VFTAUvhC4CD9/vsEwPnl91Jq2s06tO3LE4V6vJDPI7/uQFPcubLmcK8GpaYB6+OQ
q9Fhpc9cU/3JTKt6Sw9uNOqA5hfUKBdJmhWE3FqZ2arql2C9tY2o+cHvRBKZWMZ9
FdLKSwsuDpL+pYsWOPn7wU8BHZVTDDl7CtDNTCurNkkNgaAbK8C0X7QcT16RRyDF
SAPHlg0XFewLgJ+9HNyDv70VT1VLYiJNq/h0d/EMO1+FuT4ArBOTOSe4zNNXqD3w
vVFtbBhjGMf1ffqiMM5GdOPh0vxacL8jfxM7xyQ4yooSkecZCEvtNnuCysNTFDbl
53b9bjk+OSuWCb7efE6p82wU+gr617Zp2/YxALl4E0FlozeRHuRimWBtABZqi/g6
aKJL42ASY+PLJPACDjo0LhDFuCRbd75OATUGtBva7mkYWUANlMc=
=FuyV
-----END PGP SIGNATURE-----
Merge tag 'iomap-6.0-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull more iomap updates from Darrick Wong:
"In the past 10 days or so I've not heard any ZOMG STOP style
complaints about removing ->writepage support from gfs2 or zonefs, so
here's the pull request removing them (and the underlying fs iomap
support) from the kernel:
- Remove iomap_writepage and all callers, since the mm apparently
never called the zonefs or gfs2 writepage functions"
* tag 'iomap-6.0-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
iomap: remove iomap_writepage
zonefs: remove ->writepage
gfs2: remove ->writepage
gfs2: stop using generic_writepages in gfs2_ail1_start_one
Luis and others, almost exclusively in the filesystem. Several patches
touch files outside of our normal purview to set the stage for bringing
in Jeff's long awaited ceph+fscrypt series in the near future. All of
them have appropriate acks and sat in linux-next for a while.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmL1HF8THGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHziwOuB/97JKHFuOlP1HrD6fYe5a0ul9zC9VG4
57XPDNqG2PSmfXCjvZhyVU4n53sUlJTqzKDSTXydoPCMQjtyHvysA6gEvcgUJFPd
PHaZDCd9TmqX8my67NiTK70RVpNR9BujJMVMbOfM+aaisl0K6WQbitO+BfhEiJcK
QStdKm5lPyf02ESH9jF+Ga0DpokARaLbtDFH7975owxske6gWuoPBCJNrkMooKiX
LjgEmNgH1F/sJSZXftmKdlw9DtGBFaLQBdfbfSB5oVPRb7chI7xBeraNr6Od3rls
o4davbFkcsOr+s6LJPDH2BJobmOg+HoMoma7ezspF7ZqBF4Uipv5j3VC
=1427
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-5.20-rc1' of https://github.com/ceph/ceph-client
Pull ceph updates from Ilya Dryomov:
"We have a good pile of various fixes and cleanups from Xiubo, Jeff,
Luis and others, almost exclusively in the filesystem.
Several patches touch files outside of our normal purview to set the
stage for bringing in Jeff's long awaited ceph+fscrypt series in the
near future. All of them have appropriate acks and sat in linux-next
for a while"
* tag 'ceph-for-5.20-rc1' of https://github.com/ceph/ceph-client: (27 commits)
libceph: clean up ceph_osdc_start_request prototype
libceph: fix ceph_pagelist_reserve() comment typo
ceph: remove useless check for the folio
ceph: don't truncate file in atomic_open
ceph: make f_bsize always equal to f_frsize
ceph: flush the dirty caps immediatelly when quota is approaching
libceph: print fsid and epoch with osd id
libceph: check pointer before assigned to "c->rules[]"
ceph: don't get the inline data for new creating files
ceph: update the auth cap when the async create req is forwarded
ceph: make change_auth_cap_ses a global symbol
ceph: fix incorrect old_size length in ceph_mds_request_args
ceph: switch back to testing for NULL folio->private in ceph_dirty_folio
ceph: call netfs_subreq_terminated with was_async == false
ceph: convert to generic_file_llseek
ceph: fix the incorrect comment for the ceph_mds_caps struct
ceph: don't leak snap_rwsem in handle_cap_grant
ceph: prevent a client from exceeding the MDS maximum xattr size
ceph: choose auth MDS for getxattr with the Xs caps
ceph: add session already open notify support
...
* Documentation formatting fixes
* Make rseq selftest compatible with glibc-2.35
* Fix handling of illegal LEA reg, reg
* Cleanup creation of debugfs entries
* Fix steal time cache handling bug
* Fixes for MMIO caching
* Optimize computation of number of LBRs
* Fix uninitialized field in guest_maxphyaddr < host_maxphyaddr path
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmL0qwIUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroML1gf/SK6by+Gi0r7WSkrDjU94PKZ8D6Y3
fErMhratccc9IfL3p90IjCVhEngfdQf5UVHExA5TswgHHAJTpECzuHya9TweQZc5
2rrTvufup0MNALfzkSijrcI80CBvrJc6JyOCkv0BLp7yqXUrnrm0OOMV2XniS7y0
YNn2ZCy44tLqkNiQrLhJQg3EsXu9l7okGpHSVO6iZwC7KKHvYkbscVFa/AOlaAwK
WOZBB+1Ee+/pWhxsngM1GwwM3ZNU/jXOSVjew5plnrD4U7NYXIDATszbZAuNyxqV
5gi+wvTF1x9dC6Tgd3qF7ouAqtT51BdRYaI9aYHOYgvzqdNFHWJu3XauDQ==
=vI6Q
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull more kvm updates from Paolo Bonzini:
- Xen timer fixes
- Documentation formatting fixes
- Make rseq selftest compatible with glibc-2.35
- Fix handling of illegal LEA reg, reg
- Cleanup creation of debugfs entries
- Fix steal time cache handling bug
- Fixes for MMIO caching
- Optimize computation of number of LBRs
- Fix uninitialized field in guest_maxphyaddr < host_maxphyaddr path
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (26 commits)
KVM: x86/MMU: properly format KVM_CAP_VM_DISABLE_NX_HUGE_PAGES capability table
Documentation: KVM: extend KVM_CAP_VM_DISABLE_NX_HUGE_PAGES heading underline
KVM: VMX: Adjust number of LBR records for PERF_CAPABILITIES at refresh
KVM: VMX: Use proper type-safe functions for vCPU => LBRs helpers
KVM: x86: Refresh PMU after writes to MSR_IA32_PERF_CAPABILITIES
KVM: selftests: Test all possible "invalid" PERF_CAPABILITIES.LBR_FMT vals
KVM: selftests: Use getcpu() instead of sched_getcpu() in rseq_test
KVM: selftests: Make rseq compatible with glibc-2.35
KVM: Actually create debugfs in kvm_create_vm()
KVM: Pass the name of the VM fd to kvm_create_vm_debugfs()
KVM: Get an fd before creating the VM
KVM: Shove vcpu stats_id init into kvm_vcpu_init()
KVM: Shove vm stats_id init into kvm_create_vm()
KVM: x86/mmu: Add sanity check that MMIO SPTE mask doesn't overlap gen
KVM: x86/mmu: rename trace function name for asynchronous page fault
KVM: x86/xen: Stop Xen timer before changing IRQ
KVM: x86/xen: Initialize Xen timer only once
KVM: SVM: Disable SEV-ES support if MMIO caching is disable
KVM: x86/mmu: Fully re-evaluate MMIO caching when SPTE masks change
KVM: x86: Tag kvm_mmu_x86_module_init() with __init
...
We should trace the allocated address instead of page struct.
Fixes: 27c874867c ("dpaa2-eth: Use a single page per Rx buffer")
Signed-off-by: Chen Lin <chen.lin5@zte.com.cn>
Reviewed-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Link: https://lore.kernel.org/r/20220811151651.3327-1-chen45464546@163.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Merge changes adding support for device properties with buffer values
to the ACPI device properties handling code.
* acpi-properties:
ACPI: property: Fix error handling in acpi_init_properties()
ACPI: property: Read buffer properties as integers
ACPI: property: Add support for parsing buffer property UUID
ACPI: property: Unify integer value reading functions
ACPI: property: Switch node property referencing from ifs to a switch
ACPI: property: Move property ref argument parsing into a new function
ACPI: property: Use acpi_object_type consistently in property ref parsing
ACPI: property: Tie data nodes to acpi handles
ACPI: property: Return type of acpi_add_nondev_subnodes() should be bool
Multicast group flags were added in commit 4d54cc3211 ("mptcp: avoid
lock_fast usage in accept path"), but it missed adding the kdoc.
Mention which flags go into that field, and do the same for
op structs.
Link: https://lore.kernel.org/r/20220809232012.403730-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
- changes to input core to properly queue synthetic events (such as
autorepeat) and to release multitouch contacts when an input device is
inhibited or suspended
- reworked quirk handling in i8042 driver that consolidates multiple
DMI tables into one and adds several quirks for TUXEDO line of
laptops
- update to mt6779 keypad to better reflect organization of the hardware
- changes to mtk-pmic-keys driver preparing it to handle more variants
- facelift of adp5588-keys driver
- improvements to iqs7222 driver
- adjustments to various DT binding documents for input devices
- other assorted driver fixes.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQST2eWILY88ieB2DOtAj56VGEWXnAUCYvFsGAAKCRBAj56VGEWX
nMI6AQDQUwQpKtmCoDmxLTi/8Oy7qq0j9Sn8FNyWOaFykK3iTAD/eJNsY+PSn+M2
bPzg1bduYK/eqZkomMaL2dAytHpr9AM=
=XmKz
-----END PGP SIGNATURE-----
Merge tag 'input-for-v5.20-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
Pull input updates from Dmitry Torokhov:
- changes to input core to properly queue synthetic events (such as
autorepeat) and to release multitouch contacts when an input device
is inhibited or suspended
- reworked quirk handling in i8042 driver that consolidates multiple
DMI tables into one and adds several quirks for TUXEDO line of
laptops
- update to mt6779 keypad to better reflect organization of the
hardware
- changes to mtk-pmic-keys driver preparing it to handle more variants
- facelift of adp5588-keys driver
- improvements to iqs7222 driver
- adjustments to various DT binding documents for input devices
- other assorted driver fixes.
* tag 'input-for-v5.20-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: (54 commits)
Input: adc-joystick - fix ordering in adc_joystick_probe()
dt-bindings: input: ariel-pwrbutton: use spi-peripheral-props.yaml
Input: deactivate MT slots when inhibiting or suspending devices
Input: properly queue synthetic events
dt-bindings: input: iqs7222: Use central 'linux,code' definition
Input: i8042 - add dritek quirk for Acer Aspire One AO532
dt-bindings: input: gpio-keys: accept also interrupt-extended
dt-bindings: input: gpio-keys: reference input.yaml and document properties
dt-bindings: input: gpio-keys: enforce node names to match all properties
dt-bindings: input: Convert adc-keys to DT schema
dt-bindings: input: Centralize 'linux,input-type' definition
dt-bindings: input: Use common 'linux,keycodes' definition
dt-bindings: input: Centralize 'linux,code' definition
dt-bindings: input: Increase maximum keycode value to 0x2ff
Input: mt6779-keypad - implement row/column selection
Input: mt6779-keypad - match hardware matrix organization
Input: i8042 - add additional TUXEDO devices to i8042 quirk tables
Input: goodix - switch use of acpi_gpio_get_*_resource() APIs
Input: i8042 - add TUXEDO devices to i8042 quirk tables
Input: i8042 - add debug output for quirks
...
area_cache_get() is used to distribute cache->area and set cache->id,
and if cache->id is not 0 and cache->area->kref refcount is 0, it will
release the cache->area by nfp_cpp_area_release(). area_cache_get()
set cache->id before cpp->op->area_init() and nfp_cpp_area_acquire().
But if area_init() or nfp_cpp_area_acquire() fails, the cache->id is
is already set but the refcount is not increased as expected. At this
time, calling the nfp_cpp_area_release() will cause use-after-free.
To avoid the use-after-free, set cache->id after area_init() and
nfp_cpp_area_acquire() complete successfully.
Note: This vulnerability is triggerable by providing emulated device
equipped with specified configuration.
BUG: KASAN: use-after-free in nfp6000_area_init (drivers/net/ethernet/netronome/nfp/nfpcore/nfp6000_pcie.c:760)
Write of size 4 at addr ffff888005b7f4a0 by task swapper/0/1
Call Trace:
<TASK>
nfp6000_area_init (drivers/net/ethernet/netronome/nfp/nfpcore/nfp6000_pcie.c:760)
area_cache_get.constprop.8 (drivers/net/ethernet/netronome/nfp/nfpcore/nfp_cppcore.c:884)
Allocated by task 1:
nfp_cpp_area_alloc_with_name (drivers/net/ethernet/netronome/nfp/nfpcore/nfp_cppcore.c:303)
nfp_cpp_area_cache_add (drivers/net/ethernet/netronome/nfp/nfpcore/nfp_cppcore.c:802)
nfp6000_init (drivers/net/ethernet/netronome/nfp/nfpcore/nfp6000_pcie.c:1230)
nfp_cpp_from_operations (drivers/net/ethernet/netronome/nfp/nfpcore/nfp_cppcore.c:1215)
nfp_pci_probe (drivers/net/ethernet/netronome/nfp/nfp_main.c:744)
Freed by task 1:
kfree (mm/slub.c:4562)
area_cache_get.constprop.8 (drivers/net/ethernet/netronome/nfp/nfpcore/nfp_cppcore.c:873)
nfp_cpp_read (drivers/net/ethernet/netronome/nfp/nfpcore/nfp_cppcore.c:924 drivers/net/ethernet/netronome/nfp/nfpcore/nfp_cppcore.c:973)
nfp_cpp_readl (drivers/net/ethernet/netronome/nfp/nfpcore/nfp_cpplib.c:48)
Signed-off-by: Jialiang Wang <wangjialiang0806@163.com>
Reviewed-by: Yinjun Zhang <yinjun.zhang@corigine.com>
Acked-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/20220810073057.4032-1-wangjialiang0806@163.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Drop devl_lock() / devl_unlock() from ports creation and removal flows
since the devlink instance lock is now taken by mlxsw_core.
Fixes: 72a4c8c94e ("mlxsw: convert driver to use unlocked devlink API during init/fini")
Signed-off-by: Vadim Pasternak <vadimp@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/f4afce5ab0318617f3866b85274be52542d59b32.1660211614.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit d5410ac7b0 ("net:bonding:support balance-alb interface
with vlan to bridge") introduced a reference count leak by not releasing
the reference acquired by ip_dev_find(). Remedy this by insuring the
reference is released.
Fixes: d5410ac7b0 ("net:bonding:support balance-alb interface with vlan to bridge")
Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://lore.kernel.org/r/26758.1660194413@famine
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This reverts commit 258fafcd06.
The clang -Wformat warning is terminally broken, and the clang people
can't seem to get their act together.
This test program causes a warning with clang:
#include <stdio.h>
int main(int argc, char **argv)
{
printf("%hhu\n", 'a');
}
resulting in
t.c:5:19: warning: format specifies type 'unsigned char' but the argument has type 'int' [-Wformat]
printf("%hhu\n", 'a');
~~~~ ^~~
%d
and apparently clang people consider that a feature, because they don't
want to face the reality of how either C character constants, C
arithmetic, and C varargs functions work.
The rest of the world just shakes their head at that kind of
incompetence, and turns off -Wformat for clang again.
And no, the "you should use a pointless cast to shut this up" is not a
valid answer. That warning should not exist in the first place, or at
least be optinal with some "-Wformat-me-harder" kind of option.
[ Admittedly, there's also very little reason to *ever* use '%hh[ud]' in
C, but what little reason there is is entirely about 'I want to see
only the low 8 bits of the argument'. So I would suggest nobody ever
use that format in the first place, but if they do, the clang
behavious is simply always wrong. Because '%hhu' takes an 'int'. It's
that simple. ]
Reported-by: Sudip Mukherjee (Codethink) <sudipm.mukherjee@gmail.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Shut up this warning:
kernel/bpf/syscall.c:5089:5: warning: no previous prototype for function 'kern_sys_bpf' [-Wmissing-prototypes]
int kern_sys_bpf(int cmd, union bpf_attr *attr, unsigned int size)
Reported-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
There is unexpected warning on KVM_CAP_VM_DISABLE_NX_HUGE_PAGES capability
table, which cause the table to be rendered as paragraph text instead.
The warning is due to missing colon at capability name and returns keyword,
as well as improper alignment on multi-line returns field.
Fix the warning by adding missing colons and aligning the field.
Link: https://lore.kernel.org/lkml/20220627181937.3be67263@canb.auug.org.au/
Fixes: 084cc29f8b ("KVM: x86/MMU: Allow NX huge pages to be disabled on a per-vm basis")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: David Matlack <dmatlack@google.com>
Cc: Ben Gardon <bgardon@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: kvm@vger.kernel.org
Cc: linux-next@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Message-Id: <20220627095151.19339-3-bagasdotme@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Extend heading underline for KVM_CAP_VM_DISABLE_NX_HUGE_PAGE to match
the heading text length.
Link: https://lore.kernel.org/lkml/20220627181937.3be67263@canb.auug.org.au/
Fixes: 084cc29f8b ("KVM: x86/MMU: Allow NX huge pages to be disabled on a per-vm basis")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: David Matlack <dmatlack@google.com>
Cc: Ben Gardon <bgardon@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: kvm@vger.kernel.org
Cc: linux-next@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Message-Id: <20220627095151.19339-2-bagasdotme@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently, tls_device_down synchronizes with tls_device_resync_rx using
RCU, however, the pointer to netdev is stored using WRITE_ONCE and
loaded using READ_ONCE.
Although such approach is technically correct (rcu_dereference is
essentially a READ_ONCE, and rcu_assign_pointer uses WRITE_ONCE to store
NULL), using special RCU helpers for pointers is more valid, as it
includes additional checks and might change the implementation
transparently to the callers.
Mark the netdev pointer as __rcu and use the correct RCU helpers to
access it. For non-concurrent access pass the right conditions that
guarantee safe access (locks taken, refcount value). Also use the
correct helper in mlx5e, where even READ_ONCE was missing.
The transition to RCU exposes existing issues, fixed by this commit:
1. bond_tls_device_xmit could read netdev twice, and it could become
NULL the second time, after the NULL check passed.
2. Drivers shouldn't stop processing the last packet if tls_device_down
just set netdev to NULL, before tls_dev_del was called. This prevents a
possible packet drop when transitioning to the fallback software mode.
Fixes: 89df6a8104 ("net/bonding: Implement TLS TX device offload")
Fixes: c55dcdd435 ("net/tls: Fix use-after-free after the TLS device goes down and up")
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Link: https://lore.kernel.org/r/20220810081602.1435800-1-maximmi@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Another device offload bug, we use the length of the output
skb as an indication of how much data to copy. But that skb
is sized to offset + record length, and we start from offset.
So we end up double-counting the offset which leads to
skb_copy_bits() returning -EFAULT.
Reported-by: Tariq Toukan <tariqt@nvidia.com>
Fixes: 84c61fe1a7 ("tls: rx: do not use the standard strparser")
Tested-by: Ran Rozenstein <ranro@nvidia.com>
Link: https://lore.kernel.org/r/20220809175544.354343-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We can't do skb_walk_frags() on the input skbs, because
the input skbs is really just a pointer to the tcp read
queue. We need to bound the "is decrypted" check by the
amount of data in the message.
Note that the walk in tls_device_reencrypt() is after a
CoW so the skb there is safe to walk. Actually in the
current implementation it can't have frags at all, but
whatever, maybe one day it will.
Reported-by: Tariq Toukan <tariqt@nvidia.com>
Fixes: 84c61fe1a7 ("tls: rx: do not use the standard strparser")
Tested-by: Ran Rozenstein <ranro@nvidia.com>
Link: https://lore.kernel.org/r/20220809175544.354343-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When a route filter is replaced and the old filter has a 0 handle, the old
one won't be removed from the hashtable, while it will still be freed.
The test was there since before commit 1109c00547 ("net: sched: RCU
cls_route"), when a new filter was not allocated when there was an old one.
The old filter was reused and the reinserting would only be necessary if an
old filter was replaced. That was still wrong for the same case where the
old handle was 0.
Remove the old filter from the list independently from its handle value.
This fixes CVE-2022-2588, also reported as ZDI-CAN-17440.
Reported-by: Zhenpeng Lin <zplin@u.northwestern.edu>
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Reviewed-by: Kamal Mostafa <kamal@canonical.com>
Cc: <stable@vger.kernel.org>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://lore.kernel.org/r/20220809170518.164662-1-cascardo@canonical.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The custom multipath hash tests use mausezahn in order to test how
changes in various packet fields affect the packet distribution across
the available nexthops.
The tool uses the libnet library for various low-level packet
construction and injection. The library started using the
"SO_BINDTODEVICE" socket option for IPv6 sockets in version 1.1.6 and
for IPv4 sockets in version 1.2.
When the option is not set, packets are not routed according to the
table associated with the VRF master device and tests fail.
Fix this by prefixing the command with "ip vrf exec", which will cause
the route lookup to occur in the VRF routing table. This makes the tests
pass regardless of the libnet library version.
Fixes: 511e8db540 ("selftests: forwarding: Add test for custom multipath hash")
Fixes: 185b0c190b ("selftests: forwarding: Add test for custom multipath hash with IPv4 GRE")
Fixes: b7715acba4 ("selftests: forwarding: Add test for custom multipath hash with IPv6 GRE")
Reported-by: Ivan Vecera <ivecera@redhat.com>
Tested-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Link: https://lore.kernel.org/r/20220809113320.751413-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Borkmann says:
====================
bpf 2022-08-10
We've added 23 non-merge commits during the last 7 day(s) which contain
a total of 19 files changed, 424 insertions(+), 35 deletions(-).
The main changes are:
1) Several fixes for BPF map iterator such as UAFs along with selftests, from Hou Tao.
2) Fix BPF syscall program's {copy,strncpy}_from_bpfptr() to not fault, from Jinghao Jia.
3) Reject BPF syscall programs calling BPF_PROG_RUN, from Alexei Starovoitov and YiFei Zhu.
4) Fix attach_btf_obj_id info to pick proper target BTF, from Stanislav Fomichev.
5) BPF design Q/A doc update to clarify what is not stable ABI, from Paul E. McKenney.
6) Fix BPF map's prealloc_lru_pop to not reinitialize, from Kumar Kartikeya Dwivedi.
7) Fix bpf_trampoline_put to avoid leaking ftrace hash, from Jiri Olsa.
8) Fix arm64 JIT to address sparse errors around BPF trampoline, from Xu Kuohai.
9) Fix arm64 JIT to use kvcalloc instead of kcalloc for internal program address
offset buffer, from Aijun Sun.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: (23 commits)
selftests/bpf: Ensure sleepable program is rejected by hash map iter
selftests/bpf: Add write tests for sk local storage map iterator
selftests/bpf: Add tests for reading a dangling map iter fd
bpf: Only allow sleepable program for resched-able iterator
bpf: Check the validity of max_rdwr_access for sock local storage map iterator
bpf: Acquire map uref in .init_seq_private for sock{map,hash} iterator
bpf: Acquire map uref in .init_seq_private for sock local storage map iterator
bpf: Acquire map uref in .init_seq_private for hash map iterator
bpf: Acquire map uref in .init_seq_private for array map iterator
bpf: Disallow bpf programs call prog_run command.
bpf, arm64: Fix bpf trampoline instruction endianness
selftests/bpf: Add test for prealloc_lru_pop bug
bpf: Don't reinit map value in prealloc_lru_pop
bpf: Allow calling bpf_prog_test kfuncs in tracing programs
bpf, arm64: Allocate program buffer using kvcalloc instead of kcalloc
selftests/bpf: Excercise bpf_obj_get_info_by_fd for bpf2bpf
bpf: Use proper target btf when exporting attach_btf_obj_id
mptcp, btf: Add struct mptcp_sock definition when CONFIG_MPTCP is disabled
bpf: Cleanup ftrace hash in bpf_trampoline_put
BPF: Fix potential bad pointer dereference in bpf_sys_bpf()
...
====================
Link: https://lore.kernel.org/r/20220810190624.10748-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Refactor sk_user_data dereference using more generic function
__rcu_dereference_sk_user_data_with_flags(), which improve its
maintainability
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Hawkins Jiawei <yin31149@gmail.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Syzkaller reports refcount bug as follows:
------------[ cut here ]------------
refcount_t: saturated; leaking memory.
WARNING: CPU: 1 PID: 3605 at lib/refcount.c:19 refcount_warn_saturate+0xf4/0x1e0 lib/refcount.c:19
Modules linked in:
CPU: 1 PID: 3605 Comm: syz-executor208 Not tainted 5.18.0-syzkaller-03023-g7e062cda7d90 #0
<TASK>
__refcount_add_not_zero include/linux/refcount.h:163 [inline]
__refcount_inc_not_zero include/linux/refcount.h:227 [inline]
refcount_inc_not_zero include/linux/refcount.h:245 [inline]
sk_psock_get+0x3bc/0x410 include/linux/skmsg.h:439
tls_data_ready+0x6d/0x1b0 net/tls/tls_sw.c:2091
tcp_data_ready+0x106/0x520 net/ipv4/tcp_input.c:4983
tcp_data_queue+0x25f2/0x4c90 net/ipv4/tcp_input.c:5057
tcp_rcv_state_process+0x1774/0x4e80 net/ipv4/tcp_input.c:6659
tcp_v4_do_rcv+0x339/0x980 net/ipv4/tcp_ipv4.c:1682
sk_backlog_rcv include/net/sock.h:1061 [inline]
__release_sock+0x134/0x3b0 net/core/sock.c:2849
release_sock+0x54/0x1b0 net/core/sock.c:3404
inet_shutdown+0x1e0/0x430 net/ipv4/af_inet.c:909
__sys_shutdown_sock net/socket.c:2331 [inline]
__sys_shutdown_sock net/socket.c:2325 [inline]
__sys_shutdown+0xf1/0x1b0 net/socket.c:2343
__do_sys_shutdown net/socket.c:2351 [inline]
__se_sys_shutdown net/socket.c:2349 [inline]
__x64_sys_shutdown+0x50/0x70 net/socket.c:2349
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x46/0xb0
</TASK>
During SMC fallback process in connect syscall, kernel will
replaces TCP with SMC. In order to forward wakeup
smc socket waitqueue after fallback, kernel will sets
clcsk->sk_user_data to origin smc socket in
smc_fback_replace_callbacks().
Later, in shutdown syscall, kernel will calls
sk_psock_get(), which treats the clcsk->sk_user_data
as psock type, triggering the refcnt warning.
So, the root cause is that smc and psock, both will use
sk_user_data field. So they will mismatch this field
easily.
This patch solves it by using another bit(defined as
SK_USER_DATA_PSOCK) in PTRMASK, to mark whether
sk_user_data points to a psock object or not.
This patch depends on a PTRMASK introduced in commit f1ff5ce2cd
("net, sk_msg: Clear sk_user_data pointer on clone if tagged").
For there will possibly be more flags in the sk_user_data field,
this patch also refactor sk_user_data flags code to be more generic
to improve its maintainability.
Reported-and-tested-by: syzbot+5f26f85569bd179c18ce@syzkaller.appspotmail.com
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: Hawkins Jiawei <yin31149@gmail.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Users of GNU ld (BFD) from binutils 2.39+ will observe multiple
instances of a new warning when linking kernels in the form:
ld: warning: arch/x86/boot/pmjump.o: missing .note.GNU-stack section implies executable stack
ld: NOTE: This behaviour is deprecated and will be removed in a future version of the linker
ld: warning: arch/x86/boot/compressed/vmlinux has a LOAD segment with RWX permissions
Generally, we would like to avoid the stack being executable. Because
there could be a need for the stack to be executable, assembler sources
have to opt-in to this security feature via explicit creation of the
.note.GNU-stack feature (which compilers create by default) or command
line flag --noexecstack. Or we can simply tell the linker the
production of such sections is irrelevant and to link the stack as
--noexecstack.
LLVM's LLD linker defaults to -z noexecstack, so this flag isn't
strictly necessary when linking with LLD, only BFD, but it doesn't hurt
to be explicit here for all linkers IMO. --no-warn-rwx-segments is
currently BFD specific and only available in the current latest release,
so it's wrapped in an ld-option check.
While the kernel makes extensive usage of ELF sections, it doesn't use
permissions from ELF segments.
Link: https://lore.kernel.org/linux-block/3af4127a-f453-4cf7-f133-a181cce06f73@kernel.dk/
Link: https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=ba951afb99912da01a6e8434126b8fac7aa75107
Link: https://github.com/llvm/llvm-project/issues/57009
Reported-and-tested-by: Jens Axboe <axboe@kernel.dk>
Suggested-by: Fangrui Song <maskray@google.com>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Users of GNU ld (BFD) from binutils 2.39+ will observe multiple
instances of a new warning when linking kernels in the form:
ld: warning: vmlinux: missing .note.GNU-stack section implies executable stack
ld: NOTE: This behaviour is deprecated and will be removed in a future version of the linker
ld: warning: vmlinux has a LOAD segment with RWX permissions
Generally, we would like to avoid the stack being executable. Because
there could be a need for the stack to be executable, assembler sources
have to opt-in to this security feature via explicit creation of the
.note.GNU-stack feature (which compilers create by default) or command
line flag --noexecstack. Or we can simply tell the linker the
production of such sections is irrelevant and to link the stack as
--noexecstack.
LLVM's LLD linker defaults to -z noexecstack, so this flag isn't
strictly necessary when linking with LLD, only BFD, but it doesn't hurt
to be explicit here for all linkers IMO. --no-warn-rwx-segments is
currently BFD specific and only available in the current latest release,
so it's wrapped in an ld-option check.
While the kernel makes extensive usage of ELF sections, it doesn't use
permissions from ELF segments.
Link: https://lore.kernel.org/linux-block/3af4127a-f453-4cf7-f133-a181cce06f73@kernel.dk/
Link: https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=ba951afb99912da01a6e8434126b8fac7aa75107
Link: https://github.com/llvm/llvm-project/issues/57009
Reported-and-tested-by: Jens Axboe <axboe@kernel.dk>
Suggested-by: Fangrui Song <maskray@google.com>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It turns out that gcc-12.1 has some nasty problems with register
allocation on a 32-bit x86 build for the 64-bit values used in the
generic blake2b implementation, where the pattern of 64-bit rotates and
xor operations ends up making gcc generate horrible code.
As a result it ends up with a ridiculously large stack frame for all the
spills it generates, resulting in the following build problem:
crypto/blake2b_generic.c: In function ‘blake2b_compress_one_generic’:
crypto/blake2b_generic.c:109:1: error: the frame size of 2640 bytes is larger than 2048 bytes [-Werror=frame-larger-than=]
on the same test-case, clang ends up generating a stack frame that is
just 296 bytes (and older gcc versions generate a slightly bigger one at
428 bytes - still nowhere near that almost 3kB monster stack frame of
gcc-12.1).
The issue is fixed both in mainline and the GCC 12 release branch [1],
but current release compilers end up failing the i386 allmodconfig build
due to this issue.
Disable the warning for now by simply raising the frame size for this
one file, just to keep this issue from having people turn off WERROR.
Link: https://lore.kernel.org/all/CAHk-=wjxqgeG2op+=W9sqgsWqCYnavC+SRfVyopu9-31S6xw+Q@mail.gmail.com/
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930 [1]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Highlights include:
Stable fixes:
- pNFS/flexfiles: Fix infinite looping when the RDMA connection errors out
Bugfixes:
- NFS: fix port value parsing
- SUNRPC: Reinitialise the backchannel request buffers before reuse
- SUNRPC: fix expiry of auth creds
- NFSv4: Fix races in the legacy idmapper upcall
- NFS: O_DIRECT fixes from Jeff Layton
- NFSv4.1: Fix OP_SEQUENCE error handling
- SUNRPC: Fix an RPC/RDMA performance regression
- NFS: Fix case insensitive renames
- NFSv4/pnfs: Fix a use-after-free bug in open
- NFSv4.1: RECLAIM_COMPLETE must handle EACCES
Features:
- NFSv4.1: session trunking enhancements
- NFSv4.2: READ_PLUS performance optimisations
- NFS: relax the rules for rsize/wsize mount options
- NFS: don't unhash dentry during unlink/rename
- SUNRPC: Fail faster on bad verifier
- NFS/SUNRPC: Various tracing improvements
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmLzy24ACgkQZwvnipYK
APJJvhAAnVmv3jeOLjwDm+3X9xBevTq8PXAikWVeJgbSKCdjfqXO1J+XF49MpxXl
N8PQZMqnwWkF3WvhvYobblwGSl6gJ3RnhVuTgdv4jiPl0ZWyS3XngJY0dCTnNgdL
5O9AjoOMtnGVZwN5j8agymA8f2TUcel5mED6sAk10t2zDZY7VxuQqjp0m696jYjF
0PKBmxoC4+6tXtYJS7d2PGRCTjfEUx2BLnnGuLOKsB7X8f63XmxJiu8/AIiY7spr
M/M+BjAF3ok86dT1LjGlkvSNp23H70Wmsv98udfvxIWBm1l4972oCR/CS+kh3/mU
dYM+oQ3JxxgKEN7Fdak+zU/+qma9q5z2rPFpSIT1fMEuaKN/7H2cbiHPi5RnEBLa
AHWilX/lWBIMnZJZd9g3yYcGe6E/pkT6TqW5JY+2510koyfNER4IismAWMx2iYKU
D0WSZOkmEBS/OYZxpTnqGwvS4L1szo9DN3c+yG2KXLifnmVPpjXZO25wahqSuUo3
V6eYUCXRJmVg+IuXGsMNdrjxGYxD12xChoYzx5RlXls2lwHGeZr+iG3aL3+XayHa
I1Kji3500UmfEOUUUr4UiQ428dOdL3QqNzVzdymN8Vh4d7v64LUL0GSseY+10Xrs
xcbR6l/hwjBIo+I1Bi2mmv3W10tKErFy9eBIKzql3D6VHg7ESOo=
=U00h
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.20-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Highlights include:
Stable fixes:
- pNFS/flexfiles: Fix infinite looping when the RDMA connection
errors out
Bugfixes:
- NFS: fix port value parsing
- SUNRPC: Reinitialise the backchannel request buffers before reuse
- SUNRPC: fix expiry of auth creds
- NFSv4: Fix races in the legacy idmapper upcall
- NFS: O_DIRECT fixes from Jeff Layton
- NFSv4.1: Fix OP_SEQUENCE error handling
- SUNRPC: Fix an RPC/RDMA performance regression
- NFS: Fix case insensitive renames
- NFSv4/pnfs: Fix a use-after-free bug in open
- NFSv4.1: RECLAIM_COMPLETE must handle EACCES
Features:
- NFSv4.1: session trunking enhancements
- NFSv4.2: READ_PLUS performance optimisations
- NFS: relax the rules for rsize/wsize mount options
- NFS: don't unhash dentry during unlink/rename
- SUNRPC: Fail faster on bad verifier
- NFS/SUNRPC: Various tracing improvements"
* tag 'nfs-for-5.20-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (46 commits)
NFS: Improve readpage/writepage tracing
NFS: Improve O_DIRECT tracing
NFS: Improve write error tracing
NFS: don't unhash dentry during unlink/rename
NFSv4/pnfs: Fix a use-after-free bug in open
NFS: nfs_async_write_reschedule_io must not recurse into the writeback code
SUNRPC: Don't reuse bvec on retransmission of the request
SUNRPC: Reinitialise the backchannel request buffers before reuse
NFSv4.1: RECLAIM_COMPLETE must handle EACCES
NFSv4.1 probe offline transports for trunking on session creation
SUNRPC create a function that probes only offline transports
SUNRPC export xprt_iter_rewind function
SUNRPC restructure rpc_clnt_setup_test_and_add_xprt
NFSv4.1 remove xprt from xprt_switch if session trunking test fails
SUNRPC create an rpc function that allows xprt removal from rpc_clnt
SUNRPC enable back offline transports in trunking discovery
SUNRPC create an iterator to list only OFFLINE xprts
NFSv4.1 offline trunkable transports on DESTROY_SESSION
SUNRPC add function to offline remove trunkable transports
SUNRPC expose functions for offline remote xprt functionality
...
Now that the PMU is refreshed when MSR_IA32_PERF_CAPABILITIES is written
by host userspace, zero out the number of LBR records for a vCPU during
PMU refresh if PMU_CAP_LBR_FMT is not set in PERF_CAPABILITIES instead of
handling the check at run-time.
guest_cpuid_has() is expensive due to the linear search of guest CPUID
entries, intel_pmu_lbr_is_enabled() is checked on every VM-Enter, _and_
simply enumerating the same "Model" as the host causes KVM to set the
number of LBR records to a non-zero value.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220727233424.2968356-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>