This patch added a new command MPTCP_PM_CMD_SET_FLAGS in PM netlink:
In mptcp_nl_cmd_set_flags, parse the input address, get the backup value
according to whether the address's FLAG_BACKUP flag is set from the
user-space. Then check whether this address had been added in the local
address list. If it had been, then call mptcp_nl_addr_backup to deal with
this address.
In mptcp_nl_addr_backup, traverse all the existing msk sockets to find
the relevant sockets, and call mptcp_pm_nl_mp_prio_send_ack to send out
a MP_PRIO ACK packet.
Finally in mptcp_nl_cmd_set_flags, set or clear the address's FLAG_BACKUP
flag.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch added the incoming MP_PRIO logic:
Added a flag named mp_prio in struct mptcp_options_received, to mark the
MP_PRIO is received, and save the priority value to struct
mptcp_options_received's backup member. Then invoke
mptcp_pm_mp_prio_received with the receiving subsocket and the backup
value.
In mptcp_pm_mp_prio_received, get the subflow context according the input
subsocket, and change the subflow's backup as the incoming priority value.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch added the outgoing MP_PRIO logic:
In mptcp_pm_nl_mp_prio_send_ack, find the related subflow and subsocket
according to the input parameter addr. Save the input priority value to
suflow's backup, then set subflow's send_mp_prio flag to true, and save
the input priority value to suflow's request_bkup. Finally, send out a
pure ACK on the related subsocket.
In mptcp_established_options_mp_prio, check whether the subflow's
send_mp_prio is set. If it is, this is the packet for sending MP_PRIO.
So save subflow->request_bkup value to mptcp_out_options's backup, and
change the option type to OPTION_MPTCP_PRIO.
In mptcp_write_options, clear the send_mp_prio flag and send out the
MP_PRIO suboption with mptcp_out_options's backup value.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently the address ID set by the netlink PM from user-space is
overridden by the kernel. This patch added the address ID assignment
bitmap to allow user-space to set the address ID.
Use a per netns bitmask id_bitmap (256 bits) to keep track of in-use IDs.
And use next_id to keep track of the highest ID currently in use. If the
user-space provides an ID at endpoint creation time, try to use it. If
already in use, endpoint creation fails. Otherwise pick the first ID
available after the highest currently in use, with wrap-around.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Introduced in commit 37b8da1a3c ("net: dsa: Move FDB add/del
implementation inside DSA") in net/dsa/legacy.c, these functions were
moved again to slave.c as part of commit 2a93c1a365 ("net: dsa: Allow
compiling out legacy support"), before actually deleting net/dsa/slave.c
in 93e86b3bc8 ("net: dsa: Remove legacy probing support"). Along with
that movement there should have been a deletion of the prototypes from
dsa_priv.h, they are not useful.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20210108233054.1222278-1-olteanv@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
GRO_DROP can only be returned from napi_gro_frags()
if the skb has not been allocated by a prior napi_get_frags()
Since drivers must use napi_get_frags() and test its result
before populating the skb with metadata, we can safely remove
GRO_DROP since it offers no practical use.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
Acked-by: Edward Cree <ecree.xilinx@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There are cases where GSO segment's length exceeds the egress MTU:
- Forwarding of a TCP GRO skb, when DF flag is not set.
- Forwarding of an skb that arrived on a virtualisation interface
(virtio-net/vhost/tap) with TSO/GSO size set by other network
stack.
- Local GSO skb transmitted on an NETIF_F_TSO tunnel stacked over an
interface with a smaller MTU.
- Arriving GRO skb (or GSO skb in a virtualised environment) that is
bridged to a NETIF_F_TSO tunnel stacked over an interface with an
insufficient MTU.
If so:
- Consume the SKB and its segments.
- Issue an ICMP packet with 'Packet Too Big' message containing the
MTU, allowing the source host to reduce its Path MTU appropriately.
Note: These cases are handled in the same manner in IPv4 output finish.
This patch aligns the behavior of IPv6 and the one of IPv4.
Fixes: 9e50849054 ("netfilter: ipv6: move POSTROUTING invocation before fragmentation")
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/1610027418-30438-1-git-send-email-ayal@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If register_netdevice() fails at the very last stage - the
notifier call - some subsystems may have already seen it and
grabbed a reference. struct net_device can't be freed right
away without calling netdev_wait_all_refs().
Now that we have a clean interface in form of dev->needs_free_netdev
and lenient free_netdev() we can undo what commit 93ee31f14f ("[NET]:
Fix free_netdev on register_netdev failure.") has done and complete
the unregistration path by bringing the net_set_todo() call back.
After registration fails user is still expected to explicitly
free the net_device, so make sure ->needs_free_netdev is cleared,
otherwise rolling back the registration will cause the old double
free for callers who release rtnl_lock before the free.
This also solves the problem of priv_destructor not being called
on notifier error.
net_set_todo() will be moved back into unregister_netdevice_queue()
in a follow up.
Reported-by: Hulk Robot <hulkci@huawei.com>
Reported-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There are two flavors of handling netdev registration:
- ones called without holding rtnl_lock: register_netdev() and
unregister_netdev(); and
- those called with rtnl_lock held: register_netdevice() and
unregister_netdevice().
While the semantics of the former are pretty clear, the same can't
be said about the latter. The netdev_todo mechanism is utilized to
perform some of the device unregistering tasks and it hooks into
rtnl_unlock() so the locked variants can't actually finish the work.
In general free_netdev() does not mix well with locked calls. Most
drivers operating under rtnl_lock set dev->needs_free_netdev to true
and expect core to make the free_netdev() call some time later.
The part where this becomes most problematic is error paths. There is
no way to unwind the state cleanly after a call to register_netdevice(),
since unreg can't be performed fully without dropping locks.
Make free_netdev() more lenient, and defer the freeing if device
is being unregistered. This allows error paths to simply call
free_netdev() both after register_netdevice() failed, and after
a call to unregister_netdevice() but before dropping rtnl_lock.
Simplify the error paths which are currently doing gymnastics
around free_netdev() handling.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
sparse complains about some harmless endianness issues:
> net/ipv4/ip_tunnel_core.c:225:43: warning: cast to restricted __be16
> net/ipv4/ip_tunnel_core.c:225:43: warning: incorrect type in initializer (different base types)
> net/ipv4/ip_tunnel_core.c:225:43: expected restricted __be16 [usertype] mtu
> net/ipv4/ip_tunnel_core.c:225:43: got unsigned short [usertype]
iptunnel_pmtud_build_icmp() uses the wrong flavour of byte-order conversion
when storing the MTU into the ICMPv4 packet. Use htons(), just like
iptunnel_pmtud_build_icmpv6() does.
> net/ipv4/ip_tunnel_core.c:248:35: warning: cast from restricted __be16
> net/ipv4/ip_tunnel_core.c:248:35: warning: incorrect type in argument 3 (different base types)
> net/ipv4/ip_tunnel_core.c:248:35: expected unsigned short type
> net/ipv4/ip_tunnel_core.c:248:35: got restricted __be16 [usertype]
> net/ipv4/ip_tunnel_core.c:341:35: warning: cast from restricted __be16
> net/ipv4/ip_tunnel_core.c:341:35: warning: incorrect type in argument 3 (different base types)
> net/ipv4/ip_tunnel_core.c:341:35: expected unsigned short type
> net/ipv4/ip_tunnel_core.c:341:35: got restricted __be16 [usertype]
eth_header() wants the Ethertype in host-order, use the correct flavour of
byte-order conversion.
> net/ipv4/ip_tunnel_core.c:600:45: warning: restricted __be16 degrades to integer
> net/ipv4/ip_tunnel_core.c:609:30: warning: incorrect type in assignment (different base types)
> net/ipv4/ip_tunnel_core.c:609:30: expected int type
> net/ipv4/ip_tunnel_core.c:609:30: got restricted __be16 [usertype]
> net/ipv4/ip_tunnel_core.c:619:30: warning: incorrect type in assignment (different base types)
> net/ipv4/ip_tunnel_core.c:619:30: expected int type
> net/ipv4/ip_tunnel_core.c:619:30: got restricted __be16 [usertype]
> net/ipv4/ip_tunnel_core.c:629:30: warning: incorrect type in assignment (different base types)
> net/ipv4/ip_tunnel_core.c:629:30: expected int type
> net/ipv4/ip_tunnel_core.c:629:30: got restricted __be16 [usertype]
The TUNNEL_* types are big-endian, so adjust the type of the local
variable in ip_tun_parse_opts().
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Link: https://lore.kernel.org/r/20210107144008.25777-1-jwi@linux.ibm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
reuse->socks[] is modified concurrently by reuseport_add_sock. To
prevent reading values that have not been fully initialized, only read
the array up until the last known safe index instead of incorrectly
re-reading the last index of the array.
Fixes: acdcecc612 ("udp: correct reuseport selection with connected sockets")
Signed-off-by: Baptiste Lepers <baptiste.lepers@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20210107051110.12247-1-baptiste.lepers@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
skbs in fraglist could be shared by a BPF filter loaded at TC. If TC
writes, it will call skb_ensure_writable -> pskb_expand_head to create
a private linear section for the head_skb. And then call
skb_clone_fraglist -> skb_get on each skb in the fraglist.
skb_segment_list overwrites part of the skb linear section of each
fragment itself. Even after skb_clone, the frag_skbs share their
linear section with their clone in PF_PACKET.
Both sk_receive_queue of PF_PACKET and PF_INET (or PF_INET6) can have
a link for the same frag_skbs chain. If a new skb (not frags) is
queued to one of the sk_receive_queue, multiple ptypes can see and
release this. It causes use-after-free.
[ 4443.426215] ------------[ cut here ]------------
[ 4443.426222] refcount_t: underflow; use-after-free.
[ 4443.426291] WARNING: CPU: 7 PID: 28161 at lib/refcount.c:190
refcount_dec_and_test_checked+0xa4/0xc8
[ 4443.426726] pstate: 60400005 (nZCv daif +PAN -UAO)
[ 4443.426732] pc : refcount_dec_and_test_checked+0xa4/0xc8
[ 4443.426737] lr : refcount_dec_and_test_checked+0xa0/0xc8
[ 4443.426808] Call trace:
[ 4443.426813] refcount_dec_and_test_checked+0xa4/0xc8
[ 4443.426823] skb_release_data+0x144/0x264
[ 4443.426828] kfree_skb+0x58/0xc4
[ 4443.426832] skb_queue_purge+0x64/0x9c
[ 4443.426844] packet_set_ring+0x5f0/0x820
[ 4443.426849] packet_setsockopt+0x5a4/0xcd0
[ 4443.426853] __sys_setsockopt+0x188/0x278
[ 4443.426858] __arm64_sys_setsockopt+0x28/0x38
[ 4443.426869] el0_svc_common+0xf0/0x1d0
[ 4443.426873] el0_svc_handler+0x74/0x98
[ 4443.426880] el0_svc+0x8/0xc
Fixes: 3a1296a38d (net: Support GRO/GSO fraglist chaining.)
Signed-off-by: Dongseok Yi <dseok.yi@samsung.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/1610072918-174177-1-git-send-email-dseok.yi@samsung.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The function sockfd_lookup uses fget on the value that is stored in
the file field of the returned structure, so fput should ultimately
be applied to this value. This can be done directly, but it seems
better to use the specific macro sockfd_put, which does the same
thing.
The cleanup was done using the following semantic patch:
(http://www.emn.fr/x-info/coccinelle/)
// <smpl>
@@
expression s;
@@
s = sockfd_lookup(...)
...
+ sockfd_put(s);
?- fput(s->file);
// </smpl>
Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20201229134834.22962-1-zhengyongjun3@huawei.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Underlying xfrm output supports gso packets.
Declare support in hw_features and adapt the xmit MTU check to pass GSO
packets.
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
The function nh_check_attr_group() is called to validate nexthop groups.
The intention of that code seems to have been to bounce all attributes
above NHA_GROUP_TYPE except for NHA_FDB. However instead it bounces all
these attributes except when NHA_FDB attribute is present--then it accepts
them.
NHA_FDB validation that takes place before, in rtm_to_nh_config(), already
bounces NHA_OIF, NHA_BLACKHOLE, NHA_ENCAP and NHA_ENCAP_TYPE. Yet further
back, NHA_GROUPS and NHA_MASTER are bounced unconditionally.
But that still leaves NHA_GATEWAY as an attribute that would be accepted in
FDB nexthop groups (with no meaning), so long as it keeps the address
family as unspecified:
# ip nexthop add id 1 fdb via 127.0.0.1
# ip nexthop add id 10 fdb via default group 1
The nexthop code is still relatively new and likely not used very broadly,
and the FDB bits are newer still. Even though there is a reproducer out
there, it relies on an improbable gateway arguments "via default", "via
all" or "via any". Given all this, I believe it is OK to reformulate the
condition to do the right thing and bounce NHA_GATEWAY.
Fixes: 38428d6871 ("nexthop: support for fdb ecmp nexthops")
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In case of error, remove the nexthop group entry from the list to which
it was previously added.
Fixes: 430a049190 ("nexthop: Add support for nexthop groups")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
A reference was not taken for the current nexthop entry, so do not try
to put it in the error path.
Fixes: 430a049190 ("nexthop: Add support for nexthop groups")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Unlike the rest of the skb_zcopy_ functions, these routines
operate on a 'struct ubuf', not a skb. Remove the 'skb_'
prefix from the naming to make things clearer.
Suggested-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently, when an ubuf is attached to a new skb, the shared
flags word is initialized to a fixed value. Instead of doing
this, set the default flags in the ubuf, and have new skbs
inherit from this default.
This is needed when setting up different zerocopy types.
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In preparation for expanded zerocopy (TX and RX), move
the zerocopy related bits out of tx_flags into their own
flag word.
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
At Willem's suggestion, rename the sock_zerocopy_* functions
so that they match the MSG_ZEROCOPY flag, which makes it clear
they are specific to this zerocopy implementation.
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
RX zerocopy fragment pages which are not allocated from the
system page pool require special handling. Give the callback
in skb_zcopy_clear() a chance to process them first.
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The sock_zerocopy_put_abort function contains logic which is
specific to the current zerocopy implementation. Add a wrapper
which checks the callback and dispatches apppropriately.
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add an optional skb parameter to the zerocopy callback parameter,
which is passed down from skb_zcopy_clear(). This gives access
to the original skb, which is needed for upcoming RX zero-copy
error handling.
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Replace sock_zerocopy_put with the generic skb_zcopy_put()
function. Pass 'true' as the success argument, as this
is identical to no change.
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Before this change, the caller of sock_zerocopy_callback would
need to save the zerocopy status, decrement and check the refcount,
and then call the callback function - the callback was only invoked
when the refcount reached zero.
Now, the caller just passes the status into the callback function,
which saves the status and handles its own refcounts.
This makes the behavior of the sock_zerocopy_callback identical
to the tpacket and vhost callbacks.
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
All 'struct ubuf_info' users should have a callback defined
as of commit 0a4a060bb2 ("sock: fix zerocopy_success regression
with msg_zerocopy").
Remove the dead code path to consume_skb(), which makes
assumptions about how the structure was allocated.
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This effectively reverts commit 60724d4bae ("net: dsa: Add support for
DSA specific notifiers"). The reason is that since commit 2f1e8ea726
("net: dsa: link interfaces with the DSA master to get rid of lockdep
warnings"), it appears that there is a generic way to achieve the same
purpose. The only user thus far, the Broadcom SYSTEMPORT driver, was
converted to use the generic notifiers.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Using the NETDEV_CHANGEUPPER notifications, drivers can be aware when
they are enslaved to e.g. a bridge by calling netif_is_bridge_master().
Export this helper from DSA to get the equivalent functionality of
determining whether the upper interface of a CHANGEUPPER notifier is a
DSA switch interface or not.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
It is a bit strange to see something as specific as Broadcom SYSTEMPORT
bits in the main DSA include file. Move these away into a separate
header, and have the tagger and the SYSTEMPORT driver include them.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Some DSA switches (and not only) cannot learn source MAC addresses from
packets injected from the CPU. They only perform hardware address
learning from inbound traffic.
This can be problematic when we have a bridge spanning some DSA switch
ports and some non-DSA ports (which we'll call "foreign interfaces" from
DSA's perspective).
There are 2 classes of problems created by the lack of learning on
CPU-injected traffic:
- excessive flooding, due to the fact that DSA treats those addresses as
unknown
- the risk of stale routes, which can lead to temporary packet loss
To illustrate the second class, consider the following situation, which
is common in production equipment (wireless access points, where there
is a WLAN interface and an Ethernet switch, and these form a single
bridging domain).
AP 1:
+------------------------------------------------------------------------+
| br0 |
+------------------------------------------------------------------------+
+------------+ +------------+ +------------+ +------------+ +------------+
| swp0 | | swp1 | | swp2 | | swp3 | | wlan0 |
+------------+ +------------+ +------------+ +------------+ +------------+
| ^ ^
| | |
| | |
| Client A Client B
|
|
|
+------------+ +------------+ +------------+ +------------+ +------------+
| swp0 | | swp1 | | swp2 | | swp3 | | wlan0 |
+------------+ +------------+ +------------+ +------------+ +------------+
+------------------------------------------------------------------------+
| br0 |
+------------------------------------------------------------------------+
AP 2
- br0 of AP 1 will know that Clients A and B are reachable via wlan0
- the hardware fdb of a DSA switch driver today is not kept in sync with
the software entries on other bridge ports, so it will not know that
clients A and B are reachable via the CPU port UNLESS the hardware
switch itself performs SA learning from traffic injected from the CPU.
Nonetheless, a substantial number of switches don't.
- the hardware fdb of the DSA switch on AP 2 may autonomously learn that
Client A and B are reachable through swp0. Therefore, the software br0
of AP 2 also may or may not learn this. In the example we're
illustrating, some Ethernet traffic has been going on, and br0 from AP
2 has indeed learnt that it can reach Client B through swp0.
One of the wireless clients, say Client B, disconnects from AP 1 and
roams to AP 2. The topology now looks like this:
AP 1:
+------------------------------------------------------------------------+
| br0 |
+------------------------------------------------------------------------+
+------------+ +------------+ +------------+ +------------+ +------------+
| swp0 | | swp1 | | swp2 | | swp3 | | wlan0 |
+------------+ +------------+ +------------+ +------------+ +------------+
| ^
| |
| Client A
|
|
| Client B
| |
| v
+------------+ +------------+ +------------+ +------------+ +------------+
| swp0 | | swp1 | | swp2 | | swp3 | | wlan0 |
+------------+ +------------+ +------------+ +------------+ +------------+
+------------------------------------------------------------------------+
| br0 |
+------------------------------------------------------------------------+
AP 2
- br0 of AP 1 still knows that Client A is reachable via wlan0 (no change)
- br0 of AP 1 will (possibly) know that Client B has left wlan0. There
are cases where it might never find out though. Either way, DSA today
does not process that notification in any way.
- the hardware FDB of the DSA switch on AP 1 may learn autonomously that
Client B can be reached via swp0, if it receives any packet with
Client 1's source MAC address over Ethernet.
- the hardware FDB of the DSA switch on AP 2 still thinks that Client B
can be reached via swp0. It does not know that it has roamed to wlan0,
because it doesn't perform SA learning from the CPU port.
Now Client A contacts Client B.
AP 1 routes the packet fine towards swp0 and delivers it on the Ethernet
segment.
AP 2 sees a frame on swp0 and its fdb says that the destination is swp0.
Hairpinning is disabled => drop.
This problem comes from the fact that these switches have a 'blind spot'
for addresses coming from software bridging. The generic solution is not
to assume that hardware learning can be enabled somehow, but to listen
to more bridge learning events. It turns out that the bridge driver does
learn in software from all inbound frames, in __br_handle_local_finish.
A proper SWITCHDEV_FDB_ADD_TO_DEVICE notification is emitted for the
addresses serviced by the bridge on 'foreign' interfaces. The software
bridge also does the right thing on migration, by notifying that the old
entry is deleted, so that does not need to be special-cased in DSA. When
it is deleted, we just need to delete our static FDB entry towards the
CPU too, and wait.
The problem is that DSA currently only cares about SWITCHDEV_FDB_ADD_TO_DEVICE
events received on its own interfaces, such as static FDB entries.
Luckily we can change that, and DSA can listen to all switchdev FDB
add/del events in the system and figure out if those events were emitted
by a bridge that spans at least one of DSA's own ports. In case that is
true, DSA will also offload that address towards its own CPU port, in
the eventuality that there might be bridge clients attached to the DSA
switch who want to talk to the station connected to the foreign
interface.
In terms of implementation, we need to keep the fdb_info->added_by_user
check for the case where the switchdev event was targeted directly at a
DSA switch port. But we don't need to look at that flag for snooped
events. So the check is currently too late, we need to move it earlier.
This also simplifies the code a bit, since we avoid uselessly allocating
and freeing switchdev_work.
We could probably do some improvements in the future. For example,
multi-bridge support is rudimentary at the moment. If there are two
bridges spanning a DSA switch's ports, and both of them need to service
the same MAC address, then what will happen is that the migration of one
of those stations will trigger the deletion of the FDB entry from the
CPU port while it is still used by other bridge. That could be improved
with reference counting but is left for another time.
This behavior needs to be enabled at driver level by setting
ds->assisted_learning_on_cpu_port = true. This is because we don't want
to inflict a potential performance penalty (accesses through
MDIO/I2C/SPI are expensive) to hardware that really doesn't need it
because address learning on the CPU port works there.
Reported-by: DENG Qingfang <dqfext@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Right now, the following would happen for a switch driver that does not
implement .port_fdb_add or .port_fdb_del.
dsa_slave_switchdev_event returns NOTIFY_OK and schedules:
-> dsa_slave_switchdev_event_work
-> dsa_port_fdb_add
-> dsa_port_notify(DSA_NOTIFIER_FDB_ADD)
-> dsa_switch_fdb_add
-> if (!ds->ops->port_fdb_add) return -EOPNOTSUPP;
-> an error is printed with dev_dbg, and
dsa_fdb_offload_notify(switchdev_work) is not called.
We can avoid scheduling the worker for nothing and say NOTIFY_DONE.
Because we don't call dsa_fdb_offload_notify, the static FDB entry will
remain just in the software bridge.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We'll need to start listening to SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE
events even for interfaces where dsa_slave_dev_check returns false, so
we need that check inside the switch-case statement for SWITCHDEV_FDB_*.
This movement also avoids a useless allocation / free of switchdev_work
on the untreated "default event" case.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently DSA doesn't add FDB entries on the CPU port, because it only
does so through switchdev, which is associated with a net_device, and
there are none of those for the CPU port.
But actually FDB addresses on the CPU port have some use cases of their
own, if the switchdev operations are initiated from within the DSA
layer. There is just one problem with the existing code: it passes a
structure in dsa_switchdev_event_work which was retrieved directly from
switchdev, so it contains a net_device. We need to generalize the
contents to something that covers the CPU port as well: the "ds, port"
tuple is fine for that.
Note that the new procedure for notifying the successful FDB offload is
inspired from the rocker model.
Also, nothing was being done if added_by_user was false. Let's check for
that a lot earlier, and don't actually bother to schedule the worker
for nothing.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The dev_close() call was added in commit c9eb3e0f87 ("net: dsa: Add
support for learning FDB through notification") "to indicate inconsistent
situation" when we could not delete an FDB entry from the port.
bridge fdb del d8:58:d7:00:ca:6d dev swp0 self master
It is a bit drastic and at the same time not helpful if the above fails
to only print with netdev_dbg log level, but on the other hand to bring
the interface down.
So increase the verbosity of the error message, and drop dev_close().
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently the bridge emits atomic switchdev notifications for
dynamically learnt FDB entries. Monitoring these notifications works
wonders for switchdev drivers that want to keep their hardware FDB in
sync with the bridge's FDB.
For example station A wants to talk to station B in the diagram below,
and we are concerned with the behavior of the bridge on the DUT device:
DUT
+-------------------------------------+
| br0 |
| +------+ +------+ +------+ +------+ |
| | | | | | | | | |
| | swp0 | | swp1 | | swp2 | | eth0 | |
+-------------------------------------+
| | |
Station A | |
| |
+--+------+--+ +--+------+--+
| | | | | | | |
| | swp0 | | | | swp0 | |
Another | +------+ | | +------+ | Another
switch | br0 | | br0 | switch
| +------+ | | +------+ |
| | | | | | | |
| | swp1 | | | | swp1 | |
+--+------+--+ +--+------+--+
|
Station B
Interfaces swp0, swp1, swp2 are handled by a switchdev driver that has
the following property: frames injected from its control interface bypass
the internal address analyzer logic, and therefore, this hardware does
not learn from the source address of packets transmitted by the network
stack through it. So, since bridging between eth0 (where Station B is
attached) and swp0 (where Station A is attached) is done in software,
the switchdev hardware will never learn the source address of Station B.
So the traffic towards that destination will be treated as unknown, i.e.
flooded.
This is where the bridge notifications come in handy. When br0 on the
DUT sees frames with Station B's MAC address on eth0, the switchdev
driver gets these notifications and can install a rule to send frames
towards Station B's address that are incoming from swp0, swp1, swp2,
only towards the control interface. This is all switchdev driver private
business, which the notification makes possible.
All is fine until someone unplugs Station B's cable and moves it to the
other switch:
DUT
+-------------------------------------+
| br0 |
| +------+ +------+ +------+ +------+ |
| | | | | | | | | |
| | swp0 | | swp1 | | swp2 | | eth0 | |
+-------------------------------------+
| | |
Station A | |
| |
+--+------+--+ +--+------+--+
| | | | | | | |
| | swp0 | | | | swp0 | |
Another | +------+ | | +------+ | Another
switch | br0 | | br0 | switch
| +------+ | | +------+ |
| | | | | | | |
| | swp1 | | | | swp1 | |
+--+------+--+ +--+------+--+
|
Station B
Luckily for the use cases we care about, Station B is noisy enough that
the DUT hears it (on swp1 this time). swp1 receives the frames and
delivers them to the bridge, who enters the unlikely path in br_fdb_update
of updating an existing entry. It moves the entry in the software bridge
to swp1 and emits an addition notification towards that.
As far as the switchdev driver is concerned, all that it needs to ensure
is that traffic between Station A and Station B is not forever broken.
If it does nothing, then the stale rule to send frames for Station B
towards the control interface remains in place. But Station B is no
longer reachable via the control interface, but via a port that can
offload the bridge port learning attribute. It's just that the port is
prevented from learning this address, since the rule overrides FDB
updates. So the rule needs to go. The question is via what mechanism.
It sure would be possible for this switchdev driver to keep track of all
addresses which are sent to the control interface, and then also listen
for bridge notifier events on its own ports, searching for the ones that
have a MAC address which was previously sent to the control interface.
But this is cumbersome and inefficient. Instead, with one small change,
the bridge could notify of the address deletion from the old port, in a
symmetrical manner with how it did for the insertion. Then the switchdev
driver would not be required to monitor learn/forget events for its own
ports. It could just delete the rule towards the control interface upon
bridge entry migration. This would make hardware address learning be
possible again. Then it would take a few more packets until the hardware
and software FDB would be in sync again.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Conntrack reassembly records the largest fragment size seen in IPCB.
However, when this gets forwarded/transmitted, fragmentation will only
be forced if one of the fragmented packets had the DF bit set.
In that case, a flag in IPCB will force fragmentation even if the
MTU is large enough.
This should work fine, but this breaks with ip tunnels.
Consider client that sends a UDP datagram of size X to another host.
The client fragments the datagram, so two packets, of size y and z, are
sent. DF bit is not set on any of these packets.
Middlebox netfilter reassembles those packets back to single size-X
packet, before routing decision.
packet-size-vs-mtu checks in ip_forward are irrelevant, because DF bit
isn't set. At output time, ip refragmentation is skipped as well
because x is still smaller than the mtu of the output device.
If ttransmit device is an ip tunnel, the packet size increases to
x+overhead.
Also, tunnel might be configured to force DF bit on outer header.
In this case, packet will be dropped (exceeds MTU) and an ICMP error is
generated back to sender.
But sender already respects the announced MTU, all the packets that
it sent did fit the announced mtu.
Force refragmentation as per original sizes unconditionally so ip tunnel
will encapsulate the fragments instead.
The only other solution I see is to place ip refragmentation in
the ip_tunnel code to handle this case.
Fixes: d6b915e29f ("ip_fragment: don't forward defragmented DF packet")
Reported-by: Christian Perle <christian.perle@secunet.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
For some reason ip_tunnel insist on setting the DF bit anyway when the
inner header has the DF bit set, EVEN if the tunnel was configured with
'nopmtudisc'.
This means that the script added in the previous commit
cannot be made to work by adding the 'nopmtudisc' flag to the
ip tunnel configuration. Doing so breaks connectivity even for the
without-conntrack/netfilter scenario.
When nopmtudisc is set, the tunnel will skip the mtu check, so no
icmp error is sent to client. Then, because inner header has DF set,
the outer header gets added with DF bit set as well.
IP stack then sends an error to itself because the packet exceeds
the device MTU.
Fixes: 23a3647bc4 ("ip_tunnels: Use skb-len to PMTU check.")
Cc: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Move the NETIF_F_RX_UDP_TUNNEL_PORT feature check into
udp_tunnel_nic_*_port() helpers, since they're always
done right before the call.
Add similar checks before calling the notifier.
udp_tunnel_nic invokes the notifier without checking
features which could result in some wasted cycles.
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
All drivers use udp_tunnel_nic_*_port() helpers, prepare for
NDO removal by invoking those helpers directly.
The helpers are safe to call on all devices, they check if
device has the UDP tunnel state initialized.
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Route removal is handled by two code paths. The main removal path is via
fib6_del_route() which will handle purging any PMTU exceptions from the
cache, removing all per-cpu copies of the DST entry used by the route, and
releasing the fib6_info struct.
The second removal location is during fib6_add_rt2node() during a route
replacement operation. This path also calls fib6_purge_rt() to handle
cleaning up the per-cpu copies of the DST entries and releasing the
fib6_info associated with the older route, but it does not flush any PMTU
exceptions that the older route had. Since the older route is removed from
the tree during the replacement, we lose any way of accessing it again.
As these lingering DSTs and the fib6_info struct are holding references to
the underlying netdevice struct as well, unregistering that device from the
kernel can never complete.
Fixes: 2b760fcf5c ("ipv6: hook up exception table to store dst cache")
Signed-off-by: Sean Tranchetti <stranche@codeaurora.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/1609892546-11389-1-git-send-email-stranche@quicinc.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQFHBAABCgAxFiEEK3kIWJt9yTYMP3ehqclaivrt76kFAl/12xATHG1rbEBwZW5n
dXRyb25peC5kZQAKCRCpyVqK+u3vqcB6B/4xTRQCOLF1BN59AdixqGf9gdXjIpQs
R5h4WJXqh9OPS4XSckSVBVGYniT8KdfBLL0IY4NV8q2s6R6Yo70G3GTiaYXI4cX/
BaKMRcvX6p1i2qZxTtHDGYxucnKCd+yRwfWkL4GLsCNcmTWCkqUL+nGgx6UmS6eB
9fMHcLXiP9vSfJQUfAOLR33Bu2C0QslI5fX5hi8PIdKEz9cHGf3PAuaRvtWEWi3a
1Zi/iXrv/0XDcT+OE5/2tAU6n8slIA2MU/80r7KdgUrWXOfuueH/pqepYgVU0Nf2
eFH+gDYBlZucd4ef37TPIAdPTw3z4F9irOVI1iwRIWdK3aQ3jyinsVWF
=0lW0
-----END PGP SIGNATURE-----
Merge tag 'linux-can-next-for-5.12-20210106' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next
Marc Kleine-Budde says:
====================
pull-request: can-next 2021-01-06
The first 16 patches are by me and target the tcan4x5x SPI glue driver for the
m_can CAN driver. First there are a several cleanup commits, then the SPI
regmap part is converted to 8 bits per word, to make it possible to use that
driver on SPI controllers that only support the 8 bit per word mode (such as
the SPI cores on the raspberry pi).
Oliver Hartkopp contributes a patch for the CAN_RAW protocol. The getsockopt()
for CAN_RAW_FILTER is changed to return -ERANGE if the filterset does not fit
into the provided user space buffer.
The last two patches are by Joakim Zhang and add wakeup support to the flexcan
driver for the i.MX8QM SoC. The dt-bindings docs are extended to describe the
added property.
* tag 'linux-can-next-for-5.12-20210106' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next:
can: flexcan: add CAN wakeup function for i.MX8QM
dt-bindings: can: fsl,flexcan: add fsl,scu-index property to indicate a resource
can: raw: return -ERANGE when filterset does not fit into user space buffer
can: tcan4x5x: add support for half-duplex controllers
can: tcan4x5x: rework SPI access
can: tcan4x5x: add {wr,rd}_table
can: tcan4x5x: add max_raw_{read,write} of 256
can: tcan4x5x: tcan4x5x_regmap: set reg_stride to 4
can: tcan4x5x: fix max register value
can: tcan4x5x: tcan4x5x_regmap_init(): use spi as context pointer
can: tcan4x5x: tcan4x5x_regmap_write(): remove not needed casts and replace 4 by sizeof
can: tcan4x5x: rename regmap_spi_gather_write() -> tcan4x5x_regmap_gather_write()
can: tcan4x5x: remove regmap async support
can: tcan4x5x: tcan4x5x_bus: remove not needed read_flag_mask
can: tcan4x5x: mark struct regmap_bus tcan4x5x_bus as constant
can: tcan4x5x: move regmap code into seperate file
can: tcan4x5x: rename tcan4x5x.c -> tcan4x5x-core.c
can: tcan4x5x: beautify indention of tcan4x5x_of_match and tcan4x5x_id_table
can: tcan4x5x: replace DEVICE_NAME by KBUILD_MODNAME
====================
Link: https://lore.kernel.org/r/20210107094900.173046-1-mkl@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Looking for an -EINVAL all over the dsa code could take hours for
inexperienced DSA users.
Signed-off-by: Rafał Miłecki <rafal@milecki.pl>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20210106090915.21439-1-zajec5@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Multiple filters (struct can_filter) can be set with the setsockopt()
function, which was originally intended as a write-only operation.
As getsockopt() also provides a CAN_RAW_FILTER option to read back the
given filters, the caller has to provide an appropriate user space buffer.
In the case this buffer is too small the getsockopt() silently truncates
the filter information and gives no information about the needed space.
This is safe but not convenient for the programmer.
In net/core/sock.c the SO_PEERGROUPS sockopt had a similar requirement
and solved it by returning -ERANGE in the case that the provided data
does not fit into the given user space buffer and fills the required size
into optlen, so that the caller can retry with a matching buffer length.
This patch adopts this approach for CAN_RAW_FILTER getsockopt().
Reported-by: Phillip Schichtel <phillip@schich.tel>
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Tested-By: Phillip Schichtel <phillip@schich.tel>
Link: https://lore.kernel.org/r/20201216174928.21663-1-socketcan@hartkopp.net
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
u128_xor() takes pointers to quantities that are assumed to be at least
64-bit aligned, which is not guaranteed to be the case in the smp_c1()
routine. So switch to crypto_xor() instead.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
A null-ptr-deref bug is reported by Hulk Robot like this:
--------------
KASAN: null-ptr-deref in range [0x0000000000000128-0x000000000000012f]
Call Trace:
qrtr_ns_remove+0x22/0x40 [ns]
qrtr_proto_fini+0xa/0x31 [qrtr]
__x64_sys_delete_module+0x337/0x4e0
do_syscall_64+0x34/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x468ded
--------------
When qrtr_ns_init fails in qrtr_proto_init, qrtr_ns_remove which would
be called later on would raise a null-ptr-deref because qrtr_ns.workqueue
has been destroyed.
Fix it by making qrtr_ns_init have a return value and adding a check in
qrtr_proto_init.
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Qinglang Miao <miaoqinglang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
VLAN checks for NETREG_UNINITIALIZED to distinguish between
registration failure and unregistration in progress.
Since commit cb626bf566 ("net-sysfs: Fix reference count leak")
registration failure may, however, result in NETREG_UNREGISTERED
as well as NETREG_UNINITIALIZED.
This fix is similer to cebb69754f ("rtnetlink: Fix
memory(net_device) leak when ->newlink fails")
Fixes: cb626bf566 ("net-sysfs: Fix reference count leak")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
If there is a NCI command in work queue after closing the NCI device at
nci_unregister_device, The NCI command timer starts at flush_workqueue
function and then NCI command timeout handler would be called 5 second
after flushing the NCI command work queue and destroying the queue.
At that time, the timeout handler would try to use NCI command work queue
that is destroyed already. it will causes the problem. To avoid this
abnormal situation, change the sequence to prevent the NCI command timeout
handler from being called after destroying the NCI command work queue.
Signed-off-by: Bongsu Jeon <bongsu.jeon@samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function sockfd_lookup uses fget on the value that is stored in
the file field of the returned structure, so fput should ultimately be
applied to this value. This can be done directly, but it seems better
to use the specific macro sockfd_put, which does the same thing.
Perform a source code refactoring by using the following semantic patch.
// <smpl>
@@
expression s;
@@
s = sockfd_lookup(...)
...
+ sockfd_put(s);
- fput(s->file);
// </smpl>
Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Without crc32 support, this fails to link:
arm-linux-gnueabi-ld: net/wireless/scan.o: in function `cfg80211_scan_6ghz':
scan.c:(.text+0x928): undefined reference to `crc32_le'
Fixes: c8cb5b854b ("nl80211/cfg80211: support 6 GHz scanning")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use the existing offsetof() macro instead of duplicating code.
Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
and bpf trees.
Current release - regressions:
- mt76: - usb: fix NULL pointer dereference in mt76u_status_worker
- sdio: fix NULL pointer dereference in mt76s_process_tx_queue
- net: ipa: fix interconnect enable bug
Current release - always broken:
- netfilter: ipset: fixes possible oops in mtype_resize
- ath11k: fix number of coding issues found by static analysis tools
and spurious error messages
Previous releases - regressions:
- e1000e: re-enable s0ix power saving flows for systems with
the Intel i219-LM Ethernet controllers to fix power
use regression
- virtio_net: fix recursive call to cpus_read_lock() to avoid
a deadlock
- ipv4: ignore ECN bits for fib lookups in fib_compute_spec_dst()
- net-sysfs: take the rtnl lock around XPS configuration
- xsk: - fix memory leak for failed bind
- rollback reservation at NETDEV_TX_BUSY
- r8169: work around power-saving bug on some chip versions
Previous releases - always broken:
- dcb: validate netlink message in DCB handler
- tun: fix return value when the number of iovs exceeds MAX_SKB_FRAGS
to prevent unnecessary retries
- vhost_net: fix ubuf refcount when sendmsg fails
- bpf: save correct stopping point in file seq iteration
- ncsi: use real net-device for response handler
- neighbor: fix div by zero caused by a data race (TOCTOU)
- bareudp: - fix use of incorrect min_headroom size
- fix false positive lockdep splat from the TX lock
- net: mvpp2: - clear force link UP during port init procedure
in case bootloader had set it
- add TCAM entry to drop flow control pause frames
- fix PPPoE with ipv6 packet parsing
- fix GoP Networking Complex Control config of port 3
- fix pkt coalescing IRQ-threshold configuration
- xsk: fix race in SKB mode transmit with shared cq
- ionic: account for vlan tag len in rx buffer len
- net: stmmac: ignore the second clock input, current clock framework
does not handle exclusive clock use well, other drivers
may reconfigure the second clock
Misc:
- ppp: change PPPIOCUNBRIDGECHAN ioctl request number to follow
existing scheme
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAl/zsqQACgkQMUZtbf5S
IrvfqA/+MbjN9TRccZRgYVzPVzlP5jswi7VZIjikPrNxCdwgQd8bDMfeaD6I1PcX
WHf35vtD8zh729qz9DheWXFp7kDQ1fY0Z59KA25xf/ulFEkZPl3RBg70rSgv4rc+
T82dVo6x33DPe6NkspDC+Uhjz2IxcS/P7F9N7DtbavrfNuDyX8+0U/FFQIL0xOyG
DuhwecCh0vJFGcWXTWtK1vP1CPD98L28KS2Od+EZsUUZOKt1WMyGrAgNcT6uYXmO
NIYNy+FPyvvIwTLupoFE7oU4LA0sZozyvzcTDugXBF5EKoR8BwBFk0FfWzN9Oxge
LrmhNBSTeYyiw8XMOwSIfxwZnBm7mJFQqTHR1+Y83Qw1SR6PfSUZgkEkW2SYgprL
9CzE3O3P3Ci7TSx7fvZUn8B1q5J0DfZR6ZYyor9zl55e+ikraRYtXsk47bf9AGXl
owpHXEYWHFmgOP+LVdf1BUjuiE3vnCBJBsHlMbRkxiNPKravWtPSiM2yTu6fEbpT
pMXCgFQBL/IqwzX01zuw7teg40YLVaFnmFdQbYDwA5p9VODlQvHzn2K4GyuktswX
wxHYU5WRWtCkBfE+nbAROKzE7MuH9jtPtV1ZeuseTqYGBRuvEvudX8ypEvKS45pP
OWkzFsSXd9q7M6cxftipwjcyLiIO+UGdizNHvDUyEQOPAyYPKb4=
=N4/x
-----END PGP SIGNATURE-----
Merge tag 'net-5.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Networking fixes, including fixes from netfilter, wireless and bpf
trees.
Current release - regressions:
- mt76: fix NULL pointer dereference in mt76u_status_worker and
mt76s_process_tx_queue
- net: ipa: fix interconnect enable bug
Current release - always broken:
- netfilter: fixes possible oops in mtype_resize in ipset
- ath11k: fix number of coding issues found by static analysis tools
and spurious error messages
Previous releases - regressions:
- e1000e: re-enable s0ix power saving flows for systems with the
Intel i219-LM Ethernet controllers to fix power use regression
- virtio_net: fix recursive call to cpus_read_lock() to avoid a
deadlock
- ipv4: ignore ECN bits for fib lookups in fib_compute_spec_dst()
- sysfs: take the rtnl lock around XPS configuration
- xsk: fix memory leak for failed bind and rollback reservation at
NETDEV_TX_BUSY
- r8169: work around power-saving bug on some chip versions
Previous releases - always broken:
- dcb: validate netlink message in DCB handler
- tun: fix return value when the number of iovs exceeds MAX_SKB_FRAGS
to prevent unnecessary retries
- vhost_net: fix ubuf refcount when sendmsg fails
- bpf: save correct stopping point in file seq iteration
- ncsi: use real net-device for response handler
- neighbor: fix div by zero caused by a data race (TOCTOU)
- bareudp: fix use of incorrect min_headroom size and a false
positive lockdep splat from the TX lock
- mvpp2:
- clear force link UP during port init procedure in case
bootloader had set it
- add TCAM entry to drop flow control pause frames
- fix PPPoE with ipv6 packet parsing
- fix GoP Networking Complex Control config of port 3
- fix pkt coalescing IRQ-threshold configuration
- xsk: fix race in SKB mode transmit with shared cq
- ionic: account for vlan tag len in rx buffer len
- stmmac: ignore the second clock input, current clock framework does
not handle exclusive clock use well, other drivers may reconfigure
the second clock
Misc:
- ppp: change PPPIOCUNBRIDGECHAN ioctl request number to follow
existing scheme"
* tag 'net-5.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (99 commits)
net: dsa: lantiq_gswip: Fix GSWIP_MII_CFG(p) register access
net: dsa: lantiq_gswip: Enable GSWIP_MII_CFG_EN also for internal PHYs
net: lapb: Decrease the refcount of "struct lapb_cb" in lapb_device_event
r8169: work around power-saving bug on some chip versions
net: usb: qmi_wwan: add Quectel EM160R-GL
selftests: mlxsw: Set headroom size of correct port
net: macb: Correct usage of MACB_CAPS_CLK_HW_CHG flag
ibmvnic: fix: NULL pointer dereference.
docs: networking: packet_mmap: fix old config reference
docs: networking: packet_mmap: fix formatting for C macros
vhost_net: fix ubuf refcount incorrectly when sendmsg fails
bareudp: Fix use of incorrect min_headroom size
bareudp: set NETIF_F_LLTX flag
net: hdlc_ppp: Fix issues when mod_timer is called while timer is running
atlantic: remove architecture depends
erspan: fix version 1 check in gre_parse_header()
net: hns: fix return value check in __lb_other_process()
net: sched: prevent invalid Scell_log shift count
net: neighbor: fix a crash caused by mod zero
ipv4: Ignore ECN bits for fib lookups in fib_compute_spec_dst()
...
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
1) Missing sanitization of rateest userspace string, bug has been
triggered by syzbot, patch from Florian Westphal.
2) Report EOPNOTSUPP on missing set features in nft_dynset, otherwise
error reporting to userspace via EINVAL is misleading since this is
reserved for malformed netlink requests.
3) New binaries with old kernels might silently accept several set
element expressions. New binaries set on the NFT_SET_EXPR and
NFT_DYNSET_F_EXPR flags to request for several expressions per
element, hence old kernels which do not support for this bail out
with EOPNOTSUPP.
* git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf:
netfilter: nftables: add set expression flags
netfilter: nft_dynset: report EOPNOTSUPP on missing set feature
netfilter: xt_RATEEST: reject non-null terminated string from userspace
====================
Link: https://lore.kernel.org/r/20210103192920.18639-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In lapb_device_event, lapb_devtostruct is called to get a reference to
an object of "struct lapb_cb". lapb_devtostruct increases the refcount
of the object and returns a pointer to it. However, we didn't decrease
the refcount after we finished using the pointer. This patch fixes this
problem.
Fixes: a4989fa911 ("net/lapb: support netdev events")
Cc: Martin Schiller <ms@dev.tdt.de>
Signed-off-by: Xie He <xie.he.0141@gmail.com>
Link: https://lore.kernel.org/r/20201231174331.64539-1-xie.he.0141@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since a few years, kernel addresses are no longer included in oops
dumps, at least on x86. All we get is a symbol name with offset and
size.
This is a problem for ceph_connection_operations handlers, especially
con->ops->dispatch(). All three handlers have the same name and there
is little context to disambiguate between e.g. monitor and OSD clients
because almost everything is inlined. gdb sneakily stops at the first
matching symbol, so one has to resort to nm and addr2line.
Some of these are already prefixed with mon_, osd_ or mds_. Let's do
the same for all others.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
Try and avoid leaving bits and pieces of session key and connection
secret (gets split into GCM key and a pair of GCM IVs) around.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Use three-way comparison for address components to avoid integer
wraparound in the result of xfrm_policy_addr_delta(). This ensures
that the search trees are built and traversed correctly.
Treat IPv4 and IPv6 similarly by returning 0 when prefixlen == 0.
Prefix /0 has only one equivalence class.
Fixes: 9cf545ebd5 ("xfrm: policy: store inexact policies in a tree ordered by destination address")
Signed-off-by: Visa Hankala <visa@hankala.org>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
xfrm_probe_algs() probes kernel crypto modules and changes the
availability of struct xfrm_algo_desc. But there is a small window
where ealg->available and aalg->available get changed between
count_ah_combs()/count_esp_combs() and dump_ah_combs()/dump_esp_combs(),
in this case we may allocate a smaller skb but later put a larger
amount of data and trigger the panic in skb_put().
Fix this by relaxing the checks when counting the size, that is,
skipping the test of ->available. We may waste some memory for a few
of sizeof(struct sadb_comb), but it is still much better than a panic.
Reported-by: syzbot+b2bf2652983d23734c5c@syzkaller.appspotmail.com
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
The disable_xfrm flag signals that xfrm should not be performed during
routing towards a device before reaching device xmit.
For xfrm interfaces this is usually desired as they perform the outbound
policy lookup as part of their xmit using their if_id.
Before this change enabling this flag on xfrm interfaces prevented them
from xmitting as xfrm_lookup_with_ifid() would not perform a policy lookup
in case the original dst had the DST_NOXFRM flag.
This optimization is incorrect when the lookup is done by the xfrm
interface xmit logic.
Fix by performing policy lookup when invoked by xfrmi as if_id != 0.
Similarly it's unlikely for the 'no policy exists on net' check to yield
any performance benefits when invoked from xfrmi.
Fixes: f203b76d78 ("xfrm: Add virtual xfrm interfaces")
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Daniel Borkmann says:
====================
pull-request: bpf 2020-12-28
The following pull-request contains BPF updates for your *net* tree.
There is a small merge conflict between bpf tree commit 69ca310f34
("bpf: Save correct stopping point in file seq iteration") and net tree
commit 66ed594409 ("bpf/task_iter: In task_file_seq_get_next use
task_lookup_next_fd_rcu"). The get_files_struct() does not exist anymore
in net, so take the hunk in HEAD and add the `info->tid = curr_tid` to
the error path:
[...]
curr_task = task_seq_get_next(ns, &curr_tid, true);
if (!curr_task) {
info->task = NULL;
info->tid = curr_tid;
return NULL;
}
/* set info->task and info->tid */
[...]
We've added 10 non-merge commits during the last 9 day(s) which contain
a total of 11 files changed, 75 insertions(+), 20 deletions(-).
The main changes are:
1) Various AF_XDP fixes such as fill/completion ring leak on failed bind and
fixing a race in skb mode's backpressure mechanism, from Magnus Karlsson.
2) Fix latency spikes on lockdep enabled kernels by adding a rescheduling
point to BPF hashtab initialization, from Eric Dumazet.
3) Fix a splat in task iterator by saving the correct stopping point in the
seq file iteration, from Jonathan Lemon.
4) Fix BPF maps selftest by adding retries in case hashtab returns EBUSY
errors on update/deletes, from Andrii Nakryiko.
5) Fix BPF selftest error reporting to something more user friendly if the
vmlinux BTF cannot be found, from Kamal Mostafa.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Both version 0 and version 1 use ETH_P_ERSPAN, but version 0 does not
have an erspan header. So the check in gre_parse_header() is wrong,
we have to distinguish version 1 from version 0.
We can just check the gre header length like is_erspan_type1().
Fixes: cb73ee40b1 ("net: ip_gre: use erspan key field for tunnel lookup")
Reported-by: syzbot+f583ce3d4ddf9836b27a@syzkaller.appspotmail.com
Cc: William Tu <u9012063@gmail.com>
Cc: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Check Scell_log shift size in red_check_params() and modify all callers
of red_check_params() to pass Scell_log.
This prevents a shift out-of-bounds as detected by UBSAN:
UBSAN: shift-out-of-bounds in ./include/net/red.h:252:22
shift exponent 72 is too large for 32-bit type 'int'
Fixes: 8afa10cbe2 ("net_sched: red: Avoid illegal values")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reported-by: syzbot+97c5bd9cc81eca63d36e@syzkaller.appspotmail.com
Cc: Nogah Frankel <nogahf@mellanox.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: netdev@vger.kernel.org
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
pneigh_enqueue() tries to obtain a random delay by mod
NEIGH_VAR(p, PROXY_DELAY). However, NEIGH_VAR(p, PROXY_DELAY)
migth be zero at that point because someone could write zero
to /proc/sys/net/ipv4/neigh/[device]/proxy_delay after the
callers check it.
This patch uses prandom_u32_max() to get a random delay instead
which avoids potential division by zero.
Signed-off-by: weichenchen <weichen.chen@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RT_TOS() only clears one of the ECN bits. Therefore, when
fib_compute_spec_dst() resorts to a fib lookup, it can return
different results depending on the value of the second ECN bit.
For example, ECT(0) and ECT(1) packets could be treated differently.
$ ip netns add ns0
$ ip netns add ns1
$ ip link add name veth01 netns ns0 type veth peer name veth10 netns ns1
$ ip -netns ns0 link set dev lo up
$ ip -netns ns1 link set dev lo up
$ ip -netns ns0 link set dev veth01 up
$ ip -netns ns1 link set dev veth10 up
$ ip -netns ns0 address add 192.0.2.10/24 dev veth01
$ ip -netns ns1 address add 192.0.2.11/24 dev veth10
$ ip -netns ns1 address add 192.0.2.21/32 dev lo
$ ip -netns ns1 route add 192.0.2.10/32 tos 4 dev veth10 src 192.0.2.21
$ ip netns exec ns1 sysctl -wq net.ipv4.icmp_echo_ignore_broadcasts=0
With TOS 4 and ECT(1), ns1 replies using source address 192.0.2.21
(ping uses -Q to set all TOS and ECN bits):
$ ip netns exec ns0 ping -c 1 -b -Q 5 192.0.2.255
[...]
64 bytes from 192.0.2.21: icmp_seq=1 ttl=64 time=0.544 ms
But with TOS 4 and ECT(0), ns1 replies using source address 192.0.2.11
because the "tos 4" route isn't matched:
$ ip netns exec ns0 ping -c 1 -b -Q 6 192.0.2.255
[...]
64 bytes from 192.0.2.11: icmp_seq=1 ttl=64 time=0.597 ms
After this patch the ECN bits don't affect the result anymore:
$ ip netns exec ns0 ping -c 1 -b -Q 6 192.0.2.255
[...]
64 bytes from 192.0.2.21: icmp_seq=1 ttl=64 time=0.591 ms
Fixes: 35ebf65e85 ("ipv4: Create and use fib_compute_spec_dst() helper.")
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Accesses to dev->xps_rxqs_map (when using dev->num_tc) should be
protected by the rtnl lock, like we do for netif_set_xps_queue. I didn't
see an actual bug being triggered, but let's be safe here and take the
rtnl lock while accessing the map in sysfs.
Fixes: 8af2c06ff4 ("net-sysfs: Add interface for Rx queue(s) map per Tx queue")
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Two race conditions can be triggered when storing xps rxqs, resulting in
various oops and invalid memory accesses:
1. Calling netdev_set_num_tc while netif_set_xps_queue:
- netif_set_xps_queue uses dev->tc_num as one of the parameters to
compute the size of new_dev_maps when allocating it. dev->tc_num is
also used to access the map, and the compiler may generate code to
retrieve this field multiple times in the function.
- netdev_set_num_tc sets dev->tc_num.
If new_dev_maps is allocated using dev->tc_num and then dev->tc_num
is set to a higher value through netdev_set_num_tc, later accesses to
new_dev_maps in netif_set_xps_queue could lead to accessing memory
outside of new_dev_maps; triggering an oops.
2. Calling netif_set_xps_queue while netdev_set_num_tc is running:
2.1. netdev_set_num_tc starts by resetting the xps queues,
dev->tc_num isn't updated yet.
2.2. netif_set_xps_queue is called, setting up the map with the
*old* dev->num_tc.
2.3. netdev_set_num_tc updates dev->tc_num.
2.4. Later accesses to the map lead to out of bound accesses and
oops.
A similar issue can be found with netdev_reset_tc.
One way of triggering this is to set an iface up (for which the driver
uses netdev_set_num_tc in the open path, such as bnx2x) and writing to
xps_rxqs in a concurrent thread. With the right timing an oops is
triggered.
Both issues have the same fix: netif_set_xps_queue, netdev_set_num_tc
and netdev_reset_tc should be mutually exclusive. We do that by taking
the rtnl lock in xps_rxqs_store.
Fixes: 8af2c06ff4 ("net-sysfs: Add interface for Rx queue(s) map per Tx queue")
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Accesses to dev->xps_cpus_map (when using dev->num_tc) should be
protected by the rtnl lock, like we do for netif_set_xps_queue. I didn't
see an actual bug being triggered, but let's be safe here and take the
rtnl lock while accessing the map in sysfs.
Fixes: 184c449f91 ("net: Add support for XPS with QoS via traffic classes")
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Two race conditions can be triggered when storing xps cpus, resulting in
various oops and invalid memory accesses:
1. Calling netdev_set_num_tc while netif_set_xps_queue:
- netif_set_xps_queue uses dev->tc_num as one of the parameters to
compute the size of new_dev_maps when allocating it. dev->tc_num is
also used to access the map, and the compiler may generate code to
retrieve this field multiple times in the function.
- netdev_set_num_tc sets dev->tc_num.
If new_dev_maps is allocated using dev->tc_num and then dev->tc_num
is set to a higher value through netdev_set_num_tc, later accesses to
new_dev_maps in netif_set_xps_queue could lead to accessing memory
outside of new_dev_maps; triggering an oops.
2. Calling netif_set_xps_queue while netdev_set_num_tc is running:
2.1. netdev_set_num_tc starts by resetting the xps queues,
dev->tc_num isn't updated yet.
2.2. netif_set_xps_queue is called, setting up the map with the
*old* dev->num_tc.
2.3. netdev_set_num_tc updates dev->tc_num.
2.4. Later accesses to the map lead to out of bound accesses and
oops.
A similar issue can be found with netdev_reset_tc.
One way of triggering this is to set an iface up (for which the driver
uses netdev_set_num_tc in the open path, such as bnx2x) and writing to
xps_cpus in a concurrent thread. With the right timing an oops is
triggered.
Both issues have the same fix: netif_set_xps_queue, netdev_set_num_tc
and netdev_reset_tc should be mutually exclusive. We do that by taking
the rtnl lock in xps_cpus_store.
Fixes: 184c449f91 ("net: Add support for XPS with QoS via traffic classes")
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
crypto_shash_setkey() and crypto_aead_setkey() will do a (small)
GFP_ATOMIC allocation to align the key if it isn't suitably aligned.
It's not a big deal, but at the same time easy to avoid.
The actual alignment requirement is dynamic, queryable with
crypto_shash_alignmask() and crypto_aead_alignmask(), but shouldn't
be stricter than 16 bytes for our algorithms.
Fixes: cd1a677cad ("libceph, ceph: implement msgr2.1 protocol (crc and secure modes)")
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
auth_signature frame is 68 bytes in plain mode and 96 bytes in
secure mode but we are requesting 68 bytes in both modes. By luck,
this doesn't actually result in any invalid memory accesses because
the allocation is satisfied out of kmalloc-96 slab and so exactly
96 bytes are allocated, but KASAN rightfully complains.
Fixes: cd1a677cad ("libceph, ceph: implement msgr2.1 protocol (crc and secure modes)")
Reported-by: Luis Henriques <lhenriques@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The set flag NFT_SET_EXPR provides a hint to the kernel that userspace
supports for multiple expressions per set element. In the same
direction, NFT_DYNSET_F_EXPR specifies that dynset expression defines
multiple expressions per set element.
This allows new userspace software with old kernels to bail out with
EOPNOTSUPP. This update is similar to ef516e8625 ("netfilter:
nf_tables: reintroduce the NFT_SET_CONCAT flag"). The NFT_SET_EXPR flag
needs to be set on when the NFTA_SET_EXPRESSIONS attribute is specified.
The NFT_SET_EXPR flag is not set on with NFTA_SET_EXPR to retain
backward compatibility in old userspace binaries.
Fixes: 48b0ae046e ("netfilter: nftables: netlink support for several set element expressions")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If userspace requests a feature which is not available the original set
definition, then bail out with EOPNOTSUPP. If userspace sends
unsupported dynset flags (new feature not supported by this kernel),
then report EOPNOTSUPP to userspace. EINVAL should be only used to
report malformed netlink messages from userspace.
Fixes: 22fe54d5fe ("netfilter: nf_tables: add support for dynamic set updates")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Looks like this was missed when patching the source to clear the structures
throughout, causing this one instance to clear the struct after the response
id is assigned.
Fixes: eddb773211 ("Bluetooth: A2MP: Fix not initializing all members")
Signed-off-by: Christopher William Snowhill <chris@kode54.net>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
When aggregating ncsi interfaces and dedicated interfaces to bond
interfaces, the ncsi response handler will use the wrong net device to
find ncsi_dev, so that the ncsi interface will not work properly.
Here, we use the original net device to fix it.
Fixes: 138635cc27 ("net/ncsi: NCSI response packet handler")
Signed-off-by: John Wang <wangzhiqiang.bj@bytedance.com>
Link: https://lore.kernel.org/r/20201223055523.2069-1-wangzhiqiang.bj@bytedance.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
DCB uses the same handler function for both RTM_GETDCB and RTM_SETDCB
messages. dcb_doit() bounces RTM_SETDCB mesasges if the user does not have
the CAP_NET_ADMIN capability.
However, the operation to be performed is not decided from the DCB message
type, but from the DCB command. Thus DCB_CMD_*_GET commands are used for
reading DCB objects, the corresponding SET and DEL commands are used for
manipulation.
The assumption is that set-like commands will be sent via an RTM_SETDCB
message, and get-like ones via RTM_GETDCB. However, this assumption is not
enforced.
It is therefore possible to manipulate DCB objects without CAP_NET_ADMIN
capability by sending the corresponding command in an RTM_GETDCB message.
That is a bug. Fix it by validating the type of the request message against
the type used for the response.
Fixes: 2f90b8657e ("ixgbe: this patch adds support for DCB to the kernel and ixgbe driver")
Signed-off-by: Petr Machata <me@pmachata.org>
Link: https://lore.kernel.org/r/a2a9b88418f3a58ef211b718f2970128ef9e3793.1608673640.git.me@pmachata.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
- fix long-standing limitation on open-unlink-fop pattern
- add refcount to p9_fid (fixes the above and will allow for more
cleanups and simplifications in the future)
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE/IPbcYBuWt0zoYhOq06b7GqY5nAFAl/gWOYACgkQq06b7GqY
5nBcHxAArtfxZ4wGa+OMoWt19UvF0bUeYUYdy75rp6awpXmQyMfMZY9oOrZ9L+St
mS3oLd8Mq7MeJQ+iGKPPOX+085aNhRtxjQOiHlv02fi09zVyNqj1vptq+rTRkKZL
8KTs+wYSIX2x1WensDBdohYFiWf87vCWyywpQ/1Vm0585mI8B88N3/H5EpkLbPZn
1bgXkakgd2hS+rKu6vJGV8lCJ8eKOSpQ3WUK0WQhY6ysJEL9dbl00qyNQ4kq2UgI
i3CpWC/86297KJOehPmegpVwqTIL5wrCkhQy7sNxDkR9E4Q3VoKTNSrnRDtBI0x/
mRCEnc/2Pf76Zv3LpGXnLFSnDLTQaO842Hla92RZJCcHaDtGDwV9Znq6sZgn+pdV
wlEpT383s+7AC/X1EFvLPMr5PHrEM1KW25YEYQI78X2hh8G+utW5p3+vu70b0PyT
Z7P9yBW4aU4cZb3XcZZihatZG2P1LxHyjNMmn4YBdUeQuitdiJGc2KwjVyKXi9PX
+t6xxI+YAIhu/5I9fRtOoIDQNTjsRRE+4FzRKd2kvxW5Z5sV/bZCf6CHnQ2uKYVP
fEMYd0GhjrUABbJPLLdFlBXaV+yigvwKtABDnJ5KSA9NlDSVBGewTRa2I+9DoclL
yQS4hELuvQsUNCo4AkICixo9vABt0j1OhXlo+B0WVk8Ilh9Wxpc=
=DKlc
-----END PGP SIGNATURE-----
Merge tag '9p-for-5.11-rc1' of git://github.com/martinetd/linux
Pull 9p update from Dominique Martinet:
- fix long-standing limitation on open-unlink-fop pattern
- add refcount to p9_fid (fixes the above and will allow for more
cleanups and simplifications in the future)
* tag '9p-for-5.11-rc1' of git://github.com/martinetd/linux:
9p: Remove unnecessary IS_ERR() check
9p: Uninitialized variable in v9fs_writeback_fid()
9p: Fix writeback fid incorrectly being attached to dentry
9p: apply review requests for fid refcounting
9p: add refcount to p9_fid struct
fs/9p: search open fids first
fs/9p: track open fids
fs/9p: fix create-unlink-getattr idiom
When setting xfrm replay_window to values higher than 32, a rare
page-fault occurs in xfrm_replay_advance_bmp:
BUG: unable to handle page fault for address: ffff8af350ad7920
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
PGD ad001067 P4D ad001067 PUD 0
Oops: 0002 [#1] SMP PTI
CPU: 3 PID: 30 Comm: ksoftirqd/3 Kdump: loaded Not tainted 5.4.52-050452-generic #202007160732
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
RIP: 0010:xfrm_replay_advance_bmp+0xbb/0x130
RSP: 0018:ffffa1304013ba40 EFLAGS: 00010206
RAX: 000000000000010d RBX: 0000000000000002 RCX: 00000000ffffff4b
RDX: 0000000000000018 RSI: 00000000004c234c RDI: 00000000ffb3dbff
RBP: ffffa1304013ba50 R08: ffff8af330ad7920 R09: 0000000007fffffa
R10: 0000000000000800 R11: 0000000000000010 R12: ffff8af29d6258c0
R13: ffff8af28b95c700 R14: 0000000000000000 R15: ffff8af29d6258fc
FS: 0000000000000000(0000) GS:ffff8af339ac0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff8af350ad7920 CR3: 0000000015ee4000 CR4: 00000000001406e0
Call Trace:
xfrm_input+0x4e5/0xa10
xfrm4_rcv_encap+0xb5/0xe0
xfrm4_udp_encap_rcv+0x140/0x1c0
Analysis revealed offending code is when accessing:
replay_esn->bmp[nr] |= (1U << bitnr);
with 'nr' being 0x07fffffa.
This happened in an SMP system when reordering of packets was present;
A packet arrived with a "too old" sequence number (outside the window,
i.e 'diff > replay_window'), and therefore the following calculation:
bitnr = replay_esn->replay_window - (diff - pos);
yields a negative result, but since bitnr is u32 we get a large unsigned
quantity (in crash dump above: 0xffffff4b seen in ecx).
This was supposed to be protected by xfrm_input()'s former call to:
if (x->repl->check(x, skb, seq)) {
However, the state's spinlock x->lock is *released* after '->check()'
is performed, and gets re-acquired before '->advance()' - which gives a
chance for a different core to update the xfrm state, e.g. by advancing
'replay_esn->seq' when it encounters more packets - leading to a
'diff > replay_window' situation when original core continues to
xfrm_replay_advance_bmp().
An attempt to fix this issue was suggested in commit bcf66bf54a
("xfrm: Perform a replay check after return from async codepaths"),
by calling 'x->repl->recheck()' after lock is re-acquired, but fix
applied only to asyncronous crypto algorithms.
Augment the fix, by *always* calling 'recheck()' - irrespective if we're
using async crypto.
Fixes: 0ebea8ef35 ("[IPSEC]: Move state lock into x->type->input")
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
1) Incorrect loop in error path of nft_set_elem_expr_clone(),
from Colin Ian King.
2) Missing xt_table_get_private_protected() to access table
private data in x_tables, from Subash Abhinov Kasiviswanathan.
3) Possible oops in ipset hash type resize, from Vasily Averin.
4) Fix shift-out-of-bounds in ipset hash type, also from Vasily.
* git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf:
netfilter: ipset: fix shift-out-of-bounds in htable_bits()
netfilter: ipset: fixes possible oops in mtype_resize
netfilter: x_tables: Update remaining dereference to RCU
netfilter: nftables: fix incorrect increment of loop counter
====================
Link: https://lore.kernel.org/r/20201218120409.3659-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
taprio_graft() can insert a NULL element in the array of child qdiscs. As
a consquence, taprio_reset() might not reset child qdiscs completely, and
taprio_destroy() might leak resources. Fix it by ensuring that loops that
iterate over q->qdiscs[] don't end when they find the first NULL item.
Fixes: 44d4775ca5 ("net/sched: sch_taprio: reset child qdiscs before freeing them")
Fixes: 5a781ccbd1 ("tc: Add support for configuring the taprio scheduler")
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Link: https://lore.kernel.org/r/13edef6778fef03adc751582562fba4a13e06d6a.1608240532.git.dcaratti@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Just like MGMT_OP_START_DISCOVERY, we should reject
MGMT_OP_START_SERVICE_DISCOVERY with MGMT_STATUS_BUSY when we are paused
for suspend.
Signed-off-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Add a missing SUSPEND_SCAN_ENABLE in passive scan, remove the separate
function for configuring le scan during suspend and update the request
complete function to clear both enable and disable tasks.
Fixes: dce0a4be80 ("Bluetooth: Set missing suspend task bits")
Reviewed-by: Alain Michaud <alainm@chromium.org>
Signed-off-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
On 64-bit systems the packet procfs header field names following 'sk'
are not aligned correctly:
sk RefCnt Type Proto Iface R Rmem User Inode
00000000605d2c64 3 3 0003 7 1 450880 0 16643
00000000080e9b80 2 2 0000 0 0 0 0 17404
00000000b23b8a00 2 2 0000 0 0 0 0 17421
...
With this change field names are correctly aligned:
sk RefCnt Type Proto Iface R Rmem User Inode
000000005c3b1d97 3 3 0003 7 1 21568 0 16178
000000007be55bb7 3 3 fbce 8 1 0 0 16250
00000000be62127d 3 3 fbcd 8 1 0 0 16254
...
Signed-off-by: Baruch Siach <baruch@tkos.co.il>
Link: https://lore.kernel.org/r/54917251d8433735d9a24e935a6cb8eb88b4058a.1608103684.git.baruch@tkos.co.il
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daire Byrne reports a ~50% aggregrate throughput regression on his
Linux NFS server after commit da1661b93b ("SUNRPC: Teach server to
use xprt_sock_sendmsg for socket sends"), which replaced
kernel_send_page() calls in NFSD's socket send path with calls to
sock_sendmsg() using iov_iter.
Investigation showed that tcp_sendmsg() was not using zero-copy to
send the xdr_buf's bvec pages, but instead was relying on memcpy.
This means copying every byte of a large NFS READ payload.
It looks like TLS sockets do indeed support a ->sendpage method,
so it's really not necessary to use xprt_sock_sendmsg() to support
TLS fully on the server. A mechanical reversion of da1661b93b is
not possible at this point, but we can re-implement the server's
TCP socket sendmsg path using kernel_sendpage().
Reported-by: Daire Byrne <daire@dneg.com>
BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=209439
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Rollback the reservation in the completion ring when we get a
NETDEV_TX_BUSY. When this error is received from the driver, we are
supposed to let the user application retry the transmit again. And in
order to do this, we need to roll back the failed send so it can be
retried. Unfortunately, we did not cancel the reservation we had made
in the completion ring. By not doing this, we actually make the
completion ring one entry smaller per NETDEV_TX_BUSY error we get, and
after enough of these errors the completion ring will be of size zero
and transmit will stop working.
Fix this by cancelling the reservation when we get a NETDEV_TX_BUSY
error.
Fixes: 642e450b6b ("xsk: Do not discard packet when NETDEV_TX_BUSY")
Reported-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/20201218134525.13119-3-magnus.karlsson@gmail.com
Fix a race when multiple sockets are simultaneously calling sendto()
when the completion ring is shared in the SKB case. This is the case
when you share the same netdev and queue id through the
XDP_SHARED_UMEM bind flag. The problem is that multiple processes can
be in xsk_generic_xmit() and call the backpressure mechanism in
xskq_prod_reserve(xs->pool->cq). As this is a shared resource in this
specific scenario, a race might occur since the rings are
single-producer single-consumer.
Fix this by moving the tx_completion_lock from the socket to the pool
as the pool is shared between the sockets that share the completion
ring. (The pool is not shared when this is not the case.) And then
protect the accesses to xskq_prod_reserve() with this lock. The
tx_completion_lock is renamed cq_lock to better reflect that it
protects accesses to the potentially shared completion ring.
Fixes: 35fcde7f8d ("xsk: support for Tx")
Reported-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/20201218134525.13119-2-magnus.karlsson@gmail.com
Fix a possible memory leak when a bind of an AF_XDP socket fails. When
the fill and completion rings are created, they are tied to the
socket. But when the buffer pool is later created at bind time, the
ownership of these two rings are transferred to the buffer pool as
they might be shared between sockets (and the buffer pool cannot be
created until we know what we are binding to). So, before the buffer
pool is created, these two rings are cleaned up with the socket, and
after they have been transferred they are cleaned up together with
the buffer pool.
The problem is that ownership was transferred before it was absolutely
certain that the buffer pool could be created and initialized
correctly and when one of these errors occurred, the fill and
completion rings did neither belong to the socket nor the pool and
where therefore leaked. Solve this by moving the ownership transfer
to the point where the buffer pool has been completely set up and
there is no way it can fail.
Fixes: 7361f9c3d7 ("xsk: Move fill and completion rings to buffer pool")
Reported-by: syzbot+cfa88ddd0655afa88763@syzkaller.appspotmail.com
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/20201214085127.3960-1-magnus.karlsson@gmail.com
Current release - always broken:
- net/smc: fix access to parent of an ib device
- devlink: use _BITUL() macro instead of BIT() in the UAPI header
- handful of mptcp fixes
Previous release - regressions:
- intel: AF_XDP: clear the status bits for the next_to_use descriptor
- dpaa2-eth: fix the size of the mapped SGT buffer
Previous release - always broken:
- mptcp: fix security context on server socket
- ethtool: fix string set id check
- ethtool: fix error paths in ethnl_set_channels()
- lan743x: fix rx_napi_poll/interrupt ping-pong
- qca: ar9331: fix sleeping function called from invalid context bug
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAl/bxLMACgkQMUZtbf5S
IrtTkRAAoxZFOsuXeTejJqUPlUwxCfAWb4HhND9k2xLNqi2H85b16v56Fy/R6AzK
AKVHYFn0a0BetcicpjNZmV8bC7Xeu7YaEF8OmzBZ0TQ2OiCSHdMWFUOlo/2Q9ogN
xc0q4umaq10SekUEeanOyHta5Y9YEVaiRk/y6Eue5FGo9jbCbgDXKoxBvFzwlNkr
yuXFZOvY9CwlzKuBWFnpBNP6tizdG2q42JDif/v1nF0wLSEt7KkwKPOButusWbjd
WdqeZ6vazg2z6/PLJ/GCgr8vGrF5Ublp8uccz0kfTW3Cmu2jZH8SxsdaUcL2WRiR
rqnu0sYe6grQCF78lUZIpJp5K5TMpXI5UtjN2Kwv1J/yo8rSXMFqTNgdWNLxUkTF
VzhoufTrjQIR1ERAzEWUdz2JA2VyIoJ5YrqHFPwJworlpKDKWHVjn4rMmsxValoN
G+F3BzEBUlPbsV8IQa0uD9tW2qxtZ1g51uueKNkX9s5m/7mXdRoIFD6JWqoY4vg6
8oWhc3xpGry3u+pOSOWHMM1FVDpgiclF6ybuERcTHFcn821RdiLtfIaRKQQG72z2
oYoadbbwrR3CnYmYY0l/7LExYSuW0kaPPiuAPKRZiPb1vK3qv0X50GXylyFYdSfT
JPytadfHz5xo9bVE8x1sDCTkwNGPdWs+w6fyHyHNGp2EZVSJyno=
=nTjM
-----END PGP SIGNATURE-----
Merge tag 'net-5.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Current release - always broken:
- net/smc: fix access to parent of an ib device
- devlink: use _BITUL() macro instead of BIT() in the UAPI header
- handful of mptcp fixes
Previous release - regressions:
- intel: AF_XDP: clear the status bits for the next_to_use descriptor
- dpaa2-eth: fix the size of the mapped SGT buffer
Previous release - always broken:
- mptcp: fix security context on server socket
- ethtool: fix string set id check
- ethtool: fix error paths in ethnl_set_channels()
- lan743x: fix rx_napi_poll/interrupt ping-pong
- qca: ar9331: fix sleeping function called from invalid context bug"
* tag 'net-5.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (32 commits)
net/sched: sch_taprio: reset child qdiscs before freeing them
nfp: move indirect block cleanup to flower app stop callback
octeontx2-af: Fix undetected unmap PF error check
net: nixge: fix spelling mistake in Kconfig: "Instuments" -> "Instruments"
qlcnic: Fix error code in probe
mptcp: fix pending data accounting
mptcp: push pending frames when subflow has free space
mptcp: properly annotate nested lock
mptcp: fix security context on server socket
net/mlx5: Fix compilation warning for 32-bit platform
mptcp: clear use_ack and use_map when dropping other suboptions
devlink: use _BITUL() macro instead of BIT() in the UAPI header
net: korina: fix return value
net/smc: fix access to parent of an ib device
ethtool: fix error paths in ethnl_set_channels()
nfc: s3fwrn5: Remove unused NCI prop commands
nfc: s3fwrn5: Remove the delay for NFC sleep
phy: fix kdoc warning
tipc: do sanity check payload of a netlink message
use __netdev_notify_peers in hyperv
...
Highlights include:
Features:
- NFSv3: Add emulation of lookupp() to improve open_by_filehandle()
support.
- A series of patches to improve readdir performance, particularly with
large directories.
- Basic support for using NFS/RDMA with the pNFS files and flexfiles
drivers.
- Micro-optimisations for RDMA.
- RDMA tracing improvements.
Bugfixes:
- Fix a long standing bug with xs_read_xdr_buf() when receiving partial
pages (Dan Aloni).
- Various fixes for getxattr and listxattr, when used over non-TCP
transports.
- Fixes for containerised NFS from Sargun Dhillon.
- switch nfsiod to be an UNBOUND workqueue (Neil Brown).
- READDIR should not ask for security label information if there is no
LSM policy. (Olga Kornievskaia)
- Avoid using interval-based rebinding with TCP in lockd (Calum Mackay).
- A series of RPC and NFS layer fixes to support the NFSv4.2 READ_PLUS code.
- A couple of fixes for pnfs/flexfiles read failover
Cleanups:
- Various cleanups for the SUNRPC xdr code in conjunction with the
READ_PLUS fixes.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAl/aiaIACgkQZwvnipYK
APIOihAAvONscxrFSaGRh2ICNv9I/zXW/A5+R3qnkESPVLTqTPJVphoN7FlINAr1
B74pg6n4T4viycbvsogU2+kHrlJZO7B8lTkJL7ynm9Wgyw8+2Ga4QEn1bsAoqmuY
b91p/+LfOLKrYeeojoH31PC73uOYYG1WHXJhjq0l9b5CTgThWpj6O3gDaFEbFvmz
A7V3yqSp04sV70YxUhwelBHZ5BXdiXIKsPnIwvXXHuY7IcamrE4EA3wGCwtxkBnu
4dwbOtRXURNSev0r3n6FsH4wZl+/nvp9UpnGdPtVv94F1zm2JKLwkhoJejS/vpjq
eyKc7ZXBQ0uHbTWI2Yj1YjA61VIUO0R0EDuyTAnRKDeaarID42n5kMG7J8cIglZR
jQfyx99xm0eSrdwxC09tcRL/lBzYcOfc6pJo5P9BtaFtRvbp9iFIHuFKlrXbULd4
WrZzDMhiKVYGSTcTpfQyVoK2rCvn6W1Ida4iYeI0gkJ1v9X90UhbtJOyggn/bxyL
DV/Qy40+l48n7CZfPU2eDv4WXqjKGRibpDoWMBLwUH20dDEX6kKYv3BfApFYGqyO
/GTPAFUZarCy8BENvzZv/Jb9mt5pDQM5p9ZXpdUOhydLMMA+pauaT/Gr+pAHPIPx
MPj546Gh2cEaT883xvRrJmQTG0nw/WscPNcHaJcgL5oYltmuwck=
=IKWG
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.11-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Highlights include:
Features:
- NFSv3: Add emulation of lookupp() to improve open_by_filehandle()
support
- A series of patches to improve readdir performance, particularly
with large directories
- Basic support for using NFS/RDMA with the pNFS files and flexfiles
drivers
- Micro-optimisations for RDMA
- RDMA tracing improvements
Bugfixes:
- Fix a long standing bug with xs_read_xdr_buf() when receiving
partial pages (Dan Aloni)
- Various fixes for getxattr and listxattr, when used over non-TCP
transports
- Fixes for containerised NFS from Sargun Dhillon
- switch nfsiod to be an UNBOUND workqueue (Neil Brown)
- READDIR should not ask for security label information if there is
no LSM policy (Olga Kornievskaia)
- Avoid using interval-based rebinding with TCP in lockd (Calum
Mackay)
- A series of RPC and NFS layer fixes to support the NFSv4.2
READ_PLUS code
- A couple of fixes for pnfs/flexfiles read failover
Cleanups:
- Various cleanups for the SUNRPC xdr code in conjunction with the
READ_PLUS fixes"
* tag 'nfs-for-5.11-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (90 commits)
NFS/pNFS: Fix a typo in ff_layout_resend_pnfs_read()
pNFS/flexfiles: Avoid spurious layout returns in ff_layout_choose_ds_for_read
NFSv4/pnfs: Add tracing for the deviceid cache
fs/lockd: convert comma to semicolon
NFSv4.2: fix error return on memory allocation failure
NFSv4.2/pnfs: Don't use READ_PLUS with pNFS yet
NFSv4.2: Deal with potential READ_PLUS data extent buffer overflow
NFSv4.2: Don't error when exiting early on a READ_PLUS buffer overflow
NFSv4.2: Handle hole lengths that exceed the READ_PLUS read buffer
NFSv4.2: decode_read_plus_hole() needs to check the extent offset
NFSv4.2: decode_read_plus_data() must skip padding after data segment
NFSv4.2: Ensure we always reset the result->count in decode_read_plus()
SUNRPC: When expanding the buffer, we may need grow the sparse pages
SUNRPC: Cleanup - constify a number of xdr_buf helpers
SUNRPC: Clean up open coded setting of the xdr_stream 'nwords' field
SUNRPC: _copy_to/from_pages() now check for zero length
SUNRPC: Cleanup xdr_shrink_bufhead()
SUNRPC: Fix xdr_expand_hole()
SUNRPC: Fixes for xdr_align_data()
SUNRPC: _shift_data_left/right_pages should check the shift length
...
adds the option of full in-transit encryption using AES-GCM algorithm
(myself). On top of that we have a series to avoid intermittent
errors during recovery with recover_session=clean and some MDS request
encoding work from Jeff, a cap handling fix and assorted observability
improvements from Luis and Xiubo and a good number of cleanups. Luis
also ran into a corner case with quotas which sadly means that we are
back to denying cross-quota-realm renames.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl/beWITHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzi4i0CACnvd87l2n7dndig7p5d5lVsmo8tAFs
wHYHaIVisWKMcqKoT+YJajSgzaonxjzvYiyCzwLxV7s7vI7cswAwjEfYT7tTDRp2
pnO1+4N/1ftznnTk/1QdqwOQLUg5UtdgWvFCaXQF+Vr/YroZomKJPaK8fXK882pC
9FBjoLNy1HWySsoXPCxJktmDzpEEyYRNJg0vquxm7mxwTgQErupWlwEFjNg5LBkm
gC0UoKhCE3DeUrXnoq21Ga62RIajxHofTooNx7dg+JiSVgluW+nORaWDYJXNzwLC
j5puSe4pWIah+gmcwIFuyNz4ddkvVL4URvsYPGkVFYXlEefQjErc10Jh
=6b9f
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-5.11-rc1' of git://github.com/ceph/ceph-client
Pull ceph updates from Ilya Dryomov:
"The big ticket item here is support for msgr2 on-wire protocol, which
adds the option of full in-transit encryption using AES-GCM algorithm
(myself).
On top of that we have a series to avoid intermittent errors during
recovery with recover_session=clean and some MDS request encoding work
from Jeff, a cap handling fix and assorted observability improvements
from Luis and Xiubo and a good number of cleanups.
Luis also ran into a corner case with quotas which sadly means that we
are back to denying cross-quota-realm renames"
* tag 'ceph-for-5.11-rc1' of git://github.com/ceph/ceph-client: (59 commits)
libceph: drop ceph_auth_{create,update}_authorizer()
libceph, ceph: make use of __ceph_auth_get_authorizer() in msgr1
libceph, ceph: implement msgr2.1 protocol (crc and secure modes)
libceph: introduce connection modes and ms_mode option
libceph, rbd: ignore addr->type while comparing in some cases
libceph, ceph: get and handle cluster maps with addrvecs
libceph: factor out finish_auth()
libceph: drop ac->ops->name field
libceph: amend cephx init_protocol() and build_request()
libceph, ceph: incorporate nautilus cephx changes
libceph: safer en/decoding of cephx requests and replies
libceph: more insight into ticket expiry and invalidation
libceph: move msgr1 protocol specific fields to its own struct
libceph: move msgr1 protocol implementation to its own file
libceph: separate msgr1 protocol implementation
libceph: export remaining protocol independent infrastructure
libceph: export zero_page
libceph: rename and export con->flags bits
libceph: rename and export con->state states
libceph: make con->state an int
...
htable_bits() can call jhash_size(32) and trigger shift-out-of-bounds
UBSAN: shift-out-of-bounds in net/netfilter/ipset/ip_set_hash_gen.h:151:6
shift exponent 32 is too large for 32-bit type 'unsigned int'
CPU: 0 PID: 8498 Comm: syz-executor519
Not tainted 5.10.0-rc7-next-20201208-syzkaller #0
Call Trace:
__dump_stack lib/dump_stack.c:79 [inline]
dump_stack+0x107/0x163 lib/dump_stack.c:120
ubsan_epilogue+0xb/0x5a lib/ubsan.c:148
__ubsan_handle_shift_out_of_bounds.cold+0xb1/0x181 lib/ubsan.c:395
htable_bits net/netfilter/ipset/ip_set_hash_gen.h:151 [inline]
hash_mac_create.cold+0x58/0x9b net/netfilter/ipset/ip_set_hash_gen.h:1524
ip_set_create+0x610/0x1380 net/netfilter/ipset/ip_set_core.c:1115
nfnetlink_rcv_msg+0xecc/0x1180 net/netfilter/nfnetlink.c:252
netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2494
nfnetlink_rcv+0x1ac/0x420 net/netfilter/nfnetlink.c:600
netlink_unicast_kernel net/netlink/af_netlink.c:1304 [inline]
netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1330
netlink_sendmsg+0x907/0xe40 net/netlink/af_netlink.c:1919
sock_sendmsg_nosec net/socket.c:652 [inline]
sock_sendmsg+0xcf/0x120 net/socket.c:672
____sys_sendmsg+0x6e8/0x810 net/socket.c:2345
___sys_sendmsg+0xf3/0x170 net/socket.c:2399
__sys_sendmsg+0xe5/0x1b0 net/socket.c:2432
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
This patch replaces htable_bits() by simple fls(hashsize - 1) call:
it alone returns valid nbits both for round and non-round hashsizes.
It is normal to set any nbits here because it is validated inside
following htable_size() call which returns 0 for nbits>31.
Fixes: 1feab10d7e6d("netfilter: ipset: Unified hash type generation")
Reported-by: syzbot+d66bfadebca46cf61a2b@syzkaller.appspotmail.com
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Jozsef Kadlecsik <kadlec@netfilter.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
currently mtype_resize() can cause oops
t = ip_set_alloc(htable_size(htable_bits));
if (!t) {
ret = -ENOMEM;
goto out;
}
t->hregion = ip_set_alloc(ahash_sizeof_regions(htable_bits));
Increased htable_bits can force htable_size() to return 0.
In own turn ip_set_alloc(0) returns not 0 but ZERO_SIZE_PTR,
so follwoing access to t->hregion should trigger an OOPS.
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Jozsef Kadlecsik <kadlec@netfilter.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This fixes the dereference to fetch the RCU pointer when holding
the appropriate xtables lock.
Reported-by: kernel test robot <lkp@intel.com>
Fixes: cc00bcaa58 ("netfilter: x_tables: Switch synchronization to RCU")
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When sendmsg() needs to wait for memory, the pending data
is not updated. That causes a drift in forward memory allocation,
leading to stall and/or warnings at socket close time.
This change addresses the above issue moving the pending data
counter update inside the sendmsg() main loop.
Fixes: 6e628cd3a8 ("mptcp: use mptcp release_cb for delayed tasks")
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When multiple subflows are active, we can receive a
window update on subflow with no write space available.
MPTCP will try to push frames on such subflow and will
fail. Pending frames will be pushed only after receiving
a window update on a subflow with some wspace available.
Overall the above could lead to suboptimal aggregate
bandwidth usage.
Instead, we should try to push pending frames as soon as
the subflow reaches both conditions mentioned above.
We can finally enable self-tests with asymmetric links,
as the above makes them finally pass.
Fixes: 6f8a612a33 ("mptcp: keep track of advertised windows right edge")
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
MPTCP closes the subflows while holding the msk-level lock.
While acquiring the subflow socket lock we need to use the
correct nested annotation, or we can hit a lockdep splat
at runtime.
Reported-and-tested-by: Geliang Tang <geliangtang@gmail.com>
Fixes: e16163b6e2 ("mptcp: refactor shutdown and close")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently MPTCP is not propagating the security context
from the ingress request socket to newly created msk
at clone time.
Address the issue invoking the missing security helper.
Fixes: cf7da0d66c ("mptcp: Create SUBFLOW socket for incoming connections")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
A smaller set of patches, nothing stands out as being particularly major
this cycle:
- Driver bug fixes and updates: bnxt_re, cxgb4, rxe, hns, i40iw, cxgb4,
mlx4 and mlx5
- Bug fixes and polishing for the new rts ULP
- Cleanup of uverbs checking for allowed driver operations
- Use sysfs_emit all over the place
- Lots of bug fixes and clarity improvements for hns
- hip09 support for hns
- NDR and 50/100Gb signaling rates
- Remove dma_virt_ops and go back to using the IB DMA wrappers
- mlx5 optimizations for contiguous DMA regions
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAl/aNXUACgkQOG33FX4g
mxqlMQ/+O6UhxKnDAnMB+HzDGvOm+KXNHOQBuzxz4ZWXqtUrW8WU5ca3PhXovc4z
/QX0HhMhQmVsva5mjp1OGVATxQ2E+yasqFLg4QXAFWFR3N7s0u/sikE9i1DoPvOC
lsmLTeRauCFaE4mJD5nvYwm+riECX0GmyVVW7v6V05xwAp0hwdhyU7Kb6Yh3lxsE
umTz+onPNJcD6Tc4snziyC5QEp5ebEjAaj4dVI1YPR5X0c2RwC5E1CIDI6u4OQ2k
j7/+Kvo8LNdYNERGiR169x6c1L7WS6dYnGMMeXRgyy0BVbVdRGDnvCV9VRmF66w5
99fHfDjNMNmqbGNt/4/gwNdVrR9aI4jMZWCh7SmsguX6XwNOlhYldy3x3WnlkfkQ
e4O0huJceJqcB2Uya70GqufnAetRXsbjzcvWxpR5YAwRmcRkm1f6aGK3BxPjWEbr
BbYRpiKMxxT4yTe65BuuThzx6g4pNQHe0z3BM/dzMJQAX+PZcs1CPQR8F8PbCrZR
Ad7qw4HJ587PoSxPi3toVMpYZRP6cISh1zx9q/JCj8cxH9Ri4MovUCS3cF63Ny3B
1LJ2q0x8FuLLjgZJogKUyEkS8OO6q7NL8WumjvrYWWx19+jcYsV81jTRGSkH3bfY
F7Esv5K2T1F2gVsCe1ZFFplQg6ja1afIcc+LEl8cMJSyTdoSub4=
=9t8b
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma updates from Jason Gunthorpe:
"A smaller set of patches, nothing stands out as being particularly
major this cycle. The biggest item would be the new HIP09 HW support
from HNS, otherwise it was pretty quiet for new work here:
- Driver bug fixes and updates: bnxt_re, cxgb4, rxe, hns, i40iw,
cxgb4, mlx4 and mlx5
- Bug fixes and polishing for the new rts ULP
- Cleanup of uverbs checking for allowed driver operations
- Use sysfs_emit all over the place
- Lots of bug fixes and clarity improvements for hns
- hip09 support for hns
- NDR and 50/100Gb signaling rates
- Remove dma_virt_ops and go back to using the IB DMA wrappers
- mlx5 optimizations for contiguous DMA regions"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (147 commits)
RDMA/cma: Don't overwrite sgid_attr after device is released
RDMA/mlx5: Fix MR cache memory leak
RDMA/rxe: Use acquire/release for memory ordering
RDMA/hns: Simplify AEQE process for different types of queue
RDMA/hns: Fix inaccurate prints
RDMA/hns: Fix incorrect symbol types
RDMA/hns: Clear redundant variable initialization
RDMA/hns: Fix coding style issues
RDMA/hns: Remove unnecessary access right set during INIT2INIT
RDMA/hns: WARN_ON if get a reserved sl from users
RDMA/hns: Avoid filling sl in high 3 bits of vlan_id
RDMA/hns: Do shift on traffic class when using RoCEv2
RDMA/hns: Normalization the judgment of some features
RDMA/hns: Limit the length of data copied between kernel and userspace
RDMA/mlx4: Remove bogus dev_base_lock usage
RDMA/uverbs: Fix incorrect variable type
RDMA/core: Do not indicate device ready when device enablement fails
RDMA/core: Clean up cq pool mechanism
RDMA/core: Update kernel documentation for ib_create_named_qp()
MAINTAINERS: SOFT-ROCE: Change Zhu Yanjun's email address
...
The parent of an ib device is used to retrieve the PCI device
attributes. It turns out that there are possible cases when an ib device
has no parent set in the device structure, which may lead to page
faults when trying to access this memory.
Fix that by checking the parent pointer and consolidate the pci device
specific processing in a new function.
Fixes: a3db10efcc ("net/smc: Add support for obtaining SMCR device list")
Reported-by: syzbot+600fef7c414ee7e2d71b@syzkaller.appspotmail.com
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Link: https://lore.kernel.org/r/20201215091058.49354-2-kgraul@linux.ibm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Fix two error paths in ethnl_set_channels() to avoid lock-up caused
but unreleased RTNL.
Fixes: e19c591eaf ("ethtool: set device channel counts with CHANNELS_SET request")
Reported-by: LiLiang <liali@redhat.com>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Link: https://lore.kernel.org/r/20201215090810.801777-1-ivecera@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When we initialize nlmsghdr with no payload inside tipc_nl_compat_dumpit()
the parsing function returns -EINVAL. We fix it by making the parsing call
conditional.
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Link: https://lore.kernel.org/r/20201215033151.76139-1-hoang.h.le@dektech.com.au
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl/XeDUQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpnF9D/4+l1r1G5AcsSsgEvu1aCjP83LLWrHIAA5+
ca3OY6vwOjBvqI7oOoPcYJeYJ9uuGGQc31tDFJtP6Sl6Gk31AB4iSddyrowaX+t+
UJyJNfsgWKiLjY48EyQJ0gIqjuvPq8hPGMGClJb1A7+w87fqBC5UwCWEnJmE7MaX
401kIw0CRVWYTnDEOYxToss6D6gQ30E8UZjdJ0cG4g8xVQBY2kKwYR3F9tDlAwsY
CF+RCKpibcKwnaNZJBL67ClWjj1hC0ivg0O0G+W1UYysesKKdWFRI2rmxvH55K5T
7tHlfVuVPladNmlLVNZnCvyqBrFHyAZPmOsdv3xQOvJ7pZPaxKV9xIYryQKZW4H4
9tKkj3T1aop/fDGqIMxgymZsWW+1vvxAmM+7WkdOPHwHRSakJ5wGIj6Ekpton+5y
aixJUFq390o/o+S8PDO7mgzdvYrasv3iLl5UxnIcU3rq30wxnRKit4vUZny8DlzF
gOTw7QSocximhGYci+Uz4d4/XdK2CHc6eZDkQDltgJXxIrdsrN0qKxMCEsMKgCR1
RMiDv+52MP6kp/wpXiOHQF25YRnUOW0qfEjWKK6Ye28DGuKPPuIXtN/BUD3rjdIc
IJX3lDfOI3PgXNX24nOarucrF+ootyRmE6tGTVZhCVBhUXGR+MGatGfkeCqnmNzZ
gny2+UrGIQ==
=ly9V
-----END PGP SIGNATURE-----
Merge tag 'for-5.11/io_uring-2020-12-14' of git://git.kernel.dk/linux-block
Pull io_uring updates from Jens Axboe:
"Fairly light set of changes this time around, and mostly some bits
that were pushed out to 5.11 instead of 5.10, fixes/cleanups, and a
few features. In particular:
- Cleanups around iovec import (David Laight, Pavel)
- Add timeout support for io_uring_enter(2), which enables us to
clean up liburing and avoid a timeout sqe submission in the
completion path.
The big win here is that it allows setups that split SQ and CQ
handling into separate threads to avoid locking, as the CQ side
will no longer submit when timeouts are needed when waiting for
events (Hao Xu)
- Add support for socket shutdown, and renameat/unlinkat.
- SQPOLL cleanups and improvements (Xiaoguang Wang)
- Allow SQPOLL setups for CAP_SYS_NICE, and enable regular
(non-fixed) files to be used.
- Cancelation improvements (Pavel)
- Fixed file reference improvements (Pavel)
- IOPOLL related race fixes (Pavel)
- Lots of other little fixes and cleanups (mostly Pavel)"
* tag 'for-5.11/io_uring-2020-12-14' of git://git.kernel.dk/linux-block: (43 commits)
io_uring: fix io_cqring_events()'s noflush
io_uring: fix racy IOPOLL flush overflow
io_uring: fix racy IOPOLL completions
io_uring: always let io_iopoll_complete() complete polled io
io_uring: add timeout update
io_uring: restructure io_timeout_cancel()
io_uring: fix files cancellation
io_uring: use bottom half safe lock for fixed file data
io_uring: fix miscounting ios_left
io_uring: change submit file state invariant
io_uring: check kthread stopped flag when sq thread is unparked
io_uring: share fixed_file_refs b/w multiple rsrcs
io_uring: replace inflight_wait with tctx->wait
io_uring: don't take fs for recvmsg/sendmsg
io_uring: only wake up sq thread while current task is in io worker context
io_uring: don't acquire uring_lock twice
io_uring: initialize 'timeout' properly in io_sq_thread()
io_uring: refactor io_sq_thread() handling
io_uring: always batch cancel in *cancel_files()
io_uring: pass files into kill timeouts/poll
...
There are some use cases for netdev_notify_peers in the context
when rtnl lock is already held. Introduce lockless version
of netdev_notify_peers call to save the extra code to call
call_netdevice_notifiers(NETDEV_NOTIFY_PEERS, dev);
call_netdevice_notifiers(NETDEV_RESEND_IGMP, dev);
After that, convert netdev_notify_peers to call the new helper.
Suggested-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Lijun Pan <ljp@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAl/YBtEUHHBhdWxAcGF1
bC1tb29yZS5jb20ACgkQ6iDy2pc3iXNnwA/9Ek8DG/1t8CEoJxpoRvwovQxNo+bi
0rCT9vqvx9PeCwoZi/0Vp6oKmpE1HADvbeB/+e00VrbLYnzE3oRY6VkpjoZRofKS
vc0/MzHSFxFUR1OTHwCefcXlPLK+bfitQbX5jEMeVyQCXNXXIrN7CnJf1LmCeLTR
kQBPlEN9lt7HyNVAi34FhOD/TQbWnFHgl2z5puffgri6cWnc+TALKMYytUZ+rYex
NYndDJW5b3g5kTat2eErn0FruxfzloGs0xMIiWb+z2i9kl41D+dkKPdAN7idqCSC
Jv0nJP/bDftzA0wOe9szmGaLQzu7YnCN5kiWcSspatZVnon42Cy/tp9tiuPGLRFU
XtelDfpyX6o3CLN0tX7LQEO+GYxPzvM6iaR2OrsChWPozUIIR3TLQg7jJN4bvNKl
TR6gCGZCoAeS5JLNGjzVKxT/oKQY+tCLLlYXQdQY6swNFi3EKmPr+K1D9lgm98fO
f3d1QmWiZZNmtxxoVogT0qoQYjkfgpnm3dVx813Vt+lwHlVpHGMEPpO27iD3/RYb
w2yWOJaGKwMD8iL0l+Cm6CPW0/nE5FFISQjWgC8b4Vgxlyan6+L9eViqGICkrUQ2
Edo0i1YFFZ4utHYkDf1VYBbJ+36KyCtdktgLAcbgnePiPB3E1XBsXTIIStSUIbVQ
iEbTkBlsCG4GIeU=
=6Cqb
-----END PGP SIGNATURE-----
Merge tag 'selinux-pr-20201214' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux
Pull selinux updates from Paul Moore:
"While we have a small number of SELinux patches for v5.11, there are a
few changes worth highlighting:
- Change the LSM network hooks to pass flowi_common structs instead
of the parent flowi struct as the LSMs do not currently need the
full flowi struct and they do not have enough information to use it
safely (missing information on the address family).
This patch was discussed both with Herbert Xu (representing team
netdev) and James Morris (representing team
LSMs-other-than-SELinux).
- Fix how we handle errors in inode_doinit_with_dentry() so that we
attempt to properly label the inode on following lookups instead of
continuing to treat it as unlabeled.
- Tweak the kernel logic around allowx, auditallowx, and dontauditx
SELinux policy statements such that the auditx/dontauditx are
effective even without the allowx statement.
Everything passes our test suite"
* tag 'selinux-pr-20201214' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
lsm,selinux: pass flowi_common instead of flowi to the LSM hooks
selinux: Fix fall-through warnings for Clang
selinux: drop super_block backpointer from superblock_security_struct
selinux: fix inode_doinit_with_dentry() LABEL_INVALID error handling
selinux: allow dontauditx and auditallowx rules to take effect without allowx
selinux: fix error initialization in inode_doinit_with_dentry()
The intention of the err_expr cleanup path is to iterate over the
allocated expr_array objects and free them, starting from i - 1 and
working down to the start of the array. Currently the loop counter
is being incremented instead of decremented and also the index i is
being used instead of k, repeatedly destroying the same expr_array
element. Fix this by decrementing k and using k as the index into
expr_array.
Addresses-Coverity: ("Infinite loop")
Fixes: 8cfd9b0f85 ("netfilter: nftables: generalize set expressions support")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
- Improve support for re-exporting NFS mounts
- Replace NFSv4 XDR decoding C macros with xdr_stream helpers
- Support for multiple RPC/RDMA chunks per RPC transaction
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAl/Q4dIACgkQM2qzM29m
f5fInw//eDrmXBEhxbzcgeqNilGU5Qkn4INJtAcOGwPcw5Kjp4UVNGFpZNPqIDSf
FP0Yw0d/rW7UggwCviPcs/adLTasU9skq1jgAv8d0ig4DtPbeqFo6BvbY+G2JxVF
EfTeHzr6w6er8HRqyuLN4hjm1rQIpQlDHaYU4QcMs4fjPVv88eYLiwnYGYf3X46i
vBYstu1IRxHhg2x4O833xmiL6VbkZDQoWwDjGICylxUBcNUtAmq/sETjTa4JVEJj
4vgXdcJmAFjNgAOrmoR3DISsr9mvCvKN9g3C0+hHiRERTGEon//HzvscWH74wT48
o0LUW0ZWgpmunTcmiSNeeiHNsUXJyy3A/xyEdteqqnvSxulxlqkQzb15Eb+92+6n
BHGT/sOz1zz+/l9NCpdeEl5AkSA9plV8Iqd/kzwFwe1KwHMjldeMw/mhMut8EM2j
b54EMsp40ipITAwBHvcygCXiWAn/mPex6bCr17Dijo6MsNLsyd+cDsazntbNzwz3
RMGMf2TPOi8tWswrTUS9J5xKk5LAEWX/6Z/hTA1YlsB3PfrhXO97ztrytxvoO/bp
M0NREA+NNMn/JyyL8FT3ID5peaLVHhA1GHw9CcUw3C7OVzmsEg29D4zNo02dF1TC
LIyekp0kbSGGY1jLOeMLsa6Jr+2+40CcctsooVkRA+3rN0tJQvw=
=1uP3
-----END PGP SIGNATURE-----
Merge tag 'nfsd-5.11' of git://git.linux-nfs.org/projects/cel/cel-2.6
Pull nfsd updates from Chuck Lever:
"Several substantial changes this time around:
- Previously, exporting an NFS mount via NFSD was considered to be an
unsupported feature. With v5.11, the community has attempted to
make re-exporting a first-class feature of NFSD.
This would enable the Linux in-kernel NFS server to be used as an
intermediate cache for a remotely-located primary NFS server, for
example, even with other NFS server implementations, like a NetApp
filer, as the primary.
- A short series of patches brings support for multiple RPC/RDMA data
chunks per RPC transaction to the Linux NFS server's RPC/RDMA
transport implementation.
This is a part of the RPC/RDMA spec that the other premiere
NFS/RDMA implementation (Solaris) has had for a very long time, and
completes the implementation of RPC/RDMA version 1 in the Linux
kernel's NFS server.
- Long ago, NFSv4 support was introduced to NFSD using a series of C
macros that hid dprintk's and goto's. Over time, the kernel's XDR
implementation has been greatly improved, but these C macros have
remained and become fallow. A series of patches in this pull
request completely replaces those macros with the use of current
kernel XDR infrastructure. Benefits include:
- More robust input sanitization in NFSD's NFSv4 XDR decoders.
- Make it easier to use common kernel library functions that use
XDR stream APIs (for example, GSS-API).
- Align the structure of the source code with the RFCs so it is
easier to learn, verify, and maintain our XDR implementation.
- Removal of more than a hundred hidden dprintk() call sites.
- Removal of some explicit manipulation of pages to help make the
eventual transition to xdr->bvec smoother.
- On top of several related fixes in 5.10-rc, there are a few more
fixes to get the Linux NFSD implementation of NFSv4.2 inter-server
copy up to speed.
And as usual, there is a pinch of seasoning in the form of a
collection of unrelated minor bug fixes and clean-ups.
Many thanks to all who contributed this time around!"
* tag 'nfsd-5.11' of git://git.linux-nfs.org/projects/cel/cel-2.6: (131 commits)
nfsd: Record NFSv4 pre/post-op attributes as non-atomic
nfsd: Set PF_LOCAL_THROTTLE on local filesystems only
nfsd: Fix up nfsd to ensure that timeout errors don't result in ESTALE
exportfs: Add a function to return the raw output from fh_to_dentry()
nfsd: close cached files prior to a REMOVE or RENAME that would replace target
nfsd: allow filesystems to opt out of subtree checking
nfsd: add a new EXPORT_OP_NOWCC flag to struct export_operations
Revert "nfsd4: support change_attr_type attribute"
nfsd4: don't query change attribute in v2/v3 case
nfsd: minor nfsd4_change_attribute cleanup
nfsd: simplify nfsd4_change_info
nfsd: only call inode_query_iversion in the I_VERSION case
nfs_common: need lock during iterate through the list
NFSD: Fix 5 seconds delay when doing inter server copy
NFSD: Fix sparse warning in nfs4proc.c
SUNRPC: Remove XDRBUF_SPARSE_PAGES flag in gss_proxy upcall
sunrpc: clean-up cache downcall
nfsd: Fix message level for normal termination
NFSD: Remove macros that are no longer used
NFSD: Replace READ* macros in nfsd4_decode_compound()
...
Core:
- support "prefer busy polling" NAPI operation mode, where we defer softirq
for some time expecting applications to periodically busy poll
- AF_XDP: improve efficiency by more batching and hindering
the adjacency cache prefetcher
- af_packet: make packet_fanout.arr size configurable up to 64K
- tcp: optimize TCP zero copy receive in presence of partial or unaligned
reads making zero copy a performance win for much smaller messages
- XDP: add bulk APIs for returning / freeing frames
- sched: support fragmenting IP packets as they come out of conntrack
- net: allow virtual netdevs to forward UDP L4 and fraglist GSO skbs
BPF:
- BPF switch from crude rlimit-based to memcg-based memory accounting
- BPF type format information for kernel modules and related tracing
enhancements
- BPF implement task local storage for BPF LSM
- allow the FENTRY/FEXIT/RAW_TP tracing programs to use bpf_sk_storage
Protocols:
- mptcp: improve multiple xmit streams support, memory accounting and
many smaller improvements
- TLS: support CHACHA20-POLY1305 cipher
- seg6: add support for SRv6 End.DT4/DT6 behavior
- sctp: Implement RFC 6951: UDP Encapsulation of SCTP
- ppp_generic: add ability to bridge channels directly
- bridge: Connectivity Fault Management (CFM) support as is defined in
IEEE 802.1Q section 12.14.
Drivers:
- mlx5: make use of the new auxiliary bus to organize the driver internals
- mlx5: more accurate port TX timestamping support
- mlxsw:
- improve the efficiency of offloaded next hop updates by using
the new nexthop object API
- support blackhole nexthops
- support IEEE 802.1ad (Q-in-Q) bridging
- rtw88: major bluetooth co-existance improvements
- iwlwifi: support new 6 GHz frequency band
- ath11k: Fast Initial Link Setup (FILS)
- mt7915: dual band concurrent (DBDC) support
- net: ipa: add basic support for IPA v4.5
Refactor:
- a few pieces of in_interrupt() cleanup work from Sebastian Andrzej Siewior
- phy: add support for shared interrupts; get rid of multiple driver
APIs and have the drivers write a full IRQ handler, slight growth
of driver code should be compensated by the simpler API which
also allows shared IRQs
- add common code for handling netdev per-cpu counters
- move TX packet re-allocation from Ethernet switch tag drivers to
a central place
- improve efficiency and rename nla_strlcpy
- number of W=1 warning cleanups as we now catch those in a patchwork
build bot
Old code removal:
- wan: delete the DLCI / SDLA drivers
- wimax: move to staging
- wifi: remove old WDS wifi bridging support
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAl/YXmUACgkQMUZtbf5S
IrvSQBAAgOrt4EFopEvVqlTHZbqI45IEqgtXS+YWmlgnjZCgshyMj8q1yK1zzane
qYxr/NNJ9kV3FdtaynmmHPgEEEfR5kJ/D3B2BsxYDkaDDrD0vbNsBGw+L+/Gbhxl
N/5l/9FjLyLY1D+EErknuwR5XGuQ6BSDVaKQMhYOiK2hgdnAAI4hszo8Chf6wdD0
XDBslQ7vpD/05r+eMj0IkS5dSAoGOIFXUxhJ5dqrDbRHiKsIyWqA3PLbYemfAhxI
s2XckjfmSgGE3FKL8PSFu+EcfHbJQQjLcULJUnqgVcdwEEtRuE9ggEi52nZRXMWM
4e8sQJAR9Fx7pZy0G1xfS149j6iPU5LjRlU9TNSpVABz14Vvvo3gEL6gyIdsz+xh
hMN7UBdp0FEaP028CXoIYpaBesvQqj0BSndmee8qsYAtN6j+QKcM2AOSr7JN1uMH
C/86EDoGAATiEQIVWJvnX5MPmlAoblyLA+RuVhmxkIBx2InGXkFmWqRkXT5l4jtk
LVl8/TArR4alSQqLXictXCjYlCm9j5N4zFFtEVasSYi7/ZoPfgRNWT+lJ2R8Y+Zv
+htzGaFuyj6RJTVeFQMrkl3whAtBamo2a0kwg45NnxmmXcspN6kJX1WOIy82+MhD
Yht7uplSs7MGKA78q/CDU0XBeGjpABUvmplUQBIfrR/jKLW2730=
=GXs1
-----END PGP SIGNATURE-----
Merge tag 'net-next-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"Core:
- support "prefer busy polling" NAPI operation mode, where we defer
softirq for some time expecting applications to periodically busy
poll
- AF_XDP: improve efficiency by more batching and hindering the
adjacency cache prefetcher
- af_packet: make packet_fanout.arr size configurable up to 64K
- tcp: optimize TCP zero copy receive in presence of partial or
unaligned reads making zero copy a performance win for much smaller
messages
- XDP: add bulk APIs for returning / freeing frames
- sched: support fragmenting IP packets as they come out of conntrack
- net: allow virtual netdevs to forward UDP L4 and fraglist GSO skbs
BPF:
- BPF switch from crude rlimit-based to memcg-based memory accounting
- BPF type format information for kernel modules and related tracing
enhancements
- BPF implement task local storage for BPF LSM
- allow the FENTRY/FEXIT/RAW_TP tracing programs to use
bpf_sk_storage
Protocols:
- mptcp: improve multiple xmit streams support, memory accounting and
many smaller improvements
- TLS: support CHACHA20-POLY1305 cipher
- seg6: add support for SRv6 End.DT4/DT6 behavior
- sctp: Implement RFC 6951: UDP Encapsulation of SCTP
- ppp_generic: add ability to bridge channels directly
- bridge: Connectivity Fault Management (CFM) support as is defined
in IEEE 802.1Q section 12.14.
Drivers:
- mlx5: make use of the new auxiliary bus to organize the driver
internals
- mlx5: more accurate port TX timestamping support
- mlxsw:
- improve the efficiency of offloaded next hop updates by using
the new nexthop object API
- support blackhole nexthops
- support IEEE 802.1ad (Q-in-Q) bridging
- rtw88: major bluetooth co-existance improvements
- iwlwifi: support new 6 GHz frequency band
- ath11k: Fast Initial Link Setup (FILS)
- mt7915: dual band concurrent (DBDC) support
- net: ipa: add basic support for IPA v4.5
Refactor:
- a few pieces of in_interrupt() cleanup work from Sebastian Andrzej
Siewior
- phy: add support for shared interrupts; get rid of multiple driver
APIs and have the drivers write a full IRQ handler, slight growth
of driver code should be compensated by the simpler API which also
allows shared IRQs
- add common code for handling netdev per-cpu counters
- move TX packet re-allocation from Ethernet switch tag drivers to a
central place
- improve efficiency and rename nla_strlcpy
- number of W=1 warning cleanups as we now catch those in a patchwork
build bot
Old code removal:
- wan: delete the DLCI / SDLA drivers
- wimax: move to staging
- wifi: remove old WDS wifi bridging support"
* tag 'net-next-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1922 commits)
net: hns3: fix expression that is currently always true
net: fix proc_fs init handling in af_packet and tls
nfc: pn533: convert comma to semicolon
af_vsock: Assign the vsock transport considering the vsock address flags
af_vsock: Set VMADDR_FLAG_TO_HOST flag on the receive path
vsock_addr: Check for supported flag values
vm_sockets: Add VMADDR_FLAG_TO_HOST vsock flag
vm_sockets: Add flags field in the vsock address data structure
net: Disable NETIF_F_HW_TLS_TX when HW_CSUM is disabled
tcp: Add logic to check for SYN w/ data in tcp_simple_retransmit
net: mscc: ocelot: install MAC addresses in .ndo_set_rx_mode from process context
nfc: s3fwrn5: Release the nfc firmware
net: vxget: clean up sparse warnings
mlxsw: spectrum_router: Use eXtended mezzanine to offload IPv4 router
mlxsw: spectrum: Set KVH XLT cache mode for Spectrum2/3
mlxsw: spectrum_router_xm: Introduce basic XM cache flushing
mlxsw: reg: Add Router LPM Cache Enable Register
mlxsw: reg: Add Router LPM Cache ML Delete Register
mlxsw: spectrum_router_xm: Implement L-value tracking for M-index
mlxsw: reg: Add XM Router M Table Register
...
proc_fs was used, in af_packet, without a surrounding #ifdef,
although there is no hard dependency on proc_fs.
That caused the initialization of the af_packet module to fail
when CONFIG_PROC_FS=n.
Specifically, proc_create_net() was used in af_packet.c,
and when it fails, packet_net_init() returns -ENOMEM.
It will always fail when the kernel is compiled without proc_fs,
because, proc_create_net() for example always returns NULL.
The calling order that starts in af_packet.c is as follows:
packet_init()
register_pernet_subsys()
register_pernet_operations()
__register_pernet_operations()
ops_init()
ops->init() (packet_net_ops.init=packet_net_init())
proc_create_net()
It worked in the past because register_pernet_subsys()'s return value
wasn't checked before this Commit 36096f2f4f ("packet: Fix error path in
packet_init.").
It always returned an error, but was not checked before, so everything
was working even when CONFIG_PROC_FS=n.
The fix here is simply to add the necessary #ifdef.
This also fixes a similar error in tls_proc.c, that was found by Jakub
Kicinski.
Fixes: d26b698dd3 ("net/tls: add skeleton of MIB statistics")
Fixes: 36096f2f4f ("packet: Fix error path in packet_init")
Signed-off-by: Yonatan Linik <yonatanlinik@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The vsock flags field can be set in the connect path (user space app)
and the (listen) receive path (kernel space logic).
When the vsock transport is assigned, the remote CID is used to
distinguish between types of connection.
Use the vsock flags value (in addition to the CID) from the remote
address to decide which vsock transport to assign. For the sibling VMs
use case, all the vsock packets need to be forwarded to the host, so
always assign the guest->host transport if the VMADDR_FLAG_TO_HOST flag
is set. For the other use cases, the vsock transport assignment logic is
not changed.
Changelog
v3 -> v4
* Update the "remote_flags" local variable type to reflect the change of
the "svm_flags" field to be 1 byte in size.
v2 -> v3
* Update bitwise check logic to not compare result to the flag value.
v1 -> v2
* Use bitwise operator to check the vsock flag.
* Use the updated "VMADDR_FLAG_TO_HOST" flag naming.
* Merge the checks for the g2h transport assignment in one "if" block.
Signed-off-by: Andra Paraschiv <andraprs@amazon.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The vsock flags can be set during the connect() setup logic, when
initializing the vsock address data structure variable. Then the vsock
transport is assigned, also considering this flags field.
The vsock transport is also assigned on the (listen) receive path. The
flags field needs to be set considering the use case.
Set the value of the vsock flags of the remote address to the one
targeted for packets forwarding to the host, if the following conditions
are met:
* The source CID of the packet is higher than VMADDR_CID_HOST.
* The destination CID of the packet is higher than VMADDR_CID_HOST.
Changelog
v3 -> v4
* No changes.
v2 -> v3
* No changes.
v1 -> v2
* Set the vsock flag on the receive path in the vsock transport
assignment logic.
* Use bitwise operator for the vsock flag setup.
* Use the updated "VMADDR_FLAG_TO_HOST" flag naming.
Signed-off-by: Andra Paraschiv <andraprs@amazon.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Check if the provided flags value from the vsock address data structure
includes the supported flags in the corresponding kernel version.
The first byte of the "svm_zero" field is used as "svm_flags", so add
the flags check instead.
Changelog
v3 -> v4
* New patch in v4.
Signed-off-by: Andra Paraschiv <andraprs@amazon.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
With NETIF_F_HW_TLS_TX packets are encrypted in HW. This cannot be
logically done when HW_CSUM offload is off.
Fixes: 2342a8512a ("net: Add TLS TX offload features")
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Boris Pismenny <borisp@nvidia.com>
Link: https://lore.kernel.org/r/20201213143929.26253-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There are cases where a fastopen SYN may trigger either a ICMP_TOOBIG
message in the case of IPv6 or a fragmentation request in the case of
IPv4. This results in the socket stalling for a second or more as it does
not respond to the message by retransmitting the SYN frame.
Normally a SYN frame should not be able to trigger a ICMP_TOOBIG or
ICMP_FRAG_NEEDED however in the case of fastopen we can have a frame that
makes use of the entire MSS. In the case of fastopen it does, and an
additional complication is that the retransmit queue doesn't contain the
original frames. As a result when tcp_simple_retransmit is called and
walks the list of frames in the queue it may not mark the frames as lost
because both the SYN and the data packet each individually are smaller than
the MSS size after the adjustment. This results in the socket being stalled
until the retransmit timer kicks in and forces the SYN frame out again
without the data attached.
In order to resolve this we can reduce the MSS the packets are compared
to in tcp_simple_retransmit to -1 for cases where we are still in the
TCP_SYN_SENT state for a fastopen socket. Doing this we will mark all of
the packets related to the fastopen SYN as lost.
Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Link: https://lore.kernel.org/r/160780498125.3272.15437756269539236825.stgit@localhost.localdomain
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
syzbot reproduces BUG_ON in skb_checksum_help():
tun creates (bogus) skb with huge partial-checksummed area and
small ip packet inside. Then ip_rcv trims the skb based on size
of internal ip packet, after that csum offset points beyond of
trimmed skb. Then checksum_tg() called via netfilter hook
triggers BUG_ON:
offset = skb_checksum_start_offset(skb);
BUG_ON(offset >= skb_headlen(skb));
To work around the problem this patch forces pskb_trim_rcsum_slow()
to return -EINVAL in described scenario. It allows its callers to
drop such kind of packets.
Link: https://syzkaller.appspot.com/bug?id=b419a5ca95062664fe1a60b764621eb4526e2cd0
Reported-by: syzbot+7010af67ced6105e5ab6@syzkaller.appspotmail.com
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/1b2494af-2c56-8ee2-7bc0-923fcad1cdf8@virtuozzo.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
- migrate_disable/enable() support which originates from the RT tree and
is now a prerequisite for the new preemptible kmap_local() API which aims
to replace kmap_atomic().
- A fair amount of topology and NUMA related improvements
- Improvements for the frequency invariant calculations
- Enhanced robustness for the global CPU priority tracking and decision
making
- The usual small fixes and enhancements all over the place
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl/XwK4THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoX28D/9cVrvziSQGfBfuQWnUiw8iOIq1QBa2
Me+Tvenhfrlt7xU6rbP9ciFu7eTN+fS06m5uQPGI+t22WuJmHzbmw1bJVXfkvYfI
/QoU+Hg7DkDAn1p7ZKXh0dRkV0nI9ixxSHl0E+Zf1ATBxCUMV2SO85flg6z/4qJq
3VWUye0dmR7/bhtkIjv5rwce9v2JB2g1AbgYXYTW9lHVoUdGoMSdiZAF4tGyHLnx
sJ6DMqQ+k+dmPyYO0z5MTzjW/fXit4n9w2e3z9TvRH/uBu58WSW1RBmQYX6aHBAg
dhT9F4lvTs6lJY23x5RSFWDOv6xAvKF5a0xfb8UZcyH5EoLYrPRvm42a0BbjdeRa
u0z7LbwIlKA+RFdZzFZWz8UvvO0ljyMjmiuqZnZ5dY9Cd80LSBuxrWeQYG0qg6lR
Y2povhhCepEG+q8AXIe2YjHKWKKC1s/l/VY3CNnCzcd21JPQjQ4Z5eWGmHif5IED
CntaeFFhZadR3w02tkX35zFmY3w4soKKrbI4EKWrQwd+cIEQlOSY7dEPI/b5BbYj
MWAb3P4EG9N77AWTNmbhK4nN0brEYb+rBbCA+5dtNBVhHTxAC7OTWElJOC2O66FI
e06dREjvwYtOkRUkUguWwErbIai2gJ2MH0VILV3hHoh64oRk7jjM8PZYnjQkdptQ
Gsq0rJW5iiu/OQ==
=Oz1V
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Thomas Gleixner:
- migrate_disable/enable() support which originates from the RT tree
and is now a prerequisite for the new preemptible kmap_local() API
which aims to replace kmap_atomic().
- A fair amount of topology and NUMA related improvements
- Improvements for the frequency invariant calculations
- Enhanced robustness for the global CPU priority tracking and decision
making
- The usual small fixes and enhancements all over the place
* tag 'sched-core-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (61 commits)
sched/fair: Trivial correction of the newidle_balance() comment
sched/fair: Clear SMT siblings after determining the core is not idle
sched: Fix kernel-doc markup
x86: Print ratio freq_max/freq_base used in frequency invariance calculations
x86, sched: Use midpoint of max_boost and max_P for frequency invariance on AMD EPYC
x86, sched: Calculate frequency invariance for AMD systems
irq_work: Optimize irq_work_single()
smp: Cleanup smp_call_function*()
irq_work: Cleanup
sched: Limit the amount of NUMA imbalance that can exist at fork time
sched/numa: Allow a floating imbalance between NUMA nodes
sched: Avoid unnecessary calculation of load imbalance at clone time
sched/numa: Rename nr_running and break out the magic number
sched: Make migrate_disable/enable() independent of RT
sched/topology: Condition EAS enablement on FIE support
arm64: Rebuild sched domains on invariance status changes
sched/topology,schedutil: Wrap sched domains rebuild
sched/uclamp: Allow to reset a task uclamp constraint value
sched/core: Fix typos in comments
Documentation: scheduler: fix information on arch SD flags, sched_domain and sched_debug
...
I got a warining report:
br_sysfs_addbr: can't create group bridge4/bridge
------------[ cut here ]------------
sysfs group 'bridge' not found for kobject 'bridge4'
WARNING: CPU: 2 PID: 9004 at fs/sysfs/group.c:279 sysfs_remove_group fs/sysfs/group.c:279 [inline]
WARNING: CPU: 2 PID: 9004 at fs/sysfs/group.c:279 sysfs_remove_group+0x153/0x1b0 fs/sysfs/group.c:270
Modules linked in: iptable_nat
...
Call Trace:
br_dev_delete+0x112/0x190 net/bridge/br_if.c:384
br_dev_newlink net/bridge/br_netlink.c:1381 [inline]
br_dev_newlink+0xdb/0x100 net/bridge/br_netlink.c:1362
__rtnl_newlink+0xe11/0x13f0 net/core/rtnetlink.c:3441
rtnl_newlink+0x64/0xa0 net/core/rtnetlink.c:3500
rtnetlink_rcv_msg+0x385/0x980 net/core/rtnetlink.c:5562
netlink_rcv_skb+0x134/0x3d0 net/netlink/af_netlink.c:2494
netlink_unicast_kernel net/netlink/af_netlink.c:1304 [inline]
netlink_unicast+0x4a0/0x6a0 net/netlink/af_netlink.c:1330
netlink_sendmsg+0x793/0xc80 net/netlink/af_netlink.c:1919
sock_sendmsg_nosec net/socket.c:651 [inline]
sock_sendmsg+0x139/0x170 net/socket.c:671
____sys_sendmsg+0x658/0x7d0 net/socket.c:2353
___sys_sendmsg+0xf8/0x170 net/socket.c:2407
__sys_sendmsg+0xd3/0x190 net/socket.c:2440
do_syscall_64+0x33/0x40 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
In br_device_event(), if the bridge sysfs fails to be added,
br_device_event() should return error. This can prevent warining
when removing bridge sysfs that do not exist.
Fixes: bb900b27a2 ("bridge: allow creating bridge devices with netlink")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wang Hai <wanghai38@huawei.com>
Tested-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Link: https://lore.kernel.org/r/20201211122921.40386-1-wanghai38@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently the xmit path of the MPTCP protocol creates smaller-
than-max-size skbs, which is suboptimal for the performances.
There are a few things to improve:
- when coalescing to an existing skb, must clear the PUSH flag
- tcp_build_frag() expect the available space as an argument.
When coalescing is enable MPTCP already subtracted the
to-be-coalesced skb len. We must increment said argument
accordingly.
Before:
./use_mptcp.sh netperf -H 127.0.0.1 -t TCP_STREAM
[...]
131072 16384 16384 30.00 24414.86
After:
./use_mptcp.sh netperf -H 127.0.0.1 -t TCP_STREAM
[...]
131072 16384 16384 30.05 28357.69
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There is no need to unconditionally acquire the join list
lock, we can simply splice the join list into the subflow
list and traverse only the latter.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
parse the MPTCP FASTCLOSE subtype.
If provided key matches the local one, schedule the work queue to close
(with tcp reset) all subflows.
The MPTCP socket moves to closed state immediately.
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Because TCP-level resets only affect the subflow, there is a MPTCP
option to indicate that the MPTCP-level connection should be closed
immediately without a mptcp-level fin exchange.
This is the 'MPTCP fast close option'. It can be carried on ack
segments or TCP resets. In the latter case, its needed to parse mptcp
options also for reset packets so that MPTCP can act accordingly.
Next patch will add receive side fastclose support in MPTCP.
Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When processing options from tcp reset path its possible that
tcp_done(ssk) drops the last reference on the mptcp socket which
results in use-after-free.
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use the macro MPTCPOPT_HMAC_LEN instead of a constant in struct
mptcp_options_received.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When the PM netlink flushes the addresses, invoke the remove address
function mptcp_nl_remove_subflow_and_signal_addr to remove the addresses
and the subflows. Since this function should not be invoked under lock,
move __flush_addrs out of the pernet->lock.
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
It has been observed that the kernel sockets created for the subflows
(except the first one) are not in the same cgroup as their parents.
That's because the additional subflows are created by kernel workers.
This is a problem with eBPF programs attached to the parent's
cgroup won't be executed for the children. But also with any other features
of CGroup linked to a sk.
This patch fixes this behaviour.
As the subflow sockets are created by the kernel, we can't use
'mem_cgroup_sk_alloc' because of the current context being the one of the
kworker. This is why we have to do low level memcg manipulation, if
required.
Suggested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Nicolas Rybowski <nicolas.rybowski@tessares.net>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently, the exception actions are not processed correctly as the wrong
dataset is passed. This change fixes this, including the misleading
comment.
In addition, a check was added to make sure we work on an IPv4 packet,
and not just assume if it's not IPv6 it's IPv4.
This was all tested using OVS with patch,
https://patchwork.ozlabs.org/project/openvswitch/list/?series=21639,
applied and sending packets with a TTL of 1 (and 0), both with IPv4
and IPv6.
Fixes: 69929d4c49 ("net: openvswitch: fix TTL decrement action netlink message format")
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Link: https://lore.kernel.org/r/160733569860.3007.12938188180387116741.stgit@wsfd-netdev64.ntdv.lab.eng.bos.redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCX9daOgAKCRCRxhvAZXjc
ohPkAQChXUB2BAjtIzXlCkZoDBbzHHblm5DZ37oy/4xYFmAcEwEA5sw6dQqyGHnF
GEP9def51HvXLpBV2BzNUGggo1SoGgQ=
=w/cO
-----END PGP SIGNATURE-----
Merge tag 'fixes-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull misc fixes from Christian Brauner:
"This contains several fixes which felt worth being combined into a
single branch:
- Use put_nsproxy() instead of open-coding it switch_task_namespaces()
- Kirill's work to unify lifecycle management for all namespaces. The
lifetime counters are used identically for all namespaces types.
Namespaces may of course have additional unrelated counters and
these are not altered. This work allows us to unify the type of the
counters and reduces maintenance cost by moving the counter in one
place and indicating that basic lifetime management is identical
for all namespaces.
- Peilin's fix adding three byte padding to Dmitry's
PTRACE_GET_SYSCALL_INFO uapi struct to prevent an info leak.
- Two smal patches to convert from the /* fall through */ comment
annotation to the fallthrough keyword annotation which I had taken
into my branch and into -next before df561f6688 ("treewide: Use
fallthrough pseudo-keyword") made it upstream which fixed this
tree-wide.
Since I didn't want to invalidate all testing for other commits I
didn't rebase and kept them"
* tag 'fixes-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
nsproxy: use put_nsproxy() in switch_task_namespaces()
sys: Convert to the new fallthrough notation
signal: Convert to the new fallthrough notation
time: Use generic ns_common::count
cgroup: Use generic ns_common::count
mnt: Use generic ns_common::count
user: Use generic ns_common::count
pid: Use generic ns_common::count
ipc: Use generic ns_common::count
uts: Use generic ns_common::count
net: Use generic ns_common::count
ns: Add a common refcount into ns_common
ptrace: Prevent kernel-infoleak in ptrace_get_syscall_info()
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
1) Missing dependencies in NFT_BRIDGE_REJECT, from Randy Dunlap.
2) Use atomic_inc_return() instead of atomic_add_return() in IPVS,
from Yejune Deng.
3) Simplify check for overquota in xt_nfacct, from Kaixu Xia.
4) Move nfnl_acct_list away from struct net, from Miao Wang.
5) Pass actual sk in reject actions, from Jan Engelhardt.
6) Add timeout and protoinfo to ctnetlink destroy events,
from Florian Westphal.
7) Four patches to generalize set infrastructure to support
for multiple expressions per set element.
* git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next:
netfilter: nftables: netlink support for several set element expressions
netfilter: nftables: generalize set extension to support for several expressions
netfilter: nftables: move nft_expr before nft_set
netfilter: nftables: generalize set expressions support
netfilter: ctnetlink: add timeout and protoinfo to destroy events
netfilter: use actual socket sk for REJECT action
netfilter: nfnl_acct: remove data from struct net
netfilter: Remove unnecessary conversion to bool
ipvs: replace atomic_add_return()
netfilter: nft_reject_bridge: fix build errors due to code movement
====================
Link: https://lore.kernel.org/r/20201212230513.3465-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Borkmann says:
====================
pull-request: bpf-next 2020-12-14
1) Expose bpf_sk_storage_*() helpers to iterator programs, from Florent Revest.
2) Add AF_XDP selftests based on veth devs to BPF selftests, from Weqaar Janjua.
3) Support for finding BTF based kernel attach targets through libbpf's
bpf_program__set_attach_target() API, from Andrii Nakryiko.
4) Permit pointers on stack for helper calls in the verifier, from Yonghong Song.
5) Fix overflows in hash map elem size after rlimit removal, from Eric Dumazet.
6) Get rid of direct invocation of llc in BPF selftests, from Andrew Delgadillo.
7) Fix xsk_recvmsg() to reorder socket state check before access, from Björn Töpel.
8) Add new libbpf API helper to retrieve ring buffer epoll fd, from Brendan Jackman.
9) Batch of minor BPF selftest improvements all over the place, from Florian Lehner,
KP Singh, Jiri Olsa and various others.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (31 commits)
selftests/bpf: Add a test for ptr_to_map_value on stack for helper access
bpf: Permits pointers on stack for helper calls
libbpf: Expose libbpf ring_buffer epoll_fd
selftests/bpf: Add set_attach_target() API selftest for module target
libbpf: Support modules in bpf_program__set_attach_target() API
selftests/bpf: Silence ima_setup.sh when not running in verbose mode.
selftests/bpf: Drop the need for LLVM's llc
selftests/bpf: fix bpf_testmod.ko recompilation logic
samples/bpf: Fix possible hang in xdpsock with multiple threads
selftests/bpf: Make selftest compilation work on clang 11
selftests/bpf: Xsk selftests - adding xdpxceiver to .gitignore
selftests/bpf: Drop tcp-{client,server}.py from Makefile
selftests/bpf: Xsk selftests - Bi-directional Sockets - SKB, DRV
selftests/bpf: Xsk selftests - Socket Teardown - SKB, DRV
selftests/bpf: Xsk selftests - DRV POLL, NOPOLL
selftests/bpf: Xsk selftests - SKB POLL, NOPOLL
selftests/bpf: Xsk selftests framework
bpf: Only provide bpf_sock_from_file with CONFIG_NET
bpf: Return -ENOTSUPP when attaching to non-kernel BTF
xsk: Validate socket state in xsk_recvmsg, prior touching socket members
...
====================
Link: https://lore.kernel.org/r/20201214214316.20642-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Implement msgr2.1 wire protocol, available since nautilus 14.2.11
and octopus 15.2.5. msgr2.0 wire protocol is not implemented -- it
has several security, integrity and robustness issues and therefore
considered deprecated.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
msgr2 supports two connection modes: crc (plain) and secure (on-wire
encryption). Connection mode is picked by server based on input from
client.
Introduce ms_mode option:
ms_mode=legacy - msgr1 (default)
ms_mode=crc - crc mode, if denied fail
ms_mode=secure - secure mode, if denied fail
ms_mode=prefer-crc - crc mode, if denied agree to secure mode
ms_mode=prefer-secure - secure mode, if denied agree to crc mode
ms_mode affects all connections, we don't separate connections to mons
like it's done in userspace with ms_client_mode vs ms_mon_client_mode.
For now the default is legacy, to be flipped to prefer-crc after some
time.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
For libceph, this ensures that libceph instance sharing (share option)
continues to work. For rbd, this avoids blocklisting alive lock owners
(locker addr is always LEGACY, while watcher addr is ANY in nautilus).
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
In preparation for msgr2, make the cluster send us maps with addrvecs
including both LEGACY and MSGR2 addrs instead of a single LEGACY addr.
This means advertising support for SERVER_NAUTILUS and also some older
features: SERVER_MIMIC, MONENC and MONNAMES.
MONNAMES and MONENC are actually pre-argonaut, we just never updated
ceph_monmap_decode() for them. Decoding is unconditional, see commit
23c625ce30 ("libceph: assume argonaut on the server side").
SERVER_MIMIC doesn't bear any meaning for the kernel client.
Since ceph_decode_entity_addrvec() is guarded by encoding version
checks (and in msgr2 case it is guarded implicitly by the fact that
server is speaking msgr2), we assume MSG_ADDR2 for it.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
In preparation for msgr2, factor out finish_auth() so it is suitable
for both existing MAuth message based authentication and upcoming msgr2
authentication exchange.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
In msgr2, initial authentication happens with an exchange of msgr2
control frames -- MAuth message and struct ceph_mon_request_header
aren't used. Make that optional.
Stop reporting cephx protocol as "x". Use "cephx" instead.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
- request service tickets together with auth ticket. Currently we get
auth ticket via CEPHX_GET_AUTH_SESSION_KEY op and then request service
tickets via CEPHX_GET_PRINCIPAL_SESSION_KEY op in a separate message.
Since nautilus, desired service tickets are shared togther with auth
ticket in CEPHX_GET_AUTH_SESSION_KEY reply.
- propagate session key and connection secret, if any. In preparation
for msgr2, update handle_reply() and verify_authorizer_reply() auth
ops to propagate session key and connection secret. Since nautilus,
if secure mode is negotiated, connection secret is shared either in
CEPHX_GET_AUTH_SESSION_KEY reply (for mons) or in a final authorizer
reply (for osds and mdses).
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Make it clear that "need" is a union of "missing" and "have, but up
for renewal" and dout when the ticket goes missing due to expiry or
invalidation by client.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
A pure move, no other changes.
Note that ceph_tcp_recv{msg,page}() and ceph_tcp_send{msg,page}()
helpers are also moved. msgr2 will bring its own, more efficient,
variants based on iov_iter. Switching msgr1 to them was considered
but decided against to avoid subtle regressions.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
In preparation for msgr2, define internal messenger <-> protocol
interface (as opposed to external messenger <-> client interface, which
is struct ceph_connection_operations) consisting of try_read(),
try_write(), revoke(), revoke_incoming(), opened(), reset_session() and
reset_protocol() ops. The semantics are exactly the same as they are
now.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
In preparation for msgr2, rename msgr1 specific states and move the
defines to the header file.
Also drop state transition comments. They don't cover all possible
transitions (e.g. NEGOTIATING -> STANDBY, etc) and currently do more
harm than good.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
unsigned long is a leftover from when con->state used to be a set of
bits managed with set_bit(), clear_bit(), etc. Save a bit of memory.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Our messenger instance addr->port is normally zero -- anything else is
nonsensical because as a client we connect to multiple servers and don't
listen on any port. However, a user can supply an arbitrary addr:port
via ip option and the port is currently preserved. Zero it.
Conversely, make sure our addr->nonce is non-zero. A zero nonce is
special: in combination with a zero port, it is used to blocklist the
entire ip.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Move the logic of grabbing the next message from the queue into its own
function. Like ceph_con_in_msg_alloc(), this is protocol independent.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
ceph_con_in_msg_alloc() is protocol independent, but con->in_hdr (and
struct ceph_msg_header in general) is msgr1 specific. While the struct
is deeply ingrained inside and outside the messenger, con->in_hdr field
can be separated.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Stick with pr_info message because session reset isn't an error most of
the time. When it is (i.e. if the server denies the reconnect attempt),
we get a bunch of other pr_err messages.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
con->peer_global_seq is part of session state. Clear it when
the server tells us to reset, not just in ceph_con_close().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Move protocol reset bits into ceph_con_reset_protocol(), leaving
just session reset bits.
Note that con->out_skip is now reset on faults. This fixes a crash
in the case of a stateful session getting a fault while in the middle
of revoking a message.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
A fault due to a version mismatch or a feature set mismatch used to be
treated differently from other faults: the connection would get closed
without trying to reconnect and there was a ->bad_proto() connection op
for notifying about that.
This changed a long time ago, see commits 6384bb8b8e ("libceph: kill
bad_proto ceph connection op") and 0fa6ebc600 ("libceph: fix protocol
feature mismatch failure path"). Nowadays these aren't any different
from other faults (i.e. we try to reconnect even though the mismatch
won't resolve until the server is replaced). reset_connection() calls
there are rather confusing because reset_connection() resets a session
together an individual instance of the protocol. This is cleaned up
in the next patch.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The current setting allows the backoff to climb up to 5 minutes. This
is too high -- it becomes hard to tell whether the client is stuck on
something or just in backoff.
In userspace, ms_max_backoff is defaulted to 15 seconds. Let's do the
same.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Pull crypto updates from Herbert Xu:
"API:
- Add speed testing on 1420-byte blocks for networking
Algorithms:
- Improve performance of chacha on ARM for network packets
- Improve performance of aegis128 on ARM for network packets
Drivers:
- Add support for Keem Bay OCS AES/SM4
- Add support for QAT 4xxx devices
- Enable crypto-engine retry mechanism in caam
- Enable support for crypto engine on sdm845 in qce
- Add HiSilicon PRNG driver support"
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (161 commits)
crypto: qat - add capability detection logic in qat_4xxx
crypto: qat - add AES-XTS support for QAT GEN4 devices
crypto: qat - add AES-CTR support for QAT GEN4 devices
crypto: atmel-i2c - select CONFIG_BITREVERSE
crypto: hisilicon/trng - replace atomic_add_return()
crypto: keembay - Add support for Keem Bay OCS AES/SM4
dt-bindings: Add Keem Bay OCS AES bindings
crypto: aegis128 - avoid spurious references crypto_aegis128_update_simd
crypto: seed - remove trailing semicolon in macro definition
crypto: x86/poly1305 - Use TEST %reg,%reg instead of CMP $0,%reg
crypto: x86/sha512 - Use TEST %reg,%reg instead of CMP $0,%reg
crypto: aesni - Use TEST %reg,%reg instead of CMP $0,%reg
crypto: cpt - Fix sparse warnings in cptpf
hwrng: ks-sa - Add dependency on IOMEM and OF
crypto: lib/blake2s - Move selftest prototype into header file
crypto: arm/aes-ce - work around Cortex-A57/A72 silion errata
crypto: ecdh - avoid unaligned accesses in ecdh_set_secret()
crypto: ccree - rework cache parameters handling
crypto: cavium - Use dma_set_mask_and_coherent to simplify code
crypto: marvell/octeontx - Use dma_set_mask_and_coherent to simplify code
...
If we're shifting the page data to the right, and this happens to be a
sparse page array, then we may need to allocate new pages in order to
receive the data.
Reported-by: "Mkrtchyan, Tigran" <tigran.mkrtchyan@desy.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
There are a number of xdr helpers for struct xdr_buf that do not change
the structure itself. Mark those as taking const pointers for
documentation purposes.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Move the setting of the xdr_stream 'nwords' field into the helpers that
reset the xdr_stream cursor.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
We do want to try to grow the buffer if possible, but if that attempt
fails, we still want to move the data and truncate the XDR message.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
The main use case right now for xdr_align_data() is to shift the page
data to the left, and in practice shrink the total XDR data buffer.
This patch ensures that we fix up the accounting for the buffer length
as we shift that data around.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Olga K. observed that rpcrdma_marsh_req() allocates sparse pages
only when it has determined that a Reply chunk is necessary. There
are plenty of cases where no Reply chunk is needed, but the
XDRBUF_SPARSE_PAGES flag is set. The result would be a crash in
rpcrdma_inline_fixup() when it tries to copy parts of the received
Reply into a missing page.
To avoid crashing, handle sparse page allocation up front.
Until XATTR support was added, this issue did not appear often
because the only SPARSE_PAGES consumer always expected a reply large
enough to always require a Reply chunk.
Reported-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
When receiving pages data, return value 'ret' when positive includes
`buf->page_base`, so we should subtract that before it is used for
changing `offset` and comparing against `want`.
This was discovered on the very rare cases where the server returned a
chunk of bytes that when added to the already received amount of bytes
for the pages happened to match the current `recv.len`, for example
on this case:
buf->page_base : 258356
actually received from socket: 1740
ret : 260096
want : 260096
In this case neither of the two 'if ... goto out' trigger, and we
continue to tail parsing.
Worth to mention that the ensuing EMSGSIZE from the continued execution of
`xs_read_xdr_buf` may be observed by an application due to 4 superfluous
bytes being added to the pages data.
Fixes: 277e4ab7d5 ("SUNRPC: Simplify TCP receive code by switching to using iterators")
Signed-off-by: Dan Aloni <dan@kernelim.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
According to the X.25 documentation, there was a plan to implement
X.25-over-802.2-LLC. It never finished but left various code stubs in the
X.25 code. At this time it is unlikely that it would ever finish so it
may be better to remove those code stubs.
Also change the documentation to make it clear that this is not a ongoing
plan anymore. Change words like "will" to "could", "would", etc.
Cc: Martin Schiller <ms@dev.tdt.de>
Signed-off-by: Xie He <xie.he.0141@gmail.com>
Link: https://lore.kernel.org/r/20201209033346.83742-1-xie.he.0141@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
On a few of our systems, I found frequent 'unshare(CLONE_NEWNET)' calls
make the number of active slab objects including 'sock_inode_cache' type
rapidly and continuously increase. As a result, memory pressure occurs.
In more detail, I made an artificial reproducer that resembles the
workload that we found the problem and reproduce the problem faster. It
merely repeats 'unshare(CLONE_NEWNET)' 50,000 times in a loop. It takes
about 2 minutes. On 40 CPU cores / 70GB DRAM machine, the available
memory continuously reduced in a fast speed (about 120MB per second,
15GB in total within the 2 minutes). Note that the issue don't
reproduce on every machine. On my 6 CPU cores machine, the problem
didn't reproduce.
'cleanup_net()' and 'fqdir_work_fn()' are functions that deallocate the
relevant memory objects. They are asynchronously invoked by the work
queues and internally use 'rcu_barrier()' to ensure safe destructions.
'cleanup_net()' works in a batched maneer in a single thread worker,
while 'fqdir_work_fn()' works for each 'fqdir_exit()' call in the
'system_wq'. Therefore, 'fqdir_work_fn()' called frequently under the
workload and made the contention for 'rcu_barrier()' high. In more
detail, the global mutex, 'rcu_state.barrier_mutex' became the
bottleneck.
This commit avoids such contention by doing the 'rcu_barrier()' and
subsequent lightweight works in a batched manner, as similar to that of
'cleanup_net()'. The fqdir hashtable destruction, which is done before
the 'rcu_barrier()', is still allowed to run in parallel for fast
processing, but this commit makes it to use a dedicated work queue
instead of the 'system_wq', to make sure that the number of threads is
bounded.
Signed-off-by: SeongJae Park <sjpark@amazon.de>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20201211112405.31158-1-sjpark@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Steffen Klassert says:
====================
pull request (net-next): ipsec-next 2020-12-12
Just one patch this time:
1) Redact the SA keys with kernel lockdown confidentiality.
If enabled, no secret keys are sent to uuserspace.
From Antony Antony.
* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next:
xfrm: redact SA secret with lockdown confidentiality
====================
Link: https://lore.kernel.org/r/20201212085737.2101294-1-steffen.klassert@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch adds three new netlink attributes to encapsulate a list of
expressions per set elements:
- NFTA_SET_EXPRESSIONS: this attribute provides the set definition in
terms of expressions. New set elements get attached the list of
expressions that is specified by this new netlink attribute.
- NFTA_SET_ELEM_EXPRESSIONS: this attribute allows users to restore (or
initialize) the stateful information of set elements when adding an
element to the set.
- NFTA_DYNSET_EXPRESSIONS: this attribute specifies the list of
expressions that the set element gets when it is inserted from the
packet path.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch replaces NFT_SET_EXPR by NFT_SET_EXT_EXPRESSIONS. This new
extension allows to attach several expressions to one set element (not
only one single expression as NFT_SET_EXPR provides). This patch
prepares for support for several expressions per set element in the
netlink userspace API.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* validate key indices for key deletion
* more preamble support in mac80211
* various 6 GHz scan fixes/improvements
* a common SAR power limitations API
* various small fixes & code improvements
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAl/TfYAACgkQB8qZga/f
l8ShjQ/9Hd6KjvA7keATtdjR7rDHo7H2nBKV/LukpuHsiTRrXTVOAfkcUTOb2hfR
7SJMzsUXdJGivbwm4lkx5TrIgiJm1hfW3zG0PFOs/bIuXs/KICrb+kLgQWiRIUfa
RIinf8BGPH3GgcCHDcWFUrnfnBVchrPUx2wIHoCQbCzLHIhB6q6x8jEJA67+smpv
57tDfUhm6pf6OYOqVN8HYlo0uRAIn1ImneplDelCmCI1dzlneEkMhqZuBXqWpD/I
C5vU+MjoOsJiW1XkmYOMe6VKQ/Bve06GUWs830S7aROOEfByv+ptlR9IjqvHvPIm
UI9NivfXQiZr6S7yD1m2xV7a14UMCIzYarwaM/I/NHAWF/Y4vzHFVzQjLfVKqMCV
dxxsWN+Yg7Gx3T5Fj3NNiQgnPF9ASVqgMrlC59ga+4If0y60V7dOSFuo9HF7AWgP
NIWKVI3He7Mb5TciM+BX5YQWkJiCSZXs427WLO6p0bp3kAgS6N6BThUraGCogXVF
1BT/y5G3QzZwg02vL3lxgWglXoH/e63UYCPt0r+i5c83Z+n4YnFSUZyRSViy9Elj
DkCgdxmP0OtM+FaHxLdYm+FL4GXaGWQVNORIDP0ViSrstPgSxhWIgVj/pKNcKC7g
bJI/IXm7eQkW5SXOafmhVV0TmvDOt/mM46E0CeWcPTqIhetk3lM=
=+qd6
-----END PGP SIGNATURE-----
Merge tag 'mac80211-next-for-net-next-2020-12-11' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
Johannes Berg says:
====================
A new set of wireless changes:
* validate key indices for key deletion
* more preamble support in mac80211
* various 6 GHz scan fixes/improvements
* a common SAR power limitations API
* various small fixes & code improvements
* tag 'mac80211-next-for-net-next-2020-12-11' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next: (35 commits)
mac80211: add ieee80211_set_sar_specs
nl80211: add common API to configure SAR power limitations
mac80211: fix a mistake check for rx_stats update
mac80211: mlme: save ssid info to ieee80211_bss_conf while assoc
mac80211: Update rate control on channel change
mac80211: don't filter out beacons once we start CSA
mac80211: Fix calculation of minimal channel width
mac80211: ignore country element TX power on 6 GHz
mac80211: use bitfield helpers for BA session action frames
mac80211: support Rx timestamp calculation for all preamble types
mac80211: don't set set TDLS STA bandwidth wider than possible
mac80211: support driver-based disconnect with reconnect hint
cfg80211: support immediate reconnect request hint
mac80211: use struct assignment for he_obss_pd
cfg80211: remove struct ieee80211_he_bss_color
nl80211: validate key indexes for cfg80211_registered_device
cfg80211: include block-tx flag in channel switch started event
mac80211: disallow band-switch during CSA
ieee80211: update reduced neighbor report TBTT info length
cfg80211: Save the regulatory domain when setting custom regulatory
...
====================
Link: https://lore.kernel.org/r/20201211142552.209018-1-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently, the set infrastucture allows for one single expressions per
element. This patch extends the existing infrastructure to allow for up
to two expressions. This is not updating the netlink API yet, this is
coming as an initial preparation patch.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
DESTROY events do not include the remaining timeout.
Add the timeout if the entry was removed explicitly. This can happen
when a conntrack gets deleted prematurely, e.g. due to a tcp reset,
module removal, netdev notifier (nat/masquerade device went down),
ctnetlink and so on.
Add the protocol state too for the destroy message to check for abnormal
state on connection termination.
Joint work with Pablo.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
xdp_return_frame_bulk() needs to pass a xdp_buff
to __xdp_return().
strlcpy got converted to strscpy but here it makes no
functional difference, so just keep the right code.
Conflicts:
net/netfilter/nf_tables_api.c
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This change registers ieee80211_set_sar_specs to
mac80211_config_ops, so cfg80211 can call it.
Signed-off-by: Carl Huang <cjhuang@codeaurora.org>
Reviewed-by: Brian Norris <briannorris@chromium.org>
Reviewed-by: Abhishek Kumar <kuabhs@chromium.org>
Link: https://lore.kernel.org/r/20201203103728.3034-3-cjhuang@codeaurora.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
NL80211_CMD_SET_SAR_SPECS is added to configure SAR from
user space. NL80211_ATTR_SAR_SPEC is used to pass the SAR
power specification when used with NL80211_CMD_SET_SAR_SPECS.
Wireless driver needs to register SAR type, supported frequency
ranges to wiphy, so user space can query it. The index in
frequency range is used to specify which sub band the power
limitation applies to. The SAR type is for compatibility, so later
other SAR mechanism can be implemented without breaking the user
space SAR applications.
Normal process is user space queries the SAR capability, and
gets the index of supported frequency ranges and associates the
power limitation with this index and sends to kernel.
Here is an example of message send to kernel:
8c 00 00 00 08 00 01 00 00 00 00 00 38 00 2b 81
08 00 01 00 00 00 00 00 2c 00 02 80 14 00 00 80
08 00 02 00 00 00 00 00 08 00 01 00 38 00 00 00
14 00 01 80 08 00 02 00 01 00 00 00 08 00 01 00
48 00 00 00
NL80211_CMD_SET_SAR_SPECS: 0x8c
NL80211_ATTR_WIPHY: 0x01(phy idx is 0)
NL80211_ATTR_SAR_SPEC: 0x812b (NLA_NESTED)
NL80211_SAR_ATTR_TYPE: 0x00 (NL80211_SAR_TYPE_POWER)
NL80211_SAR_ATTR_SPECS: 0x8002 (NLA_NESTED)
freq range 0 power: 0x38 in 0.25dbm unit (14dbm)
freq range 1 power: 0x48 in 0.25dbm unit (18dbm)
Signed-off-by: Carl Huang <cjhuang@codeaurora.org>
Reviewed-by: Brian Norris <briannorris@chromium.org>
Reviewed-by: Abhishek Kumar <kuabhs@chromium.org>
Link: https://lore.kernel.org/r/20201203103728.3034-2-cjhuang@codeaurora.org
[minor edits, NLA parse cleanups]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
It should be !is_multicast_ether_addr() in ieee80211_rx_h_sta_process()
for the rx_stats update, below commit remove the !, this patch is to
change it back.
It lead the rx rate "iw wlan0 station dump" become invalid for some
scenario when IEEE80211_HW_USES_RSS is set.
Fixes: 09a740ce35 ("mac80211: receive and process S1G beacons")
Signed-off-by: Wen Gong <wgong@codeaurora.org>
Link: https://lore.kernel.org/r/1607483189-3891-1-git-send-email-wgong@codeaurora.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The ssid info of ieee80211_bss_conf is filled in ieee80211_start_ap()
for AP mode. For STATION mode, it is empty, save the info from struct
ieee80211_mgd_assoc_data, the struct ieee80211_mgd_assoc_data will be
freed after assoc, so the ssid info of ieee80211_mgd_assoc_data can not
access after assoc, save ssid info to ieee80211_bss_conf, then ssid info
can be still access after assoc.
Signed-off-by: Wen Gong <wgong@codeaurora.org>
Link: https://lore.kernel.org/r/1607312195-3583-2-git-send-email-wgong@codeaurora.org
[reset on disassoc]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
A channel change or a channel bandwidth change can impact the
rate control logic. However, the rate control logic was not updated
before/after such a change, which might result in unexpected
behavior.
Fix this by updating the stations rate control logic when the
corresponding channel context changes.
Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20201206145305.600d967fe3c9.I48305f25cfcc9c032c77c51396e9e9b882748a86@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
I hit a bug in which we started a CSA with an action frame,
but the AP changed its mind and didn't change the beacon.
The CSA wasn't cancelled and we lost the connection.
The beacons were ignored because they never changed: they
never contained any CSA IE. Because they never changed, the
CRC of the beacon didn't change either which made us ignore
the beacons instead of processing them.
Now what happens is:
1) beacon has CRC X and it is valid. No CSA IE in the beacon
2) as long as beacon's CRC X, don't process their IEs
3) rx action frame with CSA
4) invalidate the beacon's CRC
5) rx beacon, CRC is still X, but now it is invalid
6) process the beacon, detect there is no CSA IE
7) abort CSA
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20201206145305.83470b8407e6.I739b907598001362744692744be15335436b8351@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When calculating the minimal channel width for channel context,
the current operation Rx channel width of a station was used and not
the overall channel width capability of the station, i.e., both for
Tx and Rx.
Fix ieee80211_get_sta_bw() to use the maximal channel width the
station is capable. While at it make the function static.
Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20201206145305.4387040b99a0.I74bcf19238f75a5960c4098b10e355123d933281@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Updates to the 802.11ax draft are coming that deprecate the
country element in favour of the transmit power envelope
element, and make the maximum transmit power level field in
the triplets reserved, so if we parse them we'd use 0 dBm
transmit power.
Follow suit and completely ignore the element on 6 GHz for
purposes of determining TX power.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20201206145305.9abf9f6b4f88.Icb6e52af586edcc74f1f0360e8f6fc9ef2bfe8f5@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When we set up a TDLS station, we set sta->sta.bandwidth solely based
on the capabilities, because the "what's the current bandwidth" check
is bypassed and only applied for other types of stations.
This leads to the unfortunate scenario that the sta->sta.bandwidth is
160 MHz if both stations support it, but we never actually configure
this bandwidth unless the AP is already using 160 MHz; even for wider
bandwidth support we only go up to 80 MHz (at least right now.)
For iwlwifi, this can also lead to firmware asserts, telling us that
we've configured the TX rates for a higher bandwidth than is actually
available due to the PHY configuration.
For non-TDLS, we check against the interface's requested bandwidth,
but we explicitly skip this check for TDLS to cope with the wider BW
case. Change this to
(a) still limit to the TDLS peer's own chandef, which gets factored
into the overall PHY configuration we request from the driver,
and
(b) limit it to when the TDLS peer is authorized, because it's only
factored into the channel context in this case.
Fixes: 504871e602 ("mac80211: fix bandwidth computation for TDLS peers")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20201206145305.fcc7d29c4590.I11f77e9e25ddf871a3c8d5604650c763e2c5887a@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
We don't really use this struct, we're now using
struct cfg80211_he_bss_color instead.
Change the one place in mac80211 that's using the old
name to use struct assignment instead of memcpy() and
thus remove the wrong sizeof while at it.
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20201206145305.f6698d97ae4e.Iba2dffcb79c4ab80bde7407609806010b55edfdf@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
syzbot discovered a bug in which an OOB access was being made because
an unsuitable key_idx value was wrongly considered to be acceptable
while deleting a key in nl80211_del_key().
Since we don't know the cipher at the time of deletion, if
cfg80211_validate_key_settings() were to be called directly in
nl80211_del_key(), even valid keys would be wrongly determined invalid,
and deletion wouldn't occur correctly.
For this reason, a new function - cfg80211_valid_key_idx(), has been
created, to determine if the key_idx value provided is valid or not.
cfg80211_valid_key_idx() is directly called in 2 places -
nl80211_del_key(), and cfg80211_validate_key_settings().
Reported-by: syzbot+49d4cab497c2142ee170@syzkaller.appspotmail.com
Tested-by: syzbot+49d4cab497c2142ee170@syzkaller.appspotmail.com
Suggested-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Anant Thazhemadam <anant.thazhemadam@gmail.com>
Link: https://lore.kernel.org/r/20201204215825.129879-1-anant.thazhemadam@gmail.com
Cc: stable@vger.kernel.org
[also disallow IGTK key IDs if no IGTK cipher is supported]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
If the AP advertises a band switch during CSA, we will not have
the right information to continue working with it, since it will
likely (have to) change its capabilities and we don't track any
capability changes at all. Additionally, we store e.g. supported
rates per band, and that information would become invalid.
Since this is a fringe scenario, just disconnect explicitly.
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20201129172929.0e2327107c06.I461adb07704e056b054a4a7c29b80c95a9f56637@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Accept a scan request with the duration set even if the driver
does not support setting the scan dwell. The duration can be used
as a hint to the driver, but the driver may use its internal logic
for setting the scan dwell.
Signed-off-by: Avraham Stern <avraham.stern@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20201129172929.9491a12f9226.Ia9c5b24fcefc5ce5592537507243391633a27e5f@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In case of scan request with wildcard SSID, or in case of more
than one SSID in scan request, need to scan PSC channels even though
all the co-located APs found during the legacy bands scan indicated
that all the APs in their ESS are co-located, as we might find different
networks on the PSC channels.
Signed-off-by: Ayala Beker <ayala.beker@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20201129172929.736415a9ca5d.If5b3578ae85e11a707a5da07e66ba85928ba702c@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Support the driver incrementing MIC error and replay detected
counters when having detected a bad frame, if it drops it directly
instead of relying on mac80211 to do the checks.
These are then exposed to userspace, though currently only in some
cases and in debugfs.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20201129172929.fb59be9c6de8.Ife2260887366f585afadd78c983ebea93d2bb54b@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Instead of casting callback functions to type iw_handler, which trips
indirect call checking with Clang's Control-Flow Integrity (CFI), add
stub functions with the correct function type for the callbacks.
Reported-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20201117205902.405316-1-samitolvanen@google.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Don't populate the const array bws on the stack but instead it
static. Makes the object code smaller by 80 bytes:
Before:
text data bss dec hex filename
85694 16865 1216 103775 1955f ./net/wireless/reg.o
After:
text data bss dec hex filename
85518 16961 1216 103695 1950f ./net/wireless/reg.o
(gcc version 10.2.0)
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Link: https://lore.kernel.org/r/20201116181636.362729-1-colin.king@canonical.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Commits
d3fd65484c ("net: core: add dev_sw_netstats_tx_add")
451b05f413 ("net: netdevice.h: sw_netstats_rx_add helper)
have added API to update net device per-cpu TX/RX stats.
Use core API instead of ieee80211_tx/rx_stats().
Signed-off-by: Lev Stipakov <lev@openvpn.net>
Reviewed-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://lore.kernel.org/r/20201113214623.144663-1-lev@openvpn.net
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The WLAN device may exist yet not be usable. This can happen
when the WLAN device is controllable by both the host and
some platform internal component.
We need some arbritration that is vendor specific, but when
the device is not available for the host, we need to reflect
this state towards the user space.
Add a reason field to the rfkill object (and event) so that
userspace can know why the device is in rfkill: because some
other platform component currently owns the device, or
because the actual hw rfkill signal is asserted.
Capable userspace can now determine the reason for the rfkill
and possibly do some negotiation on a side band channel using
a proprietary protocol to gain ownership on the device in case
the device is owned by some other component. When the host
gains ownership on the device, the kernel can remove the
RFKILL_HARD_BLOCK_NOT_OWNER reason and the hw rfkill state
will be off. Then, the userspace can bring the device up and
start normal operation.
The rfkill_event structure is enlarged to include the additional
byte, it is now 9 bytes long. Old user space will ask to read
only 8 bytes so that the kernel can know not to feed them with
more data. When the user space writes 8 bytes, new kernels will
just read what is present in the file descriptor. This new byte
is read only from the userspace standpoint anyway.
If a new user space uses an old kernel, it'll ask to read 9 bytes
but will get only 8, and it'll know that it didn't get the new
state. When it'll write 9 bytes, the kernel will again ignore
this new byte which is read only from the userspace standpoint.
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Link: https://lore.kernel.org/r/20201104134641.28816-1-emmanuel.grumbach@intel.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Alexei Starovoitov says:
====================
pull-request: bpf 2020-12-10
The following pull-request contains BPF updates for your *net* tree.
We've added 21 non-merge commits during the last 12 day(s) which contain
a total of 21 files changed, 163 insertions(+), 88 deletions(-).
The main changes are:
1) Fix propagation of 32-bit signed bounds from 64-bit bounds, from Alexei.
2) Fix ring_buffer__poll() return value, from Andrii.
3) Fix race in lwt_bpf, from Cong.
4) Fix test_offload, from Toke.
5) Various xsk fixes.
Please consider pulling these changes from:
git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git
Thanks a lot!
Also thanks to reporters, reviewers and testers of commits in this pull-request:
Cong Wang, Hulk Robot, Jakub Kicinski, Jean-Philippe Brucker, John
Fastabend, Magnus Karlsson, Maxim Mikityanskiy, Yonghong Song
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
We use rcu_assign_pointer to assign both the table and the entries,
but the entries are not marked as __rcu. This generates sparse
warnings.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
A prior patch increased the size of struct tcp_zerocopy_receive
but did not update do_tcp_getsockopt() handling to properly account
for this.
This patch simply reintroduces content erroneously cut from the
referenced prior patch that handles the new struct size.
Fixes: 18fb76ed53 ("net-zerocopy: Copy straggler unaligned data for TCP Rx. zerocopy.")
Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When CAN_ISOTP_SF_BROADCAST is set in the CAN_ISOTP_OPTS flags the CAN_ISOTP
socket is switched into functional addressing mode, where only single frame
(SF) protocol data units can be send on the specified CAN interface and the
given tp.tx_id after bind().
In opposite to normal and extended addressing this socket does not register a
CAN-ID for reception which would be needed for a 1-to-1 ISOTP connection with a
segmented bi-directional data transfer.
Sending SFs on this socket is therefore a TX-only 'broadcast' operation.
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Thomas Wagner <thwa1@web.de>
Link: https://lore.kernel.org/r/20201206144731.4609-1-socketcan@hartkopp.net
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
1. When the x25 module gets loaded, layer 2 may already be running and
connected. In this case, although we are in X25_LINK_STATE_0, we still
need to handle the Restart Request received, rather than ignore it.
2. When we are in X25_LINK_STATE_2, we have already sent a Restart Request
and is waiting for the Restart Confirmation with t20timer. t20timer will
restart itself repeatedly forever so it will always be there, as long as we
are in State 2. So we don't need to check x25_t20timer_pending again.
Fixes: d023b2b9cc ("net/x25: fix restart request/confirm handling")
Cc: Martin Schiller <ms@dev.tdt.de>
Signed-off-by: Xie He <xie.he.0141@gmail.com>
Acked-by: Martin Schiller <ms@dev.tdt.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the workqueue disposes of the msk, the subflows can still
receive some data from the peer after __mptcp_close_ssk()
completes.
The above could trigger a race between the msk receive path and the
msk destruction. Acquiring the mptcp_data_lock() in __mptcp_destroy_sock()
will not save the day: the rx path could be reached even after msk
destruction completes.
Instead use the subflow 'disposable' flag to prevent entering
the msk receive path after __mptcp_close_ssk().
Fixes: e16163b6e2 ("mptcp: refactor shutdown and close")
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a MPTCP listener socket is closed with unaccepted
children pending, the ULP release callback will be invoked,
but nobody will call into __mptcp_close_ssk() on the
corresponding subflow.
As a consequence, at ULP release time, the 'disposable' flag
will be cleared and the subflow context memory will be leaked.
This change addresses the issue always freeing the context if
the subflow is still in the accept queue at ULP release time.
Additionally, this fixes an incorrect code reference in the
related comment.
Note: this fix leverages the changes introduced by the previous
commit.
Fixes: e16163b6e2 ("mptcp: refactor shutdown and close")
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Christoph reported the following splat:
WARNING: CPU: 0 PID: 4615 at net/ipv4/inet_connection_sock.c:1031 inet_csk_listen_stop+0x8e8/0xad0 net/ipv4/inet_connection_sock.c:1031
Modules linked in:
CPU: 0 PID: 4615 Comm: syz-executor.4 Not tainted 5.9.0 #37
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:inet_csk_listen_stop+0x8e8/0xad0 net/ipv4/inet_connection_sock.c:1031
Code: 03 00 00 00 e8 79 b2 3d ff e9 ad f9 ff ff e8 1f 76 ba fe be 02 00 00 00 4c 89 f7 e8 62 b2 3d ff e9 14 f9 ff ff e8 08 76 ba fe <0f> 0b e9 97 f8 ff ff e8 fc 75 ba fe be 03 00 00 00 4c 89 f7 e8 3f
RSP: 0018:ffffc900037f7948 EFLAGS: 00010293
RAX: ffff88810a349c80 RBX: ffff888114ee1b00 RCX: ffffffff827b14cd
RDX: 0000000000000000 RSI: ffffffff827b1c38 RDI: 0000000000000005
RBP: ffff88810a2a8000 R08: ffff88810a349c80 R09: fffff520006fef1f
R10: 0000000000000003 R11: fffff520006fef1e R12: ffff888114ee2d00
R13: dffffc0000000000 R14: 0000000000000001 R15: ffff888114ee1d68
FS: 00007f2ac1945700(0000) GS:ffff88811b400000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffd44798bc0 CR3: 0000000109810002 CR4: 0000000000170ef0
Call Trace:
__tcp_close+0xd86/0x1110 net/ipv4/tcp.c:2433
__mptcp_close_ssk+0x256/0x430 net/mptcp/protocol.c:1761
__mptcp_destroy_sock+0x49b/0x770 net/mptcp/protocol.c:2127
mptcp_close+0x62d/0x910 net/mptcp/protocol.c:2184
inet_release+0xe9/0x1f0 net/ipv4/af_inet.c:434
__sock_release+0xd2/0x280 net/socket.c:596
sock_close+0x15/0x20 net/socket.c:1277
__fput+0x276/0x960 fs/file_table.c:281
task_work_run+0x109/0x1d0 kernel/task_work.c:151
get_signal+0xe8f/0x1d40 kernel/signal.c:2561
arch_do_signal+0x88/0x1b60 arch/x86/kernel/signal.c:811
exit_to_user_mode_loop kernel/entry/common.c:161 [inline]
exit_to_user_mode_prepare+0x9b/0xf0 kernel/entry/common.c:191
syscall_exit_to_user_mode+0x22/0x150 kernel/entry/common.c:266
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f2ac1254469
Code: 00 f3 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ff 49 2b 00 f7 d8 64 89 01 48
RSP: 002b:00007f2ac1944dc8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffbf RBX: 000000000069bf00 RCX: 00007f2ac1254469
RDX: 0000000000000000 RSI: 0000000000008982 RDI: 0000000000000003
RBP: 000000000069bf00 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000069bf0c
R13: 00007ffeb53f178f R14: 00000000004668b0 R15: 0000000000000003
After commit 0397c6d85f ("mptcp: keep unaccepted MPC subflow into
join list"), the msk's workqueue and/or PM can touch the MPC
subflow - and acquire its socket lock - even if it's still unaccepted.
If the above event races with the relevant listener socket close, we
can end-up with the above splat.
This change addresses the issue delaying the MPC socket insertion
in conn_list at accept time - that is, partially reverting the
blamed commit.
We must additionally ensure that mptcp_pm_fully_established()
happens after accept() time, or the PM will not be able to
handle properly such event - conn_list could be empty otherwise.
In the receive path, we check the subflow list node to ensure
it is out of the listener queue. Be sure client subflows do
not match transiently such condition moving them into the join
list earlier at creation time.
Since we now have multiple mptcp_pm_fully_established() call sites
from different code-paths, said helper can now race with itself.
Use an additional PM status bit to avoid multiple notifications.
Reported-by: Christoph Paasch <cpaasch@apple.com>
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/103
Fixes: 0397c6d85f ("mptcp: keep unaccepted MPC subflow into join list"),
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since the local variable sk has been defined, use it instead of
open-coding.
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since the RM_ADDR signal had been reused with add_addr_signal, it's not
suitable to call it add_addr_signal or mptcp_add_addr_status. So this
patch renamed add_addr_signal to addr_signal, and renamed
mptcp_add_addr_status to mptcp_addr_signal_status.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch reused add_addr_signal for the RM_ADDR announcing signal, by
defining a new ADD_ADDR status named MPTCP_RM_ADDR_SIGNAL. Then the flag
rm_addr_signal in PM could be dropped.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch printed out more debugging information for the ADD_ADDR
suboption parsing on the incoming path.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch added a new parameter 'port' for mptcp_pm_announce_addr. If
this parameter is true, we set the MPTCP_ADD_ADDR_PORT bit of the
add_addr_signal. That means the announced address is added with a port
number.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The process is similar to that of the ADD_ADDR IPv6, this patch also sent
out a pure ack for the ADD_ADDR using port.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch added a new add_addr_signal type named MPTCP_ADD_ADDR_PORT,
to identify it is an address with port to be added.
It also added a new parameter 'port' for both mptcp_add_addr_len and
mptcp_pm_add_addr_signal.
In mptcp_established_options_add_addr, we check whether the announced
address is added with port. If it is, we put this port number to
mptcp_out_options's port field.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch uses adding up size to get the ADD_ADDR suboption length rather
than returning the ADD_ADDR size constants.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In rfc8684, the length of ADD_ADDR suboption with IPv4 address and port
is 18 octets, but mptcp_write_options is 32-bit aligned, so we need to
pad it to 20 octets. All the other port related option lengths need to
be added up 2 octets similarly.
This patch added a new field 'port' in mptcp_out_options. When this
field is set with a port number, we need to add up 4 octets for the
ADD_ADDR suboption, and put the port number into the suboption.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The length of ADD_ADDR6 is 12 octets longer than ADD_ADDR. That's the
only difference between them.
This patch dropped the duplicate code between ADD_ADDR and ADD_ADDR6
suboptions writing, and unify them into one.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are two differences between ADD_ADDR suboption and ADD_ADDR echo
suboption: The length of the former is 8 octets longer than the length
of the latter. The former's echo-flag is 0, and latter's echo-flag is 1.
This patch added two local variables, len and echo, to unify ADD_ADDR
and ADD_ADDR echo suboptions writing.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
1) Switch to RCU in x_tables to fix possible NULL pointer dereference,
from Subash Abhinov Kasiviswanathan.
2) Fix netlink dump of dynset timeouts later than 23 days.
3) Add comment for the indirect serialization of the nft commit mutex
with rtnl_mutex.
4) Remove bogus check for confirmed conntrack when matching on the
conntrack ID, from Brett Mastbergen.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace a comma between expression statements by a semicolon.
Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com>
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When cwnd is not a multiple of the TSO skb size of N*MSS, we can get
into persistent scenarios where we have the following sequence:
(1) ACK for full-sized skb of N*MSS arrives
-> tcp_write_xmit() transmit full-sized skb with N*MSS
-> move pacing release time forward
-> exit tcp_write_xmit() because pacing time is in the future
(2) TSQ callback or TCP internal pacing timer fires
-> try to transmit next skb, but TSO deferral finds remainder of
available cwnd is not big enough to trigger an immediate send
now, so we defer sending until the next ACK.
(3) repeat...
So we can get into a case where we never mark ourselves as
cwnd-limited for many seconds at a time, even with
bulk/infinite-backlog senders, because:
o In case (1) above, every time in tcp_write_xmit() we have enough
cwnd to send a full-sized skb, we are not fully using the cwnd
(because cwnd is not a multiple of the TSO skb size). So every time we
send data, we are not cwnd limited, and so in the cwnd-limited
tracking code in tcp_cwnd_validate() we mark ourselves as not
cwnd-limited.
o In case (2) above, every time in tcp_write_xmit() that we try to
transmit the "remainder" of the cwnd but defer, we set the local
variable is_cwnd_limited to true, but we do not send any packets, so
sent_pkts is zero, so we don't call the cwnd-limited logic to update
tp->is_cwnd_limited.
Fixes: ca8a226343 ("tcp: make cwnd-limited checks measurement-based, and gentler")
Reported-by: Ingemar Johansson <ingemar.s.johansson@ericsson.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20201209035759.1225145-1-ncardwell.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The offending commit introduces a cleanup callback that is invoked
when the driver module is removed to clean up the tunnel device
flow block. But it returns on the first iteration of the for loop.
The remaining indirect flow blocks will never be freed.
Fixes: 1fac52da59 ("net: flow_offload: consolidate indirect flow_block infrastructure")
CC: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Chris Mi <cmi@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
For DCTCP, we have to retain the ECT bits set by the congestion control
algorithm on the socket when reflecting syn TOS in syn-ack, in order to
make ECN work properly.
Fixes: ac8f1710c1 ("tcp: reflect tos value received in SYN to the socket")
Reported-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Wei Wang <weiwan@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Syzbot reported a stack overflow in bitmap_from_arr32() called from
ethnl_parse_bitset() when bitset from netlink message is longer than
target bitmap length. While ethnl_compact_sanity_checks() makes sure that
trailing part is all zeros (i.e. the request does not try to touch bits
kernel does not recognize), we also need to cap change_bits to nbits so
that we don't try to write past the prepared bitmaps.
Fixes: 88db6d1e4f ("ethtool: add ethnl_parse_bitset() helper")
Reported-by: syzbot+9d39fa49d4df294aab93@syzkaller.appspotmail.com
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Link: https://lore.kernel.org/r/3487ee3a98e14cd526f55b6caaa959d2dcbcad9f.1607465316.git.mkubecek@suse.cz
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The isotp socket can be widely configured in its behaviour regarding addressing
types, fill-ups, receive pattern tests and link layer length. Usually all
these settings need to be fixed before bind() and can not be changed
afterwards.
This patch adds a check to enforce the common usage pattern.
Fixes: e057dd3fc2 ("can: add ISO 15765-2:2016 transport protocol")
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Tested-by: Thomas Wagner <thwa1@web.de>
Link: https://lore.kernel.org/r/20201203140604.25488-2-socketcan@hartkopp.net
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Link: https://lore.kernel.org/r/20201204133508.742120-3-mkl@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since commit 7f0a838254 ("bpf, xdp: Maintain info on attached XDP BPF
programs in net_device"), the XDP program attachment info is now maintained
in the core code. This interacts badly with the xdp_attachment_flags_ok()
check that prevents unloading an XDP program with different load flags than
it was loaded with. In practice, two kinds of failures are seen:
- An XDP program loaded without specifying a mode (and which then ends up
in driver mode) cannot be unloaded if the program mode is specified on
unload.
- The dev_xdp_uninstall() hook always calls the driver callback with the
mode set to the type of the program but an empty flags argument, which
means the flags_ok() check prevents the program from being removed,
leading to bpf prog reference leaks.
The original reason this check was added was to avoid ambiguity when
multiple programs were loaded. With the way the checks are done in the core
now, this is quite simple to enforce in the core code, so let's add a check
there and get rid of the xdp_attachment_flags_ok() callback entirely.
Fixes: 7f0a838254 ("bpf, xdp: Maintain info on attached XDP BPF programs in net_device")
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/bpf/160752225751.110217.10267659521308669050.stgit@toke.dk
There's no need to defer allocation of pages for the receive buffer.
- This upcall is quite infrequent
- gssp_alloc_receive_pages() can allocate the pages with GFP_KERNEL,
unlike the transport
- gssp_alloc_receive_pages() knows exactly how many pages are needed
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Olga Kornievskaia <kolga@netapp.com>
We can simplify code around cache_downcall unifying memory
allocations using kvmalloc. This has the benefit of getting rid of
cache_slow_downcall (and queue_io_mutex), and also matches userland
allocation size and limits.
Signed-off-by: Roberto Bergantinos Corpas <rbergant@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Since commit 656c8e9cc1 ("netfilter: conntrack: Use consistent ct id
hash calculation") the ct id will not change from initialization to
confirmation. Removing the confirmation check allows for things like
adding an element to a 'typeof ct id' set in prerouting upon reception
of the first packet of a new connection, and then being able to
reference that set consistently both before and after the connection
is confirmed.
Fixes: 656c8e9cc1 ("netfilter: conntrack: Use consistent ct id hash calculation")
Signed-off-by: Brett Mastbergen <brett.mastbergen@gmail.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This moves the bpf_sock_from_file definition into net/core/filter.c
which only gets compiled with CONFIG_NET and also moves the helper proto
usage next to other tracing helpers that are conditional on CONFIG_NET.
This avoids
ld: kernel/trace/bpf_trace.o: in function `bpf_sock_from_file':
bpf_trace.c:(.text+0xe23): undefined reference to `sock_from_file'
When compiling a kernel with BPF and without NET.
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Florent Revest <revest@chromium.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: KP Singh <kpsingh@kernel.org>
Link: https://lore.kernel.org/bpf/20201208173623.1136863-1-revest@chromium.org
Before commit a337531b94 ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB")
small tcp_rmem[1] values were overridden by tcp_fixup_rcvbuf() to accommodate various MSS.
This is no longer the case, and Hazem Mohamed Abuelfotoh reported
that DRS would not work for MTU 9000 endpoints receiving regular (1500 bytes) frames.
Root cause is that tcp_init_buffer_space() uses tp->rcv_wnd for upper limit
of rcvq_space.space computation, while it can select later a smaller
value for tp->rcv_ssthresh and tp->window_clamp.
ss -temoi on receiver would show :
skmem:(r0,rb131072,t0,tb46080,f0,w0,o0,bl0,d0) rcv_space:62496 rcv_ssthresh:56596
This means that TCP can not increase its window in tcp_grow_window(),
and that DRS can never kick.
Fix this by making sure that rcvq_space.space is not bigger than number of bytes
that can be held in TCP receive queue.
People unable/unwilling to change their kernel can work around this issue by
selecting a bigger tcp_rmem[1] value as in :
echo "4096 196608 6291456" >/proc/sys/net/ipv4/tcp_rmem
Based on an initial report and patch from Hazem Mohamed Abuelfotoh
https://lore.kernel.org/netdev/20201204180622.14285-1-abuehaze@amazon.com/
Fixes: a337531b94 ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB")
Fixes: 041a14d267 ("tcp: start receiver buffer autotuning sooner")
Reported-by: Hazem Mohamed Abuelfotoh <abuehaze@amazon.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Simplify the return expression.
Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com>
Reviewed-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a spelling mistake in the Kconfig help text. Fix it.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johan Hedberg says:
====================
pull request: bluetooth-next 2020-12-07
Here's the main bluetooth-next pull request for the 5.11 kernel.
- Updated Bluetooth entries in MAINTAINERS to include Luiz von Dentz
- Added support for Realtek 8822CE and 8852A devices
- Added support for MediaTek MT7615E device
- Improved workarounds for fake CSR devices
- Fix Bluetooth qualification test case L2CAP/COS/CFD/BV-14-C
- Fixes for LL Privacy support
- Enforce 16 byte encryption key size for FIPS security level
- Added new mgmt commands for extended advertising support
- Multiple other smaller fixes & improvements
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
`tipc_node_apply_property` does a null check on a `tipc_link_entry`
pointer but also accesses the same pointer out of the null check block.
This triggers a warning on Coverity Static Analyzer because we're
implying that `e->link` can BE null.
Move "Update MTU for node link entry" line into if block to make sure
that we're not in a state that `e->link` is null.
Signed-off-by: Cengiz Can <cengiz@kernel.wtf>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add an explicit comment in the code to describe the indirect
serialization of the holders of the commit_mutex with the rtnl_mutex.
Commit 90d2723c6d ("netfilter: nf_tables: do not hold reference on
netdevice from preparation phase") already describes this, but a comment
in this case is better for reference.
Reported-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Use nf_msecs_to_jiffies64 and nf_jiffies64_to_msecs as provided by
8e1102d5a1 ("netfilter: nf_tables: support timeouts larger than 23
days"), otherwise ruleset listing breaks.
Fixes: a8b1e36d0d ("netfilter: nft_dynset: fix element timeout for HZ != 1000")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
These warnings become somewhat more informative when they include the
MTU value that could not be set and not just the errno.
Signed-off-by: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
Link: https://lore.kernel.org/r/20201205133944.10182-1-rasmus.villemoes@prevas.dk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In AF_XDP the socket state needs to be checked, prior touching the
members of the socket. This was not the case for the recvmsg
implementation. Fix that by moving the xsk_is_bound() call.
Fixes: 45a8668184 ("xsk: Add support for recvmsg()")
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20201207082008.132263-1-bjorn.topel@gmail.com
When running concurrent iptables rules replacement with data, the per CPU
sequence count is checked after the assignment of the new information.
The sequence count is used to synchronize with the packet path without the
use of any explicit locking. If there are any packets in the packet path using
the table information, the sequence count is incremented to an odd value and
is incremented to an even after the packet process completion.
The new table value assignment is followed by a write memory barrier so every
CPU should see the latest value. If the packet path has started with the old
table information, the sequence counter will be odd and the iptables
replacement will wait till the sequence count is even prior to freeing the
old table info.
However, this assumes that the new table information assignment and the memory
barrier is actually executed prior to the counter check in the replacement
thread. If CPU decides to execute the assignment later as there is no user of
the table information prior to the sequence check, the packet path in another
CPU may use the old table information. The replacement thread would then free
the table information under it leading to a use after free in the packet
processing context-
Unable to handle kernel NULL pointer dereference at virtual
address 000000000000008e
pc : ip6t_do_table+0x5d0/0x89c
lr : ip6t_do_table+0x5b8/0x89c
ip6t_do_table+0x5d0/0x89c
ip6table_filter_hook+0x24/0x30
nf_hook_slow+0x84/0x120
ip6_input+0x74/0xe0
ip6_rcv_finish+0x7c/0x128
ipv6_rcv+0xac/0xe4
__netif_receive_skb+0x84/0x17c
process_backlog+0x15c/0x1b8
napi_poll+0x88/0x284
net_rx_action+0xbc/0x23c
__do_softirq+0x20c/0x48c
This could be fixed by forcing instruction order after the new table
information assignment or by switching to RCU for the synchronization.
Fixes: 80055dab5d ("netfilter: x_tables: make xt_replace_table wait until old rules are not used anymore")
Reported-by: Sean Tranchetti <stranche@codeaurora.org>
Reported-by: kernel test robot <lkp@intel.com>
Suggested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Steffen Klassert says:
====================
pull request (net): ipsec 2020-12-07
1) Sysbot reported fixes for the new 64/32 bit compat layer.
From Dmitry Safonov.
2) Fix a memory leak in xfrm_user_policy that was introduced
by adding the 64/32 bit compat layer. From Yu Kuai.
* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec:
net: xfrm: fix memory leak in xfrm_user_policy()
xfrm/compat: Don't allocate memory with __GFP_ZERO
xfrm/compat: memset(0) 64-bit padding at right place
xfrm/compat: Translate by copying XFRMA_UNSPEC attribute
====================
Link: https://lore.kernel.org/r/20201207093937.2874932-1-steffen.klassert@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When do cat /proc/net/netstat, the output isn't append with a new line, it looks like this:
[root@localhost ~]# cat /proc/net/netstat
...
MPTcpExt: 0 0 0 0 0 0 0 0 0 0 0 0 0[root@localhost ~]#
This is because in mptcp_seq_show(), if mptcp isn't in use, net->mib.mptcp_statistics is NULL,
so it just puts all 0 after "MPTcpExt:", and return, forgot the '\n'.
After this patch:
[root@localhost ~]# cat /proc/net/netstat
...
MPTcpExt: 0 0 0 0 0 0 0 0 0 0 0 0 0
[root@localhost ~]#
Fixes: fc518953bc ("mptcp: add and use MIB counter infrastructure")
Signed-off-by: Jianguo Wu <wujianguo@chinatelecom.cn>
Acked-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/142e2fd9-58d9-bb13-fb75-951cccc2331e@163.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When enabling multicast snooping, bridge module deadlocks on multicast_lock
if 1) IPv6 is enabled, and 2) there is an existing querier on the same L2
network.
The deadlock was caused by the following sequence: While holding the lock,
br_multicast_open calls br_multicast_join_snoopers, which eventually causes
IP stack to (attempt to) send out a Listener Report (in igmp6_join_group).
Since the destination Ethernet address is a multicast address, br_dev_xmit
feeds the packet back to the bridge via br_multicast_rcv, which in turn
calls br_multicast_add_group, which then deadlocks on multicast_lock.
The fix is to move the call br_multicast_join_snoopers outside of the
critical section. This works since br_multicast_join_snoopers only deals
with IP and does not modify any multicast data structures of the bridge,
so there's no need to hold the lock.
Steps to reproduce:
1. sysctl net.ipv6.conf.all.force_mld_version=1
2. have another querier
3. ip link set dev bridge type bridge mcast_snooping 0 && \
ip link set dev bridge type bridge mcast_snooping 1 < deadlock >
A typical call trace looks like the following:
[ 936.251495] _raw_spin_lock+0x5c/0x68
[ 936.255221] br_multicast_add_group+0x40/0x170 [bridge]
[ 936.260491] br_multicast_rcv+0x7ac/0xe30 [bridge]
[ 936.265322] br_dev_xmit+0x140/0x368 [bridge]
[ 936.269689] dev_hard_start_xmit+0x94/0x158
[ 936.273876] __dev_queue_xmit+0x5ac/0x7f8
[ 936.277890] dev_queue_xmit+0x10/0x18
[ 936.281563] neigh_resolve_output+0xec/0x198
[ 936.285845] ip6_finish_output2+0x240/0x710
[ 936.290039] __ip6_finish_output+0x130/0x170
[ 936.294318] ip6_output+0x6c/0x1c8
[ 936.297731] NF_HOOK.constprop.0+0xd8/0xe8
[ 936.301834] igmp6_send+0x358/0x558
[ 936.305326] igmp6_join_group.part.0+0x30/0xf0
[ 936.309774] igmp6_group_added+0xfc/0x110
[ 936.313787] __ipv6_dev_mc_inc+0x1a4/0x290
[ 936.317885] ipv6_dev_mc_inc+0x10/0x18
[ 936.321677] br_multicast_open+0xbc/0x110 [bridge]
[ 936.326506] br_multicast_toggle+0xec/0x140 [bridge]
Fixes: 4effd28c12 ("bridge: join all-snoopers multicast address")
Signed-off-by: Joseph Huang <Joseph.Huang@garmin.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Link: https://lore.kernel.org/r/20201204235628.50653-1-Joseph.Huang@garmin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
migrate_disable() is just a wrapper for preempt_disable() in
non-RT kernel. It is safe to replace it, and RT kernel will
benefit.
Note that it is introduced since Feb 2020.
Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201205075946.497763-2-xiyou.wangcong@gmail.com
The per-cpu bpf_redirect_info is shared among all skb_do_redirect()
and BPF redirect helpers. Callers on RX path are all in BH context,
disabling preemption is not sufficient to prevent BH interruption.
In production, we observed strange packet drops because of the race
condition between LWT xmit and TC ingress, and we verified this issue
is fixed after we disable BH.
Although this bug was technically introduced from the beginning, that
is commit 3a0af8fd61 ("bpf: BPF for lightweight tunnel infrastructure"),
at that time call_rcu() had to be call_rcu_bh() to match the RCU context.
So this patch may not work well before RCU flavor consolidation has been
completed around v5.0.
Update the comments above the code too, as call_rcu() is now BH friendly.
Signed-off-by: Dongdong Wang <wangdongdong.6@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Cong Wang <cong.wang@bytedance.com>
Link: https://lore.kernel.org/bpf/20201205075946.497763-1-xiyou.wangcong@gmail.com
Increment the mgmt revision due to the recently added new commands.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
When suspending, mark SUSPEND_SCAN_ENABLE and SUSPEND_SCAN_DISABLE tasks
correctly when either classic or le scanning is modified.
Signed-off-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Signed-off-by: Howard Chung <howardchung@google.com>
Reviewed-by: Alain Michaud <alainm@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
For advertising, we wish to know the LE tx power capabilities of the
controller in userspace, so this patch edits the Security Info MGMT
command to be more generic, such that other various controller
capabilities can be included in the EIR data. This change also includes
the LE min and max tx power into this newly-named command.
The change was tested by manually verifying that the MGMT command
returns the tx power range as expected in userspace.
Reviewed-by: Sonny Sasaka <sonnysasaka@chromium.org>
Signed-off-by: Daniel Winkler <danielwinkler@google.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Queries tx power via HCI_LE_Read_Transmit_Power command when the hci
device is initialized, and stores resulting min/max LE power in hdev
struct. If command isn't available (< BT5 support), min/max values
both default to HCI_TX_POWER_INVALID.
This patch is manually verified by ensuring BT5 devices correctly query
and receive controller tx power range.
Reviewed-by: Sonny Sasaka <sonnysasaka@chromium.org>
Signed-off-by: Daniel Winkler <danielwinkler@google.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
This patch takes the min/max intervals and tx power optionally provided
in mgmt interface, stores them in the advertisement struct, and uses
them when configuring the hci requests. While tx power is not used if
extended advertising is unavailable, software rotation will use the min
and max advertising intervals specified by the client.
This change is validated manually by ensuring the min/max intervals are
propagated to the controller on both hatch (extended advertising) and
kukui (no extended advertising) chromebooks, and that tx power is
propagated correctly on hatch. These tests are performed with multiple
advertisements simultaneously.
Reviewed-by: Sonny Sasaka <sonnysasaka@chromium.org>
Signed-off-by: Daniel Winkler <danielwinkler@google.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
This patch adds support for the new advertising add interface, with the
first command setting advertising parameters and the second to set
advertising data. The set parameters command allows the caller to leave
some fields "unset", with a params bitfield defining which params were
purposefully set. Unset parameters will be given defaults when calling
hci_add_adv_instance. The data passed to the param mgmt command is
allowed to be flexible, so in the future if bluetoothd passes a larger
structure with new params, the mgmt command will ignore the unknown
members at the end.
This change has been validated on both hatch (extended advertising) and
kukui (no extended advertising) chromebooks running bluetoothd that
support this new interface. I ran the following manual tests:
- Set several (3) advertisements using modified test_advertisement.py
- For each, validate correct data and parameters in btmon trace
- Verified both for software rotation and extended adv
Automatic test suite also run, testing many (25) scenarios of single and
multi-advertising for data/parameter correctness.
Reviewed-by: Sonny Sasaka <sonnysasaka@chromium.org>
Signed-off-by: Daniel Winkler <danielwinkler@google.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
We wish to handle advertising data separately from advertising
parameters in our new MGMT requests. This change adds a helper that
allows the advertising data and scan response to be updated for an
existing advertising instance.
Reviewed-by: Sonny Sasaka <sonnysasaka@chromium.org>
Signed-off-by: Daniel Winkler <danielwinkler@google.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
This patch add a configurable parameter to switch off the interleave
scan feature.
Signed-off-by: Howard Chung <howardchung@google.com>
Reviewed-by: Alain Michaud <alainm@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Refactor read default system configuration function so that it's capable
of returning different types than u16
Signed-off-by: Howard Chung <howardchung@google.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
This patch adds code to handle the active scan during interleave
scan. The interleave scan will be canceled when users start active scan,
and it will be restarted after active scan stopped.
Signed-off-by: Howard Chung <howardchung@google.com>
Reviewed-by: Alain Michaud <alainm@chromium.org>
Reviewed-by: Manish Mandlik <mmandlik@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
This patch adds code to handle the system suspension during interleave
scan. The interleave scan will be canceled when the system is going to
sleep, and will be restarted after waking up.
Signed-off-by: Howard Chung <howardchung@google.com>
Reviewed-by: Alain Michaud <alainm@chromium.org>
Reviewed-by: Manish Mandlik <mmandlik@chromium.org>
Reviewed-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Reviewed-by: Miao-chen Chou <mcchou@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
This patch implements the interleaving between allowlist scan and
no-filter scan. It'll be used to save power when at least one monitor is
registered and at least one pending connection or one device to be
scanned for.
The durations of the allowlist scan and the no-filter scan are
controlled by MGMT command: Set Default System Configuration. The
default values are set randomly for now.
Signed-off-by: Howard Chung <howardchung@google.com>
Reviewed-by: Alain Michaud <alainm@chromium.org>
Reviewed-by: Manish Mandlik <mmandlik@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
During controller initialization, an LE Set RPA Timeout command is sent
to the controller if supported. However, the value checked to determine
if the command is supported is incorrect. Page 1921 of the Bluetooth
Core Spec v5.2 shows that bit 2 of octet 35 of the Supported_Commands
field corresponds to the LE Set RPA Timeout command, but currently
bit 6 of octet 35 is checked. This patch checks the correct value
instead.
This issue led to the error seen in the following btmon output during
initialization of an adapter (rtl8761b) and prevented initialization
from completing.
< HCI Command: LE Set Resolvable Private Address Timeout (0x08|0x002e) plen 2
Timeout: 900 seconds
> HCI Event: Command Complete (0x0e) plen 4
LE Set Resolvable Private Address Timeout (0x08|0x002e) ncmd 2
Status: Unsupported Remote Feature / Unsupported LMP Feature (0x1a)
= Close Index: 00:E0:4C:6B:E5:03
The error did not appear when running with this patch.
Signed-off-by: Edward Vear <edwardvear@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
This renames get_adv_instance_scan_rsp to adv_instance_is_scannable and
make it return a bool since it was not actually properly return the size
of the scan response as one could expect.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Current code is actually failing on the following tests of mgmt-tester
because get_adv_instance_scan_rsp_len did not account for flags that
cause scan response data to be included resulting in non-scannable
instance when in fact it should be scannable.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
This test case is meant to verify that multiple
unknown options is included in the response.
BLUETOOTH CORE SPECIFICATION Version 5.2 | Vol 3, Part A
page 1057
'On an unknown option failure (Result=0x0003),
the option(s) that contain anoption type field that is not
understood by the recipient of the L2CAP_CONFIGURATION_REQ
packet shall be included in the L2CAP_CONFIGURATION_RSP
packet unless they are hints.'
Before this patch:
> ACL Data RX: Handle 11 flags 0x02 dlen 24
L2CAP: Configure Request (0x04) ident 18 len 16
Destination CID: 64
Flags: 0x0000
Option: Unknown (0x10) [mandatory]
10 00 11 02 11 00 12 02 12 00
< ACL Data TX: Handle 11 flags 0x00 dlen 17
L2CAP: Configure Response (0x05) ident 18 len 9
Source CID: 64
Flags: 0x0000
Result: Failure - unknown options (0x0003)
Option: Unknown (0x10) [mandatory]
12
After this patch:
> ACL Data RX: Handle 11 flags 0x02 dlen 24
L2CAP: Configure Request (0x04) ident 5 len 16
Destination CID: 64
Flags: 0x0000
Option: Unknown (0x10) [mandatory]
10 00 11 02 11 00 12 02 12 00
< ACL Data TX: Handle 11 flags 0x00 dlen 23
L2CAP: Configure Response (0x05) ident 5 len 15
Source CID: 64
Flags: 0x0000
Result: Failure - unknown options (0x0003)
Option: Unknown (0x10) [mandatory]
10 11 01 11 12 01 12
Signed-off-by: Jimmy Wahlberg <jimmywa@spotify.com>
Reviewed-by: Luiz Augusto Von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Guillaume noticed that: for segments udp_queue_rcv_one_skb() returns the
proto, and it should pass "ret" unmodified to ip_protocol_deliver_rcu().
Otherwize, with a negtive value passed, it will underflow inet_protos.
This can be reproduced with IPIP FOU:
# ip fou add port 5555 ipproto 4
# ethtool -K eth1 rx-gro-list on
Fixes: cf329aa42b ("udp: cope with UDP GRO packet misdirection")
Reported-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- bump version strings, by Simon Wunderlich
- update include for min/max helpers, by Sven Eckelmann
- add infrastructure and netlink functions for routing algo selection,
by Sven Eckelmann (2 patches)
- drop deprecated debugfs and sysfs support and obsoleted
functionality, by Sven Eckelmann (3 patches)
- drop unused include in fragmentation.c, by Simon Wunderlich
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAl/KWIkWHHN3QHNpbW9u
d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoTrTEACUImdOCWv4+NnEQfChQv6Y3i18
gJABXoOkWLfFpGBUlw/uYzFKpMEWZ0orHig9gucC+rmjNc8veWwAOugJoTPTKQJZ
/4yndhM0x39vWex03rdDmyqzCEh1V1Q9VcdEuD6XbJDaK5F4jDu3NQVneOijIkN+
5PzhlvtUlfe8csykOCOoC9Y5wy82fEhcEvuSq+Z6dU3Cb3EGHtEUtZ4orDkpnnml
7XEcn5C5+OFGlz/ikiszKumTtNK+dmGluOxoyfAzEjQHK7PoTorcXFS2YUoSWeqQ
gmYZ56RBqEHjo4eqcaEgcqq5v8cTPCEMCB8UQjAffxrhloRKHRhQOysG1+OnzGA8
IQ2ARHLQCVPVraXF2ixE0D3BvjKmtMmcvZOCXwhCHDajn9jFKAh0+hnInDyv6Fp1
7eUfpHACL9EQDxKWXeQg37X2mk3hHJ+4zgZOYidahVeKbiiexe2heaHTYAbr9rIf
8hvtlgMg4AnwL3IxadrKwsbJ5t7TEPLTInf47hPvpRg7SmthDTcgso4VDwmWgB8W
Tlug8/NoXXDCmDhXUpvyi9+idHe0J8xvHl/2xGC7aSsPAbuhqOuefMKr36YhXJTY
vBA5Ih5ppylJ8Dzwa0TbonvQbOAinA8YTa6izKxY4e+xTPB2jz/WT2vciEZv2+ig
vNIPFrLZ7OFMohCDfQ==
=tNKF
-----END PGP SIGNATURE-----
Merge tag 'batadv-next-pullrequest-20201204' of git://git.open-mesh.org/linux-merge
Simon Wunderlich says:
====================
This cleanup patchset includes the following patches:
- bump version strings, by Simon Wunderlich
- update include for min/max helpers, by Sven Eckelmann
- add infrastructure and netlink functions for routing algo selection,
by Sven Eckelmann (2 patches)
- drop deprecated debugfs and sysfs support and obsoleted
functionality, by Sven Eckelmann (3 patches)
- drop unused include in fragmentation.c, by Simon Wunderlich
* tag 'batadv-next-pullrequest-20201204' of git://git.open-mesh.org/linux-merge:
batman-adv: Drop unused soft-interface.h include in fragmentation.c
batman-adv: Drop legacy code for auto deleting mesh interfaces
batman-adv: Drop deprecated debugfs support
batman-adv: Drop deprecated sysfs support
batman-adv: Allow selection of routing algorithm over rtnetlink
batman-adv: Prepare infrastructure for newlink settings
batman-adv: Add new include for min/max helpers
batman-adv: Start new development cycle
====================
Link: https://lore.kernel.org/r/20201204154631.21063-1-sw@simonwunderlich.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
implement the NCI 2.x initial sequence to support NCI 2.x NFCC.
Since NCI 2.0, CORE_RESET and CORE_INIT sequence have been changed.
If NFCEE supports NCI 2.x, then NCI 2.x initial sequence will work.
In NCI 1.0, Initial sequence and payloads are as below:
(DH) (NFCC)
| -- CORE_RESET_CMD --> |
| <-- CORE_RESET_RSP -- |
| -- CORE_INIT_CMD --> |
| <-- CORE_INIT_RSP -- |
CORE_RESET_RSP payloads are Status, NCI version, Configuration Status.
CORE_INIT_CMD payloads are empty.
CORE_INIT_RSP payloads are Status, NFCC Features,
Number of Supported RF Interfaces, Supported RF Interface,
Max Logical Connections, Max Routing table Size,
Max Control Packet Payload Size, Max Size for Large Parameters,
Manufacturer ID, Manufacturer Specific Information.
In NCI 2.0, Initial Sequence and Parameters are as below:
(DH) (NFCC)
| -- CORE_RESET_CMD --> |
| <-- CORE_RESET_RSP -- |
| <-- CORE_RESET_NTF -- |
| -- CORE_INIT_CMD --> |
| <-- CORE_INIT_RSP -- |
CORE_RESET_RSP payloads are Status.
CORE_RESET_NTF payloads are Reset Trigger,
Configuration Status, NCI Version, Manufacturer ID,
Manufacturer Specific Information Length,
Manufacturer Specific Information.
CORE_INIT_CMD payloads are Feature1, Feature2.
CORE_INIT_RSP payloads are Status, NFCC Features,
Max Logical Connections, Max Routing Table Size,
Max Control Packet Payload Size,
Max Data Packet Payload Size of the Static HCI Connection,
Number of Credits of the Static HCI Connection,
Max NFC-V RF Frame Size, Number of Supported RF Interfaces,
Supported RF Interfaces.
Signed-off-by: Bongsu Jeon <bongsu.jeon@samsung.com>
Link: https://lore.kernel.org/r/20201202223147.3472-1-bongsu.jeon@samsung.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We add the support to remove a specific node down with 128bit
node identifier, as an alternative to legacy 32-bit node address.
example:
$tipc peer remove identiy <1001002|16777777>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Link: https://lore.kernel.org/r/20201203035045.4564-1-hoang.h.le@dektech.com.au
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Fix to return a negative error code from the error handling
case instead of 0, as done elsewhere in this function.
Changing 'return start' to 'return action_start' can fix this bug.
Fixes: 69929d4c49 ("net: openvswitch: fix TTL decrement action netlink message format")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wang Hai <wanghai38@huawei.com>
Reviewed-by: Eelco Chaudron <echaudro@redhat.com>
Link: https://lore.kernel.org/r/20201204114314.1596-1-wanghai38@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Fix to return a negative error code from the error handling
case instead of 0, as done elsewhere in this function.
Fixes: f8ed289fab ("bridge: vlan: use br_vlan_(get|put)_master to deal with refcounts")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zhang Changzhong <zhangchangzhong@huawei.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Link: https://lore.kernel.org/r/1607071737-33875-1-git-send-email-zhangchangzhong@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Fix to return a negative error code from the error handling
case instead of 0, as done elsewhere in this function.
Fixes: d15662682d ("ipv4: Allow ipv6 gateway with ipv4 routes")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zhang Changzhong <zhangchangzhong@huawei.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/1607071695-33740-1-git-send-email-zhangchangzhong@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
with the following tdc testcase:
83be: (qdisc, fq_pie) Create FQ-PIE with invalid number of flows
as fq_pie_init() fails, fq_pie_destroy() is called to clean up. Since the
timer is not yet initialized, it's possible to observe a splat like this:
INFO: trying to register non-static key.
the code is fine but needs lockdep annotation.
turning off the locking correctness validator.
CPU: 0 PID: 975 Comm: tc Not tainted 5.10.0-rc4+ #298
Hardware name: Red Hat KVM, BIOS 1.11.1-4.module+el8.1.0+4066+0f1aadab 04/01/2014
Call Trace:
dump_stack+0x99/0xcb
register_lock_class+0x12dd/0x1750
__lock_acquire+0xfe/0x3970
lock_acquire+0x1c8/0x7f0
del_timer_sync+0x49/0xd0
fq_pie_destroy+0x3f/0x80 [sch_fq_pie]
qdisc_create+0x916/0x1160
tc_modify_qdisc+0x3c4/0x1630
rtnetlink_rcv_msg+0x346/0x8e0
netlink_unicast+0x439/0x630
netlink_sendmsg+0x719/0xbf0
sock_sendmsg+0xe2/0x110
____sys_sendmsg+0x5ba/0x890
___sys_sendmsg+0xe9/0x160
__sys_sendmsg+0xd3/0x170
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
[...]
ODEBUG: assert_init not available (active state 0) object type: timer_list hint: 0x0
WARNING: CPU: 0 PID: 975 at lib/debugobjects.c:508 debug_print_object+0x162/0x210
[...]
Call Trace:
debug_object_assert_init+0x268/0x380
try_to_del_timer_sync+0x6a/0x100
del_timer_sync+0x9e/0xd0
fq_pie_destroy+0x3f/0x80 [sch_fq_pie]
qdisc_create+0x916/0x1160
tc_modify_qdisc+0x3c4/0x1630
rtnetlink_rcv_msg+0x346/0x8e0
netlink_rcv_skb+0x120/0x380
netlink_unicast+0x439/0x630
netlink_sendmsg+0x719/0xbf0
sock_sendmsg+0xe2/0x110
____sys_sendmsg+0x5ba/0x890
___sys_sendmsg+0xe9/0x160
__sys_sendmsg+0xd3/0x170
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
fix it moving timer_setup() before any failure, like it was done on 'red'
with former commit 608b4adab1 ("net_sched: initialize timer earlier in
red_init()").
Fixes: ec97ecf1eb ("net: sched: add Flow Queue PIE packet scheduler")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Cong Wang <cong.wang@bytedance.com>
Link: https://lore.kernel.org/r/2e78e01c504c633ebdff18d041833cf2e079a3a4.1607020450.git.dcaratti@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Zapping pages is required only if we are calling vm_insert_page into a
region where pages had previously been mapped. Receive zerocopy allows
reusing such regions, and hitherto called zap_page_range() before
calling vm_insert_page() in that range.
zap_page_range() can also be triggered from userspace with
madvise(MADV_DONTNEED). If userspace is configured to call this before
reusing a segment, or if there was nothing mapped at this virtual
address to begin with, we can avoid calling zap_page_range() under the
socket lock. That said, if userspace does not do that, then we are
still responsible for calling zap_page_range().
This patch adds a flag that the user can use to hint to the kernel
that a zap is not required. If the flag is not set, or if an older
user application does not have a flags field at all, then the kernel
calls zap_page_range as before. Also, if the flag is set but a zap is
still required, the kernel performs that zap as necessary. Thus
incorrectly indicating that a zap can be avoided does not change the
correctness of operation. It also increases the batchsize for
vm_insert_pages and prefetches the page struct for the batch since
we're about to bump the refcount.
An alternative mechanism could be to not have a flag, assume by
default a zap is not needed, and fall back to zapping if needed.
However, this would harm performance for older applications for which
a zap is necessary, and thus we implement it with an explicit flag
so newer applications can opt in.
When using RPC-style traffic with medium sized (tens of KB) RPCs, this
change yields an efficency improvement of about 30% for QPS/CPU usage.
Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Set zerocopy hint, event when falling back to copy, so that the
pending data can be efficiently received using zerocopy when
possible.
Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Sometimes, we may call tcp receive zerocopy when inq is 0,
or inq < PAGE_SIZE, or inq is generally small enough that
it is cheaper to copy rather than remap pages.
In these cases, we may want to either return early (inq=0) or
attempt to use the provided copy buffer to simply copy
the received data.
This allows us to save both system call overhead and
the latency of acquiring mmap_sem in read mode for cases where
it would be useless to do so.
This patchset enables this behaviour by:
1. Returning quickly if inq is 0.
2. Attempting to perform a regular copy if a hybrid copybuffer is
provided and it is large enough to absorb all available bytes.
3. Return quickly if no such buffer was provided and there are less
than PAGE_SIZE bytes available.
For small RPC ping-pong workloads, normally we would have
1 getsockopt(), 1 recvmsg() and 1 sendmsg() call per RPC. With this
change, we remove the recvmsg() call entirely, reducing the syscall
overhead by about 33%. In testing with small (hundreds of bytes)
RPC traffic, this yields a syscall reduction of about 33% and
an efficiency gain of about 3-5% when defined as QPS/CPU Util.
Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Sometimes, we may call tcp receive zerocopy when inq is 0,
or inq < PAGE_SIZE, in which case we cannot remap pages. In this case,
simply return the appropriate hint for regular copying without taking
mmap_sem.
Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Refactor frag-is-remappable test for tcp receive zerocopy. This is
part of a patch set that introduces short-circuited hybrid copies
for small receive operations, which results in roughly 33% fewer
syscalls for small RPC scenarios.
Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Refactor skb frag fast-forwarding for tcp receive zerocopy. This is
part of a patch set that introduces short-circuited hybrid copies
for small receive operations, which results in roughly 33% fewer
syscalls for small RPC scenarios.
skb_advance_to_frag(), given a skb and an offset into the skb,
iterates from the first frag for the skb until we're at the frag
specified by the offset. Assuming the offset provided refers to how
many bytes in the skb are already read, the returned frag points to
the next frag we may read from, while offset_frag is set to the number
of bytes from this frag that we have already read.
If frag is not null and offset_frag is equal to 0, then we may be able
to map this frag's page into the process address space with
vm_insert_page(). However, if offset_frag is not equal to 0, then we
cannot do so.
Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Refactor tcp_recvmsg() by splitting it into locked and unlocked
portions. Callers already holding the socket lock and not using
ERRQUEUE/cmsg/busy polling can simply call tcp_recvmsg_locked().
This is in preparation for a short-circuit copy performed by
TCP receive zerocopy for small (< PAGE_SIZE, or otherwise requested
by the user) reads.
Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When TCP receive zerocopy does not successfully map the entire
requested space, it outputs a 'hint' that the caller should recvmsg().
Augment zerocopy to accept a user buffer that it tries to copy this
hint into - if it is possible to copy the entire hint, it will do so.
This elides a recvmsg() call for received traffic that isn't exactly
page-aligned in size.
This was tested with RPC-style traffic of arbitrary sizes. Normally,
each received message required at least one getsockopt() call, and one
recvmsg() call for the remaining unaligned data.
With this change, almost all of the recvmsg() calls are eliminated,
leading to a savings of about 25%-50% in number of system calls
for RPC-style workloads.
Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Iterators are currently used to expose kernel information to userspace
over fast procfs-like files but iterators could also be used to
manipulate local storage. For example, the task_file iterator could be
used to initialize a socket local storage with associations between
processes and sockets or to selectively delete local storage values.
Signed-off-by: Florent Revest <revest@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: KP Singh <kpsingh@google.com>
Link: https://lore.kernel.org/bpf/20201204113609.1850150-3-revest@google.com
Currently, the sock_from_file prototype takes an "err" pointer that is
either not set or set to -ENOTSOCK IFF the returned socket is NULL. This
makes the error redundant and it is ignored by a few callers.
This patch simplifies the API by letting callers deduce the error based
on whether the returned socket is NULL or not.
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Florent Revest <revest@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: KP Singh <kpsingh@google.com>
Link: https://lore.kernel.org/bpf/20201204113609.1850150-1-revest@google.com
SRv6 End.DT6 is defined in the SRv6 Network Programming [1].
The Linux kernel already offers an implementation of the SRv6
End.DT6 behavior which permits IPv6 L3 VPNs over SRv6 networks. This
implementation is not particularly suitable in contexts where we need to
deploy IPv6 L3 VPNs among different tenants which share the same network
address schemes. The underlying problem lies in the fact that the
current version of DT6 (called legacy DT6 from now on) needs a complex
configuration to be applied on routers which requires ad-hoc routes and
routing policy rules to ensure the correct isolation of tenants.
Consequently, a new implementation of DT6 has been introduced with the
aim of simplifying the construction of IPv6 L3 VPN services in the
multi-tenant environment using SRv6 networks. To accomplish this task,
we reused the same VRF infrastructure and SRv6 core components already
exploited for implementing the SRv6 End.DT4 behavior.
Currently the two End.DT6 implementations coexist seamlessly and can be
used depending on the context and the user preferences. So, in order to
support both versions of DT6 a new attribute (vrftable) has been
introduced which allows us to differentiate the implementation of the
behavior to be used.
A SRv6 End.DT6 legacy behavior is still instantiated using a command
like the following one:
$ ip -6 route add 2001:db8::1 encap seg6local action End.DT6 table 100 dev eth0
While to instantiate the SRv6 End.DT6 in VRF mode, the command is still
pretty straight forward:
$ ip -6 route add 2001:db8::1 encap seg6local action End.DT6 vrftable 100 dev eth0.
Obviously as in the case of SRv6 End.DT4, the VRF strict_mode parameter
must be set (net.vrf.strict_mode=1) and the VRF associated with table
100 must exist.
Please note that the instances of SRv6 End.DT6 legacy and End.DT6 VRF
mode can coexist in the same system/configuration without problems.
[1] https://tools.ietf.org/html/draft-ietf-spring-srv6-network-programming
Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
SRv6 End.DT4 is defined in the SRv6 Network Programming [1].
The SRv6 End.DT4 is used to implement IPv4 L3VPN use-cases in
multi-tenants environments. It decapsulates the received packets and it
performs IPv4 routing lookup in the routing table of the tenant.
The SRv6 End.DT4 Linux implementation leverages a VRF device in order to
force the routing lookup into the associated routing table.
To make the End.DT4 work properly, it must be guaranteed that the routing
table used for routing lookup operations is bound to one and only one
VRF during the tunnel creation. Such constraint has to be enforced by
enabling the VRF strict_mode sysctl parameter, i.e:
$ sysctl -wq net.vrf.strict_mode=1.
At JANOG44, LINE corporation presented their multi-tenant DC architecture
using SRv6 [2]. In the slides, they reported that the Linux kernel is
missing the support of SRv6 End.DT4 behavior.
The SRv6 End.DT4 behavior can be instantiated using a command similar to
the following:
$ ip route add 2001:db8::1 encap seg6local action End.DT4 vrftable 100 dev eth0
We introduce the "vrftable" extension in iproute2 in a following patch.
[1] https://tools.ietf.org/html/draft-ietf-spring-srv6-network-programming
[2] https://speakerdeck.com/line_developers/line-data-center-networking-with-srv6
Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We introduce two callbacks used for customizing the creation/destruction of
a SRv6 behavior. Such callbacks are defined in the new struct
seg6_local_lwtunnel_ops and hereafter we provide a brief description of
them:
- build_state(...): used for calling the custom constructor of the
behavior during its initialization phase and after all the attributes
have been parsed successfully;
- destroy_state(...): used for calling the custom destructor of the
behavior before it is completely destroyed.
Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Before this patch, each SRv6 behavior specifies a set of required
attributes that must be provided by the userspace application when such
behavior is going to be instantiated. If at least one of the required
attributes is not provided, the creation of the behavior fails.
The SRv6 behavior framework lacks a way to manage optional attributes.
By definition, an optional attribute for a SRv6 behavior consists of an
attribute which may or may not be provided by the userspace. Therefore,
if an optional attribute is missing (and thus not supplied by the user)
the creation of the behavior goes ahead without any issue.
This patch explicitly differentiates the required attributes from the
optional attributes. In particular, each behavior can declare a set of
required attributes and a set of optional ones.
The semantic of the required attributes remains *totally* unaffected by
this patch. The introduction of the optional attributes does NOT impact
on the backward compatibility of the existing SRv6 behaviors.
It is essential to note that if an (optional or required) attribute is
supplied to a SRv6 behavior which does not expect it, the behavior
simply discards such attribute without generating any error or warning.
This operating mode remained unchanged both before and after the
introduction of the optional attributes extension.
The optional attributes are one of the key components used to implement
the SRv6 End.DT6 behavior based on the Virtual Routing and Forwarding
(VRF) framework. The optional attributes make possible the coexistence
of the already existing SRv6 End.DT6 implementation with the new SRv6
End.DT6 VRF-based implementation without breaking any backward
compatibility. Further details on the SRv6 End.DT6 behavior (VRF mode)
are reported in subsequent patches.
From the userspace point of view, the support for optional attributes DO
NOT require any changes to the userspace applications, i.e: iproute2
unless new attributes (required or optional) are needed.
Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Depending on the attribute (i.e.: SEG6_LOCAL_SRH, SEG6_LOCAL_TABLE, etc),
the parse() callback performs some validity checks on the provided input
and updates the tunnel state (slwt) with the result of the parsing
operation. However, an attribute may also need to reserve some additional
resources (i.e.: memory or setting up an eBPF program) in the parse()
callback to complete the parsing operation.
The parse() callbacks are invoked by the parse_nla_action() for each
attribute belonging to a specific behavior. Given a behavior with N
attributes, if the parsing of the i-th attribute fails, the
parse_nla_action() returns immediately with an error. Nonetheless, the
resources acquired during the parsing of the i-1 attributes are not freed
by the parse_nla_action().
Attributes which acquire resources must release them *in an explicit way*
in both the seg6_local_{build/destroy}_state(). However, adding a new
attribute of this type requires changes to
seg6_local_{build/destroy}_state() to release the resources correctly.
The seg6local infrastructure still lacks a simple and structured way to
release the resources acquired in the parse() operations.
We introduced a new callback in the struct seg6_action_param named
destroy(). This callback releases any resource which may have been acquired
in the parse() counterpart. Each attribute may or may not implement the
destroy() callback depending on whether it needs to free some acquired
resources.
The destroy() callback comes with several of advantages:
1) we can have many attributes as we want for a given behavior with no
need to explicitly free the taken resources;
2) As in case of the seg6_local_build_state(), the
seg6_local_destroy_state() does not need to handle the release of
resources directly. Indeed, it calls the destroy_attrs() function which
is in charge of calling the destroy() callback for every set attribute.
We do not need to patch seg6_local_{build/destroy}_state() anymore as
we add new attributes;
3) the code is more readable and better structured. Indeed, all the
information needed to handle a given attribute are contained in only
one place;
4) it facilitates the integration with new features introduced in further
patches.
Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
First set of patches for v5.11. rtw88 getting improvements to work
better with Bluetooth and other driver also getting some new features.
mhi-ath11k-immutable branch was pulled from mhi tree to avoid
conflicts with mhi tree.
Major changes:
rtw88
* major bluetooth co-existance improvements
wilc1000
* Wi-Fi Multimedia (WMM) support
ath11k
* Fast Initial Link Setup (FILS) discovery and unsolicited broadcast
probe response support
* qcom,ath11k-calibration-variant Device Tree setting
* cold boot calibration support
* new DFS region: JP
wnc36xx
* enable connection monitoring and keepalive in firmware
ath10k
* firmware IRAM recovery feature
mhi
* merge mhi-ath11k-immutable branch to make MHI API change go smoothly
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJfyTQyAAoJEG4XJFUm622bCdcIAIyVnqdW7pnoDmWIyQmAEnD9
vGARkzghPHXnufpOzohyDdxT12X9klhrxSVIgzEgH1/pl3i1PpnF6KXyGFCC44Lw
wrLXhQygPzmIW1IZtJJE3G72WExXoRjWx6LD1I7C7oEIduqFixXADmK2tKzFp795
Jxum+sOeT6+Dk1OvO/fIroBHX73mRE9zAuiTIMpt2G1j8uXs9QVfcTbTrUshLASN
0sX9J6JutltBuM4G7+bFpVzKnLnlQ7ebUaF6nvTCQsgHWZwkS7yAubSWX9sFohbR
UXgQHNE83s/esOg7nBxAfqTKP8mbxsobmxZtxE5GR5vFY5FJDxqP9Zc2KzPp39w=
=CbX/
-----END PGP SIGNATURE-----
Merge tag 'wireless-drivers-next-2020-12-03' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next
Kalle Valo says:
====================
wireless-drivers-next patches for v5.11
First set of patches for v5.11. rtw88 getting improvements to work
better with Bluetooth and other driver also getting some new features.
mhi-ath11k-immutable branch was pulled from mhi tree to avoid
conflicts with mhi tree.
Major changes:
rtw88
* major bluetooth co-existance improvements
wilc1000
* Wi-Fi Multimedia (WMM) support
ath11k
* Fast Initial Link Setup (FILS) discovery and unsolicited broadcast
probe response support
* qcom,ath11k-calibration-variant Device Tree setting
* cold boot calibration support
* new DFS region: JP
wnc36xx
* enable connection monitoring and keepalive in firmware
ath10k
* firmware IRAM recovery feature
mhi
* merge mhi-ath11k-immutable branch to make MHI API change go smoothly
* tag 'wireless-drivers-next-2020-12-03' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next: (180 commits)
wl1251: remove trailing semicolon in macro definition
airo: remove trailing semicolon in macro definition
wilc1000: added queue support for WMM
wilc1000: call complete() for failure in wilc_wlan_txq_add_cfg_pkt()
wilc1000: free resource in wilc_wlan_txq_add_mgmt_pkt() for failure path
wilc1000: free resource in wilc_wlan_txq_add_net_pkt() for failure path
wilc1000: added 'ndo_set_mac_address' callback support
brcmfmac: expose firmware config files through modinfo
wlcore: Switch to using the new API kobj_to_dev()
rtw88: coex: add feature to enhance HID coexistence performance
rtw88: coex: upgrade coexistence A2DP mechanism
rtw88: coex: add action for coexistence in hardware initial
rtw88: coex: add function to avoid cck lock
rtw88: coex: change the coexistence mechanism for WLAN connected
rtw88: coex: change the coexistence mechanism for HID
rtw88: coex: update AFH information while in free-run mode
rtw88: coex: update the mechanism for A2DP + PAN
rtw88: coex: add debug message
rtw88: coex: run coexistence when WLAN entering/leaving LPS
Revert "rtl8xxxu: Add Buffalo WI-U3-866D to list of supported devices"
...
====================
Link: https://lore.kernel.org/r/20201203185732.9CFA5C433ED@smtp.codeaurora.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If force_zc is set, we should exit out with an error, not fall back to
copy mode.
Fixes: 921b68692a ("xsk: Enable sharing of dma mappings")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zhang Changzhong <zhangchangzhong@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/1607077277-41995-1-git-send-email-zhangchangzhong@huawei.com
Alexei Starovoitov says:
====================
pull-request: bpf-next 2020-12-03
The main changes are:
1) Support BTF in kernel modules, from Andrii.
2) Introduce preferred busy-polling, from Björn.
3) bpf_ima_inode_hash() and bpf_bprm_opts_set() helpers, from KP Singh.
4) Memcg-based memory accounting for bpf objects, from Roman.
5) Allow bpf_{s,g}etsockopt from cgroup bind{4,6} hooks, from Stanislav.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (118 commits)
selftests/bpf: Fix invalid use of strncat in test_sockmap
libbpf: Use memcpy instead of strncpy to please GCC
selftests/bpf: Add fentry/fexit/fmod_ret selftest for kernel module
selftests/bpf: Add tp_btf CO-RE reloc test for modules
libbpf: Support attachment of BPF tracing programs to kernel modules
libbpf: Factor out low-level BPF program loading helper
bpf: Allow to specify kernel module BTFs when attaching BPF programs
bpf: Remove hard-coded btf_vmlinux assumption from BPF verifier
selftests/bpf: Add CO-RE relocs selftest relying on kernel module BTF
selftests/bpf: Add support for marking sub-tests as skipped
selftests/bpf: Add bpf_testmod kernel module for testing
libbpf: Add kernel module BTF support for CO-RE relocations
libbpf: Refactor CO-RE relocs to not assume a single BTF object
libbpf: Add internal helper to load BTF data by FD
bpf: Keep module's btf_data_size intact after load
bpf: Fix bpf_put_raw_tracepoint()'s use of __module_address()
selftests/bpf: Add Userspace tests for TCP_WINDOW_CLAMP
bpf: Adds support for setting window clamp
samples/bpf: Fix spelling mistake "recieving" -> "receiving"
bpf: Fix cold build of test_progs-no_alu32
...
====================
Link: https://lore.kernel.org/r/20201204021936.85653-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
During restarrt, mac80211 is supposed to reconfigure the driver.
When there's a monitor interface, the interface is added and the
channel context for it was created, but not assigned to it as it
was not considered running during the restart.
Fix this by setting SDATA_STATE_RUNNING while adding monitor
interfaces.
Signed-off-by: Borwankar, Antara <antara.borwankar@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20201129172929.e1df99693a4c.I494579f28018c2d0b9d4083a664cf872c28405ae@changeid
[reword commit log]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
ieee80211_chandef_he_6ghz_oper() needs to return true if it
determined a value 6 GHz chandef, fix that.
Fixes: 1d00ce807e ("mac80211: support S1G association")
Signed-off-by: Wen Gong <wgong@codeaurora.org>
Link: https://lore.kernel.org/r/1606121152-3452-1-git-send-email-wgong@codeaurora.org
[rewrite commit message]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The commit 992b03b88e ("batman-adv: Don't always reallocate the
fragmentation skb head") removed the last user of functions from
soft-interface.h.
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
The only way to automatically drop batadv mesh interfaces when all soft
interfaces were removed was dropped with the sysfs support. It is no longer
needed to have them handled by kernel anymore.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
The debugfs support in batman-adv was marked as deprecated by the commit
00caf6a2b3 ("batman-adv: Mark debugfs functionality as deprecated") and
scheduled for removal in 2021.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
The sysfs in batman-adv support was marked as deprecated by the commit
42cdd52148 ("batman-adv: ABI: Mark sysfs files as deprecated") and
scheduled for removal in 2021.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
A batadv net_device is associated to a B.A.T.M.A.N. routing algorithm. This
algorithm has to be selected before the interface is initialized and cannot
be changed after that. The only way to select this algorithm was a module
parameter which specifies the default algorithm used during the creation of
the net_device.
This module parameter is writeable over
/sys/module/batman_adv/parameters/routing_algo and thus allows switching of
the routing algorithm:
1. change routing_algo parameter
2. create new batadv net_device
But this is not race free because another process can be scheduled between
1 + 2 and in that time frame change the routing_algo parameter again.
It is much cleaner to directly provide this information inside the
rtnetlink's RTM_NEWLINK message. The two processes would be (in regards of
the creation parameter of their batadv interfaces) be isolated. This also
eases the integration of batadv devices inside tools like network-manager
or systemd-networkd which are not expecting to operate on /sys before a new
net_device is created.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
The batadv generic netlink family can be used to retrieve the current state
and set various configuration settings. But there are also settings which
must be set before the actual interface is created.
The rtnetlink already uses IFLA_INFO_DATA to allow net_device families to
transfer such configurations. The minimal required functionality for this
is now available for the batadv rtnl_link_ops. Also a new IFLA class of
attributes will be attached to it because rtnetlink only allows 51
different attributes but batadv_nl_attrs already contains 62 attributes.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
The commit b296a6d533 ("kernel.h: split out min()/max() et al. helpers")
moved the min/max helper functionality from kernel.h to minmax.h. Adjust
the kernel code accordingly to avoid fragile indirect includes.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Remove a permeating assumption thoughout BPF verifier of vmlinux BTF. Instead,
wherever BTF type IDs are involved, also track the instance of struct btf that
goes along with the type ID. This allows to gradually add support for kernel
module BTFs and using/tracking module types across BPF helper calls and
registers.
This patch also renames btf_id() function to btf_obj_id() to minimize naming
clash with using btf_id to denote BTF *type* ID, rather than BTF *object*'s ID.
Also, altough btf_vmlinux can't get destructed and thus doesn't need
refcounting, module BTFs need that, so apply BTF refcounting universally when
BPF program is using BTF-powered attachment (tp_btf, fentry/fexit, etc). This
makes for simpler clean up code.
Now that BTF type ID is not enough to uniquely identify a BTF type, extend BPF
trampoline key to include BTF object ID. To differentiate that from target
program BPF ID, set 31st bit of type ID. BTF type IDs (at least currently) are
not allowed to take full 32 bits, so there is no danger of confusing that bit
with a valid BTF type ID.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201203204634.1325171-10-andrii@kernel.org
Adds a new bpf_setsockopt for TCP sockets, TCP_BPF_WINDOW_CLAMP,
which sets the maximum receiver window size. It will be useful for
limiting receiver window based on RTT.
Signed-off-by: Prankur gupta <prankgup@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201202213152.435886-2-prankgup@fb.com
RFC 8684 says:
If the token is unknown or the host wants to refuse subflow establishment
(for example, due to a limit on the number of subflows it will permit),
the receiver will send back a reset (RST) signal, analogous to an unknown
port in TCP, containing an MP_TCPRST option (Section 3.6) with an
"MPTCP specific error" reason code.
mptcp-next doesn't support MP_TCPRST yet, this can be added in another
change.
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The Multipath-TCP standard (RFC 8684) says that an MPTCP host should send
a TCP reset if the token in a MP_JOIN request is unknown.
At this time we don't do this, the 3whs completes and the 'new subflow'
is reset afterwards. There are two ways to allow MPTCP to send the
reset.
1. override 'send_synack' callback and emit the rst from there.
The drawback is that the request socket gets inserted into the
listeners queue just to get removed again right away.
2. Send the reset from the 'route_req' function instead.
This avoids the 'add&remove request socket', but route_req lacks the
skb that is required to send the TCP reset.
Instead of just adding the skb to that function for MPTCP sake alone,
Paolo suggested to merge init_req and route_req functions.
This saves one indirection from syn processing path and provides the skb
to the merged function at the same time.
'send reset on unknown mptcp join token' is added in next patch.
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
when 'act_mpls' is used to mangle the LSE, the current value is read from
the packet dereferencing 4 bytes at mpls_hdr(): ensure that the label is
contained in the skb "linear" area.
Found by code inspection.
v2:
- use MPLS_HLEN instead of sizeof(new_lse), thanks to Jakub Kicinski
Fixes: 2a2ea50870 ("net: sched: add mpls manipulation actions to TC")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Guillaume Nault <gnault@redhat.com>
Link: https://lore.kernel.org/r/3243506cba43d14858f3bd21ee0994160e44d64a.1606987058.git.dcaratti@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
when openvswitch is configured to mangle the LSE, the current value is
read from the packet dereferencing 4 bytes at mpls_hdr(): ensure that
the label is contained in the skb "linear" area.
Found by code inspection.
Fixes: d27cf5c59a ("net: core: add MPLS update core helper and use in OvS")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Link: https://lore.kernel.org/r/aa099f245d93218b84b5c056b67b6058ccf81a66.1606987185.git.dcaratti@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
skb_mpls_dec_ttl() reads the LSE without ensuring that it is contained in
the skb "linear" area. Fix this calling pskb_may_pull() before reading the
current ttl.
Found by code inspection.
Fixes: 2a2ea50870 ("net: sched: add mpls manipulation actions to TC")
Reported-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Link: https://lore.kernel.org/r/53659f28be8bc336c113b5254dc637cc76bbae91.1606987074.git.dcaratti@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Do not use rlimit-based memory accounting for xskmap maps.
It has been replaced with the memcg-based memory accounting.
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-31-guro@fb.com
Do not use rlimit-based memory accounting for sockmap and sockhash maps.
It has been replaced with the memcg-based memory accounting.
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-29-guro@fb.com
Include internal metadata into the memcg-based memory accounting.
Also include the memory allocated on updating an element.
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-17-guro@fb.com
The .x25_addr[] address comes from the user and is not necessarily
NUL terminated. This leads to a couple problems. The first problem is
that the strlen() in x25_bind() can read beyond the end of the buffer.
The second problem is more subtle and could result in memory corruption.
The call tree is:
x25_connect()
--> x25_write_internal()
--> x25_addr_aton()
The .x25_addr[] buffers are copied to the "addresses" buffer from
x25_write_internal() so it will lead to stack corruption.
Verify that the strings are NUL terminated and return -EINVAL if they
are not.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Fixes: a9288525d2 ("X25: Dont let x25_bind use addresses containing characters")
Reported-by: "kiyin(尹亮)" <kiyin@tencent.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Martin Schiller <ms@dev.tdt.de>
Link: https://lore.kernel.org/r/X8ZeAKm8FnFpN//B@mwanda
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Modify the tx writeable condition from the queue is not full to the
number of present tx queues is less than the half of the total number
of queues. Because the tx queue not full is a very short time, this will
cause a large number of EPOLLOUT events, and cause a large number of
process wake up.
Fixes: 35fcde7f8d ("xsk: support for Tx")
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/508fef55188d4e1160747ead64c6dcda36735880.1606555939.git.xuanzhuo@linux.alibaba.com
datagram_poll will judge the current socket status (EPOLLIN, EPOLLOUT)
based on the traditional socket information (eg: sk_wmem_alloc), but
this does not apply to xsk. So this patch uses sock_poll_wait instead of
datagram_poll, and the mask is calculated by xsk_poll.
Fixes: c497176cb2 ("xsk: add Rx receive functions and poll support")
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/e82f4697438cd63edbf271ebe1918db8261b7c09.1606555939.git.xuanzhuo@linux.alibaba.com
I have to now lock/unlock socket for the bind hook execution.
That shouldn't cause any overhead because the socket is unbound
and shouldn't receive any traffic.
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrey Ignatov <rdna@fb.com>
Link: https://lore.kernel.org/bpf/20201202172516.3483656-3-sdf@google.com
If a packet is ready in receive queue, and application isssues
a recvmsg()/recvfrom()/recvmmsg() request asking for zero bytes,
we hang in mptcp_recvmsg().
Fixes: ea4ca586b1 ("mptcp: refine MPTCP-level ack scheduling")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Link: https://lore.kernel.org/r/20201202171657.1185108-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
syzkaller managed to crash the kernel using an NBMA ip6gre interface. I
could reproduce it creating an NBMA ip6gre interface and forwarding
traffic to it:
skbuff: skb_under_panic: text:ffffffff8250e927 len:148 put:44 head:ffff8c03c7a33
------------[ cut here ]------------
kernel BUG at net/core/skbuff.c:109!
Call Trace:
skb_push+0x10/0x10
ip6gre_header+0x47/0x1b0
neigh_connected_output+0xae/0xf0
ip6gre tunnel provides its own header_ops->create, and sets it
conditionally when initializing the tunnel in NBMA mode. When
header_ops->create is used, dev->hard_header_len should reflect the
length of the header created. Otherwise, when not used,
dev->needed_headroom should be used.
Fixes: eb95f52fc7 ("net: ipv6_gre: Fix GRO to work on IPv6 over GRE tap")
Cc: Maria Pasechnik <mariap@mellanox.com>
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Link: https://lore.kernel.org/r/20201130161911.464106-1-atenart@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
'snprintf' returns the number of characters which would have been written
if enough space had been available, excluding the terminating null byte.
Thus, the return value of 'sizeof(buf)' means that the last character
has been dropped.
Signed-off-by: Fedor Tokarev <ftokarev@gmail.com>
Fixes: 2f34b8bfae ("SUNRPC: add links for all client xprts to debugfs")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
While we always want to align to the next page and/or the beginning of
the tail buffer when we call xdr_set_next_page(), the functions
xdr_align_data() and xdr_expand_hole() really want to align to the next
object in that next page or tail.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
rpc_prepare_reply_pages() currently expects the 'hdrsize' argument to
contain the length of the data that we expect to want placed in the head
kvec plus a count of 1 word of padding that is placed after the page data.
This is very confusing when trying to read the code, and sometimes leads
to callers adding an arbitrary value of '1' just in order to satisfy the
requirement (whether or not the page data actually needs such padding).
This patch aims to clarify the code by changing the 'hdrsize' argument
to remove that 1 word of padding. This means we need to subtract the
padding from all the existing callers.
Fixes: 02ef04e432 ("NFS: Account for XDR pad of buf->pages")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Fix up xdr_read_pages() so that it can handle object lengths that are
larger than the page length, by simply aligning to the next object in
the buffer tail.
The function will continue to return the length of the truncate object
data that actually fit into the pages.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Allow xdr_set_iov() to set a base so that we can use it to set the
cursor to a specific position in the kvec buffer.
If the new base overflows the kvec/pages buffer in either xdr_set_iov()
or xdr_set_page_base(), then truncate it so that we point to the end of
the buffer.
Finally, change both function to return the number of bytes remaining to
read in their buffers.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
We already know that the head buffer and page are empty, so if there is
any data, it is in the tail.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
After we've looked up the transport module, we need to ensure it can't
go away until we've finished running the transport setup code.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
According to RFC5666, the correct netid for an IPv6 addressed RDMA
transport is "rdma6", which we've supported as a mount option since
Linux-4.7. The problem is when we try to load the module "xprtrdma6",
that will fail, since there is no modulealias of that name.
Fixes: 181342c5eb ("xprtrdma: Add rdma6 option to support NFS/RDMA IPv6")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Currently, we wake up the tasks by priority queue ordering, which means
that we ignore the batching that is supposed to help with QoS issues.
Fixes: c049f8ea9a ("SUNRPC: Remove the bh-safe lock requirement on the rpc_wait_queue->lock")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
There's no need to defer allocation of pages for the receive buffer.
- This upcall is quite infrequent
- gssp_alloc_receive_pages() can allocate the pages with GFP_KERNEL,
unlike the transport
- gssp_alloc_receive_pages() knows exactly how many pages are needed
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Olga Kornievskaia <aglo@umich.edu>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Introduce get link command which loops through
all available links of all available link groups. It
uses the SMC-R linkgroup list as entry point, not
the socket list, which makes linkgroup diagnosis
possible, in case linkgroup does not contain active
connections anymore.
Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Introduce get linkgroup command which loops through
all available SMCR linkgroups. It uses the SMC-R linkgroup
list as entry point, not the socket list, which makes
linkgroup diagnosis possible, in case linkgroup does not
contain active connections anymore.
Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add new netlink command to obtain system information
of the smc module.
Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Encapsulate the smc ism v2 capability boolean value
in a function for better information hiding.
Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
During link creation add net-device ifindex and ib-device
name to link structure. This is needed for diagnostic purposes.
When diagnostic information is gathered, we need to traverse
device, linkgroup and link structures, to be able to do that
we need to hold a spinlock for the linkgroup list, without this
diagnostic information in link structure, another device list
mutex holding would be necessary to dereference the device
pointer in the link structure which would be impossible when
holding a spinlock already.
Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
During smc ib-device creation, add network device ifindex to smc
ib-device structure. Register for netdevice changes and update ib-device
accordingly. This is needed for diagnostic purposes.
Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add link counters to the structure of the smc ib device, one counter per
ib port. Increase/decrease the counters as needed in the corresponding
routines.
Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add connection counters to the structure of the link.
Increase/decrease the counters as needed in the corresponding
routines.
Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use active link of the connection directly and not
via linkgroup array structure when obtaining link
data of the connection.
Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The helper smc_connect_abort() can be used by the listen processing
functions, too. And rename this helper to smc_conn_abort() to make the
purpose clearer.
No functional change.
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Recent changes made to remove AES constants started using protocol
aware salt_size. ctx->prot_info's salt_size is filled in tls sw case,
but not in tls offload mode, and was working so far because of the
hard coded value was used.
Fixes: 6942a284fb ("net/tls: make inline helpers protocol-aware")
Signed-off-by: Rohit Maheshwari <rohitm@chelsio.com>
Link: https://lore.kernel.org/r/20201201090752.27355-1-rohitm@chelsio.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl/EM9oeHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG/3kH/RNkFyTlHlUkZpJx
8Ks2yWgUln7YhZcmOaG/IcIyWnhCgo3l35kiaH7XxM+rPMZzidp51MHUllaTAQDc
u+5EFHMJsmTWUfE8ocHPb1cPdYEDSoVr6QUsixbL9+uADpRz+VZVtWMb89EiyMrC
wvLIzpnqY5UNriWWBxD0hrmSsT4g9XCsauer4k2KB+zvebwg6vFOMCFLFc2qz7fb
ABsrPFqLZOMp+16chGxyHP7LJ6ygI/Hwf7tPW8ppv4c+hes4HZg7yqJxXhV02QbJ
s10s6BTcEWMqKg/T6L/VoScsMHWUcNdvrr3uuPQhgup240XdmB1XO8rOKddw27e7
VIjrjNw=
=4ZaP
-----END PGP SIGNATURE-----
Merge tag 'v5.10-rc6' into rdma.git for-next
For dependencies in following patches
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The last user of the RTNL brother of dev_getfirstbyhwtype (the latter
being synchronized under RCU) has been deleted in commit b4db2b35fc
("afs: Use core kernel UUID generation").
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20201129200550.2433401-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Fix Return: kernel-doc notation in all net/tipc/ source files.
Also keep ReST list notation intact for output formatting.
Fix a few typos in comments.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Fix socket.c kernel-doc warnings in preparation for adding to the
networking docbook.
Also, for rcvbuf_limit(), use bullet notation so that the lines do
not run together.
../net/tipc/socket.c:130: warning: Function parameter or member 'cong_links' not described in 'tipc_sock'
../net/tipc/socket.c:130: warning: Function parameter or member 'probe_unacked' not described in 'tipc_sock'
../net/tipc/socket.c:130: warning: Function parameter or member 'snd_win' not described in 'tipc_sock'
../net/tipc/socket.c:130: warning: Function parameter or member 'peer_caps' not described in 'tipc_sock'
../net/tipc/socket.c:130: warning: Function parameter or member 'rcv_win' not described in 'tipc_sock'
../net/tipc/socket.c:130: warning: Function parameter or member 'group' not described in 'tipc_sock'
../net/tipc/socket.c:130: warning: Function parameter or member 'oneway' not described in 'tipc_sock'
../net/tipc/socket.c:130: warning: Function parameter or member 'nagle_start' not described in 'tipc_sock'
../net/tipc/socket.c:130: warning: Function parameter or member 'snd_backlog' not described in 'tipc_sock'
../net/tipc/socket.c:130: warning: Function parameter or member 'msg_acc' not described in 'tipc_sock'
../net/tipc/socket.c:130: warning: Function parameter or member 'pkt_cnt' not described in 'tipc_sock'
../net/tipc/socket.c:130: warning: Function parameter or member 'expect_ack' not described in 'tipc_sock'
../net/tipc/socket.c:130: warning: Function parameter or member 'nodelay' not described in 'tipc_sock'
../net/tipc/socket.c:130: warning: Function parameter or member 'group_is_open' not described in 'tipc_sock'
../net/tipc/socket.c:267: warning: Function parameter or member 'sk' not described in 'tsk_advance_rx_queue'
../net/tipc/socket.c:295: warning: Function parameter or member 'sk' not described in 'tsk_rej_rx_queue'
../net/tipc/socket.c:295: warning: Function parameter or member 'error' not described in 'tsk_rej_rx_queue'
../net/tipc/socket.c:894: warning: Function parameter or member 'tsk' not described in 'tipc_send_group_msg'
../net/tipc/socket.c:1187: warning: Function parameter or member 'net' not described in 'tipc_sk_mcast_rcv'
../net/tipc/socket.c:1323: warning: Function parameter or member 'inputq' not described in 'tipc_sk_conn_proto_rcv'
../net/tipc/socket.c:1323: warning: Function parameter or member 'xmitq' not described in 'tipc_sk_conn_proto_rcv'
../net/tipc/socket.c:1885: warning: Function parameter or member 'sock' not described in 'tipc_recvmsg'
../net/tipc/socket.c:1993: warning: Function parameter or member 'sock' not described in 'tipc_recvstream'
../net/tipc/socket.c:2313: warning: Function parameter or member 'xmitq' not described in 'tipc_sk_filter_rcv'
../net/tipc/socket.c:2404: warning: Function parameter or member 'xmitq' not described in 'tipc_sk_enqueue'
../net/tipc/socket.c:2456: warning: Function parameter or member 'net' not described in 'tipc_sk_rcv'
../net/tipc/socket.c:2693: warning: Function parameter or member 'kern' not described in 'tipc_accept'
../net/tipc/socket.c:3816: warning: Excess function parameter 'sysctl_tipc_sk_filter' description in 'tipc_sk_filtering'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Fix node.c kernel-doc warnings in preparation for adding to the
networking docbook.
../net/tipc/node.c:141: warning: Function parameter or member 'kref' not described in 'tipc_node'
../net/tipc/node.c:141: warning: Function parameter or member 'bc_entry' not described in 'tipc_node'
../net/tipc/node.c:141: warning: Function parameter or member 'failover_sent' not described in 'tipc_node'
../net/tipc/node.c:141: warning: Function parameter or member 'peer_id' not described in 'tipc_node'
../net/tipc/node.c:141: warning: Function parameter or member 'peer_id_string' not described in 'tipc_node'
../net/tipc/node.c:141: warning: Function parameter or member 'conn_sks' not described in 'tipc_node'
../net/tipc/node.c:141: warning: Function parameter or member 'keepalive_intv' not described in 'tipc_node'
../net/tipc/node.c:141: warning: Function parameter or member 'timer' not described in 'tipc_node'
../net/tipc/node.c:141: warning: Function parameter or member 'peer_net' not described in 'tipc_node'
../net/tipc/node.c:141: warning: Function parameter or member 'peer_hash_mix' not described in 'tipc_node'
../net/tipc/node.c:273: warning: Function parameter or member '__n' not described in 'tipc_node_crypto_rx'
../net/tipc/node.c:822: warning: Function parameter or member 'n' not described in '__tipc_node_link_up'
../net/tipc/node.c:822: warning: Function parameter or member 'bearer_id' not described in '__tipc_node_link_up'
../net/tipc/node.c:822: warning: Function parameter or member 'xmitq' not described in '__tipc_node_link_up'
../net/tipc/node.c:888: warning: Function parameter or member 'n' not described in 'tipc_node_link_up'
../net/tipc/node.c:888: warning: Function parameter or member 'bearer_id' not described in 'tipc_node_link_up'
../net/tipc/node.c:888: warning: Function parameter or member 'xmitq' not described in 'tipc_node_link_up'
../net/tipc/node.c:948: warning: Function parameter or member 'n' not described in '__tipc_node_link_down'
../net/tipc/node.c:948: warning: Function parameter or member 'bearer_id' not described in '__tipc_node_link_down'
../net/tipc/node.c:948: warning: Function parameter or member 'xmitq' not described in '__tipc_node_link_down'
../net/tipc/node.c:948: warning: Function parameter or member 'maddr' not described in '__tipc_node_link_down'
../net/tipc/node.c:1537: warning: Function parameter or member 'net' not described in 'tipc_node_get_linkname'
../net/tipc/node.c:1537: warning: Function parameter or member 'len' not described in 'tipc_node_get_linkname'
../net/tipc/node.c:1891: warning: Function parameter or member 'n' not described in 'tipc_node_check_state'
../net/tipc/node.c:1891: warning: Function parameter or member 'xmitq' not described in 'tipc_node_check_state'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Fix name_table.c kernel-doc warnings in preparation for adding to the
networking docbook.
../net/tipc/name_table.c:115: warning: Function parameter or member 'start' not described in 'service_range_foreach_match'
../net/tipc/name_table.c:115: warning: Function parameter or member 'end' not described in 'service_range_foreach_match'
../net/tipc/name_table.c:127: warning: Function parameter or member 'start' not described in 'service_range_match_first'
../net/tipc/name_table.c:127: warning: Function parameter or member 'end' not described in 'service_range_match_first'
../net/tipc/name_table.c:176: warning: Function parameter or member 'start' not described in 'service_range_match_next'
../net/tipc/name_table.c:176: warning: Function parameter or member 'end' not described in 'service_range_match_next'
../net/tipc/name_table.c:225: warning: Function parameter or member 'type' not described in 'tipc_publ_create'
../net/tipc/name_table.c:225: warning: Function parameter or member 'lower' not described in 'tipc_publ_create'
../net/tipc/name_table.c:225: warning: Function parameter or member 'upper' not described in 'tipc_publ_create'
../net/tipc/name_table.c:225: warning: Function parameter or member 'scope' not described in 'tipc_publ_create'
../net/tipc/name_table.c:225: warning: Function parameter or member 'node' not described in 'tipc_publ_create'
../net/tipc/name_table.c:225: warning: Function parameter or member 'port' not described in 'tipc_publ_create'
../net/tipc/name_table.c:225: warning: Function parameter or member 'key' not described in 'tipc_publ_create'
../net/tipc/name_table.c:252: warning: Function parameter or member 'type' not described in 'tipc_service_create'
../net/tipc/name_table.c:252: warning: Function parameter or member 'hd' not described in 'tipc_service_create'
../net/tipc/name_table.c:367: warning: Function parameter or member 'sr' not described in 'tipc_service_remove_publ'
../net/tipc/name_table.c:367: warning: Function parameter or member 'node' not described in 'tipc_service_remove_publ'
../net/tipc/name_table.c:367: warning: Function parameter or member 'key' not described in 'tipc_service_remove_publ'
../net/tipc/name_table.c:383: warning: Function parameter or member 'pa' not described in 'publication_after'
../net/tipc/name_table.c:383: warning: Function parameter or member 'pb' not described in 'publication_after'
../net/tipc/name_table.c:401: warning: Function parameter or member 'service' not described in 'tipc_service_subscribe'
../net/tipc/name_table.c:401: warning: Function parameter or member 'sub' not described in 'tipc_service_subscribe'
../net/tipc/name_table.c:546: warning: Function parameter or member 'net' not described in 'tipc_nametbl_translate'
../net/tipc/name_table.c:546: warning: Function parameter or member 'type' not described in 'tipc_nametbl_translate'
../net/tipc/name_table.c:546: warning: Function parameter or member 'instance' not described in 'tipc_nametbl_translate'
../net/tipc/name_table.c:546: warning: Function parameter or member 'dnode' not described in 'tipc_nametbl_translate'
../net/tipc/name_table.c:762: warning: Function parameter or member 'net' not described in 'tipc_nametbl_withdraw'
../net/tipc/name_table.c:762: warning: Function parameter or member 'type' not described in 'tipc_nametbl_withdraw'
../net/tipc/name_table.c:762: warning: Function parameter or member 'lower' not described in 'tipc_nametbl_withdraw'
../net/tipc/name_table.c:762: warning: Function parameter or member 'upper' not described in 'tipc_nametbl_withdraw'
../net/tipc/name_table.c:762: warning: Function parameter or member 'key' not described in 'tipc_nametbl_withdraw'
../net/tipc/name_table.c:796: warning: Function parameter or member 'sub' not described in 'tipc_nametbl_subscribe'
../net/tipc/name_table.c:826: warning: Function parameter or member 'sub' not described in 'tipc_nametbl_unsubscribe'
../net/tipc/name_table.c:876: warning: Function parameter or member 'net' not described in 'tipc_service_delete'
../net/tipc/name_table.c:876: warning: Function parameter or member 'sc' not described in 'tipc_service_delete'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Fix name_distr.c kernel-doc warnings in preparation for adding to the
networking docbook.
../net/tipc/name_distr.c:55: warning: Function parameter or member 'i' not described in 'publ_to_item'
../net/tipc/name_distr.c:55: warning: Function parameter or member 'p' not described in 'publ_to_item'
../net/tipc/name_distr.c:70: warning: Function parameter or member 'net' not described in 'named_prepare_buf'
../net/tipc/name_distr.c:70: warning: Function parameter or member 'type' not described in 'named_prepare_buf'
../net/tipc/name_distr.c:70: warning: Function parameter or member 'size' not described in 'named_prepare_buf'
../net/tipc/name_distr.c:70: warning: Function parameter or member 'dest' not described in 'named_prepare_buf'
../net/tipc/name_distr.c:88: warning: Function parameter or member 'net' not described in 'tipc_named_publish'
../net/tipc/name_distr.c:88: warning: Function parameter or member 'publ' not described in 'tipc_named_publish'
../net/tipc/name_distr.c:116: warning: Function parameter or member 'net' not described in 'tipc_named_withdraw'
../net/tipc/name_distr.c:116: warning: Function parameter or member 'publ' not described in 'tipc_named_withdraw'
../net/tipc/name_distr.c:147: warning: Function parameter or member 'net' not described in 'named_distribute'
../net/tipc/name_distr.c:147: warning: Function parameter or member 'seqno' not described in 'named_distribute'
../net/tipc/name_distr.c:199: warning: Function parameter or member 'net' not described in 'tipc_named_node_up'
../net/tipc/name_distr.c:199: warning: Function parameter or member 'dnode' not described in 'tipc_named_node_up'
../net/tipc/name_distr.c:199: warning: Function parameter or member 'capabilities' not described in 'tipc_named_node_up'
../net/tipc/name_distr.c:225: warning: Function parameter or member 'net' not described in 'tipc_publ_purge'
../net/tipc/name_distr.c:225: warning: Function parameter or member 'publ' not described in 'tipc_publ_purge'
../net/tipc/name_distr.c:225: warning: Function parameter or member 'addr' not described in 'tipc_publ_purge'
../net/tipc/name_distr.c:272: warning: Function parameter or member 'net' not described in 'tipc_update_nametbl'
../net/tipc/name_distr.c:272: warning: Function parameter or member 'i' not described in 'tipc_update_nametbl'
../net/tipc/name_distr.c:272: warning: Function parameter or member 'node' not described in 'tipc_update_nametbl'
../net/tipc/name_distr.c:272: warning: Function parameter or member 'dtype' not described in 'tipc_update_nametbl'
../net/tipc/name_distr.c:353: warning: Function parameter or member 'net' not described in 'tipc_named_rcv'
../net/tipc/name_distr.c:353: warning: Function parameter or member 'namedq' not described in 'tipc_named_rcv'
../net/tipc/name_distr.c:353: warning: Function parameter or member 'rcv_nxt' not described in 'tipc_named_rcv'
../net/tipc/name_distr.c:353: warning: Function parameter or member 'open' not described in 'tipc_named_rcv'
../net/tipc/name_distr.c:383: warning: Function parameter or member 'net' not described in 'tipc_named_reinit'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Fix link.c kernel-doc warnings in preparation for adding to the
networking docbook.
../net/tipc/link.c:200: warning: Function parameter or member 'session' not described in 'tipc_link'
../net/tipc/link.c:200: warning: Function parameter or member 'snd_nxt_state' not described in 'tipc_link'
../net/tipc/link.c:200: warning: Function parameter or member 'rcv_nxt_state' not described in 'tipc_link'
../net/tipc/link.c:200: warning: Function parameter or member 'in_session' not described in 'tipc_link'
../net/tipc/link.c:200: warning: Function parameter or member 'active' not described in 'tipc_link'
../net/tipc/link.c:200: warning: Function parameter or member 'if_name' not described in 'tipc_link'
../net/tipc/link.c:200: warning: Function parameter or member 'rst_cnt' not described in 'tipc_link'
../net/tipc/link.c:200: warning: Function parameter or member 'drop_point' not described in 'tipc_link'
../net/tipc/link.c:200: warning: Function parameter or member 'failover_reasm_skb' not described in 'tipc_link'
../net/tipc/link.c:200: warning: Function parameter or member 'failover_deferdq' not described in 'tipc_link'
../net/tipc/link.c:200: warning: Function parameter or member 'transmq' not described in 'tipc_link'
../net/tipc/link.c:200: warning: Function parameter or member 'backlog' not described in 'tipc_link'
../net/tipc/link.c:200: warning: Function parameter or member 'snd_nxt' not described in 'tipc_link'
../net/tipc/link.c:200: warning: Function parameter or member 'rcv_unacked' not described in 'tipc_link'
../net/tipc/link.c:200: warning: Function parameter or member 'deferdq' not described in 'tipc_link'
../net/tipc/link.c:200: warning: Function parameter or member 'window' not described in 'tipc_link'
../net/tipc/link.c:200: warning: Function parameter or member 'min_win' not described in 'tipc_link'
../net/tipc/link.c:200: warning: Function parameter or member 'ssthresh' not described in 'tipc_link'
../net/tipc/link.c:200: warning: Function parameter or member 'max_win' not described in 'tipc_link'
../net/tipc/link.c:200: warning: Function parameter or member 'cong_acks' not described in 'tipc_link'
../net/tipc/link.c:200: warning: Function parameter or member 'checkpoint' not described in 'tipc_link'
../net/tipc/link.c:200: warning: Function parameter or member 'reasm_tnlmsg' not described in 'tipc_link'
../net/tipc/link.c:200: warning: Function parameter or member 'last_gap' not described in 'tipc_link'
../net/tipc/link.c:200: warning: Function parameter or member 'last_ga' not described in 'tipc_link'
../net/tipc/link.c:200: warning: Function parameter or member 'bc_rcvlink' not described in 'tipc_link'
../net/tipc/link.c:200: warning: Function parameter or member 'bc_sndlink' not described in 'tipc_link'
../net/tipc/link.c:200: warning: Function parameter or member 'nack_state' not described in 'tipc_link'
../net/tipc/link.c:200: warning: Function parameter or member 'bc_peer_is_up' not described in 'tipc_link'
../net/tipc/link.c:473: warning: Function parameter or member 'self' not described in 'tipc_link_create'
../net/tipc/link.c:473: warning: Function parameter or member 'peer_id' not described in 'tipc_link_create'
../net/tipc/link.c:473: warning: Excess function parameter 'ownnode' description in 'tipc_link_create'
../net/tipc/link.c:544: warning: Function parameter or member 'ownnode' not described in 'tipc_link_bc_create'
../net/tipc/link.c:544: warning: Function parameter or member 'peer' not described in 'tipc_link_bc_create'
../net/tipc/link.c:544: warning: Function parameter or member 'peer_id' not described in 'tipc_link_bc_create'
../net/tipc/link.c:544: warning: Function parameter or member 'peer_caps' not described in 'tipc_link_bc_create'
../net/tipc/link.c:544: warning: Function parameter or member 'bc_sndlink' not described in 'tipc_link_bc_create'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Fix kernel-doc warnings in bearer.c:
../net/tipc/bearer.c:77: warning: Function parameter or member 'name' not described in 'tipc_media_find'
../net/tipc/bearer.c:91: warning: Function parameter or member 'type' not described in 'media_find_id'
../net/tipc/bearer.c:105: warning: Function parameter or member 'buf' not described in 'tipc_media_addr_printf'
../net/tipc/bearer.c:105: warning: Function parameter or member 'len' not described in 'tipc_media_addr_printf'
../net/tipc/bearer.c:105: warning: Function parameter or member 'a' not described in 'tipc_media_addr_printf'
../net/tipc/bearer.c:174: warning: Function parameter or member 'net' not described in 'tipc_bearer_find'
../net/tipc/bearer.c:174: warning: Function parameter or member 'name' not described in 'tipc_bearer_find'
../net/tipc/bearer.c:238: warning: Function parameter or member 'net' not described in 'tipc_enable_bearer'
../net/tipc/bearer.c:238: warning: Function parameter or member 'name' not described in 'tipc_enable_bearer'
../net/tipc/bearer.c:238: warning: Function parameter or member 'disc_domain' not described in 'tipc_enable_bearer'
../net/tipc/bearer.c:238: warning: Function parameter or member 'prio' not described in 'tipc_enable_bearer'
../net/tipc/bearer.c:238: warning: Function parameter or member 'attr' not described in 'tipc_enable_bearer'
../net/tipc/bearer.c:350: warning: Function parameter or member 'net' not described in 'tipc_reset_bearer'
../net/tipc/bearer.c:350: warning: Function parameter or member 'b' not described in 'tipc_reset_bearer'
../net/tipc/bearer.c:374: warning: Function parameter or member 'net' not described in 'bearer_disable'
../net/tipc/bearer.c:374: warning: Function parameter or member 'b' not described in 'bearer_disable'
../net/tipc/bearer.c:462: warning: Function parameter or member 'net' not described in 'tipc_l2_send_msg'
../net/tipc/bearer.c:479: warning: Function parameter or member 'net' not described in 'tipc_l2_send_msg'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
kernel-doc and Sphinx fixes to eliminate lots of warnings
in preparation for adding to the networking docbook.
../net/tipc/crypto.c:57: warning: cannot understand function prototype: 'enum '
../net/tipc/crypto.c:69: warning: cannot understand function prototype: 'enum '
../net/tipc/crypto.c:130: warning: Function parameter or member 'tfm' not described in 'tipc_tfm'
../net/tipc/crypto.c:130: warning: Function parameter or member 'list' not described in 'tipc_tfm'
../net/tipc/crypto.c:172: warning: Function parameter or member 'stat' not described in 'tipc_crypto_stats'
../net/tipc/crypto.c:232: warning: Function parameter or member 'flags' not described in 'tipc_crypto'
../net/tipc/crypto.c:329: warning: Function parameter or member 'ukey' not described in 'tipc_aead_key_validate'
../net/tipc/crypto.c:329: warning: Function parameter or member 'info' not described in 'tipc_aead_key_validate'
../net/tipc/crypto.c:482: warning: Function parameter or member 'aead' not described in 'tipc_aead_tfm_next'
../net/tipc/trace.c:43: warning: cannot understand function prototype: 'unsigned long sysctl_tipc_sk_filter[5] __read_mostly = '
Documentation/networking/tipc:57: ../net/tipc/msg.c:584: WARNING: Unexpected indentation.
Documentation/networking/tipc:63: ../net/tipc/name_table.c:536: WARNING: Unexpected indentation.
Documentation/networking/tipc:63: ../net/tipc/name_table.c:537: WARNING: Block quote ends without a blank line; unexpected unindent.
Documentation/networking/tipc:78: ../net/tipc/socket.c:3809: WARNING: Unexpected indentation.
Documentation/networking/tipc:78: ../net/tipc/socket.c:3807: WARNING: Inline strong start-string without end-string.
Documentation/networking/tipc:72: ../net/tipc/node.c:904: WARNING: Unexpected indentation.
Documentation/networking/tipc:39: ../net/tipc/crypto.c:97: WARNING: Block quote ends without a blank line; unexpected unindent.
Documentation/networking/tipc:39: ../net/tipc/crypto.c:98: WARNING: Block quote ends without a blank line; unexpected unindent.
Documentation/networking/tipc:39: ../net/tipc/crypto.c:141: WARNING: Inline strong start-string without end-string.
../net/tipc/discover.c:82: warning: Function parameter or member 'skb' not described in 'tipc_disc_init_msg'
../net/tipc/msg.c:69: warning: Function parameter or member 'gfp' not described in 'tipc_buf_acquire'
../net/tipc/msg.c:382: warning: Function parameter or member 'offset' not described in 'tipc_msg_build'
../net/tipc/msg.c:708: warning: Function parameter or member 'net' not described in 'tipc_msg_lookup_dest'
../net/tipc/subscr.c:65: warning: Function parameter or member 'seq' not described in 'tipc_sub_check_overlap'
../net/tipc/subscr.c:65: warning: Function parameter or member 'found_lower' not described in 'tipc_sub_check_overlap'
../net/tipc/subscr.c:65: warning: Function parameter or member 'found_upper' not described in 'tipc_sub_check_overlap'
../net/tipc/udp_media.c:75: warning: Function parameter or member 'proto' not described in 'udp_media_addr'
../net/tipc/udp_media.c:75: warning: Function parameter or member 'port' not described in 'udp_media_addr'
../net/tipc/udp_media.c:75: warning: Function parameter or member 'ipv4' not described in 'udp_media_addr'
../net/tipc/udp_media.c:75: warning: Function parameter or member 'ipv6' not described in 'udp_media_addr'
../net/tipc/udp_media.c:98: warning: Function parameter or member 'rcast' not described in 'udp_bearer'
Also fixed a typo of "duest" to "dest".
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Fix tipc header files for adding to the networking docbook.
Remove some uses of "/**" that were not kernel-doc notation.
Fix some source formatting to eliminate Sphinx warnings.
Add missing struct member and function argument kernel-doc descriptions.
Correct the description of a couple of struct members that were
marked as "(FIXME)".
Documentation/networking/tipc:18: ../net/tipc/name_table.h:65: WARNING: Unexpected indentation.
Documentation/networking/tipc:18: ../net/tipc/name_table.h:66: WARNING: Block quote ends without a blank line; unexpected unindent.
../net/tipc/bearer.h:128: warning: Function parameter or member 'min_win' not described in 'tipc_media'
../net/tipc/bearer.h:128: warning: Function parameter or member 'max_win' not described in 'tipc_media'
../net/tipc/bearer.h:171: warning: Function parameter or member 'min_win' not described in 'tipc_bearer'
../net/tipc/bearer.h:171: warning: Function parameter or member 'max_win' not described in 'tipc_bearer'
../net/tipc/bearer.h:171: warning: Function parameter or member 'disc' not described in 'tipc_bearer'
../net/tipc/bearer.h:171: warning: Function parameter or member 'up' not described in 'tipc_bearer'
../net/tipc/bearer.h:171: warning: Function parameter or member 'refcnt' not described in 'tipc_bearer'
../net/tipc/name_distr.h:68: warning: Function parameter or member 'port' not described in 'distr_item'
../net/tipc/name_table.h:111: warning: Function parameter or member 'services' not described in 'name_table'
../net/tipc/name_table.h:111: warning: Function parameter or member 'cluster_scope_lock' not described in 'name_table'
../net/tipc/name_table.h:111: warning: Function parameter or member 'rc_dests' not described in 'name_table'
../net/tipc/name_table.h:111: warning: Function parameter or member 'snd_nxt' not described in 'name_table'
../net/tipc/subscr.h:67: warning: Function parameter or member 'kref' not described in 'tipc_subscription'
../net/tipc/subscr.h:67: warning: Function parameter or member 'net' not described in 'tipc_subscription'
../net/tipc/subscr.h:67: warning: Function parameter or member 'service_list' not described in 'tipc_subscription'
../net/tipc/subscr.h:67: warning: Function parameter or member 'conid' not described in 'tipc_subscription'
../net/tipc/subscr.h:67: warning: Function parameter or member 'inactive' not described in 'tipc_subscription'
../net/tipc/subscr.h:67: warning: Function parameter or member 'lock' not described in 'tipc_subscription'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In commit 682cd3cf94
("tipc: confgiure and apply UDP bearer MTU on running links"), we
introduced a function to change UDP bearer MTU and applied this new value
across existing per-link. However, we did not apply this new MTU value at
node level. This lead to packet dropped at link level if its size is
greater than new MTU value.
To fix this issue, we also apply this new MTU value for node level.
Fixes: 682cd3cf94 ("tipc: confgiure and apply UDP bearer MTU on running links")
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Link: https://lore.kernel.org/r/20201130025544.3602-1-hoang.h.le@dektech.com.au
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Drivers that support bridge offload need to be notified about changes to
the bridge's VLAN protocol so that they could react accordingly and
potentially veto the change.
Add a new switchdev attribute to communicate the change to drivers.
Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
For IPV6_2292PKTOPTIONS, do_ipv6_getsockopt() stores the user pointer
optval in the msg_control field of the msghdr.
Hence, sparse rightfully warns at ./net/ipv6/ipv6_sockglue.c:1151:33:
warning: incorrect type in assignment (different address spaces)
expected void *msg_control
got char [noderef] __user *optval
Since commit 1f466e1f15 ("net: cleanly handle kernel vs user buffers for
->msg_control"), user pointers shall be stored in the msg_control_user
field, and kernel pointers in the msg_control field. This allows to
propagate __user annotations nicely through this struct.
Store optval in msg_control_user to properly record and propagate the
memory space annotation of this pointer.
Note that msg_control_is_user is set to true, so the key invariant, i.e.,
use msg_control_user if and only if msg_control_is_user is true, holds.
The msghdr is further used in the six alternative put_cmsg() calls, with
msg_control_is_user being true, put_cmsg() picks msg_control_user
preserving the __user annotation and passes that properly to
copy_to_user().
No functional change. No change in object code.
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20201127093421.21673-1-lukas.bulwahn@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
It turns out that usage of skb extensions can cause memory leaks. Ido
Schimmel reported: "[...] there are instances that blindly overwrite
'skb->extensions' by invoking skb_copy_header() after __alloc_skb()."
Therefore, give up on using skb extensions for KCOV handle, and instead
directly store kcov_handle in sk_buff.
Fixes: 6370cc3bbd ("net: add kcov handle to skb extensions")
Fixes: 85ce50d337 ("net: kcov: don't select SKB_EXTENSIONS when there is no NET")
Fixes: 97f53a08cb ("net: linux/skbuff.h: combine SKB_EXTENSIONS + KCOV handling")
Link: https://lore.kernel.org/linux-wireless/20201121160941.GA485907@shredder.lan/
Reported-by: Ido Schimmel <idosch@idosch.org>
Signed-off-by: Marco Elver <elver@google.com>
Link: https://lore.kernel.org/r/20201125224840.2014773-1-elver@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Functions tfilter_notify_chain() and tcf_get_next_proto() are always called
with rtnl lock held in current implementation. Moreover, attempting to call
them without rtnl lock would cause a warning down the call chain in
function __tcf_get_next_proto() that requires the lock to be held by
callers. Remove the 'rtnl_held' argument in order to simplify the code and
make rtnl lock requirement explicit.
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Link: https://lore.kernel.org/r/20201127151205.23492-1-vladbu@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
True to the message of commit v5.10-rc1-105-g46d6c5ae953c, _do_
actually make use of state->sk when possible, such as in the REJECT
modules.
Reported-by: Minqiang Chen <ptpt52@gmail.com>
Cc: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Jan Engelhardt <jengelh@inai.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch removes nfnl_acct_list from struct net to reduce the default
memory footprint for the netns structure.
Signed-off-by: Miao Wang <shankerwangmiao@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Here we could use the '!=' expression to fix the following coccicheck
warning:
./net/netfilter/xt_nfacct.c:30:41-46: WARNING: conversion to bool not needed here
Reported-by: Tosk Robot <tencent_os_robot@tencent.com>
Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We have some tasks triggered by the subflow receive path
which require to access the msk socket status, specifically:
mptcp_clean_una() and mptcp_push_pending()
We have almost everything in place to defer to the msk
release_cb such tasks when the msk sock is owned.
Since the worker is no more used to clean the acked data,
for fallback sockets we need to explicitly flush them.
As an added bonus we can move the wake-up code in __mptcp_clean_una(),
simplify a lot mptcp_poll() and move the timer update under
the data lock.
The worker is now used only to process and send DATA_FIN
packets and do the mptcp-level retransmissions.
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Extending the data_lock scope in mptcp_incoming_option
we can use that to protect both snd_una and wnd_end.
In the typical case, we will have a single atomic op instead of 2
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Move the TX skbs allocation in mptcp_sendmsg() scope,
and tentatively pre-allocate a skbs number proportional
to the sendmsg() length.
Use the ssk tx skb cache to prevent the subflow allocation.
This allows removing the msk skb extension cache and will
make possible the later patches.
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Such spinlock is currently used only to protect the 'owned'
flag inside the socket lock itself. With this patch, we extend
its scope to protect the whole msk receive path and
sk_forward_memory.
Given the above, we can always move data into the msk receive
queue (and OoO queue) from the subflow.
We leverage the previous commit, so that we need to acquire the
spinlock in the tx path only when moving fwd memory.
recvmsg() must now explicitly acquire the socket spinlock
when moving skbs out of sk_receive_queue. To reduce the number of
lock operations required we use a second rx queue and splice the
first into the latter in mptcp_lock_sock(). Additionally rmem
allocated memory is bulk-freed via release_cb()
Acked-by: Florian Westphal <fw@strlen.de>
Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This leverages the previous commit to reserve the wmem
required for the sendmsg() operation when the msk socket
lock is first acquired.
Some heuristics are used to get a reasonable [over] estimation of
the whole memory required. If we can't forward alloc such amount
fallback to a reasonable small chunk, otherwise enter the wait
for memory path.
When sendmsg() needs more memory it looks at wmem_reserved
first and if that is exhausted, move more space from
sk_forward_alloc.
The reserved memory is not persistent and is released at the
next socket unlock via the release_cb().
Overall this will simplify the next patch.
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This allows invoking an additional callback under the
socket spin lock.
Will be used by the next patches to avoid additional
spin lock contention.
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add napi_id to the xdp_rxq_info structure, and make sure the XDP
socket pick up the napi_id in the Rx path. The napi_id is used to find
the corresponding NAPI structure for socket busy polling.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/bpf/20201130185205.196029-7-bjorn.topel@gmail.com
Wire-up XDP socket busy-poll support for recvmsg() and sendmsg(). If
the XDP socket prefers busy-polling, make sure that no wakeup/IPI is
performed.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20201130185205.196029-6-bjorn.topel@gmail.com
Add a check for need wake up in sendmsg(), so that if a user calls
sendmsg() when no wakeup is needed, do not trigger a wakeup.
To simplify the need wakeup check in the syscall, unconditionally
enable the need wakeup flag for Tx. This has a side-effect for poll();
If poll() is called for a socket without enabled need wakeup, a Tx
wakeup is unconditionally performed.
The wakeup matrix for AF_XDP now looks like:
need wakeup | poll() | sendmsg() | recvmsg()
------------+--------------+-------------+------------
disabled | wake Tx | wake Tx | nop
enabled | check flag; | check flag; | check flag;
| wake Tx/Rx | wake Tx | wake Rx
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20201130185205.196029-5-bjorn.topel@gmail.com
Add support for non-blocking recvmsg() to XDP sockets. Previously,
only sendmsg() was supported by XDP socket. Now, for symmetry and the
upcoming busy-polling support, recvmsg() is added.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20201130185205.196029-4-bjorn.topel@gmail.com
This option lets a user set a per socket NAPI budget for
busy-polling. If the options is not set, it will use the default of 8.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/bpf/20201130185205.196029-3-bjorn.topel@gmail.com
The existing busy-polling mode, enabled by the SO_BUSY_POLL socket
option or system-wide using the /proc/sys/net/core/busy_read knob, is
an opportunistic. That means that if the NAPI context is not
scheduled, it will poll it. If, after busy-polling, the budget is
exceeded the busy-polling logic will schedule the NAPI onto the
regular softirq handling.
One implication of the behavior above is that a busy/heavy loaded NAPI
context will never enter/allow for busy-polling. Some applications
prefer that most NAPI processing would be done by busy-polling.
This series adds a new socket option, SO_PREFER_BUSY_POLL, that works
in concert with the napi_defer_hard_irqs and gro_flush_timeout
knobs. The napi_defer_hard_irqs and gro_flush_timeout knobs were
introduced in commit 6f8b12d661 ("net: napi: add hard irqs deferral
feature"), and allows for a user to defer interrupts to be enabled and
instead schedule the NAPI context from a watchdog timer. When a user
enables the SO_PREFER_BUSY_POLL, again with the other knobs enabled,
and the NAPI context is being processed by a softirq, the softirq NAPI
processing will exit early to allow the busy-polling to be performed.
If the application stops performing busy-polling via a system call,
the watchdog timer defined by gro_flush_timeout will timeout, and
regular softirq handling will resume.
In summary; Heavy traffic applications that prefer busy-polling over
softirq processing should use this option.
Example usage:
$ echo 2 | sudo tee /sys/class/net/ens785f1/napi_defer_hard_irqs
$ echo 200000 | sudo tee /sys/class/net/ens785f1/gro_flush_timeout
Note that the timeout should be larger than the userspace processing
window, otherwise the watchdog will timeout and fall back to regular
softirq processing.
Enable the SO_BUSY_POLL/SO_PREFER_BUSY_POLL options on your socket.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/bpf/20201130185205.196029-2-bjorn.topel@gmail.com
It turns out that it does exist a path where xdp_return_buff() is
being passed an XDP buffer of type MEM_TYPE_XSK_BUFF_POOL. This path
is when AF_XDP zero-copy mode is enabled, and a buffer is redirected
to a DEVMAP with an attached XDP program that drops the buffer.
This change simply puts the handling of MEM_TYPE_XSK_BUFF_POOL back
into xdp_return_buff().
Fixes: 82c41671ca ("xdp: Simplify xdp_return_{frame, frame_rx_napi, buff}")
Reported-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Link: https://lore.kernel.org/bpf/20201127171726.123627-1-bjorn.topel@gmail.com
Convert the READ_BUF macro in nfs4xdr.c from open code to instead
use the new xdr_stream-style decoders already in use by the encode
side (and by the in-kernel NFS client implementation). Once this
conversion is done, each individual NFSv4 argument decoder can be
independently cleaned up to replace these macros with C code.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
A "permanent" struct xdr_stream is allocated in struct svc_rqst so
that it is usable by all server-side decoders. A per-rqst scratch
buffer is also allocated to handle decoding XDR data items that
cross page boundaries.
To demonstrate how it will be used, add the first call site for the
new svcxdr_init_decode() API.
As an additional part of the overall conversion, add symbolic
constants for successful and failed XDR operations. Returning "0" is
overloaded. Sometimes it means something failed, but sometimes it
means success. To make it more clear when XDR decoding functions
succeed or fail, introduce symbolic constants.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Commit c509f15a58 ("SUNRPC: Split the xdr_buf event class") added
display of the rqst's XID to the svc_xdr_buf_class. However, when
the recvfrom tracepoint fires, rq_xid has yet to be filled in with
the current XID. So it ends up recording the previous XID that was
handled by that svc_rqst.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
An efficient way to handle multiple Read chunks is to post them all
together and then take a single completion. This is also how the
code is already structured: when the Read completion fires, all
portions of the incoming RPC message are available to be assembled.
The difficult problem is setting up the Read sink buffers so that
the server pulls the client's data into place, making subsequent
pull-up unnecessary. There are several cases:
* No Read chunks. No-op.
* One data item Read chunk. This is the fast case, where the inline
part of the RPC-over-RDMA message becomes the head and tail, and
the data item chunk is placed in buf->pages.
* A Position-zero Read chunk. Treated like TCP: the Read chunk is
pulled into contiguous pages.
+ A Position-zero Read chunk with data item chunks. Treated like
TCP: all of the Read chunks are pulled into contiguous pages.
+ Multiple data item chunks. Treated like TCP: the inline part is
copied and the data item chunks are pulled into contiguous pages.
The "*" cases are already supported. This patch adds support for the
"+" cases.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
As a pre-requisite for handling multiple Read chunks in each Read
list, convert svc_rdma_recv_read_chunk() to use the new parsed Read
chunk list.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
I'm about to change the purpose of ri_chunklen: Instead of tracking
the number of bytes in one Read chunk, it will track the total
number of bytes in the Read list. Rename it for clarity.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
We already have trace_svcrdma_decode_rseg(), which records each
ingress Read segment. Instead of reporting those again when they
are about to be posted as RDMA Reads, let's fire one tracepoint
before posting each type of chunk.
So we'll get:
nfsd-1998 [002] 321.666615: svcrdma_decode_rseg: cq.id=4 cid=42 segno=0 position=0 192@0x013ca9ebfae14000:0xb0010b05
nfsd-1998 [002] 321.666615: svcrdma_decode_rseg: cq.id=4 cid=42 segno=1 position=0 7688@0x013ca9ebf914e000:0xb0010a05
nfsd-1998 [002] 321.666615: svcrdma_decode_rseg: cq.id=4 cid=42 segno=2 position=0 28@0x013ca9ebfae15000:0xb0010905
nfsd-1998 [002] 321.666622: svcrdma_decode_rqst: cq.id=4 cid=42 xid=0x013ca9eb vers=1 credits=128 proc=RDMA_NOMSG hdrlen=100
nfsd-1998 [002] 321.666642: svcrdma_post_read_chunk: cq.id=3 cid=112 sqecount=3
kworker/2:1H-221 [002] 321.673949: svcrdma_wc_read: cq.id=3 cid=112 status=SUCCESS (0/0x0)
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor svc_rdma_send_reply_chunk() so that it Sends only the parts
of rq_res that do not contain a result payload.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor: svc_rdma_map_reply_msg() is restructured to DMA map only
the parts of rq_res that do not contain a result payload.
This change has been tested to confirm that it does not cause a
regression in the no Write chunk and single Write chunk cases.
Multiple Write chunks have not been tested.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
When counting the number of SGEs needed to construct a Send request,
do not count result payloads. And, when copying the Reply message
into the pull-up buffer, result payloads are not to be copied to the
Send buffer.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor: Instead of re-parsing the ingress RPC Call transport
header when constructing the egress RPC Reply transport header, use
the new parsed Write list and Reply chunk, which are version-
agnostic and already XDR decoded.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor: Instead of re-parsing the ingress RPC Call transport
header when constructing RDMA Writes, use the new parsed chunk lists
for the Write list and Reply chunk, which are version-agnostic and
already XDR-decoded.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor: Don't duplicate header decoding smarts here. Instead, use
the new parsed chunk lists.
Note that the XID sanity test is also removed. The XID is already
looked up by the cb handler, and is rejected if it's not recognized.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
This simple data structure binds the location of each data payload
inside of an RPC message to the chunk that will be used to push it
to or pull it from the client.
There are several benefits to this small additional overhead:
* It enables support for more than one chunk in incoming Read and
Write lists.
* It translates the version-specific on-the-wire format into a
generic in-memory structure, enabling support for multiple
versions of the RPC/RDMA transport protocol.
* It enables the server to re-organize a chunk list if it needs to
adjust where Read chunk data lands in server memory without
altering the contents of the XDR-encoded Receive buffer.
Construction of these lists is done while sanity checking each
incoming RPC/RDMA header. Subsequent patches will make use of the
generated data structures.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The only RPC/RDMA ordering requirement between RDMA Writes and RDMA
Sends is that the responder must post the Writes on the Send queue
before posting the Send that conveys the RPC Reply for that Write
payload.
The Linux NFS server implementation now has a transport method that
can post result Payload Writes earlier than svc_rdma_sendto:
->xpo_result_payload()
This gets RDMA Writes going earlier so they are more likely to be
complete at the remote end before the Send completes.
Some care must be taken with pulled-up Replies. We don't want to
push the Write chunk and then send the same payload data via Send.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Have the NFSD encoders annotate the boundaries of every
direct-data-placement eligible result data payload. Then change
svcrdma to use that annotation instead of the xdr->page_len
when handling Write chunks.
For NFSv4 on RDMA, that enables the ability to recognize multiple
result payloads per compound. This is a pre-requisite for supporting
multiple Write chunks per RPC transaction.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Clean up: "result payload" is a less confusing name for these
payloads. "READ payload" reflects only the NFS usage.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor for subsequent changes.
Constify the xdr_buf argument to ensure the code here does not
modify it, and to enable callers to pass in a
"const struct xdr_buf *".
At the same time, rename the helper functions, which emit RDMA
Writes, not RDMA Sends, and add documenting comments.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Clean up: Ensure the code in rw.c does not modify the argument, and
enable callers to also use "const struct xdr_buf *".
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
When space in the Reply chunk runs out in the middle of a segment,
we end up passing a zero-length SGL to rdma_rw_ctx_init(), and it
oopses.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
1) Fix insufficient validation of IPSET_ATTR_IPADDR_IPV6 reported
by syzbot.
2) Remove spurious reports on nf_tables when lockdep gets disabled,
from Florian Westphal.
3) Fix memleak in the error path of error path of
ip_vs_control_net_init(), from Wang Hai.
4) Fix missing control data in flow dissector, otherwise IP address
matching in hardware offload infra does not work.
5) Fix hardware offload match on prefix IP address when userspace
does not send a bitwise expression to represent the prefix.
* git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf:
netfilter: nftables_offload: build mask based from the matching bytes
netfilter: nftables_offload: set address type in control dissector
ipvs: fix possible memory leak in ip_vs_control_net_init
netfilter: nf_tables: avoid false-postive lockdep splat
netfilter: ipset: prevent uninit-value in hash_ip6_add
====================
Link: https://lore.kernel.org/r/20201127190313.24947-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When inet_rtm_getroute() was converted to use the RCU variants of
ip_route_input() and ip_route_output_key(), the TOS parameters
stopped being masked with IPTOS_RT_MASK before doing the route lookup.
As a result, "ip route get" can return a different route than what
would be used when sending real packets.
For example:
$ ip route add 192.0.2.11/32 dev eth0
$ ip route add unreachable 192.0.2.11/32 tos 2
$ ip route get 192.0.2.11 tos 2
RTNETLINK answers: No route to host
But, packets with TOS 2 (ECT(0) if interpreted as an ECN bit) would
actually be routed using the first route:
$ ping -c 1 -Q 2 192.0.2.11
PING 192.0.2.11 (192.0.2.11) 56(84) bytes of data.
64 bytes from 192.0.2.11: icmp_seq=1 ttl=64 time=0.173 ms
--- 192.0.2.11 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.173/0.173/0.173/0.000 ms
This patch re-applies IPTOS_RT_MASK in inet_rtm_getroute(), to
return results consistent with real route lookups.
Fixes: 3765d35ed8 ("net: ipv4: Convert inet_rtm_getroute to rcu versions of route lookup")
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/b2d237d08317ca55926add9654a48409ac1b8f5b.1606412894.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Borkmann says:
====================
pull-request: bpf 2020-11-28
1) Do not reference the skb for xsk's generic TX side since when looped
back into RX it might crash in generic XDP, from Björn Töpel.
2) Fix umem cleanup on a partially set up xsk socket when being destroyed,
from Magnus Karlsson.
3) Fix an incorrect netdev reference count when failing xsk_bind() operation,
from Marek Majtyka.
4) Fix bpftool to set an error code on failed calloc() in build_btf_type_table(),
from Zhen Lei.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf: Add MAINTAINERS entry for BPF LSM
bpftool: Fix error return value in build_btf_type_table
net, xsk: Avoid taking multiple skbuff references
xsk: Fix incorrect netdev reference count
xsk: Fix umem cleanup bug at socket destruct
MAINTAINERS: Update XDP and AF_XDP entries
====================
Link: https://lore.kernel.org/r/20201128005104.1205-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
- Fix head/tailroom issues for fragments, by Sven Eckelmann (3 patches)
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAl/BORsWHHN3QHNpbW9u
d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoX+AD/wJYmiGL8beD3eCBQdTjmvJNRbZ
MIlzse2LsiSk/4G/UbrEr11M4pc6gFDghH/VXs2MxD2hd6qJsJH7ulAQKMPzTEqD
WhwlyJ08O9HIivaN2Ri1ibwmLIDAYlvR2bsWu95weI/hDdE5ups/kIvBudM8b+fC
HKQ/e9wGaiB+zUUobvl6bP0Av6u3xPNTHFkXgGTo1shgLISAYFsmEoYQ3yZYrYKy
P9ckwLoJwMD6AODMRvWVmW4dzi5QSBWN/84XgQSoqdlZeNklIcTY2P5XWdePo7cv
wTnSxm4gJwdU7i8dBXLbtOYzmavzq0JkQHnRcFcKm6RWMbaeNNwfYUwMLp4h4ZtG
Mjh0qt4IyuRb5Tf7jkoZ9UR4fIftpuMNMMvE37Cfr7wUZKWtVbG9jHyydU2MZ678
vAF5NigYHa88i+MSdbXq8fPflmhddiHsN05nzI3olzX7tBFWNDf+71P82Gm4zKuM
YC4ZlZH7Rrb79C13CJm5xKN0MI5TYGDYJeFLV4Km7hZ9H9EyDJcsuSFPgV0R0hg5
Y5rWIWRrK/lU1aSXPSEuyL1P+tbH5z5xeYxdPUz2WroqNTyre6ycXy+wuoyCw5Ij
t91wqD8pNC1/3nynBbI2kcLeGFPjDTEfWYcYN7OE3Ts6AXr3fs8x1XUcUsseAZ1P
oxaUxzzfCcDno3abtg==
=mORx
-----END PGP SIGNATURE-----
Merge tag 'batadv-net-pullrequest-20201127' of git://git.open-mesh.org/linux-merge
Simon Wunderlich says:
====================
Here are some batman-adv bugfixes:
- Fix head/tailroom issues for fragments, by Sven Eckelmann (3 patches)
* tag 'batadv-net-pullrequest-20201127' of git://git.open-mesh.org/linux-merge:
batman-adv: Don't always reallocate the fragmentation skb head
batman-adv: Reserve needed_*room for fragments
batman-adv: Consider fragmentation for needed_headroom
====================
Link: https://lore.kernel.org/r/20201127173849.19208-1-sw@simonwunderlich.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Netfilter changes PACKET_OTHERHOST to PACKET_HOST before invoking the
hooks as, while it's an expected value for a bridge, routing expects
PACKET_HOST. The change is undone later on after hook traversal. This
can be seen with pairs of functions updating skb>pkt_type and then
reverting it to its original value:
For hook NF_INET_PRE_ROUTING:
setup_pre_routing / br_nf_pre_routing_finish
For hook NF_INET_FORWARD:
br_nf_forward_ip / br_nf_forward_finish
But the third case where netfilter does this, for hook
NF_INET_POST_ROUTING, the packet type is changed in br_nf_post_routing
but never reverted. A comment says:
/* We assume any code from br_dev_queue_push_xmit onwards doesn't care
* about the value of skb->pkt_type. */
But when having a tunnel (say vxlan) attached to a bridge we have the
following call trace:
br_nf_pre_routing
br_nf_pre_routing_ipv6
br_nf_pre_routing_finish
br_nf_forward_ip
br_nf_forward_finish
br_nf_post_routing <- pkt_type is updated to PACKET_HOST
br_nf_dev_queue_xmit <- but not reverted to its original value
vxlan_xmit
vxlan_xmit_one
skb_tunnel_check_pmtu <- a check on pkt_type is performed
In this specific case, this creates issues such as when an ICMPv6 PTB
should be sent back. When CONFIG_BRIDGE_NETFILTER is enabled, the PTB
isn't sent (as skb_tunnel_check_pmtu checks if pkt_type is PACKET_HOST
and returns early).
If the comment is right and no one cares about the value of
skb->pkt_type after br_dev_queue_push_xmit (which isn't true), resetting
it to its original value should be safe.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Reviewed-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20201123174902.622102-1-atenart@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
By setting NF_FLOWTABLE_COUNTER. Otherwise, the updates added by
commit ef803b3cf9 ("netfilter: flowtable: add counter support in HW
offload") are not effective when using act_ct.
While at it, now that we have the flag set, protect the call to
nf_ct_acct_update() by commit beb97d3a31 ("net/sched: act_ct: update
nf_conn_acct for act_ct SW offload in flowtable") with the check on
NF_FLOWTABLE_COUNTER, as also done on other places.
Note that this shouldn't impact performance as these stats are only
enabled when net.netfilter.nf_conntrack_acct is enabled.
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: wenxu <wenxu@ucloud.cn>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Link: https://lore.kernel.org/r/481a65741261fd81b0a0813e698af163477467ec.1606415787.git.marcelo.leitner@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Trivial conflict in CAN, keep the net-next + the byteswap wrapper.
Conflicts:
drivers/net/can/usb/gs_usb.c
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We update the terminology in the code so that deprecated structure
names and macros are replaced with those currently recommended in
the user API.
struct tipc_portid -> struct tipc_socket_addr
struct tipc_name -> struct tipc_service_addr
struct tipc_name_seq -> struct tipc_service_range
TIPC_ADDR_ID -> TIPC_SOCKET_ADDR
TIPC_ADDR_NAME -> TIPC_SERVICE_ADDR
TIPC_ADDR_NAMESEQ -> TIPC_SERVICE_RANGE
TIPC_CFG_SRV -> TIPC_NODE_STATE
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The 32-bit node number, aka node hash or node address, is calculated
based on the 128-bit node identity when it is not set explicitly by
the user. In future commits we will need to perform this hash operation
on peer nodes while feeling safe that we obtain the same result.
We do this by interpreting the initial hash as a network byte order
number. Whenever we need to use the number locally on a node
we must therefore translate it to host byte order to obtain an
architecure independent result.
Furthermore, given the context where we use this number, we must not
allow it to be zero unless the node identity also is zero. Hence, in
the rare cases when the xor-ed hash value may end up as zero we replace
it with a fix number, knowing that the code anyway is capable of
handling hash collisions.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We refactor the tipc_sk_bind() function, so that the lock handling
is handled separately from the logics. We also move some sanity
tests to earlier in the call chain, to the function tipc_bind().
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Remove obsolete function x25_kill_by_device(). It's not used any more.
Signed-off-by: Martin Schiller <ms@dev.tdt.de>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We have to take the actual link state into account to handle
restart requests/confirms well.
Signed-off-by: Martin Schiller <ms@dev.tdt.de>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
1. DTE interface changes immediately to LAPB_STATE_1 and start sending
SABM(E).
2. DCE interface sends N2-times DM and changes to LAPB_STATE_1
afterwards if there is no response in the meantime.
Signed-off-by: Martin Schiller <ms@dev.tdt.de>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch allows layer2 (LAPB) to react to netdev events itself and
avoids the detour via layer3 (X.25).
1. Establish layer2 on NETDEV_UP events, if the carrier is already up.
2. Call lapb_disconnect_request() on NETDEV_GOING_DOWN events to signal
the peer that the connection will go down.
(Only when the carrier is up.)
3. When a NETDEV_DOWN event occur, clear all queues, enter state
LAPB_STATE_0 and stop all timers.
4. The NETDEV_CHANGE event makes it possible to handle carrier loss and
detection.
In case of Carrier Loss, clear all queues, enter state LAPB_STATE_0
and stop all timers.
In case of Carrier Detection, we start timer t1 on a DCE interface,
and on a DTE interface we change to state LAPB_STATE_1 and start
sending SABM(E).
Signed-off-by: Martin Schiller <ms@dev.tdt.de>
Acked-by: Xie He <xie.he.0141@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
1. Add / remove x25_link_device by NETDEV_REGISTER/UNREGISTER and also
by NETDEV_POST_TYPE_CHANGE/NETDEV_PRE_TYPE_CHANGE.
This change is needed so that the x25_neigh struct for an interface
is already created when it shows up and is kept independently if the
interface goes UP or DOWN.
This is used in an upcomming commit, where x25 params of an neighbour
will get configurable through ioctls.
2. NETDEV_CHANGE event makes it possible to handle carrier loss and
detection. If carrier is lost, clean up everything related to this
neighbour by calling x25_link_terminated().
3. Also call x25_link_terminated() for NETDEV_DOWN events and remove the
call to x25_clear_forward_by_dev() in x25_route_device_down(), as
this is already called by x25_kill_by_neigh() which gets called by
x25_link_terminated().
4. Do nothing for NETDEV_UP and NETDEV_GOING_DOWN events, as these will
be handled in layer 2 (LAPB) and layer3 (X.25) will be informed by
layer2 when layer2 link is established and layer3 link should be
initiated.
Signed-off-by: Martin Schiller <ms@dev.tdt.de>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently kernel tc subsystem can do conntrack in cat_ct. But when several
fragment packets go through the act_ct, function tcf_ct_handle_fragments
will defrag the packets to a big one. But the last action will redirect
mirred to a device which maybe lead the reassembly big packet over the mtu
of target device.
This patch add support for a xmit hook to mirred, that gets executed before
xmiting the packet. Then, when act_ct gets loaded, it configs that hook.
The frag xmit hook maybe reused by other modules.
Signed-off-by: wenxu <wenxu@ucloud.cn>
Acked-by: Cong Wang <cong.wang@bytedance.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The mru in the qdisc_skb_cb should be init as 0. Only defrag packets in the
act_ct will set the value.
Fixes: 038ebb1a71 ("net/sched: act_ct: fix miss set mru for ovs after defrag in act_ct")
Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
RFC 7905 defines special behavior for ChaCha-Poly TLS sessions.
The differences are in the calculation of nonce and the absence
of explicit IV. This behavior is like TLSv1.3 partly.
Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Inline functions defined in tls.h have a lot of AES-specific
constants. Remove these constants and change argument to struct
tls_prot_info to have an access to cipher type in later patches
Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The functions xsk_map_put() and xsk_map_inc() are simple wrappers and
as such, replace these functions with the functions bpf_map_inc() and
bpf_map_put() and remove some error testing code.
Signed-off-by: Zhu Yanjun <zyjzyj2000@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/1606402998-12562-1-git-send-email-yanjunz@nvidia.com
-----BEGIN PGP SIGNATURE-----
iQFHBAABCgAxFiEEK3kIWJt9yTYMP3ehqclaivrt76kFAl/Ay8sTHG1rbEBwZW5n
dXRyb25peC5kZQAKCRCpyVqK+u3vqZvTB/9TfdD9pr5E4zIfStIlir7bvamDRK3y
QVQd4+gaB6gLKNAQo0Kh+osbMZAm43ixzEtoLEHDOd+XaaNDyJdUpNUXyf+/ZZgs
twMoCuY1vnO/1LPtVhLuggiyUWhc8YFPs7ZuFYmAWS+kYDMG4qxF3rMHcTjKw+bD
TcD383ZldO8I2cTmdhT8ldx3tk0UUSCvG284Db43asUkTOhPs8SrkFLVlAZZfcNc
d3gxzQYuMyR/s/ektyFP39ed9PO/OOQcg401XbT7aXvnCgaecTJHLIDfSvz6f69M
HHybn8CWs68+Ne4k6kC4c5tGJzz3QGt+q1dO1J1IENG99DsDpoEnfNiS
=xlpC
-----END PGP SIGNATURE-----
Merge tag 'linux-can-fixes-for-5.10-20201127' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can
Marc Kleine-Budde says:
====================
pull-request: can 2020-11-27
The first patch is by me and target the gs_usb driver and fixes the endianess
problem with candleLight firmware.
Another patch by me for the mcp251xfd driver add sanity checking to bail out if
no IRQ is configured.
The next three patches target the m_can driver. A patch by me removes the
hardcoded IRQF_TRIGGER_FALLING from the request_threaded_irq() as this clashes
with the trigger level specified in the DT. Further a patch by me fixes the
nominal bitiming tseg2 min value for modern m_can cores. Pankaj Sharma's patch
add support for cores version 3.3.x.
The last patch by Oliver Hartkopp is for af_can and converts a WARN() into a
pr_warn(), which is triggered by the syzkaller. It was able to create a
situation where the closing of a socket runs simultaneously to the notifier
call chain for removing the CAN network device in use.
* tag 'linux-can-fixes-for-5.10-20201127' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can:
can: af_can: can_rx_unregister(): remove WARN() statement from list operation sanity check
can: m_can: m_can_dev_setup(): add support for bosch mcan version 3.3.0
can: m_can: fix nominal bitiming tseg2 min for version >= 3.1
can: m_can: m_can_open(): remove IRQF_TRIGGER_FALLING from request_threaded_irq()'s flags
can: mcp251xfd: mcp251xfd_probe(): bail out if no IRQ was given
can: gs_usb: fix endianess problem with candleLight firmware
====================
Link: https://lore.kernel.org/r/20201127100301.512603-1-mkl@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When setting sk_err, set it to ee_errno, not ee_origin.
Commit f5f99309fa ("sock: do not set sk_err in
sock_dequeue_err_skb") disabled updating sk_err on errq dequeue,
which is correct for most error types (origins):
- sk->sk_err = err;
Commit 38b257938a ("sock: reset sk_err when the error queue is
empty") reenabled the behavior for IMCP origins, which do require it:
+ if (icmp_next)
+ sk->sk_err = SKB_EXT_ERR(skb_next)->ee.ee_origin;
But read from ee_errno.
Fixes: 38b257938a ("sock: reset sk_err when the error queue is empty")
Reported-by: Ayush Ranjan <ayushranjan@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Link: https://lore.kernel.org/r/20201126151220.2819322-1-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If an msk listener receives an MPJ carrying an invalid token, it
will zero the request socket msk entry. That should later
cause fallback and subflow reset - as per RFC - at
subflow_syn_recv_sock() time due to failing hmac validation.
Since commit 4cf8b7e48a ("subflow: introduce and use
mptcp_can_accept_new_subflow()"), we unconditionally dereference
- in mptcp_can_accept_new_subflow - the subflow request msk
before performing hmac validation. In the above scenario we
hit a NULL ptr dereference.
Address the issue doing the hmac validation earlier.
Fixes: 4cf8b7e48a ("subflow: introduce and use mptcp_can_accept_new_subflow()")
Tested-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Link: https://lore.kernel.org/r/03b2cfa3ac80d8fc18272edc6442a9ddf0b1e34e.1606400227.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently, the openvswitch module is not accepting the correctly formated
netlink message for the TTL decrement action. For both setting and getting
the dec_ttl action, the actions should be nested in the
OVS_DEC_TTL_ATTR_ACTION attribute as mentioned in the openvswitch.h uapi.
When the original patch was sent, it was tested with a private OVS userspace
implementation. This implementation was unfortunately not upstreamed and
reviewed, hence an erroneous version of this patch was sent out.
Leaving the patch as-is would cause problems as the kernel module could
interpret additional attributes as actions and vice-versa, due to the
actions not being encapsulated/nested within the actual attribute, but
being concatinated after it.
Fixes: 744676e777 ("openvswitch: add TTL decrement action")
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Link: https://lore.kernel.org/r/160622121495.27296.888010441924340582.stgit@wsfd-netdev64.ntdv.lab.eng.bos.redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Userspace might match on prefix bytes of header fields if they are on
the byte boundary, this requires that the mask is adjusted accordingly.
Use NFT_OFFLOAD_MATCH_EXACT() for meta since prefix byte matching is not
allowed for this type of selector.
The bitwise expression might be optimized out by userspace, hence the
kernel needs to infer the prefix from the number of payload bytes to
match on. This patch adds nft_payload_offload_mask() to calculate the
bitmask to match on the prefix.
Fixes: c9626a2cbd ("netfilter: nf_tables: add hardware offload support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch adds nft_flow_rule_set_addr_type() to set the address type
from the nft_payload expression accordingly.
If the address type is not set in the control dissector then a rule that
matches either on source or destination IP address does not work.
After this patch, nft hardware offload generates the flow dissector
configuration as tc-flower does to match on an IP address.
This patch has been also tested functionally to make sure packets are
filtered out by the NIC.
This is also getting the code aligned with the existing netfilter flow
offload infrastructure which is also setting the control dissector.
Fixes: c9626a2cbd ("netfilter: nf_tables: add hardware offload support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
redact XFRM SA secret in the netlink response to xfrm_get_sa()
or dumpall sa.
Enable lockdown, confidentiality mode, at boot or at run time.
e.g. when enabled:
cat /sys/kernel/security/lockdown
none integrity [confidentiality]
ip xfrm state
src 172.16.1.200 dst 172.16.1.100
proto esp spi 0x00000002 reqid 2 mode tunnel
replay-window 0
aead rfc4106(gcm(aes)) 0x0000000000000000000000000000000000000000 96
note: the aead secret is redacted.
Redacting secret is also a FIPS 140-2 requirement.
v1->v2
- add size checks before memset calls
v2->v3
- replace spaces with tabs for consistency
v3->v4
- use kernel lockdown instead of a /proc setting
v4->v5
- remove kconfig option
Reviewed-by: Stephan Mueller <smueller@chronox.de>
Signed-off-by: Antony Antony <antony.antony@secunet.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
To detect potential bugs in CAN protocol implementations (double removal of
receiver entries) a WARN() statement has been used if no matching list item was
found for removal.
The fault injection issued by syzkaller was able to create a situation where
the closing of a socket runs simultaneously to the notifier call chain for
removing the CAN network device in use.
This case is very unlikely in real life but it doesn't break anything.
Therefore we just replace the WARN() statement with pr_warn() to preserve the
notification for the CAN protocol development.
Reported-by: syzbot+381d06e0c8eaacb8706f@syzkaller.appspotmail.com
Reported-by: syzbot+d0ddd88c9a7432f041e6@syzkaller.appspotmail.com
Reported-by: syzbot+76d62d3b8162883c7d11@syzkaller.appspotmail.com
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://lore.kernel.org/r/20201126192140.14350-1-socketcan@hartkopp.net
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
When a packet is fragmented by batman-adv, the original batman-adv header
is not modified. Only a new fragmentation is inserted between the original
one and the ethernet header. The code must therefore make sure that it has
a writable region of this size in the skbuff head.
But it is not useful to always reallocate the skbuff by this size even when
there would be more than enough headroom still in the skb. The reallocation
is just to costly during in this codepath.
Fixes: ee75ed8887 ("batman-adv: Fragment and send skbs larger than mtu")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
The batadv net_device is trying to propagate the needed_headroom and
needed_tailroom from the lower devices. This is needed to avoid cost
intensive reallocations using pskb_expand_head during the transmission.
But the fragmentation code split the skb's without adding extra room at the
end/beginning of the various fragments. This reduced the performance of
transmissions over complex scenarios (batadv on vxlan on wireguard) because
the lower devices had to perform the reallocations at least once.
Fixes: ee75ed8887 ("batman-adv: Fragment and send skbs larger than mtu")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
If a batman-adv packets has to be fragmented, then the original batman-adv
packet header is not stripped away. Instead, only a new header is added in
front of the packet after it was split.
This size must be considered to avoid cost intensive reallocations during
the transmission through the various device layers.
Fixes: 7bca68c784 ("batman-adv: Add lower layer needed_(head|tail)room to own ones")
Reported-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
tls_device_offload_cleanup_rx doesn't clear tls_ctx->netdev after
calling tls_dev_del if TLX TX offload is also enabled. Clearing
tls_ctx->netdev gets postponed until tls_device_gc_task. It leaves a
time frame when tls_device_down may get called and call tls_dev_del for
RX one extra time, confusing the driver, which may lead to a crash.
This patch corrects this racy behavior by adding a flag to prevent
tls_device_down from calling tls_dev_del the second time.
Fixes: e8f6979981 ("net/tls: Add generic NIC offload infrastructure")
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Link: https://lore.kernel.org/r/20201125221810.69870-1-saeedm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When devlink reload operation is not used, netdev of an Ethernet port may
be present in different net namespace than the net namespace of the
devlink instance.
Ensure that both the devlink instance and devlink port netdev are located
in same net namespace.
Fixes: 070c63f20f ("net: devlink: allow to change namespaces during reload")
Signed-off-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
A netdevice of a devlink port can be moved to different net namespace
than its parent devlink instance.
This scenario occurs when devlink reload is not used.
When netdevice is undergoing migration to net namespace, its ifindex
and name may change.
In such use case, devlink port query may read stale netdev attributes.
Fix it by reading them under rtnl lock.
Fixes: bfcd3a4661 ("Introduce devlink infrastructure")
Signed-off-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There are reports wrt lockdep splat in nftables, e.g.:
------------[ cut here ]------------
WARNING: CPU: 2 PID: 31416 at net/netfilter/nf_tables_api.c:622
lockdep_nfnl_nft_mutex_not_held+0x28/0x38 [nf_tables]
...
These are caused by an earlier, unrelated bug such as a n ABBA deadlock
in a different subsystem.
In such an event, lockdep is disabled and lockdep_is_held returns true
unconditionally. This then causes the WARN() in nf_tables.
Make the WARN conditional on lockdep still active to avoid this.
Fixes: f102d66b33 ("netfilter: nf_tables: use dedicated mutex to guard transactions")
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Link: https://lore.kernel.org/linux-kselftest/CA+G9fYvFUpODs+NkSYcnwKnXm62tmP=ksLeBPmB+KFrB2rvCtQ@mail.gmail.com/
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Use napi_consume_skb() to assert the case when it is not called
in a atomic softirq context.
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We can enter the main mptcp_recvmsg() loop even when
no subflows are connected. As note by Eric, that would
result in a divide by zero oops on ack generation.
Address the issue by checking the subflow status before
sending the ack.
Additionally protect mptcp_recvmsg() against invocation
with weird socket states.
v1 -> v2:
- removed unneeded inline keyword - Jakub
Reported-and-suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Fixes: ea4ca586b1 ("mptcp: refine MPTCP-level ack scheduling")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/5370c0ae03449239e3d1674ddcfb090cf6f20abe.1606253206.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Extend MRP to support LC mode(link check) for the interconnect port.
This applies only to the interconnect ring.
Opposite to RC mode(ring check) the LC mode is using CFM frames to
detect when the link goes up or down and based on that the userspace
will need to react.
One advantage of the LC mode over RC mode is that there will be fewer
frames in the normal rings. Because RC mode generates InTest on all
ports while LC mode sends CFM frame only on the interconnect port.
All 4 nodes part of the interconnect ring needs to have the same mode.
And it is not possible to have running LC and RC mode at the same time
on a node.
Whenever the MIM starts it needs to detect the status of the other 3
nodes in the interconnect ring so it would send a frame called
InLinkStatus, on which the clients needs to reply with their link
status.
This patch adds InLinkStatus frame type and extends existing rules on
how to forward this frame.
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Link: https://lore.kernel.org/r/20201124082525.273820-1-horatiu.vultur@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently both filter and action flags use same "TCA_" prefix which makes
them hard to distinguish to code and confusing for users. Create aliases
for existing action flags constants with "TCA_ACT_" prefix.
Signed-off-by: Vlad Buslov <vlad@buslov.dev>
Link: https://lore.kernel.org/r/20201124164054.893168-1-vlad@buslov.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
On close this timer might be scheduled. mptcp uses sk_reset_timer for
this, so the a reference on the mptcp socket is taken.
This causes a refcount leak which can for example be reproduced
with 'mp_join_server_v4.pkt' from the mptcp-packetdrill repo.
The leak has nothing to do with join requests, v1_mp_capable_bind_no_cs.pkt
works too when replacing the last ack mpcapable to v1 instead of v0.
unreferenced object 0xffff888109bba040 (size 2744):
comm "packetdrill", [..]
backtrace:
[..] sk_prot_alloc.isra.0+0x2b/0xc0
[..] sk_clone_lock+0x2f/0x740
[..] mptcp_sk_clone+0x33/0x1a0
[..] subflow_syn_recv_sock+0x2b1/0x690 [..]
Fixes: e16163b6e2 ("mptcp: refactor shutdown and close")
Cc: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20201124162446.11448-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
After cited commit, gro_cells_destroy() became damn slow
on hosts with a lot of cores.
This is because we have one additional synchronize_net() per cpu as
stated in the changelog.
gro_cells_init() is setting NAPI_STATE_NO_BUSY_POLL, and this was enough
to not have one synchronize_net() call per netif_napi_del()
We can factorize all the synchronize_net() to a single one,
right before freeing per-cpu memory.
Fixes: 5198d545db ("net: remove napi_hash_del() from driver-facing API")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20201124203822.1360107-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
- set module owner to THIS_MODULE, by Taehee Yoo
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAl+9DbQWHHN3QHNpbW9u
d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoWlxD/0TOgtU9fVykvjMG3U0rnB6pvp7
rFH1Srqg5hsJ2PNLAF8fr0o4PDTyiAztirdJ4C4+ietug0ozxe4GZARB+/rPEk/3
u558cDMZaqLOo195g70s+8cr0XXPy8Bt4ip/X4uW8DNTluKMiQcKNDrVkc6Cysa/
5cnT4wf6RftG9eMbtG1wdEzqG/goQpmrP+bo+J1QdBQEycqMwegTWYZyU9behS+m
u47FIa20XQ8FmoZx3oooGNb7z9iDHzwVc7PMHw19Xv14bahtz1BwydERdJMde3hP
UD4HeAQla6Gl0ZO2pS5MuCQxEu8aphel75a3g+D4XCdyyIl1B3FzK1F+DWdoHlnx
g0g7xv0nnfpF2wVhGkgIKfssQXino9YPyIjl/zLJ7YXXrTZCzkOjGiumSvrvXgox
mTtvDvbSIOylQksPF5UJtoOo6mWihg7HERyrOwfIV6UxKuqbKhv3e17gUlkZPIcp
ec4iFjGheE0H00zSzBkeb+76M0XQmDTbvmjUfOxaxP4o+u5W27QE+HznOTrQ9uBa
FrMt++XjxSfmVsiFrpqE4LQyYfcFjO3Ei5yYfR3tYfDx4qegDvqksLbzb3Fkp2G9
Pz9HEc+PYJk/w3N2Q+zSpbDJfUvHjyIPDeqsTdtbbUPUyfkGmghT7F9V+vof2jqm
6pijD3ji8JN1wmh9yw==
=Qr0B
-----END PGP SIGNATURE-----
Merge tag 'batadv-net-pullrequest-20201124' of git://git.open-mesh.org/linux-merge
Simon Wunderlich says:
====================
Here is a batman-adv bugfix:
- set module owner to THIS_MODULE, by Taehee Yoo
* tag 'batadv-net-pullrequest-20201124' of git://git.open-mesh.org/linux-merge:
batman-adv: set .owner to THIS_MODULE
====================
Link: https://lore.kernel.org/r/20201124134417.17269-1-sw@simonwunderlich.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When a BPF program is used to select between a type of TCP congestion
control algorithm that uses either ECN or not there is a case where the
synack for the frame was coming up without the ECT0 bit set. A bit of
research found that this was due to the final socket being configured to
dctcp while the listener socket was staying in cubic.
To reproduce it all that is needed is to monitor TCP traffic while running
the sample bpf program "samples/bpf/tcp_cong_kern.c". What is observed,
assuming tcp_dctcp module is loaded or compiled in and the traffic matches
the rules in the sample file, is that for all frames with the exception of
the synack the ECT0 bit is set.
To address that it is necessary to make one additional call to
tcp_bpf_ca_needs_ecn using the request socket and then use the output of
that to set the ECT0 bit for the tos/tclass of the packet.
Fixes: 91b5b21c7c ("bpf: Add support for changing congestion control")
Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
Link: https://lore.kernel.org/r/160593039663.2604.1374502006916871573.stgit@localhost.localdomain
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In bug report [0] a warning in r8169 driver was reported that was
caused by an invalid GSO SKB (gso_type was 0). See [1] for a discussion
about this issue. Still the origin of the invalid GSO SKB isn't clear.
It shouldn't be a network drivers task to check for invalid GSO SKB's.
Also, even if issue [0] can be fixed, we can't be sure that a
similar issue doesn't pop up again at another place.
Therefore let gso_features_check() check for such invalid GSO SKB's.
[0] https://bugzilla.kernel.org/show_bug.cgi?id=209423
[1] https://www.spinics.net/lists/netdev/msg690794.html
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://lore.kernel.org/r/97c78d21-7f0b-d843-df17-3589f224d2cf@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit 642e450b6b ("xsk: Do not discard packet when NETDEV_TX_BUSY")
addressed the problem that packets were discarded from the Tx AF_XDP
ring, when the driver returned NETDEV_TX_BUSY. Part of the fix was
bumping the skbuff reference count, so that the buffer would not be
freed by dev_direct_xmit(). A reference count larger than one means
that the skbuff is "shared", which is not the case.
If the "shared" skbuff is sent to the generic XDP receive path,
netif_receive_generic_xdp(), and pskb_expand_head() is entered the
BUG_ON(skb_shared(skb)) will trigger.
This patch adds a variant to dev_direct_xmit(), __dev_direct_xmit(),
where a user can select the skbuff free policy. This allows AF_XDP to
avoid bumping the reference count, but still keep the NETDEV_TX_BUSY
behavior.
Fixes: 642e450b6b ("xsk: Do not discard packet when NETDEV_TX_BUSY")
Reported-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20201123175600.146255-1-bjorn.topel@gmail.com
Fix reload stats structure exposed to the user. Change stats structure
hierarchy to have the reload action as a parent of the stat entry and
then stat entry includes value per limit. This will also help to avoid
string concatenation on iproute2 output.
Reload stats structure before this fix:
"stats": {
"reload": {
"driver_reinit": 2,
"fw_activate": 1,
"fw_activate_no_reset": 0
}
}
After this fix:
"stats": {
"reload": {
"driver_reinit": {
"unspecified": 2
},
"fw_activate": {
"unspecified": 1,
"no_reset": 0
}
}
Fixes: a254c26426 ("devlink: Add reload stats")
Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/1606109785-25197-1-git-send-email-moshe@mellanox.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a packet trap to report packets that were dropped due to a
blackhole nexthop.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAl+8E4YACgkQ+7dXa6fL
C2tehQ/8C1O+p1UBmQ4k/eXkYqsIvRQJGXGheSClxvym8tQiut/sZEKsCqucFmO8
Y2gKEYJB3xYIqbZjjJUFYbL70Q5qG/cXgKn3rCpMjhn/0UAoUSwaGyoURcLwaz2B
asipDSvbvZk7OwX4ScUTi/b/7AHtO6l16Y6yRzUqneOsSkXYk1+HhK2/4rPDKEJf
cxocGIjjWs4d1erG2Aqgn3BZ2GsDNb88NJYi5qgc0ywFgRacX7oL0WFeu6pD1Ja+
8vZLjFYB/bDeLBlvzpnWal99lUiSee8BbUoQn1iGJ3EekUVWkkNNfjhpYRjzIIR5
GsJfbDHP16PHhLZgxBvNBwkWoqqBw3X/Zn0dkBDZgjlZQvLbeds6QRd6Ag/M2CJI
D1QtmBDBtr4UZfRFm+5hfpTqf2jeOc4o68WROH3FEdsnKvICkttKucE9LaykokFD
gQLNmkn2tb1D1yS2Y4iPucV3kYUU1Y6yw++Ck0cE2isBauN0gHtr9UAXZEc57ILc
hXCFrJTbGkph3TimQDd1QbXZ28eExRBnn+H5RyYHeEidv03d2UO5uFZaJ5b5UGX0
sE+3aeWLTH3fFTBvMnYPyFjzalSky8V5zOlUltwVCx7PpiDe6P7rKpO+ZXEi4OXP
mHz7HM6h1iEnnrqEGNz7L39pTqKVbfBxDMvFra07dyKkvVthF1w=
=TvbK
-----END PGP SIGNATURE-----
Merge tag 'rxrpc-next-20201123' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
David Howells says:
====================
rxrpc: Prelude to gssapi support
Here are some patches that do some reorganisation of the security class
handling in rxrpc to allow implementation of the RxGK security class that
will allow AF_RXRPC to use GSSAPI-negotiated tokens and better crypto. The
RxGK security class is not included in this patchset.
It does the following things:
(1) Add a keyrings patch to provide the original key description, as
provided to add_key(), to the payload preparser so that it can
interpret the content on that basis. Unfortunately, the rxrpc_s key
type wasn't written to interpret its payload as anything other than a
string of bytes comprising a key, but for RxGK, more information is
required as multiple Kerberos enctypes are supported.
(2) Remove the rxk5 security class key parsing. The rxk5 class never got
rolled out in OpenAFS and got replaced with rxgk.
(3) Support the creation of rxrpc keys with multiple tokens of different
types. If some types are not supported, the ENOPKG error is
suppressed if at least one other token's type is supported.
(4) Punt the handling of server keys (rxrpc_s type) to the appropriate
security class.
(5) Organise the security bits in the rxrpc_connection struct into a
union to make it easier to override for other classes.
(6) Move some bits from core code into rxkad that won't be appropriate to
rxgk.
* tag 'rxrpc-next-20201123' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
rxrpc: Ask the security class how much space to allow in a packet
rxrpc: rxkad: Don't use pskb_pull() to advance through the response packet
rxrpc: Organise connection security to use a union
rxrpc: Don't reserve security header in Tx DATA skbuff
rxrpc: Merge prime_packet_security into init_connection_security
rxrpc: Fix example key name in a comment
rxrpc: Ignore unknown tokens in key payload unless no known tokens
rxrpc: Make the parsing of xdr payloads more coherent
rxrpc: Allow security classes to give more info on server keys
rxrpc: Don't leak the service-side session key to userspace
rxrpc: Hand server key parsing off to the security class
rxrpc: Split the server key type (rxrpc_s) into its own file
rxrpc: Don't retain the server key in the connection
rxrpc: Support keys with multiple authentication tokens
rxrpc: List the held token types in the key description in /proc/keys
rxrpc: Remove the rxk5 security class as it's now defunct
keys: Provide the original description to the key preparser
====================
Link: https://lore.kernel.org/r/160616220405.830164.2239716599743995145.stgit@warthog.procyon.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Get rid of the __call_single_node union and cleanup the API a little
to avoid external code relying on the structure layout as much.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
In the patchset merged by commit b9fcf0a0d8
("Merge branch 'support-AF_PACKET-for-layer-3-devices'") L3 devices which
did not have header_ops were given one for the purpose of protocol parsing
on af_packet transmit path.
That change made af_packet receive path regard these devices as having a
visible L3 header and therefore aligned incoming skb->data to point to the
skb's mac_header. Some devices, such as ipip, xfrmi, and others, do not
reset their mac_header prior to ingress and therefore their incoming
packets became malformed.
Ideally these devices would reset their mac headers, or af_packet would be
able to rely on dev->hard_header_len being 0 for such cases, but it seems
this is not the case.
Fix by changing af_packet RX ll visibility criteria to include the
existence of a '.create()' header operation, which is used when creating
a device hard header - via dev_hard_header() - by upper layers, and does
not exist in these L3 devices.
As this predicate may be useful in other situations, add it as a common
dev_has_header() helper in netdevice.h.
Fixes: b9fcf0a0d8 ("Merge branch 'support-AF_PACKET-for-layer-3-devices'")
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Acked-by: Jason A. Donenfeld <Jason@zx2c4.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20201121062817.3178900-1-eyal.birger@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
linux/netdevice.h is included in very many places, touching any
of its dependecies causes large incremental builds.
Drop the linux/ethtool.h include, linux/netdevice.h just needs
a forward declaration of struct ethtool_ops.
Fix all the places which made use of this implicit include.
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Acked-by: Shannon Nelson <snelson@pensando.io>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Link: https://lore.kernel.org/r/20201120225052.1427503-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Remove unused and add needed includes. No functional change.
Suggested-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Starting from commit 8692cefc43 ("virtio_vsock: Fix race condition
in virtio_transport_recv_pkt"), we discard packets in
virtio_transport_recv_pkt() if the socket has been released.
When the socket is connected, we schedule a delayed work to wait the
RST packet from the other peer, also if SHUTDOWN_MASK is set in
sk->sk_shutdown.
This is done to complete the virtio-vsock shutdown algorithm, releasing
the port assigned to the socket definitively only when the other peer
has consumed all the packets.
If we discard the RST packet received, the socket will be closed only
when the VSOCK_CLOSE_TIMEOUT is reached.
Sergio discovered the issue while running ab(1) HTTP benchmark using
libkrun [1] and observing a latency increase with that commit.
To avoid this issue, we discard packet only if the socket is really
closed (SOCK_DONE flag is set).
We also set SOCK_DONE in virtio_transport_release() when we don't need
to wait any packets from the other peer (we didn't schedule the delayed
work). In this case we remove the socket from the vsock lists, releasing
the port assigned.
[1] https://github.com/containers/libkrun
Fixes: 8692cefc43 ("virtio_vsock: Fix race condition in virtio_transport_recv_pkt")
Cc: justin.he@arm.com
Reported-by: Sergio Lopez <slp@redhat.com>
Tested-by: Sergio Lopez <slp@redhat.com>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Acked-by: Jia He <justin.he@arm.com>
Link: https://lore.kernel.org/r/20201120104736.73749-1-sgarzare@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When the TCP stack is in SYN flood mode, the server child socket is
created from the SYN cookie received in a TCP packet with the ACK flag
set.
The child socket is created when the server receives the first TCP
packet with a valid SYN cookie from the client. Usually, this packet
corresponds to the final step of the TCP 3-way handshake, the ACK
packet. But is also possible to receive a valid SYN cookie from the
first TCP data packet sent by the client, and thus create a child socket
from that SYN cookie.
Since a client socket is ready to send data as soon as it receives the
SYN+ACK packet from the server, the client can send the ACK packet (sent
by the TCP stack code), and the first data packet (sent by the userspace
program) almost at the same time, and thus the server will equally
receive the two TCP packets with valid SYN cookies almost at the same
instant.
When such event happens, the TCP stack code has a race condition that
occurs between the momement a lookup is done to the established
connections hashtable to check for the existence of a connection for the
same client, and the moment that the child socket is added to the
established connections hashtable. As a consequence, this race condition
can lead to a situation where we add two child sockets to the
established connections hashtable and deliver two sockets to the
userspace program to the same client.
This patch fixes the race condition by checking if an existing child
socket exists for the same client when we are adding the second child
socket to the established connections socket. If an existing child
socket exists, we drop the packet and discard the second child socket
to the same client.
Signed-off-by: Ricardo Dias <rdias@singlestore.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20201120111133.GA67501@rdias-suse-pc.lan
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As pointed out by Herbert in a recent related patch, the LSM hooks do
not have the necessary address family information to use the flowi
struct safely. As none of the LSMs currently use any of the protocol
specific flowi information, replace the flowi pointers with pointers
to the address family independent flowi_common struct.
Reported-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: James Morris <jamorris@linux.microsoft.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl+69egeHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGTSYH/ifRBlaxy5UiHFc0
2zdR7pkjWrYfDTTT3sazIAhdlzzcfnkUqgFxOP45F4ZIqeTzunH3sUY+5UlT9IX7
liUgnLxQ/1R9Gx8kPGQfu+tLCey78xVFydGsqJoW9sPRw2R+apMdGGa/lOrk+OXz
DXIN+dDnGFqwCCNJpK+rxQQhFf++IPpSI8z6Y23moOFhsDZrEziHuVFy2FGyRM6z
prZ/us/tcobE8ptCk1RmOxLoJ1DR6UxpA2vLimTE+JD8siOsSWPbjE0KudnWCnd5
BLqIjrsPJbSxyuzzK3v9dnO5wMv7tMDuMIuYM/MQTXDttNwtsqt/aP6gdnUCym7N
5eHEj5g=
=MuO1
-----END PGP SIGNATURE-----
Merge tag 'v5.10-rc5' into rdma.git for-next
For dependencies in following patches
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Ask the security class how much header and trailer space to allow for when
allocating a packet, given how much data is remaining.
This will allow the rxgk security class to stick both a trailer in as well
as a header as appropriate in the future.
Signed-off-by: David Howells <dhowells@redhat.com>
In the rxkad security class, don't use pskb_pull() to advance through the
contents of the response packet. There's no point, especially as the next
and last access to the skbuff still has to allow for the wire header in the
offset (which we didn't advance over).
Better to just add the displacement to the next offset.
Signed-off-by: David Howells <dhowells@redhat.com>
Organise the security information in the rxrpc_connection struct to use a
union to allow for different data for different security classes.
Signed-off-by: David Howells <dhowells@redhat.com>
Insert the security header into the skbuff representing a DATA packet to be
transmitted rather than using skb_reserve() when the packet is allocated.
This makes it easier to apply crypto that spans the security header and the
data, particularly in the upcoming RxGK class where we have a common
encrypt-and-checksum function that is used in a number of circumstances.
Signed-off-by: David Howells <dhowells@redhat.com>
Merge the ->prime_packet_security() into the ->init_connection_security()
hook as they're always called together.
Signed-off-by: David Howells <dhowells@redhat.com>
When parsing a payload for an rxrpc-type key, ignore any tokens that are
not of a known type and don't give an error for them - unless there are no
tokens of a known type.
Signed-off-by: David Howells <dhowells@redhat.com>
Make the parsing of xdr-encoded payloads, as passed to add_key, more
coherent. Shuttling back and forth between various variables was a bit
hard to follow.
Signed-off-by: David Howells <dhowells@redhat.com>
Allow a security class to give more information on an rxrpc_s-type key when
it is viewed in /proc/keys. This will allow the upcoming RxGK security
class to show the enctype name here.
Signed-off-by: David Howells <dhowells@redhat.com>
Don't let someone reading a service-side rxrpc-type key get access to the
session key that was exchanged with the client. The server application
will, at some point, need to be able to read the information in the ticket,
but this probably shouldn't include the key material.
Signed-off-by: David Howells <dhowells@redhat.com>
Hand responsibility for parsing a server key off to the security class. We
can determine which class from the description. This is necessary as rxgk
server keys have different lookup requirements and different content
requirements (dependent on crypto type) to those of rxkad server keys.
Signed-off-by: David Howells <dhowells@redhat.com>
Split the server private key type (rxrpc_s) out into its own file rather
than mingling it with the authentication/client key type (rxrpc) since they
don't really bear any relation.
Signed-off-by: David Howells <dhowells@redhat.com>
Don't retain a pointer to the server key in the connection, but rather get
it on demand when the server has to deal with a response packet.
This is necessary to implement RxGK (GSSAPI-mediated transport class),
where we can't know which key we'll need until we've challenged the client
and got back the response.
This also means that we don't need to do a key search in the accept path in
softirq mode.
Also, whilst we're at it, allow the security class to ask for a kvno and
encoding-type variant of a server key as RxGK needs different keys for
different encoding types. Keys of this type have an extra bit in the
description:
"<service-id>:<security-index>:<kvno>:<enctype>"
Signed-off-by: David Howells <dhowells@redhat.com>
rxrpc-type keys can have multiple tokens attached for different security
classes. Currently, rxrpc always picks the first one, whether or not the
security class it indicates is supported.
Add preliminary support for choosing which security class will be used
(this will need to be directed from a higher layer) and go through the
tokens to find one that's supported.
Signed-off-by: David Howells <dhowells@redhat.com>
No functional changes in this patch, needed to provide io_uring support
for shutdown(2).
Cc: netdev@vger.kernel.org
Cc: David S. Miller <davem@davemloft.net>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Fix incorrect netdev reference count in xsk_bind operation. Incorrect
reference count of the device appears when a user calls bind with the
XDP_ZEROCOPY flag on an interface which does not support zero-copy.
In such a case, an error is returned but the reference count is not
decreased. This change fixes the fault, by decreasing the reference count
in case of such an error.
The problem being corrected appeared in '162c820ed896' for the first time,
and the code was moved to new file location over the time with commit
'c2d3d6a47462'. This specific patch applies to all version starting
from 'c2d3d6a47462'. The same solution should be applied but on different
file (net/xdp/xdp_umem.c) and function (xdp_umem_assign_dev) for versions
from '162c820ed896' to 'c2d3d6a47462' excluded.
Fixes: 162c820ed8 ("xdp: hold device for umem regardless of zero-copy mode")
Signed-off-by: Marek Majtyka <marekx.majtyka@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20201120151443.105903-1-marekx.majtyka@intel.com
-----BEGIN PGP SIGNATURE-----
iQFHBAABCgAxFiEEK3kIWJt9yTYMP3ehqclaivrt76kFAl+3q+gTHG1rbEBwZW5n
dXRyb25peC5kZQAKCRCpyVqK+u3vqYdjB/46Qv583QBIbM8YaXN+XTSfxSufp0Ku
G0R88mJKdOalgeiD58TaZ3hAegjOW44nCMmZP64mQB7XNvRiyNk8j69VDsRy+4Zy
r86Hi0TIlpeM5oNHb84SpvIO01elmuYNlaoU9t7NQV5YlKA2Lq+kcmMLGa7Z1zVk
IZ2KF2eF2V83p/4A5t1cgQV6LRlDBgAeoe/9pyG0mNX3Fv8lI3WIqjRkuevSm6iJ
XMEuc7kjUZ8Yd59Vvnipm89teD5lCyIpxeJ2hrOsiUlkwyosdCxewMeohJ596a2L
t5DoiT+3nJLOBQ/IKCYe5RoCf7sFZ8nfjqcNYeOm7Yw1CbXg9KwGZTU7
=BedD
-----END PGP SIGNATURE-----
Merge tag 'linux-can-next-for-5.11-20201120' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next
Marc Kleine-Budde says:
====================
pull-request: can-next 2020-11-20
The first patch is by Yegor Yefremov and he improves the j1939 documentaton by
adding tables for the CAN identifier and its fields.
Then there are 8 patches by Oliver Hartkopp targeting the CAN driver
infrastructure and drivers. These add support for optional DLC element to the
Classical CAN frame structure. See patch ea7800565a ("can: add optional DLC
element to Classical CAN frame structure") for details. Oliver's last patch
adds len8_dlc support to several drivers. Stefan Mätje provides a patch to add
len8_dlc support to the esd_usb2 driver.
The next patch is by Oliver Hartkopp, too and adds support for modification of
Classical CAN DLCs to CAN GW sockets.
The next 3 patches target the nxp,flexcan DT bindings. One patch by my adds the
missing uint32 reference to the clock-frequency property. Joakim Zhang's
patches fix the fsl,clk-source property and add the IMX_SC_R_CAN() macro to the
imx firmware header file, which will be used in the flexcan driver later.
Another patch by Joakim Zhang prepares the flexcan driver for SCU based
stop-mode, by giving the existing, GPR based stop-mode, a _GPR postfix.
The next 5 patches are by me, target the flexcan driver, and clean up the
.ndo_open and .ndo_stop callbacks. These patches try to fix a sporadically
hanging flexcan_close() during simultanious ifdown, sending of CAN messages and
probably open CAN bus. I was never able to reproduce, but these seem to fix the
problem at the reporting user. As these changes are rather big, I'd like to
mainline them via net-next/master.
The next patches are by Jimmy Assarsson and Christer Beskow, they add support
for new USB devices to the existing kvaser_usb driver.
The last patch is by Kaixu Xia and simplifies the return in the
mcp251xfd_chip_softreset() function in the mcp251xfd driver.
* tag 'linux-can-next-for-5.11-20201120' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next: (25 commits)
can: mcp251xfd: remove useless code in mcp251xfd_chip_softreset
can: kvaser_usb: Add new Kvaser hydra devices
can: kvaser_usb: kvaser_usb_hydra: Add support for new device variant
can: kvaser_usb: Add new Kvaser Leaf v2 devices
can: kvaser_usb: Add USB_{LEAF,HYDRA}_PRODUCT_ID_END defines
can: flexcan: flexcan_close(): change order if commands to properly shut down the controller
can: flexcan: flexcan_open(): completely initialize controller before requesting IRQ
can: flexcan: flexcan_rx_offload_setup(): factor out mailbox and rx-offload setup into separate function
can: flexcan: move enabling/disabling of interrupts from flexcan_chip_{start,stop}() to callers
can: flexcan: factor out enabling and disabling of interrupts into separate function
can: flexcan: rename macro FLEXCAN_QUIRK_SETUP_STOP_MODE -> FLEXCAN_QUIRK_SETUP_STOP_MODE_GPR
dt-bindings: firmware: add IMX_SC_R_CAN(x) macro for CAN
dt-bindings: can: fsl,flexcan: fix fsl,clk-source property
dt-bindings: can: fsl,flexcan: add uint32 reference to clock-frequency property
can: gw: support modification of Classical CAN DLCs
can: drivers: add len8_dlc support for esd_usb2 CAN adapter
can: drivers: add len8_dlc support for various CAN adapters
can: drivers: introduce helpers to access Classical CAN DLC values
can: update documentation for DLC usage in Classical CAN
can: rename CAN FD related can_len2dlc and can_dlc2len helpers
...
====================
Link: https://lore.kernel.org/r/20201120133318.3428231-1-mkl@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Child sockets erroneously inherit their parent's sk_type (ie. SOCK_*),
instead of the PF_IUCV protocol that the parent was created with in
iucv_sock_create().
We're currently not using sk->sk_protocol ourselves, so this shouldn't
have much impact (except eg. getting the output in skb_dump() right).
Fixes: eac3731bd0 ("[S390]: Add AF_IUCV socket support")
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Link: https://lore.kernel.org/r/20201120100657.34407-1-jwi@linux.ibm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use netdev->tstats instead of a member of net_bridge for storing
a pointer to the per-cpu counters. This allows us to use core
functionality for statistics handling.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://lore.kernel.org/r/9bad2be2-fd84-7c6e-912f-cee433787018@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When setting congestion control via a BPF program it is seen that the
SYN/ACK for packets within a given flow will not include the ECT0 flag. A
bit of simple printk debugging shows that when this is configured without
BPF we will see the value INET_ECN_xmit value initialized in
tcp_assign_congestion_control however when we configure this via BPF the
socket is in the closed state and as such it isn't configured, and I do not
see it being initialized when we transition the socket into the listen
state. The result of this is that the ECT0 bit is configured based on
whatever the default state is for the socket.
Any easy way to reproduce this is to monitor the following with tcpdump:
tools/testing/selftests/bpf/test_progs -t bpf_tcp_ca
Without this patch the SYN/ACK will follow whatever the default is. If dctcp
all SYN/ACK packets will have the ECT0 bit set, and if it is not then ECT0
will be cleared on all SYN/ACK packets. With this patch applied the SYN/ACK
bit matches the value seen on the other packets in the given stream.
Fixes: 91b5b21c7c ("bpf: Add support for changing congestion control")
Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
An issue was recently found where DCTCP SYN/ACK packets did not have the
ECT bit set in the L3 header. A bit of code review found that the recent
change referenced below had gone though and added a mask that prevented the
ECN bits from being populated in the L3 header.
This patch addresses that by rolling back the mask so that it is only
applied to the flags coming from the incoming TCP request instead of
applying it to the socket tos/tclass field. Doing this the ECT bits were
restored in the SYN/ACK packets in my testing.
One thing that is not addressed by this patch set is the fact that
tcp_reflect_tos appears to be incompatible with ECN based congestion
avoidance algorithms. At a minimum the feature should likely be documented
which it currently isn't.
Fixes: ac8f1710c1 ("tcp: reflect tos value received in SYN to the socket")
Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
Acked-by: Wei Wang <weiwan@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Send timely MPTCP-level ack is somewhat difficult when
the insertion into the msk receive level is performed
by the worker.
It needs TCP-level dup-ack to notify the MPTCP-level
ack_seq increase, as both the TCP-level ack seq and the
rcv window are unchanged.
We can actually avoid processing incoming data with the
worker, and let the subflow or recevmsg() send ack as needed.
When recvmsg() moves the skbs inside the msk receive queue,
the msk space is still unchanged, so tcp_cleanup_rbuf() could
end-up skipping TCP-level ack generation. Anyway, when
__mptcp_move_skbs() is invoked, a known amount of bytes is
going to be consumed soon: we update rcv wnd computation taking
them in account.
Additionally we need to explicitly trigger tcp_cleanup_rbuf()
when recvmsg() consumes a significant amount of the receive buffer.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
OoO handling attempts to detect when packet is out-of-window by testing
current ack sequence and remaining space vs. sequence number.
This doesn't work reliably. Store the highest allowed sequence number
that we've announced and use it to detect oow packets.
Do this when mptcp options get written to the packet (wire format).
For this to work we need to move the write_options call until after
stack selected a new tcp window.
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When ADD_ADDR suboption includes an IPv6 address, the size is 28 octets.
It will not fit when other MPTCP suboptions are included in this packet,
e.g. DSS. So here we send out an ADD_ADDR dedicated packet to carry only
ADD_ADDR suboption, no other MPTCP suboptions.
In mptcp_pm_announce_addr, we check whether this is an IPv6 ADD_ADDR.
If it is, we set the flag MPTCP_ADD_ADDR_IPV6 to true. Then we call
mptcp_pm_nl_add_addr_send_ack to sent out a new pure ACK packet.
In mptcp_established_options_add_addr, we check whether this is a pure
ACK packet for ADD_ADDR. If it is, we drop all other MPTCP suboptions
in this packet, only put ADD_ADDR suboption in it.
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch changed the 'add_addr_signal' type from bool to char, so that
we could encode the addr type there.
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This will simplify all operation dealing with subflows
before accept time (e.g. data fin processing, add_addr).
The join list is already flushed by mptcp_stream_accept()
before returning the newly created msk to the user space.
This also fixes an potential bug present into the old code:
conn_list was manipulated without helding the msk lock
in mptcp_stream_accept().
Tested-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In case a subflow path is blocked, MPTCP-level retransmit may not take
place anymore because such subflow is likely to have unacked data left
in its write queue.
Ignore subflows that have experienced loss and test next candidate.
Fixes: 3b1d6210a9 ("mptcp: implement and use MPTCP-level retransmission")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We need to cope with some more state transition for
fallback sockets, or could still end-up moving to TCP_CLOSE
too early and avoid spooling some pending data
Fixes: e16163b6e2 ("mptcp: refactor shutdown and close")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Only mptcp_close() can actually cancel the workqueue,
no need to add and use this flag.
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If dsa_switch_ops::port_txtstamp() returns false, clone will be freed
immediately. Shouldn't store a pointer to freed memory.
Signed-off-by: Christian Eggers <ceggers@arri.de>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Tested-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20201119110906.25558-1-ceggers@arri.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In case when tcp socket received FIN after some data and the
parser haven't started before reading data caller will receive
an empty buffer. This behavior differs from plain TCP socket and
leads to special treating in user-space.
The flow that triggers the race is simple. Server sends small
amount of data right after the connection is configured to use TLS
and closes the connection. In this case receiver sees TLS Handshake
data, configures TLS socket right after Change Cipher Spec record.
While the configuration is in process, TCP socket receives small
Application Data record, Encrypted Alert record and FIN packet. So
the TCP socket changes sk_shutdown to RCV_SHUTDOWN and sk_flag with
SK_DONE bit set. The received data is not parsed upon arrival and is
never sent to user-space.
Patch unpauses parser directly if we have unparsed data in tcp
receive queue.
Fixes: fcf4793e27 ("tls: check RCV_SHUTDOWN in tls_wait_data")
Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru>
Link: https://lore.kernel.org/r/1605801588-12236-1-git-send-email-vfedorenko@novek.ru
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
rose_send_frame() dereferences `neigh->dev` when called from
rose_transmit_clear_request(), and the first occurrence of the
`neigh` is in rose_loopback_timer() as `rose_loopback_neigh`,
and it is initialized in rose_add_loopback_neigh() as NULL.
i.e when `rose_loopback_neigh` used in rose_loopback_timer()
its `->dev` was still NULL and rose_loopback_timer() was calling
rose_rx_call_request() without checking for NULL.
- net/rose/rose_link.c
This bug seems to get triggered in this line:
rose_call = (ax25_address *)neigh->dev->dev_addr;
Fix it by adding NULL checking for `rose_loopback_neigh->dev`
in rose_loopback_timer().
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Reported-by: syzbot+a1c743815982d9496393@syzkaller.appspotmail.com
Tested-by: syzbot+a1c743815982d9496393@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?id=9d2a7ca8c7f2e4b682c97578dfa3f236258300b3
Signed-off-by: Anmol Karn <anmol.karan123@gmail.com>
Link: https://lore.kernel.org/r/20201119191043.28813-1-anmol.karan123@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The static checker is fooled by the non-static locking scheme
implemented by the mentioned helpers.
Let's make its life easier adding some unconditional annotation
so that the helpers are now interpreted as a plain spinlock from
sparse.
v1 -> v2:
- add __releases() annotation to unlock_sock_fast()
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/6ed7ae627d8271fb7f20e0a9c6750fbba1ac2635.1605634911.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There is no easy way to distinguish if a conntracked tcp packet is
marked invalid because of tcp_in_window() check error or because
it doesn't belong to an existing connection. With this patch,
openvswitch sets liberal tcp flag for the established sessions so
that out of window packets are not marked invalid.
A helper function - nf_ct_set_tcp_be_liberal(nf_conn) is added which
sets this flag for both the directions of the nf_conn.
Suggested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Numan Siddique <nusiddiq@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20201116130126.3065077-1-nusiddiq@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Fix a bug that is triggered when a partially setup socket is
destroyed. For a fully setup socket, a socket that has been bound to a
device, the cleanup of the umem is performed at the end of the buffer
pool's cleanup work queue item. This has to be performed in a work
queue, and not in RCU cleanup, as it is doing a vunmap that cannot
execute in interrupt context. However, when a socket has only been
partially set up so that a umem has been created but the buffer pool
has not, the code erroneously directly calls the umem cleanup function
instead of using a work queue, and this leads to a BUG_ON() in
vunmap().
As there in this case is no buffer pool, we cannot use its work queue,
so we need to introduce a work queue for the umem and schedule this for
the cleanup. So in the case there is no pool, we are going to use the
umem's own work queue to schedule the cleanup. But if there is a
pool, the cleanup of the umem is still being performed by the pool's
work queue, as it is important that the umem is cleaned up after the
pool.
Fixes: e5e1a4bc91 ("xsk: Fix possible memory leak at socket close")
Reported-by: Marek Majtyka <marekx.majtyka@intel.com>
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Marek Majtyka <marekx.majtyka@intel.com>
Link: https://lore.kernel.org/bpf/1605873219-21629-1-git-send-email-magnus.karlsson@gmail.com
Add support for data length code modifications for Classical CAN.
The netlink configuration interface always allowed to pass any value
that fits into a byte, therefore only the modification process had to be
extended to handle the raw DLC represenation of Classical CAN frames.
When a DLC value from 0 .. F is provided for Classical CAN frame
modifications the 'len' value is modified as-is with the exception that
potentially existing 9 .. F DLC values in the len8_dlc element are moved
to the 'len' element for the modification operation by mod_retrieve_ccdlc().
After the modification the Classical CAN frame DLC information is brought
back into the correct format by mod_store_ccdlc() which is filling 'len'
and 'len8_dlc' accordingly.
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://lore.kernel.org/r/20201119084921.2621-1-socketcan@hartkopp.net
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
The naming of can_dlc as element of struct can_frame and also as variable
name is misleading as it claims to be a 'data length CODE' but in reality
it always was a plain data length.
With the indroduction of a new 'len' element in struct can_frame we can now
remove can_dlc as name and make clear which of the former uses was a plain
length (-> 'len') or a data length code (-> 'dlc') value.
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://lore.kernel.org/r/20201120100444.3199-1-socketcan@hartkopp.net
[mkl: gs_usb: keep struct gs_host_frame::can_dlc as is]
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Reset MHI device channels when driver remove is called due to
module unload or any crash scenario. This will make sure that
MHI channels no longer remain enabled for transfers since the
MHI stack does not take care of this anymore after the auto-start
channels feature was removed.
Signed-off-by: Bhaumik Bhatt <bbhatt@codeaurora.org>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
This patch adds an IPv6 routes encapsulation attribute
to the result of netlink RTM_GETROUTE requests
(i.e. ip route get 2001:db8::).
Signed-off-by: Oliver Herms <oliver.peter.herms@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20201118230651.GA8861@tws
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We must start the retransmission timer only there are
pending data in the rtx queue.
Otherwise we can hit a WARN_ON in mptcp_reset_timer(),
as syzbot demonstrated.
Reported-and-tested-by: syzbot+42aa53dafb66a07e5a24@syzkaller.appspotmail.com
Fixes: d9ca1de8c0 ("mptcp: move page frag allocation in mptcp_sendmsg()")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Link: https://lore.kernel.org/r/1a72039f112cae048c44d398ffa14e0a1432db3d.1605737083.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When performing a flash update via devlink, device drivers may inform
user space of status updates via
devlink_flash_update_(begin|end|timeout|status)_notify functions.
It is expected that drivers do not send any status notifications unless
they send a begin and end message. If a driver sends a status
notification without sending the appropriate end notification upon
finishing (regardless of success or failure), the current implementation
of the devlink userspace program can get stuck endlessly waiting for the
end notification that will never come.
The current ice driver implementation may send such a status message
without the appropriate end notification in rare cases.
Fixing the ice driver is relatively simple: we just need to send the
begin_notify at the start of the function and always send an end_notify
no matter how the function exits.
Rather than assuming driver authors will always get this right in the
future, lets just fix the API so that it is not possible to get wrong.
Make devlink_flash_update_begin_notify and
devlink_flash_update_end_notify static, and call them in devlink.c core
code. Always send the begin_notify just before calling the driver's
flash_update routine. Always send the end_notify just after the routine
returns regardless of success or failure.
Doing this makes the status notification easier to use from the driver,
as it no longer needs to worry about catching failures and cleaning up
by calling devlink_flash_update_end_notify. It is now no longer possible
to do the wrong thing in this regard. We also save a couple of lines of
code in each driver.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Acked-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>