Action sample doesn't properly handle psample_group pointer in overwrite
case. Following issues need to be fixed:
- In tcf_sample_init() function RCU_INIT_POINTER() is used to set
s->psample_group, even though we neither setting the pointer to NULL, nor
preventing concurrent readers from accessing the pointer in some way.
Use rcu_swap_protected() instead to safely reset the pointer.
- Old value of s->psample_group is not released or deallocated in any way,
which results resource leak. Use psample_group_put() on non-NULL value
obtained with rcu_swap_protected().
- The function psample_group_put() that released reference to struct
psample_group pointed by rcu-pointer s->psample_group doesn't respect rcu
grace period when deallocating it. Extend struct psample_group with rcu
head and use kfree_rcu when freeing it.
Fixes: 5c5670fae4 ("net/sched: Introduce sample tc action")
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Only the first fragment in a datagram contains the L4 headers. When the
Open vSwitch module parses a packet, it always sets the IP protocol
field in the key, but can only set the L4 fields on the first fragment.
The original behavior would not clear the L4 portion of the key, so
garbage values would be sent in the key for "later" fragments. This
patch clears the L4 fields in that circumstance to prevent sending those
garbage values as part of the upcall.
Signed-off-by: Justin Pettit <jpettit@ovn.org>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When IP fragments are reassembled before being sent to conntrack, the
key from the last fragment is used. Unless there are reordering
issues, the last fragment received will not contain the L4 ports, so the
key for the reassembled datagram won't contain them. This patch updates
the key once we have a reassembled datagram.
The handle_fragments() function works on L3 headers so we pull the L3/L4
flow key update code from key_extract into a new function
'key_extract_l3l4'. Then we add a another new function
ovs_flow_key_update_l3l4() and export it so that it is accessible by
handle_fragments() for conntrack packet reassembly.
Co-authored-by: Justin Pettit <jpettit@ovn.org>
Signed-off-by: Greg Rose <gvrose8192@gmail.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that 'TCQ_F_CPUSTATS' bit can be cleared, depending on the value of
'TCQ_F_NOLOCK' bit in the parent qdisc, we need to be sure that per-cpu
counters are present when 'reset()' is called for pfifo_fast qdiscs.
Otherwise, the following script:
# tc q a dev lo handle 1: root htb default 100
# tc c a dev lo parent 1: classid 1:100 htb \
> rate 95Mbit ceil 100Mbit burst 64k
[...]
# tc f a dev lo parent 1: protocol arp basic classid 1:100
[...]
# tc q a dev lo parent 1:100 handle 100: pfifo_fast
[...]
# tc q d dev lo root
can generate the following splat:
Unable to handle kernel paging request at virtual address dfff2c01bd148000
Mem abort info:
ESR = 0x96000004
Exception class = DABT (current EL), IL = 32 bits
SET = 0, FnV = 0
EA = 0, S1PTW = 0
Data abort info:
ISV = 0, ISS = 0x00000004
CM = 0, WnR = 0
[dfff2c01bd148000] address between user and kernel address ranges
Internal error: Oops: 96000004 [#1] SMP
[...]
pstate: 80000005 (Nzcv daif -PAN -UAO)
pc : pfifo_fast_reset+0x280/0x4d8
lr : pfifo_fast_reset+0x21c/0x4d8
sp : ffff800d09676fa0
x29: ffff800d09676fa0 x28: ffff200012ee22e4
x27: dfff200000000000 x26: 0000000000000000
x25: ffff800ca0799958 x24: ffff1001940f332b
x23: 0000000000000007 x22: ffff200012ee1ab8
x21: 0000600de8a40000 x20: 0000000000000000
x19: ffff800ca0799900 x18: 0000000000000000
x17: 0000000000000002 x16: 0000000000000000
x15: 0000000000000000 x14: 0000000000000000
x13: 0000000000000000 x12: ffff1001b922e6e2
x11: 1ffff001b922e6e1 x10: 0000000000000000
x9 : 1ffff001b922e6e1 x8 : dfff200000000000
x7 : 0000000000000000 x6 : 0000000000000000
x5 : 1fffe400025dc45c x4 : 1fffe400025dc357
x3 : 00000c01bd148000 x2 : 0000600de8a40000
x1 : 0000000000000007 x0 : 0000600de8a40004
Call trace:
pfifo_fast_reset+0x280/0x4d8
qdisc_reset+0x6c/0x370
htb_reset+0x150/0x3b8 [sch_htb]
qdisc_reset+0x6c/0x370
dev_deactivate_queue.constprop.5+0xe0/0x1a8
dev_deactivate_many+0xd8/0x908
dev_deactivate+0xe4/0x190
qdisc_graft+0x88c/0xbd0
tc_get_qdisc+0x418/0x8a8
rtnetlink_rcv_msg+0x3a8/0xa78
netlink_rcv_skb+0x18c/0x328
rtnetlink_rcv+0x28/0x38
netlink_unicast+0x3c4/0x538
netlink_sendmsg+0x538/0x9a0
sock_sendmsg+0xac/0xf8
___sys_sendmsg+0x53c/0x658
__sys_sendmsg+0xc8/0x140
__arm64_sys_sendmsg+0x74/0xa8
el0_svc_handler+0x164/0x468
el0_svc+0x10/0x14
Code: 910012a0 92400801 d343fc03 11000c21 (38fb6863)
Fix this by testing the value of 'TCQ_F_CPUSTATS' bit in 'qdisc->flags',
before dereferencing 'qdisc->cpu_qstats'.
Changes since v1:
- coding style improvements, thanks to Stefano Brivio
Fixes: 8a53e616de ("net: sched: when clearing NOLOCK, clear TCQ_F_CPUSTATS, too")
CC: Paolo Abeni <pabeni@redhat.com>
Reported-by: Li Shuang <shuali@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Rutsky reported stuck TCP sessions after memory pressure
events. Edge Trigger epoll() user would never receive an EPOLLOUT
notification allowing them to retry a sendmsg().
Jason tested the case of sk_stream_alloc_skb() returning NULL,
but there are other paths that could lead both sendmsg() and sendpage()
to return -1 (EAGAIN), with an empty skb queued on the write queue.
This patch makes sure we remove this empty skb so that
Jason code can detect that the queue is empty, and
call sk->sk_write_space(sk) accordingly.
Fixes: ce5ec44099 ("tcp: ensure epoll edge trigger wakeup when write queue is empty")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jason Baron <jbaron@akamai.com>
Reported-by: Vladimir Rutsky <rutsky@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The rds6_inc_info_copy() function has a couple struct members which
are leaking stack information. The ->tos field should hold actual
information and the ->flags field needs to be zeroed out.
Fixes: 3eb450367d ("rds: add type of service(tos) infrastructure")
Fixes: b7ff8b1036 ("rds: Extend RDS API for IPv6 support")
Reported-by: 黄ID蝴蝶 <butterflyhuangxx@gmail.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After witnessing the discussion in https://lkml.org/lkml/2019/8/14/151
w.r.t. ioctl extensibility, it became clear that such an issue might
prevent that the 3 RSV bits inside the DSA 802.1Q tag might also suffer
the same fate and be useless for further extension.
So clearly specify that the reserved bits should currently be
transmitted as zero and ignored on receive. The DSA tagger already does
this (and has always did), and is the only known user so far (no
Wireshark dissection plugin, etc). So there should be no incompatibility
to speak of.
Fixes: 0471dd429c ("net: dsa: tag_8021q: Create a stable binary format")
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The net pointer in struct xt_tgdtor_param is not explicitly
initialized therefore is still NULL when dereferencing it.
So we have to find a way to pass the correct net pointer to
ipt_destroy_target().
The best way I find is just saving the net pointer inside the per
netns struct tcf_idrinfo, which could make this patch smaller.
Fixes: 0c66dc1ea3 ("netfilter: conntrack: register hooks in netns when needed by ruleset")
Reported-and-tested-by: itugrok@yahoo.com
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Highlights include:
Stable fixes:
- Fix a page lock leak in nfs_pageio_resend()
- Ensure O_DIRECT reports an error if the bytes read/written is 0
- Don't handle errors if the bind/connect succeeded
- Revert "NFSv4/flexfiles: Abort I/O early if the layout segment was invalidat
ed"
Bugfixes:
- Don't refresh attributes with mounted-on-file information
- Fix return values for nfs4_file_open() and nfs_finish_open()
- Fix pnfs layoutstats reporting of I/O errors
- Don't use soft RPC calls for pNFS/flexfiles I/O, and don't abort for
soft I/O errors when the user specifies a hard mount.
- Various fixes to the error handling in sunrpc
- Don't report writepage()/writepages() errors twice.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAl1lgz4ACgkQZwvnipYK
APIsHhAApqaVaGzwfeR87zq+QaaVOzYzejyvFgs3wh/Lc5xPH+SlQ6NxLbs8ppdT
srrOHV9E2MA4JgqoHaIBMTqWacQ0UfQQ/6qLEFCrps9/0QHs7fg0CAHS5emmgk2v
rD6Mezr5Nx8h5/QJCBEZXfas5lxsICz1EYJ4Pk8QT6IoyeC+fvarGZKvzIQJ3KDN
8yrdv5kCVtN7noREf1KDIqIlYvFbIEoOoglNA40G49e1ffT9Oz6qzTcg19HFO50x
eAIxc9u4KCUY/ASCvcv9biQ5200l7QSCqmR7/Xlj/+4aClKp6Ay058j0awxtHHDy
NlZt6V3XGlm1/SVpvtU/XXWcyJmQwX7kOVIEYOFmt+lEqC7ZBzWEpAaJ8h4DMLLc
PIxIWBSmXNxp6LPNI0dZFf7O6UZ3ZMRacav+HHu7mjWolEB22f4jQJs+RxNhnfLU
fg180YWBMX4V/98S7iigxZkRd+qqQhddYtku+o+bp3h4m6mVrrYNm11J0o0GWQWf
Lio9nlkLq9hkYpdBwkH4PtIv3b+O5f9yhfEYn15eF27Ru0Bob0+DiBkzlflcrJve
W2VfNAj+jxP3Wg0QAI40BSqUB3b+zVtZW5FenAUEK7NxhhPi6jrIsVhhVgGFZIAd
i1xwYUg6fDjielhGOxMTF66ilvduA9uBCFAnTD3iSBoZmF63vew=
=YHhU
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.3-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
"Highlights include:
Stable fixes:
- Fix a page lock leak in nfs_pageio_resend()
- Ensure O_DIRECT reports an error if the bytes read/written is 0
- Don't handle errors if the bind/connect succeeded
- Revert "NFSv4/flexfiles: Abort I/O early if the layout segment was
invalidat ed"
Bugfixes:
- Don't refresh attributes with mounted-on-file information
- Fix return values for nfs4_file_open() and nfs_finish_open()
- Fix pnfs layoutstats reporting of I/O errors
- Don't use soft RPC calls for pNFS/flexfiles I/O, and don't abort
for soft I/O errors when the user specifies a hard mount.
- Various fixes to the error handling in sunrpc
- Don't report writepage()/writepages() errors twice"
* tag 'nfs-for-5.3-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFS: remove set but not used variable 'mapping'
NFSv2: Fix write regression
NFSv2: Fix eof handling
NFS: Fix writepage(s) error handling to not report errors twice
NFS: Fix spurious EIO read errors
pNFS/flexfiles: Don't time out requests on hard mounts
SUNRPC: Handle connection breakages correctly in call_status()
Revert "NFSv4/flexfiles: Abort I/O early if the layout segment was invalidated"
SUNRPC: Handle EADDRINUSE and ENOBUFS correctly
pNFS/flexfiles: Turn off soft RPC calls
SUNRPC: Don't handle errors if the bind/connect succeeded
NFS: On fatal writeback errors, we need to call nfs_inode_remove_request()
NFS: Fix initialisation of I/O result struct in nfs_pgio_rpcsetup
NFS: Ensure O_DIRECT reports an error if the bytes read/written is 0
NFSv4/pnfs: Fix a page lock leak in nfs_pageio_resend()
NFSv4: Fix return value in nfs_finish_open()
NFSv4: Fix return values for nfs4_file_open()
NFS: Don't refresh attributes with mounted-on-file information
Pull networking fixes from David Miller:
1) Use 32-bit index for tails calls in s390 bpf JIT, from Ilya
Leoshkevich.
2) Fix missed EPOLLOUT events in TCP, from Eric Dumazet. Same fix for
SMC from Jason Baron.
3) ipv6_mc_may_pull() should return 0 for malformed packets, not
-EINVAL. From Stefano Brivio.
4) Don't forget to unpin umem xdp pages in error path of
xdp_umem_reg(). From Ivan Khoronzhuk.
5) Fix sta object leak in mac80211, from Johannes Berg.
6) Fix regression by not configuring PHYLINK on CPU port of bcm_sf2
switches. From Florian Fainelli.
7) Revert DMA sync removal from r8169 which was causing regressions on
some MIPS Loongson platforms. From Heiner Kallweit.
8) Use after free in flow dissector, from Jakub Sitnicki.
9) Fix NULL derefs of net devices during ICMP processing across
collect_md tunnels, from Hangbin Liu.
10) proto_register() memory leaks, from Zhang Lin.
11) Set NLM_F_MULTI flag in multipart netlink messages consistently,
from John Fastabend.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (66 commits)
r8152: Set memory to all 0xFFs on failed reg reads
openvswitch: Fix conntrack cache with timeout
ipv4: mpls: fix mpls_xmit for iptunnel
nexthop: Fix nexthop_num_path for blackhole nexthops
net: rds: add service level support in rds-info
net: route dump netlink NLM_F_MULTI flag missing
s390/qeth: reject oversized SNMP requests
sock: fix potential memory leak in proto_register()
MAINTAINERS: Add phylink keyword to SFF/SFP/SFP+ MODULE SUPPORT
xfrm/xfrm_policy: fix dst dev null pointer dereference in collect_md mode
ipv4/icmp: fix rt dst dev null pointer dereference
openvswitch: Fix log message in ovs conntrack
bpf: allow narrow loads of some sk_reuseport_md fields with offset > 0
bpf: fix use after free in prog symbol exposure
bpf: fix precision tracking in presence of bpf2bpf calls
flow_dissector: Fix potential use-after-free on BPF_PROG_DETACH
Revert "r8169: remove not needed call to dma_sync_single_for_device"
ipv6: propagate ipv6_add_dev's error returns out of ipv6_find_idev
net/ncsi: Fix the payload copying for the request coming from Netlink
qed: Add cleanup in qed_slowpath_start()
...
If the connection breaks while we're waiting for a reply from the
server, then we want to immediately try to reconnect.
Fixes: ec6017d903 ("SUNRPC fix regression in umount of a secure mount")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
This reverts commit a79f194aa4.
The mechanism for aborting I/O is racy, since we are not guaranteed that
the request is asleep while we're changing both task->tk_status and
task->tk_action.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: stable@vger.kernel.org # v5.1
If a connect or bind attempt returns EADDRINUSE, that means we want to
retry with a different port. It is not a fatal connection error.
Similarly, ENOBUFS is not fatal, but just indicates a memory allocation
issue. Retry after a short delay.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Don't handle errors in call_bind_status()/call_connect_status()
if it turns out that a previous call caused it to succeed.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: stable@vger.kernel.org # v5.1+
This patch addresses a conntrack cache issue with timeout policy.
Currently, we do not check if the timeout extension is set properly in the
cached conntrack entry. Thus, after packet recirculate from conntrack
action, the timeout policy is not applied properly. This patch fixes the
aforementioned issue.
Fixes: 06bd2bdf19 ("openvswitch: Add timeout support to ct action")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When using mpls over gre/gre6 setup, rt->rt_gw4 address is not set, the
same for rt->rt_gw_family. Therefore, when rt->rt_gw_family is checked
in mpls_xmit(), neigh_xmit() call is skipped. As a result, such setup
doesn't work anymore.
This issue was found with LTP mpls03 tests.
Fixes: 1550c17193 ("ipv4: Prepare rtable for IPv6 gateway")
Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
>From IB specific 7.6.5 SERVICE LEVEL, Service Level (SL)
is used to identify different flows within an IBA subnet.
It is carried in the local route header of the packet.
Before this commit, run "rds-info -I". The outputs are as
below:
"
RDS IB Connections:
LocalAddr RemoteAddr Tos SL LocalDev RemoteDev
192.2.95.3 192.2.95.1 2 0 fe80::21:28:1a:39 fe80::21:28:10:b9
192.2.95.3 192.2.95.1 1 0 fe80::21:28:1a:39 fe80::21:28:10:b9
192.2.95.3 192.2.95.1 0 0 fe80::21:28:1a:39 fe80::21:28:10:b9
"
After this commit, the output is as below:
"
RDS IB Connections:
LocalAddr RemoteAddr Tos SL LocalDev RemoteDev
192.2.95.3 192.2.95.1 2 2 fe80::21:28:1a:39 fe80::21:28:10:b9
192.2.95.3 192.2.95.1 1 1 fe80::21:28:1a:39 fe80::21:28:10:b9
192.2.95.3 192.2.95.1 0 0 fe80::21:28:1a:39 fe80::21:28:10:b9
"
The commit fe3475af3b ("net: rds: add per rds connection cache
statistics") adds cache_allocs in struct rds_info_rdma_connection
as below:
struct rds_info_rdma_connection {
...
__u32 rdma_mr_max;
__u32 rdma_mr_size;
__u8 tos;
__u32 cache_allocs;
};
The peer struct in rds-tools of struct rds_info_rdma_connection is as
below:
struct rds_info_rdma_connection {
...
uint32_t rdma_mr_max;
uint32_t rdma_mr_size;
uint8_t tos;
uint8_t sl;
uint32_t cache_allocs;
};
The difference between userspace and kernel is the member variable sl.
In the kernel struct, the member variable sl is missing. This will
introduce risks. So it is necessary to use this commit to avoid this risk.
Fixes: fe3475af3b ("net: rds: add per rds connection cache statistics")
CC: Joe Jin <joe.jin@oracle.com>
CC: JUNXIAO_BI <junxiao.bi@oracle.com>
Suggested-by: Gerd Rausch <gerd.rausch@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
An excerpt from netlink(7) man page,
In multipart messages (multiple nlmsghdr headers with associated payload
in one byte stream) the first and all following headers have the
NLM_F_MULTI flag set, except for the last header which has the type
NLMSG_DONE.
but, after (ee28906) there is a missing NLM_F_MULTI flag in the middle of a
FIB dump. The result is user space applications following above man page
excerpt may get confused and may stop parsing msg believing something went
wrong.
In the golang netlink lib [0] the library logic stops parsing believing the
message is not a multipart message. Found this running Cilium[1] against
net-next while adding a feature to auto-detect routes. I noticed with
multiple route tables we no longer could detect the default routes on net
tree kernels because the library logic was not returning them.
Fix this by handling the fib_dump_info_fnhe() case the same way the
fib_dump_info() handles it by passing the flags argument through the
call chain and adding a flags argument to rt_fill_info().
Tested with Cilium stack and auto-detection of routes works again. Also
annotated libs to dump netlink msgs and inspected NLM_F_MULTI and
NLMSG_DONE flags look correct after this.
Note: In inet_rtm_getroute() pass rt_fill_info() '0' for flags the same
as is done for fib_dump_info() so this looks correct to me.
[0] https://github.com/vishvananda/netlink/
[1] https://github.com/cilium/
Fixes: ee28906fd7 ("ipv4: Dump route exceptions if requested")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If protocols registered exceeded PROTO_INUSE_NR, prot will be
added to proto_list, but no available bit left for prot in
proto_inuse_idx.
Changes since v2:
* Propagate the error code properly
Signed-off-by: zhanglin <zhang.lin16@zte.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
In decode_session{4,6} there is a possibility that the skb dst dev is NULL,
e,g, with tunnel collect_md mode, which will cause kernel crash.
Here is what the code path looks like, for GRE:
- ip6gre_tunnel_xmit
- ip6gre_xmit_ipv6
- __gre6_xmit
- ip6_tnl_xmit
- if skb->len - t->tun_hlen - eth_hlen > mtu; return -EMSGSIZE
- icmpv6_send
- icmpv6_route_lookup
- xfrm_decode_session_reverse
- decode_session4
- oif = skb_dst(skb)->dev->ifindex; <-- here
- decode_session6
- oif = skb_dst(skb)->dev->ifindex; <-- here
The reason is __metadata_dst_init() init dst->dev to NULL by default.
We could not fix it in __metadata_dst_init() as there is no dev supplied.
On the other hand, the skb_dst(skb)->dev is actually not needed as we
called decode_session{4,6} via xfrm_decode_session_reverse(), so oif is not
used by: fl4->flowi4_oif = reverse ? skb->skb_iif : oif;
So make a dst dev check here should be clean and safe.
v4: No changes.
v3: No changes.
v2: fix the issue in decode_session{4,6} instead of updating shared dst dev
in {ip_md, ip6}_tunnel_xmit.
Fixes: 8d79266bc4 ("ip6_tunnel: add collect_md mode to IPv6 tunnels")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Tested-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In __icmp_send() there is a possibility that the rt->dst.dev is NULL,
e,g, with tunnel collect_md mode, which will cause kernel crash.
Here is what the code path looks like, for GRE:
- ip6gre_tunnel_xmit
- ip6gre_xmit_ipv4
- __gre6_xmit
- ip6_tnl_xmit
- if skb->len - t->tun_hlen - eth_hlen > mtu; return -EMSGSIZE
- icmp_send
- net = dev_net(rt->dst.dev); <-- here
The reason is __metadata_dst_init() init dst->dev to NULL by default.
We could not fix it in __metadata_dst_init() as there is no dev supplied.
On the other hand, the reason we need rt->dst.dev is to get the net.
So we can just try get it from skb->dev when rt->dst.dev is NULL.
v4: Julian Anastasov remind skb->dev also could be NULL. We'd better
still use dst.dev and do a check to avoid crash.
v3: No changes.
v2: fix the issue in __icmp_send() instead of updating shared dst dev
in {ip_md, ip6}_tunnel_xmit.
Fixes: c8b34e680a ("ip_tunnel: Add tnl_update_pmtu in ip_md_tunnel_xmit")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes: 06bd2bdf19 ("openvswitch: Add timeout support to ct action")
Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stefan Schmidt says:
====================
pull-request: ieee802154 for net 2019-08-24
An update from ieee802154 for your *net* tree.
Yue Haibing fixed two bugs discovered by KASAN in the hwsim driver for
ieee802154 and Colin Ian King cleaned up a redundant variable assignment.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf 2019-08-24
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Fix verifier precision tracking with BPF-to-BPF calls, from Alexei.
2) Fix a use-after-free in prog symbol exposure, from Daniel.
3) Several s390x JIT fixes plus BE related fixes in BPF kselftests, from Ilya.
4) Fix memory leak by unpinning XDP umem pages in error path, from Ivan.
5) Fix a potential use-after-free on flow dissector detach, from Jakub.
6) Fix bpftool to close prog fd after showing metadata, from Quentin.
7) BPF kselftest config and TEST_PROGS_EXTENDED fixes, from Anders.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
test_select_reuseport fails on s390 due to verifier rejecting
test_select_reuseport_kern.o with the following message:
; data_check.eth_protocol = reuse_md->eth_protocol;
18: (69) r1 = *(u16 *)(r6 +22)
invalid bpf_context access off=22 size=2
This is because on big-endian machines casts from __u32 to __u16 are
generated by referencing the respective variable as __u16 with an offset
of 2 (as opposed to 0 on little-endian machines).
The verifier already has all the infrastructure in place to allow such
accesses, it's just that they are not explicitly enabled for
eth_protocol field. Enable them for eth_protocol field by using
bpf_ctx_range instead of offsetof.
Ditto for ip_protocol, bind_inany and len, since they already allow
narrowing, and the same problem can arise when working with them.
Fixes: 2dbb9b9e6d ("bpf: Introduce BPF_PROG_TYPE_SK_REUSEPORT")
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Call to bpf_prog_put(), with help of call_rcu(), queues an RCU-callback to
free the program once a grace period has elapsed. The callback can run
together with new RCU readers that started after the last grace period.
New RCU readers can potentially see the "old" to-be-freed or already-freed
pointer to the program object before the RCU update-side NULLs it.
Reorder the operations so that the RCU update-side resets the protected
pointer before the end of the grace period after which the program will be
freed.
Fixes: d58e468b11 ("flow_dissector: implements flow dissector BPF hook")
Reported-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: Petar Penkov <ppenkov@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Currently, ipv6_find_idev returns NULL when ipv6_add_dev fails,
ignoring the specific error value. This results in addrconf_add_dev
returning ENOBUFS in all cases, which is unfortunate in cases such as:
# ip link add dummyX type dummy
# ip link set dummyX mtu 1200 up
# ip addr add 2000::/64 dev dummyX
RTNETLINK answers: No buffer space available
Commit a317a2f19d ("ipv6: fail early when creating netdev named all
or default") introduced error returns in ipv6_add_dev. Before that,
that function would simply return NULL for all failures.
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
an assert and a NULL pointer dereference) plus a small series from Luis
fixing instances of vfree() under spinlock.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl1f2fITHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzi83fB/0a+TnNY8Q2aEeB9Y/0sckSpRCsMGMV
syt2krwKC0EYM1f2dkJdgCjlSjMzMcHPseP3g5odRXgyPKJt5O9oE7l3vGDC4Oyt
chqhEh86UzG6Kcptx6tIzsAGYS9S4NzxR5sfXF6oRu8m1bwk1n5IhKxYjQDTvAMd
RxwvpdguNA9xvHeUvLMTpy2R3qE3uQ2dxierutW67GeyeCPkvyBmazzi72Q36hlL
y1w8DWaPBemBk5QEM9vmz5i2xQeLO4h4ejhP4LcXyVjJtfvAPl0JWOsHMK4uWRJf
6XjbGDaGYvID0hTQLlEw/k73976HmRxSbaXRtCZN+IG3yWGTL8ID6GqI
=kaFB
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-5.3-rc6' of git://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
"Three important fixes tagged for stable (an indefinite hang, a crash
on an assert and a NULL pointer dereference) plus a small series from
Luis fixing instances of vfree() under spinlock"
* tag 'ceph-for-5.3-rc6' of git://github.com/ceph/ceph-client:
libceph: fix PG split vs OSD (re)connect race
ceph: don't try fill file_lock on unsuccessful GETFILELOCK reply
ceph: clear page dirty before invalidate page
ceph: fix buffer free while holding i_ceph_lock in fill_inode()
ceph: fix buffer free while holding i_ceph_lock in __ceph_build_xattrs_blob()
ceph: fix buffer free while holding i_ceph_lock in __ceph_setxattr()
libceph: allow ceph_buffer_put() to receive a NULL ceph_buffer
The request coming from Netlink should use the OEM generic handler.
The standard command handler expects payload in bytes/words/dwords
but the actual payload is stored in data if the request is coming from Netlink.
Signed-off-by: Justin Lee <justin.lee1@dell.com>
Reviewed-by: Vijay Khemka <vijaykhemka@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We can't rely on ->peer_features in calc_target() because it may be
called both when the OSD session is established and open and when it's
not. ->peer_features is not valid unless the OSD session is open. If
this happens on a PG split (pg_num increase), that could mean we don't
resend a request that should have been resent, hanging the client
indefinitely.
In userspace this was fixed by looking at require_osd_release and
get_xinfo[osd].features fields of the osdmap. However these fields
belong to the OSD section of the osdmap, which the kernel doesn't
decode (only the client section is decoded).
Instead, let's drop this feature check. It effectively checks for
luminous, so only pre-luminous OSDs would be affected in that on a PG
split the kernel might resend a request that should not have been
resent. Duplicates can occur in other scenarios, so both sides should
already be prepared for them: see dup/replay logic on the OSD side and
retry_attempt check on the client side.
Cc: stable@vger.kernel.org
Fixes: 7de030d6b1 ("libceph: resend on PG splits if OSD has RESEND_ON_SPLIT")
Link: https://tracker.ceph.com/issues/41162
Reported-by: Jerry Lee <leisurelysw24@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Tested-by: Jerry Lee <leisurelysw24@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
it expects a unsigned int, but got a __be32
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit 93a714d6b5 ("multicast: Extend ip address command to enable
multicast group join/leave on") we added a new flag IFA_F_MCAUTOJOIN
to make user able to add multicast address on ethernet interface.
This works for IPv4, but not for IPv6. See the inet6_addr_add code.
static int inet6_addr_add()
{
...
if (cfg->ifa_flags & IFA_F_MCAUTOJOIN) {
ipv6_mc_config(net->ipv6.mc_autojoin_sk, true...)
}
ifp = ipv6_add_addr(idev, cfg, true, extack); <- always fail with maddr
if (!IS_ERR(ifp)) {
...
} else if (cfg->ifa_flags & IFA_F_MCAUTOJOIN) {
ipv6_mc_config(net->ipv6.mc_autojoin_sk, false...)
}
}
But in ipv6_add_addr() it will check the address type and reject multicast
address directly. So this feature is never worked for IPv6.
We should not remove the multicast address check totally in ipv6_add_addr(),
but could accept multicast address only when IFA_F_MCAUTOJOIN flag supplied.
v2: update commit description
Fixes: 93a714d6b5 ("multicast: Extend ip address command to enable multicast group join/leave on")
Reported-by: Jianlin Shi <jishi@redhat.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- fix uninit-value in batadv_netlink_get_ifindex(), by Eric Dumazet
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAl1dRtEWHHN3QHNpbW9u
d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoeN6D/9IExhEX0OZQFf6fkHJ71yKAN0j
FjBBmumtu33brjFCtieix+jBRGny9qAO80I2q3X2yRQlx/SpHVuXUyHPrN+azNMJ
w2mb7Mw1l3+Y4KbdhM1dz+CTafmdJzgZoFsKIAamD+IP5uIUg1sSL4mLcgka35ol
GFZa82LE7hjTY1sR0azKCbbeAIlwMkoLWy5YsUmIDzJZ6eWJSEt/kEuNIWuhIhl4
VZeT6VH96xWSY1rqwgIiv0fXBuiE/n3//nMY5WquM0yFA1XJBG9jRFhmNcca8oPi
ehodBtZBlHyBXnAnDkSEJY7nux3+jsj4N4X+LDjmv6yWH5EmqMvM3NysXalM6sBj
SZeZyoxJZxl+WE658ZPoVawg3NuABOdOuY6tw/FLvFtBH3rKQzvlz0nXp19m1kef
TrGjCsJsVO/pnrXlQMB0+JnQooRId/UGUTHfgqqP2hySn6a2KcGrxCd7tWxBQ5oW
a3DSmVo7QEwxvEPnIPpxPFv698Rkheb+gQZVrOBjmE4YmdmuaCFUDt04zDRwFjsz
dH+tX3kgQvHZEwVqWylikNqkFDWuako7+N1fZ9DxWZh8tnDe6FXXgQ+DL80nuATS
DGI7ojpAhe8YUyCqOHsCD3lW1hHUmtmzj/N2OrN+U2ROO745RCqs9Co+Kn2ugRhq
YOdgbYJL6aRzu/D52w==
=9t+O
-----END PGP SIGNATURE-----
Merge tag 'batadv-net-for-davem-20190821' of git://git.open-mesh.org/linux-merge
Simon Wunderlich says:
====================
Here is a batman-adv bugfix:
- fix uninit-value in batadv_netlink_get_ifindex(), by Eric Dumazet
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 96cce12ff6 ("cfg80211: fix processing world
regdomain when non modular").
Re-triggering a reg_process_hint with the last request on all events,
can make the regulatory domain fail in case of multiple WiFi modules. On
slower boards (espacially with mdev), enumeration of the WiFi modules
can end up in an intersected regulatory domain, and user cannot set it
with 'iw reg set' anymore.
This is happening, because:
- 1st module enumerates, queues up a regulatory request
- request gets processed by __reg_process_hint_driver():
- checks if previous was set by CORE -> yes
- checks if regulator domain changed -> yes, from '00' to e.g. 'US'
-> sends request to the 'crda'
- 2nd module enumerates, queues up a regulator request (which triggers
the reg_todo() work)
- reg_todo() -> reg_process_pending_hints() sees, that the last request
is not processed yet, so it tries to process it again.
__reg_process_hint driver() will run again, and:
- checks if the last request's initiator was the core -> no, it was
the driver (1st WiFi module)
- checks, if the previous initiator was the driver -> yes
- checks if the regulator domain changed -> yes, it was '00' (set by
core, and crda call did not return yet), and should be changed to 'US'
------> __reg_process_hint_driver calls an intersect
Besides, the reg_process_hint call with the last request is meaningless
since the crda call has a timeout work. If that timeout expires, the
first module's request will lost.
Cc: stable@vger.kernel.org
Fixes: 96cce12ff6 ("cfg80211: fix processing world regdomain when non modular")
Signed-off-by: Robert Hodaszi <robert.hodaszi@digi.com>
Link: https://lore.kernel.org/r/20190614131600.GA13897@a1-hr
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Fix two shortcomings in the Extended Key ID API:
1) Allow the userspace to install pairwise keys using keyid 1 without
NL80211_KEY_NO_TX set. This allows the userspace to install and
activate pairwise keys with keyid 1 in the same way as for keyid 0,
simplifying the API usage for e.g. FILS and FT key installs.
2) IEEE 802.11 - 2016 restricts Extended Key ID usage to CCMP/GCMP
ciphers in IEEE 802.11 - 2016 "9.4.2.25.4 RSN capabilities".
Enforce that when installing a key.
Cc: stable@vger.kernel.org # 5.2
Fixes: 6cdd3979a2 ("nl80211/cfg80211: Extended Key ID support")
Signed-off-by: Alexander Wetzel <alexander@wetzel-home.de>
Link: https://lore.kernel.org/r/20190805123400.51567-1-alexander@wetzel-home.de
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
in ip_mc_inc_group, memory allocation flag, not mcast mode, is expected
by __ip_mc_inc_group
similar issue in __ip_mc_join_group, both mcase mode and gfp_t are needed
here, so use ____ip_mc_inc_group(...)
Fixes: 9fb20801da ("net: Fix ip_mc_{dec,inc}_group allocation context")
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The NCSI spec indicates that if the data does not end on a 32 bit
boundary, one to three padding bytes equal to 0x00 shall be present to
align the checksum field to a 32-bit boundary.
Signed-off-by: Terry S. Duncan <terry.s.duncan@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, we are only explicitly setting SOCK_NOSPACE on a write timeout
for non-blocking sockets. Epoll() edge-trigger mode relies on SOCK_NOSPACE
being set when -EAGAIN is returned to ensure that EPOLLOUT is raised.
Expand the setting of SOCK_NOSPACE to non-blocking sockets as well that can
use SO_SNDTIMEO to adjust their write timeout. This mirrors the behavior
that Eric Dumazet introduced for tcp sockets.
Signed-off-by: Jason Baron <jbaron@akamai.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Ursula Braun <ubraun@linux.ibm.com>
Cc: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix mem leak caused by missed unpin routine for umem pages.
Fixes: 8aef7340ae ("xsk: introduce xdp_umem_page")
Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
1) Remove IP MASQUERADING record in MAINTAINERS file,
from Denis Efremov.
2) Counter arguments are swapped in ebtables, from
Todd Seidelmann.
3) Missing netlink attribute validation in flow_offload
extension.
4) Incorrect alignment in xt_nfacct that breaks 32-bits
userspace / 64-bits kernels, from Juliana Rodrigueiro.
5) Missing include guard in nf_conntrack_h323_types.h,
from Masahiro Yamada.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
As Jason Baron explained in commit 790ba4566c ("tcp: set SOCK_NOSPACE
under memory pressure"), it is crucial we properly set SOCK_NOSPACE
when needed.
However, Jason patch had a bug, because the 'nonblocking' status
as far as sk_stream_wait_memory() is concerned is governed
by MSG_DONTWAIT flag passed at sendmsg() time :
long timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT);
So it is very possible that tcp sendmsg() calls sk_stream_wait_memory(),
and that sk_stream_wait_memory() returns -EAGAIN with SOCK_NOSPACE
cleared, if sk->sk_sndtimeo has been set to a small (but not zero)
value.
This patch removes the 'noblock' variable since we must always
set SOCK_NOSPACE if -EAGAIN is returned.
It also renames the do_nonblock label since we might reach this
code path even if we were in blocking mode.
Fixes: 790ba4566c ("tcp: set SOCK_NOSPACE under memory pressure")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jason Baron <jbaron@akamai.com>
Reported-by: Vladimir Rutsky <rutsky@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Jason Baron <jbaron@akamai.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When running a 64-bit kernel with a 32-bit iptables binary, the size of
the xt_nfacct_match_info struct diverges.
kernel: sizeof(struct xt_nfacct_match_info) : 40
iptables: sizeof(struct xt_nfacct_match_info)) : 36
Trying to append nfacct related rules results in an unhelpful message.
Although it is suggested to look for more information in dmesg, nothing
can be found there.
# iptables -A <chain> -m nfacct --nfacct-name <acct-object>
iptables: Invalid argument. Run `dmesg' for more information.
This patch fixes the memory misalignment by enforcing 8-byte alignment
within the struct's first revision. This solution is often used in many
other uapi netfilter headers.
Signed-off-by: Juliana Rodrigueiro <juliana.rodrigueiro@intra2net.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The netlink attribute policy for NFTA_FLOW_TABLE_NAME is missing.
Fixes: a3c90f7a23 ("netfilter: nf_tables: flow offload expression")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The ordering of arguments to the x_tables ADD_COUNTER macro
appears to be wrong in ebtables (cf. ip_tables.c, ip6_tables.c,
and arp_tables.c).
This causes data corruption in the ebtables userspace tools
because they get incorrect packet & byte counts from the kernel.
Fixes: d72133e628 ("netfilter: ebtables: use ADD_COUNTER macro")
Signed-off-by: Todd Seidelmann <tseidelmann@linode.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch adds initial support for offloading basechains using the
priority range from 1 to 65535. This is restricting the netfilter
priority range to 16-bit integer since this is what most drivers assume
so far from tc. It should be possible to extend this range of supported
priorities later on once drivers are updated to support for 32-bit
integer priorities.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johan Hedberg says:
====================
pull request: bluetooth 2019-08-17
Here's a set of Bluetooth fixes for the 5.3-rc series:
- Multiple fixes for Qualcomm (btqca & hci_qca) drivers
- Minimum encryption key size debugfs setting (this is required for
Bluetooth Qualification)
- Fix hidp_send_message() to have a meaningful return value
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
For testing and qualification purposes it is useful to allow changing
the minimum encryption key size value that the host stack is going to
enforce. This adds a new debugfs setting min_encrypt_key_size to achieve
this functionality.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>