The bpfilter_umh will be stopped via __stop_umh() when the bpfilter
error occurred.
The bpfilter_umh() couldn't start again because there is no restart
routine.
The section of the bpfilter_umh_{start/end} is no longer .init.rodata
because these area should be reused in the restart routine. hence
the section name is changed to .bpfilter_umh.
The bpfilter_ops->start() is restart callback. it will be called when
bpfilter_umh is stopped.
The stop bit means bpfilter_umh is stopped. this bit is set by both
start and stop routine.
Before this patch,
Test commands:
$ iptables -vnL
$ kill -9 <pid of bpfilter_umh>
$ iptables -vnL
[ 480.045136] bpfilter: write fail -32
$ iptables -vnL
All iptables commands will fail.
After this patch,
Test commands:
$ iptables -vnL
$ kill -9 <pid of bpfilter_umh>
$ iptables -vnL
$ iptables -vnL
Now, all iptables commands will work.
Fixes: d2ba09c17a ("net: add skeleton of bpfilter kernel module")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now, UMH process is killed, do_exit() calls the umh_info->cleanup callback
to release members of the umh_info.
This patch makes bpfilter_umh's cleanup routine to use the
umh_info->cleanup callback.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A UMH process which is created by the fork_usermode_blob() such as
bpfilter needs to release members of the umh_info when process is
terminated.
But the do_exit() does not release members of the umh_info. hence module
which uses UMH needs own code to detect whether UMH process is
terminated or not.
But this implementation needs extra code for checking the status of
UMH process. it eventually makes the code more complex.
The new PF_UMH flag is added and it is used to identify UMH processes.
The exit_umh() does not release members of the umh_info.
Hence umh_info->cleanup callback should release both members of the
umh_info and the private data.
Suggested-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The functions isdn_tty_tiocmset() and isdn_tty_set_termios() may be
concurrently executed.
isdn_tty_tiocmset
isdn_tty_modem_hup
line 719: kfree(info->dtmf_state);
line 721: kfree(info->silence_state);
line 723: kfree(info->adpcms);
line 725: kfree(info->adpcmr);
isdn_tty_set_termios
isdn_tty_modem_hup
line 719: kfree(info->dtmf_state);
line 721: kfree(info->silence_state);
line 723: kfree(info->adpcms);
line 725: kfree(info->adpcmr);
Thus, some concurrency double-free bugs may occur.
These possible bugs are found by a static tool written by myself and
my manual code review.
To fix these possible bugs, the mutex lock "modem_info_mutex" used in
isdn_tty_tiocmset() is added in isdn_tty_set_termios().
Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The vsock core only supports 32bit CID, but the Virtio-vsock spec define
CID (dst_cid and src_cid) as u64 and the upper 32bits is reserved as
zero. This inconsistency causes one bug in vhost vsock driver. The
scenarios is:
0. A hash table (vhost_vsock_hash) is used to map an CID to a vsock
object. And hash_min() is used to compute the hash key. hash_min() is
defined as:
(sizeof(val) <= 4 ? hash_32(val, bits) : hash_long(val, bits)).
That means the hash algorithm has dependency on the size of macro
argument 'val'.
0. In function vhost_vsock_set_cid(), a 64bit CID is passed to
hash_min() to compute the hash key when inserting a vsock object into
the hash table.
0. In function vhost_vsock_get(), a 32bit CID is passed to hash_min()
to compute the hash key when looking up a vsock for an CID.
Because the different size of the CID, hash_min() returns different hash
key, thus fails to look up the vsock object for an CID.
To fix this bug, we keep CID as u64 in the IOCTLs and virtio message
headers, but explicitly convert u64 to u32 when deal with the hash table
and vsock core.
Fixes: 834e772c8d ("vhost/vsock: fix use-after-free in network stack callers")
Link: https://github.com/stefanha/virtio/blob/vsock/trunk/content.tex
Signed-off-by: Zha Bin <zhabin@linux.alibaba.com>
Reviewed-by: Liu Jiang <gerry@linux.alibaba.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jose Abreu says:
====================
net: stmmac: Misc Fixes
Some small fixes for stmmac targeting -net. Detailed info in commit log.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, TX is given a budget which is consumed by stmmac_tx_clean()
and stmmac_rx() is given the remaining non-consumed budget.
This is wrong and in case we are sending a large number of packets this
can starve RX because remaining budget will be low.
Let's give always the same budget for RX and TX clean.
While at it, check if we missed any interrupts while we were in NAPI
callback by looking at DMA interrupt status.
Cc: Joao Pinto <jpinto@synopsys.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>
Signed-off-by: Jose Abreu <joabreu@synopsys.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RX Watchdog can be disabled by platform definitions but currently we are
initializing the descriptors before checking if Watchdog must be
disabled or not.
Fix this by checking earlier if user wants Watchdog disabled or not.
Cc: Joao Pinto <jpinto@synopsys.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>
Signed-off-by: Jose Abreu <joabreu@synopsys.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Check if CBS is currently supported before trying to configure it in HW.
Cc: Joao Pinto <jpinto@synopsys.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>
Signed-off-by: Jose Abreu <joabreu@synopsys.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In DMA interrupt handler we were clearing all interrupts status, even
the ones that were not active. Fix this and only clear the active
interrupts.
Cc: Joao Pinto <jpinto@synopsys.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>
Signed-off-by: Jose Abreu <joabreu@synopsys.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit b7d0f08e91, the enable / disable of PCI device is not
managed which will result in IO regions not being automatically unmapped.
As regions continue mapped it is currently not possible to remove and
then probe again the PCI module of stmmac.
Fix this by manually unmapping regions on remove callback.
Changes from v1:
- Fix build error
Cc: Joao Pinto <jpinto@synopsys.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>
Fixes: b7d0f08e91 ("net: stmmac: Fix WoL for PCI-based setups")
Signed-off-by: Jose Abreu <joabreu@synopsys.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf 2019-01-11
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Fix TCP-BPF support for correctly setting the initial window
via TCP_BPF_IW on an active TFO sender, from Yuchung.
2) Fix a panic in BPF's stack_map_get_build_id()'s ELF parsing on
32 bit archs caused by page_address() returning NULL, from Song.
3) Fix BTF pretty print in kernel and bpftool when bitfield member
offset is greater than 256. Also add test cases, from Yonghong.
4) Fix improper argument handling in xdp1 sample, from Ioana.
5) Install missing tcp_server.py and tcp_client.py files from
BPF selftests, from Anders.
6) Add test_libbpf to gitignore in libbpf and BPF selftests,
from Stanislav.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Yonghong Song says:
====================
The previous BTF kind_flag support patch set introduced a bug
for kernel bpffs pretty printing and another bug for bpftool
map pretty printing. If a bitfield struct member offset is
greater than 256 bits, printed value for that struct
member will be incorrect.
- Patch #1 fixed the bug in kernel bpffs pretty printing.
- Patch #2 enhanced the test_btf test case to cover the
issue exposed by patch #1.
- Patch #3 fixed the bug in bpftool map pretty printing.
====================
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Commit 8772c8bc09 ("tools: bpftool: support pretty print
with kind_flag set") added bpftool map dump with kind_flag
support. When bitfield_size can be retrieved directly from
btf_member, function btf_dumper_bitfield() is called to
dump the bitfield. The implementation passed the
wrong parameter "bit_offset" to the function. The excepted
value is the bit_offset within a byte while the passed-in
value is the struct member offset.
This commit fixed the bug with passing correct "bit_offset"
with adjusted data pointer.
Fixes: 8772c8bc09 ("tools: bpftool: support pretty print with kind_flag set")
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This patch modified test_btf pretty print test to cover
the bitfield with struct member equal to or greater 256.
Without the previous kernel patch fix, the modified test will fail:
$ test_btf -p
......
BTF pretty print array(#1)......unexpected pprint output
expected: 0: {0,0,0,0x3,0x0,0x3,{0|[0,0,0,0,0,0,0,0]},ENUM_ZERO,4,0x1}
read: 0: {0,0,0,0x3,0x0,0x3,{0|[0,0,0,0,0,0,0,0]},ENUM_ZERO,4,0x0}
BTF pretty print array(#2)......unexpected pprint output
expected: 0: {0,0,0,0x3,0x0,0x3,{0|[0,0,0,0,0,0,0,0]},ENUM_ZERO,4,0x1}
read: 0: {0,0,0,0x3,0x0,0x3,{0|[0,0,0,0,0,0,0,0]},ENUM_ZERO,4,0x0}
PASS:6 SKIP:0 FAIL:2
With the kernel fix, the modified test will succeed:
$ test_btf -p
......
BTF pretty print array(#1)......OK
BTF pretty print array(#2)......OK
PASS:8 SKIP:0 FAIL:0
Fixes: 9d5f9f701b ("bpf: btf: fix struct/union/fwd types with kind_flag")
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Commit 9d5f9f701b ("bpf: btf: fix struct/union/fwd types
with kind_flag") introduced kind_flag and used bitfield_size
in the btf_member to directly pretty print member values.
The commit contained a bug where the incorrect parameters could be
passed to function btf_bitfield_seq_show(). The bits_offset
parameter in the function expects a value less than 8.
Instead, the member offset in the structure is passed.
The below is btf_bitfield_seq_show() func signature:
void btf_bitfield_seq_show(void *data, u8 bits_offset,
u8 nr_bits, struct seq_file *m)
both bits_offset and nr_bits are u8 type. If the bitfield
member offset is greater than 256, incorrect value will
be printed.
This patch fixed the issue by calculating correct proper
data offset and bits_offset similar to non kind_flag case.
Fixes: 9d5f9f701b ("bpf: btf: fix struct/union/fwd types with kind_flag")
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
linux 5.0-rc1 shows following warning on bpi-r2/mt7623 bootup:
[ 5.170597] WARNING: CPU: 3 PID: 1 at drivers/net/phy/phy.c:548 phy_start_aneg+0x110/0x144
[ 5.178826] called from state READY
....
[ 5.264111] [<c0629fd4>] (phy_start_aneg) from [<c0e3e720>] (mtk_init+0x414/0x47c)
[ 5.271630] r7:df5f5eec r6:c0f08c48 r5:00000000 r4:dea67800
[ 5.277256] [<c0e3e30c>] (mtk_init) from [<c07dabbc>] (register_netdevice+0x98/0x51c)
[ 5.285035] r8:00000000 r7:00000000 r6:c0f97080 r5:c0f08c48 r4:dea67800
[ 5.291693] [<c07dab24>] (register_netdevice) from [<c07db06c>] (register_netdev+0x2c/0x44)
[ 5.299989] r8:00000000 r7:dea2e608 r6:deacea00 r5:dea2e604 r4:dea67800
[ 5.306646] [<c07db040>] (register_netdev) from [<c06326d8>] (mtk_probe+0x668/0x7ac)
[ 5.314336] r5:dea2e604 r4:dea2e040
[ 5.317890] [<c0632070>] (mtk_probe) from [<c05a78fc>] (platform_drv_probe+0x58/0xa8)
[ 5.325670] r10:c0f86bac r9:00000000 r8:c0fbe578 r7:00000000 r6:c0f86bac r5:00000000
[ 5.333445] r4:deacea10
[ 5.335963] [<c05a78a4>] (platform_drv_probe) from [<c05a5248>] (really_probe+0x2d8/0x424)
maybe other boards using this generic driver are affected
v2:
optimization:
- phy_set_max_speed() is only needed if you want to reduce the
max speed, typically if the PHY supports 1Gbps but the MAC
supports 100Mbps only.
- The pause parameters are autonegotiated. Except you have a specific
need you normally don't need to manually fiddle with this.
- phy_start_aneg() is called implicitly by the phylib state machine,
you shouldn't call it manually except you have a good excuse.
- netif_carrier_on/netif_carrier_off in mtk_phy_link_adjust() isn't
needed. It's done by phy_link_change() in phylib.
Signed-off-by: Frank Wunderlich <frank-w@public-files.de>
Reviewed-by: Heiner Kallweit <hkallweit1@gmail.com>
Acked-by: Sean Wang <sean.wang@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Previously upon SYN timeouts the sender recomputes the txhash to
try a different path. However this does not apply on the initial
timeout of SYN-data (active Fast Open). Therefore an active IPv6
Fast Open connection may incur one second RTO penalty to take on
a new path after the second SYN retransmission uses a new flow label.
This patch removes this undesirable behavior so Fast Open changes
the flow label just like the regular connections. This also helps
avoid falsely disabling Fast Open on the sender which triggers
after two consecutive SYN timeouts on Fast Open.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The 6390 copper ports have an errata which require poking magic values
into undocumented magic registers and then performing a software
reset.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
A network device stack with multiple layers of bonding devices can
trigger a false positive lockdep warning. Adding lockdep nest levels
fixes this. Update the level on both enslave and unlink, to avoid the
following series of events ..
ip netns add test
ip netns exec test bash
ip link set dev lo addr 00:11:22:33:44:55
ip link set dev lo down
ip link add dev bond1 type bond
ip link add dev bond2 type bond
ip link set dev lo master bond1
ip link set dev bond1 master bond2
ip link set dev bond1 nomaster
ip link set dev bond2 master bond1
.. from still generating a splat:
[ 193.652127] ======================================================
[ 193.658231] WARNING: possible circular locking dependency detected
[ 193.664350] 4.20.0 #8 Not tainted
[ 193.668310] ------------------------------------------------------
[ 193.674417] ip/15577 is trying to acquire lock:
[ 193.678897] 00000000a40e3b69 (&(&bond->stats_lock)->rlock#3/3){+.+.}, at: bond_get_stats+0x58/0x290
[ 193.687851]
but task is already holding lock:
[ 193.693625] 00000000807b9d9f (&(&bond->stats_lock)->rlock#2/2){+.+.}, at: bond_get_stats+0x58/0x290
[..]
[ 193.851092] lock_acquire+0xa7/0x190
[ 193.855138] _raw_spin_lock_nested+0x2d/0x40
[ 193.859878] bond_get_stats+0x58/0x290
[ 193.864093] dev_get_stats+0x5a/0xc0
[ 193.868140] bond_get_stats+0x105/0x290
[ 193.872444] dev_get_stats+0x5a/0xc0
[ 193.876493] rtnl_fill_stats+0x40/0x130
[ 193.880797] rtnl_fill_ifinfo+0x6c5/0xdc0
[ 193.885271] rtmsg_ifinfo_build_skb+0x86/0xe0
[ 193.890091] rtnetlink_event+0x5b/0xa0
[ 193.894320] raw_notifier_call_chain+0x43/0x60
[ 193.899225] netdev_change_features+0x50/0xa0
[ 193.904044] bond_compute_features.isra.46+0x1ab/0x270
[ 193.909640] bond_enslave+0x141d/0x15b0
[ 193.913946] do_set_master+0x89/0xa0
[ 193.918016] do_setlink+0x37c/0xda0
[ 193.921980] __rtnl_newlink+0x499/0x890
[ 193.926281] rtnl_newlink+0x48/0x70
[ 193.930238] rtnetlink_rcv_msg+0x171/0x4b0
[ 193.934801] netlink_rcv_skb+0xd1/0x110
[ 193.939103] rtnetlink_rcv+0x15/0x20
[ 193.943151] netlink_unicast+0x3b5/0x520
[ 193.947544] netlink_sendmsg+0x2fd/0x3f0
[ 193.951942] sock_sendmsg+0x38/0x50
[ 193.955899] ___sys_sendmsg+0x2ba/0x2d0
[ 193.960205] __x64_sys_sendmsg+0xad/0x100
[ 193.964687] do_syscall_64+0x5a/0x460
[ 193.968823] entry_SYSCALL_64_after_hwframe+0x49/0xbe
Fixes: 7e2556e400 ("bonding: avoid lockdep confusion in bond_get_stats()")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As Naresh reported, test_stacktrace_build_id() causes panic on i386 and
arm32 systems. This is caused by page_address() returns NULL in certain
cases.
This patch fixes this error by using kmap_atomic/kunmap_atomic instead
of page_address.
Fixes: 615755a77b (" bpf: extend stackmap to save binary_build_id+offset instead of address")
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
When test_tcpbpf_user runs it complains that it can't find files
tcp_server.py and tcp_client.py.
Rework so that tcp_server.py and tcp_client.py gets installed, added them
to the variable TEST_PROGS_EXTENDED.
Fixes: d6d4f60c3a ("bpf: add selftest for tcpbpf")
Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Use optind as index for argv instead of a hardcoded value.
When the program has options this leads to improper parameter handling.
Fixes: dc378a1ab5 ("samples: bpf: get ifindex from ifname")
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Acked-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
We build test_libbpf with CXX to make sure linking against C++ works.
$ make -s -C tools/lib/bpf
$ git status -sb
? tools/lib/bpf/test_libbpf
$ make -s -C tools/testing/selftests/bpf
$ git status -sb
? tools/lib/bpf/test_libbpf
? tools/testing/selftests/bpf/test_libbpf
Fixes: 8c4905b995 ("libbpf: make sure bpf headers are c++ include-able")
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
There are some lines that have indentation issues, fix these.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are handful of lines that have indentation issues, fix these.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 2efd4fca70 ("ip: in cmsg IP(V6)_ORIGDSTADDR call
pskb_may_pull") avoided a read beyond the end of the skb linear
segment by calling pskb_may_pull.
That function can trigger a BUG_ON in pskb_expand_head if the skb is
shared, which it is when when peeking. It can also return ENOMEM.
Avoid both by switching to safer skb_header_pointer.
Fixes: 2efd4fca70 ("ip: in cmsg IP(V6)_ORIGDSTADDR call pskb_may_pull")
Reported-by: syzbot <syzkaller@googlegroups.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
BUG: unable to handle kernel NULL pointer dereference at 00000000000000d1
Call Trace:
? napi_gro_frags+0xa7/0x2c0
tun_get_user+0xb50/0xf20
tun_chr_write_iter+0x53/0x70
new_sync_write+0xff/0x160
vfs_write+0x191/0x1e0
__x64_sys_write+0x5e/0xd0
do_syscall_64+0x47/0xf0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
I think there is a subtle race between sending a packet via tap and
attaching it:
CPU0: CPU1:
tun_chr_ioctl(TUNSETIFF)
tun_set_iff
tun_attach
rcu_assign_pointer(tfile->tun, tun);
tun_fops->write_iter()
tun_chr_write_iter
tun_napi_alloc_frags
napi_get_frags
napi->skb = napi_alloc_skb
tun_napi_init
netif_napi_add
napi->skb = NULL
napi->skb is NULL here
napi_gro_frags
napi_frags_skb
skb = napi->skb
skb_reset_mac_header(skb)
panic()
Move rcu_assign_pointer(tfile->tun) and rcu_assign_pointer(tun->tfiles) to
be the last thing we do in tun_attach(); this should guarantee that when we
call tun_get() we always get an initialized object.
v2 changes:
* remove extra napi_mutex locks/unlocks for napi operations
Reported-by: syzbot <syzkaller@googlegroups.com>
Fixes: 90e33d4594 ("tun: enable napi_gro_frags() for TUN/TAP driver")
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The existing BPF TCP initial congestion window (TCP_BPF_IW) does not
to work on (active) Fast Open sender. This is because it changes the
(initial) window only if data_segs_out is zero -- but data_segs_out
is also incremented on SYN-data. This patch fixes the issue by
proerly accounting for SYN-data additionally.
Fixes: fc7478103c ("bpf: Adds support for setting initial cwnd")
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
'dev' is non NULL when the addr_len check triggers so it must goto a label
that does the dev_put otherwise dev will have a leaked refcount.
This bug causes the ib_ipoib module to become unloadable when using
systemd-network as it triggers this check on InfiniBand links.
Fixes: 99137b7888 ("packet: validate address length")
Reported-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf 2019-01-08
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Fix BSD'ism in sendmsg(2) to rewrite unspecified IPv6 dst for
unconnected UDP sockets with [::1] _after_ cgroup BPF invocation,
from Andrey.
2) Follow-up fix to the speculation fix where we need to reject a
corner case for sanitation when ptr and scalars are mixed in the
same alu op. Also, some unrelated minor doc fixes, from Daniel.
3) Fix BPF kselftest's incorrect uses of create_and_get_cgroup()
by not assuming fd of zero value to be the result of an error
case, from Stanislav.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a VLAN on a bridge port, delete it and make sure the PVID VLAN is
not affected.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a VLAN is deleted from a bridge port we should not change the PVID
unless the deleted VLAN is the PVID.
Fixes: fe9ccc785d ("mlxsw: spectrum_switchdev: Don't batch VLAN operations")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When running the test on the Spectrum ASIC the generated packets are
counted on the ingress filter and injected back to the pipeline because
of the 'pass' action. The router block then drops the packets due to
checksum error, as the test generates packets with zero checksum.
When running the test on an emulator that is not as strict about
checksum errors the test fails since packets are counted twice. Once by
the emulated ASIC on its ingress filter and again by the kernel as the
emulator does not perform checksum validation and allows the packets to
be trapped by a matching host route.
Fix this by changing the action to 'drop', which will prevent the packet
from continuing further in the pipeline to the router block.
For veth pairs this change is essentially a NOP given packets are only
processed once (by the kernel).
Fixes: a0b61f3d8e ("selftests: forwarding: vxlan_bridge_1d: Add an ECN decap test")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When adding / deleting VLANs to / from a bridge port, the bridge driver
first tries to propagate the information via switchdev and falls back to
the 8021q driver in case the underlying driver does not support
switchdev. This can result in a memory leak [1] when VXLAN and mlxsw
ports are enslaved to the bridge:
$ ip link set dev vxlan0 master br0
# No mlxsw ports are enslaved to 'br0', so mlxsw ignores the switchdev
# notification and the bridge driver adds the VLAN on 'vxlan0' via the
# 8021q driver
$ bridge vlan add vid 10 dev vxlan0 pvid untagged
# mlxsw port is enslaved to the bridge
$ ip link set dev swp1 master br0
# mlxsw processes the switchdev notification and the 8021q driver is
# skipped
$ bridge vlan del vid 10 dev vxlan0
This results in 'struct vlan_info' and 'struct vlan_vid_info' being
leaked, as they were allocated by the 8021q driver during VLAN addition,
but never freed as the 8021q driver was skipped during deletion.
Fix this by introducing a new VLAN private flag that indicates whether
the VLAN was added on the port by switchdev or the 8021q driver. If the
VLAN was added by the 8021q driver, then we make sure to delete it via
the 8021q driver as well.
[1]
unreferenced object 0xffff88822d20b1e8 (size 256):
comm "bridge", pid 2532, jiffies 4295216998 (age 1188.830s)
hex dump (first 32 bytes):
e0 42 97 ce 81 88 ff ff 00 00 00 00 00 00 00 00 .B..............
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<00000000f82d851d>] kmem_cache_alloc_trace+0x1be/0x330
[<00000000e0178b02>] vlan_vid_add+0x661/0x920
[<00000000218ebd5f>] __vlan_add+0x1be9/0x3a00
[<000000006eafa1ca>] nbp_vlan_add+0x8b3/0xd90
[<000000003535392c>] br_vlan_info+0x132/0x410
[<00000000aedaa9dc>] br_afspec+0x75c/0x870
[<00000000f5716133>] br_setlink+0x3dc/0x6d0
[<00000000aceca5e2>] rtnl_bridge_setlink+0x615/0xb30
[<00000000a2f2d23e>] rtnetlink_rcv_msg+0x3a3/0xa80
[<0000000064097e69>] netlink_rcv_skb+0x152/0x3c0
[<000000008be8d614>] rtnetlink_rcv+0x21/0x30
[<000000009ab2ca25>] netlink_unicast+0x52f/0x740
[<00000000e7d9ac96>] netlink_sendmsg+0x9c7/0xf50
[<000000005d1e2050>] sock_sendmsg+0xbe/0x120
[<00000000d51426bc>] ___sys_sendmsg+0x778/0x8f0
[<00000000b9d7b2cc>] __sys_sendmsg+0x112/0x270
unreferenced object 0xffff888227454308 (size 32):
comm "bridge", pid 2532, jiffies 4295216998 (age 1188.882s)
hex dump (first 32 bytes):
88 b2 20 2d 82 88 ff ff 88 b2 20 2d 82 88 ff ff .. -...... -....
81 00 0a 00 01 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<00000000f82d851d>] kmem_cache_alloc_trace+0x1be/0x330
[<0000000018050631>] vlan_vid_add+0x3e6/0x920
[<00000000218ebd5f>] __vlan_add+0x1be9/0x3a00
[<000000006eafa1ca>] nbp_vlan_add+0x8b3/0xd90
[<000000003535392c>] br_vlan_info+0x132/0x410
[<00000000aedaa9dc>] br_afspec+0x75c/0x870
[<00000000f5716133>] br_setlink+0x3dc/0x6d0
[<00000000aceca5e2>] rtnl_bridge_setlink+0x615/0xb30
[<00000000a2f2d23e>] rtnetlink_rcv_msg+0x3a3/0xa80
[<0000000064097e69>] netlink_rcv_skb+0x152/0x3c0
[<000000008be8d614>] rtnetlink_rcv+0x21/0x30
[<000000009ab2ca25>] netlink_unicast+0x52f/0x740
[<00000000e7d9ac96>] netlink_sendmsg+0x9c7/0xf50
[<000000005d1e2050>] sock_sendmsg+0xbe/0x120
[<00000000d51426bc>] ___sys_sendmsg+0x778/0x8f0
[<00000000b9d7b2cc>] __sys_sendmsg+0x112/0x270
Fixes: d70e42b22d ("mlxsw: spectrum: Enable VxLAN enslavement to VLAN-aware bridges")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Cc: Roopa Prabhu <roopa@cumulusnetworks.com>
Cc: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Cc: bridge@lists.linux-foundation.org
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a test case for the issue fixed by previous commit. In case the
offloading of an unsupported VxLAN tunnel was triggered by adding the
mapped VLAN to a local port, then error should be returned to the user.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adding a VLAN on a port can trigger the offload of a VXLAN tunnel which
is already a member in the VLAN. In case the configuration of the VXLAN
is not supported, the driver would return -EOPNOTSUPP.
This is problematic since bridge code does not interpret this as error,
but rather that it should try to setup the VLAN using the 8021q driver
instead of switchdev.
Fixes: d70e42b22d ("mlxsw: spectrum: Enable VxLAN enslavement to VLAN-aware bridges")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Drivers are not supposed to return errors in switchdev commit phase if
they returned OK in prepare phase. Otherwise, a WARNING is emitted.
However, when the offloading of a VXLAN tunnel is triggered by the
addition of a VLAN on a local port, it is not possible to guarantee that
the commit phase will succeed without doing a lot of work.
In these cases, the artificial division between prepare and commit phase
does not make sense, so simply do the work in the prepare phase.
Fixes: d70e42b22d ("mlxsw: spectrum: Enable VxLAN enslavement to VLAN-aware bridges")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When VXLAN is a loadable module, MLXSW_SPECTRUM must not be built-in:
drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c:2547: undefined
reference to `vxlan_fdb_find_uc'
Add Kconfig dependency to enforce usable configurations.
Fixes: 1231e04f5b ("mlxsw: spectrum_switchdev: Add support for VxLAN encapsulation")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reported-by: kbuild test robot <lkp@intel.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make sure that lag port TX is disabled before mlxsw_sp_port_lag_leave()
is called and prevent from possible EMAD error.
Fixes: 0d65fc1304 ("mlxsw: spectrum: Implement LAG port join/leave")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Removal of the mlxsw driver on Spectrum-2 platforms hits an ASSERT_RTNL()
in Spectrum-2 ACL Bloom filter and in ERP removal paths. This happens
because the multicast router implementation in Spectrum-2 relies on ACLs.
Taking the RTNL lock upon driver removal is useless since the driver first
removes its ports and unregisters from notifiers so concurrent writes
cannot happen at that time. The assertions were originally put as a
reminder for future work involving ERP background optimization, but having
these assertions only during addition serves this purpose as well.
Therefore remove the ASSERT_RTNL() in both places related to ERP and Bloom
filter removal.
Fixes: cf7221a4f5 ("mlxsw: spectrum_router: Add Multicast routing support for Spectrum-2")
Signed-off-by: Nir Dotan <nird@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When writing to C-TCAM, mlxsw driver uses cregion->ops->entry_insert().
In case of C-TCAM HW insertion error, the opposite action should take
place.
Add error handling case in which the C-TCAM region entry is removed, by
calling cregion->ops->entry_remove().
Fixes: a0a777b940 ("mlxsw: spectrum_acl: Start using A-TCAM")
Signed-off-by: Nir Dotan <nird@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This soft dependency works around an issue where sometimes the genphy
driver is used instead of the dedicated PHY driver. The root cause of
the issue isn't clear yet. People reported the unloading/re-loading
module r8169 helps, and also configuring this soft dependency in
the modprobe config files. Important just seems to be that the
realtek module is loaded before r8169.
Once this has been applied preliminary fix 38af4b903210 ("net: phy:
add workaround for issue where PHY driver doesn't bind to the device")
will be removed.
Fixes: f1e911d5d0 ("r8169: add basic phylib support")
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It has been noticed that some phys do not have the registers
required by the previous implementation.
To fix this, instead of using phy_read, the required information
is extracted from the phy_device structure.
fixes: 23f0703c12 ("lan743x: Add main source files for new lan743x driver")
Signed-off-by: Bryan Whitehead <Bryan.Whitehead@microchip.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ioctl command is read/write (or just read, if the fact that user space
writes n_samples field is ignored).
Signed-off-by: Eugene Syromiatnikov <esyr@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Otherwise it is impossible to use it for something else, as it will break
userspace that puts garbage there.
The same check should be done in other structures, but the fact that
data in reserved fields is ignored is already part of the kernel ABI.
Signed-off-by: Eugene Syromiatnikov <esyr@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf 2019-01-08
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Fix BSD'ism in sendmsg(2) to rewrite unspecified IPv6 dst for
unconnected UDP sockets with [::1] _after_ cgroup BPF invocation,
from Andrey.
2) Follow-up fix to the speculation fix where we need to reject a
corner case for sanitation when ptr and scalars are mixed in the
same alu op. Also, some unrelated minor doc fixes, from Daniel.
3) Fix BPF kselftest's incorrect uses of create_and_get_cgroup()
by not assuming fd of zero value to be the result of an error
case, from Stanislav.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
Two trivial doc follow-ups to i) remove deprecated kern_version
mentioning in the design qa and ii) to mention stand-alone build
and license of libbpf. Thanks!
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>