linux/net/ipv4
Xin Long 202f59afd4 netfilter: ipt_CLUSTERIP: do not hold dev
It's a terrible thing to hold dev in iptables target. When the dev is
being removed, unregister_netdevice has to wait for the dev to become
free. dmesg will keep logging the err:

  kernel:unregister_netdevice: waiting for veth0_in to become free. \
  Usage count = 1

until iptables rules with this target are removed manually.

The worse thing is when deleting a netns, a virtual nic will be deleted
instead of reset to init_net in default_device_ops exit/exit_batch. As
it is earlier than to flush the iptables rules in iptable_filter_net_ops
exit, unregister_netdevice will block to wait for the nic to become free.

As unregister_netdevice is actually waiting for iptables rules flushing
while iptables rules have to be flushed after unregister_netdevice. This
'dead lock' will cause unregister_netdevice to block there forever. As
the netns is not available to operate at that moment, iptables rules can
not even be flushed manually either.

The reproducer can be:

  # ip netns add test
  # ip link add veth0_in type veth peer name veth0_out
  # ip link set veth0_in netns test
  # ip netns exec test ip link set lo up
  # ip netns exec test ip link set veth0_in up
  # ip netns exec test iptables -I INPUT -d 1.2.3.4 -i veth0_in -j \
    CLUSTERIP --new --clustermac 89:d4:47:eb:9a:fa --total-nodes 3 \
    --local-node 1 --hashmode sourceip-sourceport
  # ip netns del test

This issue can be triggered by all virtual nics with ipt_CLUSTERIP.

This patch is to fix it by not holding dev in ipt_CLUSTERIP, but saving
the dev->ifindex instead of the dev.

As Pablo Neira Ayuso's suggestion, it will refresh c->ifindex and dev's
mc by registering a netdevice notifier, just as what xt_TEE does. So it
removes the old codes updating dev's mc, and also no need to initialize
c->ifindex with dev->ifindex.

But as one config can be shared by more than one targets, and the netdev
notifier is per config, not per target. It couldn't get e->ip.iniface
in the notifier handler. So e->ip.iniface has to be saved into config.

Note that for backwards compatibility, this patch doesn't remove the
codes checking if the dev exists before creating a config.

v1->v2:
  - As Pablo Neira Ayuso's suggestion, register a netdevice notifier to
    manage c->ifindex and dev's mc.

Reported-by: Jianlin Shi <jishi@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-06-19 19:06:20 +02:00
..
netfilter netfilter: ipt_CLUSTERIP: do not hold dev 2017-06-19 19:06:20 +02:00
af_inet.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2017-05-02 16:40:27 -07:00
ah4.c IPsec: do not ignore crypto err in ah4 input 2017-01-16 12:57:48 +01:00
arp.c arp: always override existing neigh entries with gratuitous ARP 2017-05-21 13:26:45 -04:00
cipso_ipv4.c netlabel: out of bound access in cipso_v4_validate() 2017-02-04 19:44:22 -05:00
datagram.c
devinet.c net: rtnetlink: plumb extended ack to doit function 2017-04-17 15:35:38 -04:00
esp4_offload.c esp4/6: Fix GSO path for non-GSO SW-crypto packets 2017-04-19 07:48:57 +02:00
esp4.c net/esp4: Fix invalid esph pointer crash 2017-05-01 14:58:50 -04:00
fib_frontend.c net: ipv4: Add extack messages for route add failures 2017-05-22 12:12:20 -04:00
fib_lookup.h net: ipv4: Plumb extack through route add functions 2017-05-22 12:12:19 -04:00
fib_notifier.c ipv4: fib: Remove redundant argument 2017-03-10 09:45:09 -08:00
fib_rules.c ipv4: fib_rules: Dump FIB rules when registering FIB notifier 2017-03-16 10:18:34 -07:00
fib_semantics.c net: ipv4: Add extack messages for route add failures 2017-05-22 12:12:20 -04:00
fib_trie.c net: ipv4: Plumb extack through route add functions 2017-05-22 12:12:19 -04:00
fou.c fou: make local function static 2017-05-21 13:42:36 -04:00
gre_demux.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2016-06-30 05:03:36 -04:00
gre_offload.c net: add recursion limit to GRO 2016-10-20 14:32:22 -04:00
icmp.c net: ipv4: add support for ECMP hash policy choice 2017-03-21 15:27:19 -07:00
igmp.c igmp, mld: Fix memory leak in igmpv3/mld_del_delrec() 2017-02-09 16:43:45 -05:00
inet_connection_sock.c inet: fix warning about missing prototype 2017-05-21 13:42:36 -04:00
inet_diag.c tcp: remove early retransmit 2017-01-13 22:37:16 -05:00
inet_fragment.c net: disable fragment reassembly if high_thresh is zero 2016-06-05 22:56:42 -04:00
inet_hashtables.c treewide: use kv[mz]alloc* rather than opencoded variants 2017-05-08 17:15:13 -07:00
inet_timewait_sock.c ipv4: Namespaceify tcp_tw_recycle and tcp_max_tw_buckets knob 2016-12-29 11:38:31 -05:00
inetpeer.c net: Add helper function to compare inetpeer addresses 2015-08-28 13:32:36 -07:00
ip_forward.c ipv4: allow local fragmentation in ip_finish_output_gso() 2016-11-03 16:10:26 -04:00
ip_fragment.c inet: frag: release spinlock before calling icmp_send() 2017-03-22 15:40:45 -07:00
ip_gre.c ip_tunnel: Allow policy-based routing through tunnels 2017-04-21 13:21:31 -04:00
ip_input.c net: Add sysctl to toggle early demux for tcp and udp 2017-03-24 13:17:07 -07:00
ip_options.c Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
ip_output.c udp: avoid ufo handling on IP payload compression packets 2017-03-09 18:28:42 -08:00
ip_sockglue.c ipv4: get rid of ip_ra_lock 2017-04-30 22:44:04 -04:00
ip_tunnel_core.c netlink: pass extended ACK struct to parsing functions 2017-04-13 13:58:22 -04:00
ip_tunnel.c ip_tunnel: Allow policy-based routing through tunnels 2017-04-21 13:21:31 -04:00
ip_vti.c vti: check nla_put_* return value 2017-05-08 15:10:31 -04:00
ipcomp.c
ipconfig.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-04-06 08:24:51 -07:00
ipip.c ip_tunnel: Allow policy-based routing through tunnels 2017-04-21 13:21:31 -04:00
ipmr.c ipmr: vrf: Find VIFs using the actual device 2017-05-16 12:52:17 -04:00
Kconfig Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next 2017-02-16 21:25:49 -05:00
Makefile ipv4: fib: Move FIB notification code to a separate file 2017-03-10 09:45:09 -08:00
netfilter.c netfilter: use skb_to_full_sk in ip_route_me_harder 2017-02-28 12:49:36 +01:00
ping.c ping: implement proper locking 2017-03-24 20:50:28 -07:00
proc.c net/tcp_fastopen: Add snmp counter for blackhole detection 2017-04-24 14:27:17 -04:00
protocol.c net: Add sysctl to toggle early demux for tcp and udp 2017-03-24 13:17:07 -07:00
raw_diag.c net: ip, raw_diag -- Use jump for exiting from nested loop 2016-11-03 15:25:26 -04:00
raw.c ipv4, ipv6: ensure raw socket message is big enough to hold an IP header 2017-05-04 11:02:46 -04:00
route.c Revert "ipv4: restore rt->fi for reference counting" 2017-05-08 22:35:32 -04:00
syncookies.c tcp: switch TCP TS option (RFC 7323) to 1ms clock 2017-05-17 16:06:01 -04:00
sysctl_net_ipv4.c net/tcp_fastopen: Disable active side TFO in certain scenarios 2017-04-24 14:27:17 -04:00
tcp_bbr.c tcp: switch TCP TS option (RFC 7323) to 1ms clock 2017-05-17 16:06:01 -04:00
tcp_bic.c tcp: bic, cubic: use tcp_jiffies32 instead of tcp_time_stamp 2017-05-17 16:06:01 -04:00
tcp_cdg.c sched/headers: Prepare for new header dependencies before moving code to <linux/sched/clock.h> 2017-03-02 08:42:27 +01:00
tcp_cong.c tcp: memset ca_priv data to 0 properly 2017-04-26 14:58:32 -04:00
tcp_cubic.c tcp: bic, cubic: use tcp_jiffies32 instead of tcp_time_stamp 2017-05-17 16:06:01 -04:00
tcp_dctcp.c Revert "dctcp: update cwnd on congestion event" 2016-12-06 11:34:24 -05:00
tcp_diag.c net: diag: Fix refcnt leak in error path destroying socket 2016-08-23 23:11:36 -07:00
tcp_fastopen.c net/tcp_fastopen: Add snmp counter for blackhole detection 2017-04-24 14:27:17 -04:00
tcp_highspeed.c tcp: add cwnd_undo functions to various tcp cc algorithms 2016-11-21 13:20:17 -05:00
tcp_htcp.c tcp: replace misc tcp_time_stamp to tcp_jiffies32 2017-05-17 16:06:01 -04:00
tcp_hybla.c tcp: make undo_cwnd mandatory for congestion modules 2016-11-21 13:20:17 -05:00
tcp_illinois.c tcp: add cwnd_undo functions to various tcp cc algorithms 2016-11-21 13:20:17 -05:00
tcp_input.c tcp: warn on negative reordering values 2017-05-19 16:55:46 -04:00
tcp_ipv4.c tcp: switch TCP TS option (RFC 7323) to 1ms clock 2017-05-17 16:06:01 -04:00
tcp_lp.c tcp: switch TCP TS option (RFC 7323) to 1ms clock 2017-05-17 16:06:01 -04:00
tcp_metrics.c tcp: use tcp_jiffies32 to feed tp->snd_cwnd_stamp 2017-05-17 16:06:01 -04:00
tcp_minisocks.c tcp: switch TCP TS option (RFC 7323) to 1ms clock 2017-05-17 16:06:01 -04:00
tcp_nv.c tcpnv: do not export local function 2017-05-21 13:42:36 -04:00
tcp_offload.c gso: Support partial splitting at the frag_list pointer 2016-09-19 20:59:34 -04:00
tcp_output.c tcp: switch TCP TS option (RFC 7323) to 1ms clock 2017-05-17 16:06:01 -04:00
tcp_probe.c tcp: Revert "tcp: tcp_probe: use spin_lock_bh()" 2017-02-21 13:26:03 -05:00
tcp_rate.c tcp: switch TCP TS option (RFC 7323) to 1ms clock 2017-05-17 16:06:01 -04:00
tcp_recovery.c tcp: switch TCP TS option (RFC 7323) to 1ms clock 2017-05-17 16:06:01 -04:00
tcp_scalable.c tcp: add cwnd_undo functions to various tcp cc algorithms 2016-11-21 13:20:17 -05:00
tcp_timer.c tcp: fix tcp_probe_timer() for TCP_USER_TIMEOUT 2017-05-21 13:50:34 -04:00
tcp_vegas.c tcp: make undo_cwnd mandatory for congestion modules 2016-11-21 13:20:17 -05:00
tcp_vegas.h tcp: replace cnt & rtt with struct in pkts_acked() 2016-05-11 14:43:19 -04:00
tcp_veno.c tcp: add cwnd_undo functions to various tcp cc algorithms 2016-11-21 13:20:17 -05:00
tcp_westwood.c tcp_westwood: use tcp_jiffies32 instead of tcp_time_stamp 2017-05-17 16:06:01 -04:00
tcp_yeah.c tcp: add cwnd_undo functions to various tcp cc algorithms 2016-11-21 13:20:17 -05:00
tcp.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-05-22 23:32:48 -04:00
tunnel4.c tunnels: correct conditional build of MPLS and IPv6 2016-07-11 13:27:06 -07:00
udp_diag.c net: inet: diag: expose the socket mark to privileged processes. 2016-09-08 16:13:09 -07:00
udp_impl.h udp: make *udp*_queue_rcv_skb() functions static 2017-05-18 10:23:33 -04:00
udp_offload.c udp: disable inner UDP checksum offloads in IPsec case 2017-04-24 13:48:54 -04:00
udp_tunnel.c net: Remove deprecated tunnel specific UDP offload functions 2016-06-17 20:23:32 -07:00
udp.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-05-18 16:11:32 -04:00
udplite.c udplite: call proper backlog handlers 2016-11-24 15:32:14 -05:00
xfrm4_input.c esp: Add a software GRO codepath 2017-02-15 11:04:11 +01:00
xfrm4_mode_beet.c
xfrm4_mode_transport.c xfrm: Add encapsulation header offsets while SKB is not encrypted 2017-04-14 10:07:39 +02:00
xfrm4_mode_tunnel.c xfrm: Add encapsulation header offsets while SKB is not encrypted 2017-04-14 10:07:39 +02:00
xfrm4_output.c xfrm: Add an IPsec hardware offloading API 2017-04-14 10:06:10 +02:00
xfrm4_policy.c xfrm: policy: make policy backend const 2017-02-09 10:22:19 +01:00
xfrm4_protocol.c xfrm: input: constify xfrm_input_afinfo 2017-02-09 10:22:17 +01:00
xfrm4_state.c xfrm: remove unused function 2017-01-10 10:57:12 +01:00
xfrm4_tunnel.c