linux

Author	SHA1	Message	Date
David S. Miller	9fa684ec86	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains a larger than usual batch of Netfilter fixes for your net tree. This series contains a mixture of old bugs and recently introduced bugs, they are: 1) Fix a crash when using nft_dynset with nft_set_rbtree, which doesn't support the set element updates from the packet path. From Liping Zhang. 2) Fix leak when nft_expr_clone() fails, from Liping Zhang. 3) Fix a race when inserting new elements to the set hash from the packet path, also from Liping. 4) Handle segmented TCP SIP packets properly, basically avoid that the INVITE in the allow header create bogus expectations by performing stricter SIP message parsing, from Ulrich Weber. 5) nft_parse_u32_check() should return signed integer for errors, from John Linville. 6) Fix wrong allocation instead of connlabels, allocate 16 instead of 32 bytes, from Florian Westphal. 7) Fix compilation breakage when building the ip_vs_sync code with CONFIG_OPTIMIZE_INLINING on x86, from Arnd Bergmann. 8) Destroy the new set if the transaction object cannot be allocated, also from Liping Zhang. 9) Use device to route duplicated packets via nft_dup only when set by the user, otherwise packets may not follow the right route, again from Liping. 10) Fix wrong maximum genetlink attribute definition in IPVS, from WANG Cong. 11) Ignore untracked conntrack objects from xt_connmark, from Florian Westphal. 12) Allow to use conntrack helpers that are registered NFPROTO_UNSPEC via CT target, otherwise we cannot use the h.245 helper, from Florian. 13) Revisit garbage collection heuristic in the new workqueue-based timer approach for conntrack to evict objects earlier, again from Florian. 14) Fix crash in nf_tables when inserting an element into a verdict map, from Liping Zhang. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-11-09 20:38:18 -05:00
Davide Caratti	0e54d2179f	netfilter: conntrack: simplify init/uninit of L4 protocol trackers modify registration and deregistration of layer-4 protocol trackers to facilitate inclusion of new elements into the current list of builtin protocols. Both builtin (TCP, UDP, ICMP) and non-builtin (DCCP, GRE, SCTP, UDPlite) layer-4 protocol trackers usually register/deregister themselves using consecutive calls to nf_ct_l4proto_{,pernet}_{,un}register(...). This sequence is interrupted and rolled back in case of error; in order to simplify addition of builtin protocols, the input of the above functions has been modified to allow registering/unregistering multiple protocols. Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-11-09 23:49:25 +01:00
Sebastian Andrzej Siewior	a4fc1bfc42	net/flowcache: Convert to hotplug state machine Install the callbacks via the state machine. Use multi state support to avoid custom list handling for the multiple instances. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: netdev@vger.kernel.org Cc: rt@linutronix.de Cc: "David S. Miller" <davem@davemloft.net> Link: http://lkml.kernel.org/r/20161103145021.28528-10-bigeasy@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2016-11-09 23:45:28 +01:00
Liping Zhang	4e24877e61	netfilter: nf_tables: simplify the basic expressions' init routine Some basic expressions are built into nf_tables.ko, such as nft_cmp, nft_lookup, nft_range and so on. But these basic expressions' init routine is a little ugly, too many goto errX labels, and we forget to call nft_range_module_exit in the exit routine, although it is harmless. Acctually, the init and exit routines of these basic expressions are same, i.e. do nft_register_expr in the init routine and do nft_unregister_expr in the exit routine. So it's better to arrange them into an array and deal with them together. Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-11-09 23:42:23 +01:00
Hadar Hen Zion	24ba898d43	net/dst: Add dst port to dst_metadata utility functions Add dst port parameter to __ip_tun_set_dst and __ipv6_tun_set_dst utility functions. Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-11-09 13:41:54 -05:00
Hadar Hen Zion	f4d997fd61	net/sched: cls_flower: Add UDP port to tunnel parameters The current IP tunneling classification supports only IP addresses and key. Enhance UDP based IP tunneling classification parameters by adding UDP src and dst port. Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-11-09 13:41:54 -05:00
Hadar Hen Zion	9ba6a9a9f7	flow_dissector: Add enums for encapsulation keys New encapsulation keys were added to the flower classifier, which allow classification according to outer (encapsulation) headers attributes such as key and IP addresses. In order to expose those attributes outside flower, add corresponding enums in the flow dissector. Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-11-09 13:41:54 -05:00
Hadar Hen Zion	9ce183b4c4	net/sched: act_tunnel_key: add helper inlines to access tcf_tunnel_key Needed for drivers to pick the relevant action when offloading tunnel key act. Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-11-09 13:41:53 -05:00
Jesper Dangaard Brouer	d0a81f67cd	net: make default TX queue length a defined constant The default TX queue length of Ethernet devices have been a magic constant of 1000, ever since the initial git import. Looking back in historical trees[1][2] the value used to be 100, with the same comment "Ethernet wants good queues". The commit[3] that changed this from 100 to 1000 didn't describe why, but from conversations with Robert Olsson it seems that it was changed when Ethernet devices went from 100Mbit/s to 1Gbit/s, because the link speed increased x10 the queue size were also adjusted. This value later caused much heartache for the bufferbloat community. This patch merely moves the value into a defined constant. [1] https://git.kernel.org/cgit/linux/kernel/git/davem/netdev-vger-cvs.git/ [2] https://git.kernel.org/cgit/linux/kernel/git/tglx/history.git/ [3] https://git.kernel.org/tglx/history/c/98921832c232 Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-11-07 20:15:55 -05:00
Paolo Abeni	7c13f97ffd	udp: do fwd memory scheduling on dequeue A new argument is added to __skb_recv_datagram to provide an explicit skb destructor, invoked under the receive queue lock. The UDP protocol uses such argument to perform memory reclaiming on dequeue, so that the UDP protocol does not set anymore skb->desctructor. Instead explicit memory reclaiming is performed at close() time and when skbs are removed from the receive queue. The in kernel UDP protocol users now need to call a skb_recv_udp() variant instead of skb_recv_datagram() to properly perform memory accounting on dequeue. Overall, this allows acquiring only once the receive queue lock on dequeue. Tested using pktgen with random src port, 64 bytes packet, wire-speed on a 10G link as sender and udp_sink as the receiver, using an l4 tuple rxhash to stress the contention, and one or more udp_sink instances with reuseport. nr sinks vanilla patched 1 440 560 3 2150 2300 6 3650 3800 9 4450 4600 12 6250 6450 v1 -> v2: - do rmem and allocated memory scheduling under the receive lock - do bulk scheduling in first_packet_length() and in udp_destruct_sock() - avoid the typdef for the dequeue callback Suggested-by: Eric Dumazet <edumazet@google.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-11-07 13:24:41 -05:00
Paolo Abeni	ad959036a7	net/sock: add an explicit sk argument for ip_cmsg_recv_offset() So that we can use it even after orphaining the skbuff. Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-11-07 13:24:41 -05:00
Lorenzo Colitti	e2d118a1cb	net: inet: Support UID-based routing in IP protocols. - Use the UID in routing lookups made by protocol connect() and sendmsg() functions. - Make sure that routing lookups triggered by incoming packets (e.g., Path MTU discovery) take the UID of the socket into account. - For packets not associated with a userspace socket, (e.g., ping replies) use UID 0 inside the user namespace corresponding to the network namespace the socket belongs to. This allows all namespaces to apply routing and iptables rules to kernel-originated traffic in that namespaces by matching UID 0. This is better than using the UID of the kernel socket that is sending the traffic, because the UID of kernel sockets created at namespace creation time (e.g., the per-processor ICMP and TCP sockets) is the UID of the user that created the socket, which might not be mapped in the namespace. Tested: compiles allnoconfig, allyesconfig, allmodconfig Tested: https://android-review.googlesource.com/253302 Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-11-04 14:45:23 -04:00
Lorenzo Colitti	622ec2c9d5	net: core: add UID to flows, rules, and routes - Define a new FIB rule attributes, FRA_UID_RANGE, to describe a range of UIDs. - Define a RTA_UID attribute for per-UID route lookups and dumps. - Support passing these attributes to and from userspace via rtnetlink. The value INVALID_UID indicates no UID was specified. - Add a UID field to the flow structures. Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-11-04 14:45:23 -04:00
Lorenzo Colitti	86741ec254	net: core: Add a UID field to struct sock. Protocol sockets (struct sock) don't have UIDs, but most of the time, they map 1:1 to userspace sockets (struct socket) which do. Various operations such as the iptables xt_owner match need access to the "UID of a socket", and do so by following the backpointer to the struct socket. This involves taking sk_callback_lock and doesn't work when there is no socket because userspace has already called close(). Simplify this by adding a sk_uid field to struct sock whose value matches the UID of the corresponding struct socket. The semantics are as follows: 1. Whenever sk_socket is non-null: sk_uid is the same as the UID in sk_socket, i.e., matches the return value of sock_i_uid. Specifically, the UID is set when userspace calls socket(), fchown(), or accept(). 2. When sk_socket is NULL, sk_uid is defined as follows: - For a socket that no longer has a sk_socket because userspace has called close(): the previous UID. - For a cloned socket (e.g., an incoming connection that is established but on which userspace has not yet called accept): the UID of the socket it was cloned from. - For a socket that has never had an sk_socket: UID 0 inside the user namespace corresponding to the network namespace the socket belongs to. Kernel sockets created by sock_create_kern are a special case of #1 and sk_uid is the user that created them. For kernel sockets created at network namespace creation time, such as the per-processor ICMP and TCP sockets, this is the user that created the network namespace. Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-11-04 14:45:22 -04:00
Eric Dumazet	c3f24cfb3e	dccp: do not release listeners too soon Andrey Konovalov reported following error while fuzzing with syzkaller : IPv4: Attempt to release alive inet socket ffff880068e98940 kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] SMP KASAN Modules linked in: CPU: 1 PID: 3905 Comm: a.out Not tainted 4.9.0-rc3+ #333 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 task: ffff88006b9e0000 task.stack: ffff880068770000 RIP: 0010:[<ffffffff819ead5f>] [<ffffffff819ead5f>] selinux_socket_sock_rcv_skb+0xff/0x6a0 security/selinux/hooks.c:4639 RSP: 0018:ffff8800687771c8 EFLAGS: 00010202 RAX: ffff88006b9e0000 RBX: 1ffff1000d0eee3f RCX: 1ffff1000d1d312a RDX: 1ffff1000d1d31a6 RSI: dffffc0000000000 RDI: 0000000000000010 RBP: ffff880068777360 R08: 0000000000000000 R09: 0000000000000002 R10: dffffc0000000000 R11: 0000000000000006 R12: ffff880068e98940 R13: 0000000000000002 R14: ffff880068777338 R15: 0000000000000000 FS: 00007f00ff760700(0000) GS:ffff88006cd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020008000 CR3: 000000006a308000 CR4: 00000000000006e0 Stack: ffff8800687771e0 ffffffff812508a5 ffff8800686f3168 0000000000000007 ffff88006ac8cdfc ffff8800665ea500 0000000041b58ab3 ffffffff847b5480 ffffffff819eac60 ffff88006b9e0860 ffff88006b9e0868 ffff88006b9e07f0 Call Trace: [<ffffffff819c8dd5>] security_sock_rcv_skb+0x75/0xb0 security/security.c:1317 [<ffffffff82c2a9e7>] sk_filter_trim_cap+0x67/0x10e0 net/core/filter.c:81 [<ffffffff82b81e60>] __sk_receive_skb+0x30/0xa00 net/core/sock.c:460 [<ffffffff838bbf12>] dccp_v4_rcv+0xdb2/0x1910 net/dccp/ipv4.c:873 [<ffffffff83069d22>] ip_local_deliver_finish+0x332/0xad0 net/ipv4/ip_input.c:216 [< inline >] NF_HOOK_THRESH ./include/linux/netfilter.h:232 [< inline >] NF_HOOK ./include/linux/netfilter.h:255 [<ffffffff8306abd2>] ip_local_deliver+0x1c2/0x4b0 net/ipv4/ip_input.c:257 [< inline >] dst_input ./include/net/dst.h:507 [<ffffffff83068500>] ip_rcv_finish+0x750/0x1c40 net/ipv4/ip_input.c:396 [< inline >] NF_HOOK_THRESH ./include/linux/netfilter.h:232 [< inline >] NF_HOOK ./include/linux/netfilter.h:255 [<ffffffff8306b82f>] ip_rcv+0x96f/0x12f0 net/ipv4/ip_input.c:487 [<ffffffff82bd9fb7>] __netif_receive_skb_core+0x1897/0x2a50 net/core/dev.c:4213 [<ffffffff82bdb19a>] __netif_receive_skb+0x2a/0x170 net/core/dev.c:4251 [<ffffffff82bdb493>] netif_receive_skb_internal+0x1b3/0x390 net/core/dev.c:4279 [<ffffffff82bdb6b8>] netif_receive_skb+0x48/0x250 net/core/dev.c:4303 [<ffffffff8241fc75>] tun_get_user+0xbd5/0x28a0 drivers/net/tun.c:1308 [<ffffffff82421b5a>] tun_chr_write_iter+0xda/0x190 drivers/net/tun.c:1332 [< inline >] new_sync_write fs/read_write.c:499 [<ffffffff8151bd44>] __vfs_write+0x334/0x570 fs/read_write.c:512 [<ffffffff8151f85b>] vfs_write+0x17b/0x500 fs/read_write.c:560 [< inline >] SYSC_write fs/read_write.c:607 [<ffffffff81523184>] SyS_write+0xd4/0x1a0 fs/read_write.c:599 [<ffffffff83fc02c1>] entry_SYSCALL_64_fastpath+0x1f/0xc2 It turns out DCCP calls __sk_receive_skb(), and this broke when lookups no longer took a reference on listeners. Fix this issue by adding a @refcounted parameter to __sk_receive_skb(), so that sock_put() is used only when needed. Fixes: `3b24d854cb` ("tcp/dccp: do not touch listener sk_refcnt under synflood") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Andrey Konovalov <andreyknvl@google.com> Tested-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-11-03 16:16:50 -04:00
Lance Richardson	9ee6c5dc81	ipv4: allow local fragmentation in ip_finish_output_gso() Some configurations (e.g. geneve interface with default MTU of 1500 over an ethernet interface with 1500 MTU) result in the transmission of packets that exceed the configured MTU. While this should be considered to be a "bad" configuration, it is still allowed and should not result in the sending of packets that exceed the configured MTU. Fix by dropping the assumption in ip_finish_output_gso() that locally originated gso packets will never need fragmentation. Basic testing using iperf (observing CPU usage and bandwidth) have shown no measurable performance impact for traffic not requiring fragmentation. Fixes: `c7ba65d7b6` ("net: ip: push gso skb forwarding handling down the stack") Reported-by: Jan Tluka <jtluka@redhat.com> Signed-off-by: Lance Richardson <lrichard@redhat.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-11-03 16:10:26 -04:00
David Ahern	da96786e26	net: tcp: check skb is non-NULL for exact match on lookups Andrey reported the following error report while running the syzkaller fuzzer: general protection fault: 0000 [#1] SMP KASAN Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: CPU: 0 PID: 648 Comm: syz-executor Not tainted 4.9.0-rc3+ #333 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 task: ffff8800398c4480 task.stack: ffff88003b468000 RIP: 0010:[<ffffffff83091106>] [< inline >] inet_exact_dif_match include/net/tcp.h:808 RIP: 0010:[<ffffffff83091106>] [<ffffffff83091106>] __inet_lookup_listener+0xb6/0x500 net/ipv4/inet_hashtables.c:219 RSP: 0018:ffff88003b46f270 EFLAGS: 00010202 RAX: 0000000000000004 RBX: 0000000000004242 RCX: 0000000000000001 RDX: 0000000000000000 RSI: ffffc90000e3c000 RDI: 0000000000000054 RBP: ffff88003b46f2d8 R08: 0000000000004000 R09: ffffffff830910e7 R10: 0000000000000000 R11: 000000000000000a R12: ffffffff867fa0c0 R13: 0000000000004242 R14: 0000000000000003 R15: dffffc0000000000 FS: 00007fb135881700(0000) GS:ffff88003ec00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020cc3000 CR3: 000000006d56a000 CR4: 00000000000006f0 Stack: 0000000000000000 000000000601a8c0 0000000000000000 ffffffff00004242 424200003b9083c2 ffff88003def4041 ffffffff84e7e040 0000000000000246 ffff88003a0911c0 0000000000000000 ffff88003a091298 ffff88003b9083ae Call Trace: [<ffffffff831100f4>] tcp_v4_send_reset+0x584/0x1700 net/ipv4/tcp_ipv4.c:643 [<ffffffff83115b1b>] tcp_v4_rcv+0x198b/0x2e50 net/ipv4/tcp_ipv4.c:1718 [<ffffffff83069d22>] ip_local_deliver_finish+0x332/0xad0 net/ipv4/ip_input.c:216 ... MD5 has a code path that calls __inet_lookup_listener with a null skb, so inet{6}_exact_dif_match needs to check skb against null before pulling the flag. Fixes: `a04a480d43` ("net: Require exact match for TCP socket lookups if dif is l3mdev") Reported-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Tested-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-11-03 16:05:44 -04:00
Willem de Bruijn	70ecc24841	ipv4: add IP_RECVFRAGSIZE cmsg The IP stack records the largest fragment of a reassembled packet in IPCB(skb)->frag_max_size. When reading a datagram or raw packet that arrived fragmented, expose the value to allow applications to estimate receive path MTU. Tested: Sent data over a veth pair of which the source has a small mtu. Sent data using netcat, received using a dedicated process. Verified that the cmsg IP_RECVFRAGSIZE is returned only when data arrives fragmented, and in that cases matches the veth mtu. ip link add veth0 type veth peer name veth1 ip netns add from ip netns add to ip link set dev veth1 netns to ip netns exec to ip addr add dev veth1 192.168.10.1/24 ip netns exec to ip link set dev veth1 up ip link set dev veth0 netns from ip netns exec from ip addr add dev veth0 192.168.10.2/24 ip netns exec from ip link set dev veth0 up ip netns exec from ip link set dev veth0 mtu 1300 ip netns exec from ethtool -K veth0 ufo off dd if=/dev/zero bs=1 count=1400 2>/dev/null > payload ip netns exec to ./recv_cmsg_recvfragsize -4 -u -p 6000 & ip netns exec from nc -q 1 -u 192.168.10.1 6000 < payload using github.com/wdebruij/kerneltools/blob/master/tests/recvfragsize.c Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-11-03 15:41:11 -04:00
Pablo Neira Ayuso	01886bd91f	netfilter: remove hook_entries field from nf_hook_state This field is only useful for nf_queue, so store it in the nf_queue_entry structure instead, away from the core path. Pass hook_head to nf_hook_slow(). Since we always have a valid entry on the first iteration in nf_iterate(), we can use 'do { ... } while (entry)' loop instead. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-11-03 11:52:58 +01:00
Pablo Neira Ayuso	0e5a1c7eb3	netfilter: nf_tables: use hook state from xt_action_param structure Don't copy relevant fields from hook state structure, instead use the one that is already available in struct xt_action_param. This patch also adds a set of new wrapper functions to fetch relevant hook state structure fields. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-11-03 11:52:34 +01:00
Pablo Neira Ayuso	613dbd9572	netfilter: x_tables: move hook state into xt_action_param structure Place pointer to hook state in xt_action_param structure instead of copying the fields that we need. After this change xt_action_param fits into one cacheline. This patch also adds a set of new wrapper functions to fetch relevant hook state structure fields. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-11-03 10:56:21 +01:00
Eli Cooper	23f4ffedb7	ip6_tunnel: Clear IP6CB in ip6tunnel_xmit() skb->cb may contain data from previous layers. In the observed scenario, the garbage data were misinterpreted as IP6CB(skb)->frag_max_size, so that small packets sent through the tunnel are mistakenly fragmented. This patch unconditionally clears the control buffer in ip6tunnel_xmit(), which affects ip6_tunnel, ip6_udp_tunnel and ip6_gre. Currently none of these tunnels set IP6CB(skb)->flags, otherwise it needs to be done earlier. Cc: stable@vger.kernel.org Signed-off-by: Eli Cooper <elicooper@gmx.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-11-02 15:18:36 -04:00
David S. Miller	4cb551a100	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for your net-next tree. This includes better integration with the routing subsystem for nf_tables, explicit notrack support and smaller updates. More specifically, they are: 1) Add fib lookup expression for nf_tables, from Florian Westphal. This new expression provides a native replacement for iptables addrtype and rp_filter matches. This is more flexible though, since we can populate the kernel flowi representation to inquire fib to accomodate new usecases, such as RTBH through skb mark. 2) Introduce rt expression for nf_tables, from Anders K. Pedersen. This new expression allow you to access skbuff route metadata, more specifically nexthop and classid fields. 3) Add notrack support for nf_tables, to skip conntracking, requested by many users already. 4) Add boilerplate code to allow to use nf_log infrastructure from nf_tables ingress. 5) Allow to mangle pkttype from nf_tables prerouting chain, to emulate the xtables cluster match, from Liping Zhang. 6) Move socket lookup code into generic nf_socket_* infrastructure so we can provide a native replacement for the xtables socket match. 7) Make sure nfnetlink_queue data that is updated on every packets is placed in a different cache from read-only data, from Florian Westphal. 8) Handle NF_STOLEN from nf_tables core, also from Florian Westphal. 9) Start round robin number generation in nft_numgen from zero, instead of n-1, for consistency with xtables statistics match, patch from Liping Zhang. 10) Set GFP_NOWARN flag in skbuff netlink allocations in nfnetlink_log, given we retry with a smaller allocation on failure, from Calvin Owens. 11) Cleanup xt_multiport to use switch(), from Gao feng. 12) Remove superfluous check in nft_immediate and nft_cmp, from Liping Zhang. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-11-02 14:57:47 -04:00
Pablo Neira Ayuso	8db4c5be88	netfilter: move socket lookup infrastructure to nf_socket_ipv{4,6}.c We need this split to reuse existing codebase for the upcoming nf_tables socket expression. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-11-01 20:50:31 +01:00
Pablo Neira Ayuso	1fddf4bad0	netfilter: nf_log: add packet logging for netdev family Move layer 2 packet logging into nf_log_l2packet() that resides in nf_log_common.c, so this can be shared by both bridge and netdev families. This patch adds the boiler plate code to register the netdev logging family. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-11-01 20:50:30 +01:00
Florian Westphal	f6d0cbcf09	netfilter: nf_tables: add fib expression Add FIB expression, supported for ipv4, ipv6 and inet family (the latter just dispatches to ipv4 or ipv6 one based on nfproto). Currently supports fetching output interface index/name and the rtm_type associated with an address. This can be used for adding path filtering. rtm_type is useful to e.g. enforce a strong-end host model where packets are only accepted if daddr is configured on the interface the packet arrived on. The fib expression is a native nftables alternative to the xtables addrtype and rp_filter matches. FIB result order for oif/oifname retrieval is as follows: - if packet is local (skb has rtable, RTF_LOCAL set, this will also catch looped-back multicast packets), set oif to the loopback interface. - if fib lookup returns an error, or result points to local, store zero result. This means '--local' option of -m rpfilter is not supported. It is possible to use 'fib type local' or add explicit saddr/daddr matching rules to create exceptions if this is really needed. - store result in the destination register. In case of multiple routes, search set for desired oif in case strict matching is requested. ipv4 and ipv6 behave fib expressions are supposed to behave the same. [ I have collapsed Arnd Bergmann's ("netfilter: nf_tables: fib warnings") http://patchwork.ozlabs.org/patch/688615/ to address fallout from this patch after rebasing nf-next, that was posted to address compilation warnings. --pablo ] Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-11-01 20:50:14 +01:00
Eric Dumazet	bd68a2a854	net: set SK_MEM_QUANTUM to 4096 Systems with large pages (64KB pages for example) do not always have huge quantity of memory. A big SK_MEM_QUANTUM value leads to fewer interactions with the global counters (like tcp_memory_allocated) but might trigger memory pressure much faster, giving suboptimal TCP performance since windows are lowered to ridiculous values. Note that sysctl_mem units being in pages and in ABI, we also need to change sk_prot_mem_limits() accordingly. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-10-31 20:59:58 -04:00
Xin Long	dae399d7fd	sctp: hold transport instead of assoc when lookup assoc in rx path Prior to this patch, in rx path, before calling lock_sock, it needed to hold assoc when got it by __sctp_lookup_association, in case other place would free/put assoc. But in __sctp_lookup_association, it lookup and hold transport, then got assoc by transport->assoc, then hold assoc and put transport. It means it didn't hold transport, yet it was returned and later on directly assigned to chunk->transport. Without the protection of sock lock, the transport may be freed/put by other places, which would cause a use-after-free issue. This patch is to fix this issue by holding transport instead of assoc. As holding transport can make sure to access assoc is also safe, and actually it looks up assoc by searching transport rhashtable, to hold transport here makes more sense. Note that the function will be renamed later on on another patch. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-10-31 16:20:33 -04:00
David S. Miller	27058af401	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Mostly simple overlapping changes. For example, David Ahern's adjacency list revamp in 'net-next' conflicted with an adjacency list traversal bug fix in 'net'. Signed-off-by: David S. Miller <davem@davemloft.net>	2016-10-30 12:42:58 -04:00
Linus Torvalds	2a26d99b25	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Pull networking fixes from David Miller: "Lots of fixes, mostly drivers as is usually the case. 1) Don't treat zero DMA address as invalid in vmxnet3, from Alexey Khoroshilov. 2) Fix element timeouts in netfilter's nft_dynset, from Anders K. Pedersen. 3) Don't put aead_req crypto struct on the stack in mac80211, from Ard Biesheuvel. 4) Several uninitialized variable warning fixes from Arnd Bergmann. 5) Fix memory leak in cxgb4, from Colin Ian King. 6) Fix bpf handling of VLAN header push/pop, from Daniel Borkmann. 7) Several VRF semantic fixes from David Ahern. 8) Set skb->protocol properly in ip6_tnl_xmit(), from Eli Cooper. 9) Socket needs to be locked in udp_disconnect(), from Eric Dumazet. 10) Div-by-zero on 32-bit fix in mlx4 driver, from Eugenia Emantayev. 11) Fix stale link state during failover in NCSCI driver, from Gavin Shan. 12) Fix netdev lower adjacency list traversal, from Ido Schimmel. 13) Propvide proper handle when emitting notifications of filter deletes, from Jamal Hadi Salim. 14) Memory leaks and big-endian issues in rtl8xxxu, from Jes Sorensen. 15) Fix DESYNC_FACTOR handling in ipv6, from Jiri Bohac. 16) Several routing offload fixes in mlxsw driver, from Jiri Pirko. 17) Fix broadcast sync problem in TIPC, from Jon Paul Maloy. 18) Validate chunk len before using it in SCTP, from Marcelo Ricardo Leitner. 19) Revert a netns locking change that causes regressions, from Paul Moore. 20) Add recursion limit to GRO handling, from Sabrina Dubroca. 21) GFP_KERNEL in irq context fix in ibmvnic, from Thomas Falcon. 22) Avoid accessing stale vxlan/geneve socket in data path, from Pravin Shelar" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (189 commits) geneve: avoid using stale geneve socket. vxlan: avoid using stale vxlan socket. qede: Fix out-of-bound fastpath memory access net: phy: dp83848: add dp83822 PHY support enic: fix rq disable tipc: fix broadcast link synchronization problem ibmvnic: Fix missing brackets in init_sub_crq_irqs ibmvnic: Fix releasing of sub-CRQ IRQs in interrupt context Revert "ibmvnic: Fix releasing of sub-CRQ IRQs in interrupt context" arch/powerpc: Update parameters for csum_tcpudp_magic & csum_tcpudp_nofold net/mlx4_en: Save slave ethtool stats command net/mlx4_en: Fix potential deadlock in port statistics flow net/mlx4: Fix firmware command timeout during interrupt test net/mlx4_core: Do not access comm channel if it has not yet been initialized net/mlx4_en: Fix panic during reboot net/mlx4_en: Process all completions in RX rings after port goes up net/mlx4_en: Resolve dividing by zero in 32-bit system net/mlx4_core: Change the default value of enable_qos net/mlx4_core: Avoid setting ports to auto when only one port type is supported net/mlx4_core: Fix the resource-type enum in res tracker to conform to FW spec ...	2016-10-29 20:33:20 -07:00
pravin shelar	c6fcc4fc5f	vxlan: avoid using stale vxlan socket. When vxlan device is closed vxlan socket is freed. This operation can race with vxlan-xmit function which dereferences vxlan socket. Following patch uses RCU mechanism to avoid this situation. Signed-off-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-10-29 20:56:31 -04:00
David S. Miller	32ab0a38f0	Among various cleanups and improvements, we have the following: * client FILS authentication support in mac80211 (Jouni) * AP/VLAN multicast improvements (Michael Braun) * config/advertising support for differing beacon intervals on multiple virtual interfaces (Purushottam Kushwaha, myself) * deprecate the old WDS mode for cfg80211-based drivers, the mode is hardly usable since it doesn't support any "modern" features like WPA encryption (2003), HT (2009) or VHT (2014), I'm not even sure WEP (introduced in 1997) could be done. -----BEGIN PGP SIGNATURE----- iQIcBAABCgAGBQJYEy/9AAoJEGt7eEactAAd/V0P/0FmHGS8HjlSjm+1p6sbWKbt 5v8bb3cuKHQiYiUM6euIXql2OYuOEHVAQEpNoPXN9CsfKFYgbIH6yW6d8HtKNedV n9lmMy/U6yJX9nYt7yMIQ3kLkbEg+YU58B9Hf47waWXLLSNVumS8rfNBn43EoNQf VKWYPWpetsCRIWJ1fnLuxvMHCtOOYCxH+491BUonof32+DKPEAsAbnszZ2ElufTR 7KNyA3K6leOtTd5Ml52dvLOGNc+h2C83VAMxiShq/6r8OnlX5tPifaubzd9n3m41 jiJJH/92ESrtF2AaWEm8slcgtcfHS/O7y/FSoV4r0PMSvPTBdjwQ9nqCsbONd831 vjj6c6YWNxgHPcISX0XcWz+FHnLJdUGaDUtHjAJYw4oH4gaRXwfSw0U+jvdlSMUf 2CBUArk5f0OEguzwa/5X4Jio3OPPIj4jY/lKplcpLOUu8K2FWTLDuIlww/FHXovs rDzTLQeXZkx+MkszkTJN42qSEfOFly91J6OA2Wju+emBqrLIbkAGmvyLVg8U8BQd gG7oltgmZ6Xg6fEnUQqpIDO7UJlQ+GXAU04SpNDMv1j/ueUJskxlr3hYM7E9ueQv LJcZcVV0RAwNRw52cEsdcYCMLuSMYRrO1OHlkl0wd+x2hFrUCWnVzUgLEUhNBV+c ICmNMr96nKhrZI217yzF =SjfQ -----END PGP SIGNATURE----- Merge tag 'mac80211-next-for-davem-2016-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next Johannes Berg says: ==================== Among various cleanups and improvements, we have the following: * client FILS authentication support in mac80211 (Jouni) * AP/VLAN multicast improvements (Michael Braun) * config/advertising support for differing beacon intervals on multiple virtual interfaces (Purushottam Kushwaha, myself) * deprecate the old WDS mode for cfg80211-based drivers, the mode is hardly usable since it doesn't support any "modern" features like WPA encryption (2003), HT (2009) or VHT (2014), I'm not even sure WEP (introduced in 1997) could be done. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-10-29 17:28:45 -04:00
David S. Miller	880b583ce1	Just two fixes: * a fix to process all events while suspending, so any potential calls into the driver are done before it is suspended * small markup fixes for the sphinx documentation conversion that's coming into the tree via the doc tree -----BEGIN PGP SIGNATURE----- iQIcBAABCgAGBQJYEbSGAAoJEGt7eEactAAdOo4P/A1RyOl2uTspwB4lZFMMA5Tp IIwaB+WlNcjGq05C8kemdQJ30AKJUy6aKMTb5FIuFXWLRK6u5uIoQsYe6cRI8WBs bgTpo5RMcNGx6dUgFS/XwM2bLXUNXAXOt+43c2kWSjnqQZpQuSXh8/4R8ybZxyZB SNyKCzkx1LYPox4Qnufmi+pz/hI8hHRY7gMdyhUWU2uUk9u/nuY0Zxyvj4OFgyow SOGasCPoDzT9556hUGyG9M8HmwBiRZUzfzo5mSR2V0a9Tij4ZOvoSkcAUDbj38fZ 0UPJ48xwUDMF88W+NEO+rrd1IQXtlsHJk6x/mAQpSkEvzymb5BEvraVeTV3+chgo fn45ZDP/GbTwqJ3YacSmKgyGwaGPF9XN2+Iuxotp0UeuYxvxwnkWNr0JbRecDPau ZL/P2qV7Y3gjy0+6nHafrYIYPuPFZjwPIJ7k6x7mVtO6zfJhtKF57YP/YoFYD78l TtCEE5o7lAYkdWKa2nm4Fg16NM897PS/GTM9dQCFtnYLV7ynZPY2TGf4TelqJ3cz FU1/vlM0t/jRkkRKIhlmy8re4ASjTEccf//dllz1CDyPdxHY1NmGftftz539s7qW NpCvvO0IXDchBT1jeoSMwcMzlZSfQeIKtkp1P0z4lyPiMcmekdgUQ7sb84FhwSa+ SYRpiZ/Xv6xa5Djhonci =toQY -----END PGP SIGNATURE----- Merge tag 'mac80211-for-davem-2016-10-27' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Johannes Berg says: ==================== Just two fixes: * a fix to process all events while suspending, so any potential calls into the driver are done before it is suspended * small markup fixes for the sphinx documentation conversion that's coming into the tree via the doc tree ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-10-29 15:54:16 -04:00
Eric Dumazet	5ea8ea2cb7	tcp/dccp: drop SYN packets if accept queue is full Per listen(fd, backlog) rules, there is really no point accepting a SYN, sending a SYNACK, and dropping the following ACK packet if accept queue is full, because application is not draining accept queue fast enough. This behavior is fooling TCP clients that believe they established a flow, while there is nothing at server side. They might then send about 10 MSS (if using IW10) that will be dropped anyway while server is under stress. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-10-29 15:09:21 -04:00
Thomas Graf	b15ca182ed	netlink: Add nla_memdup() to wrap kmemdup() use on nlattr Wrap several common instances of: kmemdup(nla_data(attr), nla_len(attr), GFP_KERNEL); Signed-off-by: Thomas Graf <tgraf@suug.ch> Acked-by: Johannes Berg <johannes@sipsolutions.net> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-10-29 14:57:42 -04:00
David Ahern	d5d32e4b76	net: ipv6: Do not consider link state for nexthop validation Similar to IPv4, do not consider link state when validating next hops. Currently, if the link is down default routes can fail to insert: $ ip -6 ro add vrf blue default via 2100:2::64 dev eth2 RTNETLINK answers: No route to host With this patch the command succeeds. Fixes: `8c14586fc3` ("net: ipv6: Use passed in table for nexthop lookups") Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-10-27 16:33:12 -04:00
David Ahern	830218c1ad	net: ipv6: Fix processing of RAs in presence of VRF rt6_add_route_info and rt6_add_dflt_router were updated to pull the FIB table from the device index, but the corresponding rt6_get_route_info and rt6_get_dflt_router functions were not leading to the failure to process RA's: ICMPv6: RA: ndisc_router_discovery failed to add default route Fix the 'get' functions by using the table id associated with the device when applicable. Also, now that default routes can be added to tables other than the default table, rt6_purge_dflt_routers needs to be updated as well to look at all tables. To handle that efficiently, add a flag to the table denoting if it is has a default route via RA. Fixes: `ca254490c8` ("net: Add VRF support to IPv6 stack") Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-10-27 16:30:52 -04:00
Johannes Berg	2ae0f17df1	genetlink: use idr to track families Since generic netlink family IDs are small integers, allocated densely, IDR is an ideal match for lookups. Replace the existing hand-written hash-table with IDR for allocation and lookup. This lets the families only be written to once, during register, since the list_head can be removed and removal of a family won't cause any writes. It also slightly reduces the code size (by about 1.3k on x86-64). Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-10-27 16:16:09 -04:00
Johannes Berg	489111e5c2	genetlink: statically initialize families Instead of providing macros/inline functions to initialize the families, make all users initialize them statically and get rid of the macros. This reduces the kernel code size by about 1.6k on x86-64 (with allyesconfig). Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-10-27 16:16:09 -04:00
Johannes Berg	a07ea4d994	genetlink: no longer support using static family IDs Static family IDs have never really been used, the only use case was the workaround I introduced for those users that assumed their family ID was also their multicast group ID. Additionally, because static family IDs would never be reserved by the generic netlink code, using a relatively low ID would only work for built-in families that can be registered immediately after generic netlink is started, which is basically only the control family (apart from the workaround code, which I also had to add code for so it would reserve those IDs) Thus, anything other than GENL_ID_GENERATE is flawed and luckily not used except in the cases I mentioned. Move those workarounds into a few lines of code, and then get rid of GENL_ID_GENERATE entirely, making it more robust. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-10-27 16:16:09 -04:00
Johannes Berg	c90c39dab3	genetlink: introduce and use genl_family_attrbuf() This helper function allows family implementations to access their family's attrbuf. This gets rid of the attrbuf usage in families, and also adds locking validation, since it's not valid to use the attrbuf with parallel_ops or outside of the dumpit callback. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-10-27 16:16:08 -04:00
Antonio Quartulli	4fe77d82ef	skbedit: allow the user to specify bitmask for mark The user may want to use only some bits of the skb mark in his skbedit rules because the remaining part might be used by something else. Introduce the "mask" parameter to the skbedit actor in order to implement such functionality. When the mask is specified, only those bits selected by the latter are altered really changed by the actor, while the rest is left untouched. Signed-off-by: Antonio Quartulli <antonio@open-mesh.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-10-27 16:07:25 -04:00
Florian Westphal	cdb436d181	netfilter: conntrack: avoid excess memory allocation This is now a fixed-size extension, so we don't need to pass a variable alloc size. This (harmless) error results in allocating 32 instead of the needed 16 bytes for this extension as the size gets passed twice. Fixes: `23014011ba` ("netfilter: conntrack: support a fixed size of 128 distinct labels") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-10-27 18:29:02 +02:00
John W. Linville	f1d505bb76	netfilter: nf_tables: fix type mismatch with error return from nft_parse_u32_check Commit `36b701fae1` ("netfilter: nf_tables: validate maximum value of u32 netlink attributes") introduced nft_parse_u32_check with a return value of "unsigned int", yet on error it returns "-ERANGE". This patch corrects the mismatch by changing the return value to "int", which happens to match the actual users of nft_parse_u32_check already. Found by Coverity, CID 1373930. Note that commit `21a9e0f156` ("netfilter: nft_exthdr: fix error handling in nft_exthdr_init()) attempted to address the issue, but did not address the return type of nft_parse_u32_check. Signed-off-by: John W. Linville <linville@tuxdriver.com> Cc: Laura Garcia Liebana <nevola@gmail.com> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Cc: Dan Carpenter <dan.carpenter@oracle.com> Fixes: `36b701fae1` ("netfilter: nf_tables: validate maximum value...") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-10-27 18:29:01 +02:00
Liping Zhang	61f9e2924f	netfilter: nf_tables: fix leak when expr clone fail When nft_expr_clone failed, a series of problems will happen: 1. module refcnt will leak, we call __module_get at the beginning but we forget to put it back if ops->clone returns fail 2. memory will be leaked, if clone fail, we just return NULL and forget to free the alloced element 3. set->nelems will become incorrect when set->size is specified. If clone fail, we should decrease the set->nelems Now this patch fixes these problems. And fortunately, clone fail will only happen on counter expression when memory is exhausted. Fixes: `086f332167` ("netfilter: nf_tables: add clone interface to expression operations") Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-10-27 18:20:45 +02:00
vamsi krishna	088e8df82f	cfg80211: Add support to update connection parameters Add functionality to update the connection parameters when in connected state, so that driver/firmware uses the updated parameters for subsequent roaming. This is for drivers that support internal BSS selection and roaming. The new command does not change the current association state, i.e., it can be used to update IE contents for future (re)associations without causing an immediate disassociation or reassociation with the current BSS. This commit implements the required functionality for updating IEs for (Re)Association Request frame only. Other parameters can be added in future when required. Signed-off-by: vamsi krishna <vamsin@qti.qualcomm.com> Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-10-27 16:03:28 +02:00
Michael Braun	ce0ce13a1c	cfg80211: configure multicast to unicast for AP interfaces Add the ability to configure if an AP (and associated VLANs) will do multicast-to-unicast conversion for ARP, IPv4 and IPv6 frames (possibly within 802.1Q). If enabled, such frames are to be sent to each station separately, with the DA replaced by their own MAC address rather than the group address. Note that this may break certain expectations of the receiver, such as the ability to drop unicast IP packets received within multicast L2 frames, or the ability to not send ICMP destination unreachable messages for packets received in L2 multicast (which is required, but the receiver can't tell the difference if this new option is enabled.) This also doesn't implement the 802.11 DMS (directed multicast service). Signed-off-by: Michael Braun <michael-dev@fami-braun.de> [fix disabling, add better documentation & commit message] Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-10-27 16:03:27 +02:00
Jouni Malinen	348bd45669	cfg80211: Add KEK/nonces for FILS association frames The new nl80211 attributes can be used to provide KEK and nonces to allow the driver to encrypt and decrypt FILS (Re)Association Request/Response frames in station mode. Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-10-27 16:03:24 +02:00
Jouni Malinen	3f817fe718	cfg80211: Define IEEE P802.11ai (FILS) information elements Define the Element IDs and Element ID Extensions from IEEE P802.11ai/D11.0. In addition, add a new cfg80211_find_ext_ie() wrapper to make it easier to find information elements that used the Element ID Extension field. Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-10-27 16:03:23 +02:00
Jouni Malinen	11b6b5a4ce	cfg80211: Rename SAE_DATA to more generic AUTH_DATA This adds defines and nl80211 extensions to allow FILS Authentication to be implemented similarly to SAE. FILS does not need the special rules for the Authentication transaction number and Status code fields, but it does need to add non-IE fields. The previously used NL80211_ATTR_SAE_DATA can be reused for this to avoid having to duplicate that implementation. Rename that attribute to more generic NL80211_ATTR_AUTH_DATA (with backwards compatibility define for NL80211_SAE_DATA). Also document the special rules related to the Authentication transaction number and Status code fiels. Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-10-27 16:03:20 +02:00
Johannes Berg	4c8dea638c	cfg80211: validate beacon int as part of iface combinations Remove the pointless checking against interface combinations in the initial basic beacon interval validation, that currently isn't taking into account radar detection or channels properly. Instead, just validate the basic range there, and then delay real checking to the interface combination validation that drivers must do. This means that drivers wanting to use the beacon_int_min_gcd will now have to pass the new_beacon_int when validating the AP/mesh start. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-10-27 09:18:07 +02:00
Arend Van Spriel	73c7da3dae	cfg80211: add generic helper to check interface is running Add a helper using wdev to check if interface is running. This deals with both non-netdev and netdev interfaces. In struct wireless_dev replace 'p2p_started' and 'nan_started' by 'is_running' as those are mutually exclusive anyway, and unify all the code to use wdev_running(). Signed-off-by: Arend van Spriel <arend.vanspriel@broadcom.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-10-27 09:08:44 +02:00
Eric Dumazet	10df8e6152	udp: fix IP_CHECKSUM handling First bug was added in commit `ad6f939ab1` ("ip: Add offset parameter to ip_cmsg_recv") : Tom missed that ipv4 udp messages could be received on AF_INET6 socket. ip_cmsg_recv(msg, skb) should have been replaced by ip_cmsg_recv_offset(msg, skb, sizeof(struct udphdr)); Then commit `e6afc8ace6` ("udp: remove headers from UDP packets before queueing") forgot to adjust the offsets now UDP headers are pulled before skb are put in receive queue. Fixes: `ad6f939ab1` ("ip: Add offset parameter to ip_cmsg_recv") Fixes: `e6afc8ace6` ("udp: remove headers from UDP packets before queueing") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Sam Kumar <samanthakumar@google.com> Cc: Willem de Bruijn <willemb@google.com> Tested-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-10-26 17:33:22 -04:00
Stephen Hemminger	293de7dee4	doc: update docbook annotations for socket and skb The skbuff and sock structure both had missing parameter annotation values. Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-10-26 17:31:23 -04:00
Jani Nikula	b4f7f4ad42	mac80211: fix some sphinx warnings Signed-off-by: Jani Nikula <jani.nikula@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-10-26 08:01:07 +02:00
Cyrill Gorcunov	432490f9d4	net: ip, diag -- Add diag interface for raw sockets In criu we are actively using diag interface to collect sockets present in the system when dumping applications. And while for unix, tcp, udp[lite], packet, netlink it works as expected, the raw sockets do not have. Thus add it. v2: - add missing sock_put calls in raw_diag_dump_one (by eric.dumazet@) - implement @destroy for diag requests (by dsa@) v3: - add export of raw_abort for IPv6 (by dsa@) - pass net-admin flag into inet_sk_diag_fill due to changes in net-next branch (by dsa@) v4: - use @pad in struct inet_diag_req_v2 for raw socket protocol specification: raw module carries sockets which may have custom protocol passed from socket() syscall and sole @sdiag_protocol is not enough to match underlied ones - start reporting protocol specifed in socket() call when sockets are raw ones for the same reason: user space tools like ss may parse this attribute and use it for socket matching v5 (by eric.dumazet@): - use sock_hold in raw_sock_get instead of atomic_inc, we're holding (raw_v4_hashinfo\|raw_v6_hashinfo)->lock when looking up so counter won't be zero here. v6: - use sdiag_raw_protocol() helper which will access @pad structure used for raw sockets protocol specification: we can't simply rename this member without breaking uapi v7: - sine sdiag_raw_protocol() helper is not suitable for uapi lets rather make an alias structure with proper names. __check_inet_diag_req_raw helper will catch if any of structure unintentionally changed. CC: David S. Miller <davem@davemloft.net> CC: Eric Dumazet <eric.dumazet@gmail.com> CC: David Ahern <dsa@cumulusnetworks.com> CC: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> CC: James Morris <jmorris@namei.org> CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> CC: Patrick McHardy <kaber@trash.net> CC: Andrey Vagin <avagin@openvz.org> CC: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-10-23 19:35:24 -04:00
Thomas Graf	f76a9db351	lwt: Remove unused len field The field is initialized by ILA and MPLS but never used. Remove it. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-10-23 17:45:01 -04:00
Paolo Abeni	f970bd9e3a	udp: implement memory accounting helpers Avoid using the generic helpers. Use the receive queue spin lock to protect the memory accounting operation, both on enqueue and on dequeue. On dequeue perform partial memory reclaiming, trying to leave a quantum of forward allocated memory. On enqueue use a custom helper, to allow some optimizations: - use a plain spin_lock() variant instead of the slightly costly spin_lock_irqsave(), - avoid dst_force check, since the calling code has already dropped the skb dst - avoid orphaning the skb, since skb_steal_sock() already did the work for us The above needs custom memory reclaiming on shutdown, provided by the udp_destruct_sock(). v5 -> v6: - don't orphan the skb on enqueue v4 -> v5: - replace the mem_lock with the receive queue spin lock - ensure that the bh is always allowed to enqueue at least a skb, even if sk_rcvbuf is exceeded v3 -> v4: - reworked memory accunting, simplifying the schema - provide an helper for both memory scheduling and enqueuing v1 -> v2: - use a udp specific destrctor to perform memory reclaiming - remove a couple of helpers, unneeded after the above cleanup - do not reclaim memory on dequeue if not under memory pressure - reworked the fwd accounting schema to avoid potential integer overflow Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-10-22 17:05:05 -04:00
Paolo Abeni	f8c3bf00d4	net/socket: factor out helpers for memory and queue manipulation Basic sock operations that udp code can use with its own memory accounting schema. No functional change is introduced in the existing APIs. v4 -> v5: - avoid whitespace changes v2 -> v4: - avoid exporting __sock_enqueue_skb v1 -> v2: - avoid export sock_rmem_free Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-10-22 17:05:05 -04:00
WANG Cong	8651be8f14	ipv6: fix a potential deadlock in do_ipv6_setsockopt() Baozeng reported this deadlock case: CPU0 CPU1 ---- ---- lock([ 165.136033] sk_lock-AF_INET6); lock([ 165.136033] rtnl_mutex); lock([ 165.136033] sk_lock-AF_INET6); lock([ 165.136033] rtnl_mutex); Similar to commit `87e9f03159` ("ipv4: fix a potential deadlock in mcast getsockopt() path") this is due to we still have a case, ipv6_sock_mc_close(), where we acquire sk_lock before rtnl_lock. Close this deadlock with the similar solution, that is always acquire rtnl lock first. Fixes: `baf606d9c9` ("ipv4,ipv6: grab rtnl before locking the socket") Reported-by: Baozeng Ding <sploving1@gmail.com> Tested-by: Baozeng Ding <sploving1@gmail.com> Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-10-21 11:29:02 -04:00
Eric Dumazet	286c72deab	udp: must lock the socket in udp_disconnect() Baozeng Ding reported KASAN traces showing uses after free in udp_lib_get_port() and other related UDP functions. A CONFIG_DEBUG_PAGEALLOC=y kernel would eventually crash. I could write a reproducer with two threads doing : static int sock_fd; static void thr1(void arg) { for (;;) { connect(sock_fd, (const struct sockaddr )arg, sizeof(struct sockaddr_in)); } } static void thr2(void arg) { struct sockaddr_in unspec; for (;;) { memset(&unspec, 0, sizeof(unspec)); connect(sock_fd, (const struct sockaddr )&unspec, sizeof(unspec)); } } Problem is that udp_disconnect() could run without holding socket lock, and this was causing list corruptions. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Baozeng Ding <sploving1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-10-20 14:45:52 -04:00
Ilan Peer	0711d63878	cfg80211: allow aborting in-progress connection atttempts On a disconnect request from userspace, cfg80211 currently calls called rdev_disconnect() only in case that 'current_bss' was set, i.e. connection had been established. Change this to allow the userspace call to succeed and call the driver's disconnect() method also while the connection attempt is in progress, to be able to abort attempts. Signed-off-by: Ilan Peer <ilan.peer@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> [change commit subject/message] Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-10-19 12:15:38 +02:00
Emmanuel Grumbach	f438ceb81d	mac80211: uapsd_queues is in QoS IE order The uapsd_queue field is in QoS IE order and not in IEEE80211_AC_*'s order. This means that mac80211 would get confused between BK and BE which is certainly not such a big deal but needs to be fixed. Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-10-19 12:13:54 +02:00
Sara Sharon	f3fe4e93dd	mac80211: add a HW flag for supporting HW TX fragmentation Currently mac80211 determines whether HW does fragmentation by checking whether the set_frag_threshold callback is set or not. However, some drivers may want to set the HW fragmentation capability depending on HW generation. Allow this by checking a HW flag instead of checking the callback. Signed-off-by: Sara Sharon <sara.sharon@intel.com> [added the flag to ath10k and wlcore] Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-10-19 12:12:44 +02:00
Emmanuel Grumbach	0aa419ec6e	mac80211: allow the driver not to pass the tid to ieee80211_sta_uapsd_trigger iwlwifi will check internally that the tid maps to an AC that is trigger enabled, but can't know what tid exactly. Allow the driver to pass a generic tid and make mac80211 assume that a trigger frame was received. Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-10-19 12:12:19 +02:00
Johannes Berg	a1264c3d6c	wireless: radiotap: fix timestamp sampling position values The values don't match the radiotap spec, corrected that. Reported-by: Oz Shalev <oz.shalev@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-10-19 12:11:36 +02:00
David S. Miller	5cbee55736	This is relatively small, mostly to get the SG/crypto from stack removal fix that crashes things when VMAP stack is used in conjunction with software crypto. Aside from that, we have: * a fix for AP_VLAN usage with the nl80211 frame command * two fixes (and two preparation patches) for A-MSDU, one to discard group-addressed (multicast) and unexpected 4-address A-MSDUs, the other to validate A-MSDU inner MAC addresses properly to prevent controlled port bypass -----BEGIN PGP SIGNATURE----- iQIcBAABCgAGBQJYBcgKAAoJEGt7eEactAAdhUkP/jMVQbLMZ1Jcc9+lsPVGUIga I9GeQ4lcnD+4ASeJUhTtemC1IMNL4zMVqaIxbznDXKP7rZRrODVvCPk2TYIw9c5S rzF/TRierMFttLu3xY757nAsYg6T7F03JdOQ3SKIb3xOD8pXCWQoVRN14ldroRno 4stOAtDrpD5wvK2JhlWv1EYlxGVLqLcakZt/BwgDX/cJGkAx49Q/s29FUnesB9Ep sCH5chffeQskOL9CrSwboNmucgt4HGQORc4UL/KtPOEBtyfu/LCXEKSqAKVyQZtZ OerouOHWqQE5lT2K6qD/KKFW4lV2t1h+xzqsvZk4ZR5o3s+PAGai6D/wf+JgY9Hk uor9ju/e0htcI9m0aFdHDnltV0OOwIhR2bxWTuBBUkyFVtdQQY+1MRTTtuunWIB4 SDYv6LrNL/0HAIuTlPQH99rnsFNnRZCtTpdbT7GRckAMeWMvy19bF2ZB1FXuSn+h 5dxIo0qkw8nv4Y9wQ6QmgOcSzYyidUrCgLTO516qXVAKY0kl/u4q/zPr0Fmx/qfY oxspelDv0qd2NMQwJ/AmwjAjkQBulv5DVLu+cDXdOMkc/EbhzWyvetcHiNukxjHn mukCBxTlLoDLug2LFkAPIddEutj+VUEefkf/pD/js8uYuyd9ZnPjiIh6fG25il9a cHbMYtANt2EnZjwI9Z74 =T6t1 -----END PGP SIGNATURE----- Merge tag 'mac80211-for-davem-2016-10-18' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Johannes Berg says: ==================== This is relatively small, mostly to get the SG/crypto from stack removal fix that crashes things when VMAP stack is used in conjunction with software crypto. Aside from that, we have: * a fix for AP_VLAN usage with the nl80211 frame command * two fixes (and two preparation patches) for A-MSDU, one to discard group-addressed (multicast) and unexpected 4-address A-MSDUs, the other to validate A-MSDU inner MAC addresses properly to prevent controlled port bypass ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-10-18 10:26:15 -04:00
David Ahern	a04a480d43	net: Require exact match for TCP socket lookups if dif is l3mdev Currently, socket lookups for l3mdev (vrf) use cases can match a socket that is bound to a port but not a device (ie., a global socket). If the sysctl tcp_l3mdev_accept is not set this leads to ack packets going out based on the main table even though the packet came in from an L3 domain. The end result is that the connection does not establish creating confusion for users since the service is running and a socket shows in ss output. Fix by requiring an exact dif to sk_bound_dev_if match if the skb came through an interface enslaved to an l3mdev device and the tcp_l3mdev_accept is not set. skb's through an l3mdev interface are marked by setting a flag in inet{6}_skb_parm. The IPv6 variant is already set; this patch adds the flag for IPv4. Using an skb flag avoids a device lookup on the dif. The flag is set in the VRF driver using the IP{6}CB macros. For IPv4, the inet_skb_parm struct is moved in the cb per commit `971f10eca1`, so the match function in the TCP stack needs to use TCP_SKB_CB. For IPv6, the move is done after the socket lookup, so IP6CB is used. The flags field in inet_skb_parm struct needs to be increased to add another flag. There is currently a 1-byte hole following the flags, so it can be expanded to u16 without increasing the size of the struct. Fixes: `193125dbd8` ("net: Introduce VRF device driver") Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-10-17 10:17:05 -04:00
Michael Braun	a3e2f4b6ed	mac80211: fix A-MSDU outer SA/DA According to IEEE 802.11-2012 section 8.3.2 table 8-19, the outer SA/DA of A-MSDU frames need to be changed depending on FromDS/ToDS values. Signed-off-by: Michael Braun <michael-dev@fami-braun.de> [use ether_addr_copy and add alignment annotations] Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-10-17 11:43:33 +02:00
Tom Herbert	1104d9ba44	lwtunnel: Add destroy state operation Users of lwt tunnels may set up some secondary state in build_state function. Add a corresponding destroy_state function to allow users to clean up state. This destroy state function is called from lwstate_free. Also, we now free lwstate using kfree_rcu so user can assume structure is not freed before rcu. Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-10-15 17:33:41 -04:00
Linus Torvalds	d4d24d2d0a	A single commit converting the mac80211 DocBook template over to Sphinx. Only 32 more to go... -----BEGIN PGP SIGNATURE----- iQIcBAABAgAGBQJYAOP4AAoJEI3ONVYwIuV6CIgQAKqtI3i99xOJcuVJfojHYo0p LRLwIX0RxkQCb+nCPLJTjH+NLQ5Zw3BLFTmizcewJJuYnv8eBbBcAEsegvrkIl7B 0KmHEttdWFkujE+kISmfI6WsvxiFt+VbcjqgFMNM7D5Xw352x3v3X9VMPO7P/5lz ztWCdYZxhH2qFmeDiNmKMnPqtUJjOppTR73jqMzPHUI4PQcFxzGaTRwntuCJQ/XA fRwcTEQAX3r/xdCDb7+tIq00i+J8ZDTqwng9/8GqlWyjeDQZG8CmaGvDBwA1+n+X zG6lmOHLPIBppOF8rUQ1Q1ZlZl5x0jPDoo19mGdlQ+IgZocdo4z43XTc0c+oLguA zjiXKJXn1EJvl4iKLeF6nkxJxESJioCNg3eXqFPLFjYDSWzrK7umTkiJMLB9UbqN ThqrxgVrMpKjSug9KKqItu47WZ4s+dczkkngyiqMUUo34RDnfCQXwCBO7JAdzyyH XnwrCVj6hD8SIIv2REWNAiBTzIqEZxNmc9qvwj+Xy18hXKYhqqYRtf35QL3adp9R Nigl9dtcTdCkNJiaVYgfcTz/9ZMLcrKcMFV27ExMYZiDce2T7YWnnE5/VpheXi0r /EULZLxKJgu99SHACLmK1ZWD8YuoqlRZVQtk8LTNOBAiu8sasrwVYy7meQWlsL/8 Q/bmUmWXUD3I6CwIaS1R =U9Pc -----END PGP SIGNATURE----- Merge tag 'docs-4.9-2' of git://git.lwn.net/linux Pull one more documentation update from Jonathan Corbet: "A single commit converting the mac80211 DocBook template over to Sphinx. Only 32 more to go..." * tag 'docs-4.9-2' of git://git.lwn.net/linux: docs-rst: sphinxify 802.11 documentation	2016-10-14 14:11:22 -07:00
Jiri Bohac	76506a986d	IPv6: fix DESYNC_FACTOR The IPv6 temporary address generation uses a variable called DESYNC_FACTOR to prevent hosts updating the addresses at the same time. Quoting RFC 4941: ... The value DESYNC_FACTOR is a random value (different for each client) that ensures that clients don't synchronize with each other and generate new addresses at exactly the same time ... DESYNC_FACTOR is defined as: DESYNC_FACTOR -- A random value within the range 0 - MAX_DESYNC_FACTOR. It is computed once at system start (rather than each time it is used) and must never be greater than (TEMP_VALID_LIFETIME - REGEN_ADVANCE). First, I believe the RFC has a typo in it and meant to say: "and must never be greater than (TEMP_PREFERRED_LIFETIME - REGEN_ADVANCE)" The reason is that at various places in the RFC, DESYNC_FACTOR is used in a calculation like (TEMP_PREFERRED_LIFETIME - DESYNC_FACTOR) or (TEMP_PREFERRED_LIFETIME - REGEN_ADVANCE - DESYNC_FACTOR). It needs to be smaller than (TEMP_PREFERRED_LIFETIME - REGEN_ADVANCE) for the result of these calculations to be larger than zero. It's never used in a calculation together with TEMP_VALID_LIFETIME. I already submitted an errata to the rfc-editor: https://www.rfc-editor.org/errata_search.php?rfc=4941 The Linux implementation of DESYNC_FACTOR is very wrong: max_desync_factor is used in places DESYNC_FACTOR should be used. max_desync_factor is initialized to the RFC-recommended value for MAX_DESYNC_FACTOR (600) but the whole point is to get a _random_ value. And nothing ensures that the value used is not greater than (TEMP_PREFERRED_LIFETIME - REGEN_ADVANCE), which leads to underflows. The effect can easily be observed when setting the temp_prefered_lft sysctl e.g. to 60. The preferred lifetime of the temporary addresses will be bogus. TEMP_PREFERRED_LIFETIME and REGEN_ADVANCE are not constants and can be influenced by these three sysctls: regen_max_retry, dad_transmits and temp_prefered_lft. Thus, the upper bound for desync_factor needs to be re-calculated each time a new address is generated and if desync_factor is larger than the new upper bound, a new random value needs to be re-generated. And since we already have max_desync_factor configurable per interface, we also need to calculate and store desync_factor per interface. Signed-off-by: Jiri Bohac <jbohac@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-10-14 10:59:15 -04:00
Jiri Bohac	9d6280da39	IPv6: Drop the temporary address regen_timer The randomized interface identifier (rndid) was periodically updated from the regen_timer timer. Simplify the code by updating the rndid only when needed by ipv6_try_regen_rndid(). This makes the follow-up DESYNC_FACTOR fix much simpler. Also it fixes a reference counting error in this error path, where an in6_dev_put was missing: err = addrconf_sysctl_register(ndev); if (err) { ipv6_mc_destroy_dev(ndev); - del_timer(&ndev->regen_timer); snmp6_unregister_dev(ndev); goto err_release; Signed-off-by: Jiri Bohac <jbohac@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-10-14 10:59:15 -04:00
Shmulik Ladkani	5724b8b569	net/sched: tc_mirred: Rename public predicates 'is_tcf_mirred_redirect' and 'is_tcf_mirred_mirror' These accessors are used in various drivers that support tc offloading, to detect properties of a given 'tc_action'. 'is_tcf_mirred_redirect' tests that the action is TCA_EGRESS_REDIR. 'is_tcf_mirred_mirror' tests that the action is TCA_EGRESS_MIRROR. As a prep towards supporting INGRESS redir/mirror, rename these predicates to reflect their true meaning: s/is_tcf_mirred_redirect/is_tcf_mirred_egress_redirect/ s/is_tcf_mirred_mirror/is_tcf_mirred_egress_mirror/ Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Cc: Hariprasad S <hariprasad@chelsio.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Cc: Saeed Mahameed <saeedm@mellanox.com> Cc: Jiri Pirko <jiri@mellanox.com> Cc: Ido Schimmel <idosch@mellanox.com> Cc: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-10-14 10:23:06 -04:00
Shmulik Ladkani	165779231f	net/sched: act_mirred: Rename tcfm_ok_push to tcfm_mac_header_xmit and make it a bool 'tcfm_ok_push' specifies whether a mac_len sized push is needed upon egress to the target device (if action is performed at ingress). Rename it to 'tcfm_mac_header_xmit' as this is actually an attribute of the target device (and use a bool instead of int). This allows to decouple the attribute from the action to be taken. Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-10-14 10:23:06 -04:00
David S. Miller	8eed1cd4cd	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	2016-10-14 10:00:27 -04:00
Linus Torvalds	29fbff8698	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Pull networking fixes from David Miller: 1) Fix various build warnings in tlan/qed/xen-netback drivers, from Arnd Bergmann. 2) Propagate proper error code in strparser's strp_recv(), from Geert Uytterhoeven. 3) Fix accidental broadcast of RTM_GETTFILTER responses, from Eric Dumazret. 4) Need to use list_for_each_entry_safe() in qed driver, from Wei Yongjun. 5) Openvswitch 802.1AD bug fixes from Jiri Benc. 6) Cure BUILD_BUG_ON() in mlx5 driver, from Tom Herbert. 7) Fix UDP ipv6 checksumming in netvsc driver, from Stephen Hemminger. 8) stmmac driver fixes from Giuseppe CAVALLARO. 9) Fix access to mangled IP6CB in tcp, from Eric Dumazet. 10) Fix info leaks in tipc and rtnetlink, from Dan Carpenter. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (27 commits) net: bridge: add the multicast_flood flag attribute to brport_attrs net: axienet: Remove unused parameter from __axienet_device_reset liquidio: CN23XX: fix a loop timeout net: rtnl: info leak in rtnl_fill_vfinfo() tipc: info leak in __tipc_nl_add_udp_addr() net: ipv4: Do not drop to make_route if oif is l3mdev net: phy: Trigger state machine on state change and not polling. ipv6: tcp: restore IP6CB for pktoptions skbs netvsc: Remove mistaken udp.h inclusion. xen-netback: fix type mismatch warning stmmac: fix error check when init ptp stmmac: fix ptp init for gmac4 qed: fix old-style function definition netvsc: fix checksum on UDP IPV6 net_sched: reorder pernet ops and act ops registrations xen-netback: fix guest Rx stall detection (after guest Rx refactor) drivers/ptp: Fix kernel memory disclosure net/mlx5: Add MLX5_ARRAY_SET64 to fix BUILD_BUG_ON qmi_wwan: add support for Quectel EC21 and EC25 openvswitch: add NETIF_F_HW_VLAN_STAG_TX to internal dev ...	2016-10-13 21:40:23 -07:00
David Ahern	6104e112f4	net: ipv4: Do not drop to make_route if oif is l3mdev Commit `e0d56fdd73` was a bit aggressive removing l3mdev calls in the IPv4 stack. If the fib_lookup fails we do not want to drop to make_route if the oif is an l3mdev device. Also reverts `19664c6a00` ("net: l3mdev: Remove netif_index_is_l3_master") which removed netif_index_is_l3_master. Fixes: `e0d56fdd73` ("net: l3mdev: remove redundant calls") Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-10-13 12:05:26 -04:00
Xin Long	8ae808eb85	sctp: remove the old ttl expires policy The prsctp polices include ttl expires policy already, we should remove the old ttl expires codes, and just adjust the new polices' codes to be compatible with the old one for users. This patch is to remove all the old expires codes, and if prsctp polices are not set, it will still set msg's expires_at and check the expires in sctp_check_abandoned. Note that asoc->prsctp_enable is set by default, so users can't feel any difference even if they use the old expires api in userspace. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-10-13 09:44:14 -04:00
Xin Long	cc6ac9bccf	sctp: reuse sent_count to avoid retransmitted chunks for RTT measurements Now sctp uses chunk->resent to record if a chunk is retransmitted, for RTT measurements with retransmitted DATA chunks. chunk->sent_count was introduced to record how many times one chunk has been sent for prsctp RTX policy before. We actually can know if one chunk is retransmitted by checking chunk->sent_count is greater than 1. This patch is to remove resent from sctp_chunk and reuse sent_count to avoid retransmitted chunks for RTT measurements. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-10-13 09:44:13 -04:00
Purushottam Kushwaha	0c317a02ca	cfg80211: support virtual interfaces with different beacon intervals This commit provides a mechanism for the host drivers to advertise the support for different beacon intervals among the respective interface combinations in a group, through NL80211_IFACE_COMB_BI_MIN_GCD (u32). This value will be compared against GCD of all beaconing interfaces of matching combinations. If the driver doesn't advertise this value, the old behaviour where all beacon intervals must be identical is retained. If it is specified, then any beacon interval for an interface in the interface combination as well as the GCD of all active beacon intervals in the combination must be greater or equal to this value. Signed-off-by: Purushottam Kushwaha <pkushwah@qti.qualcomm.com> [change commit message, some variable names, small other things] Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-10-13 14:28:29 +02:00
Purushottam Kushwaha	e227300c83	cfg80211: pass struct to interface combination check/iter Move the growing parameter list to a structure for the interface combination check and iteration functions in cfg80211 and mac80211 to make the code easier to understand. Signed-off-by: Purushottam Kushwaha <pkushwah@qti.qualcomm.com> [edit commit message] Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-10-13 13:39:49 +02:00
Johannes Berg	8b935ee2ea	cfg80211: add ability to check DA/SA in A-MSDU decapsulation We should not accept arbitrary DA/SA inside A-MSDUs, it could be used to circumvent protections, like allowing a station to send frames and make them seem to come from somewhere else. Add the necessary infrastructure in cfg80211 to allow such checks, in further patches we'll start using them. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-10-12 09:19:10 +02:00
Johannes Berg	7f6990c830	cfg80211: let ieee80211_amsdu_to_8023s() take only header-less SKB There's only a single case where has_80211_header is passed as true, which is in mac80211. Given that there's only simple code that needs to be done before calling it, export that function from cfg80211 instead and let mac80211 call it itself. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-10-12 09:19:10 +02:00
Linus Torvalds	4cdf8dbe2d	Merge branch 'work.uaccess2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull uaccess.h prepwork from Al Viro: "Preparations to tree-wide switch to use of linux/uaccess.h (which, obviously, will allow to start unifying stuff for real). The last step there, ie PATT='^[[:blank:]]#[[:blank:]]include[[:blank:]]<asm/uaccess.h>' sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \ `git grep -l "$PATT"\|grep -v ^include/linux/uaccess.h` is not taken here - I would prefer to do it once just before or just after -rc1. However, everything should be ready for it" 'work.uaccess2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: remove a stray reference to asm/uaccess.h in docs sparc64: separate extable_64.h, switch elf_64.h to it score: separate extable.h, switch module.h to it mips: separate extable.h, switch module.h to it x86: separate extable.h, switch sections.h to it remove stray include of asm/uaccess.h from cacheflush.h mn10300: remove a bogus processor.h->uaccess.h include xtensa: split uaccess.h into C and asm sides bonding: quit messing with IOCTL kill __kernel_ds_p off mn10300: finish verify_area() off frv: move HAVE_ARCH_UNMAPPED_AREA to pgtable.h exceptions: detritus removal	2016-10-11 23:38:39 -07:00
Johannes Berg	819bf59376	docs-rst: sphinxify 802.11 documentation This is just a very basic conversion, I've split up the original multi-book template, and also split up the multi-part mac80211 part in the original book; neither of those were handled by the automatic pandoc conversion. Fix errors that showed up, resulting in a much nicer rendering, at least for the interface combinations documentation. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net>	2016-10-11 16:19:17 -06:00
Linus Torvalds	14986a34e1	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull namespace updates from Eric Biederman: "This set of changes is a number of smaller things that have been overlooked in other development cycles focused on more fundamental change. The devpts changes are small things that were a distraction until we managed to kill off DEVPTS_MULTPLE_INSTANCES. There is an trivial regression fix to autofs for the unprivileged mount changes that went in last cycle. A pair of ioctls has been added by Andrey Vagin making it is possible to discover the relationships between namespaces when referring to them through file descriptors. The big user visible change is starting to add simple resource limits to catch programs that misbehave. With namespaces in general and user namespaces in particular allowing users to use more kinds of resources, it has become important to have something to limit errant programs. Because the purpose of these limits is to catch errant programs the code needs to be inexpensive to use as it always on, and the default limits need to be high enough that well behaved programs on well behaved systems don't encounter them. To this end, after some review I have implemented per user per user namespace limits, and use them to limit the number of namespaces. The limits being per user mean that one user can not exhause the limits of another user. The limits being per user namespace allow contexts where the limit is 0 and security conscious folks can remove from their threat anlysis the code used to manage namespaces (as they have historically done as it root only). At the same time the limits being per user namespace allow other parts of the system to use namespaces. Namespaces are increasingly being used in application sand boxing scenarios so an all or nothing disable for the entire system for the security conscious folks makes increasing use of these sandboxes impossible. There is also added a limit on the maximum number of mounts present in a single mount namespace. It is nontrivial to guess what a reasonable system wide limit on the number of mount structure in the kernel would be, especially as it various based on how a system is using containers. A limit on the number of mounts in a mount namespace however is much easier to understand and set. In most cases in practice only about 1000 mounts are used. Given that some autofs scenarious have the potential to be 30,000 to 50,000 mounts I have set the default limit for the number of mounts at 100,000 which is well above every known set of users but low enough that the mount hash tables don't degrade unreaonsably. These limits are a start. I expect this estabilishes a pattern that other limits for resources that namespaces use will follow. There has been interest in making inotify event limits per user per user namespace as well as interest expressed in making details about what is going on in the kernel more visible" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (28 commits) autofs: Fix automounts by using current_real_cred()->uid mnt: Add a per mount namespace limit on the number of mounts netns: move {inc,dec}_net_namespaces into #ifdef nsfs: Simplify __ns_get_path tools/testing: add a test to check nsfs ioctl-s nsfs: add ioctl to get a parent namespace nsfs: add ioctl to get an owning user namespace for ns file descriptor kernel: add a helper to get an owning user namespace for a namespace devpts: Change the owner of /dev/pts/ptmx to the mounter of /dev/pts devpts: Remove sync_filesystems devpts: Make devpts_kill_sb safe if fsi is NULL devpts: Simplify devpts_mount by using mount_nodev devpts: Move the creation of /dev/pts/ptmx into fill_super devpts: Move parse_mount_options into fill_super userns: When the per user per user namespace limit is reached return ENOSPC userns; Document per user per user namespace limits. mntns: Add a limit on the number of mount namespaces. netns: Add a limit on the number of net namespaces cgroupns: Add a limit on the number of cgroup namespaces ipcns: Add a limit on the number of ipc namespaces ...	2016-10-06 09:52:23 -07:00
Stephen Rothwell	a44c984f1e	netfilter: merge fixup for "nf_tables_netdev: remove redundant ip_hdr assignment" Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Acked-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-10-05 20:25:48 -04:00
Johannes Berg	1e1430d528	Merge remote-tracking branch 'net-next/master' into mac80211-next Resolve the merge conflict between Felix's/my and Toke's patches coming into the tree through net and mac80211-next respectively. Most of Felix's changes go away due to Toke's new infrastructure work, my patch changes to "goto begin" (the label wasn't there before) instead of returning NULL so flow control towards drivers is preserved better. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-10-04 09:46:44 +02:00
Gavin Shan	c0cd1ba4f8	net/ncsi: Introduce ncsi_stop_dev() This introduces ncsi_stop_dev(), as counterpart to ncsi_start_dev(), to stop the NCSI device so that it can be reenabled in future. This API should be called when the network device driver is going to shutdown the device. There are 3 things done in the function: Stop the channel monitoring; Reset channels to inactive state; Report NCSI link down. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Reviewed-by: Joel Stanley <joel@jms.id.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-10-04 02:11:51 -04:00
Jiri Benc	85de4a2101	openvswitch: use mpls_hdr skb_mpls_header is equivalent to mpls_hdr now. Use the existing helper instead. Signed-off-by: Jiri Benc <jbenc@redhat.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-10-03 02:00:22 -04:00
Jiri Benc	9095e10edd	mpls: move mpls_hdr to a common location This will be also used by openvswitch. Signed-off-by: Jiri Benc <jbenc@redhat.com> Acked-by: David Ahern <dsa@cumulusnetworks.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-10-03 02:00:21 -04:00
David S. Miller	b50afd203a	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Three sets of overlapping changes. Nothing serious. Signed-off-by: David S. Miller <davem@davemloft.net>	2016-10-02 22:20:41 -04:00
Toke Høiland-Jørgensen	bb42f2d13f	mac80211: Move reorder-sensitive TX handlers to after TXQ dequeue The TXQ intermediate queues can cause packet reordering when more than one flow is active to a single station. Since some of the wifi-specific packet handling (notably sequence number and encryption handling) is sensitive to re-ordering, things break if they are applied before the TXQ. This splits up the TX handlers and fast_xmit logic into two parts: An early part and a late part. The former is applied before TXQ enqueue, and the latter after dequeue. The non-TXQ path just applies both parts at once. Because fragments shouldn't be split up or reordered, the fragmentation handler is run after dequeue. Any fragments are then kept in the TXQ and on subsequent dequeues they take precedence over dequeueing from the FQ structure. This approach avoids having to scatter special cases all over the place for when TXQ is enabled, at the cost of making the fast_xmit and TX handler code slightly more complex. Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> [fix a few code-style nits, make ieee80211_xmit_fast_finish void, remove a useless txq->sta check] Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-30 14:46:57 +02:00
Pedersen, Thomas	354d381baf	mac80211: add offset_tsf driver op and use it for mesh This allows the mesh sync (and debugfs) code to make incremental TSF adjustments, avoiding any uncertainty introduced by delay in programming absolute TSF. Signed-off-by: Thomas Pedersen <twp@qca.qualcomm.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-30 13:45:44 +02:00
Toke Høiland-Jørgensen	097b065b5c	fq.h: Port memory limit mechanism from fq_codel The reusable fairness queueing implementation (fq.h) lacks the memory usage limit that the fq_codel qdisc has. This means that small devices (e.g. WiFi routers) can run out of memory when flooded with a large number of packets. This ports the memory limit feature from fq_codel to fq.h. Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-30 13:29:21 +02:00
Ayala Beker	92bc43bce2	mac80211: Add API to report NAN function match Provide an API to report NAN function match. Mac80211 will lookup the corresponding cookie and report the match to cfg80211. Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com> Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-30 13:21:57 +02:00
Ayala Beker	167e33f4f6	mac80211: Implement add_nan_func and rm_nan_func Implement add/rm_nan_func functions and handle NAN function termination notifications. Handle instance_id allocation for NAN functions and implement the reconfig flow. Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com> Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-30 13:21:52 +02:00
Ayala Beker	5953ff6d6a	mac80211: implement nan_change_conf Implement nan_change_conf callback which allows to change current NAN configuration (master preference and dual band operation). Store the current NAN configuration in sdata, so it can be used both to provide the driver the updated configuration with changes and also it will be used in hw reconfig flows in next patches. Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com> Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-30 13:21:43 +02:00
Ayala Beker	368e5a7b4e	cfg80211: Provide an API to report NAN function termination Provide a function that reports NAN DE function termination. The function may be terminated due to one of the following reasons: user request, ttl expiration or failure. If the NAN instance is tied to the owner, the notification will be sent to the socket that started the NAN interface only Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com> Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-30 13:21:37 +02:00
Ayala Beker	50bcd31d99	cfg80211: provide a function to report a match for NAN Provide a function the driver can call to report a match. This will send the event to the user space. If the NAN instance is tied to the owner, the notifications will be sent to the socket that started the NAN interface only. Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com> Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-30 13:21:32 +02:00
Ayala Beker	a5a9dcf291	cfg80211: allow the user space to change current NAN configuration Some NAN configuration paramaters may change during the operation of the NAN device. For example, a user may want to update master preference value when the device gets plugged/unplugged to the power. Add API that allows to do so. Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com> Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-30 13:21:28 +02:00
Ayala Beker	a442b761b2	cfg80211: add add_nan_func / del_nan_func A NAN function can be either publish, subscribe or follow up. Make all the necessary verifications and just pass the request to the driver. Allow the user space application that starts NAN to forbid any other socket to add or remove functions. Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com> Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Signed-off-by: Ayala Beker <ayala.beker@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-30 13:21:23 +02:00
Ayala Beker	708d50edb1	mac80211: add boilerplate code for start / stop NAN This code doesn't do much besides allowing to start and stop the vif. Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com> Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Signed-off-by: Ayala Beker <ayala.beker@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-30 13:21:19 +02:00
Ayala Beker	cb3b7d8765	cfg80211: add start / stop NAN commands This allows user space to start/stop NAN interface. A NAN interface is like P2P device in a few aspects: it doesn't have a netdev associated to it. Add the new interface type and prevent operations that can't be executed on NAN interface like scan. Define several attributes that may be configured by user space when starting NAN functionality (master preference and dual band operation) Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com> Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-30 13:21:14 +02:00
David Spinadel	b8676221f0	cfg80211: Add support for static WEP in the driver Add support for drivers that implement static WEP internally, i.e. expose connection keys to the driver in connect flow and don't upload the keys after the connection. Signed-off-by: David Spinadel <david.spinadel@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-30 13:19:10 +02:00
Xin Long	0605483f6a	sctp: remove prsctp_param from sctp_chunk Now sctp uses chunk->prsctp_param to save the prsctp param for all the prsctp polices, we didn't need to introduce prsctp_param to sctp_chunk. We can just use chunk->sinfo.sinfo_timetolive for RTX and BUF polices, and reuse msg->expires_at for TTL policy, as the prsctp polices and old expires policy are mutual exclusive. This patch is to remove prsctp_param from sctp_chunk, and reuse msg's expires_at for TTL and chunk's sinfo.sinfo_timetolive for RTX and BUF polices. Note that sctp can't use chunk's sinfo.sinfo_timetolive for TTL policy, as it needs a u64 variables to save the expires_at time. This one also fixes the "netperf-Throughput_Mbps -37.2% regression" issue. Fixes: `a6c2f79287` ("sctp: implement prsctp TTL policy") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-30 02:07:05 -04:00
Xin Long	73dca124cd	sctp: move sent_count to the memory hole in sctp_chunk Now pahole sctp_chunk, it has 2 memory holes: struct sctp_chunk { struct list_head list; atomic_t refcnt; /* XXX 4 bytes hole, try to pack / ... long unsigned int prsctp_param; int sent_count; / XXX 4 bytes hole, try to pack */ This patch is to move up sent_count to fill the 1st one and eliminate the 2nd one. It's not just another struct compaction, it also fixes the "netperf- Throughput_Mbps -37.2% regression" issue when overloading the CPU. Fixes: `a6c2f79287` ("sctp: implement prsctp TTL policy") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-30 02:07:05 -04:00
Maciej Żenczykowski	bd11f0741f	ipv6 addrconf: implement RFC7559 router solicitation backoff This implements: https://tools.ietf.org/html/rfc7559 Backoff is performed according to RFC3315 section 14: https://tools.ietf.org/html/rfc3315#section-14 We allow setting /proc/sys/net/ipv6/conf//router_solicitations to a negative value meaning an unlimited number of retransmits, and we make this the new default (inline with the RFC). We also add a new setting: /proc/sys/net/ipv6/conf//router_solicitation_max_interval defaulting to 1 hour (per RFC recommendation). Signed-off-by: Maciej Żenczykowski <maze@google.com> Acked-by: Erik Kline <ek@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-30 01:54:28 -04:00
Jia He	6348ef2dbb	net:snmp: Introduce generic interfaces for snmp_get_cpu_field{, 64} This is to introduce the generic interfaces for snmp_get_cpu_field{,64}. It exchanges the two for-loops for collecting the percpu statistics data. This can aggregate the data by going through all the items of each cpu sequentially. Signed-off-by: Jia He <hejianet@gmail.com> Suggested-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-30 01:50:44 -04:00
Hadar Hen Zion	fa5effe766	net/sched: pkt_cls: change tc actions order to be as the user sets Currently the created tc actions list is reversed against the order set by the user. Change the actions list order to be the same as was set by the user. This patch doesn't affect dump actions behavior. For dumping, action->order parameter is used so the list order doesn't matter. Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-28 05:02:44 -04:00
Jiri Pirko	347e3b28c1	switchdev: remove FIB offload infrastructure Since this is now taken care of by FIB notifier, remove the code, with all unused dependencies. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-28 04:48:00 -04:00
Jiri Pirko	c98501879b	fib: introduce FIB info offload flag helpers These helpers are to be used in case someone offloads the FIB entry. The result is that if the entry is offloaded to at least one device, the offload flag is set. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-28 04:48:00 -04:00
Jiri Pirko	b90eb75494	fib: introduce FIB notification infrastructure This allows to pass information about added/deleted FIB entries/rules to whoever is interested. This is done in a very similar way as devinet notifies address additions/removals. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-28 04:48:00 -04:00
Al Viro	4ad41c1e26	bonding: quit messing with IOCTL The only remaining users are issuing SIOCGMIIPHY and SIOCGMIIREG, neither of which deals with userland pointers. Simply calling ->ndo_do_ioctl() is fine; no messing with set_fs() is needed. It used to mess with SIOCETHTOOL, which would've needed set_fs(), but that has been killed in "[NET] ethtool ops are the only way" 9 years ago... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-09-27 21:15:21 -04:00
Johannes Berg	8564e38206	cfg80211: add checks for beacon rate, extend to mesh The previous commit added support for specifying the beacon rate for AP mode. Add features checks to this, and extend it to also support the rate configuration for mesh networks. For IBSS it's not as simple due to joining etc., so that's not yet supported. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-26 10:23:48 +02:00
Purushottam Kushwaha	a7c7fbff6a	cfg80211: Add support to configure a beacon data rate This allows an option to configure a single beacon tx rate for an AP. Signed-off-by: Purushottam Kushwaha <pkushwah@qti.qualcomm.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-26 10:23:48 +02:00
Pablo Neira Ayuso	f20fbc0717	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next Conflicts: net/netfilter/core.c net/netfilter/nf_tables_netdev.c Resolve two conflicts before pull request for David's net-next tree: 1) Between `c73c248490` ("netfilter: nf_tables_netdev: remove redundant ip_hdr assignment") from the net tree and commit `ddc8b6027a` ("netfilter: introduce nft_set_pktinfo_{ipv4, ipv6}_validate()"). 2) Between `e8bffe0cf9` ("net: Add _nf_(un)register_hooks symbols") and Aaron Conole's patches to replace list_head with single linked list. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-09-25 23:34:19 +02:00
Liping Zhang	ff107d2776	netfilter: nft_log: complete NFTA_LOG_FLAGS attr support NFTA_LOG_FLAGS attribute is already supported, but the related NF_LOG_XXX flags are not exposed to the userspace. So we cannot explicitly enable log flags to log uid, tcp sequence, ip options and so on, i.e. such rule "nft add rule filter output log uid" is not supported yet. So move NF_LOG_XXX macro definitions to the uapi/../nf_log.h. In order to keep consistent with other modules, change NF_LOG_MASK to refer to all supported log flags. On the other hand, add a new NF_LOG_DEFAULT_MASK to refer to the original default log flags. Finally, if user specify the unsupported log flags or NFTA_LOG_GROUP and NFTA_LOG_FLAGS are set at the same time, report EINVAL to the userspace. Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-09-25 23:16:43 +02:00
Pablo Neira Ayuso	0f3cd9b369	netfilter: nf_tables: add range expression Inverse ranges != [a,b] are not currently possible because rules are composites of && operations, and we need to express this: data < a \|\| data > b This patch adds a new range expression. Positive ranges can be already through two cmp expressions: cmp(sreg, data, >=) cmp(sreg, data, <=) This new range expression provides an alternative way to express this. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-09-25 23:16:42 +02:00
Aaron Conole	e3b37f11e6	netfilter: replace list_head with single linked list The netfilter hook list never uses the prev pointer, and so can be trimmed to be a simple singly-linked list. In addition to having a more light weight structure for hook traversal, struct net becomes 5568 bytes (down from 6400) and struct net_device becomes 2176 bytes (down from 2240). Signed-off-by: Aaron Conole <aconole@bytheb.org> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-09-25 14:38:48 +02:00
Aaron Conole	54f17bbc52	netfilter: nf_queue: whitespace cleanup A future patch will modify the hook drop and outfn functions. This will cause the line lengths to take up too much space. This is simply a readability change. Signed-off-by: Aaron Conole <aconole@bytheb.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-09-25 01:20:05 +02:00
Florian Westphal	c5136b15ea	netfilter: bridge: add and use br_nf_hook_thresh This replaces the last uses of NF_HOOK_THRESH(). Followup patch will remove it and rename nf_hook_thresh. The reason is that inet (non-bridge) netfilter no longer invokes the hooks from hooks, so we do no longer need the thresh value to skip hooks with a lower priority. The bridge netfilter however may need to do this. br_nf_hook_thresh is a wrapper that is supposed to do this, i.e. only call hooks with a priority that exceeds NF_BR_PRI_BRNF. It's used only in the recursion cases of br_netfilter. It invokes nf_hook_slow while holding an rcu read-side critical section to make a future cleanup simpler. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Aaron Conole <aconole@bytheb.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-09-24 21:25:48 +02:00
Vivien Didelot	732f794c1b	net: dsa: add port fast ageing Today the DSA drivers are in charge of flushing the MAC addresses associated to a port when its STP state changes from Learning or Forwarding, to Disabled or Blocking or Listening. This makes the drivers more complex and hides the generic switch logic. Introduce a new optional port_fast_age operation to dsa_switch_ops, to move this logic to the DSA layer and keep drivers simple. Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-23 08:38:50 -04:00
Or Gerlitz	53e89941ba	net_sched: act_vlan: add helper inlines to access tcf_vlan info Needed e.g for offloading drivers to pick the relevant attributes. Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-23 07:22:11 -04:00
Marcelo Ricardo Leitner	182691d099	sctp: improve how SSN, TSN and ASCONF serial are compared Make it similar to time_before() macros: - easier to understand - make use of typecheck() to avoid working on unexpected variable types (made the issue on previous patch visible) - for _[lg]te versions, slighly faster, as the compiler used to generate a sequence of cmp/je/cmp/js instructions and now it's sub/test/jle (for _lte): Before, for sctp_outq_sack: if (primary->cacc.changeover_active) { 1f01: 80 b9 84 02 00 00 00 cmpb $0x0,0x284(%rcx) 1f08: 74 6e je 1f78 <sctp_outq_sack+0xe8> u8 clear_cycling = 0; if (TSN_lte(primary->cacc.next_tsn_at_change, sack_ctsn)) { 1f0a: 8b 81 80 02 00 00 mov 0x280(%rcx),%eax return ((s) - (t)) & TSN_SIGN_BIT; } static inline int TSN_lte(__u32 s, __u32 t) { return ((s) == (t)) \|\| (((s) - (t)) & TSN_SIGN_BIT); 1f10: 8b 7d bc mov -0x44(%rbp),%edi 1f13: 39 c7 cmp %eax,%edi 1f15: 74 25 je 1f3c <sctp_outq_sack+0xac> 1f17: 39 f8 cmp %edi,%eax 1f19: 78 21 js 1f3c <sctp_outq_sack+0xac> primary->cacc.changeover_active = 0; After: if (primary->cacc.changeover_active) { 1ee7: 80 b9 84 02 00 00 00 cmpb $0x0,0x284(%rcx) 1eee: 74 73 je 1f63 <sctp_outq_sack+0xf3> u8 clear_cycling = 0; if (TSN_lte(primary->cacc.next_tsn_at_change, sack_ctsn)) { 1ef0: 8b 81 80 02 00 00 mov 0x280(%rcx),%eax 1ef6: 2b 45 b4 sub -0x4c(%rbp),%eax 1ef9: 85 c0 test %eax,%eax 1efb: 7e 26 jle 1f23 <sctp_outq_sack+0xb3> primary->cacc.changeover_active = 0; *_lt() generated pretty much the same code. Tested with gcc (GCC) 6.1.1 20160621. This patch also removes SSN_lte as it is not used and cleanups some comments. Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-23 06:54:58 -04:00
David S. Miller	d6989d4bbe	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	2016-09-23 06:46:57 -04:00
Liping Zhang	2462f3f4a7	netfilter: nf_queue: improve queue range support for bridge family After commit `ac28634456` ("netfilter: bridge: add nf_afinfo to enable queuing to userspace"), we can queue packets to the user space in bridge family. But when the user specify the queue range, packets will be only delivered to the first queue num. Because in nfqueue_hash, we only support ipv4 and ipv6 family. Now add support for bridge family too. Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-09-23 09:30:01 +02:00
Laura Garcia Liebana	36b701fae1	netfilter: nf_tables: validate maximum value of u32 netlink attributes Fetch value and validate u32 netlink attribute. This validation is usually required when the u32 netlink attributes are being stored in a field whose size is smaller. This patch revisits `4da449ae1d` ("netfilter: nft_exthdr: Add size check on u8 nft_exthdr attributes"). Fixes: `96518518cc` ("netfilter: add nftables") Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Laura Garcia Liebana <nevola@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-09-23 09:29:02 +02:00
Marcelo Ricardo Leitner	e2f036a972	sctp: rename WORD_TRUNC/ROUND macros To something more meaningful these days, specially because this is working on packet headers or lengths and which are not tied to any CPU arch but to the protocol itself. So, WORD_TRUNC becomes SCTP_TRUNC4 and WORD_ROUND becomes SCTP_PAD4. Reported-by: David Laight <David.Laight@ACULAB.COM> Reported-by: David Miller <davem@davemloft.net> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-22 03:13:26 -04:00
David S. Miller	ba1ba25d31	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2016-09-21 1) Propagate errors on security context allocation. From Mathias Krause. 2) Fix inbound policy checks for inter address family tunnels. From Thomas Zeitlhofer. 3) Fix an old memory leak on aead algorithm usage. From Ilan Tayari. 4) A recent patch fixed a possible NULL pointer dereference but broke the vti6 input path. Fix from Nicolas Dichtel. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-22 02:56:23 -04:00
Jakub Kicinski	68d640630d	net: cls_bpf: allow offloaded filters to update stats Call into offloaded filters to update stats. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-21 19:50:03 -04:00
Jakub Kicinski	0d01d45f1b	net: cls_bpf: limit hardware offload by software-only flag Add cls_bpf support for the TCA_CLS_FLAGS_SKIP_HW flag. Unlike U32 and flower cls_bpf already has some netlink flags defined. Create a new attribute to be able to use the same flag values as the above. Unlike U32 and flower reject unknown flags. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-21 19:50:02 -04:00
Jakub Kicinski	332ae8e2f6	net: cls_bpf: add hardware offload This patch adds hardware offload capability to cls_bpf classifier, similar to what have been done with U32 and flower. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-21 19:50:02 -04:00
Nicolas Dichtel	63c43787d3	vti6: fix input path Since commit `1625f45299`, vti6 is broken, all input packets are dropped (LINUX_MIB_XFRMINNOSTATES is incremented). XFRM_TUNNEL_SKB_CB(skb)->tunnel.ip6 is set by vti6_rcv() before calling xfrm6_rcv()/xfrm6_rcv_spi(), thus we cannot set to NULL that value in xfrm6_rcv_spi(). A new function xfrm6_rcv_tnl() that enables to pass a value to xfrm6_rcv_spi() is added, so that xfrm6_rcv() is not touched (this function is used in several handlers). CC: Alexey Kodanev <alexey.kodanev@oracle.com> Fixes: `1625f45299` ("net/xfrm_input: fix possible NULL deref of tunnel.ip6->parms.i_key") Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2016-09-21 10:09:14 +02:00
Neal Cardwell	7e74417138	tcp: increase ICSK_CA_PRIV_SIZE from 64 bytes to 88 The TCP CUBIC module already uses 64 bytes. The upcoming TCP BBR module uses 88 bytes. Signed-off-by: Van Jacobson <vanj@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-21 00:23:01 -04:00
Yuchung Cheng	c0402760f5	tcp: new CC hook to set sending rate with rate_sample in any CA state This commit introduces an optional new "omnipotent" hook, cong_control(), for congestion control modules. The cong_control() function is called at the end of processing an ACK (i.e., after updating sequence numbers, the SACK scoreboard, and loss detection). At that moment we have precise delivery rate information the congestion control module can use to control the sending behavior (using cwnd, TSO skb size, and pacing rate) in any CA state. This function can also be used by a congestion control that prefers not to use the default cwnd reduction approach (i.e., the PRR algorithm) during CA_Recovery to control the cwnd and sending rate during loss recovery. We take advantage of the fact that recent changes defer the retransmission or transmission of new data (e.g. by F-RTO) in recovery until the new tcp_cong_control() function is run. With this commit, we only run tcp_update_pacing_rate() if the congestion control is not using this new API. New congestion controls which use the new API do not want the TCP stack to run the default pacing rate calculation and overwrite whatever pacing rate they have chosen at initialization time. Signed-off-by: Van Jacobson <vanj@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-21 00:23:01 -04:00
Yuchung Cheng	77bfc174c3	tcp: allow congestion control to expand send buffer differently Currently the TCP send buffer expands to twice cwnd, in order to allow limited transmits in the CA_Recovery state. This assumes that cwnd does not increase in the CA_Recovery. For some congestion control algorithms, like the upcoming BBR module, if the losses in recovery do not indicate congestion then we may continue to raise cwnd multiplicatively in recovery. In such cases the current multiplier will falsely limit the sending rate, much as if it were limited by the application. This commit adds an optional congestion control callback to use a different multiplier to expand the TCP send buffer. For congestion control modules that do not specificy this callback, TCP continues to use the previous default of 2. Signed-off-by: Van Jacobson <vanj@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-21 00:23:01 -04:00
Neal Cardwell	1b3878ca15	tcp: export tcp_tso_autosize() and parameterize minimum number of TSO segments To allow congestion control modules to use the default TSO auto-sizing algorithm as one of the ingredients in their own decision about TSO sizing: 1) Export tcp_tso_autosize() so that CC modules can use it. 2) Change tcp_tso_autosize() to allow callers to specify a minimum number of segments per TSO skb, in case the congestion control module has a different notion of the best floor for TSO skbs for the connection right now. For very low-rate paths or policed connections it can be appropriate to use smaller TSO skbs. Signed-off-by: Van Jacobson <vanj@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-21 00:23:00 -04:00
Neal Cardwell	ed6e7268b9	tcp: allow congestion control module to request TSO skb segment count Add the tso_segs_goal() function in tcp_congestion_ops to allow the congestion control module to specify the number of segments that should be in a TSO skb sent by tcp_write_xmit() and tcp_xmit_retransmit_queue(). The congestion control module can either request a particular number of segments in TSO skb that we transmit, or return 0 if it doesn't care. This allows the upcoming BBR congestion control module to select small TSO skb sizes if the module detects that the bottleneck bandwidth is very low, or that the connection is policed to a low rate. Signed-off-by: Van Jacobson <vanj@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-21 00:23:00 -04:00
Soheil Hassas Yeganeh	d7722e8570	tcp: track application-limited rate samples This commit adds code to track whether the delivery rate represented by each rate_sample was limited by the application. Upon each transmit, we store in the is_app_limited field in the skb a boolean bit indicating whether there is a known "bubble in the pipe": a point in the rate sample interval where the sender was application-limited, and did not transmit even though the cwnd and pacing rate allowed it. This logic marks the flow app-limited on a write if all of the following are true: 1) There is less than 1 MSS of unsent data in the write queue available to transmit. 2) There is no packet in the sender's queues (e.g. in fq or the NIC tx queue). 3) The connection is not limited by cwnd. 4) There are no lost packets to retransmit. The tcp_rate_check_app_limited() code in tcp_rate.c determines whether the connection is application-limited at the moment. If the flow is application-limited, it sets the tp->app_limited field. If the flow is application-limited then that means there is effectively a "bubble" of silence in the pipe now, and this silence will be reflected in a lower bandwidth sample for any rate samples from now until we get an ACK indicating this bubble has exited the pipe: specifically, until we get an ACK for the next packet we transmit. When we send every skb we record in scb->tx.is_app_limited whether the resulting rate sample will be application-limited. The code in tcp_rate_gen() checks to see when it is safe to mark all known application-limited bubbles of silence as having exited the pipe. It does this by checking to see when the delivered count moves past the tp->app_limited marker. At this point it zeroes the tp->app_limited marker, as all known bubbles are out of the pipe. We make room for the tx.is_app_limited bit in the skb by borrowing a bit from the in_flight field used by NV to record the number of bytes in flight. The receive window in the TCP header is 16 bits, and the max receive window scaling shift factor is 14 (RFC 1323). So the max receive window offered by the TCP protocol is 2^(16+14) = 2^30. So we only need 30 bits for the tx.in_flight used by NV. Signed-off-by: Van Jacobson <vanj@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-21 00:23:00 -04:00
Yuchung Cheng	b9f64820fb	tcp: track data delivery rate for a TCP connection This patch generates data delivery rate (throughput) samples on a per-ACK basis. These rate samples can be used by congestion control modules, and specifically will be used by TCP BBR in later patches in this series. Key state: tp->delivered: Tracks the total number of data packets (original or not) delivered so far. This is an already-existing field. tp->delivered_mstamp: the last time tp->delivered was updated. Algorithm: A rate sample is calculated as (d1 - d0)/(t1 - t0) on a per-ACK basis: d1: the current tp->delivered after processing the ACK t1: the current time after processing the ACK d0: the prior tp->delivered when the acked skb was transmitted t0: the prior tp->delivered_mstamp when the acked skb was transmitted When an skb is transmitted, we snapshot d0 and t0 in its control block in tcp_rate_skb_sent(). When an ACK arrives, it may SACK and ACK some skbs. For each SACKed or ACKed skb, tcp_rate_skb_delivered() updates the rate_sample struct to reflect the latest (d0, t0). Finally, tcp_rate_gen() generates a rate sample by storing (d1 - d0) in rs->delivered and (t1 - t0) in rs->interval_us. One caveat: if an skb was sent with no packets in flight, then tp->delivered_mstamp may be either invalid (if the connection is starting) or outdated (if the connection was idle). In that case, we'll re-stamp tp->delivered_mstamp. At first glance it seems t0 should always be the time when an skb was transmitted, but actually this could over-estimate the rate due to phase mismatch between transmit and ACK events. To track the delivery rate, we ensure that if packets are in flight then t0 and and t1 are times at which packets were marked delivered. If the initial and final RTTs are different then one may be corrupted by some sort of noise. The noise we see most often is sending gaps caused by delayed, compressed, or stretched acks. This either affects both RTTs equally or artificially reduces the final RTT. We approach this by recording the info we need to compute the initial RTT (duration of the "send phase" of the window) when we recorded the associated inflight. Then, for a filter to avoid bandwidth overestimates, we generalize the per-sample bandwidth computation from: bw = delivered / ack_phase_rtt to the following: bw = delivered / max(send_phase_rtt, ack_phase_rtt) In large-scale experiments, this filtering approach incorporating send_phase_rtt is effective at avoiding bandwidth overestimates due to ACK compression or stretched ACKs. Signed-off-by: Van Jacobson <vanj@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-21 00:23:00 -04:00
Neal Cardwell	6403389211	tcp: use windowed min filter library for TCP min_rtt estimation Refactor the TCP min_rtt code to reuse the new win_minmax library in lib/win_minmax.c to simplify the TCP code. This is a pure refactor: the functionality is exactly the same. We just moved the windowed min code to make TCP easier to read and maintain, and to allow other parts of the kernel to use the windowed min/max filter code. Signed-off-by: Van Jacobson <vanj@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-21 00:22:59 -04:00
David S. Miller	204dfe1798	Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next Johan Hedberg says: ==================== pull request: bluetooth-next 2016-09-19 Here's the main bluetooth-next pull request for the 4.9 kernel. - Added new messages for monitor sockets for better mgmt tracing - Added local name and appearance support in scan response - Added new Qualcomm WCNSS SMD based HCI driver - Minor fixes & cleanup to 802.15.4 code - New USB ID to btusb driver - Added Marvell support to HCI UART driver - Add combined LED trigger for controller power - Other minor fixes here and there Please let me know if there are any issues pulling. Thanks. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-20 22:52:50 -04:00
Jamal Hadi Salim	6a5d58b67e	net sched ife action: add 16 bit helpers encoder and checker for 16 bits metadata Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-19 21:55:28 -04:00
Michał Narajowski	c4960ecf2b	Bluetooth: Add support for appearance in scan rsp This patch enables prepending appearance value to scan response data. It also adds support for setting appearance value through mgmt command. If currently advertised instance has apperance flag set it is expired immediately. Signed-off-by: Michał Narajowski <michal.narajowski@codecoup.pl> Signed-off-by: Szymon Janc <szymon.janc@codecoup.pl> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2016-09-19 20:19:34 +02:00
Marcel Holtmann	4037a7747d	Bluetooth: Increase the subsystem minor version number While the subsystem version information are purely informational, increase the minor number due to the addition of user channel and management control monitoring suppport. It is helpful for debugging purposes to see the version numbers change. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>	2016-09-19 20:19:34 +02:00
Marcel Holtmann	321c6feed2	Bluetooth: Add framework for Extended Controller Information This command is used to retrieve the current state and basic information of a controller. It is typically used right after getting the response to the Read Controller Index List command or an Index Added event (or its extended counterparts). When any of the values in the EIR_Data field changes, the event Extended Controller Information Changed will be used to inform clients about the updated information. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Michał Narajowski <michal.narajowski@codecoup.pl>	2016-09-19 20:19:34 +02:00
Marcel Holtmann	9e8305b39b	Bluetooth: Use numbers for subsystem version string Instead of keeping a version string around, use version and revision numbers and then stringify them for use as module parameter. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>	2016-09-19 20:19:34 +02:00
Marcel Holtmann	5504c3a310	Bluetooth: Use individual flags for certain management events Instead of hiding everything behind a general managment events flag, introduce indivdual flags that allow fine control over which events are send to a given management channel. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>	2016-09-19 20:19:34 +02:00

1 2 3 4 5 ...

10108 Commits