linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-27 06:31:52 +00:00

Author	SHA1	Message	Date
David S. Miller	107bc0aa95	Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next Johan Hedberg says: ==================== pull request: bluetooth-next 2016-12-08 I didn't miss your "net-next is closed" email, but it did come as a bit of a surprise, and due to time-zone differences I didn't have a chance to react to it until now. We would have had a couple of patches in bluetooth-next that we'd still have wanted to get to 4.10. Out of these the most critical one is the H7/CT2 patch for Bluetooth Security Manager Protocol, something that couldn't be published before the Bluetooth 5.0 specification went public (yesterday). If these really can't go to net-next we'll likely be sending at least this patch through bluetooth.git to net.git for rc1 inclusion. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-08 14:33:17 -05:00
Martin KaFai Lau	17bedab272	bpf: xdp: Allow head adjustment in XDP prog This patch allows XDP prog to extend/remove the packet data at the head (like adding or removing header). It is done by adding a new XDP helper bpf_xdp_adjust_head(). It also renames bpf_helper_changes_skb_data() to bpf_helper_changes_pkt_data() to better reflect that XDP prog does not work on skb. This patch adds one "xdp_adjust_head" bit to bpf_prog for the XDP-capable driver to check if the XDP prog requires bpf_xdp_adjust_head() support. The driver can then decide to error out during XDP_SETUP_PROG. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-08 14:25:13 -05:00
Eric Dumazet	c8c8b12709	udp: under rx pressure, try to condense skbs Under UDP flood, many softirq producers try to add packets to UDP receive queue, and one user thread is burning one cpu trying to dequeue packets as fast as possible. Two parts of the per packet cost are : - copying payload from kernel space to user space, - freeing memory pieces associated with skb. If socket is under pressure, softirq handler(s) can try to pull in skb->head the payload of the packet if it fits. Meaning the softirq handler(s) can free/reuse the page fragment immediately, instead of letting udp_recvmsg() do this hundreds of usec later, possibly from another node. Additional gains : - We reduce skb->truesize and thus can store more packets per SO_RCVBUF - We avoid cache line misses at copyout() time and consume_skb() time, and avoid one put_page() with potential alien freeing on NUMA hosts. This comes at the cost of a copy, bounded to available tail room, which is usually small. (We might have to fix GRO_MAX_HEAD which looks bigger than necessary) This patch gave me about 5 % increase in throughput in my tests. skb_condense() helper could probably used in other contexts. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-08 13:25:07 -05:00
Eric Dumazet	13bfff25c0	net: rfs: add a jump label RFS is not commonly used, so add a jump label to avoid some conditionals in fast path. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-08 13:18:35 -05:00
Simon Horman	7b684884fb	net/sched: cls_flower: Support matching on ICMP type and code Support matching on ICMP type and code. Example usage: tc qdisc add dev eth0 ingress tc filter add dev eth0 protocol ip parent ffff: flower \ indev eth0 ip_proto icmp type 8 code 0 action drop tc filter add dev eth0 protocol ipv6 parent ffff: flower \ indev eth0 ip_proto icmpv6 type 128 code 0 action drop Signed-off-by: Simon Horman <simon.horman@netronome.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-08 11:47:08 -05:00
Simon Horman	972d3876fa	flow dissector: ICMP support Allow dissection of ICMP(V6) type and code. This should only occur if a packet is ICMP(V6) and the dissector has FLOW_DISSECTOR_KEY_ICMP set. There are currently no users of FLOW_DISSECTOR_KEY_ICMP. A follow-up patch will allow FLOW_DISSECTOR_KEY_ICMP to be used by the flower classifier. Signed-off-by: Simon Horman <simon.horman@netronome.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-08 11:45:21 -05:00
Or Gerlitz	faa3ffce78	net/sched: cls_flower: Add support for matching on flags Add UAPI to provide set of flags for matching, where the flags provided from user-space are mapped to flow-dissector flags. The 1st flag allows to match on whether the packet is an IP fragment and corresponds to the FLOW_DIS_IS_FRAGMENT flag. Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Reviewed-by: Paul Blakey <paulb@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-08 11:32:50 -05:00
Zhang Shengju	f91c58d68b	icmp: correct return value of icmp_rcv() Currently, icmp_rcv() always return zero on a packet delivery upcall. To make its behavior more compliant with the way this API should be used, this patch changes this to let it return NET_RX_SUCCESS when the packet is proper handled, and NET_RX_DROP otherwise. Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-08 11:24:23 -05:00
Johan Hedberg	a62da6f14d	Bluetooth: SMP: Add support for H7 crypto function and CT2 auth flag Bluetooth 5.0 introduces a new H7 key generation function that's used when both sides of the pairing set the CT2 authentication flag to 1. Signed-off-by: Johan Hedberg <johan.hedberg@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2016-12-08 07:50:24 +01:00
David S. Miller	5fccd64aa4	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset contains a large Netfilter update for net-next, to summarise: 1) Add support for stateful objects. This series provides a nf_tables native alternative to the extended accounting infrastructure for nf_tables. Two initial stateful objects are supported: counters and quotas. Objects are identified by a user-defined name, you can fetch and reset them anytime. You can also use a maps to allow fast lookups using any arbitrary key combination. More info at: http://marc.info/?l=netfilter-devel&m=148029128323837&w=2 2) On-demand registration of nf_conntrack and defrag hooks per netns. Register nf_conntrack hooks if we have a stateful ruleset, ie. state-based filtering or NAT. The new nf_conntrack_default_on sysctl enables this from newly created netnamespaces. Default behaviour is not modified. Patches from Florian Westphal. 3) Allocate 4k chunks and then use these for x_tables counter allocation requests, this improves ruleset load time and also datapath ruleset evaluation, patches from Florian Westphal. 4) Add support for ebpf to the existing x_tables bpf extension. From Willem de Bruijn. 5) Update layer 4 checksum if any of the pseudoheader fields is updated. This provides a limited form of 1:1 stateless NAT that make sense in specific scenario, eg. load balancing. 6) Add support to flush sets in nf_tables. This series comes with a new set->ops->deactivate_one() indirection given that we have to walk over the list of set elements, then deactivate them one by one. The existing set->ops->deactivate() performs an element lookup that we don't need. 7) Two patches to avoid cloning packets, thus speed up packet forwarding via nft_fwd from ingress. From Florian Westphal. 8) Two IPVS patches via Simon Horman: Decrement ttl in all modes to prevent infinite loops, patch from Dwip Banerjee. And one minor refactoring from Gao feng. 9) Revisit recent log support for nf_tables netdev families: One patch to ensure that we correctly handle non-ethernet packets. Another patch to add missing logger definition for netdev. Patches from Liping Zhang. 10) Three patches for nft_fib, one to address insufficient register initialization and another to solve incorrect (although harmless) byteswap operation. Moreover update xt_rpfilter and nft_fib to match lbcast packets with zeronet as source, eg. DHCP Discover packets (0.0.0.0 -> 255.255.255.255). Also from Liping Zhang. 11) Built-in DCCP, SCTP and UDPlite conntrack and NAT support, from Davide Caratti. While DCCP is rather hopeless lately, and UDPlite has been broken in many-cast mode for some little time, let's give them a chance by placing them at the same level as other existing protocols. Thus, users don't explicitly have to modprobe support for this and NAT rules work for them. Some people point to the lack of support in SOHO Linux-based routers that make deployment of new protocols harder. I guess other middleboxes outthere on the Internet are also to blame. Anyway, let's see if this has any impact in the midrun. 12) Skip software SCTP software checksum calculation if the NIC comes with SCTP checksum offload support. From Davide Caratti. 13) Initial core factoring to prepare conversion to hook array. Three patches from Aaron Conole. 14) Gao Feng made a wrong conversion to switch in the xt_multiport extension in a patch coming in the previous batch. Fix it in this batch. 15) Get vmalloc call in sync with kmalloc flags to avoid a warning and likely OOM killer intervention from x_tables. From Marcelo Ricardo Leitner. 16) Update Arturo Borrero's email address in all source code headers. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-07 19:16:46 -05:00
Pablo Neira Ayuso	73c25fb139	netfilter: nft_quota: allow to restore consumed quota Allow to restore consumed quota, this is useful to restore the quota state across reboots. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-07 14:40:53 +01:00
Willem de Bruijn	2c16d60332	netfilter: xt_bpf: support ebpf Add support for attaching an eBPF object by file descriptor. The iptables binary can be called with a path to an elf object or a pinned bpf object. Also pass the mode and path to the kernel to be able to return it later for iptables dump and save. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-07 13:32:35 +01:00
Marcelo Ricardo Leitner	5bad87348c	netfilter: x_tables: avoid warn and OOM killer on vmalloc call Andrey Konovalov reported that this vmalloc call is based on an userspace request and that it's spewing traces, which may flood the logs and cause DoS if abused. Florian Westphal also mentioned that this call should not trigger OOM killer. This patch brings the vmalloc call in sync to kmalloc and disables the warn trace on allocation failure and also disable OOM killer invocation. Note, however, that under such stress situation, other places may trigger OOM killer invocation. Reported-by: Andrey Konovalov <andreyknvl@google.com> Cc: Florian Westphal <fw@strlen.de> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-07 13:31:41 +01:00
Pablo Neira Ayuso	8411b6442e	netfilter: nf_tables: support for set flushing This patch adds support for set flushing, that consists of walking over the set elements if the NFTA_SET_ELEM_LIST_ELEMENTS attribute is set. This patch requires the following changes: 1) Add set->ops->deactivate_one() operation: This allows us to deactivate an element from the set element walk path, given we can skip the lookup that happens in ->deactivate(). 2) Add a new nft_trans_alloc_gfp() function since we need to allocate transactions using GFP_ATOMIC given the set walk path happens with held rcu_read_lock. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-07 13:31:40 +01:00
Pablo Neira Ayuso	37df5301a3	netfilter: nft_set: introduce nft_{hash, rbtree}_deactivate_one() This new function allows us to deactivate one single element, this is required by the set flush command that comes in a follow up patch. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-07 13:31:02 +01:00
Pablo Neira Ayuso	1a37ef769d	netfilter: nf_tables: constify struct nft_ctx * parameter in nft_trans_alloc() Context is not modified by nft_trans_alloc(), so constify it. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-07 13:22:51 +01:00
Davide Caratti	3189a290f9	netfilter: nat: skip checksum on offload SCTP packets SCTP GSO and hardware can do CRC32c computation after netfilter processing, so we can avoid calling sctp_compute_checksum() on skb if skb->ip_summed is equal to CHECKSUM_PARTIAL. Moreover, set skb->ip_summed to CHECKSUM_NONE when the NAT code computes the CRC, to prevent offloaders from computing it again (on ixgbe this resulted in a transmission with wrong L4 checksum). Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-07 13:22:50 +01:00
Liping Zhang	3b760dcb0f	netfilter: rpfilter: bypass ipv4 lbcast packets with zeronet source Otherwise, DHCP Discover packets(0.0.0.0->255.255.255.255) may be dropped incorrectly. Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-07 13:22:50 +01:00
Pablo Neira Ayuso	a9fea2a3c3	netfilter: nf_tables: allow to filter stateful object dumps by type This patch adds the netlink code to filter out dump of stateful objects, through the NFTA_OBJ_TYPE netlink attribute. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-07 13:22:49 +01:00
Pablo Neira Ayuso	63aea29060	netfilter: nft_objref: support for stateful object maps This patch allows us to refer to stateful object dictionaries, the source register indicates the key data to be used to look up for the corresponding state object. We can refer to these maps through names or, alternatively, the map transaction id. This allows us to refer to both anonymous and named maps. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-07 13:22:48 +01:00
Pablo Neira Ayuso	8aeff920dc	netfilter: nf_tables: add stateful object reference to set elements This patch allows you to refer to stateful objects from set elements. This provides the infrastructure to create maps where the right hand side of the mapping is a stateful object. This allows us to build dictionaries of stateful objects, that you can use to perform fast lookups using any arbitrary key combination. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-07 13:22:47 +01:00
Pablo Neira Ayuso	1896531710	netfilter: nft_quota: add depleted flag for objects Notify on depleted quota objects. The NFT_QUOTA_F_DEPLETED flag indicates we have reached overquota. Add pointer to table from nft_object, so we can use it when sending the depletion notification to userspace. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-07 13:22:12 +01:00
Pablo Neira Ayuso	2599e98934	netfilter: nf_tables: notify internal updates of stateful objects Introduce nf_tables_obj_notify() to notify internal state changes in stateful objects. This is used by the quota object to report depletion in a follow up patch. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-07 12:57:20 +01:00
Pablo Neira Ayuso	43da04a593	netfilter: nf_tables: atomic dump and reset for stateful objects This patch adds a new NFT_MSG_GETOBJ_RESET command perform an atomic dump-and-reset of the stateful object. This also comes with add support for atomic dump and reset for counter and quota objects. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-07 12:56:57 +01:00
Pablo Neira Ayuso	795595f68d	netfilter: nft_quota: dump consumed quota Add a new attribute NFTA_QUOTA_CONSUMED that displays the amount of quota that has been already consumed. This allows us to restore the internal state of the quota object between reboots as well as to monitor how wasted it is. This patch changes the logic to account for the consumed bytes, instead of the bytes that remain to be consumed. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-07 12:54:22 +01:00
Marc Kleine-Budde	332b05ca7a	can: raw: raw_setsockopt: limit number of can_filter that can be set This patch adds a check to limit the number of can_filters that can be set via setsockopt on CAN_RAW sockets. Otherwise allocations > MAX_ORDER are not prevented resulting in a warning. Reference: https://lkml.org/lkml/2016/12/2/230 Reported-by: Andrey Konovalov <andreyknvl@google.com> Tested-by: Andrey Konovalov <andreyknvl@google.com> Cc: linux-stable <stable@vger.kernel.org> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>	2016-12-07 10:45:57 +01:00
David S. Miller	c63d352f05	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	2016-12-06 21:33:19 -05:00
Pablo Neira Ayuso	c97d22e68b	netfilter: nf_tables: add stateful object reference expression This new expression allows us to refer to existing stateful objects from rules. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-06 21:48:25 +01:00
Pablo Neira Ayuso	173705d9a2	netfilter: nft_quota: add stateful object type Register a new quota stateful object type into the new stateful object infrastructure. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-06 21:48:24 +01:00
Pablo Neira Ayuso	b1ce0ced10	netfilter: nft_counter: add stateful object type Register a new percpu counter stateful object type into the stateful object infrastructure. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-06 21:48:23 +01:00
Pablo Neira Ayuso	e50092404c	netfilter: nf_tables: add stateful objects This patch augments nf_tables to support stateful objects. This new infrastructure allows you to create, dump and delete stateful objects, that are identified by a user-defined name. This patch adds the generic infrastructure, follow up patches add support for two stateful objects: counters and quotas. This patch provides a native infrastructure for nf_tables to replace nfacct, the extended accounting infrastructure for iptables. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-06 21:48:22 +01:00
Florian Westphal	3bf3276119	netfilter: add and use nf_fwd_netdev_egress ... so we can use current skb instead of working with a clone. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-06 21:48:22 +01:00
Gao Feng	1ed9887ee3	netfilter: xt_multiport: Fix wrong unmatch result with multiple ports I lost one test case in the last commit for xt_multiport. For example, the rule is "-m multiport --dports 22,80,443". When first port is unmatched and the second is matched, the curent codes could not return the right result. It would return false directly when the first port is unmatched. Fixes: `dd2602d00f` ("netfilter: xt_multiport: Use switch case instead of multiple condition checks") Signed-off-by: Gao Feng <fgao@ikuai8.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-06 21:48:20 +01:00
Pablo Neira Ayuso	1814096980	netfilter: nft_payload: layer 4 checksum adjustment for pseudoheader fields This patch adds a new flag that signals the kernel to update layer 4 checksum if the packet field belongs to the layer 4 pseudoheader. This implicitly provides stateless NAT 1:1 that is useful under very specific usecases. Since rules mangling layer 3 fields that are part of the pseudoheader may potentially convey any layer 4 packet, we have to deal with the layer 4 checksum adjustment using protocol specific code. This patch adds support for TCP, UDP and ICMPv6, since they include the pseudoheader in the layer 4 checksum calculation. ICMP doesn't, so we can skip it. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-06 21:47:54 +01:00
Liping Zhang	e0ffdbc78d	netfilter: nft_fib_ipv4: initialize dest to zero Otherwise, if fib lookup fail, dest will be filled with garbage value, so reverse path filtering will not work properly: # nft add rule x prerouting fib saddr oif eq 0 drop Fixes: `f6d0cbcf09` ("netfilter: nf_tables: add fib expression") Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-06 21:42:21 +01:00
Liping Zhang	11583438b7	netfilter: nft_fib: convert htonl to ntohl properly Acctually ntohl and htonl are identical, so this doesn't affect anything, but it is conceptually wrong. Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-06 21:42:20 +01:00
Florian Westphal	ae0ac0ed6f	netfilter: x_tables: pack percpu counter allocations instead of allocating each xt_counter individually, allocate 4k chunks and then use these for counter allocation requests. This should speed up rule evaluation by increasing data locality, also speeds up ruleset loading because we reduce calls to the percpu allocator. As Eric points out we can't use PAGE_SIZE, page_allocator would fail on arches with 64k page size. Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-06 21:42:19 +01:00
Florian Westphal	f28e15bace	netfilter: x_tables: pass xt_counters struct to counter allocator Keeps some noise away from a followup patch. Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-06 21:42:18 +01:00
Florian Westphal	4d31eef517	netfilter: x_tables: pass xt_counters struct instead of packet counter On SMP we overload the packet counter (unsigned long) to contain percpu offset. Hide this from callers and pass xt_counters address instead. Preparation patch to allocate the percpu counters in page-sized batch chunks. Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-06 21:42:17 +01:00
Aaron Conole	679972f3be	netfilter: convert while loops to for loops This is to facilitate converting from a singly-linked list to an array of elements. Signed-off-by: Aaron Conole <aconole@bytheb.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-06 21:42:16 +01:00
Aaron Conole	0aa8c57a04	netfilter: introduce accessor functions for hook entries This allows easier future refactoring. Signed-off-by: Aaron Conole <aconole@bytheb.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-06 21:42:15 +01:00
Florian Westphal	834184b1f3	netfilter: defrag: only register defrag functionality if needed nf_defrag modules for ipv4 and ipv6 export an empty stub function. Any module that needs the defragmentation hooks registered simply 'calls' this empty function to create a phony module dependency -- modprobe will then load the defrag module too. This extends netfilter ipv4/ipv6 defragmentation modules to delay the hook registration until the functionality is requested within a network namespace instead of module load time for all namespaces. Hooks are only un-registered on module unload or when a namespace that used such defrag functionality exits. We have to use struct net for this as the register hooks can be called before netns initialization here from the ipv4/ipv6 conntrack module init path. There is no unregister functionality support, defrag will always be active once it was requested inside a net namespace. The reason is that defrag has impact on nft and iptables rulesets (without defrag we might see framents). Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-06 21:42:00 +01:00
Fabian Frederick	3eb15f2828	sunrpc: use DEFINE_SPINLOCK() Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2016-12-06 14:18:30 -05:00
Florian Westphal	343dfaa198	Revert "dctcp: update cwnd on congestion event" Neal Cardwell says: If I am reading the code correctly, then I would have two concerns: 1) Has that been tested? That seems like an extremely dramatic decrease in cwnd. For example, if the cwnd is 80, and there are 40 ACKs, and half the ACKs are ECE marked, then my back-of-the-envelope calculations seem to suggest that after just 11 ACKs the cwnd would be down to a minimal value of 2 [..] 2) That seems to contradict another passage in the draft [..] where it sazs: Just as specified in [RFC3168], DCTCP does not react to congestion indications more than once for every window of data. Neal is right. Fortunately we don't have to complicate this by testing vs. current rtt estimate, we can just revert the patch. Normal stack already handles this for us: receiving ACKs with ECE set causes a call to tcp_enter_cwr(), from there on the ssthresh gets adjusted and prr will take care of cwnd adjustment. Fixes: `4780566784` ("dctcp: update cwnd on congestion event") Cc: Neal Cardwell <ncardwell@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-06 11:34:24 -05:00
Marcelo Ricardo Leitner	dcb17d22e1	tcp: warn on bogus MSS and try to amend it There have been some reports lately about TCP connection stalls caused by NIC drivers that aren't setting gso_size on aggregated packets on rx path. This causes TCP to assume that the MSS is actually the size of the aggregated packet, which is invalid. Although the proper fix is to be done at each driver, it's often hard and cumbersome for one to debug, come to such root cause and report/fix it. This patch amends this situation in two ways. First, it adds a warning on when this situation occurs, so it gives a hint to those trying to debug this. It also limit the maximum probed MSS to the adverised MSS, as it should never be any higher than that. The result is that the connection may not have the best performance ever but it shouldn't stall, and the admin will have a hint on what to look for. Tested with virtio by forcing gso_size to 0. v2: updated msg per David's suggestion v3: use skb_iif to find the interface and also log its name, per Eric Dumazet's suggestion. As the skb may be backlogged and the interface gone by then, we need to check if the number still has a meaning. v4: use helper tcp_gro_dev_warn() and avoid pr_warn_once inside __once, per David's suggestion Cc: Jonathan Maxwell <jmaxwell37@gmail.com> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-06 11:01:19 -05:00
Eric Dumazet	a297569fe0	net/udp: do not touch skb->peeked unless really needed In UDP recvmsg() path we currently access 3 cache lines from an skb while holding receive queue lock, plus another one if packet is dequeued, since we need to change skb->next->prev 1st cache line (contains ->next/prev pointers, offsets 0x00 and 0x08) 2nd cache line (skb->len & skb->peeked, offsets 0x80 and 0x8e) 3rd cache line (skb->truesize/users, offsets 0xe0 and 0xe4) skb->peeked is only needed to make sure 0-length packets are properly handled while MSG_PEEK is operated. I had first the intent to remove skb->peeked but the "MSG_PEEK at non-zero offset" support added by Sam Kumar makes this not possible. This patch avoids one cache line miss during the locked section, when skb->len and skb->peeked do not have to be read. It also avoids the skb_set_peeked() cost for non empty UDP datagrams. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-06 10:41:49 -05:00
Herbert Xu	ed5d7788a9	netlink: Do not schedule work from sk_destruct It is wrong to schedule a work from sk_destruct using the socket as the memory reserve because the socket will be freed immediately after the return from sk_destruct. Instead we should do the deferral prior to sk_free. This patch does just that. Fixes: `707693c8a4` ("netlink: Call cb->done from a worker thread") Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Tested-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-05 19:43:42 -05:00
Daniel Borkmann	7bd509e311	bpf: add prog_digest and expose it via fdinfo/netlink When loading a BPF program via bpf(2), calculate the digest over the program's instruction stream and store it in struct bpf_prog's digest member. This is done at a point in time before any instructions are rewritten by the verifier. Any unstable map file descriptor number part of the imm field will be zeroed for the hash. fdinfo example output for progs: # cat /proc/1590/fdinfo/5 pos: 0 flags: 02000002 mnt_id: 11 prog_type: 1 prog_jited: 1 prog_digest: b27e8b06da22707513aa97363dfb11c7c3675d28 memlock: 4096 When programs are pinned and retrieved by an ELF loader, the loader can check the program's digest through fdinfo and compare it against one that was generated over the ELF file's program section to see if the program needs to be reloaded. Furthermore, this can also be exposed through other means such as netlink in case of a tc cls/act dump (or xdp in future), but also through tracepoints or other facilities to identify the program. Other than that, the digest can also serve as a base name for the work in progress kallsyms support of programs. The digest doesn't depend/select the crypto layer, since we need to keep dependencies to a minimum. iproute2 will get support for this facility. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-05 15:33:11 -05:00
Daniel Borkmann	8d829bdb97	bpf, cls: consolidate prog deletion path Commit `18cdb37ebf` ("net: sched: do not use tcf_proto 'tp' argument from call_rcu") removed the last usage of tp from cls_bpf_delete_prog(), so also remove it from the function as argument to not give a wrong impression. tp is illegal to access from this callback, since it could already have been freed. Refactor the deletion code a bit, so that cls_bpf_destroy() can call into the same code for prog deletion as cls_bpf_delete() op, instead of having it unnecessarily duplicated. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-05 15:33:10 -05:00
Daniel Borkmann	1afaf661b2	bpf: remove type arg from __is_valid_{,xdp_}access Commit `d691f9e8d4` ("bpf: allow programs to write to certain skb fields") pushed access type check outside of __is_valid_access() to have different restrictions for socket filters and tc programs. type is thus not used anymore within __is_valid_access() and should be removed as a function argument. Same for __is_valid_xdp_access() introduced by `6a773a15a1` ("bpf: add XDP prog type for early driver filter"). Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-05 15:33:10 -05:00
Eric Dumazet	1c0d32fde5	net_sched: gen_estimator: complete rewrite of rate estimators 1) Old code was hard to maintain, due to complex lock chains. (We probably will be able to remove some kfree_rcu() in callers) 2) Using a single timer to update all estimators does not scale. 3) Code was buggy on 32bit kernel (WRITE_ONCE() on 64bit quantity is not supposed to work well) In this rewrite : - I removed the RB tree that had to be scanned in gen_estimator_active(). qdisc dumps should be much faster. - Each estimator has its own timer. - Estimations are maintained in net_rate_estimator structure, instead of dirtying the qdisc. Minor, but part of the simplification. - Reading the estimator uses RCU and a seqcount to provide proper support for 32bit kernels. - We reduce memory need when estimators are not used, since we store a pointer, instead of the bytes/packets counters. - xt_rateest_mt() no longer has to grab a spinlock. (In the future, xt_rateest_tg() could be switched to per cpu counters) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-05 15:21:59 -05:00
Hadar Hen Zion	a6e1693129	net/sched: cls_flower: Set the filter Hardware device for all use-cases Check if the returned device from tcf_exts_get_dev function supports tc offload and in case the rule can't be offloaded, set the filter hw_dev parameter to the original device given by the user. The filter hw_device parameter should always be set by fl_hw_replace_filter function, since this pointer is used by dump stats and destroy filter for each flower rule (offloaded or not). Fixes: `7091d8c705` ('net/sched: cls_flower: Add offload support using egress Hardware device') Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Reported-by: Simon Horman <horms@verge.net.au> Tested-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-05 15:06:58 -05:00
Erik Nordmark	96d5822c1d	ipv6: Allow IPv4-mapped address as next-hop Made kernel accept IPv6 routes with IPv4-mapped address as next-hop. It is possible to configure IP interfaces with IPv4-mapped addresses, and one can add IPv6 routes for IPv4-mapped destinations/prefixes, yet prior to this fix the kernel returned an EINVAL when attempting to add an IPv6 route with an IPv4-mapped address as a nexthop/gateway. RFC 4798 (a proposed standard RFC) uses IPv4-mapped addresses as nexthops, thus in order to support that type of address configuration the kernel needs to allow IPv4-mapped addresses as nexthops. Signed-off-by: Erik Nordmark <nordmark@arista.com> Signed-off-by: Bob Gilligan <gilligan@arista.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-05 14:52:05 -05:00
Pan Bian	c79e167c3c	net: caif: remove ineffective check The check of the return value of sock_register() is ineffective. "if(!err)" seems to be a typo. It is better to propagate the error code to the callers of caif_sktinit_module(). This patch removes the check statment and directly returns the result of sock_register(). Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=188751 Signed-off-by: Pan Bian <bianpan2016@163.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-05 14:48:48 -05:00
Al Viro	0b62fca262	switch getfrag callbacks to ..._full() primitives Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-12-05 14:34:32 -05:00
Al Viro	cbbd26b8b1	[iov_iter] new primitives - copy_from_iter_full() and friends copy_from_iter_full(), copy_from_iter_full_nocache() and csum_and_copy_from_iter_full() - counterparts of copy_from_iter() et.al., advancing iterator only in case of successful full copy and returning whether it had been successful or not. Convert some obvious users. NOTE - do not blindly assume that something is a good candidate for those unless you are sure that not advancing iov_iter in failure case is the right thing in this case. Anything that does short read/short write kind of stuff (or is in a loop, etc.) is unlikely to be a good one. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-12-05 14:33:36 -05:00
David S. Miller	c3543688ab	Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next Johan Hedberg says: ==================== pull request: bluetooth-next 2016-12-03 Here's a set of Bluetooth & 802.15.4 patches for net-next (i.e. 4.10 kernel): - Fix for a potential NULL deref in the ieee802154 netlink code - Fix for the ED values of the at86rf2xx driver - Documentation updates to ieee802154 - Cleanups to u8 vs __u8 usage - Timer API usage cleanups in HCI drivers Please let me know if there are any issues pulling. Thanks. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-05 13:37:28 -05:00
Kees Cook	0eab121ef8	net: ping: check minimum size on ICMP header length Prior to commit `c0371da604` ("put iov_iter into msghdr") in v3.19, there was no check that the iovec contained enough bytes for an ICMP header, and the read loop would walk across neighboring stack contents. Since the iov_iter conversion, bad arguments are noticed, but the returned error is EFAULT. Returning EINVAL is a clearer error and also solves the problem prior to v3.19. This was found using trinity with KASAN on v3.18: BUG: KASAN: stack-out-of-bounds in memcpy_fromiovec+0x60/0x114 at addr ffffffc071077da0 Read of size 8 by task trinity-c2/9623 page:ffffffbe034b9a08 count:0 mapcount:0 mapping: (null) index:0x0 flags: 0x0() page dumped because: kasan: bad access detected CPU: 0 PID: 9623 Comm: trinity-c2 Tainted: G BU 3.18.0-dirty #15 Hardware name: Google Tegra210 Smaug Rev 1,3+ (DT) Call trace: [<ffffffc000209c98>] dump_backtrace+0x0/0x1ac arch/arm64/kernel/traps.c:90 [<ffffffc000209e54>] show_stack+0x10/0x1c arch/arm64/kernel/traps.c:171 [< inline >] __dump_stack lib/dump_stack.c:15 [<ffffffc000f18dc4>] dump_stack+0x7c/0xd0 lib/dump_stack.c:50 [< inline >] print_address_description mm/kasan/report.c:147 [< inline >] kasan_report_error mm/kasan/report.c:236 [<ffffffc000373dcc>] kasan_report+0x380/0x4b8 mm/kasan/report.c:259 [< inline >] check_memory_region mm/kasan/kasan.c:264 [<ffffffc00037352c>] __asan_load8+0x20/0x70 mm/kasan/kasan.c:507 [<ffffffc0005b9624>] memcpy_fromiovec+0x5c/0x114 lib/iovec.c:15 [< inline >] memcpy_from_msg include/linux/skbuff.h:2667 [<ffffffc000ddeba0>] ping_common_sendmsg+0x50/0x108 net/ipv4/ping.c:674 [<ffffffc000dded30>] ping_v4_sendmsg+0xd8/0x698 net/ipv4/ping.c:714 [<ffffffc000dc91dc>] inet_sendmsg+0xe0/0x12c net/ipv4/af_inet.c:749 [< inline >] __sock_sendmsg_nosec net/socket.c:624 [< inline >] __sock_sendmsg net/socket.c:632 [<ffffffc000cab61c>] sock_sendmsg+0x124/0x164 net/socket.c:643 [< inline >] SYSC_sendto net/socket.c:1797 [<ffffffc000cad270>] SyS_sendto+0x178/0x1d8 net/socket.c:1761 CVE-2016-8399 Reported-by: Qidan He <i@flanker017.me> Fixes: `c319b4d76b` ("net: ipv4: add IPPROTO_ICMP socket kind") Cc: stable@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-05 13:35:38 -05:00
Eric Dumazet	7aa5470c2c	tcp: tsq: move tsq_flags close to sk_wmem_alloc tsq_flags being in the same cache line than sk_wmem_alloc makes a lot of sense. Both fields are changed from tcp_wfree() and more generally by various TSQ related functions. Prior patch made room in struct sock and added sk_tsq_flags, this patch deletes tsq_flags from struct tcp_sock. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-05 13:32:24 -05:00
Eric Dumazet	12a59abc22	tcp: tcp_mtu_probe() is likely to exit early Adding a likely() in tcp_mtu_probe() moves its code which used to be inlined in front of tcp_write_xmit() We still have a cache line miss to access icsk->icsk_mtup.enabled, we will probably have to reorganize fields to help data locality. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-05 13:32:23 -05:00
Eric Dumazet	75eefc6c59	tcp: tsq: add a shortcut in tcp_small_queue_check() Always allow the two first skbs in write queue to be sent, regardless of sk_wmem_alloc/sk_pacing_rate values. This helps a lot in situations where TX completions are delayed either because of driver latencies or softirq latencies. Test is done with no cache line misses. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-05 13:32:23 -05:00
Eric Dumazet	a9b204d156	tcp: tsq: avoid one atomic in tcp_wfree() Under high load, tcp_wfree() has an atomic operation trying to schedule a tasklet over and over. We can schedule it only if our per cpu list was empty. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-05 13:32:23 -05:00
Eric Dumazet	b223feb9de	tcp: tsq: add shortcut in tcp_tasklet_func() Under high stress, I've seen tcp_tasklet_func() consuming ~700 usec, handling ~150 tcp sockets. By setting TCP_TSQ_DEFERRED in tcp_wfree(), we give a chance for other cpus/threads entering tcp_write_xmit() to grab it, allowing tcp_tasklet_func() to skip sockets that already did an xmit cycle. In the future, we might give to ACK processing an increased budget to reduce even more tcp_tasklet_func() amount of work. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-05 13:32:22 -05:00
Eric Dumazet	408f0a6c21	tcp: tsq: remove one locked operation in tcp_wfree() Instead of atomically clear TSQ_THROTTLED and atomically set TSQ_QUEUED bits, use one cmpxchg() to perform a single locked operation. Since the following patch will also set TCP_TSQ_DEFERRED here, this cmpxchg() will make this addition free. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-05 13:32:22 -05:00
Eric Dumazet	40fc3423b9	tcp: tsq: add tsq_flags / tsq_enum This is a cleanup, to ease code review of following patches. Old 'enum tsq_flags' is renamed, and a new enumeration is added with the flags used in cmpxchg() operations as opposed to single bit operations. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-05 13:32:22 -05:00
Pan Bian	b59589635f	net: bridge: set error code on failure Function br_sysfs_addbr() does not set error code when the call kobject_create_and_add() returns a NULL pointer. It may be better to return "-ENOMEM" when kobject_create_and_add() fails. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=188781 Signed-off-by: Pan Bian <bianpan2016@163.com> Acked-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-05 13:26:22 -05:00
Suraj Deshmukh	14dd3e1b97	net: af_mpls.c add space before open parenthesis Adding space after switch keyword before open parenthesis for readability purpose. This patch fixes the checkpatch.pl warning: space required before the open parenthesis '(' Signed-off-by: Suraj Deshmukh <surajssd009005@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-05 13:25:55 -05:00
Alexander Duyck	a52ca62c4a	ipv4: Drop suffix update from resize code It has been reported that update_suffix can be expensive when it is called on a large node in which most of the suffix lengths are the same. The time required to add 200K entries had increased from around 3 seconds to almost 49 seconds. In order to address this we need to move the code for updating the suffix out of resize and instead just have it handled in the cases where we are pushing a node that increases the suffix length, or will decrease the suffix length. Fixes: `5405afd1a3` ("fib_trie: Add tracking value for suffix length") Reported-by: Robert Shearman <rshearma@brocade.com> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Reviewed-by: Robert Shearman <rshearma@brocade.com> Tested-by: Robert Shearman <rshearma@brocade.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-05 13:15:58 -05:00
Alexander Duyck	1a239173cc	ipv4: Drop leaf from suffix pull/push functions It wasn't necessary to pass a leaf in when doing the suffix updates so just drop it. Instead just pass the suffix and work with that. Since we dropped the leaf there is no need to include that in the name so the names are updated to node_push_suffix and node_pull_suffix. Finally I noticed that the logic for pulling the suffix length back actually had some issues. Specifically it would stop prematurely if there was a longer suffix, but it was not as long as the original suffix. I updated the code to address that in node_pull_suffix. Fixes: `5405afd1a3` ("fib_trie: Add tracking value for suffix length") Suggested-by: Robert Shearman <rshearma@brocade.com> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Reviewed-by: Robert Shearman <rshearma@brocade.com> Tested-by: Robert Shearman <rshearma@brocade.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-05 13:15:58 -05:00
Florian Westphal	481fa37347	netfilter: conntrack: add nf_conntrack_default_on sysctl This switch (default on) can be used to disable automatic registration of connection tracking functionality in newly created network namespaces. This means that when net namespace goes down (or the tracker protocol module is unloaded) we might have to unregister the hooks. We can either add another per-netns variable that tells if the hooks got registered by default, or, alternatively, just call the protocol _put() function and have the callee deal with a possible 'extra' put() operation that doesn't pair with a get() one. This uses the latter approach, i.e. a put() without a get has no effect. Conntrack is still enabled automatically regardless of the new sysctl setting if the new net namespace requires connection tracking, e.g. when NAT rules are created. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-04 21:17:25 +01:00
Florian Westphal	0c66dc1ea3	netfilter: conntrack: register hooks in netns when needed by ruleset This makes use of nf_ct_netns_get/put added in previous patch. We add get/put functions to nf_conntrack_l3proto structure, ipv4 and ipv6 then implement use-count to track how many users (nft or xtables modules) have a dependency on ipv4 and/or ipv6 connection tracking functionality. When count reaches zero, the hooks are unregistered. This delays activation of connection tracking inside a namespace until stateful firewall rule or nat rule gets added. This patch breaks backwards compatibility in the sense that connection tracking won't be active anymore when the protocol tracker module is loaded. This breaks e.g. setups that ctnetlink for flow accounting and the like, without any '-m conntrack' packet filter rules. Followup patch restores old behavour and makes new delayed scheme optional via sysctl. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-04 21:17:24 +01:00
Florian Westphal	20afd42397	netfilter: nf_tables: add conntrack dependencies for nat/masq/redir expressions so that conntrack core will add the needed hooks in this namespace. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-04 21:17:16 +01:00
Florian Westphal	a357b3f80b	netfilter: nat: add dependencies on conntrack module MASQUERADE, S/DNAT and REDIRECT already call functions that depend on the conntrack module. However, since the conntrack hooks are now registered in a lazy fashion (i.e., only when needed) a symbol reference is not enough. Thus, when something is added to a nat table, make sure that it will see packets by calling nf_ct_netns_get() which will register the conntrack hooks in the current netns. An alternative would be to add these dependencies to the NAT table. However, that has problems when using non-modular builds -- we might register e.g. ipv6 conntrack before its initcall has run, leading to NULL deref crashes since its per-netns storage has not yet been allocated. Adding the dependency in the modules instead has the advantage that nat table also does not register its hooks until rules are added. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-04 21:16:51 +01:00
Florian Westphal	ecb2421b5d	netfilter: add and use nf_ct_netns_get/put currently aliased to try_module_get/_put. Will be changed in next patch when we add functions to make use of ->net argument to store usercount per l3proto tracker. This is needed to avoid registering the conntrack hooks in all netns and later only enable connection tracking in those that need conntrack. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-04 21:16:50 +01:00
Florian Westphal	a379854d91	netfilter: conntrack: remove unused init_net hook since `adf0516845` ("netfilter: remove ip_conntrack* sysctl compat code") the only user (ipv4 tracker) sets this to an empty stub function. After this change nf_ct_l3proto_pernet_register() is also empty, but this will change in a followup patch to add conditional register of the hooks. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-04 21:16:41 +01:00
Davide Caratti	9b91c96c5d	netfilter: conntrack: built-in support for UDPlite CONFIG_NF_CT_PROTO_UDPLITE is no more a tristate. When set to y, connection tracking support for UDPlite protocol is built-in into nf_conntrack.ko. footprint test: $ ls -l net/netfilter/nf_conntrack{_proto_udplite,}.ko \ net/ipv4/netfilter/nf_conntrack_ipv4.ko \ net/ipv6/netfilter/nf_conntrack_ipv6.ko (builtin)\|\| udplite\| ipv4 \| ipv6 \|nf_conntrack ---------++--------+--------+--------+-------------- none \|\| 432538 \| 828755 \| 828676 \| 6141434 UDPlite \|\| - \| 829649 \| 829362 \| 6498204 Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-04 20:57:36 +01:00
Davide Caratti	a85406afeb	netfilter: conntrack: built-in support for SCTP CONFIG_NF_CT_PROTO_SCTP is no more a tristate. When set to y, connection tracking support for SCTP protocol is built-in into nf_conntrack.ko. footprint test: $ ls -l net/netfilter/nf_conntrack{_proto_sctp,}.ko \ net/ipv4/netfilter/nf_conntrack_ipv4.ko \ net/ipv6/netfilter/nf_conntrack_ipv6.ko (builtin)\|\| sctp \| ipv4 \| ipv6 \| nf_conntrack ---------++--------+--------+--------+-------------- none \|\| 498243 \| 828755 \| 828676 \| 6141434 SCTP \|\| - \| 829254 \| 829175 \| 6547872 Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-04 20:55:37 +01:00
Davide Caratti	c51d39010a	netfilter: conntrack: built-in support for DCCP CONFIG_NF_CT_PROTO_DCCP is no more a tristate. When set to y, connection tracking support for DCCP protocol is built-in into nf_conntrack.ko. footprint test: $ ls -l net/netfilter/nf_conntrack{_proto_dccp,}.ko \ net/ipv4/netfilter/nf_conntrack_ipv4.ko \ net/ipv6/netfilter/nf_conntrack_ipv6.ko (builtin)\|\| dccp \| ipv4 \| ipv6 \| nf_conntrack ---------++--------+--------+--------+-------------- none \|\| 469140 \| 828755 \| 828676 \| 6141434 DCCP \|\| - \| 830566 \| 829935 \| 6533526 Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-04 20:53:15 +01:00
Pablo Neira Ayuso	f6b3ef5e38	Merge tag 'ipvs-for-v4.10' of https://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs-next Simon Horman says: ==================== IPVS Updates for v4.10 please consider these enhancements to the IPVS for v4.10. * Decrement the IP ttl in all the modes in order to prevent infinite route loops. Thanks to Dwip Banerjee. * Use IS_ERR_OR_NULL macro. Clean-up from Gao Feng. ==================== Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-04 20:46:16 +01:00
Liping Zhang	a7647080d3	netfilter: nfnetlink_log: add "nf-logger-5-1" module alias name So we can autoload nfnetlink_log.ko when the user adding nft log group X rule in netdev family. Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-04 20:45:34 +01:00
Liping Zhang	673ab46f34	netfilter: nf_log: do not assume ethernet header in netdev family In netdev family, we will handle non ethernet packets, so using eth_hdr(skb)->h_proto is incorrect. Meanwhile, we can use socket(AF_PACKET...) to sending packets, so skb->protocol is not always set in bridge family. Add an extra parameter into nf_log_l2packet to solve this issue. Fixes: `1fddf4bad0` ("netfilter: nf_log: add packet logging for netdev family") Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-04 20:45:33 +01:00
Davide Caratti	b8ad652f97	netfilter: built-in NAT support for UDPlite CONFIG_NF_NAT_PROTO_UDPLITE is no more a tristate. When set to y, NAT support for UDPlite protocol is built-in into nf_nat.ko. footprint test: (nf_nat_proto_) \|udplite \|\| nf_nat --------------------------+--------++-------- no builtin \| 408048 \|\| 2241312 UDPLITE builtin \| - \|\| 2577256 Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-04 20:45:32 +01:00
Davide Caratti	7a2dd28c70	netfilter: built-in NAT support for SCTP CONFIG_NF_NAT_PROTO_SCTP is no more a tristate. When set to y, NAT support for SCTP protocol is built-in into nf_nat.ko. footprint test: (nf_nat_proto_) \| sctp \|\| nf_nat --------------------------+--------++-------- no builtin \| 428344 \|\| 2241312 SCTP builtin \| - \|\| 2597032 Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-04 20:45:31 +01:00
Davide Caratti	0c4e966eaf	netfilter: built-in NAT support for DCCP CONFIG_NF_NAT_PROTO_DCCP is no more a tristate. When set to y, NAT support for DCCP protocol is built-in into nf_nat.ko. footprint test: (nf_nat_proto_) \| dccp \|\| nf_nat --------------------------+--------++-------- no builtin \| 409800 \|\| 2241312 DCCP builtin \| - \|\| 2578968 Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-04 20:45:30 +01:00
Arturo Borrero Gonzalez	cd72751468	netfilter: update Arturo Borrero Gonzalez email address The email address has changed, let's update the copyright statements. Signed-off-by: Arturo Borrero Gonzalez <arturo@debian.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-04 20:45:25 +01:00
Pan Bian	c66ebf2db5	net: dcb: set error code on failures In function dcbnl_cee_fill(), returns the value of variable err on errors. However, on some error paths (e.g. nla put fails), its value may be 0. It may be better to explicitly set a negative errno to variable err before returning. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=188881 Signed-off-by: Pan Bian <bianpan2016@163.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-03 23:54:25 -05:00
Erik Nordmark	adc176c547	ipv6 addrconf: Implemented enhanced DAD (RFC7527) Implemented RFC7527 Enhanced DAD. IPv6 duplicate address detection can fail if there is some temporary loopback of Ethernet frames. RFC7527 solves this by including a random nonce in the NS messages used for DAD, and if an NS is received with the same nonce it is assumed to be a looped back DAD probe and is ignored. RFC7527 is enabled by default. Can be disabled by setting both of conf/{all,interface}/enhanced_dad to zero. Signed-off-by: Erik Nordmark <nordmark@arista.com> Signed-off-by: Bob Gilligan <gilligan@arista.com> Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-03 23:21:37 -05:00
Ido Schimmel	c3852ef7f2	ipv4: fib: Replay events when registering FIB notifier Commit `b90eb75494` ("fib: introduce FIB notification infrastructure") introduced a new notification chain to notify listeners (f.e., switchdev drivers) about addition and deletion of routes. However, upon registration to the chain the FIB tables can already be populated, which means potential listeners will have an incomplete view of the tables. Solve that by dumping the FIB tables and replaying the events to the passed notification block. The dump itself is done using RCU in order not to starve consumers that need RTNL to make progress. The integrity of the dump is ensured by reading the FIB change sequence counter before and after the dump under RTNL. This allows us to avoid the problematic situation in which the dumping process sends a ENTRY_ADD notification following ENTRY_DEL generated by another process holding RTNL. Callers of the registration function may pass a callback that is executed in case the dump was inconsistent with current FIB tables. The number of retries until a consistent dump is achieved is set to a fixed number to prevent callers from looping for long periods of time. In case current limit proves to be problematic in the future, it can be easily converted to be configurable using a sysctl. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-03 19:29:35 -05:00
Ido Schimmel	cacaad11f4	ipv4: fib: Allow for consistent FIB dumping The next patch will enable listeners of the FIB notification chain to request a dump of the FIB tables. However, since RTNL isn't taken during the dump, it's possible for the FIB tables to change mid-dump, which will result in inconsistency between the listener's table and the kernel's. Allow listeners to know about changes that occurred mid-dump, by adding a change sequence counter to each net namespace. The counter is incremented just before a notification is sent in the FIB chain. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-03 19:29:35 -05:00
Ido Schimmel	d3f706f68e	ipv4: fib: Convert FIB notification chain to be atomic In order not to hold RTNL for long periods of time we're going to dump the FIB tables using RCU. Convert the FIB notification chain to be atomic, as we can't block in RCU critical sections. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-03 19:29:35 -05:00
Ido Schimmel	b423cb1080	ipv4: fib: Export free_fib_info() The FIB notification chain is going to be converted to an atomic chain, which means switchdev drivers will have to offload FIB entries in deferred work, as hardware operations entail sleeping. However, while the work is queued fib info might be freed, so a reference must be taken. To release the reference (and potentially free the fib info) fib_info_put() will be called, which in turn calls free_fib_info(). Export free_fib_info() so that modules will be able to invoke fib_info_put(). Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-03 19:29:35 -05:00
WANG Cong	548ed72246	act_mirred: fix a typo in get_dev Fixes: `255cb30425` ("net/sched: act_mirred: Add new tc_action_ops get_dev()") Cc: Hadar Hen Zion <hadarh@mellanox.com> Cc: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-03 19:28:02 -05:00
Paolo Abeni	363dc73aca	udp: be less conservative with sock rmem accounting Before commit `850cbaddb5` ("udp: use it's own memory accounting schema"), the udp protocol allowed sk_rmem_alloc to grow beyond the rcvbuf by the whole current packet's truesize. After said commit we allow sk_rmem_alloc to exceed the rcvbuf only if the receive queue is empty. As reported by Jesper this cause a performance regression for some (small) values of rcvbuf. This commit is intended to fix the regression restoring the old handling of the rcvbuf limit. Reported-by: Jesper Dangaard Brouer <brouer@redhat.com> Fixes: `850cbaddb5` ("udp: use it's own memory accounting schema") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-03 16:14:48 -05:00
David S. Miller	a38b610094	Here is another batman-adv bugfix: - fix checking for failed allocation of TVLV blocks in TT local data, by Sven Eckelmann -----BEGIN PGP SIGNATURE----- iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAlhBnMAWHHN3QHNpbW9u d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoQjfD/4i/eHZjGNwS9aTxh0QCfzM45JP xq5DmvCZrJXzGIt/g3vDI9WqpdJrnJsfUhCwzcn2NwCqBt8BKlGFT/NDXv5rxjBu 5wHV9gZ5r6whMVyXiDgkYNW0B1M71rjgh9rxvH61Lyrdaq5RSwGVZb9dohRffGpo zTU+l+GiMJXzszA3Hi0ga+8/3xyEdwjMzSyBSK2xLdyHFBn8gg4EfAf1kJhS4rk8 dbJUHFEvtg0wB7LkMlXe3DHkX/bRAi+PGU/YNk1x0x877ejfejIn7pp/GoweFgAW 2cn4SxVIsN7sycZCQYar5iqjbQ/mO0v5fNTkxo8EGiNB7wHTr0RVFI/UPOP6KeS6 2JR9qubg4kd6nU9yWMC8osbzye+GWmbPkvkFBrcHRkZpZDwsuaLJH2SbmWGt62kT mjpF2Uhsz+hn6AF5YhdDPAnBNP0iJWu25jWymCtHqCjieJsK0QjvAbXLMz45VWoW 5rrvOBy1gI47/3gs2jitEXER9/6onai3vzoJ3cX7YYUh2ntiAuK3Lg5wGnrODm/W fyjpDcnwURBhiP4H+uoeiRvxFXpNeUBJiTolw+tl8n//adlLNepclA0OnGunx4GP qc5LPksYp1KWV6rIy3Nl+62XGLZkSGHcYzVb+uX/3Bng8HNmva7ceUggsVwIFsgi Isw24Dckx4bSB8P86w== =Y2ED -----END PGP SIGNATURE----- Merge tag 'batadv-net-for-davem-20161202' of git://git.open-mesh.org/linux-merge Simon Wunderlich says: ==================== Here is another batman-adv bugfix: - fix checking for failed allocation of TVLV blocks in TT local data, by Sven Eckelmann ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-03 16:13:27 -05:00
Eric Dumazet	12efa1fa43	net_sched: gen_estimator: account for timer drifts Under heavy stress, timer used in estimators tend to slowly be delayed by a few jiffies, leading to inaccuracies. Lets remember what was the last scheduled jiffies so that we get more precise estimations, without having to add a multiply/divide in the loop to account for the drifts. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-03 16:12:17 -05:00
Alexey Dobriyan	6af2d5fff2	netns: fix net_generic() "id - 1" bloat net_generic() function is both a) inline and b) used ~600 times. It has the following code inside ... ptr = ng->ptr[id - 1]; ... "id" is never compile time constant so compiler is forced to subtract 1. And those decrements or LEA [r32 - 1] instructions add up. We also start id'ing from 1 to catch bugs where pernet sybsystem id is not initialized and 0. This is quite pointless idea (nothing will work or immediate interference with first registered subsystem) in general but it hints what needs to be done for code size reduction. Namely, overlaying allocation of pointer array and fixed part of structure in the beginning and using usual base-0 addressing. Ids are just cookies, their exact values do not matter, so lets start with 3 on x86_64. Code size savings (oh boy): -4.2 KB As usual, ignore the initial compiler stupidity part of the table. add/remove: 0/0 grow/shrink: 12/670 up/down: 89/-4297 (-4208) function old new delta tipc_nametbl_insert_publ 1250 1270 +20 nlmclnt_lookup_host 686 703 +17 nfsd4_encode_fattr 5930 5941 +11 nfs_get_client 1050 1061 +11 register_pernet_operations 333 342 +9 tcf_mirred_init 843 849 +6 tcf_bpf_init 1143 1149 +6 gss_setup_upcall 990 994 +4 idmap_name_to_id 432 434 +2 ops_init 274 275 +1 nfsd_inject_forget_client 259 260 +1 nfs4_alloc_client 612 613 +1 tunnel_key_walker 164 163 -1 ... tipc_bcbase_select_primary 392 360 -32 mac80211_hwsim_new_radio 2808 2767 -41 ipip6_tunnel_ioctl 2228 2186 -42 tipc_bcast_rcv 715 672 -43 tipc_link_build_proto_msg 1140 1089 -51 nfsd4_lock 3851 3796 -55 tipc_mon_rcv 1012 956 -56 Total: Before=156643951, After=156639743, chg -0.00% Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-03 15:59:58 -05:00
Alexey Dobriyan	9bfc7b9969	netns: add dummy struct inside "struct net_generic" This is precursor to fixing "[id - 1]" bloat inside net_generic(). Name "s" is chosen to complement name "u" often used for dummy unions. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-03 15:59:58 -05:00
Alexey Dobriyan	1a9a059203	netns: publish net_generic correctly Publishing net_generic pointer is done with silly mistake: new array is published BEFORE setting freshly acquired pernet subsystem pointer. memcpy rcu_assign_pointer kfree_rcu ng->ptr[id - 1] = data; This bug was introduced with commit `dec827d174` ("[NETNS]: The generic per-net pointers.") in the glorious days of chopping networking stack into containers proper 8.5 years ago (whee...) How it didn't trigger for so long? Well, you need quite specific set of conditions: ) race window opens once per pernet subsystem addition (read: modprobe or boot) ) not every pernet subsystem is eligible (need ->id and ->size) ) not every pernet subsystem is vulnerable (need incorrect or absense of ordering of register_pernet_sybsys() and actually using net_generic()) ) to hide the bug even more, default is to preallocate 13 pointers which is actually quite a lot. You need IPv6, netfilter, bridging etc together loaded to trigger reallocation in the first place. Trimmed down config are OK. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-03 15:59:58 -05:00
David S. Miller	2745529ac7	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Couple conflicts resolved here: 1) In the MACB driver, a bug fix to properly initialize the RX tail pointer properly overlapped with some changes to support variable sized rings. 2) In XGBE we had a "CONFIG_PM" --> "CONFIG_PM_SLEEP" fix overlapping with a reorganization of the driver to support ACPI, OF, as well as PCI variants of the chip. 3) In 'net' we had several probe error path bug fixes to the stmmac driver, meanwhile a lot of this code was cleaned up and reorganized in 'net-next'. 4) The cls_flower classifier obtained a helper function in 'net-next' called __fl_delete() and this overlapped with Daniel Borkamann's bug fix to use RCU for object destruction in 'net'. It also overlapped with Jiri's change to guard the rhashtable_remove_fast() call with a check against tc_skip_sw(). 5) In mlx4, a revert bug fix in 'net' overlapped with some unrelated changes in 'net-next'. 6) In geneve, a stale header pointer after pskb_expand_head() bug fix in 'net' overlapped with a large reorganization of the same code in 'net-next'. Since the 'net-next' code no longer had the bug in question, there was nothing to do other than to simply take the 'net-next' hunks. Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-03 12:29:53 -05:00
Eric Dumazet	b98b0bc8c4	net: avoid signed overflows for SO_{SND\|RCV}BUFFORCE CAP_NET_ADMIN users should not be allowed to set negative sk_sndbuf or sk_rcvbuf values, as it can lead to various memory corruptions, crashes, OOM... Note that before commit `8298193012` ("net: cleanups in sock_setsockopt()"), the bug was even more serious, since SO_SNDBUF and SO_RCVBUF were vulnerable. This needs to be backported to all known linux kernels. Again, many thanks to syzkaller team for discovering this gem. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-02 14:10:14 -05:00
Michal Kubeček	3de81b7588	tipc: check minimum bearer MTU Qian Zhang (张谦) reported a potential socket buffer overflow in tipc_msg_build() which is also known as CVE-2016-8632: due to insufficient checks, a buffer overflow can occur if MTU is too short for even tipc headers. As anyone can set device MTU in a user/net namespace, this issue can be abused by a regular user. As agreed in the discussion on Ben Hutchings' original patch, we should check the MTU at the moment a bearer is attached rather than for each processed packet. We also need to repeat the check when bearer MTU is adjusted to new device MTU. UDP case also needs a check to avoid overflow when calculating bearer MTU. Fixes: `b97bf3fd8f` ("[TIPC] Initial merge") Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reported-by: Qian Zhang (张谦) <zhangqian-c@360.cn> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-02 14:03:20 -05:00
David Ahern	aa4c1037a3	bpf: Add support for reading socket family, type, protocol Add socket family, type and protocol to bpf_sock allowing bpf programs read-only access. Add __sk_flags_offset[0] to struct sock before the bitfield to programmtically determine the offset of the unsigned int containing protocol and type. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-02 13:46:09 -05:00
David Ahern	6102365876	bpf: Add new cgroup attach type to enable sock modifications Add new cgroup based program type, BPF_PROG_TYPE_CGROUP_SOCK. Similar to BPF_PROG_TYPE_CGROUP_SKB programs can be attached to a cgroup and run any time a process in the cgroup opens an AF_INET or AF_INET6 socket. Currently only sk_bound_dev_if is exported to userspace for modification by a bpf program. This allows a cgroup to be configured such that AF_INET{6} sockets opened by processes are automatically bound to a specific device. In turn, this enables the running of programs that do not support SO_BINDTODEVICE in a specific VRF context / L3 domain. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-02 13:46:08 -05:00
Artem Savkov	6b6ebb6b01	ip6_offload: check segs for NULL in ipv6_gso_segment. segs needs to be checked for being NULL in ipv6_gso_segment() before calling skb_shinfo(segs), otherwise kernel can run into a NULL-pointer dereference: [ 97.811262] BUG: unable to handle kernel NULL pointer dereference at 00000000000000cc [ 97.819112] IP: [<ffffffff816e52f9>] ipv6_gso_segment+0x119/0x2f0 [ 97.825214] PGD 0 [ 97.827047] [ 97.828540] Oops: 0000 [#1] SMP [ 97.831678] Modules linked in: vhost_net vhost macvtap macvlan nfsv3 rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 tun ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter bridge stp llc snd_hda_codec_realtek snd_hda_codec_hdmi snd_hda_codec_generic snd_hda_intel snd_hda_codec edac_mce_amd snd_hda_core edac_core snd_hwdep kvm_amd snd_seq kvm snd_seq_device snd_pcm irqbypass snd_timer ppdev parport_serial snd parport_pc k10temp pcspkr soundcore parport sp5100_tco shpchp sg wmi i2c_piix4 acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sr_mod cdrom sd_mod ata_generic pata_acpi amdkfd amd_iommu_v2 radeon broadcom bcm_phy_lib i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ahci serio_raw tg3 firewire_ohci libahci pata_atiixp drm ptp libata firewire_core pps_core i2c_core crc_itu_t fjes dm_mirror dm_region_hash dm_log dm_mod [ 97.927721] CPU: 1 PID: 3504 Comm: vhost-3495 Not tainted 4.9.0-7.el7.test.x86_64 #1 [ 97.935457] Hardware name: AMD Snook/Snook, BIOS ESK0726A 07/26/2010 [ 97.941806] task: ffff880129a1c080 task.stack: ffffc90001bcc000 [ 97.947720] RIP: 0010:[<ffffffff816e52f9>] [<ffffffff816e52f9>] ipv6_gso_segment+0x119/0x2f0 [ 97.956251] RSP: 0018:ffff88012fc43a10 EFLAGS: 00010207 [ 97.961557] RAX: 0000000000000000 RBX: ffff8801292c8700 RCX: 0000000000000594 [ 97.968687] RDX: 0000000000000593 RSI: ffff880129a846c0 RDI: 0000000000240000 [ 97.975814] RBP: ffff88012fc43a68 R08: ffff880129a8404e R09: 0000000000000000 [ 97.982942] R10: 0000000000000000 R11: ffff880129a84076 R12: 00000020002949b3 [ 97.990070] R13: ffff88012a580000 R14: 0000000000000000 R15: ffff88012a580000 [ 97.997198] FS: 0000000000000000(0000) GS:ffff88012fc40000(0000) knlGS:0000000000000000 [ 98.005280] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 98.011021] CR2: 00000000000000cc CR3: 0000000126c5d000 CR4: 00000000000006e0 [ 98.018149] Stack: [ 98.020157] 00000000ffffffff ffff88012fc43ac8 ffffffffa017ad0a 000000000000000e [ 98.027584] 0000001300000000 0000000077d59998 ffff8801292c8700 00000020002949b3 [ 98.035010] ffff88012a580000 0000000000000000 ffff88012a580000 ffff88012fc43a98 [ 98.042437] Call Trace: [ 98.044879] <IRQ> [ 98.046803] [<ffffffffa017ad0a>] ? tg3_start_xmit+0x84a/0xd60 [tg3] [ 98.053156] [<ffffffff815eeee0>] skb_mac_gso_segment+0xb0/0x130 [ 98.059158] [<ffffffff815eefd3>] __skb_gso_segment+0x73/0x110 [ 98.064985] [<ffffffff815ef40d>] validate_xmit_skb+0x12d/0x2b0 [ 98.070899] [<ffffffff815ef5d2>] validate_xmit_skb_list+0x42/0x70 [ 98.077073] [<ffffffff81618560>] sch_direct_xmit+0xd0/0x1b0 [ 98.082726] [<ffffffff815efd86>] __dev_queue_xmit+0x486/0x690 [ 98.088554] [<ffffffff8135c135>] ? cpumask_next_and+0x35/0x50 [ 98.094380] [<ffffffff815effa0>] dev_queue_xmit+0x10/0x20 [ 98.099863] [<ffffffffa09ce057>] br_dev_queue_push_xmit+0xa7/0x170 [bridge] [ 98.106907] [<ffffffffa09ce161>] br_forward_finish+0x41/0xc0 [bridge] [ 98.113430] [<ffffffff81627cf2>] ? nf_iterate+0x52/0x60 [ 98.118735] [<ffffffff81627d6b>] ? nf_hook_slow+0x6b/0xc0 [ 98.124216] [<ffffffffa09ce32c>] __br_forward+0x14c/0x1e0 [bridge] [ 98.130480] [<ffffffffa09ce120>] ? br_dev_queue_push_xmit+0x170/0x170 [bridge] [ 98.137785] [<ffffffffa09ce4bd>] br_forward+0x9d/0xb0 [bridge] [ 98.143701] [<ffffffffa09cfbb7>] br_handle_frame_finish+0x267/0x560 [bridge] [ 98.150834] [<ffffffffa09d0064>] br_handle_frame+0x174/0x2f0 [bridge] [ 98.157355] [<ffffffff8102fb89>] ? sched_clock+0x9/0x10 [ 98.162662] [<ffffffff810b63b2>] ? sched_clock_cpu+0x72/0xa0 [ 98.168403] [<ffffffff815eccf5>] __netif_receive_skb_core+0x1e5/0xa20 [ 98.174926] [<ffffffff813659f9>] ? timerqueue_add+0x59/0xb0 [ 98.180580] [<ffffffff815ed548>] __netif_receive_skb+0x18/0x60 [ 98.186494] [<ffffffff815ee625>] process_backlog+0x95/0x140 [ 98.192145] [<ffffffff815edccd>] net_rx_action+0x16d/0x380 [ 98.197713] [<ffffffff8170cff1>] __do_softirq+0xd1/0x283 [ 98.203106] [<ffffffff8170b2bc>] do_softirq_own_stack+0x1c/0x30 [ 98.209107] <EOI> [ 98.211029] [<ffffffff8108a5c0>] do_softirq+0x50/0x60 [ 98.216166] [<ffffffff815ec853>] netif_rx_ni+0x33/0x80 [ 98.221386] [<ffffffffa09eeff7>] tun_get_user+0x487/0x7f0 [tun] [ 98.227388] [<ffffffffa09ef3ab>] tun_sendmsg+0x4b/0x60 [tun] [ 98.233129] [<ffffffffa0b68932>] handle_tx+0x282/0x540 [vhost_net] [ 98.239392] [<ffffffffa0b68c25>] handle_tx_kick+0x15/0x20 [vhost_net] [ 98.245916] [<ffffffffa0abacfe>] vhost_worker+0x9e/0xf0 [vhost] [ 98.251919] [<ffffffffa0abac60>] ? vhost_umem_alloc+0x40/0x40 [vhost] [ 98.258440] [<ffffffff81003a47>] ? do_syscall_64+0x67/0x180 [ 98.264094] [<ffffffff810a44d9>] kthread+0xd9/0xf0 [ 98.268965] [<ffffffff810a4400>] ? kthread_park+0x60/0x60 [ 98.274444] [<ffffffff8170a4d5>] ret_from_fork+0x25/0x30 [ 98.279836] Code: 8b 93 d8 00 00 00 48 2b 93 d0 00 00 00 4c 89 e6 48 89 df 66 89 93 c2 00 00 00 ff 10 48 3d 00 f0 ff ff 49 89 c2 0f 87 52 01 00 00 <41> 8b 92 cc 00 00 00 48 8b 80 d0 00 00 00 44 0f b7 74 10 06 66 [ 98.299425] RIP [<ffffffff816e52f9>] ipv6_gso_segment+0x119/0x2f0 [ 98.305612] RSP <ffff88012fc43a10> [ 98.309094] CR2: 00000000000000cc [ 98.312406] ---[ end trace 726a2c7a2d2d78d0 ]--- Signed-off-by: Artem Savkov <asavkov@redhat.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-02 13:34:58 -05:00
Sowmini Varadhan	721c7443dc	RDS: TCP: unregister_netdevice_notifier() in error path of rds_tcp_init_net If some error is encountered in rds_tcp_init_net, make sure to unregister_netdevice_notifier(), else we could trigger a panic later on, when the modprobe from a netns fails. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-02 13:29:26 -05:00
Hadar Hen Zion	7091d8c705	net/sched: cls_flower: Add offload support using egress Hardware device In order to support hardware offloading when the device given by the tc rule is different from the Hardware underline device, extract the mirred (egress) device from the tc action when a filter is added, using the new tc_action_ops, get_dev(). Flower caches the information about the mirred device and use it for calling ndo_setup_tc in filter change, update stats and delete. Calling ndo_setup_tc of the mirred (egress) device instead of the ingress device will allow a resolution between the software ingress device and the underline hardware device. The resolution will take place inside the offloading driver using 'egress_device' flag added to tc_to_netdev struct which is provided to the offloading driver. Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-02 13:28:37 -05:00
Hadar Hen Zion	255cb30425	net/sched: act_mirred: Add new tc_action_ops get_dev() Adding support to a new tc_action_ops. get_dev is a general option which allows to get the underline device when trying to offload a tc rule. In case of mirred action the returned device is the mirred (egress) device. Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-02 13:28:36 -05:00
Hadar Hen Zion	3036dab670	net/sched: cls_flower: Provide a filter to replace/destroy hardware filter functions Instead of providing many arguments to fl_hw_{replace/destroy}_filter functions, just provide cls_fl_filter struct that includes all the relevant args. This patches doesn't add any new functionality. Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-02 13:28:36 -05:00
Hadar Hen Zion	796852197c	net/sched: cls_flower: Try to offload only if skip_hw flag isn't set Check skip_hw flag isn't set before calling fl_hw_{replace/destroy}_filter and fl_hw_update_stats functions. Replace the call to tc_should_offload with tc_can_offload. tc_can_offload only checks if the device supports offloading, the check for skip_hw flag is done earlier in the flow. Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-02 13:28:36 -05:00
Florian Westphal	25429d7b7d	tcp: allow to turn tcp timestamp randomization off Eric says: "By looking at tcpdump, and TS val of xmit packets of multiple flows, we can deduct the relative qdisc delays (think of fq pacing). This should work even if we have one flow per remote peer." Having random per flow (or host) offsets doesn't allow that anymore so add a way to turn this off. Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-02 12:49:59 -05:00
Florian Westphal	95a22caee3	tcp: randomize tcp timestamp offsets for each connection jiffies based timestamps allow for easy inference of number of devices behind NAT translators and also makes tracking of hosts simpler. commit `ceaa1fef65` ("tcp: adding a per-socket timestamp offset") added the main infrastructure that is needed for per-connection ts randomization, in particular writing/reading the on-wire tcp header format takes the offset into account so rest of stack can use normal tcp_time_stamp (jiffies). So only two items are left: - add a tsoffset for request sockets - extend the tcp isn generator to also return another 32bit number in addition to the ISN. Re-use of ISN generator also means timestamps are still monotonically increasing for same connection quadruple, i.e. PAWS will still work. Includes fixes from Eric Dumazet. Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-02 12:49:59 -05:00
Eli Cooper	80d1106aea	Revert: "ip6_tunnel: Update skb->protocol to ETH_P_IPV6 in ip6_tnl_xmit()" This reverts commit `ae148b0858` ("ip6_tunnel: Update skb->protocol to ETH_P_IPV6 in ip6_tnl_xmit()"). skb->protocol is now set in __ip_local_out() and __ip6_local_out() before dst_output() is called. It is no longer necessary to do it for each tunnel. Cc: stable@vger.kernel.org Signed-off-by: Eli Cooper <elicooper@gmx.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-02 12:34:22 -05:00
Eli Cooper	b4e479a96f	ipv6: Set skb->protocol properly for local output When xfrm is applied to TSO/GSO packets, it follows this path: xfrm_output() -> xfrm_output_gso() -> skb_gso_segment() where skb_gso_segment() relies on skb->protocol to function properly. This patch sets skb->protocol to ETH_P_IPV6 before dst_output() is called, fixing a bug where GSO packets sent through an ipip6 tunnel are dropped when xfrm is involved. Cc: stable@vger.kernel.org Signed-off-by: Eli Cooper <elicooper@gmx.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-02 12:34:22 -05:00
Eli Cooper	f418043910	ipv4: Set skb->protocol properly for local output When xfrm is applied to TSO/GSO packets, it follows this path: xfrm_output() -> xfrm_output_gso() -> skb_gso_segment() where skb_gso_segment() relies on skb->protocol to function properly. This patch sets skb->protocol to ETH_P_IP before dst_output() is called, fixing a bug where GSO packets sent through a sit tunnel are dropped when xfrm is involved. Cc: stable@vger.kernel.org Signed-off-by: Eli Cooper <elicooper@gmx.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-02 12:34:22 -05:00
Philip Pettersson	84ac726023	packet: fix race condition in packet_set_ring When packet_set_ring creates a ring buffer it will initialize a struct timer_list if the packet version is TPACKET_V3. This value can then be raced by a different thread calling setsockopt to set the version to TPACKET_V1 before packet_set_ring has finished. This leads to a use-after-free on a function pointer in the struct timer_list when the socket is closed as the previously initialized timer will not be deleted. The bug is fixed by taking lock_sock(sk) in packet_setsockopt when changing the packet version while also taking the lock at the start of packet_set_ring. Fixes: `f6fb8f100b` ("af-packet: TPACKET_V3 flexible buffer implementation.") Signed-off-by: Philip Pettersson <philip.pettersson@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-02 12:16:49 -05:00
Soheil Hassas Yeganeh	83a1a1a70e	sock: reset sk_err for ICMP packets read from error queue Only when ICMP packets are enqueued onto the error queue, sk_err is also set. Before `f5f99309fa` (sock: do not set sk_err in sock_dequeue_err_skb), a subsequent error queue read would set sk_err to the next error on the queue, or 0 if empty. As no error types other than ICMP set this field, sk_err should not be modified upon dequeuing them. Only for ICMP errors, reset the (racy) sk_err. Some applications, like traceroute, rely on it and go into a futile busy POLLERR loop otherwise. In principle, sk_err has to be set while an ICMP error is queued. Testing is_icmp_err_skb(skb_next) approximates this without requiring a full queue walk. Applications that receive both ICMP and other errors cannot rely on this legacy behavior, as other errors do not set sk_err in the first place. Fixes: `f5f99309fa` (sock: do not set sk_err in sock_dequeue_err_skb) Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-02 10:55:39 -05:00
Thomas Graf	3a0af8fd61	bpf: BPF for lightweight tunnel infrastructure Registers new BPF program types which correspond to the LWT hooks: - BPF_PROG_TYPE_LWT_IN => dst_input() - BPF_PROG_TYPE_LWT_OUT => dst_output() - BPF_PROG_TYPE_LWT_XMIT => lwtunnel_xmit() The separate program types are required to differentiate between the capabilities each LWT hook allows: * Programs attached to dst_input() or dst_output() are restricted and may only read the data of an skb. This prevent modification and possible invalidation of already validated packet headers on receive and the construction of illegal headers while the IP headers are still being assembled. * Programs attached to lwtunnel_xmit() are allowed to modify packet content as well as prepending an L2 header via a newly introduced helper bpf_skb_change_head(). This is safe as lwtunnel_xmit() is invoked after the IP header has been assembled completely. All BPF programs receive an skb with L3 headers attached and may return one of the following error codes: BPF_OK - Continue routing as per nexthop BPF_DROP - Drop skb and return EPERM BPF_REDIRECT - Redirect skb to device as per redirect() helper. (Only valid in lwtunnel_xmit() context) The return codes are binary compatible with their TC_ACT_ relatives to ease compatibility. Signed-off-by: Thomas Graf <tgraf@suug.ch> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-02 10:51:49 -05:00
Thomas Graf	efd8570081	route: Set lwtstate for local traffic and cached input dsts A route on the output path hitting a RTN_LOCAL route will keep the dst associated on its way through the loopback device. On the receive path, the dst_input() call will thus invoke the input handler of the route created in the output path. Thus, lwt redirection for input must be done for dsts allocated in the otuput path as well. Also, if a route is cached in the input path, the allocated dst should respect lwtunnel configuration on the nexthop as well. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-02 10:51:49 -05:00
Thomas Graf	11b3d9c586	route: Set orig_output when redirecting to lwt on locally generated traffic orig_output for IPv4 was only set for dsts which hit an input route. Set it consistently for locally generated traffic as well to allow lwt to continue the dst_output() path as configured by the nexthop. Fixes: `2536862311` ("lwt: Add support to redirect dst.input") Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-02 10:51:49 -05:00
Tobias Klauser	6919756caa	net/rtnetlink: fix attribute name in nlmsg_size() comments Use the correct attribute constant names IFLA_GSO_MAX_{SEGS,SIZE} instead of IFLA_MAX_GSO_{SEGS,SIZE} for the comments int nlmsg_size(). Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-02 10:34:59 -05:00
Sven Eckelmann	c2d0f48a13	batman-adv: Check for alloc errors when preparing TT local data batadv_tt_prepare_tvlv_local_data can fail to allocate the memory for the new TVLV block. The caller is informed about this problem with the returned length of 0. Not checking this value results in an invalid memory access when either tt_data or tt_change is accessed. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Fixes: `7ea7b4a142` ("batman-adv: make the TT CRC logic VLAN specific") Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>	2016-12-02 10:46:59 +01:00
NeilBrown	2c2ee6d20b	sunrpc: Don't engage exponential backoff when connection attempt is rejected. xs_connect() contains an exponential backoff mechanism so the repeated connection attempts are delayed by longer and longer amounts. This is appropriate when the connection failed due to a timeout, but it not appropriate when a definitive "no" answer is received. In such cases, call_connect_status() imposes a minimum 3-second back-off, so not having the exponetial back-off will never result in immediate retries. The current situation is a problem when the NFS server tries to register with rpcbind but rpcbind isn't running. All connection attempts are made on the same "xprt" and as the connection is never "closed", the exponential back delays successive attempts to register, or de-register, different protocols. This results in a multi-minute delay with no benefit. So, when call_connect_status() receives a definitive "no", use xprt_conditional_disconnect() to cancel the previous connection attempt. This will set XPRT_CLOSE_WAIT so that xprt->ops->close() calls xs_close() which resets the reestablish_timeout. To ensure xprt_conditional_disconnect() does the right thing, we ensure that rq_connect_cookie is set before a connection attempt, and allow xprt_conditional_disconnect() to complete even when the transport is not fully connected. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2016-12-01 17:40:41 -05:00
Zhang Shengju	2934c9dbd3	rtnetlink: return the correct error code Before this patch, function ndo_dflt_fdb_dump() will always return code from uc fdb dump. The reture code of mc fdb dump is lost. Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-01 14:36:03 -05:00
David S. Miller	7bbf91ce27	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2016-12-01 1) Change the error value when someone tries to run 32bit userspace on a 64bit host from -ENOTSUPP to the userspace exported -EOPNOTSUPP. Fix from Yi Zhao. 2) On inbound, ESN sequence numbers are already in network byte order. So don't try to convert it again, this fixes integrity verification for ESN. Fixes from Tobias Brunner. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-01 11:35:49 -05:00
David S. Miller	3d2dd617fb	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net This is a large batch of Netfilter fixes for net, they are: 1) Three patches to fix NAT conversion to rhashtable: Switch to rhlist structure that allows to have several objects with the same key. Moreover, fix wrong comparison logic in nf_nat_bysource_cmp() as this is expecting a return value similar to memcmp(). Change location of the nat_bysource field in the nf_conn structure to avoid zeroing this as it breaks interaction with SLAB_DESTROY_BY_RCU and lead us to crashes. From Florian Westphal. 2) Don't allow malformed fragments go through in IPv6, drop them, otherwise we hit GPF, patch from Florian Westphal. 3) Fix crash if attributes are missing in nft_range, from Liping Zhang. 4) Fix arptables 32-bits userspace 64-bits kernel compat, from Hongxu Jia. 5) Two patches from David Ahern to fix netfilter interaction with vrf. From David Ahern. 6) Fix element timeout calculation in nf_tables, we take milliseconds from userspace, but we use jiffies from kernelspace. Patch from Anders K. Pedersen. 7) Missing validation length netlink attribute for nft_hash, from Laura Garcia. 8) Fix nf_conntrack_helper documentation, we don't default to off anymore for a bit of time so let's get this in sync with the code. I know is late but I think these are important, specifically the NAT bits, as they are mostly addressing fallout from recent changes. I also read there are chances to have -rc8, if that is the case, that would also give us a bit more time to test this. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-01 11:04:41 -05:00
Chuck Lever	fafedf8170	svcrdma: Further clean-up of svc_rdma_get_inv_rkey() No longer any need for the dprintk(). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2016-11-30 17:31:16 -05:00
Chuck Lever	0725745020	svcrdma: Break up dprintk format in svc_rdma_accept() The current code results in: Nov 7 14:50:19 klimt kernel: svcrdma: newxprt->sc_cm_id=ffff88085590c800, newxprt->sc_pd=ffff880852a7ce00#012 cm_id->device=ffff88084dd20000, sc_pd->device=ffff88084dd20000#012 cap.max_send_wr = 272#012 cap.max_recv_wr = 34#012 cap.max_send_sge = 32#012 cap.max_recv_sge = 32 Nov 7 14:50:19 klimt kernel: svcrdma: new connection ffff880855908000 accepted with the following attributes:#012 local_ip : 10.0.0.5#012 local_port#011 : 20049#012 remote_ip : 10.0.0.2#012 remote_port : 59909#012 max_sge : 32#012 max_sge_rd : 30#012 sq_depth : 272#012 max_requests : 32#012 ord : 16 Split up the output over multiple dprintks and take the opportunity to fix the display of IPv6 addresses. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2016-11-30 17:31:16 -05:00
Chuck Lever	f5426d37f6	svcrdma: Remove unused variable in rdma_copy_tail() Clean up. linux-2.6/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c: In function ‘rdma_copy_tail’: linux-2.6/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c:376:6: warning: variable ‘ret’ set but not used [-Wunused-but-set-variable] int ret; ^ Fixes: `a97c331f9a` ("svcrdma: Handle additional inline content") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2016-11-30 17:31:15 -05:00
Chuck Lever	9a16a34d21	svcrdma: Remove unused variables in xprt_rdma_bc_allocate() Clean up. /linux-2.6/net/sunrpc/xprtrdma/svc_rdma_backchannel.c: In function ‘xprt_rdma_bc_allocate’: linux-2.6/net/sunrpc/xprtrdma/svc_rdma_backchannel.c:169:23: warning: variable ‘rdma’ set but not used [-Wunused-but-set-variable] struct svcxprt_rdma *rdma; ^ Fixes: `5d252f90a8` ("svcrdma: Add class for RDMA backwards ...") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2016-11-30 17:31:14 -05:00
Chuck Lever	96a58f9c19	svcrdma: Remove svc_rdma_op_ctxt::wc_status Clean up: Completion status is already reported in the individual completion handlers. Save a few bytes in struct svc_rdma_op_ctxt. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2016-11-30 17:31:14 -05:00
Chuck Lever	dd6fd213b0	svcrdma: Remove DMA map accounting Clean up: sc_dma_used is not required for correct operation. It is simply a debugging tool to report when svcrdma has leaked DMA maps. However, manipulating an atomic has a measurable CPU cost, and DMA map accounting specific to svcrdma will be meaningless once svcrdma is converted to use the new generic r/w API. A similar kind of debug accounting can be done simply by enabling the IOMMU or by using CONFIG_DMA_API_DEBUG, CONFIG_IOMMU_DEBUG, and CONFIG_IOMMU_LEAK. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2016-11-30 17:31:13 -05:00
Chuck Lever	e4eb42cecc	svcrdma: Remove BH-disabled spin locking in svc_rdma_send() svcrdma's current SQ accounting algorithm takes sc_lock and disables bottom-halves while posting all RDMA Read, Write, and Send WRs. This is relatively heavyweight serialization. And note that Write and Send are already fully serialized by the xpt_mutex. Using a single atomic_t should be all that is necessary to guarantee that ib_post_send() is called only when there is enough space on the send queue. This is what the other RDMA-enabled storage targets do. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2016-11-30 17:31:13 -05:00
Chuck Lever	5fdca65314	svcrdma: Renovate sendto chunk list parsing The current sendto code appears to support clients that provide only one of a Read list, a Write list, or a Reply chunk. My reading of that code is that it doesn't support the following cases: - Read list + Write list - Read list + Reply chunk - Write list + Reply chunk - Read list + Write list + Reply chunk The protocol allows more than one Read or Write chunk in those lists. Some clients do send a Read list and Reply chunk simultaneously. NFSv4 WRITE uses a Read list for the data payload, and a Reply chunk because the GETATTR result in the reply can contain a large object like an ACL. Generalize one of the sendto code paths needed to support all of the above cases, and attempt to ensure that only one pass is done through the RPC Call's transport header to gather chunk list information for building the reply. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2016-11-30 17:31:12 -05:00
Chuck Lever	4d712ef1db	svcauth_gss: Close connection when dropping an incoming message S5.3.3.1 of RFC 2203 requires that an incoming GSS-wrapped message whose sequence number lies outside the current window is dropped. The rationale is: The reason for discarding requests silently is that the server is unable to determine if the duplicate or out of range request was due to a sequencing problem in the client, network, or the operating system, or due to some quirk in routing, or a replay attack by an intruder. Discarding the request allows the client to recover after timing out, if indeed the duplication was unintentional or well intended. However, clients may rely on the server dropping the connection to indicate that a retransmit is needed. Without a connection reset, a client can wait forever without retransmitting, and the workload just stops dead. I've reproduced this behavior by running xfstests generic/323 on an NFSv4.0 mount with proto=rdma and sec=krb5i. To address this issue, have the server close the connection when it silently discards an incoming message due to a GSS sequence number problem. There are a few other places where the server will never reply. Change those spots in a similar fashion. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2016-11-30 17:31:11 -05:00
Chuck Lever	1b9f700b8c	svcrdma: Clear xpt_bc_xps in xprt_setup_rdma_bc() error exit arm Logic copied from xs_setup_bc_tcp(). Fixes: `39a9beab5a` ('rpc: share one xps between all backchannels') Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2016-11-30 17:31:11 -05:00
Lorenzo Colitti	d109e61bfe	net: ipv4: Don't crash if passing a null sk to ip_rt_update_pmtu. Commit `e2d118a1cb` ("net: inet: Support UID-based routing in IP protocols.") made __build_flow_key call sock_net(sk) to determine the network namespace of the passed-in socket. This crashes if sk is NULL. Fix this by getting the network namespace from the skb instead. Fixes: `e2d118a1cb` ("net: inet: Support UID-based routing in IP protocols.") Reported-by: Erez Shitrit <erezsh@dev.mellanox.co.il> Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-11-30 14:53:59 -05:00
Hongxu Jia	17a49cd549	netfilter: arp_tables: fix invoking 32bit "iptable -P INPUT ACCEPT" failed in 64bit kernel Since `09d9686047` ("netfilter: x_tables: do compat validation via translate_table"), it used compatr structure to assign newinfo structure. In translate_compat_table of ip_tables.c and ip6_tables.c, it used compatr->hook_entry to replace info->hook_entry and compatr->underflow to replace info->underflow, but not do the same replacement in arp_tables.c. It caused invoking 32-bit "arptbale -P INPUT ACCEPT" failed in 64bit kernel. -------------------------------------- root@qemux86-64:~# arptables -P INPUT ACCEPT root@qemux86-64:~# arptables -P INPUT ACCEPT ERROR: Policy for `INPUT' offset 448 != underflow 0 arptables: Incompatible with this kernel -------------------------------------- Fixes: `09d9686047` ("netfilter: x_tables: do compat validation via translate_table") Signed-off-by: Hongxu Jia <hongxu.jia@windriver.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-11-30 20:50:23 +01:00
Guillaume Nault	31e2f21fb3	l2tp: fix address test in __l2tp_ip6_bind_lookup() The '!(addr && ipv6_addr_equal(addr, laddr))' part of the conditional matches if addr is NULL or if addr != laddr. But the intend of __l2tp_ip6_bind_lookup() is to find a sockets with the same address, so the ipv6_addr_equal() condition needs to be inverted. For better clarity and consistency with the rest of the expression, the (!X \|\| X == Y) notation is used instead of !(X && X != Y). Signed-off-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-11-30 14:14:08 -05:00
Guillaume Nault	df90e68861	l2tp: fix lookup for sockets not bound to a device in l2tp_ip When looking up an l2tp socket, we must consider a null netdevice id as wild card. There are currently two problems caused by __l2tp_ip_bind_lookup() not considering 'dif' as wild card when set to 0: * A socket bound to a device (i.e. with sk->sk_bound_dev_if != 0) never receives any packet. Since __l2tp_ip_bind_lookup() is called with dif == 0 in l2tp_ip_recv(), sk->sk_bound_dev_if is always different from 'dif' so the socket doesn't match. * Two sockets, one bound to a device but not the other, can be bound to the same address. If the first socket binding to the address is the one that is also bound to a device, the second socket can bind to the same address without __l2tp_ip_bind_lookup() noticing the overlap. To fix this issue, we need to consider that any null device index, be it 'sk->sk_bound_dev_if' or 'dif', matches with any other value. We also need to pass the input device index to __l2tp_ip_bind_lookup() on reception so that sockets bound to a device never receive packets from other devices. This patch fixes l2tp_ip6 in the same way. Signed-off-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-11-30 14:14:08 -05:00
Guillaume Nault	d5e3a19093	l2tp: fix racy socket lookup in l2tp_ip and l2tp_ip6 bind() It's not enough to check for sockets bound to same address at the beginning of l2tp_ip{,6}_bind(): even if no socket is found at that time, a socket with the same address could be bound before we take the l2tp lock again. This patch moves the lookup right before inserting the new socket, so that no change can ever happen to the list between address lookup and socket insertion. Care is taken to avoid side effects on the socket in case of failure. That is, modifications of the socket are done after the lookup, when binding is guaranteed to succeed, and before releasing the l2tp lock, so that concurrent lookups will always see fully initialised sockets. For l2tp_ip, 'ret' is set to -EINVAL before checking the SOCK_ZAPPED bit. Error code was mistakenly set to -EADDRINUSE on error by commit `32c231164b` ("l2tp: fix racy SOCK_ZAPPED flag check in l2tp_ip{,6}_bind()"). Using -EINVAL restores original behaviour. For l2tp_ip6, the lookup is now always done with the correct bound device. Before this patch, when binding to a link-local address, the lookup was done with the original sk->sk_bound_dev_if, which was later overwritten with addr->l2tp_scope_id. Lookup is now performed with the final sk->sk_bound_dev_if value. Finally, the (addr_len >= sizeof(struct sockaddr_in6)) check has been dropped: addr is a sockaddr_l2tpip6 not sockaddr_in6 and addr_len has already been checked at this point (this part of the code seems to have been copy-pasted from net/ipv6/raw.c). Signed-off-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-11-30 14:14:08 -05:00
Guillaume Nault	a3c18422a4	l2tp: hold socket before dropping lock in l2tp_ip{, 6}_recv() Socket must be held while under the protection of the l2tp lock; there is no guarantee that sk remains valid after the read_unlock_bh() call. Same issue for l2tp_ip and l2tp_ip6. Signed-off-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-11-30 14:14:07 -05:00
Guillaume Nault	0382a25af3	l2tp: lock socket before checking flags in connect() Socket flags aren't updated atomically, so the socket must be locked while reading the SOCK_ZAPPED flag. This issue exists for both l2tp_ip and l2tp_ip6. For IPv6, this patch also brings error handling for __ip6_datagram_connect() failures. Signed-off-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-11-30 14:14:07 -05:00
Zhang Shengju	18502acd9a	neigh: remove duplicate check for same neigh Currently loop index 'idx' is used as the index in the neigh list of interest. It's increased only when the neigh is dumped. It's not the absolute index in the list. Because there is no info to record which neigh has already be scanned by previous loop. This will cause the filtered out neighs to be scanned mulitple times. This patch make idx as the absolute index in the list, it will increase no matter whether the neigh is filtered. This will prevent the above problem. And this is in line with other dump functions. v2: - take David Ahern's advice to do simple change Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-11-30 13:46:16 -05:00
Daniele Di Proietto	f92a80a997	openvswitch: Fix skb leak in IPv6 reassembly. If nf_ct_frag6_gather() returns an error other than -EINPROGRESS, it means that we still have a reference to the skb. We should free it before returning from handle_fragments, as stated in the comment above. Fixes: `daaa7d647f` ("netfilter: ipv6: avoid nf_iterate recursion") CC: Florian Westphal <fw@strlen.de> CC: Pravin B Shelar <pshelar@ovn.org> CC: Joe Stringer <joe@ovn.org> Signed-off-by: Daniele Di Proietto <diproiettod@ovn.org> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-11-30 11:00:45 -05:00
Daniel Borkmann	85de8576a0	bpf, xdp: allow to pass flags to dev_change_xdp_fd Add an IFLA_XDP_FLAGS attribute that can be passed for setting up XDP along with IFLA_XDP_FD, which eventually allows user space to implement typical add/replace/delete logic for programs. Right now, calling into dev_change_xdp_fd() will always replace previous programs. When passed XDP_FLAGS_UPDATE_IF_NOEXIST, we can handle this more graceful when requested by returning -EBUSY in case we try to attach a new program, but we find that another one is already attached. This will be used by upcoming front-end for iproute2 as well. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-11-30 10:27:20 -05:00
Francis Yan	1c885808e4	tcp: SOF_TIMESTAMPING_OPT_STATS option for SO_TIMESTAMPING This patch exports the sender chronograph stats via the socket SO_TIMESTAMPING channel. Currently we can instrument how long a particular application unit of data was queued in TCP by tracking SOF_TIMESTAMPING_TX_SOFTWARE and SOF_TIMESTAMPING_TX_SCHED. Having these sender chronograph stats exported simultaneously along with these timestamps allow further breaking down the various sender limitation. For example, a video server can tell if a particular chunk of video on a connection takes a long time to deliver because TCP was experiencing small receive window. It is not possible to tell before this patch without packet traces. To prepare these stats, the user needs to set SOF_TIMESTAMPING_OPT_STATS and SOF_TIMESTAMPING_OPT_TSONLY flags while requesting other SOF_TIMESTAMPING TX timestamps. When the timestamps are available in the error queue, the stats are returned in a separate control message of type SCM_TIMESTAMPING_OPT_STATS, in a list of TLVs (struct nlattr) of types: TCP_NLA_BUSY_TIME, TCP_NLA_RWND_LIMITED, TCP_NLA_SNDBUF_LIMITED. Unit is microsecond. Signed-off-by: Francis Yan <francisyyan@gmail.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-11-30 10:04:25 -05:00
Francis Yan	efd9017416	tcp: export sender limits chronographs to TCP_INFO This patch exports all the sender chronograph measurements collected in the previous patches to TCP_INFO interface. Note that busy time exported includes all the other sending limits (rwnd-limited, sndbuf-limited). Internally the time unit is jiffy but externally the measurements are in microseconds for future extensions. Signed-off-by: Francis Yan <francisyyan@gmail.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-11-30 10:04:25 -05:00
Francis Yan	b0f71bd3e1	tcp: instrument how long TCP is limited by insufficient send buffer This patch measures the amount of time when TCP runs out of new data to send to the network due to insufficient send buffer, while TCP is still busy delivering (i.e. write queue is not empty). The goal is to indicate either the send buffer autotuning or user SO_SNDBUF setting has resulted network under-utilization. The measurement starts conservatively by checking various conditions to minimize false claims (i.e. under-estimation is more likely). The measurement stops when the SOCK_NOSPACE flag is cleared. But it does not account the time elapsed till the next application write. Also the measurement only starts if the sender is still busy sending data, s.t. the limit accounted is part of the total busy time. Signed-off-by: Francis Yan <francisyyan@gmail.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-11-30 10:04:24 -05:00
Francis Yan	5615f88614	tcp: instrument how long TCP is limited by receive window This patch measures the total time when the TCP stops sending because the receiver's advertised window is not large enough. Note that once the limit is lifted we are likely in the busy status if we have data pending. Signed-off-by: Francis Yan <francisyyan@gmail.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-11-30 10:04:24 -05:00
Francis Yan	0f87230d1a	tcp: instrument how long TCP is busy sending This patch measures TCP busy time, which is defined as the period of time when sender has data (or FIN) to send. The time starts when data is buffered and stops when the write queue is flushed by ACKs or error events. Note the busy time does not include SYN time, unless data is included in SYN (i.e. Fast Open). It does include FIN time even if the FIN carries no payload. Excluding pure FIN is possible but would incur one additional test in the fast path, which may not be worth it. Signed-off-by: Francis Yan <francisyyan@gmail.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-11-30 10:04:24 -05:00

1 2 3 4 5 ...

44643 Commits