linux

mirror of https://github.com/torvalds/linux.git synced 2024-12-29 22:31:32 +00:00

Author	SHA1	Message	Date
Daniel Borkmann	e28e87ed47	ip_tunnel, bpf: ip_tunnel_info_opts_{get, set} depends on CONFIG_INET Helpers like ip_tunnel_info_opts_{get,set}() are only available if CONFIG_INET is set, thus add an empty definition into the header for the !CONFIG_INET case, where already other empty inline helpers are defined. This avoids ifdef kludge inside filter.c, but also vxlan and geneve themself where this facility can only be used with, depend on INET being set. For the !INET case TUNNEL_OPTIONS_PRESENT would never be set in flags. Fixes: `14ca0751c9` ("bpf: support for access to tunnel options") Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-08 23:20:53 -05:00
David S. Miller	f14b488d50	Merge branch 'bpf-map-prealloc' Alexei Starovoitov says: ==================== bpf: map pre-alloc v1->v2: . fix few issues spotted by Daniel . converted stackmap into pre-allocation as well . added a workaround for lockdep false positive . added pcpu_freelist_populate to be used by hashmap and stackmap this path set switches bpf hash map to use pre-allocation by default and introduces BPF_F_NO_PREALLOC flag to keep old behavior for cases where full map pre-allocation is too memory expensive. Some time back Daniel Wagner reported crashes when bpf hash map is used to compute time intervals between preempt_disable->preempt_enable and recently Tom Zanussi reported a dead lock in iovisor/bcc/funccount tool if it's used to count the number of invocations of kernel 'spin' functions. Both problems are due to the recursive use of slub and can only be solved by pre-allocating all map elements. A lot of different solutions were considered. Many implemented, but at the end pre-allocation seems to be the only feasible answer. As far as pre-allocation goes it also was implemented 4 different ways: - simple free-list with single lock - percpu_ida with optimizations - blk-mq-tag variant customized for bpf use case - percpu_freelist For bpf style of alloc/free patterns percpu_freelist is the best and implemented in this patch set. Detailed performance numbers in patch 3. Patch 2 introduces percpu_freelist Patch 1 fixes simple deadlocks due to missing recursion checks Patch 5: converts stackmap to pre-allocation Patches 6-9: prepare test infra Patch 10: stress test for hash map infra. It attaches to spin_lock functions and bpf_map_update/delete are called from different contexts Patch 11: stress for bpf_get_stackid Patch 12: map performance test Reported-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Reported-by: Tom Zanussi <tom.zanussi@linux.intel.com> ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-08 15:28:33 -05:00
Alexei Starovoitov	c3f85cffc5	samples/bpf: test both pre-alloc and normal maps extend test coveraged to include pre-allocated and run-time alloc maps Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-08 15:28:32 -05:00
Alexei Starovoitov	89b9760701	samples/bpf: add map_flags to bpf loader note old loader is compatible with new kernel. map_flags are optional Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-08 15:28:32 -05:00
Alexei Starovoitov	3622e7e493	samples/bpf: move ksym_search() into library move ksym search from offwaketime into library to be reused in other tests Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-08 15:28:32 -05:00
Alexei Starovoitov	618ec9a7b1	samples/bpf: make map creation more verbose map creation is typically the first one to fail when rlimits are too low, not enough memory, etc Make this failure scenario more verbose Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-08 15:28:32 -05:00
Alexei Starovoitov	557c0c6e7d	bpf: convert stackmap to pre-allocation It was observed that calling bpf_get_stackid() from a kprobe inside slub or from spin_unlock causes similar deadlock as with hashmap, therefore convert stackmap to use pre-allocated memory. The call_rcu is no longer feasible mechanism, since delayed freeing causes bpf_get_stackid() to fail unpredictably when number of actual stacks is significantly less than user requested max_entries. Since elements are no longer freed into slub, we can push elements into freelist immediately and let them be recycled. However the very unlikley race between user space map_lookup() and program-side recycling is possible: cpu0 cpu1 ---- ---- user does lookup(stackidX) starts copying ips into buffer delete(stackidX) calls bpf_get_stackid() which recyles the element and overwrites with new stack trace To avoid user space seeing a partial stack trace consisting of two merged stack traces, do bucket = xchg(, NULL); copy; xchg(,bucket); to preserve consistent stack trace delivery to user space. Now we can move memset(,0) of left-over element value from critical path of bpf_get_stackid() into slow-path of user space lookup. Also disallow lookup() from bpf program, since it's useless and program shouldn't be messing with collected stack trace. Note that similar race between user space lookup and kernel side updates is also present in hashmap, but it's not a new race. bpf programs were always allowed to modify hash and array map elements while user space is copying them. Fixes: `d5a3b1f691` ("bpf: introduce BPF_MAP_TYPE_STACK_TRACE") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-08 15:28:31 -05:00
Alexei Starovoitov	823707b68d	bpf: check for reserved flag bits in array and stack maps Suggested-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-08 15:28:31 -05:00
Alexei Starovoitov	6c90598174	bpf: pre-allocate hash map elements If kprobe is placed on spin_unlock then calling kmalloc/kfree from bpf programs is not safe, since the following dead lock is possible: kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe-> bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock) and deadlocks. The following solutions were considered and some implemented, but eventually discarded - kmem_cache_create for every map - add recursion check to slow-path of slub - use reserved memory in bpf_map_update for in_irq or in preempt_disabled - kmalloc via irq_work At the end pre-allocation of all map elements turned out to be the simplest solution and since the user is charged upfront for all the memory, such pre-allocation doesn't affect the user space visible behavior. Since it's impossible to tell whether kprobe is triggered in a safe location from kmalloc point of view, use pre-allocation by default and introduce new BPF_F_NO_PREALLOC flag. While testing of per-cpu hash maps it was discovered that alloc_percpu(GFP_ATOMIC) has odd corner cases and often fails to allocate memory even when 90% of it is free. The pre-allocation of per-cpu hash elements solves this problem as well. Turned out that bpf_map_update() quickly followed by bpf_map_lookup()+bpf_map_delete() is very common pattern used in many of iovisor/bcc/tools, so there is additional benefit of pre-allocation, since such use cases are must faster. Since all hash map elements are now pre-allocated we can remove atomic increment of htab->count and save few more cycles. Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid large malloc/free done by users who don't have sufficient limits. Pre-allocation is done with vmalloc and alloc/free is done via percpu_freelist. Here are performance numbers for different pre-allocation algorithms that were implemented, but discarded in favor of percpu_freelist: 1 cpu: pcpu_ida 2.1M pcpu_ida nolock 2.3M bt 2.4M kmalloc 1.8M hlist+spinlock 2.3M pcpu_freelist 2.6M 4 cpu: pcpu_ida 1.5M pcpu_ida nolock 1.8M bt w/smp_align 1.7M bt no/smp_align 1.1M kmalloc 0.7M hlist+spinlock 0.2M pcpu_freelist 2.0M 8 cpu: pcpu_ida 0.7M bt w/smp_align 0.8M kmalloc 0.4M pcpu_freelist 1.5M 32 cpu: kmalloc 0.13M pcpu_freelist 0.49M pcpu_ida nolock is a modified percpu_ida algorithm without percpu_ida_cpu locks and without cross-cpu tag stealing. It's faster than existing percpu_ida, but not as fast as pcpu_freelist. bt is a variant of block/blk-mq-tag.c simlified and customized for bpf use case. bt w/smp_align is using cache line for every 'long' (similar to blk-mq-tag). bt no/smp_align allocates 'long' bitmasks continuously to save memory. It's comparable to percpu_ida and in some cases faster, but slower than percpu_freelist hlist+spinlock is the simplest free list with single spinlock. As expeceted it has very bad scaling in SMP. kmalloc is existing implementation which is still available via BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist, but saves memory, so in cases where map->max_entries can be large and number of map update/delete per second is low, it may make sense to use it. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-08 15:28:31 -05:00
Alexei Starovoitov	e19494edab	bpf: introduce percpu_freelist Introduce simple percpu_freelist to keep single list of elements spread across per-cpu singly linked lists. /* push element into the list / void pcpu_freelist_push(struct pcpu_freelist , struct pcpu_freelist_node ); / pop element from the list / struct pcpu_freelist_node pcpu_freelist_pop(struct pcpu_freelist *); The object is pushed to the current cpu list. Pop first trying to get the object from the current cpu list, if it's empty goes to the neigbour cpu list. For bpf program usage pattern the collision rate is very low, since programs push and pop the objects typically on the same cpu. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-08 15:28:31 -05:00
Alexei Starovoitov	b121d1e74d	bpf: prevent kprobe+bpf deadlocks if kprobe is placed within update or delete hash map helpers that hold bucket spin lock and triggered bpf program is trying to grab the spinlock for the same bucket on the same cpu, it will deadlock. Fix it by extending existing recursion prevention mechanism. Note, map_lookup and other tracing helpers don't have this problem, since they don't hold any locks and don't modify global data. bpf_trace_printk has its own recursive check and ok as well. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-08 15:28:30 -05:00
David S. Miller	8aba8b8312	Merge branch 'ipv6-per-netns-gc' Michal Kubecek says: ==================== ipv6: per netns FIB6 walkers and garbage collector Commit `2ac3ac8f86` ("ipv6: prevent fib6_run_gc() contention") reduced the risk of contention on FIB6 garbage collector lock on systems with many CPUs. However, one of our customers can still observe heavy contention on fib6_gc_lock which can even trigger the soft lockup detector. This is caused by garbage collector running in forced mode from a timer. While there is one timer per network namespace, the instances of fib6_run_gc() running from them are protected by one global spinlock so that only one garbage collector can run at any moment and other namespaces have to wait. As most relevant data structures are separated per netns, there is little reason for garbage collectors blocking each other. Similar problem exists for walkers: changes in one tree do not need to adjust (and block) walkers traversing FIB trees in other namespaces. This series separates both the walkers infrastructure and garbage collector so that they work independently in network namespaces. v2: get rid of ifdef in ipv6_route_seq_setup_walk(), pass net from callers instead ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-08 15:16:51 -05:00
Michal Kubeček	3dc94f93be	ipv6: per netns FIB garbage collection One of our customers observed issues with FIB6 garbage collectors running in different network namespaces blocking each other, resulting in soft lockups (fib6_run_gc() initiated from timer runs always in forced mode). Now that FIB6 walkers are separated per namespace, there is no more need for instances of fib6_run_gc() in different namespaces blocking each other. There is still a call to icmp6_dst_gc() which operates on shared data but this function is protected by its own shared lock. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-08 15:16:51 -05:00
Michal Kubeček	9a03cd8f38	ipv6: per netns fib6 walkers The IPv6 FIB data structures are separated per network namespace but there is still only one global walkers list and one global walker list lock. This means changes in one namespace unnecessarily interfere with walkers in other namespaces. Replace the global list with per-netns lists (and give each its own lock). Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-08 15:16:51 -05:00
Michal Kubeček	3570df914f	ipv6: replace global gc_args with local variable Global variable gc_args is only used in fib6_run_gc() and functions called from it. As fib6_run_gc() makes sure there is at most one instance of fib6_clean_all() running at any moment, we can replace gc_args with a local variable which will be needed once multiple instances (per netns) of garbage collector are allowed. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-08 15:16:50 -05:00
David S. Miller	02daec7c22	Merge branch 'bnxt_en-next' Michael Chan says: ==================== bnxt_en: Updates for net-next. Updates to support autoneg for all supported speeds, add PF port statistics, and Advanced Error Reporting. v2: Fixed patch 3 to not use parentheses on function return. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-08 14:51:44 -05:00
Satish Baddipadige	6316ea6db9	bnxt_en: Enable AER support. Add pci_error_handler callbacks to support for pcie advanced error recovery. Signed-off-by: Satish Baddipadige <sbaddipa@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-08 14:51:44 -05:00
Michael Chan	8ddc9aaa72	bnxt_en: Include hardware port statistics in ethtool -S. Include the more useful port statistics in ethtool -S for the PF device. Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-08 14:51:44 -05:00
Michael Chan	9947f83fb7	bnxt_en: Include some hardware port statistics in ndo_get_stats64(). Include some of the port error counters (e.g. crc) in ->ndo_get_stats64() for the PF device. Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-08 14:51:43 -05:00
Michael Chan	3bdf56c47d	bnxt_en: Add port statistics support. Gather periodic port statistics if the device is PF and link is up. This is triggered in bnxt_timer() every one second to request firmware to DMA the counters. Signed-off-by: Michael Chan <michael.chan@broadocm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-08 14:51:43 -05:00
Michael Chan	f1a082a6f7	bnxt_en: Extend autoneg to all speeds. Allow all autoneg speeds aupported by firmware to be advertised. If the advertising parameter is 0, then all supported speeds will be advertised. Remove BNXT_ALL_COPPER_ETHTOOL_SPEED which is no longer used as all supported speeds can be advertised. Signed-off-by: Michael Chan <mchan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-08 14:51:43 -05:00
Michael Chan	4b32cacca2	bnxt_en: Use common function to get ethtool supported flags. The supported bits and advertising bits in ethtool have the same definitions. The same is true for the firmware bits. So use the common function to handle the conversion for both supported and advertising bits. v2: Don't use parentheses on function return. Signed-off-by: Michael Chan <mchan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-08 14:51:43 -05:00
Michael Chan	3277360eb2	bnxt_en: Add reporting of link partner advertisement. And report actual pause settings to ETHTOOL_GPAUSEPARAM to let ethtool resolve the actual pause settings. Signed-off-by: Michael Chan <mchan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-08 14:51:42 -05:00
Michael Chan	27c4d57860	bnxt_en: Refactor bnxt_fw_to_ethtool_advertised_spds(). Include the conversion of pause bits and add one extra call layer so that the same refactored function can be reused to get the link partner advertisement bits. Signed-off-by: Michael Chan <mchan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-08 14:51:42 -05:00
Kyeong Yoo	f8b33d8e87	net_sched: dsmark: use qdisc_dequeue_peeked() This fix is for dsmark similar to commit `3557619f0f` ("net_sched: prio: use qdisc_dequeue_peeked") and makes use of qdisc_dequeue_peeked() instead of direct dequeue() call. First time, wrr peeks dsmark, which will then peek into sfq. sfq dequeues an skb and it's stored in sch->gso_skb. Next time, wrr tries to dequeue from dsmark, which will call sfq dequeue directly. This results skipping the previously peeked skb. So changed dsmark dequeue to call qdisc_dequeue_peeked() instead to use peeked skb if exists. Signed-off-by: Kyeong Yoo <kyeong.yoo@alliedtelesis.co.nz> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-08 14:35:13 -05:00
David S. Miller	4c38cd61ae	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset contains Netfilter updates for your net-next tree, they are: 1) Remove useless debug message when deleting IPVS service, from Yannick Brosseau. 2) Get rid of compilation warning when CONFIG_PROC_FS is unset in several spots of the IPVS code, from Arnd Bergmann. 3) Add prandom_u32 support to nft_meta, from Florian Westphal. 4) Remove unused variable in xt_osf, from Sudip Mukherjee. 5) Don't calculate IP checksum twice from netfilter ipv4 defrag hook since fixing af_packet defragmentation issues, from Joe Stringer. 6) On-demand hook registration for iptables from netns. Instead of registering the hooks for every available netns whenever we need one of the support tables, we register this on the specific netns that needs it, patchset from Florian Westphal. 7) Add missing port range selection to nf_tables masquerading support. BTW, just for the record, there is a typo in the description of `5f6c253ebe` ("netfilter: bridge: register hooks only when bridge interface is added") that refers to the cluster match as deprecated, but it is actually the CLUSTERIP target (which registers hooks inconditionally) the one that is scheduled for removal. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-08 14:25:20 -05:00
David S. Miller	d24ad3fc0e	Merge branch 'bpf-next' Daniel Borkmann says: ==================== BPF updates Couple of misc updates to BPF, besides others this series adds bpf_csum_diff() to be used with L3 csums, allows for managing tunnel options for collect meta data mode, and enabling ipv6 traffic class for collect meta data in vxlan specifically (geneve already supports it). For more details, please see individual patches. The series requires net to be merged into net-next first to avoid any further pending merge conflicts. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-08 13:58:52 -05:00
Daniel Borkmann	1400615d64	vxlan: allow setting ipv6 traffic class We can already do that for IPv4, but IPv6 support was missing. Add it for vxlan, so it can be used with collect metadata frontends. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-08 13:58:47 -05:00
Daniel Borkmann	db3c6139e6	bpf, vxlan, geneve, gre: fix usage of dst_cache on xmit The assumptions from commit `0c1d70af92` ("net: use dst_cache for vxlan device"), `468dfffcd7` ("geneve: add dst caching support") and `3c1cb4d260` ("net/ipv4: add dst cache support for gre lwtunnels") on dst_cache usage when ip_tunnel_info is used is unfortunately not always valid as assumed. While it seems correct for ip_tunnel_info front-ends such as OVS, eBPF however can fill in ip_tunnel_info for consumers like vxlan, geneve or gre with different remote dsts, tos, etc, therefore they cannot be assumed as packet independent. Right now vxlan, geneve, gre would cache the dst for eBPF and every packet would reuse the same entry that was first created on the initial route lookup. eBPF doesn't store/cache the ip_tunnel_info, so each skb may have a different one. Fix it by adding a flag that checks the ip_tunnel_info. Also the !tos test in vxlan needs to be handeled differently in this context as it is currently inferred from ip_tunnel_info as well if present. ip_tunnel_dst_cache_usable() helper is added for the three tunnel cases, which checks if we can use dst cache. Fixes: `0c1d70af92` ("net: use dst_cache for vxlan device") Fixes: `468dfffcd7` ("geneve: add dst caching support") Fixes: `3c1cb4d260` ("net/ipv4: add dst cache support for gre lwtunnels") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-08 13:58:47 -05:00
Daniel Borkmann	14ca0751c9	bpf: support for access to tunnel options After eBPF being able to programmatically access/manage tunnel key meta data via commit `d3aa45ce6b` ("bpf: add helpers to access tunnel metadata") and more recently also for IPv6 through `c6c3345407` ("bpf: support ipv6 for bpf_skb_{set,get}_tunnel_key"), this work adds two complementary helpers to generically access their auxiliary tunnel options. Geneve and vxlan support this facility. For geneve, TLVs can be pushed, and for the vxlan case its GBP extension. I.e. setting tunnel key for geneve case only makes sense, if we can also read/write TLVs into it. In the GBP case, it provides the flexibility to easily map the group policy ID in combination with other helpers or maps. I chose to model this as two separate helpers, bpf_skb_{set,get}_tunnel_opt(), for a couple of reasons. bpf_skb_{set,get}_tunnel_key() is already rather complex by itself, and there may be cases for tunnel key backends where tunnel options are not always needed. If we would have integrated this into bpf_skb_{set,get}_tunnel_key() nevertheless, we are very limited with remaining helper arguments, so keeping compatibility on structs in case of passing in a flat buffer gets more cumbersome. Separating both also allows for more flexibility and future extensibility, f.e. options could be fed directly from a map, etc. Moreover, change geneve's xmit path to test only for info->options_len instead of TUNNEL_GENEVE_OPT flag. This makes it more consistent with vxlan's xmit path and allows for avoiding to specify a protocol flag in the API on xmit, so it can be protocol agnostic. Having info->options_len is enough information that is needed. Tested with vxlan and geneve. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-08 13:58:46 -05:00
Daniel Borkmann	2208087061	bpf: allow to propagate df in bpf_skb_set_tunnel_key Added by `9a628224a6` ("ip_tunnel: Add dont fragment flag."), allow to feed df flag into tunneling facilities (currently supported on TX by vxlan, geneve and gre) as a hint from eBPF's bpf_skb_set_tunnel_key() helper. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-08 13:58:46 -05:00
Daniel Borkmann	577c50aade	bpf: make helper function protos static They are only used here, so there's no reason they should not be static. Only the vlan push/pop protos are used in the test_bpf suite. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-08 13:55:15 -05:00
Daniel Borkmann	8afd54c87a	bpf: add flags to bpf_skb_store_bytes for clearing hash When overwriting parts of the packet with bpf_skb_store_bytes() that were fed previously into skb->hash calculation, we should clear the current hash with skb_clear_hash(), so that a next skb_get_hash() call can determine the correct hash related to this skb. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-08 13:55:15 -05:00
Daniel Borkmann	8050c0f027	bpf: allow bpf_csum_diff to feed bpf_l3_csum_replace as well Commit `7d672345ed` ("bpf: add generic bpf_csum_diff helper") added a generic checksum diff helper that can feed bpf_l4_csum_replace() with a target __wsum diff that is to be applied to the L4 checksum. This facility is very flexible, can be cascaded, allows for adding, removing, or diffing data, or for calculating the pseudo header checksum from scratch, but it can also be reused for working with the IPv4 header checksum. Thus, analogous to bpf_l4_csum_replace(), add a case for header field value of 0 to change the checksum at a given offset through a new helper csum_replace_by_diff(). Also, in addition to that, this provides an easy to use interface for feeding precalculated diffs f.e. coming from a map. It nicely complements bpf_l3_csum_replace() that currently allows only for csum updates of 2 and 4 byte diffs. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-08 13:55:15 -05:00
David S. Miller	810813c47a	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Several cases of overlapping changes, as well as one instance (vxlan) of a bug fix in 'net' overlapping with code movement in 'net-next'. Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-08 12:34:12 -05:00
Linus Torvalds	e2857b8f11	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Pull networking fixes from David Miller: 1) Fix ordering of WEXT netlink messages so we don't see a newlink after a dellink, from Johannes Berg. 2) Out of bounds access in minstrel_ht_set_best_prob_rage, from Konstantin Khlebnikov. 3) Paging buffer memory leak in iwlwifi, from Matti Gottlieb. 4) Wrong units used to set initial TCP rto from cached metrics, also from Konstantin Khlebnikov. 5) Fix stale IP options data in the SKB control block from leaking through layers of encapsulation, from Bernie Harris. 6) Zero padding len miscalculated in bnxt_en, from Michael Chan. 7) Only CHECKSUM_PARTIAL packets should be passed down through GSO, fix from Hannes Frederic Sowa. 8) Fix suspend/resume with JME networking devices, from Diego Violat and Guo-Fu Tseng. 9) Checksums not validated properly in bridge multicast support due to the placement of the SKB header pointers at the time of the check, fix from Álvaro Fernández Rojas. 10) Fix hang/tiemout with r8169 if a stats fetch is done while the device is runtime suspended. From Chun-Hao Lin. 11) The forwarding database netlink dump facilities don't track the state of the dump properly, resulting in skipped/missed entries. From Minoura Makoto. 12) Fix regression from a recent 3c59x bug fix, from Neil Horman. 13) Fix list corruption in bna driver, from Ivan Vecera. 14) Big endian machines crash on vlan add in bnx2x, fix from Michal Schmidt. 15) Ethtool RSS configuration not propagated properly in mlx5 driver, from Tariq Toukan. 16) Fix regression in PHY probing in stmmac driver, from Gabriel Fernandez. 17) Fix SKB tailroom calculation in igmp/mld code, from Benjamin Poirier. 18) A past change to skip empty routing headers in ipv6 extention header parsing accidently caused fragment headers to not be matched any longer. Fix from Florian Westphal. 19) eTSEC-106 erratum needs to be applied to more gianfar chips, from Atsushi Nemoto. 20) Fix netdev reference after free via workqueues in usb networking drivers, from Oliver Neukum and Bjørn Mork. 21) mdio->irq is now an array rather than a pointer to dynamic memory, but several drivers were still trying to free it :-/ Fixes from Colin Ian King. 22) act_ipt iptables action forgets to set the family field, thus LOG netfilter targets don't work with it. Fix from Phil Sutter. 23) SKB leak in ibmveth when skb_linearize() fails, from Thomas Falcon. 24) pskb_may_pull() cannot be called with interrupts disabled, fix code that tries to do this in vmxnet3 driver, from Neil Horman. 25) be2net driver leaks iomap'd memory on removal, fix from Douglas Miller. 26) Forgotton RTNL mutex unlock in ppp_create_interface() error paths, from Guillaume Nault. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (97 commits) ppp: release rtnl mutex when interface creation fails cdc_ncm: do not call usbnet_link_change from cdc_ncm_bind tcp: fix tcpi_segs_in after connection establishment net: hns: fix the bug about loopback jme: Fix device PM wakeup API usage jme: Do not enable NIC WoL functions on S0 udp6: fix UDP/IPv6 encap resubmit path be2net: Don't leak iomapped memory on removal. vmxnet3: avoid calling pskb_may_pull with interrupts disabled net: ethernet: Add missing MFD_SYSCON dependency on HAS_IOMEM ibmveth: check return of skb_linearize in ibmveth_start_xmit cdc_ncm: toggle altsetting to force reset before setup usbnet: cleanup after bind() in probe() mlxsw: pci: Correctly determine if descriptor queue is full mlxsw: spectrum: Always decrement bridge's ref count tipc: fix nullptr crash during subscription cancel net: eth: altera: do not free array priv->mdio->irq net/ethoc: do not free array priv->mdio->irq net: sched: fix act_ipt for LOG target asix: do not free array priv->mdio->irq ...	2016-03-07 15:41:10 -08:00
Linus Torvalds	01ffa3df22	Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull overlayfs fixes from Miklos Szeredi: "Overlayfs bug fixes. All marked as -stable material" * 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: ovl: copy new uid/gid into overlayfs runtime inode ovl: ignore lower entries when checking purity of non-directory entries ovl: fix getcwd() failure after unsuccessful rmdir ovl: fix working on distributed fs as lower layer	2016-03-07 15:23:25 -08:00
Linus Torvalds	256faedcfd	Revert "drm/radeon: call hpd_irq_event on resume" This reverts commit `dbb17a21c1`. It turns out that commit can cause problems for systems with multiple GPUs, and causes X to hang on at least a HP Pavilion dv7 with hybrid graphics. This got noticed originally in 4.4.4, where this patch had already gotten back-ported, but 4.5-rc7 was verified to have the same problem. Alexander Deucher says: "It looks like you have a muxed system so I suspect what's happening is that one of the display is being reported as connected for both the IGP and the dGPU and then the desktop environment gets confused or there some sort problem in the detect functions since the mux is not switched to the dGPU. I don't see an easy fix unless Dave has any ideas. I'd say just revert for now" Reported-by: Jörg-Volker Peetz <jvpeetz@web.de> Acked-by: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Dave Airlie <airlied@gmail.com> Cc: stable@kernel.org # wherever `dbb17a21c1` got back-ported Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-03-07 13:15:09 -08:00
Guillaume Nault	6faac63a69	ppp: release rtnl mutex when interface creation fails Add missing rtnl_unlock() in the error path of ppp_create_interface(). Fixes: `58a89ecaca` ("ppp: fix lockdep splat in ppp_dev_uninit()") Signed-off-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-07 16:11:31 -05:00
Bjørn Mork	4d06dd537f	cdc_ncm: do not call usbnet_link_change from cdc_ncm_bind usbnet_link_change will call schedule_work and should be avoided if bind is failing. Otherwise we will end up with scheduled work referring to a netdev which has gone away. Instead of making the call conditional, we can just defer it to usbnet_probe, using the driver_info flag made for this purpose. Fixes: `8a34b0ae87` ("usbnet: cdc_ncm: apply usbnet_link_change") Reported-by: Andrey Konovalov <andreyknvl@gmail.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Bjørn Mork <bjorn@mork.no> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-07 15:54:33 -05:00
Eric Dumazet	a9d99ce28e	tcp: fix tcpi_segs_in after connection establishment If final packet (ACK) of 3WHS is lost, it appears we do not properly account the following incoming segment into tcpi_segs_in While we are at it, starts segs_in with one, to count the SYN packet. We do not yet count number of SYN we received for a request sock, we might add this someday. packetdrill script showing proper behavior after fix : // Tests tcpi_segs_in when 3rd packet (ACK) of 3WHS is lost 0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 +0 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop> +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK> +.020 < P. 1:1001(1000) ack 1 win 32792 +0 accept(3, ..., ...) = 4 +.000 %{ assert tcpi_segs_in == 2, 'tcpi_segs_in=%d' % tcpi_segs_in }% Fixes: `2efd055c53` ("tcp: add tcpi_segs_in and tcpi_segs_out to tcp_info") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-07 15:47:13 -05:00
yankejian	68c222a6bc	net: hns: fix the bug about loopback It will always be passed if the soc is tested the loopback cases. This patch will fix this bug. Signed-off-by: Kejian Yan <yankejian@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-07 15:44:51 -05:00
Guo-Fu Tseng	81422e672f	jme: Fix device PM wakeup API usage According to Documentation/power/devices.txt The driver should not use device_set_wakeup_enable() which is the policy for user to decide. Using device_init_wakeup() to initialize dev->power.should_wakeup and dev->power.can_wakeup on driver initialization. And use device_may_wakeup() on suspend to decide if WoL function should be enabled on NIC. Reported-by: Diego Viola <diego.viola@gmail.com> Signed-off-by: Guo-Fu Tseng <cooldavid@cooldavid.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-07 15:39:45 -05:00
Guo-Fu Tseng	0772a99b81	jme: Do not enable NIC WoL functions on S0 Otherwise it might be back on resume right after going to suspend in some hardware. Reported-by: Diego Viola <diego.viola@gmail.com> Signed-off-by: Guo-Fu Tseng <cooldavid@cooldavid.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-07 15:39:45 -05:00
Haiyang Zhang	d66ab51442	hv_netvsc: Move subchannel waiting to rndis_filter_device_remove() During hot add, vmbus_device_register() is called from vmbus_onoffer(), on the same workqueue as the subchannel offer message work-queue, so subchannel offer won't be processed until the vmbus_device_register()/... /netvsc_probe() is done. Also, vmbus_device_register() is called with channel_mutex locked, which prevents subchannel processing too. So the "waiting for sub-channel processing" will not success in hot add case. But, in usual module loading, the netvsc_probe() is called from different code path, and doesn't fail. This patch resolves the deadlock during NIC hot-add, and speeds up NIC loading time. Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Reviewed-by: K. Y. Srinivasan <kys@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-07 15:37:50 -05:00
Bill Sommerfeld	59dca1d8a6	udp6: fix UDP/IPv6 encap resubmit path IPv4 interprets a negative return value from a protocol handler as a request to redispatch to a new protocol. In contrast, IPv6 interprets a negative value as an error, and interprets a positive value as a request for redispatch. UDP for IPv6 was unaware of this difference. Change __udp6_lib_rcv() to return a positive value for redispatch. Note that the socket's encap_rcv hook still needs to return a negative value to request dispatch, and in the case of IPv6 packets, adjust IP6CB(skb)->nhoff to identify the byte containing the next protocol. Signed-off-by: Bill Sommerfeld <wsommerfeld@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-07 15:23:12 -05:00
Douglas Miller	a69bf3c5b4	be2net: Don't leak iomapped memory on removal. The adapter->pcicfg resource is either mapped via pci_iomap() or derived from adapter->db. During be_remove() this resource was ignored and so could remain mapped after remove. Add a flag to track whether adapter->pcicfg was mapped or not, then use that flag in be_unmap_pci_bars() to unmap if required. Fixes: `25848c901` ("use PCI MMIO read instead of config read for errors") Signed-off-by: Douglas Miller <dougmill@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-07 15:18:31 -05:00
Neil Horman	cec05562fb	vmxnet3: avoid calling pskb_may_pull with interrupts disabled vmxnet3 has a function vmxnet3_parse_and_copy_hdr which, among other operations, uses pskb_may_pull to linearize the header portion of an skb. That operation eventually uses local_bh_disable/enable to ensure that it doesn't race with the drivers bottom half handler. Unfortunately, vmxnet3 preforms this parse_and_copy operation with a spinlock held and interrupts disabled. This causes us to run afoul of the WARN_ON_ONCE(irqs_disabled()) warning in local_bh_enable, resulting in this: WARNING: at kernel/softirq.c:159 local_bh_enable+0x59/0x90() (Not tainted) Hardware name: VMware Virtual Platform Modules linked in: ipv6 ppdev parport_pc parport microcode e1000 vmware_balloon vmxnet3 i2c_piix4 sg ext4 jbd2 mbcache sd_mod crc_t10dif sr_mod cdrom mptspi mptscsih mptbase scsi_transport_spi pata_acpi ata_generic ata_piix vmwgfx ttm drm_kms_helper drm i2c_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: mperf] Pid: 6229, comm: sshd Not tainted 2.6.32-616.el6.i686 #1 Call Trace: [<c04624d9>] ? warn_slowpath_common+0x89/0xe0 [<c0469e99>] ? local_bh_enable+0x59/0x90 [<c046254b>] ? warn_slowpath_null+0x1b/0x20 [<c0469e99>] ? local_bh_enable+0x59/0x90 [<c07bb936>] ? skb_copy_bits+0x126/0x210 [<f8d1d9fe>] ? ext4_ext_find_extent+0x24e/0x2d0 [ext4] [<c07bc49e>] ? __pskb_pull_tail+0x6e/0x2b0 [<f95a6164>] ? vmxnet3_xmit_frame+0xba4/0xef0 [vmxnet3] [<c05d15a6>] ? selinux_ip_postroute+0x56/0x320 [<c0615988>] ? cfq_add_rq_rb+0x98/0x110 [<c0852df8>] ? packet_rcv+0x48/0x350 [<c07c5839>] ? dev_queue_xmit_nit+0xc9/0x140 ... Fix it by splitting vmxnet3_parse_and_copy_hdr into two functions: vmxnet3_parse_hdr, which sets up the internal/on stack ctx datastructure, and pulls the skb (both of which can be done without holding the spinlock with irqs disabled and vmxnet3_copy_header, which just copies the skb to the tx ring under the lock safely. tested and shown to correct the described problem. Applies cleanly to the head of the net tree Signed-off-by: Neil Horman <nhorman@tuxdriver.com> CC: Shrikrishna Khare <skhare@vmware.com> CC: "VMware, Inc." <pv-drivers@vmware.com> CC: "David S. Miller" <davem@davemloft.net> Acked-by: Shrikrishna Khare <skhare@vmware.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-07 15:15:24 -05:00
David S. Miller	ab825adbaa	Merge branch 'qed_hw_gro' Manish Chopra says: ==================== qed/qede: Add hardware GRO support This patch series enables hardware GRO and add support for handling HW aggregated TCP packets in driver receive flow by skipping software GRO handling in stack. Please consider applying this series to net-next. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-07 15:01:40 -05:00
Manish Chopra	55482edc25	qede: Add slowpath/fastpath support and enable hardware GRO This patch configures hardware to use GRO and adds support for fastpath APIs to handle HW aggregated packets. Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com> Signed-off-by: Manish Chopra <manish.chopra@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-07 15:01:40 -05:00

1 2 3 4 5 ...

576510 Commits