linux

Author	SHA1	Message	Date
Eliezer Tamir	2d48d67fa8	net: poll/select low latency socket support select/poll busy-poll support. Split sysctl value into two separate ones, one for read and one for poll. updated Documentation/sysctl/net.txt Add a new poll flag POLL_LL. When this flag is set, sock_poll will call sk_poll_ll if possible. sock_poll sets this flag in its return value to indicate to select/poll when a socket that can busy poll is found. When poll/select have nothing to report, call the low-level sock_poll again until we are out of time or we find something. Once the system call finds something, it stops setting POLL_LL, so it can return the result to the user ASAP. Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-06-25 16:35:52 -07:00
Eric Dumazet	bd8a7036c0	gre: fix a possible skb leak commit `68c3316311` ("v4 GRE: Add TCP segmentation offload for GRE") added a possible skb leak, because it frees only the head of segment list, in case a skb_linearize() call fails. This patch adds a kfree_skb_list() helper to fix the bug. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Pravin B Shelar <pshelar@nicira.com> Cc: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-06-25 16:07:44 -07:00
Mike Rapoport	f693dff710	rtnetlink: allow using zero MAC address in rtnl_fdb_{add,del} This is required for multiple default destinations management in VXLAN Signed-off-by: Mike Rapoport <mike.rapoport@ravellosystems.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>	2013-06-25 09:31:39 -07:00
Wedson Almeida Filho	aeb193ea6c	net: Unmap fragment page once iterator is done Callers of skb_seq_read() are currently forced to call skb_abort_seq_read() even when consuming all the data because the last call to skb_seq_read (the one that returns 0 to indicate the end) fails to unmap the last fragment page. With this patch callers will be allowed to traverse the SKB data by calling skb_prepare_seq_read() once and repeatedly calling skb_seq_read() as originally intended (and documented in the original commit `677e90eda`), that is, only call skb_abort_seq_read() if the sequential read is actually aborted. Signed-off-by: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-06-24 01:46:01 -07:00
Eric Dumazet	60877a32bc	net: allow large number of tx queues netif_alloc_netdev_queues() uses kcalloc() to allocate memory for the "struct netdev_queue *_tx" array. For large number of tx queues, kcalloc() might fail, so this patch does a fallback to vzalloc(). As vmalloc() adds overhead on a critical network path, add __GFP_REPEAT to kzalloc() flags to do this fallback only when really needed. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-06-23 23:56:55 -07:00
Gao feng	dc25c676f5	neigh: disallow un-init_net to change thresh of neigh thresh and interval are global resources, only init net can change them. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-06-19 21:13:24 -07:00
Gao feng	170d6f9954	neigh: only allow init_net to change the default neigh_parms Though we don't export the /proc/sys/net/ipv[4,6]/neigh/default/ directory to the un-init_net, but we can still use cmd such as "ip ntable change name arp_cache locktime 129" to change the locktime of default neigh_parms. This patch disallows the un-init_net to find out the neigh_table.parms. So the un-init_net will failed to influence the init_net. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-06-19 21:13:24 -07:00
Gao feng	cf89d6b280	neigh: no need to call lookup_neigh_parms in neigh_parms_alloc neigh_table.parms always exist and is initialized,kmemdup can use it to create new neigh_parms, actually lookup_neigh_parms here will return neigh_table.parms too. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-06-19 21:13:24 -07:00
David S. Miller	d98cae64e4	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/net/wireless/ath/ath9k/Kconfig drivers/net/xen-netback/netback.c net/batman-adv/bat_iv_ogm.c net/wireless/nl80211.c The ath9k Kconfig conflict was a change of a Kconfig option name right next to the deletion of another option. The xen-netback conflict was overlapping changes involving the handling of the notify list in xen_netbk_rx_action(). Batman conflict resolution provided by Antonio Quartulli, basically keep everything in both conflict hunks. The nl80211 conflict is a little more involved. In 'net' we added a dynamic memory allocation to nl80211_dump_wiphy() to fix a race that Linus reported. Meanwhile in 'net-next' the handlers were converted to use pre and post doit handlers which use a flag to determine whether to hold the RTNL mutex around the operation. However, the dump handlers to not use this logic. Instead they have to explicitly do the locking. There were apparent bugs in the conversion of nl80211_dump_wiphy() in that we were not dropping the RTNL mutex in all the return paths, and it seems we very much should be doing so. So I fixed that whilst handling the overlapping changes. To simplify the initial returns, I take the RTNL mutex after we try to allocate 'tb'. Signed-off-by: David S. Miller <davem@davemloft.net>	2013-06-19 16:49:39 -07:00
Fernando Luis Vazquez Cao	788dfcacac	vlan: restore ethtool ABI to control VLAN hardware acceleration As part of the push to add 802.1ad server provider tagging support to the kernel the VLAN features flags were renamed. Unfortunately the kernel name for the VLAN hardware acceleration features that the kernel shows user space was included in the rename, which broke ethtool (txvlan and rxvlan options do not work). This patch restores the original names, i.e. the original ABI. If we wanted to make clear to users that we are refering to CTAGs we can always change ethtool's short_name and long_name for these features (for example something along the lines of txvlan -> txvlan-ctag, tx-vlan-offload -> tx-vlan-ctag-offload). Cc: Patrick McHardy <kaber@trash.net> Cc: David S. Miller <davem@davemloft.net> Cc: netdev@vger.kernel.org Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp> Reviewed-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-06-17 17:09:35 -07:00
Eliezer Tamir	dafcc4380d	net: add socket option for low latency polling adds a socket option for low latency polling. This allows overriding the global sysctl value with a per-socket one. Unexport sysctl_net_ll_poll since for now it's not needed in modules. Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-06-17 15:48:14 -07:00
Eliezer Tamir	eb6db62282	net: change sysctl_net_ll_poll into an unsigned int There is no reason for sysctl_net_ll_poll to be an unsigned long. Change it into an unsigned int. Fix the proc handler. Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-06-17 15:48:13 -07:00
Greg Kroah-Hartman	0e496b8e84	Merge 3.10-rc6 into char-misc-next We want the fixes in here.	2013-06-17 11:54:25 -07:00
Rony Efraim	1d8faf48c7	net/core: Add VF link state control Add netlink directives and ndo entry to allow for controling VF link, which can be in one of three states: Auto - VF link state reflects the PF link state (default) Up - VF link state is up, traffic from VF to VF works even if the actual PF link is down Down - VF link state is down, no traffic from/to this VF, can be of use while configuring the VF Signed-off-by: Rony Efraim <ronye@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-06-13 17:51:04 -07:00
Willem de Bruijn	5f121b9a83	net-rps: fixes for rps flow limit Caught by sparse: - __rcu: missing annotation to sd->flow_limit - __user: direct access in cpumask_scnprintf Also - add endline character when printing bitmap if room in buffer - avoid bucket overflow by reducing FLOW_LIMIT_HISTORY The last item warrants some explanation. The hashtable buckets are subject to overflow if FLOW_LIMIT_HISTORY is larger than or equal to bucket size, since all packets may end up in a single bucket. The current (rather arbitrary) history value of 256 happens to match the buffer size (u8). As a result, with a single flow, the first 128 packets are accepted (correct), the second 128 packets dropped (correct) and then the history[] array has filled, so that each subsequent new packet causes an increment in the bucket for new_flow plus a decrement for old_flow: a steady state. This is fine if packets are dropped, as the steady state goes away as soon as a mix of traffic reappears. But, because the 256th packet overflowed the bucket to 0: no packets are dropped. Instead of explicitly adding an overflow check, this patch changes FLOW_LIMIT_HISTORY to never be able to overflow a single bucket. Reported-by: Fengguang Wu <fengguang.wu@intel.com> (first item) Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-06-13 17:13:05 -07:00
Joe Perches	fe2c6338fd	net: Convert uses of typedef ctl_table to struct ctl_table Reduce the uses of this unnecessary typedef. Done via perl script: $ git grep --name-only -w ctl_table net \| \ xargs perl -p -i -e '\ sub trim { my ($local) = @_; $local =~ s/(^\s+\|\s+$)//g; return $local; } \ s/\b(?<!struct\s)ctl_table\b(\s\\s*\|\s+\w+)/"struct ctl_table " . trim($1)/ge' Reflow the modified lines that now exceed 80 columns. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-06-13 02:36:09 -07:00
Flavio Leitner	194f4a6df2	net: make all team port device link events urgent Since team functionality relies heavily on userspace daemon, we need to deliver event to userspace via Netlink as quick as possible. So make all team port device link events urgent. Signed-off-by: Flavio Leitner <fbl@redhat.com> Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-06-13 02:31:41 -07:00
Daniel Borkmann	7a6e288d27	pktgen: ipv6: numa: consolidate skb allocation to pktgen_alloc_skb We currently allow for numa-node aware skb allocation only within the fill_packet_ipv4() path, but not in fill_packet_ipv6(). Consolidate that code to a common allocation helper to enable numa-node aware skb allocation for ipv6, and use it in both paths. This also makes both functions a bit more readable. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-06-12 00:47:25 -07:00
Eric Dumazet	45203a3b38	net_sched: add 64bit rate estimators struct gnet_stats_rate_est contains u32 fields, so the bytes per second field can wrap at 34360Mbit. Add a new gnet_stats_rate_est64 structure to get 64bit bps/pps fields, and switch the kernel to use this structure natively. This structure is dumped to user space as a new attribute : TCA_STATS_RATE_EST64 Old tc command will now display the capped bps (to 34360Mbit), instead of wrapped values, and updated tc command will display correct information. Old tc command output, after patch : eric:~# tc -s -d qd sh dev lo qdisc pfifo 8001: root refcnt 2 limit 1000p Sent 80868245400 bytes 1978837 pkt (dropped 0, overlimits 0 requeues 0) rate 34360Mbit 189696pps backlog 0b 0p requeues 0 This patch carefully reorganizes "struct Qdisc" layout to get optimal performance on SMP. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-06-11 02:51:03 -07:00
Peter Pan(潘卫平)	b41abb42bf	net: pass correct parameter to skb_headers_offset_update() Since commit 1a37e412a022(net: Use 16bits for _headers fields of struct skbuff), skb->_header are relative to skb->head, so copy_skb_header() should not call skb_headers_offset_update() now, and we should pass correct parameter to skb_headers_offset_update() in pskb_expand_head() and skb_copy_expand(). Signed-off-by: Weiping Pan <panweiping3@gmail.com> Reviewed-by: Simon Horman <horms@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-06-11 02:48:06 -07:00
Nicolas Dichtel	ed13998c31	sock_diag: fix filter code sent to userspace Filters need to be translated to real BPF code for userland, like SO_GETFILTER. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-06-10 22:23:32 -07:00
Eliezer Tamir	a5b50476f7	udp: add low latency socket poll support Add upport for busy-polling on UDP sockets. In __udp[46]_lib_rcv add a call to sk_mark_ll() to copy the napi_id from the skb into the sk. This is done at the earliest possible moment, right after we identify which socket this skb is for. In __skb_recv_datagram When there is no data and the user tries to read we busy poll. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com> Acked-by: Eric Dumazet <edumazet@google.com> Tested-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-06-10 21:22:36 -07:00
Eliezer Tamir	0602129286	net: add low latency socket poll Adds an ndo_ll_poll method and the code that supports it. This method can be used by low latency applications to busy-poll Ethernet device queues directly from the socket code. sysctl_net_ll_poll controls how many microseconds to poll. Default is zero (disabled). Individual protocol support will be added by subsequent patches. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com> Acked-by: Eric Dumazet <edumazet@google.com> Tested-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-06-10 21:22:35 -07:00
Eliezer Tamir	af12fa6e46	net: add napi_id and hash Adds a napi_id and a hashing mechanism to lookup a napi by id. This will be used by subsequent patches to implement low latency Ethernet device polling. Based on a code sample by Eric Dumazet. Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Tested-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-06-10 21:22:35 -07:00
Greg Kroah-Hartman	38a4671cad	Merge 3.10-rc5 into char-misc-next	2013-06-08 22:34:53 -07:00
David S. Miller	6bc19fb82d	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Merge 'net' bug fixes into 'net-next' as we have patches that will build on top of them. This merge commit includes a change from Emil Goode (emilgoode@gmail.com) that fixes a warning that would have been introduced by this merge. Specifically it fixes the pingv6_ops method ipv6_chk_addr() to add a "const" to the "struct net_device *dev" argument and likewise update the dummy_ipv6_chk_addr() declaration. Signed-off-by: David S. Miller <davem@davemloft.net>	2013-06-05 16:37:30 -07:00
Andy Shevchenko	4cd5773a2a	net: core: move mac_pton() to lib/net_utils.c Since we have at least one user of this function outside of CONFIG_NET scope, we have to provide this function independently. The proposed solution is to move it under lib/net_utils.c with corresponding configuration variable and select wherever it is needed. Signed-off-by: Andy Shevchenko <andy.shevchenko@gmail.com> Reported-by: Arnd Bergmann <arnd@arndb.de> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2013-06-05 12:00:27 -07:00
Amerigo Wang	00f97da17a	netpoll: fix position of network header Similar to the problem in pktgen, netpoll uses skb_tail_offset() too, as the code is copied from pktgen. Also use return values of skb_put() directly, this will simiplify the code. Reported-by: Thomas Graf <tgraf@suug.ch> Cc: Thomas Graf <tgraf@suug.ch> Cc: Daniel Borkmann <dborkmann@redhat.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-06-04 17:37:04 -07:00
Thomas Graf	525cebedb3	pktgen: Fix position of ip and udp header skb_set_network_header() expects an offset based on the data pointer whereas skb_tail_offset() also includes the headroom. This resulted in the ip header being written in a wrong location. Use return values of skb_put() directly and rely on skb->len to set mac, network, and transport header. Cc: Simon Horman <horms@verge.net.au> Cc: Daniel Borkmann <dborkmann@redhat.com> Assisted-by: Daniel Borkmann <dborkmann@redhat.com> Signed-off-by: Thomas Graf <tgraf@suug.ch> Acked-by: Daniel Borkmann <dborkmann@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-06-04 17:35:45 -07:00
Pablo Neira	5e71d9d77c	net: fix sk_buff head without data area Eric Dumazet spotted that we have to check skb->head instead of skb->data as skb->head points to the beginning of the data area of the skbuff. Similarly, we have to initialize the skb->head pointer, not skb->data in __alloc_skb_head. After this fix, netlink crashes in the release path of the sk_buff, so let's fix that as well. This bug was introduced in (`0ebd0ac` net: add function to allocate sk_buff head without data area). Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-06-04 17:26:49 -07:00
Yan Burman	600fed5e97	net/ethtool: Fix comment regarding location of dev_ethtool() call Signed-off-by: Yan Burman <yanb@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-06-04 17:10:03 -07:00
Baruch Siach	430f03cde2	net: mark netdev_create_hash __net_init netdev_create_hash() is only called from netdev_init() which is marked __net_init. Signed-off-by: Baruch Siach <baruch@tkos.co.il> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-06-04 16:47:27 -07:00
Cong Wang	35d0461061	net: clean up skb headers code commit `1a37e412a0` (net: Use 16bits for _headers fields of struct skbuff) converts skb->_header to u16, some #if NET_SKBUFF_DATA_USES_OFFSET are now useless, and to be safe, we could just use "X = (typeof(X)) ~0U;" as suggested by David. Cc: David S. Miller <davem@davemloft.net> Cc: Simon Horman <horms@verge.net.au> Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-05-31 17:02:47 -07:00
Jay Vosburgh	b190a50875	net/core: dev_mc_sync_multiple calls wrong helper The dev_mc_sync_multiple function is currently calling __hw_addr_sync, and not __hw_addr_sync_multiple. This will result in addresses only being synced to the first device from the set. Corrected by calling the _multiple variant. Signed-off-by: Jay Vosburgh <fubar@us.ibm.com> Reviewed-by: Vlad Yasevich <vyasevic@redhat.com> Tested-by: Shawn Bohrer <sbohrer@rgmadvisors.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-05-31 16:56:56 -07:00
Jay Vosburgh	29ca2f8fcc	net/core: __hw_addr_sync_one / _multiple broken Currently, __hw_addr_sync_one is called in a loop by __hw_addr_sync_multiple to sync each of a "from" device's hw addresses to a "to" device. __hw_addr_sync_one calls __hw_addr_add_ex to attempt to add each address. __hw_addr_add_ex is called with global=false, and sync=true. __hw_addr_add_ex checks to see if the new address matches an address already on the list. If so, it tests global and sync. In this case, sync=true, and it then checks if the address is already synced, and if so, returns 0. This 0 return causes __hw_addr_sync_one to increment the sync_cnt and refcount for the "from" list's address entry, even though the address is already synced and has a reference and sync_cnt. This will cause the sync_cnt and refcount to increment without bound every time an addresses is added to the "from" device and synced to the "to" device. The fix here has two parts: First, when __hw_addr_add_ex finds the address already exists and is synced, return -EEXIST instead of 0. Second, __hw_addr_sync_one checks the error return for -EEXIST, and if so, it (a) does not add a refcount/sync_cnt, and (b) returns 0 itself so that __hw_addr_sync_multiple will not return an error. Signed-off-by: Jay Vosburgh <fubar@us.ibm.com> Reviewed-by: Vlad Yasevich <vyasevic@redhat.com> Tested-by: Shawn Bohrer <sbohrer@rgmadvisors.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-05-31 16:56:56 -07:00
Jay Vosburgh	60ba834c2f	net/core: __hw_addr_unsync_one "from" address not marked synced When an address is added to a subordinate interface (the "to" list), the address entry in the "from" list is not marked "synced" as the entry added to the "to" list is. When performing the unsync operation (e.g., dev_mc_unsync), __hw_addr_unsync_one calls __hw_addr_del_entry with the "synced" parameter set to true for the case when the address reference is being released from the "from" list. This causes a test inside to fail, with the result being that the reference count on the "from" address is not properly decremeted and the address on the "from" list will never be freed. Correct this by having __hw_addr_unsync_one call the __hw_addr_del_entry function with the "sync" flag set to false for the "remove from the from list" case. Signed-off-by: Jay Vosburgh <fubar@us.ibm.com> Reviewed-by: Vlad Yasevich <vyasevic@redhat.com> Tested-by: Shawn Bohrer <sbohrer@rgmadvisors.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-05-31 16:56:56 -07:00
Jay Vosburgh	9747ba6636	net/core: __hw_addr_create_ex does not initialize sync_cnt The sync_cnt field is not being initialized, which can result in arbitrary values in the field. Fixed by initializing it to zero. Signed-off-by: Jay Vosburgh <fubar@us.ibm.com> Reviewed-by: Vlad Yasevich <vyasevic@redhat.com> Tested-by: Shawn Bohrer <sbohrer@rgmadvisors.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-05-31 16:56:56 -07:00
Federico Vaga	456db6a4d4	net/core/sock.c: add missing VSOCK string in af_family_*_key_strings The three arrays of strings: af_family_key_strings, af_family_slock_key_strings and af_family_clock_key_strings have not VSOCK's string Signed-off-by: Federico Vaga <federico.vaga@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-05-28 23:58:49 -07:00
Simon Horman	7cc4619005	net, ipv4, ipv6: Correct assignment of skb->network_header to skb->tail This corrects an regression introduced by "net: Use 16bits for *_headers fields of struct skbuff" when NET_SKBUFF_DATA_USES_OFFSET is not set. In that case skb->tail will be a pointer however skb->network_header is now an offset. This patch corrects the problem by adding a wrapper to return skb tail as an offset regardless of the value of NET_SKBUFF_DATA_USES_OFFSET. It seems that skb->tail that this offset may be more than 64k and some care has been taken to treat such cases as an error. Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-05-28 23:49:07 -07:00
Simon Horman	ced14f6804	net: Correct comparisons and calculations using skb->tail and skb-transport_header This corrects an regression introduced by "net: Use 16bits for *_headers fields of struct skbuff" when NET_SKBUFF_DATA_USES_OFFSET is not set. In that case skb->tail will be a pointer whereas skb->transport_header will be an offset from head. This is corrected by using wrappers that ensure that comparisons and calculations are always made using pointers. Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-05-28 23:49:07 -07:00
Cong Wang	75538c2b85	net: always pass struct netdev_notifier_info to netdevice notifiers commit `351638e7de` (net: pass info struct via netdevice notifier) breaks booting of my KVM guest, this is due to we still forget to pass struct netdev_notifier_info in several places. This patch completes it. Cc: Jiri Pirko <jiri@resnulli.us> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-05-28 21:58:54 -07:00
David S. Miller	06ecf24bdf	net: Fix build warnings after mac_header and transport_header became __u16. net/core/skbuff.c: In function ‘__alloc_skb_head’: net/core/skbuff.c:203:2: warning: large integer implicitly truncated to unsigned type [-Woverflow] net/core/skbuff.c: In function ‘__alloc_skb’: net/core/skbuff.c:279:2: warning: large integer implicitly truncated to unsigned type [-Woverflow] net/core/skbuff.c:280:2: warning: large integer implicitly truncated to unsigned type [-Woverflow] net/core/skbuff.c: In function ‘build_skb’: net/core/skbuff.c:348:2: warning: large integer implicitly truncated to unsigned type [-Woverflow] net/core/skbuff.c:349:2: warning: large integer implicitly truncated to unsigned type [-Woverflow] Signed-off-by: David S. Miller <davem@davemloft.net>	2013-05-28 13:15:50 -07:00
Jiri Pirko	be9efd3653	net: pass changed flags along with NETDEV_CHANGE event Use new netdevice notifier infrastructure to pass along changed flags. Signed-off-by: Timo Teräs <timo.teras@iki.fi> Signed-off-by: Jiri Pirko <jiri@resnulli.us> v2->v3: shortened notifier_info struct name Signed-off-by: David S. Miller <davem@davemloft.net>	2013-05-28 13:11:01 -07:00
Jiri Pirko	351638e7de	net: pass info struct via netdevice notifier So far, only net_device * could be passed along with netdevice notifier event. This patch provides a possibility to pass custom structure able to provide info that event listener needs to know. Signed-off-by: Jiri Pirko <jiri@resnulli.us> v2->v3: fix typo on simeth shortened dev_getter shortened notifier_info struct name v1->v2: fix notifier_call parameter in call_netdevice_notifier() Signed-off-by: David S. Miller <davem@davemloft.net>	2013-05-28 13:11:01 -07:00
dingtianhong	da6e378ba9	netpoll: remove return value from netpoll_rx_disable() The netpoll_rx_disable() will always return 0, it is no use and looks wordy, so remove the unnecessary code and get rid of it in _dev_open and _dev_close. Signed-off-by: Ding Tianhong <dingtianhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-05-27 23:18:50 -07:00
Simon Horman	0d89d2035f	MPLS: Add limited GSO support In the case where a non-MPLS packet is received and an MPLS stack is added it may well be the case that the original skb is GSO but the NIC used for transmit does not support GSO of MPLS packets. The aim of this code is to provide GSO in software for MPLS packets whose skbs are GSO. SKB Usage: When an implementation adds an MPLS stack to a non-MPLS packet it should do the following to skb metadata: * Set skb->inner_protocol to the old non-MPLS ethertype of the packet. skb->inner_protocol is added by this patch. * Set skb->protocol to the new MPLS ethertype of the packet. * Set skb->network_header to correspond to the end of the L3 header, including the MPLS label stack. I have posted a patch, "[PATCH v3.29] datapath: Add basic MPLS support to kernel" which adds MPLS support to the kernel datapath of Open vSwtich. That patch sets the above requirements in datapath/actions.c:push_mpls() and was used to exercise this code. The datapath patch is against the Open vSwtich tree but it is intended that it be added to the Open vSwtich code present in the mainline Linux kernel at some point. Features: I believe that the approach that I have taken is at least partially consistent with the handling of other protocols. Jesse, I understand that you have some ideas here. I am more than happy to change my implementation. This patch adds dev->mpls_features which may be used by devices to advertise features supported for MPLS packets. A new NETIF_F_MPLS_GSO feature is added for devices which support hardware MPLS GSO offload. Currently no devices support this and MPLS GSO always falls back to software. Alternate Implementation: One possible alternate implementation is to teach netif_skb_features() and skb_network_protocol() about MPLS, in a similar way to their understanding of VLANs. I believe this would avoid the need for net/mpls/mpls_gso.c and in particular the calls to __skb_push() and __skb_push() in mpls_gso_segment(). I have decided on the implementation in this patch as it should not introduce any overhead in the case where mpls_gso is not compiled into the kernel or inserted as a module. MPLS GSO suggested by Jesse Gross. Based in part on "v4 GRE: Add TCP segmentation offload for GRE" by Pravin B Shelar. Cc: Jesse Gross <jesse@nicira.com> Cc: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-05-27 22:50:59 -07:00
Jiri Pirko	42e52bf9e3	net: add netnotifier event for upper device change Now when upper device is changed, event is not propagated via RT Netlink to userspace. Userspace might never now about the change. Fix this by adding upper-device-change notifier event. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-05-25 23:12:19 -07:00
David S. Miller	e6ff4c75f9	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Merge net into net-next because some upcoming net-next changes build on top of bug fixes that went into net. Signed-off-by: David S. Miller <davem@davemloft.net>	2013-05-24 16:48:28 -07:00
Simon Horman	1cdbcb7957	net: Loosen constraints for recalculating checksum in skb_segment() This is a generic solution to resolve a specific problem that I have observed. If the encapsulation of an skb changes then ability to offload checksums may also change. In particular it may be necessary to perform checksumming in software. An example of such a case is where a non-GRE packet is received but is to be encapsulated and transmitted as GRE. Another example relates to my proposed support for for packets that are non-MPLS when received but MPLS when transmitted. The cost of this change is that the value of the csum variable may be checked when it previously was not. In the case where the csum variable is true this is pure overhead. In the case where the csum variable is false it leads to software checksumming, which I believe also leads to correct checksums in transmitted packets for the cases described above. Further analysis: This patch relies on the return value of can_checksum_protocol() being correct and in turn the return value of skb_network_protocol(), used to provide the protocol parameter of can_checksum_protocol(), being correct. It also relies on the features passed to skb_segment() and in turn to can_checksum_protocol() being correct. I believe that this problem has not been observed for VLANs because it appears that almost all drivers, the exception being xgbe, set vlan_features such that that the checksum offload support for VLAN packets is greater than or equal to that of non-VLAN packets. I wonder if the code in xgbe may be an oversight and the hardware does support checksumming of VLAN packets. If so it may be worth updating the vlan_features of the driver as this patch will force such checksums to be performed in software rather than hardware. Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-05-22 15:06:13 -07:00
Willem de Bruijn	99bbc70741	rps: selective flow shedding during softnet overflow A cpu executing the network receive path sheds packets when its input queue grows to netdev_max_backlog. A single high rate flow (such as a spoofed source DoS) can exceed a single cpu processing rate and will degrade throughput of other flows hashed onto the same cpu. This patch adds a more fine grained hashtable. If the netdev backlog is above a threshold, IRQ cpus track the ratio of total traffic of each flow (using 4096 buckets, configurable). The ratio is measured by counting the number of packets per flow over the last 256 packets from the source cpu. Any flow that occupies a large fraction of this (set at 50%) will see packet drop while above the threshold. Tested: Setup is a muli-threaded UDP echo server with network rx IRQ on cpu0, kernel receive (RPS) on cpu0 and application threads on cpus 2--7 each handling 20k req/s. Throughput halves when hit with a 400 kpps antagonist storm. With this patch applied, antagonist overload is dropped and the server processes its complete load. The patch is effective when kernel receive processing is the bottleneck. The above RPS scenario is a extreme, but the same is reached with RFS and sufficient kernel processing (iptables, packet socket tap, ..). Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-05-20 13:48:04 -07:00
Rusty Russell	d2f83e9078	Hoist memcpy_fromiovec/memcpy_toiovec into lib/ ERROR: "memcpy_fromiovec" [drivers/vhost/vhost_scsi.ko] undefined! That function is only present with CONFIG_NET. Turns out that crypto/algif_skcipher.c also uses that outside net, but it actually needs sockets anyway. In addition, commit `6d4f0139d6` added CONFIG_NET dependency to CONFIG_VMCI for memcpy_toiovec, so hoist that function and revert that commit too. socket.h already includes uio.h, so no callers need updating; trying only broke things fo x86_64 randconfig (thanks Fengguang!). Reported-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2013-05-20 10:24:22 +09:30
Nicolas Dichtel	57b354e66b	dev: remove duplicate 'skb->dev = dev' in dev_forward_skb() This was added by commit `59b9997bab` (Revert "net: maintain namespace isolation between vlan and real device"). In fact, before the initial commit - the one that is reverted -, this statement was not present. 'skb->dev = dev' is already done in eth_type_trans(), which is call just after. Spotted-by: Alain Ritoux <alain.ritoux@6wind.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-05-17 18:23:08 -07:00
Eric Dumazet	f77d602124	ipv6: do not clear pinet6 field We have seen multiple NULL dereferences in __inet6_lookup_established() After analysis, I found that inet6_sk() could be NULL while the check for sk_family == AF_INET6 was true. Bug was added in linux-2.6.29 when RCU lookups were introduced in UDP and TCP stacks. Once an IPv6 socket, using SLAB_DESTROY_BY_RCU is inserted in a hash table, we no longer can clear pinet6 field. This patch extends logic used in commit `fcbdf09d96` ("net: fix nulls list corruptions in sk_prot_alloc") TCP/UDP/UDPLite IPv6 protocols provide their own .clear_sk() method to make sure we do not clear pinet6 field. At socket clone phase, we do not really care, as cloning the parent (non NULL) pinet6 is not adding a fatal race. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-05-11 16:26:38 -07:00
Pravin B Shelar	19acc32725	gso: Handle Trans-Ether-Bridging protocol in skb_network_protocol() Rather than having logic to calculate inner protocol in every tunnel gso handler move it to gso code. This simplifies code. Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Cong Wang <amwang@redhat.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-05-08 13:13:30 -07:00
Dan Carpenter	a3dbbc2bab	netpoll: inverted down_trylock() test The return value is reversed from mutex_trylock(). Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-05-06 11:06:52 -04:00
Al Viro	243198d09f	rps_dev_flow_table_release(): no need to delay vfree() The same story as with fib_trie patch - vfree() from RCU callbacks is legitimate now. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-05-06 11:06:51 -04:00
Bjørn Mork	b29d314518	net: vlan,ethtool: netdev_features_t is more than 32 bit Signed-off-by: Bjørn Mork <bjorn@mork.no> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-05-02 13:58:12 -04:00
Patrick McHardy	6708c9e5cc	net: use netdev_features_t in skb_needs_linearize() Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-05-02 13:58:12 -04:00
Linus Torvalds	20b4fb4852	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull VFS updates from Al Viro, Misc cleanups all over the place, mainly wrt /proc interfaces (switch create_proc_entry to proc_create(), get rid of the deprecated create_proc_read_entry() in favor of using proc_create_data() and seq_file etc). 7kloc removed. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits) don't bother with deferred freeing of fdtables proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h proc: Make the PROC_I() and PDE() macros internal to procfs proc: Supply a function to remove a proc entry by PDE take cgroup_open() and cpuset_open() to fs/proc/base.c ppc: Clean up scanlog ppc: Clean up rtas_flash driver somewhat hostap: proc: Use remove_proc_subtree() drm: proc: Use remove_proc_subtree() drm: proc: Use minor->index to label things, not PDE->name drm: Constify drm_proc_list[] zoran: Don't print proc_dir_entry data in debug reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show() proc: Supply an accessor for getting the data from a PDE's parent airo: Use remove_proc_subtree() rtl8192u: Don't need to save device proc dir PDE rtl8187se: Use a dir under /proc/net/r8180/ proc: Add proc_mkdir_data() proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h} proc: Move PDE_NET() to fs/proc/proc_net.c ...	2013-05-01 17:51:54 -07:00
David Howells	a8ca16ea7b	proc: Supply a function to remove a proc entry by PDE Supply a function (proc_remove()) to remove a proc entry (and any subtree rooted there) by proc_dir_entry pointer rather than by name and (optionally) root dir entry pointer. This allows us to eliminate all remaining pde->name accesses outside of procfs. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Grant Likely <grant.likely@linaro.or> cc: linux-acpi@vger.kernel.org cc: openipmi-developer@lists.sourceforge.net cc: devicetree-discuss@lists.ozlabs.org cc: linux-pci@vger.kernel.org cc: netdev@vger.kernel.org cc: netfilter-devel@vger.kernel.org cc: alsa-devel@alsa-project.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-05-01 17:29:46 -04:00
David Howells	0bb80f2405	proc: Split the namespace stuff out into linux/proc_ns.h Split the proc namespace stuff out into linux/proc_ns.h. Signed-off-by: David Howells <dhowells@redhat.com> cc: netdev@vger.kernel.org cc: Serge E. Hallyn <serge.hallyn@ubuntu.com> cc: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-05-01 17:29:39 -04:00
Linus Torvalds	73287a43cc	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next Pull networking updates from David Miller: "Highlights (1721 non-merge commits, this has to be a record of some sort): 1) Add 'random' mode to team driver, from Jiri Pirko and Eric Dumazet. 2) Make it so that any driver that supports configuration of multiple MAC addresses can provide the forwarding database add and del calls by providing a default implementation and hooking that up if the driver doesn't have an explicit set of handlers. From Vlad Yasevich. 3) Support GSO segmentation over tunnels and other encapsulating devices such as VXLAN, from Pravin B Shelar. 4) Support L2 GRE tunnels in the flow dissector, from Michael Dalton. 5) Implement Tail Loss Probe (TLP) detection in TCP, from Nandita Dukkipati. 6) In the PHY layer, allow supporting wake-on-lan in situations where the PHY registers have to be written for it to be configured. Use it to support wake-on-lan in mv643xx_eth. From Michael Stapelberg. 7) Significantly improve firewire IPV6 support, from YOSHIFUJI Hideaki. 8) Allow multiple packets to be sent in a single transmission using network coding in batman-adv, from Martin Hundebøll. 9) Add support for T5 cxgb4 chips, from Santosh Rastapur. 10) Generalize the VXLAN forwarding tables so that there is more flexibility in configurating various aspects of the endpoints. From David Stevens. 11) Support RSS and TSO in hardware over GRE tunnels in bxn2x driver, from Dmitry Kravkov. 12) Zero copy support in nfnelink_queue, from Eric Dumazet and Pablo Neira Ayuso. 13) Start adding networking selftests. 14) In situations of overload on the same AF_PACKET fanout socket, or per-cpu packet receive queue, minimize drop by distributing the load to other cpus/fanouts. From Willem de Bruijn and Eric Dumazet. 15) Add support for new payload offset BPF instruction, from Daniel Borkmann. 16) Convert several drivers over to mdoule_platform_driver(), from Sachin Kamat. 17) Provide a minimal BPF JIT image disassembler userspace tool, from Daniel Borkmann. 18) Rewrite F-RTO implementation in TCP to match the final specification of it in RFC4138 and RFC5682. From Yuchung Cheng. 19) Provide netlink socket diag of netlink sockets ("Yo dawg, I hear you like netlink, so I implemented netlink dumping of netlink sockets.") From Andrey Vagin. 20) Remove ugly passing of rtnetlink attributes into rtnl_doit functions, from Thomas Graf. 21) Allow userspace to be able to see if a configuration change occurs in the middle of an address or device list dump, from Nicolas Dichtel. 22) Support RFC3168 ECN protection for ipv6 fragments, from Hannes Frederic Sowa. 23) Increase accuracy of packet length used by packet scheduler, from Jason Wang. 24) Beginning set of changes to make ipv4/ipv6 fragment handling more scalable and less susceptible to overload and locking contention, from Jesper Dangaard Brouer. 25) Get rid of using non-type-safe NLMSG_* macros and use nlmsg_() instead. From Hong Zhiguo. 26) Optimize route usage in IPVS by avoiding reference counting where possible, from Julian Anastasov. 27) Convert IPVS schedulers to RCU, also from Julian Anastasov. 28) Support cpu fanouts in xt_NFQUEUE netfilter target, from Holger Eitzenberger. 29) Network namespace support for nf_log, ebt_log, xt_LOG, ipt_ULOG, nfnetlink_log, and nfnetlink_queue. From Gao feng. 30) Implement RFC3168 ECN protection, from Hannes Frederic Sowa. 31) Support several new r8169 chips, from Hayes Wang. 32) Support tokenized interface identifiers in ipv6, from Daniel Borkmann. 33) Use usbnet_link_change() helper in USB net driver, from Ming Lei. 34) Add 802.1ad vlan offload support, from Patrick McHardy. 35) Support mmap() based netlink communication, also from Patrick McHardy. 36) Support HW timestamping in mlx4 driver, from Amir Vadai. 37) Rationalize AF_PACKET packet timestamping when transmitting, from Willem de Bruijn and Daniel Borkmann. 38) Bring parity to what's provided by /proc/net/packet socket dumping and the info provided by netlink socket dumping of AF_PACKET sockets. From Nicolas Dichtel. 39) Fix peeking beyond zero sized SKBs in AF_UNIX, from Benjamin Poirier" git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1722 commits) filter: fix va_list build error af_unix: fix a fatal race with bit fields bnx2x: Prevent memory leak when cnic is absent bnx2x: correct reading of speed capabilities net: sctp: attribute printl with __printf for gcc fmt checks netlink: kconfig: move mmap i/o into netlink kconfig netpoll: convert mutex into a semaphore netlink: Fix skb ref counting. net_sched: act_ipt forward compat with xtables mlx4_en: fix a build error on 32bit arches Revert "bnx2x: allow nvram test to run when device is down" bridge: avoid OOPS if root port not found drivers: net: cpsw: fix kernel warn on cpsw irq enable sh_eth: use random MAC address if no valid one supplied 3c509.c: call SET_NETDEV_DEV for all device types (ISA/ISAPnP/EISA) tg3: fix to append hardware time stamping flags unix/stream: fix peeking with an offset larger than data in queue unix/dgram: fix peeking with an offset larger than data in queue unix/dgram: peek beyond 0-sized skbs openvswitch: Remove unneeded ovs_netdev_get_ifindex() ...	2013-05-01 14:08:52 -07:00
Neil Horman	bd7c4b604a	netpoll: convert mutex into a semaphore Bart Van Assche recently reported a warning to me: <IRQ> [<ffffffff8103d79f>] warn_slowpath_common+0x7f/0xc0 [<ffffffff8103d7fa>] warn_slowpath_null+0x1a/0x20 [<ffffffff814761dd>] mutex_trylock+0x16d/0x180 [<ffffffff813968c9>] netpoll_poll_dev+0x49/0xc30 [<ffffffff8136a2d2>] ? __alloc_skb+0x82/0x2a0 [<ffffffff81397715>] netpoll_send_skb_on_dev+0x265/0x410 [<ffffffff81397c5a>] netpoll_send_udp+0x28a/0x3a0 [<ffffffffa0541843>] ? write_msg+0x53/0x110 [netconsole] [<ffffffffa05418bf>] write_msg+0xcf/0x110 [netconsole] [<ffffffff8103eba1>] call_console_drivers.constprop.17+0xa1/0x1c0 [<ffffffff8103fb76>] console_unlock+0x2d6/0x450 [<ffffffff8104011e>] vprintk_emit+0x1ee/0x510 [<ffffffff8146f9f6>] printk+0x4d/0x4f [<ffffffffa0004f1d>] scsi_print_command+0x7d/0xe0 [scsi_mod] This resulted from my commit `ca99ca14c` which introduced a mutex_trylock operation in a path that could execute in interrupt context. When mutex debugging is enabled, the above warns the user when we are in fact exectuting in interrupt context interrupt context. After some discussion, It seems that a semaphore is the proper mechanism to use here. While mutexes are defined to be unusable in interrupt context, no such condition exists for semaphores (save for the fact that the non blocking api calls, like up and down_trylock must be used when in irq context). Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Reported-by: Bart Van Assche <bvanassche@acm.org> CC: Bart Van Assche <bvanassche@acm.org> CC: David Miller <davem@davemloft.net> CC: netdev@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>	2013-05-01 15:00:24 -04:00
David S. Miller	58717686cf	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c drivers/net/ethernet/emulex/benet/be.h include/net/tcp.h net/mac802154/mac802154.h Most conflicts were minor overlapping stuff. The be2net driver brought in some fixes that added __vlan_put_tag calls, which in net-next take an additional argument. Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-30 03:55:20 -04:00
Benjamin Poirier	39cc86130b	unix/dgram: fix peeking with an offset larger than data in queue Currently, peeking on a unix datagram socket with an offset larger than len of the data in the sk receive queue returns immediately with bogus data. That's because *off is not reset between each skb_queue_walk(). This patch fixes this so that the behavior is the same as peeking with no offset on an empty queue: the caller blocks. Signed-off-by: Benjamin Poirier <bpoirier@suse.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-30 00:43:54 -04:00
Benjamin Poirier	add05ad4e9	unix/dgram: peek beyond 0-sized skbs "77c1090 net: fix infinite loop in __skb_recv_datagram()" (v3.8) introduced a regression: After that commit, recv can no longer peek beyond a 0-sized skb in the queue. __skb_recv_datagram() instead stops at the first skb with len == 0 and results in the system call failing with -EFAULT via skb_copy_datagram_iovec(). When peeking at an offset with 0-sized skb(s), each one of those is received only once, in sequence. The offset starts moving forward again after receiving datagrams with len > 0. Signed-off-by: Benjamin Poirier <bpoirier@suse.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-30 00:43:54 -04:00
Sridhar Samudrala	0c772159d1	net: Use consume_skb() to free gso segmented skb Use consume_skb() to free the original skb that is successfully transmitted as gso segmented skbs so that it is not treated as a drop due to an error. Signed-off-by: Sridhar Samudrala <sri@us.ibm.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-30 00:18:13 -04:00
Akinobu Mita	70e3ba72ba	net/core: remove duplicate statements by do-while loop Remove duplicate statements by using do-while loop instead of while loop. - A; - while (e) { + do { A; - } + } while (e); Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-04-29 18:28:43 -07:00
Akinobu Mita	33d7c5e539	net/core: rename random32() to prandom_u32() Use preferable function name which implies using a pseudo-random number generator. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-04-29 18:28:43 -07:00
Eric Dumazet	aebda156a5	net: defer net_secret[] initialization Instead of feeding net_secret[] at boot time, defer the init at the point first socket is created. This permits some platforms to use better entropy sources than the ones available at boot time. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-29 15:14:02 -04:00
Nicolas Dichtel	e8d9612c18	sock_diag: allow to dump bpf filters This patch allows to dump BPF filters attached to a socket with SO_ATTACH_FILTER. Note that we check CAP_SYS_ADMIN before allowing to dump this info. For now, only AF_PACKET sockets use this feature. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-29 13:21:30 -04:00
Vlad Yasevich	37fe066098	net: fix address check in rtnl_fdb_del Commit `6681712d67` vxlan: generalize forwarding tables relaxed the address checks in rtnl_fdb_del() to use is_zero_ether_addr(). This allows users to add multicast addresses using the fdb API. However, the check in rtnl_fdb_del() still uses a more strict is_valid_ether_addr() which rejects multicast addresses. Thus it is possible to add an fdb that can not be later removed. Relax the check in rtnl_fdb_del() as well. Signed-off-by: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-25 04:14:08 -04:00
Eric Dumazet	e12472dc57	net: remove redundant code in dev_hard_start_xmit() This reverts commit `068a2de57d` (net: release dst entry while cache-hot for GSO case too) Before GSO packet segmentation, we already take care of skb->dst if it can be released. There is no point adding extra test for every segment in the gso loop. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Krishna Kumar <krkumar2@in.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-25 01:38:25 -04:00
Willem de Bruijn	2e31396fa1	packet: tx timestamping on tpacket ring When transmit timestamping is enabled at the socket level, record a timestamp on packets written to a PACKET_TX_RING. Tx timestamps are always looped to the application over the socket error queue. Software timestamps are also written back into the packet frame header in the packet ring. Reported-by: Paul Chavent <paul.chavent@onera.fr> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-25 01:22:22 -04:00
David S. Miller	6e0895c2ea	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/net/ethernet/emulex/benet/be_main.c drivers/net/ethernet/intel/igb/igb_main.c drivers/net/wireless/brcm80211/brcmsmac/mac80211_if.c include/net/scm.h net/batman-adv/routing.c net/ipv4/tcp_input.c The e{uid,gid} --> {uid,gid} credentials fix conflicted with the cleanup in net-next to now pass cred structs around. The be2net driver had a bug fix in 'net' that overlapped with the VLAN interface changes by Patrick McHardy in net-next. An IGB conflict existed because in 'net' the build_skb() support was reverted, and in 'net-next' there was a comment style fix within that code. Several batman-adv conflicts were resolved by making sure that all calls to batadv_is_my_mac() are changed to have a new bat_priv first argument. Eric Dumazet's TS ECR fix in TCP in 'net' conflicted with the F-RTO rewrite in 'net-next', mostly overlapping changes. Thanks to Stephen Rothwell and Antonio Quartulli for help with several of these merge resolutions. Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-22 20:32:51 -04:00
dingtianhong	53759be997	net: Remove return value from list_netdevice() The return value from list_netdevice() is not used and no need, so remove it. Signed-off-by: Ding Tianhong <dingtianhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-22 16:09:24 -04:00
Ben Greear	c846ad9b88	net: rate-limit warn-bad-offload splats. If one does do something unfortunate and allow a bad offload bug into the kernel, this the skb_warn_bad_offload can effectively live-lock the system, filling the logs with the same error over and over. Add rate limitation to this so that box remains otherwise functional in this case. Signed-off-by: Ben Greear <greearb@candelatech.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-19 17:57:49 -04:00
David S. Miller	2d6577f17b	net: Add missing netdev feature strings for NETIF_F_HW_VLAN_STAG_* Noticed by Ben Hutchings. Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-19 16:16:50 -04:00
Patrick McHardy	0ebd0ac5ff	net: add function to allocate sk_buff head without data area Add a function to allocate a sk_buff head without any data. This will be used by memory mapped netlink to attach data from the mmaped area to the skb. Additionally change skb_release_all() to check whether the skb has a data area to allow the skb destructor to clear the data pointer in case only a head has been allocated. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-19 14:57:57 -04:00
Patrick McHardy	8ad227ff89	net: vlan: add 802.1ad support Add support for 802.1ad VLAN devices. This mainly consists of checking for ETH_P_8021AD in addition to ETH_P_8021Q in a couple of places and check offloading capabilities based on the used protocol. Configuration is done using "ip link": # ip link add link eth0 eth0.1000 \ type vlan proto 802.1ad id 1000 # ip link add link eth0.1000 eth0.1000.1000 \ type vlan proto 802.1q id 1000 52:54:00:12:34:56 > 92:b1:54:28:e4:8c, ethertype 802.1Q (0x8100), length 106: vlan 1000, p 0, ethertype 802.1Q, vlan 1000, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto ICMP (1), length 84) 20.1.0.2 > 20.1.0.1: ICMP echo request, id 3003, seq 8, length 64 92:b1:54:28:e4:8c > 52:54:00:12:34:56, ethertype 802.1Q-QinQ (0x88a8), length 106: vlan 1000, p 0, ethertype 802.1Q, vlan 1000, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 47944, offset 0, flags [none], proto ICMP (1), length 84) 20.1.0.1 > 20.1.0.2: ICMP echo reply, id 3003, seq 8, length 64 Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-19 14:46:06 -04:00
Patrick McHardy	86a9bad3ab	net: vlan: add protocol argument to packet tagging functions Add a protocol argument to the VLAN packet tagging functions. In case of HW tagging, we need that protocol available in the ndo_start_xmit functions, so it is stored in a new field in the skb. The new field fits into a hole (on 64 bit) and doesn't increase the sks's size. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-19 14:46:06 -04:00
Patrick McHardy	f646968f8f	net: vlan: rename NETIF_F_HW_VLAN_* feature flags to NETIF_F_HW_VLAN_CTAG_* Rename the hardware VLAN acceleration features to include "CTAG" to indicate that they only support CTAGs. Follow up patches will introduce 802.1ad server provider tagging (STAGs) and require the distinction for hardware not supporting acclerating both. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-19 14:45:26 -04:00
Joe Perches	d5d427cdae	neighbour: Convert NEIGH_PRINTK to neigh_dbg Update debugging messages to a more current style. Emit these debugging messages at KERN_DEBUG instead of KERN_DEFAULT. Add and use neigh_dbg(level, fmt, ...) macro Add dynamic_debug capability via pr_debug Convert embedded function names to "%s: ", __func__ Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-16 16:34:08 -04:00
Vlad Yasevich	4cd729b042	net: add dev_uc_sync_multiple() and dev_mc_sync_multiple() api The current implementation of dev_uc_sync/unsync() assumes that there is a strict 1-to-1 relationship between the source and destination of the sync. In other words, once an address has been synced to a destination device, it will not be synced to any other device through the sync API. However, there are some virtual devices that aggreate a number of lower devices and need to sync addresses to all of them. The current API falls short there. This patch introduces a new dev_uc_sync_multiple() api that can be called in the above circumstances and allows sync to work for every invocation. CC: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-15 16:10:47 -04:00
David S. Miller	6c6779856a	Revert "netprio_cgroup: make local table static" This reverts commit `763eff57de`. It causes build regressions, as per Stephen Rothwell: ==================== After merging the final tree, today's linux-next build (powerpc allyesconfig) failed like this: net/core/netprio_cgroup.c:250:29: error: static declaration of 'net_prio_subsys' follows non-static declaration include/linux/cgroup_subsys.h:71:1: note: previous declaration of 'net_prio_subsys' was here ==================== Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-12 03:06:44 -04:00
stephen hemminger	763eff57de	netprio_cgroup: make local table static Minor sparse warning Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-10 23:23:43 -04:00
Al Viro	d9dda78bad	procfs: new helper - PDE_DATA(inode) The only part of proc_dir_entry the code outside of fs/proc really cares about is PDE(inode)->data. Provide a helper for that; static inline for now, eventually will be moved to fs/proc, along with the knowledge of struct proc_dir_entry layout. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-04-09 14:13:32 -04:00
Zefan Li	6ffd464102	netprio_cgroup: remove task_struct parameter from sock_update_netprio() The callers always pass current to sock_update_netprio(). Signed-off-by: Li Zefan <lizefan@huawei.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-09 13:19:37 -04:00
Zefan Li	211d2f97e9	cls_cgroup: remove task_struct parameter from sock_update_classid() The callers always pass current to sock_update_classid(). Signed-off-by: Li Zefan <lizefan@huawei.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-09 13:19:35 -04:00
Michael Riesch	88c5b5ce5c	rtnetlink: Call nlmsg_parse() with correct header length Signed-off-by: Michael Riesch <michael.riesch@omicron.at> Cc: "David S. Miller" <davem@davemloft.net> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jiri Benc <jbenc@redhat.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: linux-kernel@vger.kernel.org Acked-by: Mark Rustad <mark.d.rustad@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-08 17:12:26 -04:00
Eric W. Biederman	6b0ee8c036	scm: Stop passing struct cred Now that uids and gids are completely encapsulated in kuid_t and kgid_t we no longer need to pass struct cred which allowed us to test both the uid and the user namespace for equality. Passing struct cred potentially allows us to pass the entire group list as BSD does but I don't believe the cost of cache line misses justifies retaining code for a future potential application. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-07 18:58:55 -04:00
David S. Miller	d978a6361a	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/nfc/microread/mei.c net/netfilter/nfnetlink_queue_core.c Pull in 'net' to get Eric Biederman's AF_UNIX fix, upon which some cleanups are going to go on-top. Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-07 18:37:01 -04:00
David S. Miller	d16658206a	Merge branch 'master' of git://1984.lsi.us.es/nf-next Pablo Neira Ayuso says: ==================== The following patchset contains Netfilter and IPVS updates for your net-next tree, most relevantly they are: * Add net namespace support to NFLOG, ULOG and ebt_ulog and NFQUEUE. The LOG and ebt_log target has been also adapted, but they still depend on the syslog netnamespace that seems to be missing, from Gao Feng. * Don't lose indications of congestion in IPv6 fragmentation handling, from Hannes Frederic Sowa.i * IPVS conversion to use RCU, including some code consolidation patches and optimizations, also some from Julian Anastasov. * cpu fanout support for NFQUEUE, from Holger Eitzenberger. * Better error reporting to userspace when dropping packets from all our _*_[xfrm\|route]_me_harder functions, from Patrick McHardy. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-07 12:22:06 -04:00
Patrick McHardy	124dff01af	netfilter: don't reset nf_trace in nf_reset() Commit `130549fe` ("netfilter: reset nf_trace in nf_reset") added code to reset nf_trace in nf_reset(). This is wrong and unnecessary. nf_reset() is used in the following cases: - when passing packets up the the socket layer, at which point we want to release all netfilter references that might keep modules pinned while the packet is queued. nf_trace doesn't matter anymore at this point. - when encapsulating or decapsulating IPsec packets. We want to continue tracing these packets after IPsec processing. - when passing packets through virtual network devices. Only devices on that encapsulate in IPv4/v6 matter since otherwise nf_trace is not used anymore. Its not entirely clear whether those packets should be traced after that, however we've always done that. - when passing packets through virtual network devices that make the packet cross network namespace boundaries. This is the only cases where we clearly want to reset nf_trace and is also what the original patch intended to fix. Add a new function nf_reset_trace() and use it in dev_forward_skb() to fix this properly. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-05 15:38:10 -04:00
Vlad Yasevich	4543fbefe6	net: count hw_addr syncs so that unsync works properly. A few drivers use dev_uc_sync/unsync to synchronize the address lists from master down to slave/lower devices. In some cases (bond/team) a single address list is synched down to multiple devices. At the time of unsync, we have a leak in these lower devices, because "synced" is treated as a boolean and the address will not be unsynced for anything after the first device/call. Treat "synced" as a count (same as refcount) and allow all unsync calls to work. Signed-off-by: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-05 00:18:46 -04:00
Jacob Keller	8facd5fb73	net: fix smatch warnings inside datagram_poll Commit `7d4c04fc17` ("net: add option to enable error queue packets waking select") has an issue due to operator precedence causing the bit-wise OR to bind to the sock_flags call instead of the result of the terniary conditional. This fixes the *_poll functions to work properly. The old code results in "mask \|= POLLPRI" instead of what was intended, which is to only include POLLPRI when the socket option is enabled. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-02 16:59:16 -04:00
Julian Anastasov	932bc4d7a5	net: add skb_dst_set_noref_force Rename skb_dst_set_noref to __skb_dst_set_noref and add force flag as suggested by David Miller. The new wrapper skb_dst_set_noref_force will force dst entries that are not cached to be attached as skb dst without taking reference as long as provided dst is reclaimed after RCU grace period. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off by: Hans Schillstrom <hans@schillstrom.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Simon Horman <horms@verge.net.au>	2013-04-02 00:22:53 +02:00
David S. Miller	a210576cf8	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: net/mac80211/sta_info.c net/wireless/core.h Two minor conflicts in wireless. Overlapping additions of extern declarations in net/wireless/core.h and a bug fix overlapping with the addition of a boolean parameter to __ieee80211_key_free(). Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-01 13:36:50 -04:00
Linus Torvalds	ff3421dee6	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Pull networking fixes from David Miller: 1) sadb_msg prepared for IPSEC userspace forgets to initialize the satype field, fix from Nicolas Dichtel. 2) Fix mac80211 synchronization during station removal, from Johannes Berg. 3) Fix IPSEC sequence number notifications when they wrap, from Steffen Klassert. 4) Fix cfg80211 wdev tracing crashes when add_virtual_intf() returns an error pointer, from Johannes Berg. 5) In mac80211, don't call into the channel context code with the interface list mutex held. From Johannes Berg. 6) In mac80211, if we don't actually associate, do not restart the STA timer, otherwise we can crash. From Ben Greear. 7) Missing dma_mapping_error() check in e1000, ixgb, and e1000e. From Christoph Paasch. 8) Fix sja1000 driver defines to not conflict with SH port, from Marc Kleine-Budde. 9) Don't call il4965_rs_use_green with a NULL station, from Colin Ian King. 10) Suspend/Resume in the FEC driver fail because the buffer descriptors are not initialized at all the moments in which they should. Fix from Frank Li. 11) cpsw and davinci_emac drivers both use the wrong interface to restart a stopped TX queue. Use netif_wake_queue not netif_start_queue, the latter is for initialization/bringup not active management of the queue. From Mugunthan V N. 12) Fix regression in rate calculations done by psched_ratecfg_precompute(), missing u64 type promotion. From Sergey Popovich. 13) Fix length overflow in tg3 VPD parsing, from Kees Cook. 14) AOE driver fails to allocate enough headroom, resulting in crashes. Fix from Eric Dumazet. 15) RX overflow happens too quickly in sky2 driver because pause packet thresholds are not programmed correctly. From Mirko Lindner. 16) Bonding driver manages arp_interval and miimon settings incorrectly, disabling one unintentionally disables both. Fix from Nikolay Aleksandrov. 17) smsc75xx drivers don't program the RX mac properly for jumbo frames. Fix from Steve Glendinning. 18) Fix off-by-one in Codel packet scheduler. From Vijay Subramanian. 19) Fix packet corruption in atl1c by disabling MSI support, from Hannes Frederic Sowa. 20) netdev_rx_handler_unregister() needs a synchronize_net() to fix crashes in bonding driver unload stress tests. From Eric Dumazet. 21) rxlen field of ks8851 RX packet descriptors not interpreted correctly (it is 12 bits not 16 bits, so needs to be masked after shifting the 32-bit value down 16 bits). Fix from Max Nekludov. 22) Fix missed RX/TX enable in sh_eth driver due to mishandling of link change indications. From Sergei Shtylyov. 23) Fix crashes during spurious ECI interrupts in sh_eth driver, also from Sergei Shtylyov. 24) dm9000 driver initialization is done wrong for revision B devices with DSP PHY, from Joseph CHANG. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (53 commits) DM9000B: driver initialization upgrade sh_eth: make 'link' field of 'struct sh_eth_private' int sh_eth: workaround for spurious ECI interrupt sh_eth: fix handling of no LINK signal ks8851: Fix interpretation of rxlen field. net: add a synchronize_net() in netdev_rx_handler_unregister() MAINTAINERS: Update netxen_nic maintainers list atl1e: drop pci-msi support because of packet corruption net: fq_codel: Fix off-by-one error net: calxedaxgmac: Wake-on-LAN fixes net: calxedaxgmac: fix rx ring handling when OOM net: core: Remove redundant call to 'nf_reset' in 'dev_forward_skb' smsc75xx: fix jumbo frame support net: fix the use of this_cpu_ptr bonding: fix disabling of arp_interval and miimon ipv6: don't accept node local multicast traffic from the wire sky2: Threshold for Pause Packet is set wrong sky2: Receive Overflows not counted aoe: reserve enough headroom on skbs line up comment for ndo_bridge_getlink ...	2013-04-01 08:06:30 -07:00
Keller, Jacob E	7d4c04fc17	net: add option to enable error queue packets waking select Currently, when a socket receives something on the error queue it only wakes up the socket on select if it is in the "read" list, that is the socket has something to read. It is useful also to wake the socket if it is in the error list, which would enable software to wait on error queue packets without waking up for regular data on the socket. The main use case is for receiving timestamped transmit packets which return the timestamp to the socket via the error queue. This enables an application to select on the socket for the error queue only instead of for the regular traffic. -v2- * Added the SO_SELECT_ERR_QUEUE socket option to every architechture specific file * Modified every socket poll function that checks error queue Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Cc: Jeffrey Kirsher <jeffrey.t.kirsher@intel.com> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Matthew Vick <matthew.vick@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-31 19:44:20 -04:00
Eric Dumazet	00cfec3748	net: add a synchronize_net() in netdev_rx_handler_unregister() commit `35d48903e9` (bonding: fix rx_handler locking) added a race in bonding driver, reported by Steven Rostedt who did a very good diagnosis : <quoting Steven> I'm currently debugging a crash in an old 3.0-rt kernel that one of our customers is seeing. The bug happens with a stress test that loads and unloads the bonding module in a loop (I don't know all the details as I'm not the one that is directly interacting with the customer). But the bug looks to be something that may still be present and possibly present in mainline too. It will just be much harder to trigger it in mainline. In -rt, interrupts are threads, and can schedule in and out just like any other thread. Note, mainline now supports interrupt threads so this may be easily reproducible in mainline as well. I don't have the ability to tell the customer to try mainline or other kernels, so my hands are somewhat tied to what I can do. But according to a core dump, I tracked down that the eth irq thread crashed in bond_handle_frame() here: slave = bond_slave_get_rcu(skb->dev); bond = slave->bond; <--- BUG the slave returned was NULL and accessing slave->bond caused a NULL pointer dereference. Looking at the code that unregisters the handler: void netdev_rx_handler_unregister(struct net_device *dev) { ASSERT_RTNL(); RCU_INIT_POINTER(dev->rx_handler, NULL); RCU_INIT_POINTER(dev->rx_handler_data, NULL); } Which is basically: dev->rx_handler = NULL; dev->rx_handler_data = NULL; And looking at __netif_receive_skb() we have: rx_handler = rcu_dereference(skb->dev->rx_handler); if (rx_handler) { if (pt_prev) { ret = deliver_skb(skb, pt_prev, orig_dev); pt_prev = NULL; } switch (rx_handler(&skb)) { My question to all of you is, what stops this interrupt from happening while the bonding module is unloading? What happens if the interrupt triggers and we have this: CPU0 CPU1 ---- ---- rx_handler = skb->dev->rx_handler netdev_rx_handler_unregister() { dev->rx_handler = NULL; dev->rx_handler_data = NULL; rx_handler() bond_handle_frame() { slave = skb->dev->rx_handler; bond = slave->bond; <-- NULL pointer dereference!!! What protection am I missing in the bond release handler that would prevent the above from happening? </quoting Steven> We can fix bug this in two ways. First is adding a test in bond_handle_frame() and others to check if rx_handler_data is NULL. A second way is adding a synchronize_net() in netdev_rx_handler_unregister() to make sure that a rcu protected reader has the guarantee to see a non NULL rx_handler_data. The second way is better as it avoids an extra test in fast path. Reported-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jiri Pirko <jpirko@redhat.com> Cc: Paul E. McKenney <paulmck@us.ibm.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-29 15:38:15 -04:00
John Fastabend	91f3e7b174	net: rtnetlink: fdb dflt dump must set idx used for cb->arg[0] In rtnl_fdb_dump() when the fdb_dump ndo op is not populated we never set the idx value so that cb->arg[0] is always 0. Resulting in a endless loop of messages. Introduced with this commit, commit `090096bf3d` Author: Vlad Yasevich <vyasevic@redhat.com> Date: Wed Mar 6 15:39:42 2013 +0000 net: generic fdb support for drivers without ndo_fdb_<op> CC: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-29 15:26:27 -04:00
Shmulik Ladkani	a561cf7edf	net: core: Remove redundant call to 'nf_reset' in 'dev_forward_skb' 'nf_reset' is called just prior calling 'netif_rx'. No need to call it twice. Reported-by: Igor Michailov <rgohita@gmail.com> Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-29 15:25:28 -04:00
Li RongQing	278150321a	net: simplify the getting percpu of flow_cache replace per_cpu with per_cpu_ptr to save conversion between address and pointer Signed-off-by: Li RongQing <roy.qing.li@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-29 15:14:46 -04:00
Li RongQing	50eab0503a	net: fix the use of this_cpu_ptr flush_tasklet is not percpu var, and percpu is percpu var, and this_cpu_ptr(&info->cache->percpu->flush_tasklet) is not equal to &this_cpu_ptr(info->cache->percpu)->flush_tasklet 1f743b076(use this_cpu_ptr per-cpu helper) introduced this bug. Signed-off-by: Li RongQing <roy.qing.li@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-29 15:13:27 -04:00
Linus Torvalds	2c3de1c2d7	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull userns fixes from Eric W Biederman: "The bulk of the changes are fixing the worst consequences of the user namespace design oversight in not considering what happens when one namespace starts off as a clone of another namespace, as happens with the mount namespace. The rest of the changes are just plain bug fixes. Many thanks to Andy Lutomirski for pointing out many of these issues." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: userns: Restrict when proc and sysfs can be mounted ipc: Restrict mounting the mqueue filesystem vfs: Carefully propogate mounts across user namespaces vfs: Add a mount flag to lock read only bind mounts userns: Don't allow creation if the user is chrooted yama: Better permission check for ptraceme pid: Handle the exit of a multi-threaded init. scm: Require CAP_SYS_ADMIN over the current pidns to spoof pids.	2013-03-28 13:43:46 -07:00
Hong zhi guo	573ce260b3	net-next: replace obsolete NLMSG_* with type safe nlmsg_* Signed-off-by: Hong Zhiguo <honkiko@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-28 14:25:25 -04:00
Wei Yongjun	fcca143d69	rtnetlink: fix error return code in rtnl_link_fill() Fix to return a negative error code from the error handling case instead of 0(possible overwrite to 0 by ops->fill_xstats call), as returned elsewhere in this function. Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-27 14:06:40 -04:00
David S. Miller	e2a553dbf1	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: include/net/ipip.h The changes made to ipip.h in 'net' were already included in 'net-next' before that header was moved to another location. Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-27 13:52:49 -04:00
Andy Shevchenko	5a048e3b59	net: core: let's use native isxdigit instead of custom In kernel we have fast and pretty implementation of the isxdigit() function. Let's use it. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-27 12:48:32 -04:00
Jason Wang	e5d5decaed	net: core: let skb_partial_csum_set() set transport header For untrusted packets with partial checksum, we need to set the transport header for precise packet length estimation. We can just let skb_pratial_csum_set() to do this to avoid extra call to skb_flow_dissect() and simplify the caller. Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-27 12:48:31 -04:00
Jason Wang	15e5a03071	net_sched: better precise estimation on packet length for untrusted packets gso_segs were reset to zero when kernel receive packets from untrusted source. But we use this zero value to estimate precise packet len which is wrong. So this patch tries to estimate the correct gso_segs value before using it in qdisc_pkt_len_init(). Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-26 12:44:44 -04:00
David S. Miller	eaac5f3d3a	net: Print functions in /proc/net/ptype without the offset. It's always zero. Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-25 14:12:55 -04:00
Eric Dumazet	9979a55a83	net: remove a WARN_ON() in net_enable_timestamp() The WARN_ON(in_interrupt()) in net_enable_timestamp() can get false positive, in socket clone path, run from softirq context : [ 3641.624425] WARNING: at net/core/dev.c:1532 net_enable_timestamp+0x7b/0x80() [ 3641.668811] Call Trace: [ 3641.671254] <IRQ> [<ffffffff80286817>] warn_slowpath_common+0x87/0xc0 [ 3641.677871] [<ffffffff8028686a>] warn_slowpath_null+0x1a/0x20 [ 3641.683683] [<ffffffff80742f8b>] net_enable_timestamp+0x7b/0x80 [ 3641.689668] [<ffffffff80732ce5>] sk_clone_lock+0x425/0x450 [ 3641.695222] [<ffffffff8078db36>] inet_csk_clone_lock+0x16/0x170 [ 3641.701213] [<ffffffff807ae449>] tcp_create_openreq_child+0x29/0x820 [ 3641.707663] [<ffffffff807d62e2>] ? ipt_do_table+0x222/0x670 [ 3641.713354] [<ffffffff807aaf5b>] tcp_v4_syn_recv_sock+0xab/0x3d0 [ 3641.719425] [<ffffffff807af63a>] tcp_check_req+0x3da/0x530 [ 3641.724979] [<ffffffff8078b400>] ? inet_hashinfo_init+0x60/0x80 [ 3641.730964] [<ffffffff807ade6f>] ? tcp_v4_rcv+0x79f/0xbe0 [ 3641.736430] [<ffffffff807ab9bd>] tcp_v4_do_rcv+0x38d/0x4f0 [ 3641.741985] [<ffffffff807ae14a>] tcp_v4_rcv+0xa7a/0xbe0 Its safe at this point because the parent socket owns a reference on the netstamp_needed, so we cant have a 0 -> 1 transition, which requires to lock a mutex. Instead of refining the check, lets remove it, as all known callers are safe. If it ever changes in the future, static_key_slow_inc() will complain anyway. Reported-by: Laurent Chavey <chavey@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-24 17:27:27 -04:00
Nicolas Dichtel	0465277f6b	ipv4: provide addr and netconf dump consistency info This patch takes benefit of dev_addr_genid and dev_base_seq to check if a change occurs during a netlink dump. If a change is detected, the flag NLM_F_DUMP_INTR is set in the first message after the dump was interrupted. Note that seq and prev_seq must be reset between each family in rtnl_dump_all() because they are specific to each family. Reported-by: Junwei Zhang <junwei.zhang@6wind.com> Reported-by: Hongjun Li <hongjun.li@6wind.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-24 17:16:29 -04:00
Thomas Graf	661d2967b3	rtnetlink: Remove passing of attributes into rtnl_doit functions With decnet converted, we can finally get rid of rta_buf and its computations around it. It also gets rid of the minimal header length verification since all message handlers do that explicitly anyway. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-22 10:31:16 -04:00
Zefan Li	4021db9a0d	net: remove redundant ifdef CONFIG_CGROUPS The cgroup code has been surrounded by ifdef CONFIG_NET_CLS_CGROUP and CONFIG_NETPRIO_CGROUP. Signed-off-by: Li Zefan <lizefan@huawei.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-21 11:47:51 -04:00
Chris Metcalf	8fdc929f57	dynticks: avoid flow_cache_flush() interrupting every core Previously, if you did an "ifconfig down" or similar on one core, and the kernel had CONFIG_XFRM enabled, every core would be interrupted to check its percpu flow list for items that could be garbage collected. With this change, we generate a mask of cores that actually have any percpu items, and only interrupt those cores. When we are trying to isolate a set of cpus from interrupts, this is important to do. Signed-off-by: Chris Metcalf <cmetcalf@tilera.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-20 13:28:39 -04:00
Daniel Borkmann	3e5289d5e3	filter: add ANC_PAY_OFFSET instruction for loading payload start offset It is very useful to do dynamic truncation of packets. In particular, we're interested to push the necessary header bytes to the user space and cut off user payload that should probably not be transferred for some reasons (e.g. privacy, speed, or others). With the ancillary extension PAY_OFFSET, we can load it into the accumulator, and return it. E.g. in bpfc syntax ... ld #poff ; { 0x20, 0, 0, 0xfffff034 }, ret a ; { 0x16, 0, 0, 0x00000000 }, ... as a filter will accomplish this without having to do a big hackery in a BPF filter itself. Follow-up JIT implementations are welcome. Thanks to Eric Dumazet for suggesting and discussing this during the Netfilter Workshop in Copenhagen. Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-20 13:15:45 -04:00
Daniel Borkmann	f77668dc25	net: flow_dissector: add __skb_get_poff to get a start offset to payload __skb_get_poff() returns the offset to the payload as far as it could be dissected. The main user is currently BPF, so that we can dynamically truncate packets without needing to push actual payload to the user space and instead can analyze headers only. Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-20 13:15:45 -04:00
David S. Miller	61816596d1	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Pull in the 'net' tree to get Daniel Borkmann's flow dissector infrastructure change. Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-20 12:46:26 -04:00
Daniel Borkmann	8ed781668d	flow_keys: include thoff into flow_keys for later usage In skb_flow_dissect(), we perform a dissection of a skbuff. Since we're doing the work here anyway, also store thoff for a later usage, e.g. in the BPF filter. Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-20 12:14:36 -04:00
Kusanagi Kouichi	166ec36968	net: Fix a comment typo Signed-off-by: Kusanagi Kouichi <slash@ac.auone-net.jp> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-18 13:06:09 -04:00
Eric W. Biederman	92f28d973c	scm: Require CAP_SYS_ADMIN over the current pidns to spoof pids. Don't allow spoofing pids over unix domain sockets in the corner cases where a user has created a user namespace but has not yet created a pid namespace. Cc: stable@vger.kernel.org Reported-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2013-03-17 17:16:16 -07:00
Lai Jiangshan	7f9421c264	netpoll: use DEFINE_STATIC_SRCU() to define netpoll_srcu DEFINE_STATIC_SRCU() defines srcu struct and do init at build time. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-17 12:42:43 -04:00
David Stevens	6681712d67	vxlan: generalize forwarding tables This patch generalizes VXLAN forwarding table entries allowing an administrator to: 1) specify multiple destinations for a given MAC 2) specify alternate vni's in the VXLAN header 3) specify alternate destination UDP ports 4) use multicast MAC addresses as fdb lookup keys 5) specify multicast destinations 6) specify the outgoing interface for forwarded packets The combination allows configuration of more complex topologies using VXLAN encapsulation. Changes since v1: rebase to 3.9.0-rc2 Signed-Off-By: David L Stevens <dlstevens@us.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-17 12:23:46 -04:00
Vlad Yasevich	a5b8db9144	rtnetlink: Mask the rta_type when range checking Range/validity checks on rta_type in rtnetlink_rcv_msg() do not account for flags that may be set. This causes the function to return -EINVAL when flags are set on the type (for example NLA_F_NESTED). Signed-off-by: Vlad Yasevich <vyasevic@redhat.com> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-17 11:43:40 -04:00
Li RongQing	c80a8512ee	net/core: move vlan_depth out of while loop in skb_network_protocol() [ Bug added added in commit `05e8ef4ab2` (net: factor out skb_mac_gso_segment() from skb_gso_segment() ) ] move vlan_depth out of while loop, or else vlan_depth always is ETH_HLEN, can not be increased, and lead to infinite loop when frame has two vlan headers. Signed-off-by: Li RongQing <roy.qing.li@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-12 11:47:40 -04:00
David S. Miller	e5f2ef7ab4	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/net/ethernet/intel/e1000e/netdev.c Minor conflict in e1000e, a line that got fixed in 'net' has been removed in 'net-next'. Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-12 05:52:22 -04:00
Michael Dalton	e1733de224	flow_dissector: support L2 GRE Add support for L2 GRE tunnels, so that RPS can be more effective. Signed-off-by: Michael Dalton <mwdalton@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-12 05:14:06 -04:00
Mathias Krause	84d73cd3fb	rtnl: fix info leak on RTM_GETLINK request for VF devices Initialize the mac address buffer with 0 as the driver specific function will probably not fill the whole buffer. In fact, all in-kernel drivers fill only ETH_ALEN of the MAX_ADDR_LEN bytes, i.e. 6 of the 32 possible bytes. Therefore we currently leak 26 bytes of stack memory to userland via the netlink interface. Signed-off-by: Mathias Krause <minipli@googlemail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-10 05:19:26 -04:00
Pravin B Shelar	7313626745	tunneling: Add generic Tunnel segmentation. Adds generic tunneling offloading support for IPv4-UDP based tunnels. GSO type is added to request this offload for a skb. netdev feature NETIF_F_UDP_TUNNEL is added for hardware offloaded udp-tunnel support. Currently no device supports this feature, software offload is used. This can be used by tunneling protocols like VXLAN. CC: Jesse Gross <jesse@nicira.com> Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Acked-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-09 16:09:17 -05:00
Pravin B Shelar	aefbd2b3c2	tunneling: Capture inner mac header during encapsulation. This patch adds inner mac header. This will be used in next patch to find tunner header length. Header len is required to copy tunnel header to each gso segment. This patch does not change any functionality. Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Acked-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-09 16:08:57 -05:00
Pravin B Shelar	f5b1729443	net: Add skb_headers_offset_update helper function. This function will be used in next VXLAN_GSO patch. This patch does not change any functionality. Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Acked-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-09 16:08:57 -05:00
Pravin B Shelar	ee579677c2	tunnel: Inherit NETIF_F_SG for hw_enc_features. Inherit scatergather feature for tunnel devices to avoid copy for TSO packets of tunneling device like GRE. Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-09 16:08:57 -05:00
Pravin B Shelar	ec5f061564	net: Kill link between CSUM and SG features. Earlier SG was unset if CSUM was not available for given device to force skb copy to avoid sending inconsistent csum. Commit `c9af6db4c1` (net: Fix possible wrong checksum generation) added explicit flag to force copy to fix this issue. Therefore there is no need to link SG and CSUM, following patch kills this link between there two features. This patch is also required following patch in series. Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-09 16:08:57 -05:00
Cristian Bercaru	3bc1b1add7	bridging: fix rx_handlers return code The frames for which rx_handlers return RX_HANDLER_CONSUMED are no longer counted as dropped. They are counted as successfully received by 'netif_receive_skb'. This allows network interface drivers to correctly update their RX-OK and RX-DRP counters based on the result of 'netif_receive_skb'. Signed-off-by: Cristian Bercaru <B43982@freescale.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-08 12:19:59 -05:00
Vlad Yasevich	090096bf3d	net: generic fdb support for drivers without ndo_fdb_<op> If the driver does not support the ndo_op use the generic handler for it. This should work in the majority of cases. Eventually the fdb_dflt_add call gets translated into a __dev_set_rx_mode() call which should handle hardware support for filtering via the IFF_UNICAST_FLT flag. Namely IFF_UNICAST_FLT indicates if the hardware can do unicast address filtering. If no support is available the device is put into promisc mode. Signed-off-by: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-07 15:29:45 -05:00
Eric Dumazet	d1f41b67ff	net: reduce net_rx_action() latency to 2 HZ We should use time_after_eq() to get maximum latency of two ticks, instead of three. Bug added in commit `24f8b2385` (net: increase receive packet quantum) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-06 02:47:06 -05:00
Randy Dunlap	691b3b7e13	net: fix new kernel-doc warnings in net core Fix new kernel-doc warnings in net/core/dev.c: Warning(net/core/dev.c:4788): No description found for parameter 'new_carrier' Warning(net/core/dev.c:4788): Excess function parameter 'new_carries' description in 'dev_change_carrier' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-06 02:47:06 -05:00
Eric Dumazet	82dc3c63c6	net: introduce NAPI_POLL_WEIGHT Some drivers use a too big NAPI poll weight. This patch adds a NAPI_POLL_WEIGHT default value and issues an error message if a driver attempts to use a bigger weight. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Eilon Greenstein <eilong@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-05 23:40:01 -05:00
Sasha Levin	b67bfe0d42	hlist: drop the node parameter from iterators I'm not sure why, but the hlist for each entry iterators were conceived list_for_each_entry(pos, head, member) The hlist ones were greedy and wanted an extra parameter: hlist_for_each_entry(tpos, pos, head, member) Why did they need an extra pos parameter? I'm not quite sure. Not only they don't really need it, it also prevents the iterator from looking exactly like the list iterator, which is unfortunate. Besides the semantic patch, there was some manual work required: - Fix up the actual hlist iterators in linux/list.h - Fix up the declaration of other iterators based on the hlist ones. - A very small amount of places were using the 'node' parameter, this was modified to use 'obj->member' instead. - Coccinelle didn't handle the hlist_for_each_entry_safe iterator properly, so those had to be fixed up manually. The semantic patch which is mostly the work of Peter Senna Tschudin is here: @@ iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host; type T; expression a,c,d,e; identifier b; statement S; @@ -T b; <+... when != b ( hlist_for_each_entry(a, - b, c, d) S \| hlist_for_each_entry_continue(a, - b, c) S \| hlist_for_each_entry_from(a, - b, c) S \| hlist_for_each_entry_rcu(a, - b, c, d) S \| hlist_for_each_entry_rcu_bh(a, - b, c, d) S \| hlist_for_each_entry_continue_rcu_bh(a, - b, c) S \| for_each_busy_worker(a, c, - b, d) S \| ax25_uid_for_each(a, - b, c) S \| ax25_for_each(a, - b, c) S \| inet_bind_bucket_for_each(a, - b, c) S \| sctp_for_each_hentry(a, - b, c) S \| sk_for_each(a, - b, c) S \| sk_for_each_rcu(a, - b, c) S \| sk_for_each_from -(a, b) +(a) S + sk_for_each_from(a) S \| sk_for_each_safe(a, - b, c, d) S \| sk_for_each_bound(a, - b, c) S \| hlist_for_each_entry_safe(a, - b, c, d, e) S \| hlist_for_each_entry_continue_rcu(a, - b, c) S \| nr_neigh_for_each(a, - b, c) S \| nr_neigh_for_each_safe(a, - b, c, d) S \| nr_node_for_each(a, - b, c) S \| nr_node_for_each_safe(a, - b, c, d) S \| - for_each_gfn_sp(a, c, d, b) S + for_each_gfn_sp(a, c, d) S \| - for_each_gfn_indirect_valid_sp(a, c, d, b) S + for_each_gfn_indirect_valid_sp(a, c, d) S \| for_each_host(a, - b, c) S \| for_each_host_safe(a, - b, c, d) S \| for_each_mesh_entry(a, - b, c, d) S ) ...+> [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c] [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c] [akpm@linux-foundation.org: checkpatch fixes] [akpm@linux-foundation.org: fix warnings] [akpm@linux-foudnation.org: redo intrusive kvm changes] Tested-by: Peter Senna Tschudin <peter.senna@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Gleb Natapov <gleb@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-02-27 19:10:24 -08:00
Linus Torvalds	d895cb1af1	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs pile (part one) from Al Viro: "Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent locking violations, etc. The most visible changes here are death of FS_REVAL_DOT (replaced with "has ->d_weak_revalidate()") and a new helper getting from struct file to inode. Some bits of preparation to xattr method interface changes. Misc patches by various people sent this cycle and ocfs2 fixes from several cycles ago that should've been upstream right then. PS: the next vfs pile will be xattr stuff." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits) saner proc_get_inode() calling conventions proc: avoid extra pde_put() in proc_fill_super() fs: change return values from -EACCES to -EPERM fs/exec.c: make bprm_mm_init() static ocfs2/dlm: use GFP_ATOMIC inside a spin_lock ocfs2: fix possible use-after-free with AIO ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero target: writev() on single-element vector is pointless export kernel_write(), convert open-coded instances fs: encode_fh: return FILEID_INVALID if invalid fid_type kill f_vfsmnt vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op nfsd: handle vfs_getattr errors in acl protocol switch vfs_getattr() to struct path default SET_PERSONALITY() in linux/elf.h ceph: prepopulate inodes only when request is aborted d_hash_and_lookup(): export, switch open-coded instances 9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate() 9p: split dropping the acls from v9fs_set_create_acl() ...	2013-02-26 20:16:07 -08:00
Linus Torvalds	1cef9350cb	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Pull networking fixes from David Miller: 1) ping_err() ICMP error handler looks at wrong ICMP header, from Li Wei. 2) TCP socket hash function on ipv6 is too weak, from Eric Dumazet. 3) netif_set_xps_queue() forgets to drop mutex on errors, fix from Alexander Duyck. 4) sum_frag_mem_limit() can deadlock due to lack of BH disabling, fix from Eric Dumazet. 5) TCP SYN data is miscalculated in tcp_send_syn_data(), because the amount of TCP option space was not taken into account properly in this code path. Fix from yuchung Cheng. 6) MLX4 driver allocates device queues with the wrong size, from Kleber Sacilotto. 7) sock_diag can access past the end of the sock_diag_handlers[] array, from Mathias Krause. 8) vlan_set_encap_proto() makes incorrect assumptions about where skb->data points, rework the logic so that it works regardless of where skb->data happens to be. From Jesse Gross. 9) Fix gianfar build failure with NET_POLL enabled, from Paul Gortmaker. 10) Fix Ipv4 ID setting and checksum calculations in GRE driver, from Pravin B Shelar. 11) bgmac driver does: int i; for (i = 0; ...; ...) { ... for (i = 0; ...; ...) { effectively corrupting the outer loop index, use a seperate variable for the inner loops. From Rafał Miłecki. 12) Fix suspend bugs in smsc95xx driver, from Ming Lei. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (35 commits) usbnet: smsc95xx: rename FEATURE_AUTOSUSPEND usbnet: smsc95xx: fix broken runtime suspend usbnet: smsc95xx: fix suspend failure bgmac: fix indexing of 2nd level loops b43: Fix lockdep splat on module unload Revert "ip_gre: propogate target device GSO capability to the tunnel device" IP_GRE: Fix GRE_CSUM case. VXLAN: Use tunnel_ip_select_ident() for tunnel IP-Identification. IP_GRE: Fix IP-Identification. net/pasemi: Fix missing coding style vmxnet3: fix ethtool ring buffer size setting vmxnet3: make local function static bnx2x: remove dead code and make local funcs static gianfar: fix compile fail for NET_POLL=y due to struct packing vlan: adjust vlan_set_encap_proto() for its callers sock_diag: Simplify sock_diag_handlers[] handling in __sock_diag_rcv_msg sock_diag: Fix out-of-bounds access to sock_diag_handlers[] vxlan: remove depends on CONFIG_EXPERIMENTAL mlx4_en: fix allocation of CPU affinity reverse-map mlx4_en: fix allocation of device tx_cq ...	2013-02-26 11:44:11 -08:00
Ming Lei	9802c8e22f	net/core: apply pm_runtime_set_memalloc_noio on network devices Deadlock might be caused by allocating memory with GFP_KERNEL in runtime_resume and runtime_suspend callback of network devices in iSCSI situation, so mark network devices and its ancestor as 'memalloc_noio' with the introduced pm_runtime_set_memalloc_noio(). Signed-off-by: Ming Lei <ming.lei@canonical.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Decotigny <david.decotigny@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Jens Axboe <axboe@kernel.dk> Cc: Minchan Kim <minchan@kernel.org> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Oliver Neukum <oneukum@suse.de> Cc: Jiri Kosina <jiri.kosina@suse.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-02-23 17:50:16 -08:00
Mathias Krause	8e904550d0	sock_diag: Simplify sock_diag_handlers[] handling in __sock_diag_rcv_msg The sock_diag_lock_handler() and sock_diag_unlock_handler() actually make the code less readable. Get rid of them and make the lock usage and access to sock_diag_handlers[] clear on the first sight. Signed-off-by: Mathias Krause <minipli@googlemail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-02-23 13:51:54 -05:00
Mathias Krause	6e601a5356	sock_diag: Fix out-of-bounds access to sock_diag_handlers[] Userland can send a netlink message requesting SOCK_DIAG_BY_FAMILY with a family greater or equal then AF_MAX -- the array size of sock_diag_handlers[]. The current code does not test for this condition therefore is vulnerable to an out-of-bound access opening doors for a privilege escalation. Signed-off-by: Mathias Krause <minipli@googlemail.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-02-23 13:51:54 -05:00
Al Viro	496ad9aa8e	new helper: file_inode(file) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-02-22 23:31:31 -05:00
stephen hemminger	cbda4eaffa	sock: only define socket limit if mem cgroup configured The mem cgroup socket limit is only used if the config option is enabled. Found with sparse Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-02-22 15:10:19 -05:00
Alexander Duyck	2bb60cb9b7	net: Fix locking bug in netif_set_xps_queue Smatch found a locking bug in netif_set_xps_queue in which we were not releasing the lock in the case of an allocation failure. This change corrects that so that we release the xps_map_mutex before returning -ENOMEM in the case of an allocation failure. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-02-22 15:10:19 -05:00
YOSHIFUJI Hideaki / 吉藤英明	ecd9883724	ipv6: fix race condition regarding dst->expires and dst->from. Eric Dumazet wrote: \| Some strange crashes happen in rt6_check_expired(), with access \| to random addresses. \| \| At first glance, it looks like the RTF_EXPIRES and \| stuff added in commit `1716a96101` \| (ipv6: fix problem with expired dst cache) \| are racy : same dst could be manipulated at the same time \| on different cpus. \| \| At some point, our stack believes rt->dst.from contains a dst pointer, \| while its really a jiffie value (as rt->dst.expires shares the same area \| of memory) \| \| rt6_update_expires() should be fixed, or am I missing something ? \| \| CC Neil because of https://bugzilla.redhat.com/show_bug.cgi?id=892060 Because we do not have any locks for dst_entry, we cannot change essential structure in the entry; e.g., we cannot change reference to other entity. To fix this issue, split 'from' and 'expires' field in dst_entry out of union. Once it is 'from' is assigned in the constructor, keep the reference until the very last stage of the life time of the object. Of course, it is unsafe to change 'from', so make rt6_set_from simple just for fresh entries. Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Reported-by: Neil Horman <nhorman@tuxdriver.com> CC: Gao Feng <gaofeng@cn.fujitsu.com> Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Reported-by: Steinar H. Gunderson <sesse@google.com> Reviewed-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-02-20 15:11:45 -05:00
Amerigo Wang	68534c682e	net: fix a wrong assignment in skb_split() commit `c9af6db4c1` (net: Fix possible wrong checksum generation) has a suspicous piece: - skb_shinfo(skb1)->gso_type = skb_shinfo(skb)->gso_type; - + skb_shinfo(skb)->tx_flags = skb_shinfo(skb1)->tx_flags & SKBTX_SHARED_FRAG; skb1 is the new skb, therefore should be on the left side of the assignment. This patch fixes it. Cc: Pravin B Shelar <pshelar@nicira.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <amwang@redhat.com> Acked-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-02-20 15:11:44 -05:00
Cong Wang	cd0615746b	net: fix a build failure when !CONFIG_PROC_FS When !CONFIG_PROC_FS dev_mcast_init() is not defined, actually we can just merge dev_mcast_init() into dev_proc_init(). Reported-by: Gao feng <gaofeng@cn.fujitsu.com> Cc: Gao feng <gaofeng@cn.fujitsu.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-02-19 13:18:13 -05:00
Cong Wang	900ff8c632	net: move procfs code to net/core/net-procfs.c Similar to net/core/net-sysfs.c, group procfs code to a single unit. Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-02-19 00:51:10 -05:00
Gao feng	ece31ffd53	net: proc: change proc_net_remove to remove_proc_entry proc_net_remove is only used to remove proc entries that under /proc/net,it's not a general function for removing proc entries of netns. if we want to remove some proc entries which under /proc/net/stat/, we still need to call remove_proc_entry. this patch use remove_proc_entry to replace proc_net_remove. we can remove proc_net_remove after this patch. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-02-18 14:53:08 -05:00
Gao feng	d4beaa66ad	net: proc: change proc_net_fops_create to proc_create Right now, some modules such as bonding use proc_create to create proc entries under /proc/net/, and other modules such as ipv4 use proc_net_fops_create. It looks a little chaos.this patch changes all of proc_net_fops_create to proc_create. we can remove proc_net_fops_create after this patch. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-02-18 14:53:08 -05:00
Cong Wang	96b45cbd95	net: move ioctl functions into a separated file They well deserve a separated unit. Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-02-18 12:27:32 -05:00
Eric Dumazet	efd9450e7e	net: use skb_reset_mac_len() in dev_gro_receive() We no longer need to use mac_len, lets cleanup things. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-02-15 15:36:39 -05:00
Pravin B Shelar	68c3316311	v4 GRE: Add TCP segmentation offload for GRE Following patch adds GRE protocol offload handler so that skb_gso_segment() can segment GRE packets. SKB GSO CB is added to keep track of total header length so that skb_segment can push entire header. e.g. in case of GRE, skb_segment need to push inner and outer headers to every segment. New NETIF_F_GRE_GSO feature is added for devices which support HW GRE TSO offload. Currently none of devices support it therefore GRE GSO always fall backs to software GSO. [ Compute pkt_len before ip_local_out() invocation. -DaveM ] Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-02-15 15:17:11 -05:00
Pravin B Shelar	05e8ef4ab2	net: factor out skb_mac_gso_segment() from skb_gso_segment() This function will be used in next GRE_GSO patch. This patch does not change any functionality. It only exports skb_mac_gso_segment() function. [ Use skb_reset_mac_len() -DaveM ] Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-02-15 15:16:03 -05:00
David S. Miller	9754e29349	net: Don't write to current task flags on every packet received. Even for non-pfmalloc SKBs, __netif_receive_skb() will do a tsk_restore_flags() on current unconditionally. Make __netif_receive_skb() a shim around the existing code, renamed to __netif_receive_skb_core(). Let __netif_receive_skb() wrap the __netif_receive_skb_core() call with the task flag modifications, if necessary. Signed-off-by: David S. Miller <davem@davemloft.net>	2013-02-14 15:57:38 -05:00
Vlad Yasevich	1690be63a2	bridge: Add vlan support to static neighbors When a user adds bridge neighbors, allow him to specify VLAN id. If the VLAN id is not specified, the neighbor will be added for VLANs currently in the ports filter list. If no VLANs are configured on the port, we use vlan 0 and only add 1 entry. Signed-off-by: Vlad Yasevich <vyasevic@redhat.com> Acked-by: Jitendra Kalsaria <jitendra.kalsaria@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-02-13 19:42:16 -05:00
Vlad Yasevich	6cbdceeb1c	bridge: Dump vlan information from a bridge port Using the RTM_GETLINK dump the vlan filter list of a given bridge port. The information depends on setting the filter flag similar to how nic VF info is dumped. Signed-off-by: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-02-13 19:41:46 -05:00
Vlad Yasevich	407af3299e	bridge: Add netlink interface to configure vlans on bridge ports Add a netlink interface to add and remove vlan configuration on bridge port. The interface uses the RTM_SETLINK message and encodes the vlan configuration inside the IFLA_AF_SPEC. It is possble to include multiple vlans to either add or remove in a single message. Signed-off-by: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-02-13 19:41:46 -05:00
Pravin B Shelar	c9af6db4c1	net: Fix possible wrong checksum generation. Patch `cef401de7b` (net: fix possible wrong checksum generation) fixed wrong checksum calculation but it broke TSO by defining new GSO type but not a netdev feature for that type. net_gso_ok() would not allow hardware checksum/segmentation offload of such packets without the feature. Following patch fixes TSO and wrong checksum. This patch uses same logic that Eric Dumazet used. Patch introduces new flag SKBTX_SHARED_FRAG if at least one frag can be modified by the user. but SKBTX_SHARED_FRAG flag is kept in skb shared info tx_flags rather than gso_type. tx_flags is better compared to gso_type since we can have skb with shared frag without gso packet. It does not link SHARED_FRAG to GSO, So there is no need to define netdev feature for this. Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-02-13 13:30:10 -05:00
Neil Horman	959d5fdee7	netpoll: fix smatch warnings in netpoll core code Dan Carpenter contacted me with some notes regarding some smatch warnings in the netpoll code, some of which I introduced with my recent netpoll locking fixes, some which were there prior. Specifically they were: net-next/net/core/netpoll.c:243 netpoll_poll_dev() warn: inconsistent returns mutex:&ni->dev_lock: locked (213,217) unlocked (210,243) net-next/net/core/netpoll.c:706 netpoll_neigh_reply() warn: potential pointer math issue ('skb_transport_header(send_skb)' is a 128 bit pointer) This patch corrects the locking imbalance (the first error), and adds some parenthesis to correct the second error. Tested by myself. Applies to net-next Signed-off-by: Neil Horman <nhorman@tuxdriver.com> CC: Dan Carpenter <dan.carpenter@oracle.com> CC: "David S. Miller" <davem@davemloft.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-02-13 11:56:46 -05:00
James Hogan	99d5851eef	net: skbuff: fix compile error in skb_panic() I get the following build error on next-20130213 due to the following commit: commit `f05de73bf8` ("skbuff: create skb_panic() function and its wrappers"). It adds an argument called panic to a function that uses the BUG() macro which tries to call panic, but the argument masks the panic() function declaration, resulting in the following error (gcc 4.2.4): net/core/skbuff.c In function 'skb_panic': net/core/skbuff.c +126 : error: called object 'panic' is not a function This is fixed by renaming the argument to msg. Signed-off-by: James Hogan <james.hogan@imgtec.com> Cc: Jean Sacren <sakiwit@gmail.com> Cc: Jiri Pirko <jiri@resnulli.us> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-02-13 11:52:04 -05:00
David S. Miller	9f6d98c298	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c The bnx2x gso_type setting bug fix in 'net' conflicted with changes in 'net-next' that broke the gso_* setting logic out into a seperate function, which also fixes the bug in question. Thus, use the 'net-next' version. Signed-off-by: David S. Miller <davem@davemloft.net>	2013-02-12 18:58:28 -05:00
Eric Dumazet	77c1090f94	net: fix infinite loop in __skb_recv_datagram() Tommi was fuzzing with trinity and reported the following problem : commit `3f518bf745` (datagram: Add offset argument to __skb_recv_datagram) missed that a raw socket receive queue can contain skbs with no payload. We can loop in __skb_recv_datagram() with MSG_PEEK mode, because wait_for_packet() is not prepared to skip these skbs. [ 83.541011] INFO: rcu_sched detected stalls on CPUs/tasks: {} (detected by 0, t=26002 jiffies, g=27673, c=27672, q=75) [ 83.541011] INFO: Stall ended before state dump start [ 108.067010] BUG: soft lockup - CPU#0 stuck for 22s! [trinity-child31:2847] ... [ 108.067010] Call Trace: [ 108.067010] [<ffffffff818cc103>] __skb_recv_datagram+0x1a3/0x3b0 [ 108.067010] [<ffffffff818cc33d>] skb_recv_datagram+0x2d/0x30 [ 108.067010] [<ffffffff819ed43d>] rawv6_recvmsg+0xad/0x240 [ 108.067010] [<ffffffff818c4b04>] sock_common_recvmsg+0x34/0x50 [ 108.067010] [<ffffffff818bc8ec>] sock_recvmsg+0xbc/0xf0 [ 108.067010] [<ffffffff818bf31e>] sys_recvfrom+0xde/0x150 [ 108.067010] [<ffffffff81ca4329>] system_call_fastpath+0x16/0x1b Reported-by: Tommi Rantala <tt.rantala@gmail.com> Tested-by: Tommi Rantala <tt.rantala@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Pavel Emelyanov <xemul@parallels.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-02-12 16:07:19 -05:00
Neil Horman	0790bbb68f	netpoll: cleanup sparse warnings With my recent commit I introduced two sparse warnings. Looking closer there were a few more in the same file, so I fixed them all up. Basic rcu pointer dereferencing suff. I've validated these changes using CONFIG_PROVE_RCU while starting and stopping netconsole repeatedly in bonded and non-bonded configurations Signed-off-by: Neil Horman <nhorman@tuxdriver.com> CC: fengguang.wu@intel.com CC: David Miller <davem@davemloft.net> CC: eric.dumazet@gmail.com Signed-off-by: David S. Miller <davem@davemloft.net>	2013-02-11 19:19:58 -05:00
Neil Horman	2cde6acd49	netpoll: Fix __netpoll_rcu_free so that it can hold the rtnl lock __netpoll_rcu_free is used to free netpoll structures when the rtnl_lock is already held. The mechanism is used to asynchronously call __netpoll_cleanup outside of the holding of the rtnl_lock, so as to avoid deadlock. Unfortunately, __netpoll_cleanup modifies pointers (dev->np), which means the rtnl_lock must be held while calling it. Further, it cannot be held, because rcu callbacks may be issued in softirq contexts, which cannot sleep. Fix this by converting the rcu callback to a work queue that is guaranteed to get scheduled in process context, so that we can hold the rtnl properly while calling __netpoll_cleanup Tested successfully by myself. Signed-off-by: Neil Horman <nhorman@tuxdriver.com> CC: "David S. Miller" <davem@davemloft.net> CC: Cong Wang <amwang@redhat.com> CC: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-02-11 19:19:33 -05:00
Jean Sacren	f05de73bf8	skbuff: create skb_panic() function and its wrappers Create skb_panic() function in lieu of both skb_over_panic() and skb_under_panic() so that code duplication would be avoided. Update type and variable name where necessary. Jiri Pirko suggested using wrappers so that we would be able to keep the fruits of the original code. Signed-off-by: Jean Sacren <sakiwit@gmail.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-02-11 19:14:43 -05:00
Alexander Duyck	e5e6730588	skbuff: Move definition of NETDEV_FRAG_PAGE_MAX_SIZE In order to address the fact that some devices cannot support the full 32K frag size we need to have the value accessible somewhere so that we can use it to do comparisons against what the device can support. As such I am moving the values out of skbuff.c and into skbuff.h. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-02-08 17:33:35 -05:00
Eric Dumazet	6d1ccff627	net: reset mac header in dev_start_xmit() On 64 bit arches : There is a off-by-one error in qdisc_pkt_len_init() because mac_header is not set in xmit path. skb_mac_header() returns an out of bound value that was harmless because hdr_len is an 'unsigned int' On 32bit arches, the error is abysmal. This patch is also a prereq for "macvlan: add multicast filter" Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Ben Greear <greearb@candelatech.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-02-06 15:59:47 -05:00
Cong Wang	12b0004d1d	net: adjust skb_gso_segment() for calling in rx path skb_gso_segment() is almost always called in tx path, except for openvswitch. It calls this function when it receives the packet and tries to queue it to user-space. In this special case, the ->ip_summed check inside skb_gso_segment() is no longer true, as ->ip_summed value has different meanings on rx path. This patch adjusts skb_gso_segment() so that we can at least avoid such warnings on checksum. Cc: Jesse Gross <jesse@nicira.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-02-06 15:58:00 -05:00
Neil Horman	ca99ca14c9	netpoll: protect napi_poll and poll_controller during dev_[open\|close] Ivan Vercera was recently backporting commit `9c13cb8bb4` to a RHEL kernel, and I noticed that, while this patch protects the tg3 driver from having its ndo_poll_controller routine called during device initalization, it does nothing for the driver during shutdown. I.e. it would be entirely possible to have the ndo_poll_controller method (or subsequently the ndo_poll) routine called for a driver in the netpoll path on CPU A while in parallel on CPU B, the ndo_close or ndo_open routine could be called. Given that the two latter routines tend to initizlize and free many data structures that the former two rely on, the result can easily be data corruption or various other crashes. Furthermore, it seems that this is potentially a problem with all net drivers that support netpoll, and so this should ideally be fixed in a common path. As Ben H Pointed out to me, we can't preform dev_open/dev_close in atomic context, so I've come up with this solution. We can use a mutex to sleep in open/close paths and just do a mutex_trylock in the napi poll path and abandon the poll attempt if we're locked, as we'll just retry the poll on the next send anyway. I've tested this here by flooding netconsole with messages on a system whos nic driver I modfied to periodically return NETDEV_TX_BUSY, so that the netpoll tx workqueue would be forced to send frames and poll the device. While this was going on I rapidly ifdown/up'ed the interface and watched for any problems. I've not found any. Signed-off-by: Neil Horman <nhorman@tuxdriver.com> CC: Ivan Vecera <ivecera@redhat.com> CC: "David S. Miller" <davem@davemloft.net> CC: Ben Hutchings <bhutchings@solarflare.com> CC: Francois Romieu <romieu@fr.zoreil.com> CC: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-02-06 15:45:03 -05:00
Joe Perches	62b5942aa5	net: core: Remove unnecessary alloc/OOM messages alloc failures already get standardized OOM messages and a dump_stack. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-02-06 14:58:52 -05:00
David S. Miller	188d1f76d0	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/net/ethernet/intel/e1000e/ethtool.c drivers/net/vmxnet3/vmxnet3_drv.c drivers/net/wireless/iwlwifi/dvm/tx.c net/ipv6/route.c The ipv6 route.c conflict is simple, just ignore the 'net' side change as we fixed the same problem in 'net-next' by eliminating cached neighbours from ipv6 routes. The e1000e conflict is an addition of a new statistic in the ethtool code, trivial. The vmxnet3 conflict is about one change in 'net' removing a guarding conditional, whilst in 'net-next' we had a netdev_info() conversion. The iwlwifi conflict is dealing with a WARN_ON() conversion in 'net-next' vs. a revert happening in 'net'. Signed-off-by: David S. Miller <davem@davemloft.net>	2013-02-05 14:12:20 -05:00
Ying Xue	25cc4ae913	net: remove redundant check for timer pending state before del_timer As in del_timer() there has already placed a timer_pending() function to check whether the timer to be deleted is pending or not, it's unnecessary to check timer pending state again before del_timer() is called. Signed-off-by: Ying Xue <ying.xue@windriver.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-02-04 13:26:49 -05:00
Gao feng	c5c351088a	netns: fdb: allow unprivileged users to add/del fdb entries Right now,only ixgdb,macvlan,vxlan and bridge implement fdb_add/fdb_del operations. these operations only operate the private data of net device. So allowing the unprivileged users who creates the userns and netns to add/del fdb entries will do no harm to other netns. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-02-04 13:12:16 -05:00
Pravin B Shelar	92df9b217e	net: Fix inner_network_header assignment in skb-copy. Use correct inner offset to set inner_network_offset. Found by inspection. Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-02-03 16:10:36 -05:00
Michał Mirosław	d2ed273d30	net: disallow drivers with buggy VLAN accel to register_netdevice() Instead of jumping aroung bugs that are easily fixed just don't let them in: affected drivers should be either fixed or have NETIF_F_HW_VLAN_FILTER removed from advertised features. Quick grep in drivers/net shows two drivers that have NETIF_F_HW_VLAN_FILTER but not ndo_vlan_rx_add/kill_vid(), but those are false-positives (features are commented out). OTOH two drivers have ndo_vlan_rx_add/kill_vid() implemented but don't advertise NETIF_F_HW_VLAN_FILTER. Those are: +ethernet/cisco/enic/enic_main.c +ethernet/qlogic/qlcnic/qlcnic_main.c Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-29 22:58:40 -05:00
Cong Wang	604dfd6efc	pktgen: correctly handle failures when adding a device The return value of pktgen_add_device() is not checked, so even if we fail to add some device, for example, non-exist one, we still see "OK:...". This patch fixes it. After this patch, I got: # echo "add_device non-exist" > /proc/net/pktgen/kpktgend_0 -bash: echo: write error: No such device # cat /proc/net/pktgen/kpktgend_0 Running: Stopped: Result: ERROR: can not add device non-exist # echo "add_device eth0" > /proc/net/pktgen/kpktgend_0 # cat /proc/net/pktgen/kpktgend_0 Running: Stopped: eth0 Result: OK: add_device=eth0 (Candidate for -stable) Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-29 15:43:03 -05:00
David S. Miller	f1e7b73acc	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Bring in the 'net' tree so that we can get some ipv4/ipv6 bug fixes that some net-next work will build upon. Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-29 15:32:13 -05:00
Cong Wang	4e58a02750	pktgen: support net namespace v3: make pktgen_threads list per-namespace v2: remove a useless check This patch add net namespace to pktgen, so that we can use pktgen in different namespaces. Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <amwang@redhat.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-29 15:17:46 -05:00
YOSHIFUJI Hideaki / 吉藤英明	08433eff2d	net neigh: Optimize neighbor entry size calculation. When allocating memory for neighbour cache entry, if tbl->entry_size is not set, we always calculate sizeof(struct neighbour) + tbl->key_len, which is common in the same table. With this change, set tbl->entry_size during the table initialization phase, if it was not set, and use it in neigh_alloc() and neighbour_priv(). This change also allow us to have both of protocol private data and device priate data at tha same time. Note that the only user of prototcol private is DECnet and the only user of device private is ATM CLIP. Since those are exclusive, we have not been facing issues here. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-28 23:17:51 -05:00
bingtian.ly@taobao.com	cdda88912d	net: avoid to hang up on sending due to sysctl configuration overflow. I found if we write a larger than 4GB value to some sysctl variables, the sending syscall will hang up forever, because these variables are 32 bits, such large values make them overflow to 0 or negative. This patch try to fix overflow or prevent from zero value setup of below sysctl variables: net.core.wmem_default net.core.rmem_default net.core.rmem_max net.core.wmem_max net.ipv4.udp_rmem_min net.ipv4.udp_wmem_min net.ipv4.tcp_wmem net.ipv4.tcp_rmem Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Li Yu <raise.sail@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-28 23:15:27 -05:00
Cong Wang	556e6256c8	netpoll: use the net namespace of current process instead of init_net This will allow us to setup netconsole in a different namespace rather than where init_net is. Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-28 18:32:55 -05:00
Cong Wang	faeed828f9	netpoll: use ipv6_addr_equal() to compare ipv6 addr ipv6_addr_equal() is faster. Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-28 18:32:55 -05:00
Eric Dumazet	cef401de7b	net: fix possible wrong checksum generation Pravin Shelar mentioned that GSO could potentially generate wrong TX checksum if skb has fragments that are overwritten by the user between the checksum computation and transmit. He suggested to linearize skbs but this extra copy can be avoided for normal tcp skbs cooked by tcp_sendmsg(). This patch introduces a new SKB_GSO_SHARED_FRAG flag, set in skb_shinfo(skb)->gso_type if at least one frag can be modified by the user. Typical sources of such possible overwrites are {vm}splice(), sendfile(), and macvtap/tun/virtio_net drivers. Tested: $ netperf -H 7.7.8.84 MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 16384 10.00 3959.52 $ netperf -H 7.7.8.84 -t TCP_SENDFILE TCP SENDFILE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 16384 10.00 3216.80 Performance of the SENDFILE is impacted by the extra allocation and copy, and because we use order-0 pages, while the TCP_STREAM uses bigger pages. Reported-by: Pravin Shelar <pshelar@nicira.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-28 00:27:15 -05:00
Tom Herbert	055dc21a1d	soreuseport: infrastructure Definitions and macros for implementing soreusport. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-23 13:44:00 -05:00
Cong Wang	e39363a9de	netpoll: fix an uninitialized variable Fengguang reported: net/core/netpoll.c: In function 'netpoll_setup': net/core/netpoll.c:1049:6: warning: 'err' may be used uninitialized in this function [-Wmaybe-uninitialized] in !CONFIG_IPV6 case, we may error out without initializing 'err'. Reported-by: Fengguang Wu <fengguang.wu@intel.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-22 23:18:59 -05:00
YOSHIFUJI Hideaki / 吉藤英明	8fbcec241d	net: Use IS_ERR_OR_NULL(). Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-22 14:28:28 -05:00
YOSHIFUJI Hideaki / 吉藤英明	2724680bce	neigh: Keep neighbour cache entries if number of them is small enough. Since we have removed NCE (Neighbour Cache Entry) reference from routing entries, the only refcnt holders of an NCE are its timer (if running) and its owner table, in usual cases. As a result, neigh_periodic_work() purges NCEs over and over again even for gateways. It does not make sense to purge entries, if number of them is very small, so keep them. The minimum number of entries to keep is specified by gc_thresh1. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-22 14:25:28 -05:00
Daniel Wagner	d84295067f	net: net_cls: fd passed in SCM_RIGHTS datagram not set correctly Commit `6a328d8c6f` changed the update logic for the socket but it does not update the SCM_RIGHTS update as well. This patch is based on the net_prio fix commit `48a87cc26c` net: netprio: fd passed in SCM_RIGHTS datagram not set correctly A socket fd passed in a SCM_RIGHTS datagram was not getting updated with the new tasks cgrp prioidx. This leaves IO on the socket tagged with the old tasks priority. To fix this add a check in the scm recvmsg path to update the sock cgrp prioidx with the new tasks value. Let's apply the same fix for net_cls. Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Reported-by: Li Zefan <lizefan@huawei.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: John Fastabend <john.r.fastabend@intel.com> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: netdev@vger.kernel.org Cc: cgroups@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-22 14:17:38 -05:00
Cong Wang	441d9d327f	net: move rx and tx hash functions to net/core/flow_dissector.c __skb_tx_hash() and __skb_get_rxhash() are all for calculating hash value based by some fields in skb, mostly used for selecting queues by device drivers. Meanwhile, net/core/dev.c is bloating. Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-21 14:26:17 -05:00
Eric Dumazet	bc9540c637	net: splice: fix __splice_segment() commit `9ca1b22d6d` (net: splice: avoid high order page splitting) forgot that skb->head could need a copy into several page frags. This could be the case for loopback traffic mostly. Also remove now useless skb argument from linear_to_page() and __splice_segment() prototypes. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willy Tarreau <w@1wt.eu> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-20 23:19:36 -05:00
Eric Dumazet	82bda61956	net: splice: avoid high order page splitting splice() can handle pages of any order, but network code tries hard to split them in PAGE_SIZE units. Not quite successfully anyway, as __splice_segment() assumed poff < PAGE_SIZE. This is true for the skb->data part, not necessarily for the fragments. This patch removes this logic to give the pages as they are in the skb. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willy Tarreau <w@1wt.eu> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-20 23:19:35 -05:00
Vincent Bernat	d59577b6ff	sk-filter: Add ability to lock a socket filter program While a privileged program can open a raw socket, attach some restrictive filter and drop its privileges (or send the socket to an unprivileged program through some Unix socket), the filter can still be removed or modified by the unprivileged program. This commit adds a socket option to lock the filter (SO_LOCK_FILTER) preventing any modification of a socket filter program. This is similar to OpenBSD BIOCLOCK ioctl on bpf sockets, except even root is not allowed change/drop the filter. The state of the lock can be read with getsockopt(). No error is triggered if the state is not changed. -EPERM is returned when a user tries to remove the lock or to change/remove the filter while the lock is active. The check is done directly in sk_attach_filter() and sk_detach_filter() and does not affect only setsockopt() syscall. Signed-off-by: Vincent Bernat <bernat@luffy.cx> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-17 03:21:25 -05:00
Cong Wang	5bd30d3987	netpoll: fix a missing dev refcounting __dev_get_by_name() doesn't refcount the network device, so we have to do this by ourselves. Noticed by Eric. Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Jiri Pirko <jiri@resnulli.us> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-16 23:33:06 -05:00

... 2 3 4 5 6 ...

3152 Commits