linux

Author	SHA1	Message	Date
Eric Dumazet	cef401de7b	net: fix possible wrong checksum generation Pravin Shelar mentioned that GSO could potentially generate wrong TX checksum if skb has fragments that are overwritten by the user between the checksum computation and transmit. He suggested to linearize skbs but this extra copy can be avoided for normal tcp skbs cooked by tcp_sendmsg(). This patch introduces a new SKB_GSO_SHARED_FRAG flag, set in skb_shinfo(skb)->gso_type if at least one frag can be modified by the user. Typical sources of such possible overwrites are {vm}splice(), sendfile(), and macvtap/tun/virtio_net drivers. Tested: $ netperf -H 7.7.8.84 MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 16384 10.00 3959.52 $ netperf -H 7.7.8.84 -t TCP_SENDFILE TCP SENDFILE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 16384 10.00 3216.80 Performance of the SENDFILE is impacted by the extra allocation and copy, and because we use order-0 pages, while the TCP_STREAM uses bigger pages. Reported-by: Pravin Shelar <pshelar@nicira.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-28 00:27:15 -05:00
Tom Herbert	055dc21a1d	soreuseport: infrastructure Definitions and macros for implementing soreusport. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-23 13:44:00 -05:00
Cong Wang	e39363a9de	netpoll: fix an uninitialized variable Fengguang reported: net/core/netpoll.c: In function 'netpoll_setup': net/core/netpoll.c:1049:6: warning: 'err' may be used uninitialized in this function [-Wmaybe-uninitialized] in !CONFIG_IPV6 case, we may error out without initializing 'err'. Reported-by: Fengguang Wu <fengguang.wu@intel.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-22 23:18:59 -05:00
YOSHIFUJI Hideaki / 吉藤英明	8fbcec241d	net: Use IS_ERR_OR_NULL(). Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-22 14:28:28 -05:00
YOSHIFUJI Hideaki / 吉藤英明	2724680bce	neigh: Keep neighbour cache entries if number of them is small enough. Since we have removed NCE (Neighbour Cache Entry) reference from routing entries, the only refcnt holders of an NCE are its timer (if running) and its owner table, in usual cases. As a result, neigh_periodic_work() purges NCEs over and over again even for gateways. It does not make sense to purge entries, if number of them is very small, so keep them. The minimum number of entries to keep is specified by gc_thresh1. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-22 14:25:28 -05:00
Daniel Wagner	d84295067f	net: net_cls: fd passed in SCM_RIGHTS datagram not set correctly Commit `6a328d8c6f` changed the update logic for the socket but it does not update the SCM_RIGHTS update as well. This patch is based on the net_prio fix commit `48a87cc26c` net: netprio: fd passed in SCM_RIGHTS datagram not set correctly A socket fd passed in a SCM_RIGHTS datagram was not getting updated with the new tasks cgrp prioidx. This leaves IO on the socket tagged with the old tasks priority. To fix this add a check in the scm recvmsg path to update the sock cgrp prioidx with the new tasks value. Let's apply the same fix for net_cls. Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Reported-by: Li Zefan <lizefan@huawei.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: John Fastabend <john.r.fastabend@intel.com> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: netdev@vger.kernel.org Cc: cgroups@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-22 14:17:38 -05:00
Cong Wang	441d9d327f	net: move rx and tx hash functions to net/core/flow_dissector.c __skb_tx_hash() and __skb_get_rxhash() are all for calculating hash value based by some fields in skb, mostly used for selecting queues by device drivers. Meanwhile, net/core/dev.c is bloating. Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-21 14:26:17 -05:00
Eric Dumazet	bc9540c637	net: splice: fix __splice_segment() commit `9ca1b22d6d` (net: splice: avoid high order page splitting) forgot that skb->head could need a copy into several page frags. This could be the case for loopback traffic mostly. Also remove now useless skb argument from linear_to_page() and __splice_segment() prototypes. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willy Tarreau <w@1wt.eu> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-20 23:19:36 -05:00
Eric Dumazet	82bda61956	net: splice: avoid high order page splitting splice() can handle pages of any order, but network code tries hard to split them in PAGE_SIZE units. Not quite successfully anyway, as __splice_segment() assumed poff < PAGE_SIZE. This is true for the skb->data part, not necessarily for the fragments. This patch removes this logic to give the pages as they are in the skb. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willy Tarreau <w@1wt.eu> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-20 23:19:35 -05:00
Vincent Bernat	d59577b6ff	sk-filter: Add ability to lock a socket filter program While a privileged program can open a raw socket, attach some restrictive filter and drop its privileges (or send the socket to an unprivileged program through some Unix socket), the filter can still be removed or modified by the unprivileged program. This commit adds a socket option to lock the filter (SO_LOCK_FILTER) preventing any modification of a socket filter program. This is similar to OpenBSD BIOCLOCK ioctl on bpf sockets, except even root is not allowed change/drop the filter. The state of the lock can be read with getsockopt(). No error is triggered if the state is not changed. -EPERM is returned when a user tries to remove the lock or to change/remove the filter while the lock is active. The check is done directly in sk_attach_filter() and sk_detach_filter() and does not affect only setsockopt() syscall. Signed-off-by: Vincent Bernat <bernat@luffy.cx> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-17 03:21:25 -05:00
Cong Wang	5bd30d3987	netpoll: fix a missing dev refcounting __dev_get_by_name() doesn't refcount the network device, so we have to do this by ourselves. Noticed by Eric. Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Jiri Pirko <jiri@resnulli.us> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-16 23:33:06 -05:00
Cong Wang	f92d318023	netpoll: fix a rtnl lock assertion failure v4: hold rtnl lock for the whole netpoll_setup() v3: remove the comment v2: use RCU read lock This patch fixes the following warning: [ 72.013864] RTNL: assertion failed at net/core/dev.c (4955) [ 72.017758] Pid: 668, comm: netpoll-prep-v6 Not tainted 3.8.0-rc1+ #474 [ 72.019582] Call Trace: [ 72.020295] [<ffffffff8176653d>] netdev_master_upper_dev_get+0x35/0x58 [ 72.022545] [<ffffffff81784edd>] netpoll_setup+0x61/0x340 [ 72.024846] [<ffffffff815d837e>] store_enabled+0x82/0xc3 [ 72.027466] [<ffffffff815d7e51>] netconsole_target_attr_store+0x35/0x37 [ 72.029348] [<ffffffff811c3479>] configfs_write_file+0xe2/0x10c [ 72.030959] [<ffffffff8115d239>] vfs_write+0xaf/0xf6 [ 72.032359] [<ffffffff81978a05>] ? sysret_check+0x22/0x5d [ 72.033824] [<ffffffff8115d453>] sys_write+0x5c/0x84 [ 72.035328] [<ffffffff819789d9>] system_call_fastpath+0x16/0x1b In case of other races, hold rtnl lock for the entire netpoll_setup() function. Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Jiri Pirko <jiri@resnulli.us> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-16 15:26:03 -05:00
Eric Dumazet	757b8b1d2b	net_sched: fix qdisc_pkt_len_init() commit `1def9238d4` (net_sched: more precise pkt_len computation) does a wrong computation of mac + network headers length, as it includes the padding before the frame. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuval Mintz <yuvalmin@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-16 00:41:19 -05:00
David S. Miller	4b87f92259	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: Documentation/networking/ip-sysctl.txt drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c Both conflicts were simply overlapping context. A build fix for qlcnic is in here too, simply removing the added devinit annotations which no longer exist. Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-15 15:05:59 -05:00
Eric Dumazet	cce894bb82	tcp: fix a panic on UP machines in reqsk_fastopen_remove spin_is_locked() on a non !SMP build is kind of useless. BUG_ON(!spin_is_locked(xx)) is guaranteed to crash. Just remove this check in reqsk_fastopen_remove() as the callers do hold the socket lock. Reported-by: Ketan Kulkarni <ketkulka@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jerry Chu <hkchu@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Dave Taht <dave.taht@gmail.com> Acked-by: H.K. Jerry Chu <hkchu@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-14 18:10:05 -05:00
Eric Dumazet	18aafc622a	net: splice: fix __splice_segment() commit `9ca1b22d6d` (net: splice: avoid high order page splitting) forgot that skb->head could need a copy into several page frags. This could be the case for loopback traffic mostly. Also remove now useless skb argument from linear_to_page() and __splice_segment() prototypes. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willy Tarreau <w@1wt.eu> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-11 16:48:08 -08:00
Stanislaw Gruszka	d07d7507bf	net, wireless: overwrite default_ethtool_ops Since: commit `2c60db0370` Author: Eric Dumazet <edumazet@google.com> Date: Sun Sep 16 09:17:26 2012 +0000 net: provide a default dev->ethtool_ops wireless core does not correctly assign ethtool_ops. After alloc_netdev*() call, some cfg80211 drivers provide they own ethtool_ops, but some do not. For them, wireless core provide generic cfg80211_ethtool_ops, which is assigned in NETDEV_REGISTER notify call: if (!dev->ethtool_ops) dev->ethtool_ops = &cfg80211_ethtool_ops; But after Eric's commit, dev->ethtool_ops is no longer NULL (on cfg80211 drivers without custom ethtool_ops), but points to &default_ethtool_ops. In order to fix the problem, provide function which will overwrite default_ethtool_ops and use it by wireless core. Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Acked-by: Johannes Berg <johannes@sipsolutions.net> Acked-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-11 15:55:48 -08:00
Alexander Duyck	87696f9234	net: Export __netdev_pick_tx so that it can be used in modules When testing with FCoE enabled we discovered that I had not exported __netdev_pick_tx. As a result ixgbe doesn't build with the RFC patches applied because ixgbe_select_queue was calling the function. This change corrects that build issue by correctly exporting __netdev_pick_tx so it can be used by modules. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-11 15:47:27 -08:00
Alexander Duyck	024e9679a2	net: Add support for XPS without sysfs being defined This patch makes it so that we can support transmit packet steering without sysfs needing to be enabled. The reason for making this change is to make it so that a driver can make use of the XPS even while the sysfs portion of the interface is not present. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-10 22:47:04 -08:00
Alexander Duyck	01c5f864e6	net: Rewrite netif_set_xps_queues to address several issues This change is meant to address several issues I found within the netif_set_xps_queues function. If the allocation of one of the maps to be assigned to new_dev_maps failed we could end up with the device map in an inconsistent state since we had already worked through a number of CPUs and removed or added the queue. To address that I split the process into several steps. The first of which is just the allocation of updated maps for CPUs that will need larger maps to store the queue. By doing this we can fail gracefully without actually altering the contents of the current device map. The second issue I found was the fact that we were always allocating a new device map even if we were not adding any queues. I have updated the code so that we only allocate a new device map if we are adding queues, otherwise if we are not adding any queues to CPUs we just skip to the removal process. The last change I made was to reuse the code from remove_xps_queue to remove the queue from the CPU. By making this change we can be consistent in how we go about adding and removing the queues from the CPUs. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-10 22:47:04 -08:00
Alexander Duyck	10cdc3f3cd	net: Rewrite netif_reset_xps_queue to allow for better code reuse This patch does a minor refactor on netif_reset_xps_queue to address a few items I noticed. First is the fact that we are doing removal of queues in both netif_reset_xps_queue and netif_set_xps_queue. Since there is no need to have the code in two places I am pushing it out into a separate function and will come back in another patch and reuse the code in netif_set_xps_queue. The second item this change addresses is the fact that the Tx queues were not getting their numa_node value cleared as a part of the XPS queue reset. This patch resolves that by resetting the numa_node value if the dev_maps value is set. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-10 22:47:04 -08:00
Alexander Duyck	537c00de1c	net: Add functions netif_reset_xps_queue and netif_set_xps_queue This patch adds two functions, netif_reset_xps_queue and netif_set_xps_queue. The main idea behind these two functions is to provide a mechanism through which drivers can update their defaults in regards to XPS. Currently no such mechanism exists and as a result we cannot use XPS for things such as ATR which would require a basic configuration to start in which the Tx queues are mapped to CPUs via a 1:1 mapping. With this change I am making it possible for drivers such as ixgbe to be able to use the XPS feature by controlling the default configuration. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-10 22:47:03 -08:00
Alexander Duyck	416186fbf8	net: Split core bits of netdev_pick_tx into __netdev_pick_tx This change splits the core bits of netdev_pick_tx into a separate function. The main idea behind this is to make this code accessible to select queue functions when they decide to process the standard path instead of their own custom path in their select queue routine. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-10 22:47:03 -08:00
Eric Dumazet	1def9238d4	net_sched: more precise pkt_len computation One long standing problem with TSO/GSO/GRO packets is that skb->len doesn't represent a precise amount of bytes on wire. Headers are only accounted for the first segment. For TCP, thats typically 66 bytes per 1448 bytes segment missing, an error of 4.5 % for normal MSS value. As consequences : 1) TBF/CBQ/HTB/NETEM/... can send more bytes than the assigned limits. 2) Device stats are slightly under estimated as well. Fix this by taking account of headers in qdisc_skb_cb(skb)->pkt_len computation. Packet schedulers should use qdisc pkt_len instead of skb->len for their bandwidth limitations, and TSO enabled devices drivers could use pkt_len if their statistics are not hardware assisted, and if they don't scratch skb->cb[] first word. Both egress and ingress paths work, thanks to commit `fda55eca5a` (net: introduce skb_transport_header_was_set()) : If GRO built a GSO packet, it also set the transport header for us. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Stephen Hemminger <shemminger@vyatta.com> Cc: Paolo Valente <paolo.valente@unimore.it> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-10 14:58:13 -08:00
Jiri Pirko	948b337e62	net: init perm_addr in register_netdevice() Benefit from the fact that dev->addr_assign_type is set to NET_ADDR_PERM in case the device has permanent address. This also fixes the problem that many drivers do not set perm_addr at all. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-08 18:00:47 -08:00
Cong Wang	b3d936f3ea	netpoll: add IPv6 support Currently, netpoll only supports IPv4. This patch adds IPv6 support to netpoll so that we can run netconsole over IPv6 network. Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-08 17:56:10 -08:00
Cong Wang	b7394d2429	netpoll: prepare for ipv6 This patch adjusts some struct and functions, to prepare for supporting IPv6. Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-08 17:56:09 -08:00
Eric Dumazet	fda55eca5a	net: introduce skb_transport_header_was_set() We have skb_mac_header_was_set() helper to tell if mac_header was set on a skb. We would like the same for transport_header. __netif_receive_skb() doesn't reset the transport header if already set by GRO layer. Note that network stacks usually reset the transport header anyway, after pulling the network header, so this change only allows a followup patch to have more precise qdisc pkt_len computation for GSO packets at ingress side. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-08 17:51:54 -08:00
Jiri Pirko	c03a14e8db	ethtool: consolidate work with ethtool_ops No need to check if ethtool_ops == NULL since it can't be. Use local variable "ops" in functions where it is present instead of dev->ethtool_ops Introduce local variable "ops" in functions where dev->ethtool_ops is used many times. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Reviewed-by: Ben Hutchings <bhutchings@solarflare.com> Reviewed-by: Flavio Leitner <fbl@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-07 19:54:19 -08:00
Eric Dumazet	9ca1b22d6d	net: splice: avoid high order page splitting splice() can handle pages of any order, but network code tries hard to split them in PAGE_SIZE units. Not quite successfully anyway, as __splice_segment() assumed poff < PAGE_SIZE. This is true for the skb->data part, not necessarily for the fragments. This patch removes this logic to give the pages as they are in the skb. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willy Tarreau <w@1wt.eu> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-06 21:07:24 -08:00
Jiri Pirko	2afb9b5334	ethtool: set addr_assign_type to NET_ADDR_SET when addr is passed on create In case user passed address via netlink during create, NET_ADDR_PERM was set. That is not correct so fix this by setting NET_ADDR_SET. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-06 21:05:02 -08:00
Jiri Pirko	8b98a70c28	net: remove no longer used netdev_set_bond_master() and netdev_set_master() Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-04 13:31:50 -08:00
Jiri Pirko	471cb5a33d	bonding: remove usage of dev->master Benefit from new upper dev list and free bonding from dev->master usage. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-04 13:31:50 -08:00
Jiri Pirko	49bd8fb0b1	netpoll: remove usage of dev->master Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-04 13:31:50 -08:00
Jiri Pirko	898e506171	rtnetlink: remove usage of dev->master Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-04 13:31:49 -08:00
Jiri Pirko	9ff162a8b9	net: introduce upper device lists This lists are supposed to serve for storing pointers to all upper devices. Eventually it will replace dev->master pointer which is used for bonding, bridge, team but it cannot be used for vlan, macvlan where there might be multiple upper present. In case the upper link is replacement for dev->master, it is marked with "master" flag. New upper device list resolves this limitation. Also, the information stored in lists is used for preventing looping setups like "bond->somethingelse->samebond" Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-04 13:31:49 -08:00
Jiri Pirko	fbdeca2d77	net: add address assign type "SET" This is the way to indicate that mac address of a device has been set by dev_set_mac_address() Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-03 22:37:36 -08:00
Jiri Pirko	f652151640	net: call add_device_randomness() only after successful mac change Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-03 22:37:35 -08:00
Jiri Pirko	e7c3273ec2	rtnl: use dev_set_mac_address() instead of plain ndo_ Benefit from existence of dev_set_mac_address() and remove duplicate code. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-03 22:37:35 -08:00
Daniel Borkmann	aa1113d9f8	net: filter: return -EINVAL if BPF_S_ANC* operation is not supported Currently, we return -EINVAL for malformed or wrong BPF filters. However, this is not done for BPF_S_ANC* operations, which makes it more difficult to detect if it's actually supported or not by the BPF machine. Therefore, we should also return -EINVAL if K is within the SKF_AD_OFF universe and the ancillary operation did not match. Why exactly is it needed? If tools such as libpcap/tcpdump want to make use of new ancillary operations (like filtering VLAN in kernel space), there is currently no sane way to test if this feature / BPF_S_ANC* op is present or not, since no error is returned. This patch will make life easier for that and allow for a proper usage for user space applications. There was concern, if this patch will break userland. Short answer: Yes and no. Long answer: It will "break" only for code that calls ... { BPF_LD \| BPF_(W\|H\|B) \| BPF_ABS, 0, 0, <K> }, ... where <K> is in [0xfffff000, 0xffffffff] _and_ <K> is not an ancillary. And here comes the BUT: assuming some old code will have such an instruction where <K> is between [0xfffff000, 0xffffffff] and it doesn't know ancillary operations, then this will give a non-expected / unwanted behavior as well (since we do not return the BPF machine with 0 after a failed load_pointer(), which was the case before introducing ancillary operations, but load sth. into the accumulator instead, and continue with the next instruction, for instance). Thus, user space code would already have been broken by introducing ancillary operations into the BPF machine per se. Code that does such a direct load, e.g. "load word at packet offset 0xffffffff into accumulator" ("ld [0xffffffff]") is quite broken, isn't it? The whole assumption of ancillary operations is that no-one intentionally calls things like "ld [0xffffffff]" and expect this word to be loaded from such a packet offset. Hence, we can also safely make use of this feature testing patch and facilitate application development. Therefore, at least from this patch onwards, we have for sure a check whether current or in future implemented BPF_S_ANC* ops are supported in the kernel. Patch was tested on x86_64. (Thanks to Eric for the previous review.) Cc: Eric Dumazet <eric.dumazet@gmail.com> Reported-by: Ani Sinha <ani@aristanetworks.com> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-12-30 02:30:28 -08:00
stephen hemminger	61c5e88aec	skbuff: make __kmalloc_reserve static Sparse detected case where this local function should be static. It may even allow some compiler optimizations. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-12-28 20:32:36 -08:00
Eric Dumazet	b2111724a6	net: use per task frag allocator in skb_append_datato_frags Use the new per task frag allocator in skb_append_datato_frags(), to reduce number of frags and page allocator overhead. Tested: ifconfig lo mtu 16436 perf record netperf -t UDP_STREAM ; perf report before : Throughput: 32928 Mbit/s 51.79% netperf [kernel.kallsyms] [k] copy_user_generic_string 5.98% netperf [kernel.kallsyms] [k] __alloc_pages_nodemask 5.58% netperf [kernel.kallsyms] [k] get_page_from_freelist 5.01% netperf [kernel.kallsyms] [k] __rmqueue 3.74% netperf [kernel.kallsyms] [k] skb_append_datato_frags 1.87% netperf [kernel.kallsyms] [k] prep_new_page 1.42% netperf [kernel.kallsyms] [k] next_zones_zonelist 1.28% netperf [kernel.kallsyms] [k] __inc_zone_state 1.26% netperf [kernel.kallsyms] [k] alloc_pages_current 0.78% netperf [kernel.kallsyms] [k] sock_alloc_send_pskb 0.74% netperf [kernel.kallsyms] [k] udp_sendmsg 0.72% netperf [kernel.kallsyms] [k] zone_watermark_ok 0.68% netperf [kernel.kallsyms] [k] __cpuset_node_allowed_softwall 0.67% netperf [kernel.kallsyms] [k] fib_table_lookup 0.60% netperf [kernel.kallsyms] [k] memcpy_fromiovecend 0.55% netperf [kernel.kallsyms] [k] __udp4_lib_lookup after: Throughput: 47185 Mbit/s 61.74% netperf [kernel.kallsyms] [k] copy_user_generic_string 2.07% netperf [kernel.kallsyms] [k] prep_new_page 1.98% netperf [kernel.kallsyms] [k] skb_append_datato_frags 1.02% netperf [kernel.kallsyms] [k] sock_alloc_send_pskb 0.97% netperf [kernel.kallsyms] [k] enqueue_task_fair 0.97% netperf [kernel.kallsyms] [k] udp_sendmsg 0.91% netperf [kernel.kallsyms] [k] __ip_route_output_key 0.88% netperf [kernel.kallsyms] [k] __netif_receive_skb 0.87% netperf [kernel.kallsyms] [k] fib_table_lookup 0.85% netperf [kernel.kallsyms] [k] resched_task 0.78% netperf [kernel.kallsyms] [k] __udp4_lib_lookup 0.77% netperf [kernel.kallsyms] [k] _raw_spin_lock_irqsave Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-12-28 15:25:19 -08:00
Jiri Pirko	9a57247f31	rtnl: expose carrier value with possibility to set it Signed-off-by: Jiri Pirko <jiri@resnulli.us> Acked-by: Flavio Leitner <fbl@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-12-28 15:24:18 -08:00
Jiri Pirko	fdae0fde53	net: allow to change carrier via sysfs Make carrier writable Signed-off-by: Jiri Pirko <jiri@resnulli.us> Acked-by: Flavio Leitner <fbl@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-12-28 15:24:18 -08:00
Jiri Pirko	4bf84c35c6	net: add change_carrier netdev op This allows a driver to register change_carrier callback which will be called whenever user will like to change carrier state. This is useful for devices like dummy, gre, team and so on. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Acked-by: Flavio Leitner <fbl@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-12-28 15:24:18 -08:00
Greg KH	8baf82b368	CONFIG_HOTPLUG removal from networking core CONFIG_HOTPLUG is always enabled now, so remove the unused code that was trying to be compiled out when this option was disabled, in the networking core. Cc: Bill Pemberton <wfp5p@virginia.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-12-22 00:03:00 -08:00
Eric Dumazet	30e6c9fa93	net: devnet_rename_seq should be a seqcount Using a seqlock for devnet_rename_seq is not a good idea, as device_rename() can sleep. As we hold RTNL, we dont need a protection for writers, and only need a seqcount so that readers can catch a change done by a writer. Bug added in commit `c91f6df2db` (sockopt: Change getsockopt() of SO_BINDTODEVICE to return an interface name) Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Brian Haley <brian.haley@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-12-21 13:14:01 -08:00
Linus Torvalds	a2faf2fc53	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull (again) user namespace infrastructure changes from Eric Biederman: "Those bugs, those darn embarrasing bugs just want don't want to get fixed. Linus I just updated my mirror of your kernel.org tree and it appears you successfully pulled everything except the last 4 commits that fix those embarrasing bugs. When you get a chance can you please repull my branch" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: userns: Fix typo in description of the limitation of userns_install userns: Add a more complete capability subset test to commit_creds userns: Require CAP_SYS_ADMIN for most uses of setns. Fix cap_capable to only allow owners in the parent user namespace to have caps.	2012-12-18 10:55:28 -08:00
Linus Torvalds	6a2b60b17b	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull user namespace changes from Eric Biederman: "While small this set of changes is very significant with respect to containers in general and user namespaces in particular. The user space interface is now complete. This set of changes adds support for unprivileged users to create user namespaces and as a user namespace root to create other namespaces. The tyranny of supporting suid root preventing unprivileged users from using cool new kernel features is broken. This set of changes completes the work on setns, adding support for the pid, user, mount namespaces. This set of changes includes a bunch of basic pid namespace cleanups/simplifications. Of particular significance is the rework of the pid namespace cleanup so it no longer requires sending out tendrils into all kinds of unexpected cleanup paths for operation. At least one case of broken error handling is fixed by this cleanup. The files under /proc/<pid>/ns/ have been converted from regular files to magic symlinks which prevents incorrect caching by the VFS, ensuring the files always refer to the namespace the process is currently using and ensuring that the ptrace_mayaccess permission checks are always applied. The files under /proc/<pid>/ns/ have been given stable inode numbers so it is now possible to see if different processes share the same namespaces. Through the David Miller's net tree are changes to relax many of the permission checks in the networking stack to allowing the user namespace root to usefully use the networking stack. Similar changes for the mount namespace and the pid namespace are coming through my tree. Two small changes to add user namespace support were commited here adn in David Miller's -net tree so that I could complete the work on the /proc/<pid>/ns/ files in this tree. Work remains to make it safe to build user namespaces and 9p, afs, ceph, cifs, coda, gfs2, ncpfs, nfs, nfsd, ocfs2, and xfs so the Kconfig guard remains in place preventing that user namespaces from being built when any of those filesystems are enabled. Future design work remains to allow root users outside of the initial user namespace to mount more than just /proc and /sys." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (38 commits) proc: Usable inode numbers for the namespace file descriptors. proc: Fix the namespace inode permission checks. proc: Generalize proc inode allocation userns: Allow unprivilged mounts of proc and sysfs userns: For /proc/self/{uid,gid}_map derive the lower userns from the struct file procfs: Print task uids and gids in the userns that opened the proc file userns: Implement unshare of the user namespace userns: Implent proc namespace operations userns: Kill task_user_ns userns: Make create_new_namespaces take a user_ns parameter userns: Allow unprivileged use of setns. userns: Allow unprivileged users to create new namespaces userns: Allow setting a userns mapping to your current uid. userns: Allow chown and setgid preservation userns: Allow unprivileged users to create user namespaces. userns: Ignore suid and sgid on binaries if the uid or gid can not be mapped userns: fix return value on mntns_install() failure vfs: Allow unprivileged manipulation of the mount namespace. vfs: Only support slave subtrees across different user namespaces vfs: Add a user namespace reference from struct mnt_namespace ...	2012-12-17 15:44:47 -08:00
Eric W. Biederman	5e4a08476b	userns: Require CAP_SYS_ADMIN for most uses of setns. Andy Lutomirski <luto@amacapital.net> found a nasty little bug in the permissions of setns. With unprivileged user namespaces it became possible to create new namespaces without privilege. However the setns calls were relaxed to only require CAP_SYS_ADMIN in the user nameapce of the targed namespace. Which made the following nasty sequence possible. pid = clone(CLONE_NEWUSER \| CLONE_NEWNS); if (pid == 0) { /* child / system("mount --bind /home/me/passwd /etc/passwd"); } else if (pid != 0) { / parent */ char path[PATH_MAX]; snprintf(path, sizeof(path), "/proc/%u/ns/mnt"); fd = open(path, O_RDONLY); setns(fd, 0); system("su -"); } Prevent this possibility by requiring CAP_SYS_ADMIN in the current user namespace when joing all but the user namespace. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2012-12-14 16:12:03 -08:00
Linus Torvalds	6be35c700f	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next Pull networking changes from David Miller: 1) Allow to dump, monitor, and change the bridge multicast database using netlink. From Cong Wang. 2) RFC 5961 TCP blind data injection attack mitigation, from Eric Dumazet. 3) Networking user namespace support from Eric W. Biederman. 4) tuntap/virtio-net multiqueue support by Jason Wang. 5) Support for checksum offload of encapsulated packets (basically, tunneled traffic can still be checksummed by HW). From Joseph Gasparakis. 6) Allow BPF filter access to VLAN tags, from Eric Dumazet and Daniel Borkmann. 7) Bridge port parameters over netlink and BPDU blocking support from Stephen Hemminger. 8) Improve data access patterns during inet socket demux by rearranging socket layout, from Eric Dumazet. 9) TIPC protocol updates and cleanups from Ying Xue, Paul Gortmaker, and Jon Maloy. 10) Update TCP socket hash sizing to be more in line with current day realities. The existing heurstics were choosen a decade ago. From Eric Dumazet. 11) Fix races, queue bloat, and excessive wakeups in ATM and associated drivers, from Krzysztof Mazur and David Woodhouse. 12) Support DOVE (Distributed Overlay Virtual Ethernet) extensions in VXLAN driver, from David Stevens. 13) Add "oops_only" mode to netconsole, from Amerigo Wang. 14) Support set and query of VEB/VEPA bridge mode via PF_BRIDGE, also allow DCB netlink to work on namespaces other than the initial namespace. From John Fastabend. 15) Support PTP in the Tigon3 driver, from Matt Carlson. 16) tun/vhost zero copy fixes and improvements, plus turn it on by default, from Michael S. Tsirkin. 17) Support per-association statistics in SCTP, from Michele Baldessari. And many, many, driver updates, cleanups, and improvements. Too numerous to mention individually. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1722 commits) net/mlx4_en: Add support for destination MAC in steering rules net/mlx4_en: Use generic etherdevice.h functions. net: ethtool: Add destination MAC address to flow steering API bridge: add support of adding and deleting mdb entries bridge: notify mdb changes via netlink ndisc: Unexport ndisc_{build,send}_skb(). uapi: add missing netconf.h to export list pkt_sched: avoid requeues if possible solos-pci: fix double-free of TX skb in DMA mode bnx2: Fix accidental reversions. bna: Driver Version Updated to 3.1.2.1 bna: Firmware update bna: Add RX State bna: Rx Page Based Allocation bna: TX Intr Coalescing Fix bna: Tx and Rx Optimizations bna: Code Cleanup and Enhancements ath9k: check pdata variable before dereferencing it ath5k: RX timestamp is reported at end of frame ath9k_htc: RX timestamp is reported at end of frame ...	2012-12-12 18:07:07 -08:00
Linus Torvalds	d206e09036	Merge branch 'for-3.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup changes from Tejun Heo: "A lot of activities on cgroup side. The big changes are focused on making cgroup hierarchy handling saner. - cgroup_rmdir() had peculiar semantics - it allowed cgroup destruction to be vetoed by individual controllers and tried to drain refcnt synchronously. The vetoing never worked properly and caused good deal of contortions in cgroup. memcg was the last reamining user. Michal Hocko removed the usage and cgroup_rmdir() path has been simplified significantly. This was done in a separate branch so that the memcg people can base further memcg changes on top. - The above allowed cleaning up cgroup lifecycle management and implementation of generic cgroup iterators which are used to improve hierarchy support. - cgroup_freezer updated to allow migration in and out of a frozen cgroup and handle hierarchy. If a cgroup is frozen, all descendant cgroups are frozen. - netcls_cgroup and netprio_cgroup updated to handle hierarchy properly. - Various fixes and cleanups. - Two merge commits. One to pull in memcg and rmdir cleanups (needed to build iterators). The other pulled in cgroup/for-3.7-fixes for device_cgroup fixes so that further device_cgroup patches can be stacked on top." Fixed up a trivial conflict in mm/memcontrol.c as per Tejun (due to commit `bea8c150a7` ("memcg: fix hotplugged memory zone oops") in master touching code close to commit `2ef37d3fe4` ("memcg: Simplify mem_cgroup_force_empty_list error handling") in for-3.8) * 'for-3.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (65 commits) cgroup: update Documentation/cgroups/00-INDEX cgroup_rm_file: don't delete the uncreated files cgroup: remove subsystem files when remounting cgroup cgroup: use cgroup_addrm_files() in cgroup_clear_directory() cgroup: warn about broken hierarchies only after css_online cgroup: list_del_init() on removed events cgroup: fix lockdep warning for event_control cgroup: move list add after list head initilization netprio_cgroup: allow nesting and inherit config on cgroup creation netprio_cgroup: implement netprio[_set]_prio() helpers netprio_cgroup: use cgroup->id instead of cgroup_netprio_state->prioidx netprio_cgroup: reimplement priomap expansion netprio_cgroup: shorten variable names in extend_netdev_table() netprio_cgroup: simplify write_priomap() netcls_cgroup: move config inheritance to ->css_online() and remove .broken_hierarchy marking cgroup: remove obsolete guarantee from cgroup_task_migrate. cgroup: add cgroup->id cgroup, cpuset: remove cgroup_subsys->post_clone() cgroup: s/CGRP_CLONE_CHILDREN/CGRP_CPUSET_CLONE_CHILDREN/ cgroup: rename ->create/post_create/pre_destroy/destroy() to ->css_alloc/online/offline/free() ...	2012-12-12 08:18:24 -08:00
Eric Dumazet	75be437230	net: gro: avoid double copy in skb_gro_receive() __copy_skb_header(nskb, p) already copied p->cb[], no need to copy it again. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-12-11 13:44:09 -05:00
Abhijit Pawar	a71258d79e	net: remove obsolete simple_strto<foo> This patch removes the redundant occurences of simple_strto<foo> Signed-off-by: Abhijit Pawar <abhi.c.pawar@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-12-11 12:49:53 -05:00
Eric Dumazet	89c5fa3369	net: gro: dev_gro_receive() cleanup __napi_gro_receive() is inlined from two call sites for no good reason. Lets move the prep stuff in a function of its own, called only if/when needed. This saves 300 bytes on x86 : # size net/core/dev.o.after net/core/dev.o.before text data bss dec hex filename 51968 1238 1040 54246 d3e6 net/core/dev.o.before 51664 1238 1040 53942 d2b6 net/core/dev.o.after Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-12-11 12:49:53 -05:00
Abhijit Pawar	4b5511ebc7	net: remove obsolete simple_strto<foo> This patch replace the obsolete simple_strto<foo> with kstrto<foo> Signed-off-by: Abhijit Pawar <abhi.c.pawar@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-12-10 14:09:00 -05:00
Alexander Duyck	fc70fb640b	net: Handle encapsulated offloads before fragmentation or handing to lower dev This change allows the VXLAN to enable Tx checksum offloading even on devices that do not support encapsulated checksum offloads. The advantage to this is that it allows for the lower device to change due to routing table changes without impacting features on the VXLAN itself. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-12-09 00:20:28 -05:00
Joseph Gasparakis	6a674e9c75	net: Add support for hardware-offloaded encapsulation This patch adds support in the kernel for offloading in the NIC Tx and Rx checksumming for encapsulated packets (such as VXLAN and IP GRE). For Tx encapsulation offload, the driver will need to set the right bits in netdev->hw_enc_features. The protocol driver will have to set the skb->encapsulation bit and populate the inner headers, so the NIC driver will use those inner headers to calculate the csum in hardware. For Rx encapsulation offload, the driver will need to set again the skb->encapsulation flag and the skb->ip_csum to CHECKSUM_UNNECESSARY. In that case the protocol driver should push the decapsulated packet up to the stack, again with CHECKSUM_UNNECESSARY. In ether case, the protocol driver should set the skb->encapsulation flag back to zero. Finally the protocol driver should have NETIF_F_RXCSUM flag set in its features. Signed-off-by: Joseph Gasparakis <joseph.gasparakis@intel.com> Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-12-09 00:20:28 -05:00
Eric Dumazet	c3c7c254b2	net: gro: fix possible panic in skb_gro_receive() commit `2e71a6f808` (net: gro: selective flush of packets) added a bug for skbs using frag_list. This part of the GRO stack is rarely used, as it needs skb not using a page fragment for their skb->head. Most drivers do use a page fragment, but some of them use GFP_KERNEL allocations for the initial fill of their RX ring buffer. napi_gro_flush() overwrite skb->prev that was used for these skb to point to the last skb in frag_list. Fix this using a separate field in struct napi_gro_cb to point to the last fragment. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-12-07 14:39:29 -05:00
Jiri Pirko	e3d8fabee3	net: call notifiers for mtu change even if iface is not up Do the same thing as in set mac. Call notifiers every time. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-12-07 12:22:30 -05:00
Cong Wang	b93196dc5a	net: fix some compiler warning in net/core/neighbour.c net/core/neighbour.c:65:12: warning: 'zero' defined but not used [-Wunused-variable] net/core/neighbour.c:66:12: warning: 'unres_qlen_max' defined but not used [-Wunused-variable] These variables are only used when CONFIG_SYSCTL is defined, so move them under #ifdef CONFIG_SYSCTL. Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Cong Wang <amwang@redhat.com> Acked-by: Shan Wei <davidshan@tencent.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-12-05 21:50:37 -05:00
Shan Wei	ce46cc64d4	net: neighbour: prohibit negative value for unres_qlen_bytes parameter unres_qlen_bytes and unres_qlen are int type. But multiple relation(unres_qlen_bytes = unres_qlen * SKB_TRUESIZE(ETH_FRAME_LEN)) will cause type overflow when seting unres_qlen. e.g. $ echo 1027506 > /proc/sys/net/ipv4/neigh/eth1/unres_qlen $ cat /proc/sys/net/ipv4/neigh/eth1/unres_qlen 1182657265 $ cat /proc/sys/net/ipv4/neigh/eth1/unres_qlen_bytes -2147479756 The gutted value is not that we setting。 But user/administrator don't know this is caused by int type overflow. what's more, it is meaningless and even dangerous that unres_qlen_bytes is set with negative number. Because, for unresolved neighbour address, kernel will cache packets without limit in __neigh_event_send()(e.g. (u32)-1 = 2GB). Signed-off-by: Shan Wei <davidshan@tencent.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-12-05 16:01:28 -05:00
Serge Hallyn	4e66ae2ea3	net: dev_change_net_namespace: send a KOBJ_REMOVED/KOBJ_ADD When a new nic is created in namespace ns1, the kernel sends a KOBJ_ADD uevent to ns1. When the nic is moved to ns2, we only send a KOBJ_MOVE to ns2, and nothing to ns1. This patch changes that behavior so that when moving a nic from ns1 to ns2, we send a KOBJ_REMOVED to ns1 and KOBJ_ADD to ns2. (The KOBJ_MOVE is still sent to ns2). The effects of this can be seen when starting and stopping containers in an upstart based host. Lxc will create a pair of veth nics, the kernel sends KOBJ_ADD, and upstart starts network-instance jobs for each. When one nic is moved to the container, because no KOBJ_REMOVED event is received, the network-instance job for that veth never goes away. This was reported at https://bugs.launchpad.net/ubuntu/+source/lxc/+bug/1065589 With this patch the networ-instance jobs properly go away. The other oddness solved here is that if a nic is passed into a running upstart-based container, without this patch no network-instance job is started in the container. But when the container creates a new nic itself (ip link add new type veth) then network-interface jobs are created. With this patch, behavior comes in line with a regular host. v2: also send KOBJ_ADD to new netns. There will then be a _MOVE event from the device_rename() call, but that should be innocuous. Signed-off-by: Serge Hallyn <serge.hallyn@canonical.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Daniel Lezcano <daniel.lezcano@free.fr> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-12-04 13:25:57 -05:00
Rami Rosen	c07135633b	rtnelink: remove unused parameter from rtnl_create_link(). This patch removes an unused parameter (src_net) from rtnl_create_link() method and from the method single invocation, in veth. This parameter was used in the past when calling ops->get_tx_queues(src_net, tb) in rtnl_create_link(). The get_tx_queues() member of rtnl_link_ops was replaced by two methods, get_num_tx_queues() and get_num_rx_queues(), which do not get any parameter. This was done in commit `d40156aa5e` by Jiri Pirko ("rtnl: allow to specify different num for rx and tx queue count"). Signed-off-by: Rami Rosen <ramirose@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-11-30 12:24:40 -05:00
Rami Rosen	bb728820fe	core: make GRO methods static. This patch changes three methods to be static and removes their EXPORT_SYMBOLs in core/dev.c and their external declaration in netdevice.h. The methods, dev_gro_receive(), napi_frags_finish() and napi_skb_finish(), which are in the GRO rx path, are not used outside core/dev.c. Signed-off-by: Rami Rosen <ramirose@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-11-29 13:18:32 -05:00
Brian Haley	c91f6df2db	sockopt: Change getsockopt() of SO_BINDTODEVICE to return an interface name Instead of having the getsockopt() of SO_BINDTODEVICE return an index, which will then require another call like if_indextoname() to get the actual interface name, have it return the name directly. This also matches the existing man page description on socket(7) which mentions the argument being an interface name. If the value has not been set, zero is returned and optlen will be set to zero to indicate there is no interface name present. Added a seqlock to protect this code path, and dev_ifname(), from someone changing the device name via dev_change_name(). v2: Added seqlock protection while copying device name. v3: Fixed word wrap in patch. Signed-off-by: Brian Haley <brian.haley@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-11-26 17:22:14 -05:00
David S. Miller	24bc518a68	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/net/wireless/iwlwifi/pcie/tx.c Minor iwlwifi conflict in TX queue disabling between 'net', which removed a bogus warning, and 'net-next' which added some status register poking code. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-11-25 12:49:17 -05:00
Tejun Heo	811d8d6ff5	netprio_cgroup: allow nesting and inherit config on cgroup creation Inherit netprio configuration from ->css_online(), allow nesting and remove .broken_hierarchy marking. This makes netprio_cgroup's behavior match netcls_cgroup's. Note that this patch changes userland-visible behavior. Nesting is allowed and the first level cgroups below the root cgroup behave differently - they inherit priorities from the root cgroup on creation instead of starting with 0. This is unfortunate but not doing so is much crazier. Signed-off-by: Tejun Heo <tj@kernel.org> Tested-and-Acked-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Acked-by: David S. Miller <davem@davemloft.net>	2012-11-22 07:32:47 -08:00
Tejun Heo	666b0ebe2b	netprio_cgroup: implement netprio[_set]_prio() helpers Introduce two helpers - netprio_prio() and netprio_set_prio() - which hide the details of priomap access and expansion. This will help implementing hierarchy support. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Neil Horman <nhorman@tuxdriver.com> Tested-and-Acked-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Acked-by: David S. Miller <davem@davemloft.net>	2012-11-22 07:32:47 -08:00
Tejun Heo	88d642fa2c	netprio_cgroup: use cgroup->id instead of cgroup_netprio_state->prioidx With priomap expansion no longer depending on knowing max id allocated, netprio_cgroup can use cgroup->id insted of cs->prioidx. Drop prioidx alloc/free logic and convert all uses to cgroup->id. * In cgrp_css_alloc(), parent->id test is moved above @cs allocation to simplify error path. * In cgrp_css_free(), @cs assignment is made initialization. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Neil Horman <nhorman@tuxdriver.com> Tested-and-Acked-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Acked-by: David S. Miller <davem@davemloft.net>	2012-11-22 07:32:47 -08:00
Tejun Heo	4a6ee25c7e	netprio_cgroup: reimplement priomap expansion netprio kept track of the highest prioidx allocated and resized priomaps accordingly when necessary. This makes it necessary to keep track of prioidx allocation and may end up resizing on every new prioidx. Update extend_netdev_table() such that it takes @target_idx which the priomap should be able to accomodate. If the priomap is large enough, nothing happens; otherwise, the size is doubled until @target_idx can be accomodated. This makes max_prioidx and write_update_netdev_table() unnecessary. write_priomap() now calls extend_netdev_table() directly. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Neil Horman <nhorman@tuxdriver.com> Tested-and-Acked-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Acked-by: David S. Miller <davem@davemloft.net>	2012-11-22 07:32:46 -08:00
Tejun Heo	52bca930c9	netprio_cgroup: shorten variable names in extend_netdev_table() The function is about to go through a rewrite. In preparation, shorten the variable names so that we don't repeat "priomap" so often. This patch is cosmetic. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Neil Horman <nhorman@tuxdriver.com> Tested-and-Acked-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Acked-by: David S. Miller <davem@davemloft.net>	2012-11-22 07:32:46 -08:00
Tejun Heo	6d5759dd02	netprio_cgroup: simplify write_priomap() sscanf() doesn't bite. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Neil Horman <nhorman@tuxdriver.com> Tested-and-Acked-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Acked-by: David S. Miller <davem@davemloft.net>	2012-11-22 07:32:46 -08:00
John W. Linville	f30a944392	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless into for-davem John W. Linville says: ==================== This is a batch of fixes intended for 3.7... Included are two pulls. Regarding the mac80211 tree, Johannes says: "Please pull my mac80211.git tree (see below) to get two more fixes for 3.7. Both fix regressions introduced before this cycle that weren't noticed until now, one for IBSS not cleaning up properly and the other to add back the "wireless" sysfs directory for Fedora's startup scripts." Regarding the iwlwifi tree, Johannes says: "Please also pull my iwlwifi.git tree, I have two fixes: one to remove a spurious warning that can actually trigger in legitimate situations, and the other to fix a regression from when monitor mode was changed to use the "sniffer" firmware mode." Also included is an nfc tree pull. Samuel says: "We mostly have pn533 fixes here, 2 memory leaks and an early unlocking fix. Moreover, we also have an LLCP adapter linked list insertion fix." On top of that, a few more bits... Albert Pool adds a USB ID to rtlwifi. Bing Zhao provides two mwifiex fixes -- one to fix a system hang during a command timeout, and the other to properly report a suspend error to the MMC core. Finally, Sujith Manoharan fixes a thinko that would trigger an ath9k hang during device reset. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2012-11-21 11:48:31 -05:00
Sachin Kamat	388dfc2d2d	net: Remove redundant null check before kfree in dev.c kfree on a null pointer is a no-op. Signed-off-by: Sachin Kamat <sachin.kamat@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-11-20 13:48:09 -05:00
Eric W. Biederman	98f842e675	proc: Usable inode numbers for the namespace file descriptors. Assign a unique proc inode to each namespace, and use that inode number to ensure we only allocate at most one proc inode for every namespace in proc. A single proc inode per namespace allows userspace to test to see if two processes are in the same namespace. This has been a long requested feature and only blocked because a naive implementation would put the id in a global space and would ultimately require having a namespace for the names of namespaces, making migration and certain virtualization tricks impossible. We still don't have per superblock inode numbers for proc, which appears necessary for application unaware checkpoint/restart and migrations (if the application is using namespace file descriptors) but that is now allowd by the design if it becomes important. I have preallocated the ipc and uts initial proc inode numbers so their structures can be statically initialized. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2012-11-20 04:19:49 -08:00
Eric W. Biederman	142e1d1d5f	userns: Allow unprivileged use of setns. - Push the permission check from the core setns syscall into the setns install methods where the user namespace of the target namespace can be determined, and used in a ns_capable call. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2012-11-20 04:17:42 -08:00
Johannes Berg	01f1c6b994	net: remove unnecessary wireless includes The wireless and wext includes in net-sysfs.c aren't needed, so remove them. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-11-19 19:10:36 -05:00
Shan Wei	1f743b0765	net: core: use this_cpu_ptr per-cpu helper flush_tasklet is a struct, not a pointer in percpu var. so use this_cpu_ptr to get the member pointer. Signed-off-by: Shan Wei <davidshan@tencent.com> Reviewed-by: Christoph Lameter <cl@linux.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-11-19 18:59:44 -05:00
John W. Linville	e56108d927	Merge branch 'for-john' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211	2012-11-19 14:37:43 -05:00
Tejun Heo	92fb97487a	cgroup: rename ->create/post_create/pre_destroy/destroy() to ->css_alloc/online/offline/free() Rename cgroup_subsys css lifetime related callbacks to better describe what their roles are. Also, update documentation. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2012-11-19 08:13:38 -08:00
Eric W. Biederman	038e7332b8	userns: make each net (net_ns) belong to a user_ns The user namespace which creates a new network namespace owns that namespace and all resources created in it. This way we can target capability checks for privileged operations against network resources to the user_ns which created the network namespace in which the resource lives. Privilege to the user namespace which owns the network namespace, or any parent user namespace thereof, provides the same privilege to the network resource. This patch is reworked from a version originally by Serge E. Hallyn <serge.hallyn@canonical.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2012-11-18 22:46:23 -08:00
Eric W. Biederman	d727abcb23	netns: Deduplicate and fix copy_net_ns when !CONFIG_NET_NS The copy of copy_net_ns used when the network stack is not built is broken as it does not return -EINVAL when attempting to create a new network namespace. We don't even have a previous network namespace. Since we need a copy of copy_net_ns in net/net_namespace.h that is available when the networking stack is not built at all move the correct version of copy_net_ns from net_namespace.c into net_namespace.h Leaving us with just 2 versions of copy_net_ns. One version for when we compile in network namespace suport and another stub for all other occasions. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2012-11-18 22:46:19 -08:00
Eric W. Biederman	b51642f6d7	net: Enable a userns root rtnl calls that are safe for unprivilged users - Only allow moving network devices to network namespaces you have CAP_NET_ADMIN privileges over. - Enable creating/deleting/modifying interfaces - Enable adding/deleting addresses - Enable adding/setting/deleting neighbour entries - Enable adding/removing routes - Enable adding/removing fib rules - Enable setting the forwarding state - Enable adding/removing ipv6 address labels - Enable setting bridge parameter Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-11-18 20:33:36 -05:00
Eric W. Biederman	5e1fccc0bf	net: Allow userns root control of the core of the network stack. Allow an unpriviled user who has created a user namespace, and then created a network namespace to effectively use the new network namespace, by reducing capable(CAP_NET_ADMIN) and capable(CAP_NET_RAW) calls to be ns_capable(net->user_ns, CAP_NET_ADMIN), or capable(net->user_ns, CAP_NET_RAW) calls. Settings that merely control a single network device are allowed. Either the network device is a logical network device where restrictions make no difference or the network device is hardware NIC that has been explicity moved from the initial network namespace. In general policy and network stack state changes are allowed while resource control is left unchanged. Allow ethtool ioctls. Allow binding to network devices. Allow setting the socket mark. Allow setting the socket priority. Allow setting the network device alias via sysfs. Allow setting the mtu via sysfs. Allow changing the network device flags via sysfs. Allow setting the network device group via sysfs. Allow the following network device ioctls. SIOCGMIIPHY SIOCGMIIREG SIOCSIFNAME SIOCSIFFLAGS SIOCSIFMETRIC SIOCSIFMTU SIOCSIFHWADDR SIOCSIFSLAVE SIOCADDMULTI SIOCDELMULTI SIOCSIFHWBROADCAST SIOCSMIIREG SIOCBONDENSLAVE SIOCBONDRELEASE SIOCBONDSETHWADDR SIOCBONDCHANGEACTIVE SIOCBRADDIF SIOCBRDELIF SIOCSHWTSTAMP Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-11-18 20:32:45 -05:00
Eric W. Biederman	00f70de09c	net: Allow userns root to force the scm creds If the user calling sendmsg has the appropriate privieleges in their user namespace allow them to set the uid, gid, and pid in the SCM_CREDENTIALS control message to any valid value. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-11-18 20:32:45 -05:00
Eric W. Biederman	dfc47ef863	net: Push capable(CAP_NET_ADMIN) into the rtnl methods - In rtnetlink_rcv_msg convert the capable(CAP_NET_ADMIN) check to ns_capable(net->user-ns, CAP_NET_ADMIN). Allowing unprivileged users to make netlink calls to modify their local network namespace. - In the rtnetlink doit methods add capable(CAP_NET_ADMIN) so that calls that are not safe for unprivileged users are still protected. Later patches will remove the extra capable calls from methods that are safe for unprivilged users. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-11-18 20:32:44 -05:00
Eric W. Biederman	464dc801c7	net: Don't export sysctls to unprivileged users In preparation for supporting the creation of network namespaces by unprivileged users, modify all of the per net sysctl exports and refuse to allow them to unprivileged users. This makes it safe for unprivileged users in general to access per net sysctls, and allows sysctls to be exported to unprivileged users on an individual basis as they are deemed safe. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-11-18 20:30:55 -05:00
Eric W. Biederman	d328b83682	userns: make each net (net_ns) belong to a user_ns The user namespace which creates a new network namespace owns that namespace and all resources created in it. This way we can target capability checks for privileged operations against network resources to the user_ns which created the network namespace in which the resource lives. Privilege to the user namespace which owns the network namespace, or any parent user namespace thereof, provides the same privilege to the network resource. This patch is reworked from a version originally by Serge E. Hallyn <serge.hallyn@canonical.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-11-18 20:30:55 -05:00
Eric W. Biederman	2407dc25f7	netns: Deduplicate and fix copy_net_ns when !CONFIG_NET_NS The copy of copy_net_ns used when the network stack is not built is broken as it does not return -EINVAL when attempting to create a new network namespace. We don't even have a previous network namespace. Since we need a copy of copy_net_ns in net/net_namespace.h that is available when the networking stack is not built at all move the correct version of copy_net_ns from net_namespace.c into net_namespace.h Leaving us with just 2 versions of copy_net_ns. One version for when we compile in network namespace suport and another stub for all other occasions. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-11-18 20:30:55 -05:00
David S. Miller	67f4efdce7	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Minor line offset auto-merges. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-11-17 22:00:43 -05:00
Johannes Berg	38c1a01cf1	wireless: add back sysfs directory commit `35b2a113cb` broke (at least) Fedora's networking scripts, they check for the existence of the wireless directory. As the files aren't used, add the directory back and not the files. Also do it for both drivers based on the old wireless extensions and cfg80211, regardless of whether the compat code for wext is built into cfg80211 or not. Cc: stable@vger.kernel.org [3.6] Reported-by: Dave Airlie <airlied@gmail.com> Reported-by: Bill Nottingham <notting@redhat.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2012-11-16 21:37:52 +01:00
Tom Herbert	baefa31db2	net-rps: Fix brokeness causing OOO packets In commit `c445477d74` which adds aRFS to the kernel, the CPU selected for RFS is not set correctly when CPU is changing. This is causing OOO packets and probably other issues. Signed-off-by: Tom Herbert <therbert@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-11-16 14:35:56 -05:00
Eric Dumazet	c53aa5058a	net: use right lock in __dev_remove_offload offload_base is protected by offload_lock, not ptype_lock Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Vlad Yasevich <vyasevic@redhat.com> Acked-by: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-11-16 13:41:08 -05:00
Jiri Pirko	a652208e0b	net: correct check in dev_addr_del() Check (ha->addr == dev->dev_addr) is always true because dev_addr_init() sets this. Correct the check to behave properly on addr removal. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-11-15 17:57:53 -05:00
Vlad Yasevich	f191a1d17f	net: Remove code duplication between offload structures Move the offload callbacks into its own structure. Signed-off-by: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-11-15 17:39:51 -05:00
Vlad Yasevich	22061d8014	net: Switch to using the new packet offload infrustructure Convert to using the new GSO/GRO registration mechanism and new packet offload structure. Signed-off-by: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-11-15 17:36:17 -05:00
Vlad Yasevich	62532da9d5	net: Add generic packet offload infrastructure. Create a new data structure to contain the GRO/GSO callbacks and add a new registration mechanism. Singed-off-by: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-11-15 17:36:16 -05:00
David S. Miller	d4185bbf62	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c Minor conflict between the BCM_CNIC define removal in net-next and a bug fix added to net. Based upon a conflict resolution patch posted by Stephen Rothwell. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-11-10 18:32:51 -05:00
Eric Leblond	a3d744e995	af-packet: fix oops when socket is not present Due to a NULL dereference, the following patch is causing oops in normal trafic condition: commit `c0de08d042` Author: Eric Leblond <eric@regit.org> Date: Thu Aug 16 22:02:58 2012 +0000 af_packet: don't emit packet on orig fanout group This buggy patch was a feature fix and has reached most stable branches. When skb->sk is NULL and when packet fanout is used, there is a crash in match_fanout_group where skb->sk is accessed. This patch fixes the issue by returning false as soon as the socket is NULL: this correspond to the wanted behavior because the kernel as to resend the skb to all the listening socket in this case. Signed-off-by: Eric Leblond <eric@regit.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-11-07 15:40:14 -05:00

1 2 3 4 5 ...

2863 Commits