"77c1090 net: fix infinite loop in __skb_recv_datagram()" (v3.8) introduced a
regression:
After that commit, recv can no longer peek beyond a 0-sized skb in the queue.
__skb_recv_datagram() instead stops at the first skb with len == 0 and results
in the system call failing with -EFAULT via skb_copy_datagram_iovec().
When peeking at an offset with 0-sized skb(s), each one of those is received
only once, in sequence. The offset starts moving forward again after receiving
datagrams with len > 0.
Signed-off-by: Benjamin Poirier <bpoirier@suse.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use consume_skb() to free the original skb that is successfully transmitted
as gso segmented skbs so that it is not treated as a drop due to an error.
Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove duplicate statements by using do-while loop instead of while loop.
- A;
- while (e) {
+ do {
A;
- }
+ } while (e);
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use preferable function name which implies using a pseudo-random
number generator.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Instead of feeding net_secret[] at boot time, defer the init
at the point first socket is created.
This permits some platforms to use better entropy sources than
the ones available at boot time.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch allows to dump BPF filters attached to a socket with
SO_ATTACH_FILTER.
Note that we check CAP_SYS_ADMIN before allowing to dump this info.
For now, only AF_PACKET sockets use this feature.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 6681712d67
vxlan: generalize forwarding tables
relaxed the address checks in rtnl_fdb_del() to use is_zero_ether_addr().
This allows users to add multicast addresses using the fdb API. However,
the check in rtnl_fdb_del() still uses a more strict
is_valid_ether_addr() which rejects multicast addresses. Thus it
is possible to add an fdb that can not be later removed.
Relax the check in rtnl_fdb_del() as well.
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 068a2de57d (net: release dst entry while
cache-hot for GSO case too)
Before GSO packet segmentation, we already take care of skb->dst if it
can be released.
There is no point adding extra test for every segment in the gso loop.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When transmit timestamping is enabled at the socket level, record a
timestamp on packets written to a PACKET_TX_RING. Tx timestamps are
always looped to the application over the socket error queue. Software
timestamps are also written back into the packet frame header in the
packet ring.
Reported-by: Paul Chavent <paul.chavent@onera.fr>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/emulex/benet/be_main.c
drivers/net/ethernet/intel/igb/igb_main.c
drivers/net/wireless/brcm80211/brcmsmac/mac80211_if.c
include/net/scm.h
net/batman-adv/routing.c
net/ipv4/tcp_input.c
The e{uid,gid} --> {uid,gid} credentials fix conflicted with the
cleanup in net-next to now pass cred structs around.
The be2net driver had a bug fix in 'net' that overlapped with the VLAN
interface changes by Patrick McHardy in net-next.
An IGB conflict existed because in 'net' the build_skb() support was
reverted, and in 'net-next' there was a comment style fix within that
code.
Several batman-adv conflicts were resolved by making sure that all
calls to batadv_is_my_mac() are changed to have a new bat_priv first
argument.
Eric Dumazet's TS ECR fix in TCP in 'net' conflicted with the F-RTO
rewrite in 'net-next', mostly overlapping changes.
Thanks to Stephen Rothwell and Antonio Quartulli for help with several
of these merge resolutions.
Signed-off-by: David S. Miller <davem@davemloft.net>
The return value from list_netdevice() is not used and no need, so remove it.
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If one does do something unfortunate and allow a
bad offload bug into the kernel, this the
skb_warn_bad_offload can effectively live-lock the
system, filling the logs with the same error over
and over.
Add rate limitation to this so that box remains otherwise
functional in this case.
Signed-off-by: Ben Greear <greearb@candelatech.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a function to allocate a sk_buff head without any data. This will
be used by memory mapped netlink to attach data from the mmaped area
to the skb.
Additionally change skb_release_all() to check whether the skb has a
data area to allow the skb destructor to clear the data pointer in case
only a head has been allocated.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for 802.1ad VLAN devices. This mainly consists of checking for
ETH_P_8021AD in addition to ETH_P_8021Q in a couple of places and check
offloading capabilities based on the used protocol.
Configuration is done using "ip link":
# ip link add link eth0 eth0.1000 \
type vlan proto 802.1ad id 1000
# ip link add link eth0.1000 eth0.1000.1000 \
type vlan proto 802.1q id 1000
52:54:00:12:34:56 > 92:b1:54:28:e4:8c, ethertype 802.1Q (0x8100), length 106: vlan 1000, p 0, ethertype 802.1Q, vlan 1000, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto ICMP (1), length 84)
20.1.0.2 > 20.1.0.1: ICMP echo request, id 3003, seq 8, length 64
92:b1:54:28:e4:8c > 52:54:00:12:34:56, ethertype 802.1Q-QinQ (0x88a8), length 106: vlan 1000, p 0, ethertype 802.1Q, vlan 1000, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 47944, offset 0, flags [none], proto ICMP (1), length 84)
20.1.0.1 > 20.1.0.2: ICMP echo reply, id 3003, seq 8, length 64
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a protocol argument to the VLAN packet tagging functions. In case of HW
tagging, we need that protocol available in the ndo_start_xmit functions,
so it is stored in a new field in the skb. The new field fits into a hole
(on 64 bit) and doesn't increase the sks's size.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename the hardware VLAN acceleration features to include "CTAG" to indicate
that they only support CTAGs. Follow up patches will introduce 802.1ad
server provider tagging (STAGs) and require the distinction for hardware not
supporting acclerating both.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Update debugging messages to a more current style.
Emit these debugging messages at KERN_DEBUG instead
of KERN_DEFAULT.
Add and use neigh_dbg(level, fmt, ...) macro
Add dynamic_debug capability via pr_debug
Convert embedded function names to "%s: ", __func__
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current implementation of dev_uc_sync/unsync() assumes that there is
a strict 1-to-1 relationship between the source and destination of the sync.
In other words, once an address has been synced to a destination device, it
will not be synced to any other device through the sync API.
However, there are some virtual devices that aggreate a number of lower
devices and need to sync addresses to all of them. The current
API falls short there.
This patch introduces a new dev_uc_sync_multiple() api that can be called
in the above circumstances and allows sync to work for every invocation.
CC: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 763eff57de.
It causes build regressions, as per Stephen Rothwell:
====================
After merging the final tree, today's linux-next build (powerpc
allyesconfig) failed like this:
net/core/netprio_cgroup.c:250:29: error: static declaration of 'net_prio_subsys' follows non-static declaration
include/linux/cgroup_subsys.h:71:1: note: previous declaration of 'net_prio_subsys' was here
====================
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The only part of proc_dir_entry the code outside of fs/proc
really cares about is PDE(inode)->data. Provide a helper
for that; static inline for now, eventually will be moved
to fs/proc, along with the knowledge of struct proc_dir_entry
layout.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The callers always pass current to sock_update_netprio().
Signed-off-by: Li Zefan <lizefan@huawei.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The callers always pass current to sock_update_classid().
Signed-off-by: Li Zefan <lizefan@huawei.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that uids and gids are completely encapsulated in kuid_t
and kgid_t we no longer need to pass struct cred which allowed
us to test both the uid and the user namespace for equality.
Passing struct cred potentially allows us to pass the entire group
list as BSD does but I don't believe the cost of cache line misses
justifies retaining code for a future potential application.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/nfc/microread/mei.c
net/netfilter/nfnetlink_queue_core.c
Pull in 'net' to get Eric Biederman's AF_UNIX fix, upon which
some cleanups are going to go on-top.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
The following patchset contains Netfilter and IPVS updates for
your net-next tree, most relevantly they are:
* Add net namespace support to NFLOG, ULOG and ebt_ulog and NFQUEUE.
The LOG and ebt_log target has been also adapted, but they still
depend on the syslog netnamespace that seems to be missing, from
Gao Feng.
* Don't lose indications of congestion in IPv6 fragmentation handling,
from Hannes Frederic Sowa.i
* IPVS conversion to use RCU, including some code consolidation patches
and optimizations, also some from Julian Anastasov.
* cpu fanout support for NFQUEUE, from Holger Eitzenberger.
* Better error reporting to userspace when dropping packets from
all our _*_[xfrm|route]_me_harder functions, from Patrick McHardy.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 130549fe ("netfilter: reset nf_trace in nf_reset") added code
to reset nf_trace in nf_reset(). This is wrong and unnecessary.
nf_reset() is used in the following cases:
- when passing packets up the the socket layer, at which point we want to
release all netfilter references that might keep modules pinned while
the packet is queued. nf_trace doesn't matter anymore at this point.
- when encapsulating or decapsulating IPsec packets. We want to continue
tracing these packets after IPsec processing.
- when passing packets through virtual network devices. Only devices on
that encapsulate in IPv4/v6 matter since otherwise nf_trace is not
used anymore. Its not entirely clear whether those packets should
be traced after that, however we've always done that.
- when passing packets through virtual network devices that make the
packet cross network namespace boundaries. This is the only cases
where we clearly want to reset nf_trace and is also what the
original patch intended to fix.
Add a new function nf_reset_trace() and use it in dev_forward_skb() to
fix this properly.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
A few drivers use dev_uc_sync/unsync to synchronize the
address lists from master down to slave/lower devices. In
some cases (bond/team) a single address list is synched down
to multiple devices. At the time of unsync, we have a leak
in these lower devices, because "synced" is treated as a
boolean and the address will not be unsynced for anything after
the first device/call.
Treat "synced" as a count (same as refcount) and allow all
unsync calls to work.
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 7d4c04fc17 ("net: add option to enable
error queue packets waking select") has an issue due to operator precedence
causing the bit-wise OR to bind to the sock_flags call instead of the result of
the terniary conditional. This fixes the *_poll functions to work properly. The
old code results in "mask |= POLLPRI" instead of what was intended, which is to
only include POLLPRI when the socket option is enabled.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename skb_dst_set_noref to __skb_dst_set_noref and
add force flag as suggested by David Miller. The new wrapper
skb_dst_set_noref_force will force dst entries that are not
cached to be attached as skb dst without taking reference
as long as provided dst is reclaimed after RCU grace period.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off by: Hans Schillstrom <hans@schillstrom.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Simon Horman <horms@verge.net.au>
Conflicts:
net/mac80211/sta_info.c
net/wireless/core.h
Two minor conflicts in wireless. Overlapping additions of extern
declarations in net/wireless/core.h and a bug fix overlapping with
the addition of a boolean parameter to __ieee80211_key_free().
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) sadb_msg prepared for IPSEC userspace forgets to initialize the
satype field, fix from Nicolas Dichtel.
2) Fix mac80211 synchronization during station removal, from Johannes
Berg.
3) Fix IPSEC sequence number notifications when they wrap, from Steffen
Klassert.
4) Fix cfg80211 wdev tracing crashes when add_virtual_intf() returns an
error pointer, from Johannes Berg.
5) In mac80211, don't call into the channel context code with the
interface list mutex held. From Johannes Berg.
6) In mac80211, if we don't actually associate, do not restart the STA
timer, otherwise we can crash. From Ben Greear.
7) Missing dma_mapping_error() check in e1000, ixgb, and e1000e. From
Christoph Paasch.
8) Fix sja1000 driver defines to not conflict with SH port, from Marc
Kleine-Budde.
9) Don't call il4965_rs_use_green with a NULL station, from Colin Ian
King.
10) Suspend/Resume in the FEC driver fail because the buffer descriptors
are not initialized at all the moments in which they should. Fix
from Frank Li.
11) cpsw and davinci_emac drivers both use the wrong interface to
restart a stopped TX queue. Use netif_wake_queue not
netif_start_queue, the latter is for initialization/bringup not
active management of the queue. From Mugunthan V N.
12) Fix regression in rate calculations done by
psched_ratecfg_precompute(), missing u64 type promotion. From
Sergey Popovich.
13) Fix length overflow in tg3 VPD parsing, from Kees Cook.
14) AOE driver fails to allocate enough headroom, resulting in crashes.
Fix from Eric Dumazet.
15) RX overflow happens too quickly in sky2 driver because pause packet
thresholds are not programmed correctly. From Mirko Lindner.
16) Bonding driver manages arp_interval and miimon settings incorrectly,
disabling one unintentionally disables both. Fix from Nikolay
Aleksandrov.
17) smsc75xx drivers don't program the RX mac properly for jumbo frames.
Fix from Steve Glendinning.
18) Fix off-by-one in Codel packet scheduler. From Vijay Subramanian.
19) Fix packet corruption in atl1c by disabling MSI support, from Hannes
Frederic Sowa.
20) netdev_rx_handler_unregister() needs a synchronize_net() to fix
crashes in bonding driver unload stress tests. From Eric Dumazet.
21) rxlen field of ks8851 RX packet descriptors not interpreted
correctly (it is 12 bits not 16 bits, so needs to be masked after
shifting the 32-bit value down 16 bits). Fix from Max Nekludov.
22) Fix missed RX/TX enable in sh_eth driver due to mishandling of link
change indications. From Sergei Shtylyov.
23) Fix crashes during spurious ECI interrupts in sh_eth driver, also
from Sergei Shtylyov.
24) dm9000 driver initialization is done wrong for revision B devices
with DSP PHY, from Joseph CHANG.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (53 commits)
DM9000B: driver initialization upgrade
sh_eth: make 'link' field of 'struct sh_eth_private' *int*
sh_eth: workaround for spurious ECI interrupt
sh_eth: fix handling of no LINK signal
ks8851: Fix interpretation of rxlen field.
net: add a synchronize_net() in netdev_rx_handler_unregister()
MAINTAINERS: Update netxen_nic maintainers list
atl1e: drop pci-msi support because of packet corruption
net: fq_codel: Fix off-by-one error
net: calxedaxgmac: Wake-on-LAN fixes
net: calxedaxgmac: fix rx ring handling when OOM
net: core: Remove redundant call to 'nf_reset' in 'dev_forward_skb'
smsc75xx: fix jumbo frame support
net: fix the use of this_cpu_ptr
bonding: fix disabling of arp_interval and miimon
ipv6: don't accept node local multicast traffic from the wire
sky2: Threshold for Pause Packet is set wrong
sky2: Receive Overflows not counted
aoe: reserve enough headroom on skbs
line up comment for ndo_bridge_getlink
...
Currently, when a socket receives something on the error queue it only wakes up
the socket on select if it is in the "read" list, that is the socket has
something to read. It is useful also to wake the socket if it is in the error
list, which would enable software to wait on error queue packets without waking
up for regular data on the socket. The main use case is for receiving
timestamped transmit packets which return the timestamp to the socket via the
error queue. This enables an application to select on the socket for the error
queue only instead of for the regular traffic.
-v2-
* Added the SO_SELECT_ERR_QUEUE socket option to every architechture specific file
* Modified every socket poll function that checks error queue
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Cc: Jeffrey Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Matthew Vick <matthew.vick@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 35d48903e9 (bonding: fix rx_handler locking) added a race
in bonding driver, reported by Steven Rostedt who did a very good
diagnosis :
<quoting Steven>
I'm currently debugging a crash in an old 3.0-rt kernel that one of our
customers is seeing. The bug happens with a stress test that loads and
unloads the bonding module in a loop (I don't know all the details as
I'm not the one that is directly interacting with the customer). But the
bug looks to be something that may still be present and possibly present
in mainline too. It will just be much harder to trigger it in mainline.
In -rt, interrupts are threads, and can schedule in and out just like
any other thread. Note, mainline now supports interrupt threads so this
may be easily reproducible in mainline as well. I don't have the ability
to tell the customer to try mainline or other kernels, so my hands are
somewhat tied to what I can do.
But according to a core dump, I tracked down that the eth irq thread
crashed in bond_handle_frame() here:
slave = bond_slave_get_rcu(skb->dev);
bond = slave->bond; <--- BUG
the slave returned was NULL and accessing slave->bond caused a NULL
pointer dereference.
Looking at the code that unregisters the handler:
void netdev_rx_handler_unregister(struct net_device *dev)
{
ASSERT_RTNL();
RCU_INIT_POINTER(dev->rx_handler, NULL);
RCU_INIT_POINTER(dev->rx_handler_data, NULL);
}
Which is basically:
dev->rx_handler = NULL;
dev->rx_handler_data = NULL;
And looking at __netif_receive_skb() we have:
rx_handler = rcu_dereference(skb->dev->rx_handler);
if (rx_handler) {
if (pt_prev) {
ret = deliver_skb(skb, pt_prev, orig_dev);
pt_prev = NULL;
}
switch (rx_handler(&skb)) {
My question to all of you is, what stops this interrupt from happening
while the bonding module is unloading? What happens if the interrupt
triggers and we have this:
CPU0 CPU1
---- ----
rx_handler = skb->dev->rx_handler
netdev_rx_handler_unregister() {
dev->rx_handler = NULL;
dev->rx_handler_data = NULL;
rx_handler()
bond_handle_frame() {
slave = skb->dev->rx_handler;
bond = slave->bond; <-- NULL pointer dereference!!!
What protection am I missing in the bond release handler that would
prevent the above from happening?
</quoting Steven>
We can fix bug this in two ways. First is adding a test in
bond_handle_frame() and others to check if rx_handler_data is NULL.
A second way is adding a synchronize_net() in
netdev_rx_handler_unregister() to make sure that a rcu protected reader
has the guarantee to see a non NULL rx_handler_data.
The second way is better as it avoids an extra test in fast path.
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jiri Pirko <jpirko@redhat.com>
Cc: Paul E. McKenney <paulmck@us.ibm.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In rtnl_fdb_dump() when the fdb_dump ndo op is not populated
we never set the idx value so that cb->arg[0] is always 0.
Resulting in a endless loop of messages.
Introduced with this commit,
commit 090096bf3d
Author: Vlad Yasevich <vyasevic@redhat.com>
Date: Wed Mar 6 15:39:42 2013 +0000
net: generic fdb support for drivers without ndo_fdb_<op>
CC: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
'nf_reset' is called just prior calling 'netif_rx'.
No need to call it twice.
Reported-by: Igor Michailov <rgohita@gmail.com>
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
replace per_cpu with per_cpu_ptr to save conversion between address and pointer
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
flush_tasklet is not percpu var, and percpu is percpu var, and
this_cpu_ptr(&info->cache->percpu->flush_tasklet)
is not equal to
&this_cpu_ptr(info->cache->percpu)->flush_tasklet
1f743b076(use this_cpu_ptr per-cpu helper) introduced this bug.
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull userns fixes from Eric W Biederman:
"The bulk of the changes are fixing the worst consequences of the user
namespace design oversight in not considering what happens when one
namespace starts off as a clone of another namespace, as happens with
the mount namespace.
The rest of the changes are just plain bug fixes.
Many thanks to Andy Lutomirski for pointing out many of these issues."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
userns: Restrict when proc and sysfs can be mounted
ipc: Restrict mounting the mqueue filesystem
vfs: Carefully propogate mounts across user namespaces
vfs: Add a mount flag to lock read only bind mounts
userns: Don't allow creation if the user is chrooted
yama: Better permission check for ptraceme
pid: Handle the exit of a multi-threaded init.
scm: Require CAP_SYS_ADMIN over the current pidns to spoof pids.
Fix to return a negative error code from the error handling case
instead of 0(possible overwrite to 0 by ops->fill_xstats call),
as returned elsewhere in this function.
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
include/net/ipip.h
The changes made to ipip.h in 'net' were already included
in 'net-next' before that header was moved to another location.
Signed-off-by: David S. Miller <davem@davemloft.net>
In kernel we have fast and pretty implementation of the isxdigit() function.
Let's use it.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For untrusted packets with partial checksum, we need to set the transport header
for precise packet length estimation. We can just let skb_pratial_csum_set() to
do this to avoid extra call to skb_flow_dissect() and simplify the caller.
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
gso_segs were reset to zero when kernel receive packets from untrusted
source. But we use this zero value to estimate precise packet len which is
wrong. So this patch tries to estimate the correct gso_segs value before using
it in qdisc_pkt_len_init().
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The WARN_ON(in_interrupt()) in net_enable_timestamp() can get false
positive, in socket clone path, run from softirq context :
[ 3641.624425] WARNING: at net/core/dev.c:1532 net_enable_timestamp+0x7b/0x80()
[ 3641.668811] Call Trace:
[ 3641.671254] <IRQ> [<ffffffff80286817>] warn_slowpath_common+0x87/0xc0
[ 3641.677871] [<ffffffff8028686a>] warn_slowpath_null+0x1a/0x20
[ 3641.683683] [<ffffffff80742f8b>] net_enable_timestamp+0x7b/0x80
[ 3641.689668] [<ffffffff80732ce5>] sk_clone_lock+0x425/0x450
[ 3641.695222] [<ffffffff8078db36>] inet_csk_clone_lock+0x16/0x170
[ 3641.701213] [<ffffffff807ae449>] tcp_create_openreq_child+0x29/0x820
[ 3641.707663] [<ffffffff807d62e2>] ? ipt_do_table+0x222/0x670
[ 3641.713354] [<ffffffff807aaf5b>] tcp_v4_syn_recv_sock+0xab/0x3d0
[ 3641.719425] [<ffffffff807af63a>] tcp_check_req+0x3da/0x530
[ 3641.724979] [<ffffffff8078b400>] ? inet_hashinfo_init+0x60/0x80
[ 3641.730964] [<ffffffff807ade6f>] ? tcp_v4_rcv+0x79f/0xbe0
[ 3641.736430] [<ffffffff807ab9bd>] tcp_v4_do_rcv+0x38d/0x4f0
[ 3641.741985] [<ffffffff807ae14a>] tcp_v4_rcv+0xa7a/0xbe0
Its safe at this point because the parent socket owns a reference
on the netstamp_needed, so we cant have a 0 -> 1 transition, which
requires to lock a mutex.
Instead of refining the check, lets remove it, as all known callers
are safe. If it ever changes in the future, static_key_slow_inc()
will complain anyway.
Reported-by: Laurent Chavey <chavey@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch takes benefit of dev_addr_genid and dev_base_seq to check if a change
occurs during a netlink dump. If a change is detected, the flag NLM_F_DUMP_INTR
is set in the first message after the dump was interrupted.
Note that seq and prev_seq must be reset between each family in rtnl_dump_all()
because they are specific to each family.
Reported-by: Junwei Zhang <junwei.zhang@6wind.com>
Reported-by: Hongjun Li <hongjun.li@6wind.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With decnet converted, we can finally get rid of rta_buf and its
computations around it. It also gets rid of the minimal header
length verification since all message handlers do that explicitly
anyway.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
The cgroup code has been surrounded by ifdef CONFIG_NET_CLS_CGROUP
and CONFIG_NETPRIO_CGROUP.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Previously, if you did an "ifconfig down" or similar on one core, and
the kernel had CONFIG_XFRM enabled, every core would be interrupted to
check its percpu flow list for items that could be garbage collected.
With this change, we generate a mask of cores that actually have any
percpu items, and only interrupt those cores. When we are trying to
isolate a set of cpus from interrupts, this is important to do.
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is very useful to do dynamic truncation of packets. In particular,
we're interested to push the necessary header bytes to the user space and
cut off user payload that should probably not be transferred for some reasons
(e.g. privacy, speed, or others). With the ancillary extension PAY_OFFSET,
we can load it into the accumulator, and return it. E.g. in bpfc syntax ...
ld #poff ; { 0x20, 0, 0, 0xfffff034 },
ret a ; { 0x16, 0, 0, 0x00000000 },
... as a filter will accomplish this without having to do a big hackery in
a BPF filter itself. Follow-up JIT implementations are welcome.
Thanks to Eric Dumazet for suggesting and discussing this during the
Netfilter Workshop in Copenhagen.
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
__skb_get_poff() returns the offset to the payload as far as it could
be dissected. The main user is currently BPF, so that we can dynamically
truncate packets without needing to push actual payload to the user
space and instead can analyze headers only.
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In skb_flow_dissect(), we perform a dissection of a skbuff. Since we're
doing the work here anyway, also store thoff for a later usage, e.g. in
the BPF filter.
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Don't allow spoofing pids over unix domain sockets in the corner
cases where a user has created a user namespace but has not yet
created a pid namespace.
Cc: stable@vger.kernel.org
Reported-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
DEFINE_STATIC_SRCU() defines srcu struct and do init at build time.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch generalizes VXLAN forwarding table entries allowing an administrator
to:
1) specify multiple destinations for a given MAC
2) specify alternate vni's in the VXLAN header
3) specify alternate destination UDP ports
4) use multicast MAC addresses as fdb lookup keys
5) specify multicast destinations
6) specify the outgoing interface for forwarded packets
The combination allows configuration of more complex topologies using VXLAN
encapsulation.
Changes since v1: rebase to 3.9.0-rc2
Signed-Off-By: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Range/validity checks on rta_type in rtnetlink_rcv_msg() do
not account for flags that may be set. This causes the function
to return -EINVAL when flags are set on the type (for example
NLA_F_NESTED).
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
[ Bug added added in commit 05e8ef4ab2 (net: factor out
skb_mac_gso_segment() from skb_gso_segment() ) ]
move vlan_depth out of while loop, or else vlan_depth always is ETH_HLEN,
can not be increased, and lead to infinite loop when frame has two vlan headers.
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/intel/e1000e/netdev.c
Minor conflict in e1000e, a line that got fixed in 'net'
has been removed in 'net-next'.
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for L2 GRE tunnels, so that RPS can be more effective.
Signed-off-by: Michael Dalton <mwdalton@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Initialize the mac address buffer with 0 as the driver specific function
will probably not fill the whole buffer. In fact, all in-kernel drivers
fill only ETH_ALEN of the MAX_ADDR_LEN bytes, i.e. 6 of the 32 possible
bytes. Therefore we currently leak 26 bytes of stack memory to userland
via the netlink interface.
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adds generic tunneling offloading support for IPv4-UDP based
tunnels.
GSO type is added to request this offload for a skb.
netdev feature NETIF_F_UDP_TUNNEL is added for hardware offloaded
udp-tunnel support. Currently no device supports this feature,
software offload is used.
This can be used by tunneling protocols like VXLAN.
CC: Jesse Gross <jesse@nicira.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds inner mac header. This will be used in next patch
to find tunner header length. Header len is required to copy tunnel
header to each gso segment.
This patch does not change any functionality.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This function will be used in next VXLAN_GSO patch. This patch does
not change any functionality.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Inherit scatergather feature for tunnel devices to avoid
copy for TSO packets of tunneling device like GRE.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Earlier SG was unset if CSUM was not available for given device to
force skb copy to avoid sending inconsistent csum.
Commit c9af6db4c1 (net: Fix possible wrong checksum generation)
added explicit flag to force copy to fix this issue. Therefore
there is no need to link SG and CSUM, following patch kills this
link between there two features.
This patch is also required following patch in series.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The frames for which rx_handlers return RX_HANDLER_CONSUMED are no longer
counted as dropped. They are counted as successfully received by
'netif_receive_skb'.
This allows network interface drivers to correctly update their RX-OK and
RX-DRP counters based on the result of 'netif_receive_skb'.
Signed-off-by: Cristian Bercaru <B43982@freescale.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If the driver does not support the ndo_op use the generic
handler for it. This should work in the majority of cases.
Eventually the fdb_dflt_add call gets translated into a
__dev_set_rx_mode() call which should handle hardware
support for filtering via the IFF_UNICAST_FLT flag.
Namely IFF_UNICAST_FLT indicates if the hardware can do
unicast address filtering. If no support is available
the device is put into promisc mode.
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We should use time_after_eq() to get maximum latency of two ticks,
instead of three.
Bug added in commit 24f8b2385 (net: increase receive packet quantum)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix new kernel-doc warnings in net/core/dev.c:
Warning(net/core/dev.c:4788): No description found for parameter 'new_carrier'
Warning(net/core/dev.c:4788): Excess function parameter 'new_carries' description in 'dev_change_carrier'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some drivers use a too big NAPI poll weight.
This patch adds a NAPI_POLL_WEIGHT default value
and issues an error message if a driver attempts
to use a bigger weight.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I'm not sure why, but the hlist for each entry iterators were conceived
list_for_each_entry(pos, head, member)
The hlist ones were greedy and wanted an extra parameter:
hlist_for_each_entry(tpos, pos, head, member)
Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.
Besides the semantic patch, there was some manual work required:
- Fix up the actual hlist iterators in linux/list.h
- Fix up the declaration of other iterators based on the hlist ones.
- A very small amount of places were using the 'node' parameter, this
was modified to use 'obj->member' instead.
- Coccinelle didn't handle the hlist_for_each_entry_safe iterator
properly, so those had to be fixed up manually.
The semantic patch which is mostly the work of Peter Senna Tschudin is here:
@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
type T;
expression a,c,d,e;
identifier b;
statement S;
@@
-T b;
<+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
...+>
[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin <peter.senna@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull vfs pile (part one) from Al Viro:
"Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent
locking violations, etc.
The most visible changes here are death of FS_REVAL_DOT (replaced with
"has ->d_weak_revalidate()") and a new helper getting from struct file
to inode. Some bits of preparation to xattr method interface changes.
Misc patches by various people sent this cycle *and* ocfs2 fixes from
several cycles ago that should've been upstream right then.
PS: the next vfs pile will be xattr stuff."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
saner proc_get_inode() calling conventions
proc: avoid extra pde_put() in proc_fill_super()
fs: change return values from -EACCES to -EPERM
fs/exec.c: make bprm_mm_init() static
ocfs2/dlm: use GFP_ATOMIC inside a spin_lock
ocfs2: fix possible use-after-free with AIO
ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path
get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero
target: writev() on single-element vector is pointless
export kernel_write(), convert open-coded instances
fs: encode_fh: return FILEID_INVALID if invalid fid_type
kill f_vfsmnt
vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
nfsd: handle vfs_getattr errors in acl protocol
switch vfs_getattr() to struct path
default SET_PERSONALITY() in linux/elf.h
ceph: prepopulate inodes only when request is aborted
d_hash_and_lookup(): export, switch open-coded instances
9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate()
9p: split dropping the acls from v9fs_set_create_acl()
...
Pull networking fixes from David Miller:
1) ping_err() ICMP error handler looks at wrong ICMP header, from Li
Wei.
2) TCP socket hash function on ipv6 is too weak, from Eric Dumazet.
3) netif_set_xps_queue() forgets to drop mutex on errors, fix from
Alexander Duyck.
4) sum_frag_mem_limit() can deadlock due to lack of BH disabling, fix
from Eric Dumazet.
5) TCP SYN data is miscalculated in tcp_send_syn_data(), because the
amount of TCP option space was not taken into account properly in
this code path. Fix from yuchung Cheng.
6) MLX4 driver allocates device queues with the wrong size, from Kleber
Sacilotto.
7) sock_diag can access past the end of the sock_diag_handlers[] array,
from Mathias Krause.
8) vlan_set_encap_proto() makes incorrect assumptions about where
skb->data points, rework the logic so that it works regardless of
where skb->data happens to be. From Jesse Gross.
9) Fix gianfar build failure with NET_POLL enabled, from Paul
Gortmaker.
10) Fix Ipv4 ID setting and checksum calculations in GRE driver, from
Pravin B Shelar.
11) bgmac driver does:
int i;
for (i = 0; ...; ...) {
...
for (i = 0; ...; ...) {
effectively corrupting the outer loop index, use a seperate
variable for the inner loops. From Rafał Miłecki.
12) Fix suspend bugs in smsc95xx driver, from Ming Lei.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (35 commits)
usbnet: smsc95xx: rename FEATURE_AUTOSUSPEND
usbnet: smsc95xx: fix broken runtime suspend
usbnet: smsc95xx: fix suspend failure
bgmac: fix indexing of 2nd level loops
b43: Fix lockdep splat on module unload
Revert "ip_gre: propogate target device GSO capability to the tunnel device"
IP_GRE: Fix GRE_CSUM case.
VXLAN: Use tunnel_ip_select_ident() for tunnel IP-Identification.
IP_GRE: Fix IP-Identification.
net/pasemi: Fix missing coding style
vmxnet3: fix ethtool ring buffer size setting
vmxnet3: make local function static
bnx2x: remove dead code and make local funcs static
gianfar: fix compile fail for NET_POLL=y due to struct packing
vlan: adjust vlan_set_encap_proto() for its callers
sock_diag: Simplify sock_diag_handlers[] handling in __sock_diag_rcv_msg
sock_diag: Fix out-of-bounds access to sock_diag_handlers[]
vxlan: remove depends on CONFIG_EXPERIMENTAL
mlx4_en: fix allocation of CPU affinity reverse-map
mlx4_en: fix allocation of device tx_cq
...
Deadlock might be caused by allocating memory with GFP_KERNEL in
runtime_resume and runtime_suspend callback of network devices in iSCSI
situation, so mark network devices and its ancestor as 'memalloc_noio'
with the introduced pm_runtime_set_memalloc_noio().
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Decotigny <david.decotigny@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Oliver Neukum <oneukum@suse.de>
Cc: Jiri Kosina <jiri.kosina@suse.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The sock_diag_lock_handler() and sock_diag_unlock_handler() actually
make the code less readable. Get rid of them and make the lock usage
and access to sock_diag_handlers[] clear on the first sight.
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Userland can send a netlink message requesting SOCK_DIAG_BY_FAMILY
with a family greater or equal then AF_MAX -- the array size of
sock_diag_handlers[]. The current code does not test for this
condition therefore is vulnerable to an out-of-bound access opening
doors for a privilege escalation.
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The mem cgroup socket limit is only used if the config option is
enabled. Found with sparse
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Smatch found a locking bug in netif_set_xps_queue in which we were not
releasing the lock in the case of an allocation failure.
This change corrects that so that we release the xps_map_mutex before
returning -ENOMEM in the case of an allocation failure.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet wrote:
| Some strange crashes happen in rt6_check_expired(), with access
| to random addresses.
|
| At first glance, it looks like the RTF_EXPIRES and
| stuff added in commit 1716a96101
| (ipv6: fix problem with expired dst cache)
| are racy : same dst could be manipulated at the same time
| on different cpus.
|
| At some point, our stack believes rt->dst.from contains a dst pointer,
| while its really a jiffie value (as rt->dst.expires shares the same area
| of memory)
|
| rt6_update_expires() should be fixed, or am I missing something ?
|
| CC Neil because of https://bugzilla.redhat.com/show_bug.cgi?id=892060
Because we do not have any locks for dst_entry, we cannot change
essential structure in the entry; e.g., we cannot change reference
to other entity.
To fix this issue, split 'from' and 'expires' field in dst_entry
out of union. Once it is 'from' is assigned in the constructor,
keep the reference until the very last stage of the life time of
the object.
Of course, it is unsafe to change 'from', so make rt6_set_from simple
just for fresh entries.
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Reported-by: Neil Horman <nhorman@tuxdriver.com>
CC: Gao Feng <gaofeng@cn.fujitsu.com>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reported-by: Steinar H. Gunderson <sesse@google.com>
Reviewed-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit c9af6db4c1 (net: Fix possible wrong checksum generation)
has a suspicous piece:
- skb_shinfo(skb1)->gso_type = skb_shinfo(skb)->gso_type;
-
+ skb_shinfo(skb)->tx_flags = skb_shinfo(skb1)->tx_flags & SKBTX_SHARED_FRAG;
skb1 is the new skb, therefore should be on the left side of the assignment.
This patch fixes it.
Cc: Pravin B Shelar <pshelar@nicira.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When !CONFIG_PROC_FS dev_mcast_init() is not defined,
actually we can just merge dev_mcast_init() into
dev_proc_init().
Reported-by: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Gao feng <gaofeng@cn.fujitsu.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to net/core/net-sysfs.c, group procfs code to
a single unit.
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
proc_net_remove is only used to remove proc entries
that under /proc/net,it's not a general function for
removing proc entries of netns. if we want to remove
some proc entries which under /proc/net/stat/, we still
need to call remove_proc_entry.
this patch use remove_proc_entry to replace proc_net_remove.
we can remove proc_net_remove after this patch.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Right now, some modules such as bonding use proc_create
to create proc entries under /proc/net/, and other modules
such as ipv4 use proc_net_fops_create.
It looks a little chaos.this patch changes all of
proc_net_fops_create to proc_create. we can remove
proc_net_fops_create after this patch.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
They well deserve a separated unit.
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We no longer need to use mac_len, lets cleanup things.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Following patch adds GRE protocol offload handler so that
skb_gso_segment() can segment GRE packets.
SKB GSO CB is added to keep track of total header length so that
skb_segment can push entire header. e.g. in case of GRE, skb_segment
need to push inner and outer headers to every segment.
New NETIF_F_GRE_GSO feature is added for devices which support HW
GRE TSO offload. Currently none of devices support it therefore GRE GSO
always fall backs to software GSO.
[ Compute pkt_len before ip_local_out() invocation. -DaveM ]
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This function will be used in next GRE_GSO patch. This patch does
not change any functionality. It only exports skb_mac_gso_segment()
function.
[ Use skb_reset_mac_len() -DaveM ]
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Even for non-pfmalloc SKBs, __netif_receive_skb() will do a
tsk_restore_flags() on current unconditionally.
Make __netif_receive_skb() a shim around the existing code, renamed to
__netif_receive_skb_core(). Let __netif_receive_skb() wrap the
__netif_receive_skb_core() call with the task flag modifications, if
necessary.
Signed-off-by: David S. Miller <davem@davemloft.net>
When a user adds bridge neighbors, allow him to specify VLAN id.
If the VLAN id is not specified, the neighbor will be added
for VLANs currently in the ports filter list. If no VLANs are
configured on the port, we use vlan 0 and only add 1 entry.
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Acked-by: Jitendra Kalsaria <jitendra.kalsaria@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Using the RTM_GETLINK dump the vlan filter list of a given
bridge port. The information depends on setting the filter
flag similar to how nic VF info is dumped.
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a netlink interface to add and remove vlan configuration on bridge port.
The interface uses the RTM_SETLINK message and encodes the vlan
configuration inside the IFLA_AF_SPEC. It is possble to include multiple
vlans to either add or remove in a single message.
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Patch cef401de7b (net: fix possible wrong checksum
generation) fixed wrong checksum calculation but it broke TSO by
defining new GSO type but not a netdev feature for that type.
net_gso_ok() would not allow hardware checksum/segmentation
offload of such packets without the feature.
Following patch fixes TSO and wrong checksum. This patch uses
same logic that Eric Dumazet used. Patch introduces new flag
SKBTX_SHARED_FRAG if at least one frag can be modified by
the user. but SKBTX_SHARED_FRAG flag is kept in skb shared
info tx_flags rather than gso_type.
tx_flags is better compared to gso_type since we can have skb with
shared frag without gso packet. It does not link SHARED_FRAG to
GSO, So there is no need to define netdev feature for this.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dan Carpenter contacted me with some notes regarding some smatch warnings in the
netpoll code, some of which I introduced with my recent netpoll locking fixes,
some which were there prior. Specifically they were:
net-next/net/core/netpoll.c:243 netpoll_poll_dev() warn: inconsistent
returns mutex:&ni->dev_lock: locked (213,217) unlocked (210,243)
net-next/net/core/netpoll.c:706 netpoll_neigh_reply() warn: potential
pointer math issue ('skb_transport_header(send_skb)' is a 128 bit pointer)
This patch corrects the locking imbalance (the first error), and adds some
parenthesis to correct the second error. Tested by myself. Applies to net-next
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: Dan Carpenter <dan.carpenter@oracle.com>
CC: "David S. Miller" <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
I get the following build error on next-20130213 due to the following
commit:
commit f05de73bf8 ("skbuff: create
skb_panic() function and its wrappers").
It adds an argument called panic to a function that uses the BUG() macro
which tries to call panic, but the argument masks the panic() function
declaration, resulting in the following error (gcc 4.2.4):
net/core/skbuff.c In function 'skb_panic':
net/core/skbuff.c +126 : error: called object 'panic' is not a function
This is fixed by renaming the argument to msg.
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Cc: Jean Sacren <sakiwit@gmail.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
The bnx2x gso_type setting bug fix in 'net' conflicted with
changes in 'net-next' that broke the gso_* setting logic
out into a seperate function, which also fixes the bug in
question. Thus, use the 'net-next' version.
Signed-off-by: David S. Miller <davem@davemloft.net>
Tommi was fuzzing with trinity and reported the following problem :
commit 3f518bf745 (datagram: Add offset argument to __skb_recv_datagram)
missed that a raw socket receive queue can contain skbs with no payload.
We can loop in __skb_recv_datagram() with MSG_PEEK mode, because
wait_for_packet() is not prepared to skip these skbs.
[ 83.541011] INFO: rcu_sched detected stalls on CPUs/tasks: {}
(detected by 0, t=26002 jiffies, g=27673, c=27672, q=75)
[ 83.541011] INFO: Stall ended before state dump start
[ 108.067010] BUG: soft lockup - CPU#0 stuck for 22s! [trinity-child31:2847]
...
[ 108.067010] Call Trace:
[ 108.067010] [<ffffffff818cc103>] __skb_recv_datagram+0x1a3/0x3b0
[ 108.067010] [<ffffffff818cc33d>] skb_recv_datagram+0x2d/0x30
[ 108.067010] [<ffffffff819ed43d>] rawv6_recvmsg+0xad/0x240
[ 108.067010] [<ffffffff818c4b04>] sock_common_recvmsg+0x34/0x50
[ 108.067010] [<ffffffff818bc8ec>] sock_recvmsg+0xbc/0xf0
[ 108.067010] [<ffffffff818bf31e>] sys_recvfrom+0xde/0x150
[ 108.067010] [<ffffffff81ca4329>] system_call_fastpath+0x16/0x1b
Reported-by: Tommi Rantala <tt.rantala@gmail.com>
Tested-by: Tommi Rantala <tt.rantala@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With my recent commit I introduced two sparse warnings. Looking closer there
were a few more in the same file, so I fixed them all up. Basic rcu pointer
dereferencing suff.
I've validated these changes using CONFIG_PROVE_RCU while starting and stopping
netconsole repeatedly in bonded and non-bonded configurations
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: fengguang.wu@intel.com
CC: David Miller <davem@davemloft.net>
CC: eric.dumazet@gmail.com
Signed-off-by: David S. Miller <davem@davemloft.net>
__netpoll_rcu_free is used to free netpoll structures when the rtnl_lock is
already held. The mechanism is used to asynchronously call __netpoll_cleanup
outside of the holding of the rtnl_lock, so as to avoid deadlock.
Unfortunately, __netpoll_cleanup modifies pointers (dev->np), which means the
rtnl_lock must be held while calling it. Further, it cannot be held, because
rcu callbacks may be issued in softirq contexts, which cannot sleep.
Fix this by converting the rcu callback to a work queue that is guaranteed to
get scheduled in process context, so that we can hold the rtnl properly while
calling __netpoll_cleanup
Tested successfully by myself.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: Cong Wang <amwang@redhat.com>
CC: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Create skb_panic() function in lieu of both skb_over_panic() and
skb_under_panic() so that code duplication would be avoided. Update type
and variable name where necessary.
Jiri Pirko suggested using wrappers so that we would be able to keep the
fruits of the original code.
Signed-off-by: Jean Sacren <sakiwit@gmail.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to address the fact that some devices cannot support the full 32K
frag size we need to have the value accessible somewhere so that we can use it
to do comparisons against what the device can support. As such I am moving
the values out of skbuff.c and into skbuff.h.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On 64 bit arches :
There is a off-by-one error in qdisc_pkt_len_init() because
mac_header is not set in xmit path.
skb_mac_header() returns an out of bound value that was
harmless because hdr_len is an 'unsigned int'
On 32bit arches, the error is abysmal.
This patch is also a prereq for "macvlan: add multicast filter"
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Greear <greearb@candelatech.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb_gso_segment() is almost always called in tx path,
except for openvswitch. It calls this function when
it receives the packet and tries to queue it to user-space.
In this special case, the ->ip_summed check inside
skb_gso_segment() is no longer true, as ->ip_summed value
has different meanings on rx path.
This patch adjusts skb_gso_segment() so that we can at least
avoid such warnings on checksum.
Cc: Jesse Gross <jesse@nicira.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ivan Vercera was recently backporting commit
9c13cb8bb4 to a RHEL kernel, and I noticed that,
while this patch protects the tg3 driver from having its ndo_poll_controller
routine called during device initalization, it does nothing for the driver
during shutdown. I.e. it would be entirely possible to have the
ndo_poll_controller method (or subsequently the ndo_poll) routine called for a
driver in the netpoll path on CPU A while in parallel on CPU B, the ndo_close or
ndo_open routine could be called. Given that the two latter routines tend to
initizlize and free many data structures that the former two rely on, the result
can easily be data corruption or various other crashes. Furthermore, it seems
that this is potentially a problem with all net drivers that support netpoll,
and so this should ideally be fixed in a common path.
As Ben H Pointed out to me, we can't preform dev_open/dev_close in atomic
context, so I've come up with this solution. We can use a mutex to sleep in
open/close paths and just do a mutex_trylock in the napi poll path and abandon
the poll attempt if we're locked, as we'll just retry the poll on the next send
anyway.
I've tested this here by flooding netconsole with messages on a system whos nic
driver I modfied to periodically return NETDEV_TX_BUSY, so that the netpoll tx
workqueue would be forced to send frames and poll the device. While this was
going on I rapidly ifdown/up'ed the interface and watched for any problems.
I've not found any.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: Ivan Vecera <ivecera@redhat.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: Ben Hutchings <bhutchings@solarflare.com>
CC: Francois Romieu <romieu@fr.zoreil.com>
CC: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
alloc failures already get standardized OOM
messages and a dump_stack.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/intel/e1000e/ethtool.c
drivers/net/vmxnet3/vmxnet3_drv.c
drivers/net/wireless/iwlwifi/dvm/tx.c
net/ipv6/route.c
The ipv6 route.c conflict is simple, just ignore the 'net' side change
as we fixed the same problem in 'net-next' by eliminating cached
neighbours from ipv6 routes.
The e1000e conflict is an addition of a new statistic in the ethtool
code, trivial.
The vmxnet3 conflict is about one change in 'net' removing a guarding
conditional, whilst in 'net-next' we had a netdev_info() conversion.
The iwlwifi conflict is dealing with a WARN_ON() conversion in
'net-next' vs. a revert happening in 'net'.
Signed-off-by: David S. Miller <davem@davemloft.net>
As in del_timer() there has already placed a timer_pending() function
to check whether the timer to be deleted is pending or not, it's
unnecessary to check timer pending state again before del_timer() is
called.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Right now,only ixgdb,macvlan,vxlan and bridge implement
fdb_add/fdb_del operations.
these operations only operate the private data of net
device. So allowing the unprivileged users who creates
the userns and netns to add/del fdb entries will do no
harm to other netns.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use correct inner offset to set inner_network_offset.
Found by inspection.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of jumping aroung bugs that are easily fixed just don't let them in:
affected drivers should be either fixed or have NETIF_F_HW_VLAN_FILTER
removed from advertised features.
Quick grep in drivers/net shows two drivers that have NETIF_F_HW_VLAN_FILTER
but not ndo_vlan_rx_add/kill_vid(), but those are false-positives (features
are commented out).
OTOH two drivers have ndo_vlan_rx_add/kill_vid() implemented but don't
advertise NETIF_F_HW_VLAN_FILTER. Those are:
+ethernet/cisco/enic/enic_main.c
+ethernet/qlogic/qlcnic/qlcnic_main.c
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
The return value of pktgen_add_device() is not checked, so
even if we fail to add some device, for example, non-exist one,
we still see "OK:...". This patch fixes it.
After this patch, I got:
# echo "add_device non-exist" > /proc/net/pktgen/kpktgend_0
-bash: echo: write error: No such device
# cat /proc/net/pktgen/kpktgend_0
Running:
Stopped:
Result: ERROR: can not add device non-exist
# echo "add_device eth0" > /proc/net/pktgen/kpktgend_0
# cat /proc/net/pktgen/kpktgend_0
Running:
Stopped: eth0
Result: OK: add_device=eth0
(Candidate for -stable)
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bring in the 'net' tree so that we can get some ipv4/ipv6 bug
fixes that some net-next work will build upon.
Signed-off-by: David S. Miller <davem@davemloft.net>
v3: make pktgen_threads list per-namespace
v2: remove a useless check
This patch add net namespace to pktgen, so that
we can use pktgen in different namespaces.
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When allocating memory for neighbour cache entry, if
tbl->entry_size is not set, we always calculate
sizeof(struct neighbour) + tbl->key_len, which is common
in the same table.
With this change, set tbl->entry_size during the table
initialization phase, if it was not set, and use it in
neigh_alloc() and neighbour_priv().
This change also allow us to have both of protocol private
data and device priate data at tha same time.
Note that the only user of prototcol private is DECnet
and the only user of device private is ATM CLIP.
Since those are exclusive, we have not been facing issues
here.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
I found if we write a larger than 4GB value to some sysctl
variables, the sending syscall will hang up forever, because these
variables are 32 bits, such large values make them overflow to 0 or
negative.
This patch try to fix overflow or prevent from zero value setup
of below sysctl variables:
net.core.wmem_default
net.core.rmem_default
net.core.rmem_max
net.core.wmem_max
net.ipv4.udp_rmem_min
net.ipv4.udp_wmem_min
net.ipv4.tcp_wmem
net.ipv4.tcp_rmem
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Li Yu <raise.sail@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This will allow us to setup netconsole in a different namespace
rather than where init_net is.
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ipv6_addr_equal() is faster.
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pravin Shelar mentioned that GSO could potentially generate
wrong TX checksum if skb has fragments that are overwritten
by the user between the checksum computation and transmit.
He suggested to linearize skbs but this extra copy can be
avoided for normal tcp skbs cooked by tcp_sendmsg().
This patch introduces a new SKB_GSO_SHARED_FRAG flag, set
in skb_shinfo(skb)->gso_type if at least one frag can be
modified by the user.
Typical sources of such possible overwrites are {vm}splice(),
sendfile(), and macvtap/tun/virtio_net drivers.
Tested:
$ netperf -H 7.7.8.84
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
7.7.8.84 () port 0 AF_INET
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
87380 16384 16384 10.00 3959.52
$ netperf -H 7.7.8.84 -t TCP_SENDFILE
TCP SENDFILE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 ()
port 0 AF_INET
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
87380 16384 16384 10.00 3216.80
Performance of the SENDFILE is impacted by the extra allocation and
copy, and because we use order-0 pages, while the TCP_STREAM uses
bigger pages.
Reported-by: Pravin Shelar <pshelar@nicira.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Definitions and macros for implementing soreusport.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fengguang reported:
net/core/netpoll.c: In function 'netpoll_setup':
net/core/netpoll.c:1049:6: warning: 'err' may be used uninitialized in this function [-Wmaybe-uninitialized]
in !CONFIG_IPV6 case, we may error out without initializing
'err'.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since we have removed NCE (Neighbour Cache Entry) reference from
routing entries, the only refcnt holders of an NCE are its timer
(if running) and its owner table, in usual cases. As a result,
neigh_periodic_work() purges NCEs over and over again even for
gateways.
It does not make sense to purge entries, if number of them is
very small, so keep them. The minimum number of entries to keep
is specified by gc_thresh1.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 6a328d8c6f changed the update
logic for the socket but it does not update the SCM_RIGHTS update
as well. This patch is based on the net_prio fix commit
48a87cc26c
net: netprio: fd passed in SCM_RIGHTS datagram not set correctly
A socket fd passed in a SCM_RIGHTS datagram was not getting
updated with the new tasks cgrp prioidx. This leaves IO on
the socket tagged with the old tasks priority.
To fix this add a check in the scm recvmsg path to update the
sock cgrp prioidx with the new tasks value.
Let's apply the same fix for net_cls.
Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Reported-by: Li Zefan <lizefan@huawei.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: John Fastabend <john.r.fastabend@intel.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: netdev@vger.kernel.org
Cc: cgroups@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
__skb_tx_hash() and __skb_get_rxhash() are all for calculating hash
value based by some fields in skb, mostly used for selecting queues
by device drivers.
Meanwhile, net/core/dev.c is bloating.
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 9ca1b22d6d (net: splice: avoid high order page splitting)
forgot that skb->head could need a copy into several page frags.
This could be the case for loopback traffic mostly.
Also remove now useless skb argument from linear_to_page()
and __splice_segment() prototypes.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
splice() can handle pages of any order, but network code tries hard to
split them in PAGE_SIZE units. Not quite successfully anyway, as
__splice_segment() assumed poff < PAGE_SIZE. This is true for
the skb->data part, not necessarily for the fragments.
This patch removes this logic to give the pages as they are in the skb.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
While a privileged program can open a raw socket, attach some
restrictive filter and drop its privileges (or send the socket to an
unprivileged program through some Unix socket), the filter can still
be removed or modified by the unprivileged program. This commit adds a
socket option to lock the filter (SO_LOCK_FILTER) preventing any
modification of a socket filter program.
This is similar to OpenBSD BIOCLOCK ioctl on bpf sockets, except even
root is not allowed change/drop the filter.
The state of the lock can be read with getsockopt(). No error is
triggered if the state is not changed. -EPERM is returned when a user
tries to remove the lock or to change/remove the filter while the lock
is active. The check is done directly in sk_attach_filter() and
sk_detach_filter() and does not affect only setsockopt() syscall.
Signed-off-by: Vincent Bernat <bernat@luffy.cx>
Signed-off-by: David S. Miller <davem@davemloft.net>
__dev_get_by_name() doesn't refcount the network device,
so we have to do this by ourselves. Noticed by Eric.
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
v4: hold rtnl lock for the whole netpoll_setup()
v3: remove the comment
v2: use RCU read lock
This patch fixes the following warning:
[ 72.013864] RTNL: assertion failed at net/core/dev.c (4955)
[ 72.017758] Pid: 668, comm: netpoll-prep-v6 Not tainted 3.8.0-rc1+ #474
[ 72.019582] Call Trace:
[ 72.020295] [<ffffffff8176653d>] netdev_master_upper_dev_get+0x35/0x58
[ 72.022545] [<ffffffff81784edd>] netpoll_setup+0x61/0x340
[ 72.024846] [<ffffffff815d837e>] store_enabled+0x82/0xc3
[ 72.027466] [<ffffffff815d7e51>] netconsole_target_attr_store+0x35/0x37
[ 72.029348] [<ffffffff811c3479>] configfs_write_file+0xe2/0x10c
[ 72.030959] [<ffffffff8115d239>] vfs_write+0xaf/0xf6
[ 72.032359] [<ffffffff81978a05>] ? sysret_check+0x22/0x5d
[ 72.033824] [<ffffffff8115d453>] sys_write+0x5c/0x84
[ 72.035328] [<ffffffff819789d9>] system_call_fastpath+0x16/0x1b
In case of other races, hold rtnl lock for the entire netpoll_setup() function.
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 1def9238d4 (net_sched: more precise pkt_len computation)
does a wrong computation of mac + network headers length, as it includes
the padding before the frame.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuval Mintz <yuvalmin@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
Documentation/networking/ip-sysctl.txt
drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
Both conflicts were simply overlapping context.
A build fix for qlcnic is in here too, simply removing the added
devinit annotations which no longer exist.
Signed-off-by: David S. Miller <davem@davemloft.net>
spin_is_locked() on a non !SMP build is kind of useless.
BUG_ON(!spin_is_locked(xx)) is guaranteed to crash.
Just remove this check in reqsk_fastopen_remove() as
the callers do hold the socket lock.
Reported-by: Ketan Kulkarni <ketkulka@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jerry Chu <hkchu@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Dave Taht <dave.taht@gmail.com>
Acked-by: H.K. Jerry Chu <hkchu@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 9ca1b22d6d (net: splice: avoid high order page splitting)
forgot that skb->head could need a copy into several page frags.
This could be the case for loopback traffic mostly.
Also remove now useless skb argument from linear_to_page()
and __splice_segment() prototypes.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since:
commit 2c60db0370
Author: Eric Dumazet <edumazet@google.com>
Date: Sun Sep 16 09:17:26 2012 +0000
net: provide a default dev->ethtool_ops
wireless core does not correctly assign ethtool_ops.
After alloc_netdev*() call, some cfg80211 drivers provide they own
ethtool_ops, but some do not. For them, wireless core provide generic
cfg80211_ethtool_ops, which is assigned in NETDEV_REGISTER notify call:
if (!dev->ethtool_ops)
dev->ethtool_ops = &cfg80211_ethtool_ops;
But after Eric's commit, dev->ethtool_ops is no longer NULL (on cfg80211
drivers without custom ethtool_ops), but points to &default_ethtool_ops.
In order to fix the problem, provide function which will overwrite
default_ethtool_ops and use it by wireless core.
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Acked-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When testing with FCoE enabled we discovered that I had not exported
__netdev_pick_tx. As a result ixgbe doesn't build with the RFC patches
applied because ixgbe_select_queue was calling the function. This change
corrects that build issue by correctly exporting __netdev_pick_tx so it
can be used by modules.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch makes it so that we can support transmit packet steering without
sysfs needing to be enabled. The reason for making this change is to make
it so that a driver can make use of the XPS even while the sysfs portion of
the interface is not present.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change is meant to address several issues I found within the
netif_set_xps_queues function.
If the allocation of one of the maps to be assigned to new_dev_maps failed
we could end up with the device map in an inconsistent state since we had
already worked through a number of CPUs and removed or added the queue. To
address that I split the process into several steps. The first of which is
just the allocation of updated maps for CPUs that will need larger maps to
store the queue. By doing this we can fail gracefully without actually
altering the contents of the current device map.
The second issue I found was the fact that we were always allocating a new
device map even if we were not adding any queues. I have updated the code
so that we only allocate a new device map if we are adding queues,
otherwise if we are not adding any queues to CPUs we just skip to the
removal process.
The last change I made was to reuse the code from remove_xps_queue to remove
the queue from the CPU. By making this change we can be consistent in how
we go about adding and removing the queues from the CPUs.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch does a minor refactor on netif_reset_xps_queue to address a few
items I noticed.
First is the fact that we are doing removal of queues in both
netif_reset_xps_queue and netif_set_xps_queue. Since there is no need to
have the code in two places I am pushing it out into a separate function
and will come back in another patch and reuse the code in
netif_set_xps_queue.
The second item this change addresses is the fact that the Tx queues were
not getting their numa_node value cleared as a part of the XPS queue reset.
This patch resolves that by resetting the numa_node value if the dev_maps
value is set.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds two functions, netif_reset_xps_queue and
netif_set_xps_queue. The main idea behind these two functions is to
provide a mechanism through which drivers can update their defaults in
regards to XPS.
Currently no such mechanism exists and as a result we cannot use XPS for
things such as ATR which would require a basic configuration to start in
which the Tx queues are mapped to CPUs via a 1:1 mapping. With this change
I am making it possible for drivers such as ixgbe to be able to use the XPS
feature by controlling the default configuration.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change splits the core bits of netdev_pick_tx into a separate function.
The main idea behind this is to make this code accessible to select queue
functions when they decide to process the standard path instead of their
own custom path in their select queue routine.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
One long standing problem with TSO/GSO/GRO packets is that skb->len
doesn't represent a precise amount of bytes on wire.
Headers are only accounted for the first segment.
For TCP, thats typically 66 bytes per 1448 bytes segment missing,
an error of 4.5 % for normal MSS value.
As consequences :
1) TBF/CBQ/HTB/NETEM/... can send more bytes than the assigned limits.
2) Device stats are slightly under estimated as well.
Fix this by taking account of headers in qdisc_skb_cb(skb)->pkt_len
computation.
Packet schedulers should use qdisc pkt_len instead of skb->len for their
bandwidth limitations, and TSO enabled devices drivers could use pkt_len
if their statistics are not hardware assisted, and if they don't scratch
skb->cb[] first word.
Both egress and ingress paths work, thanks to commit fda55eca5a
(net: introduce skb_transport_header_was_set()) : If GRO built
a GSO packet, it also set the transport header for us.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Paolo Valente <paolo.valente@unimore.it>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Benefit from the fact that dev->addr_assign_type is set to NET_ADDR_PERM
in case the device has permanent address.
This also fixes the problem that many drivers do not set perm_addr at
all.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, netpoll only supports IPv4. This patch adds IPv6
support to netpoll so that we can run netconsole over IPv6 network.
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adjusts some struct and functions, to prepare
for supporting IPv6.
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We have skb_mac_header_was_set() helper to tell if mac_header
was set on a skb. We would like the same for transport_header.
__netif_receive_skb() doesn't reset the transport header if already
set by GRO layer.
Note that network stacks usually reset the transport header anyway,
after pulling the network header, so this change only allows
a followup patch to have more precise qdisc pkt_len computation
for GSO packets at ingress side.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
No need to check if ethtool_ops == NULL since it can't be.
Use local variable "ops" in functions where it is present
instead of dev->ethtool_ops
Introduce local variable "ops" in functions where dev->ethtool_ops is used
many times.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Reviewed-by: Ben Hutchings <bhutchings@solarflare.com>
Reviewed-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
splice() can handle pages of any order, but network code tries hard to
split them in PAGE_SIZE units. Not quite successfully anyway, as
__splice_segment() assumed poff < PAGE_SIZE. This is true for
the skb->data part, not necessarily for the fragments.
This patch removes this logic to give the pages as they are in the skb.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
In case user passed address via netlink during create, NET_ADDR_PERM was set.
That is not correct so fix this by setting NET_ADDR_SET.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Benefit from new upper dev list and free bonding from dev->master usage.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
This lists are supposed to serve for storing pointers to all upper devices.
Eventually it will replace dev->master pointer which is used for
bonding, bridge, team but it cannot be used for vlan, macvlan where
there might be multiple upper present. In case the upper link is
replacement for dev->master, it is marked with "master" flag.
New upper device list resolves this limitation. Also, the information
stored in lists is used for preventing looping setups like
"bond->somethingelse->samebond"
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is the way to indicate that mac address of a device has been set by
dev_set_mac_address()
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Benefit from existence of dev_set_mac_address() and remove duplicate
code.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, we return -EINVAL for malformed or wrong BPF filters.
However, this is not done for BPF_S_ANC* operations, which makes it
more difficult to detect if it's actually supported or not by the
BPF machine. Therefore, we should also return -EINVAL if K is within
the SKF_AD_OFF universe and the ancillary operation did not match.
Why exactly is it needed? If tools such as libpcap/tcpdump want to
make use of new ancillary operations (like filtering VLAN in kernel
space), there is currently no sane way to test if this feature /
BPF_S_ANC* op is present or not, since no error is returned. This
patch will make life easier for that and allow for a proper usage
for user space applications.
There was concern, if this patch will break userland. Short answer: Yes
and no. Long answer: It will "break" only for code that calls ...
{ BPF_LD | BPF_(W|H|B) | BPF_ABS, 0, 0, <K> },
... where <K> is in [0xfffff000, 0xffffffff] _and_ <K> is *not* an
ancillary. And here comes the BUT: assuming some *old* code will have
such an instruction where <K> is between [0xfffff000, 0xffffffff] and
it doesn't know ancillary operations, then this will give a
non-expected / unwanted behavior as well (since we do not return the
BPF machine with 0 after a failed load_pointer(), which was the case
before introducing ancillary operations, but load sth. into the
accumulator instead, and continue with the next instruction, for
instance). Thus, user space code would already have been broken by
introducing ancillary operations into the BPF machine per se. Code
that does such a direct load, e.g. "load word at packet offset
0xffffffff into accumulator" ("ld [0xffffffff]") is quite broken,
isn't it? The whole assumption of ancillary operations is that no-one
intentionally calls things like "ld [0xffffffff]" and expect this
word to be loaded from such a packet offset. Hence, we can also safely
make use of this feature testing patch and facilitate application
development. Therefore, at least from this patch onwards, we have
*for sure* a check whether current or in future implemented BPF_S_ANC*
ops are supported in the kernel. Patch was tested on x86_64.
(Thanks to Eric for the previous review.)
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Reported-by: Ani Sinha <ani@aristanetworks.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sparse detected case where this local function should be static.
It may even allow some compiler optimizations.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make carrier writable
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This allows a driver to register change_carrier callback which will be
called whenever user will like to change carrier state. This is useful
for devices like dummy, gre, team and so on.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
CONFIG_HOTPLUG is always enabled now, so remove the unused code that was
trying to be compiled out when this option was disabled, in the
networking core.
Cc: Bill Pemberton <wfp5p@virginia.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Using a seqlock for devnet_rename_seq is not a good idea,
as device_rename() can sleep.
As we hold RTNL, we dont need a protection for writers,
and only need a seqcount so that readers can catch a change done
by a writer.
Bug added in commit c91f6df2db (sockopt: Change getsockopt() of
SO_BINDTODEVICE to return an interface name)
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Brian Haley <brian.haley@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull (again) user namespace infrastructure changes from Eric Biederman:
"Those bugs, those darn embarrasing bugs just want don't want to get
fixed.
Linus I just updated my mirror of your kernel.org tree and it appears
you successfully pulled everything except the last 4 commits that fix
those embarrasing bugs.
When you get a chance can you please repull my branch"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
userns: Fix typo in description of the limitation of userns_install
userns: Add a more complete capability subset test to commit_creds
userns: Require CAP_SYS_ADMIN for most uses of setns.
Fix cap_capable to only allow owners in the parent user namespace to have caps.
Pull user namespace changes from Eric Biederman:
"While small this set of changes is very significant with respect to
containers in general and user namespaces in particular. The user
space interface is now complete.
This set of changes adds support for unprivileged users to create user
namespaces and as a user namespace root to create other namespaces.
The tyranny of supporting suid root preventing unprivileged users from
using cool new kernel features is broken.
This set of changes completes the work on setns, adding support for
the pid, user, mount namespaces.
This set of changes includes a bunch of basic pid namespace
cleanups/simplifications. Of particular significance is the rework of
the pid namespace cleanup so it no longer requires sending out
tendrils into all kinds of unexpected cleanup paths for operation. At
least one case of broken error handling is fixed by this cleanup.
The files under /proc/<pid>/ns/ have been converted from regular files
to magic symlinks which prevents incorrect caching by the VFS,
ensuring the files always refer to the namespace the process is
currently using and ensuring that the ptrace_mayaccess permission
checks are always applied.
The files under /proc/<pid>/ns/ have been given stable inode numbers
so it is now possible to see if different processes share the same
namespaces.
Through the David Miller's net tree are changes to relax many of the
permission checks in the networking stack to allowing the user
namespace root to usefully use the networking stack. Similar changes
for the mount namespace and the pid namespace are coming through my
tree.
Two small changes to add user namespace support were commited here adn
in David Miller's -net tree so that I could complete the work on the
/proc/<pid>/ns/ files in this tree.
Work remains to make it safe to build user namespaces and 9p, afs,
ceph, cifs, coda, gfs2, ncpfs, nfs, nfsd, ocfs2, and xfs so the
Kconfig guard remains in place preventing that user namespaces from
being built when any of those filesystems are enabled.
Future design work remains to allow root users outside of the initial
user namespace to mount more than just /proc and /sys."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (38 commits)
proc: Usable inode numbers for the namespace file descriptors.
proc: Fix the namespace inode permission checks.
proc: Generalize proc inode allocation
userns: Allow unprivilged mounts of proc and sysfs
userns: For /proc/self/{uid,gid}_map derive the lower userns from the struct file
procfs: Print task uids and gids in the userns that opened the proc file
userns: Implement unshare of the user namespace
userns: Implent proc namespace operations
userns: Kill task_user_ns
userns: Make create_new_namespaces take a user_ns parameter
userns: Allow unprivileged use of setns.
userns: Allow unprivileged users to create new namespaces
userns: Allow setting a userns mapping to your current uid.
userns: Allow chown and setgid preservation
userns: Allow unprivileged users to create user namespaces.
userns: Ignore suid and sgid on binaries if the uid or gid can not be mapped
userns: fix return value on mntns_install() failure
vfs: Allow unprivileged manipulation of the mount namespace.
vfs: Only support slave subtrees across different user namespaces
vfs: Add a user namespace reference from struct mnt_namespace
...
Andy Lutomirski <luto@amacapital.net> found a nasty little bug in
the permissions of setns. With unprivileged user namespaces it
became possible to create new namespaces without privilege.
However the setns calls were relaxed to only require CAP_SYS_ADMIN in
the user nameapce of the targed namespace.
Which made the following nasty sequence possible.
pid = clone(CLONE_NEWUSER | CLONE_NEWNS);
if (pid == 0) { /* child */
system("mount --bind /home/me/passwd /etc/passwd");
}
else if (pid != 0) { /* parent */
char path[PATH_MAX];
snprintf(path, sizeof(path), "/proc/%u/ns/mnt");
fd = open(path, O_RDONLY);
setns(fd, 0);
system("su -");
}
Prevent this possibility by requiring CAP_SYS_ADMIN
in the current user namespace when joing all but the user namespace.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Pull networking changes from David Miller:
1) Allow to dump, monitor, and change the bridge multicast database
using netlink. From Cong Wang.
2) RFC 5961 TCP blind data injection attack mitigation, from Eric
Dumazet.
3) Networking user namespace support from Eric W. Biederman.
4) tuntap/virtio-net multiqueue support by Jason Wang.
5) Support for checksum offload of encapsulated packets (basically,
tunneled traffic can still be checksummed by HW). From Joseph
Gasparakis.
6) Allow BPF filter access to VLAN tags, from Eric Dumazet and
Daniel Borkmann.
7) Bridge port parameters over netlink and BPDU blocking support
from Stephen Hemminger.
8) Improve data access patterns during inet socket demux by rearranging
socket layout, from Eric Dumazet.
9) TIPC protocol updates and cleanups from Ying Xue, Paul Gortmaker, and
Jon Maloy.
10) Update TCP socket hash sizing to be more in line with current day
realities. The existing heurstics were choosen a decade ago.
From Eric Dumazet.
11) Fix races, queue bloat, and excessive wakeups in ATM and
associated drivers, from Krzysztof Mazur and David Woodhouse.
12) Support DOVE (Distributed Overlay Virtual Ethernet) extensions
in VXLAN driver, from David Stevens.
13) Add "oops_only" mode to netconsole, from Amerigo Wang.
14) Support set and query of VEB/VEPA bridge mode via PF_BRIDGE, also
allow DCB netlink to work on namespaces other than the initial
namespace. From John Fastabend.
15) Support PTP in the Tigon3 driver, from Matt Carlson.
16) tun/vhost zero copy fixes and improvements, plus turn it on
by default, from Michael S. Tsirkin.
17) Support per-association statistics in SCTP, from Michele
Baldessari.
And many, many, driver updates, cleanups, and improvements. Too
numerous to mention individually.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1722 commits)
net/mlx4_en: Add support for destination MAC in steering rules
net/mlx4_en: Use generic etherdevice.h functions.
net: ethtool: Add destination MAC address to flow steering API
bridge: add support of adding and deleting mdb entries
bridge: notify mdb changes via netlink
ndisc: Unexport ndisc_{build,send}_skb().
uapi: add missing netconf.h to export list
pkt_sched: avoid requeues if possible
solos-pci: fix double-free of TX skb in DMA mode
bnx2: Fix accidental reversions.
bna: Driver Version Updated to 3.1.2.1
bna: Firmware update
bna: Add RX State
bna: Rx Page Based Allocation
bna: TX Intr Coalescing Fix
bna: Tx and Rx Optimizations
bna: Code Cleanup and Enhancements
ath9k: check pdata variable before dereferencing it
ath5k: RX timestamp is reported at end of frame
ath9k_htc: RX timestamp is reported at end of frame
...
Pull cgroup changes from Tejun Heo:
"A lot of activities on cgroup side. The big changes are focused on
making cgroup hierarchy handling saner.
- cgroup_rmdir() had peculiar semantics - it allowed cgroup
destruction to be vetoed by individual controllers and tried to
drain refcnt synchronously. The vetoing never worked properly and
caused good deal of contortions in cgroup. memcg was the last
reamining user. Michal Hocko removed the usage and cgroup_rmdir()
path has been simplified significantly. This was done in a
separate branch so that the memcg people can base further memcg
changes on top.
- The above allowed cleaning up cgroup lifecycle management and
implementation of generic cgroup iterators which are used to
improve hierarchy support.
- cgroup_freezer updated to allow migration in and out of a frozen
cgroup and handle hierarchy. If a cgroup is frozen, all descendant
cgroups are frozen.
- netcls_cgroup and netprio_cgroup updated to handle hierarchy
properly.
- Various fixes and cleanups.
- Two merge commits. One to pull in memcg and rmdir cleanups (needed
to build iterators). The other pulled in cgroup/for-3.7-fixes for
device_cgroup fixes so that further device_cgroup patches can be
stacked on top."
Fixed up a trivial conflict in mm/memcontrol.c as per Tejun (due to
commit bea8c150a7 ("memcg: fix hotplugged memory zone oops") in master
touching code close to commit 2ef37d3fe4 ("memcg: Simplify
mem_cgroup_force_empty_list error handling") in for-3.8)
* 'for-3.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (65 commits)
cgroup: update Documentation/cgroups/00-INDEX
cgroup_rm_file: don't delete the uncreated files
cgroup: remove subsystem files when remounting cgroup
cgroup: use cgroup_addrm_files() in cgroup_clear_directory()
cgroup: warn about broken hierarchies only after css_online
cgroup: list_del_init() on removed events
cgroup: fix lockdep warning for event_control
cgroup: move list add after list head initilization
netprio_cgroup: allow nesting and inherit config on cgroup creation
netprio_cgroup: implement netprio[_set]_prio() helpers
netprio_cgroup: use cgroup->id instead of cgroup_netprio_state->prioidx
netprio_cgroup: reimplement priomap expansion
netprio_cgroup: shorten variable names in extend_netdev_table()
netprio_cgroup: simplify write_priomap()
netcls_cgroup: move config inheritance to ->css_online() and remove .broken_hierarchy marking
cgroup: remove obsolete guarantee from cgroup_task_migrate.
cgroup: add cgroup->id
cgroup, cpuset: remove cgroup_subsys->post_clone()
cgroup: s/CGRP_CLONE_CHILDREN/CGRP_CPUSET_CLONE_CHILDREN/
cgroup: rename ->create/post_create/pre_destroy/destroy() to ->css_alloc/online/offline/free()
...
__copy_skb_header(nskb, p) already copied p->cb[], no need to copy
it again.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes the redundant occurences of simple_strto<foo>
Signed-off-by: Abhijit Pawar <abhi.c.pawar@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
__napi_gro_receive() is inlined from two call sites for no good reason.
Lets move the prep stuff in a function of its own, called only if/when
needed. This saves 300 bytes on x86 :
# size net/core/dev.o.after net/core/dev.o.before
text data bss dec hex filename
51968 1238 1040 54246 d3e6 net/core/dev.o.before
51664 1238 1040 53942 d2b6 net/core/dev.o.after
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch replace the obsolete simple_strto<foo> with kstrto<foo>
Signed-off-by: Abhijit Pawar <abhi.c.pawar@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change allows the VXLAN to enable Tx checksum offloading even on
devices that do not support encapsulated checksum offloads. The
advantage to this is that it allows for the lower device to change due
to routing table changes without impacting features on the VXLAN itself.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support in the kernel for offloading in the NIC Tx and Rx
checksumming for encapsulated packets (such as VXLAN and IP GRE).
For Tx encapsulation offload, the driver will need to set the right bits
in netdev->hw_enc_features. The protocol driver will have to set the
skb->encapsulation bit and populate the inner headers, so the NIC driver will
use those inner headers to calculate the csum in hardware.
For Rx encapsulation offload, the driver will need to set again the
skb->encapsulation flag and the skb->ip_csum to CHECKSUM_UNNECESSARY.
In that case the protocol driver should push the decapsulated packet up
to the stack, again with CHECKSUM_UNNECESSARY. In ether case, the protocol
driver should set the skb->encapsulation flag back to zero. Finally the
protocol driver should have NETIF_F_RXCSUM flag set in its features.
Signed-off-by: Joseph Gasparakis <joseph.gasparakis@intel.com>
Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 2e71a6f808 (net: gro: selective flush of packets) added
a bug for skbs using frag_list. This part of the GRO stack is rarely
used, as it needs skb not using a page fragment for their skb->head.
Most drivers do use a page fragment, but some of them use GFP_KERNEL
allocations for the initial fill of their RX ring buffer.
napi_gro_flush() overwrite skb->prev that was used for these skb to
point to the last skb in frag_list.
Fix this using a separate field in struct napi_gro_cb to point to the
last fragment.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Do the same thing as in set mac. Call notifiers every time.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net/core/neighbour.c:65:12: warning: 'zero' defined but not used [-Wunused-variable]
net/core/neighbour.c:66:12: warning: 'unres_qlen_max' defined but not used [-Wunused-variable]
These variables are only used when CONFIG_SYSCTL is defined,
so move them under #ifdef CONFIG_SYSCTL.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Cong Wang <amwang@redhat.com>
Acked-by: Shan Wei <davidshan@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
unres_qlen_bytes and unres_qlen are int type.
But multiple relation(unres_qlen_bytes = unres_qlen * SKB_TRUESIZE(ETH_FRAME_LEN))
will cause type overflow when seting unres_qlen. e.g.
$ echo 1027506 > /proc/sys/net/ipv4/neigh/eth1/unres_qlen
$ cat /proc/sys/net/ipv4/neigh/eth1/unres_qlen
1182657265
$ cat /proc/sys/net/ipv4/neigh/eth1/unres_qlen_bytes
-2147479756
The gutted value is not that we setting。
But user/administrator don't know this is caused by int type overflow.
what's more, it is meaningless and even dangerous that unres_qlen_bytes is set
with negative number. Because, for unresolved neighbour address, kernel will cache packets
without limit in __neigh_event_send()(e.g. (u32)-1 = 2GB).
Signed-off-by: Shan Wei <davidshan@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a new nic is created in namespace ns1, the kernel sends a KOBJ_ADD uevent
to ns1. When the nic is moved to ns2, we only send a KOBJ_MOVE to ns2, and
nothing to ns1.
This patch changes that behavior so that when moving a nic from ns1 to ns2, we
send a KOBJ_REMOVED to ns1 and KOBJ_ADD to ns2. (The KOBJ_MOVE is still
sent to ns2).
The effects of this can be seen when starting and stopping containers in
an upstart based host. Lxc will create a pair of veth nics, the kernel
sends KOBJ_ADD, and upstart starts network-instance jobs for each. When
one nic is moved to the container, because no KOBJ_REMOVED event is
received, the network-instance job for that veth never goes away. This
was reported at https://bugs.launchpad.net/ubuntu/+source/lxc/+bug/1065589
With this patch the networ-instance jobs properly go away.
The other oddness solved here is that if a nic is passed into a running
upstart-based container, without this patch no network-instance job is
started in the container. But when the container creates a new nic
itself (ip link add new type veth) then network-interface jobs are
created. With this patch, behavior comes in line with a regular host.
v2: also send KOBJ_ADD to new netns. There will then be a
_MOVE event from the device_rename() call, but that should
be innocuous.
Signed-off-by: Serge Hallyn <serge.hallyn@canonical.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Daniel Lezcano <daniel.lezcano@free.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes an unused parameter (src_net) from rtnl_create_link()
method and from the method single invocation, in veth.
This parameter was used in the past when calling
ops->get_tx_queues(src_net, tb) in rtnl_create_link().
The get_tx_queues() member of rtnl_link_ops was replaced by two methods,
get_num_tx_queues() and get_num_rx_queues(), which do not get any
parameter. This was done in commit d40156aa5e by
Jiri Pirko ("rtnl: allow to specify different num for rx and tx queue count").
Signed-off-by: Rami Rosen <ramirose@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch changes three methods to be static and removes their
EXPORT_SYMBOLs in core/dev.c and their external declaration in
netdevice.h. The methods, dev_gro_receive(), napi_frags_finish() and
napi_skb_finish(), which are in the GRO rx path, are not used
outside core/dev.c.
Signed-off-by: Rami Rosen <ramirose@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of having the getsockopt() of SO_BINDTODEVICE return an index, which
will then require another call like if_indextoname() to get the actual interface
name, have it return the name directly.
This also matches the existing man page description on socket(7) which mentions
the argument being an interface name.
If the value has not been set, zero is returned and optlen will be set to zero
to indicate there is no interface name present.
Added a seqlock to protect this code path, and dev_ifname(), from someone
changing the device name via dev_change_name().
v2: Added seqlock protection while copying device name.
v3: Fixed word wrap in patch.
Signed-off-by: Brian Haley <brian.haley@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/wireless/iwlwifi/pcie/tx.c
Minor iwlwifi conflict in TX queue disabling between 'net', which
removed a bogus warning, and 'net-next' which added some status
register poking code.
Signed-off-by: David S. Miller <davem@davemloft.net>
Inherit netprio configuration from ->css_online(), allow nesting and
remove .broken_hierarchy marking. This makes netprio_cgroup's
behavior match netcls_cgroup's.
Note that this patch changes userland-visible behavior. Nesting is
allowed and the first level cgroups below the root cgroup behave
differently - they inherit priorities from the root cgroup on creation
instead of starting with 0. This is unfortunate but not doing so is
much crazier.
Signed-off-by: Tejun Heo <tj@kernel.org>
Tested-and-Acked-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Acked-by: David S. Miller <davem@davemloft.net>
Introduce two helpers - netprio_prio() and netprio_set_prio() - which
hide the details of priomap access and expansion. This will help
implementing hierarchy support.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Tested-and-Acked-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Acked-by: David S. Miller <davem@davemloft.net>
With priomap expansion no longer depending on knowing max id
allocated, netprio_cgroup can use cgroup->id insted of cs->prioidx.
Drop prioidx alloc/free logic and convert all uses to cgroup->id.
* In cgrp_css_alloc(), parent->id test is moved above @cs allocation
to simplify error path.
* In cgrp_css_free(), @cs assignment is made initialization.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Tested-and-Acked-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Acked-by: David S. Miller <davem@davemloft.net>
netprio kept track of the highest prioidx allocated and resized
priomaps accordingly when necessary. This makes it necessary to keep
track of prioidx allocation and may end up resizing on every new
prioidx.
Update extend_netdev_table() such that it takes @target_idx which the
priomap should be able to accomodate. If the priomap is large enough,
nothing happens; otherwise, the size is doubled until @target_idx can
be accomodated.
This makes max_prioidx and write_update_netdev_table() unnecessary.
write_priomap() now calls extend_netdev_table() directly.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Tested-and-Acked-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Acked-by: David S. Miller <davem@davemloft.net>
The function is about to go through a rewrite. In preparation,
shorten the variable names so that we don't repeat "priomap" so often.
This patch is cosmetic.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Tested-and-Acked-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Acked-by: David S. Miller <davem@davemloft.net>
sscanf() doesn't bite.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Tested-and-Acked-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Acked-by: David S. Miller <davem@davemloft.net>
John W. Linville says:
====================
This is a batch of fixes intended for 3.7...
Included are two pulls. Regarding the mac80211 tree, Johannes says:
"Please pull my mac80211.git tree (see below) to get two more fixes for
3.7. Both fix regressions introduced *before* this cycle that weren't
noticed until now, one for IBSS not cleaning up properly and the other
to add back the "wireless" sysfs directory for Fedora's startup scripts."
Regarding the iwlwifi tree, Johannes says:
"Please also pull my iwlwifi.git tree, I have two fixes: one to remove a
spurious warning that can actually trigger in legitimate situations, and
the other to fix a regression from when monitor mode was changed to use
the "sniffer" firmware mode."
Also included is an nfc tree pull. Samuel says:
"We mostly have pn533 fixes here, 2 memory leaks and an early unlocking fix.
Moreover, we also have an LLCP adapter linked list insertion fix."
On top of that, a few more bits... Albert Pool adds a USB ID
to rtlwifi. Bing Zhao provides two mwifiex fixes -- one to fix
a system hang during a command timeout, and the other to properly
report a suspend error to the MMC core. Finally, Sujith Manoharan
fixes a thinko that would trigger an ath9k hang during device reset.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Assign a unique proc inode to each namespace, and use that
inode number to ensure we only allocate at most one proc
inode for every namespace in proc.
A single proc inode per namespace allows userspace to test
to see if two processes are in the same namespace.
This has been a long requested feature and only blocked because
a naive implementation would put the id in a global space and
would ultimately require having a namespace for the names of
namespaces, making migration and certain virtualization tricks
impossible.
We still don't have per superblock inode numbers for proc, which
appears necessary for application unaware checkpoint/restart and
migrations (if the application is using namespace file descriptors)
but that is now allowd by the design if it becomes important.
I have preallocated the ipc and uts initial proc inode numbers so
their structures can be statically initialized.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
- Push the permission check from the core setns syscall into
the setns install methods where the user namespace of the
target namespace can be determined, and used in a ns_capable
call.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The wireless and wext includes in net-sysfs.c aren't
needed, so remove them.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
flush_tasklet is a struct, not a pointer in percpu var.
so use this_cpu_ptr to get the member pointer.
Signed-off-by: Shan Wei <davidshan@tencent.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename cgroup_subsys css lifetime related callbacks to better describe
what their roles are. Also, update documentation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
The user namespace which creates a new network namespace owns that
namespace and all resources created in it. This way we can target
capability checks for privileged operations against network resources to
the user_ns which created the network namespace in which the resource
lives. Privilege to the user namespace which owns the network
namespace, or any parent user namespace thereof, provides the same
privilege to the network resource.
This patch is reworked from a version originally by
Serge E. Hallyn <serge.hallyn@canonical.com>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
The copy of copy_net_ns used when the network stack is not
built is broken as it does not return -EINVAL when attempting
to create a new network namespace. We don't even have
a previous network namespace.
Since we need a copy of copy_net_ns in net/net_namespace.h that is
available when the networking stack is not built at all move the
correct version of copy_net_ns from net_namespace.c into net_namespace.h
Leaving us with just 2 versions of copy_net_ns. One version for when
we compile in network namespace suport and another stub for all other
occasions.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Allow an unpriviled user who has created a user namespace, and then
created a network namespace to effectively use the new network
namespace, by reducing capable(CAP_NET_ADMIN) and
capable(CAP_NET_RAW) calls to be ns_capable(net->user_ns,
CAP_NET_ADMIN), or capable(net->user_ns, CAP_NET_RAW) calls.
Settings that merely control a single network device are allowed.
Either the network device is a logical network device where
restrictions make no difference or the network device is hardware NIC
that has been explicity moved from the initial network namespace.
In general policy and network stack state changes are allowed
while resource control is left unchanged.
Allow ethtool ioctls.
Allow binding to network devices.
Allow setting the socket mark.
Allow setting the socket priority.
Allow setting the network device alias via sysfs.
Allow setting the mtu via sysfs.
Allow changing the network device flags via sysfs.
Allow setting the network device group via sysfs.
Allow the following network device ioctls.
SIOCGMIIPHY
SIOCGMIIREG
SIOCSIFNAME
SIOCSIFFLAGS
SIOCSIFMETRIC
SIOCSIFMTU
SIOCSIFHWADDR
SIOCSIFSLAVE
SIOCADDMULTI
SIOCDELMULTI
SIOCSIFHWBROADCAST
SIOCSMIIREG
SIOCBONDENSLAVE
SIOCBONDRELEASE
SIOCBONDSETHWADDR
SIOCBONDCHANGEACTIVE
SIOCBRADDIF
SIOCBRDELIF
SIOCSHWTSTAMP
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If the user calling sendmsg has the appropriate privieleges
in their user namespace allow them to set the uid, gid, and
pid in the SCM_CREDENTIALS control message to any valid value.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- In rtnetlink_rcv_msg convert the capable(CAP_NET_ADMIN) check
to ns_capable(net->user-ns, CAP_NET_ADMIN). Allowing unprivileged
users to make netlink calls to modify their local network
namespace.
- In the rtnetlink doit methods add capable(CAP_NET_ADMIN) so
that calls that are not safe for unprivileged users are still
protected.
Later patches will remove the extra capable calls from methods
that are safe for unprivilged users.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In preparation for supporting the creation of network namespaces
by unprivileged users, modify all of the per net sysctl exports
and refuse to allow them to unprivileged users.
This makes it safe for unprivileged users in general to access
per net sysctls, and allows sysctls to be exported to unprivileged
users on an individual basis as they are deemed safe.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The user namespace which creates a new network namespace owns that
namespace and all resources created in it. This way we can target
capability checks for privileged operations against network resources to
the user_ns which created the network namespace in which the resource
lives. Privilege to the user namespace which owns the network
namespace, or any parent user namespace thereof, provides the same
privilege to the network resource.
This patch is reworked from a version originally by
Serge E. Hallyn <serge.hallyn@canonical.com>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The copy of copy_net_ns used when the network stack is not
built is broken as it does not return -EINVAL when attempting
to create a new network namespace. We don't even have
a previous network namespace.
Since we need a copy of copy_net_ns in net/net_namespace.h that is
available when the networking stack is not built at all move the
correct version of copy_net_ns from net_namespace.c into net_namespace.h
Leaving us with just 2 versions of copy_net_ns. One version for when
we compile in network namespace suport and another stub for all other
occasions.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 35b2a113cb broke (at least)
Fedora's networking scripts, they check for the existence of the
wireless directory. As the files aren't used, add the directory
back and not the files. Also do it for both drivers based on the
old wireless extensions and cfg80211, regardless of whether the
compat code for wext is built into cfg80211 or not.
Cc: stable@vger.kernel.org [3.6]
Reported-by: Dave Airlie <airlied@gmail.com>
Reported-by: Bill Nottingham <notting@redhat.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In commit c445477d74 which adds aRFS to the kernel, the CPU
selected for RFS is not set correctly when CPU is changing.
This is causing OOO packets and probably other issues.
Signed-off-by: Tom Herbert <therbert@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
offload_base is protected by offload_lock, not ptype_lock
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Vlad Yasevich <vyasevic@redhat.com>
Acked-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Check (ha->addr == dev->dev_addr) is always true because dev_addr_init()
sets this. Correct the check to behave properly on addr removal.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the offload callbacks into its own structure.
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert to using the new GSO/GRO registration mechanism and new
packet offload structure.
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Create a new data structure to contain the GRO/GSO callbacks and add
a new registration mechanism.
Singed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
Minor conflict between the BCM_CNIC define removal in net-next
and a bug fix added to net. Based upon a conflict resolution
patch posted by Stephen Rothwell.
Signed-off-by: David S. Miller <davem@davemloft.net>
Due to a NULL dereference, the following patch is causing oops
in normal trafic condition:
commit c0de08d042
Author: Eric Leblond <eric@regit.org>
Date: Thu Aug 16 22:02:58 2012 +0000
af_packet: don't emit packet on orig fanout group
This buggy patch was a feature fix and has reached most stable
branches.
When skb->sk is NULL and when packet fanout is used, there is a
crash in match_fanout_group where skb->sk is accessed.
This patch fixes the issue by returning false as soon as the
socket is NULL: this correspond to the wanted behavior because
the kernel as to resend the skb to all the listening socket in
this case.
Signed-off-by: Eric Leblond <eric@regit.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bridge notify hook rtnl_bridge_notify() was not handling the
case where the master flags was set or with both flags set. First
flags are not being passed correctly and second the logic to parse
them is broken.
This patch passes the original flags value and fixes the
logic.
Reported-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change the dflt fdb dump handler to use RTM_NEWNEIGH to
be compatible with bridge dump routines.
The dump reply from the network driver handlers should
match the reply from bridge handler. The fact they were
not in the ixgbe case was effectively a bug. This patch
resolves it.
Applications that were not checking the nlmsg type will
continue to work. And now applications that do check
the type will work as expected.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some years ago, the ktime_t helper functions ktime_now() and ktime_lt()
have been introduced. Instead of defining them inside pktgen.c, they
should either use ktime_t library functions or, if not available, they
should be defined in ktime.h, so that also others can benefit from them.
ktime_compare() is introduced with a similar notion as in timespec_compare().
Signed-off-by: Daniel Borkmann <daniel.borkmann@tik.ee.ethz.ch>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit e5a55a8987 ('net: create generic
bridge ops') broke the handling of a non-zero starting index in
rtnl_bridge_getlink() (based on the old br_dump_ifinfo()).
When the starting index is non-zero, we need to increment the current
index for each entry that we are skipping. Also, we need to check the
index before both cases, since we may previously have stopped
iteration between getting information about a device from its master
and from itself.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Tested-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Orphaning frags for zero copy skbs needs to allocate data in atomic
context so is has a chance to fail. If it does we currently discard
the skb which is safe, but we don't report anything to the caller,
so it can not recover by e.g. disabling zero copy.
Add an API to free skb reporting such errors: this is used
by tun in case orphaning frags fails.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Even if skb is marked for zero copy, net core might still decide
to copy it later which is somewhat slower than a copy in user context:
besides copying the data we need to pin/unpin the pages.
Add a parameter reporting such cases through zero copy callback:
if this happens a lot, device can take this into account
and switch to copying in user context.
This patch updates all users but ignores the passed value for now:
it will be used by follow-up patches.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The SO_ATTACH_FILTER option is set only. I propose to add the get
ability by using SO_ATTACH_FILTER in getsockopt. To be less
irritating to eyes the SO_GET_FILTER alias to it is declared. This
ability is required by checkpoint-restore project to be able to
save full state of a socket.
There are two issues with getting filter back.
First, kernel modifies the sock_filter->code on filter load, thus in
order to return the filter element back to user we have to decode it
into user-visible constants. Fortunately the modification in question
is interconvertible.
Second, the BPF_S_ALU_DIV_K code modifies the command argument k to
speed up the run-time division by doing kernel_k = reciprocal(user_k).
Bad news is that different user_k may result in same kernel_k, so we
can't get the original user_k back. Good news is that we don't have
to do it. What we need to is calculate a user2_k so, that
reciprocal(user2_k) == reciprocal(user_k) == kernel_k
i.e. if it's re-loaded back the compiled again value will be exactly
the same as it was. That said, the user2_k can be calculated like this
user2_k = reciprocal(kernel_k)
with an exception, that if kernel_k == 0, then user2_k == 1.
The optlen argument is treated like this -- when zero, kernel returns
the amount of sock_fprog elements in filter, otherwise it should be
large enough for the sock_fprog array.
changes since v1:
* Declared SO_GET_FILTER in all arch headers
* Added decode of vlan-tag codes
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
BPF filters lack ability to access skb->vlan_tci
This patch adds two new ancillary accessors :
SKF_AD_VLAN_TAG (44) mapped to vlan_tx_tag_get(skb)
SKF_AD_VLAN_TAG_PRESENT (48) mapped to vlan_tx_tag_present(skb)
This allows libpcap/tcpdump to use a kernel filter instead of
having to fallback to accept all packets, then filter them in
user space.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Suggested-by: Ani Sinha <ani@aristanetworks.com>
Suggested-by: Daniel Borkmann <danborkmann@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds support for the net device ops to manage the embedded
hardware bridge on ixgbe devices. With this patch the bridge
mode can be toggled between VEB and VEPA to support stacking
macvlan devices or using the embedded switch without any SW
component in 802.1Qbg/br environments.
Additionally, this adds source address pruning to the ixgbevf
driver to prune any frames sent back from a reflective relay on
the switch. This is required because the existing hardware does
not support this. Without it frames get pushed into the stack
with its own src mac which is invalid per 802.1Qbg VEPA
definition.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hardware switches may support enabling and disabling the
loopback switch which puts the device in a VEPA mode defined
in the IEEE 802.1Qbg specification. In this mode frames are
not switched in the hardware but sent directly to the switch.
SR-IOV capable NICs will likely support this mode I am
aware of at least two such devices. Also I am told (but don't
have any of this hardware available) that there are devices
that only support VEPA modes. In these cases it is important
at a minimum to be able to query these attributes.
This patch adds an additional IFLA_BRIDGE_MODE attribute that can be
set and dumped via the PF_BRIDGE:{SET|GET}LINK operations. Also
anticipating bridge attributes that may be common for both embedded
bridges and software bridges this adds a flags attribute
IFLA_BRIDGE_FLAGS currently used to determine if the command or event
is being generated to/from an embedded bridge or software bridge.
Finally, the event generation is pulled out of the bridge module and
into rtnetlink proper.
For example using the macvlan driver in VEPA mode on top of
an embedded switch requires putting the embedded switch into
a VEPA mode to get the expected results.
-------- --------
| VEPA | | VEPA | <-- macvlan vepa edge relays
-------- --------
| |
| |
------------------
| VEPA | <-- embedded switch in NIC
------------------
|
|
-------------------
| external switch | <-- shiny new physical
------------------- switch with VEPA support
A packet sent from the macvlan VEPA at the top could be
loopbacked on the embedded switch and never seen by the
external switch. So in order for this to work the embedded
switch needs to be set in the VEPA state via the above
described commands.
By making these attributes nested in IFLA_AF_SPEC we allow
future extensions to be made as needed.
CC: Lennert Buytenhek <buytenh@wantstofly.org>
CC: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The PF_BRIDGE:RTM_{GET|SET}LINK nlmsg family and type are
currently embedded in the ./net/bridge module. This prohibits
them from being used by other bridging devices. One example
of this being hardware that has embedded bridging components.
In order to use these nlmsg types more generically this patch
adds two net_device_ops hooks. One to set link bridge attributes
and another to dump the current bride attributes.
ndo_bridge_setlink()
ndo_bridge_getlink()
CC: Lennert Buytenhek <buytenh@wantstofly.org>
CC: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sock_update_classid() assumes that the update operation always are
applied on the current task. sock_update_classid() needs to know on
which tasks to work on in order to be able to migrate task between
cgroups using the struct cgroup_subsys attach() callback.
Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Joe Perches <joe@perches.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: <netdev@vger.kernel.org>
Cc: <cgroups@vger.kernel.org>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As Eric pointed out:
"Hey task_cls_classid() has its own rcu protection since commit
3fb5a99191 (cls_cgroup: Fix rcu lockdep warning)
So we can safely revert Paul commit (1144182a87)
(We no longer need rcu_read_lock/unlock here)"
Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: netdev@vger.kernel.org
Cc: cgroups@vger.kernel.org
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net_prio_attach() is only access via cgroup_subsys callbacks,
therefore we can reduce the visibility of this function.
Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: John Fastabend <john.r.fastabend@intel.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: <netdev@vger.kernel.org>
Cc: <cgroups@vger.kernel.org>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It's no needed to check the return value of tab since the NULL situation
has been handled already, and the rtnl_msg_handlers[PF_UNSPEC] has been
initialized as non-NULL during the rtnetlink_init().
Signed-off-by: Hans Zhang <zhanghonghui@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Mike Kazantsev found 3.5 kernels and beyond were leaking memory,
and tracked the faulty commit to a1c7fff7e1 ("net:
netdev_alloc_skb() use build_skb()")
While this commit seems fine, it uncovered a bug introduced
in commit bad43ca832 ("net: introduce skb_try_coalesce()), in function
kfree_skb_partial()"):
If head is stolen, we free the sk_buff,
without removing references on secpath (skb->sp).
So IPsec + IP defrag/reassembly (using skb coalescing), or
TCP coalescing could leak secpath objects.
Fix this bug by calling skb_release_head_state(skb) to properly
release all possible references to linked objects.
Reported-by: Mike Kazantsev <mk.fraggod@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Bisected-by: Mike Kazantsev <mk.fraggod@gmail.com>
Tested-by: Mike Kazantsev <mk.fraggod@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes double assignment of err to -EINVAL in dev_change_net_namespace().
Signed-off-by: Rami Rosen <ramirose@gmail.com>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The SO_BINDTODEVICE option is the only SOL_SOCKET one that can be set, but
cannot be get via sockopt API. The only way we can find the device id a
socket is bound to is via sock-diag interface. But the diag works only on
hashed sockets, while the opt in question can be set for yet unhashed one.
That said, in order to know what device a socket is bound to (we do want
to know this in checkpoint-restore project) I propose to make this option
getsockopt-able and report the respective device index.
Another solution to the problem might be to teach the sock-diag reporting
info on unhashed sockets. Should I go this way instead?
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is not easy to use in4_pton() correctly without reading
its definition, so add some doc for it.
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is not easy to use in6_pton() correctly without reading
its definition, so add some doc for it.
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is weird to display IPv4 address in %x format, what's more,
IPv6 address is disaplayed in human-readable format too. So,
make it human-readable.
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ETH_ZLEN is too small for IPv6, so this default value is not
suitable.
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For IPv6, sizeof(struct ipv6hdr) = 40, thus the following
expression will result negative:
datalen = pkt_dev->cur_pkt_size - 14 -
sizeof(struct ipv6hdr) - sizeof(struct udphdr) -
pkt_dev->pkt_overhead;
And, the check "if (datalen < sizeof(struct pktgen_hdr))" will be
passed as "datalen" is promoted to unsigned, therefore will cause
a crash later.
This is a quick fix by checking if "datalen" is negative. The following
patch will increase the default value of 'min_pkt_size' for IPv6.
This bug should exist for a long time, so Cc -stable too.
Cc: <stable@vger.kernel.org>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6a32e4f9dd made the vlan code skip marking
vlan-tagged frames for not locally configured vlans as PACKET_OTHERHOST if
there was an rx_handler, as the rx_handler could cause the frame to be received
on a different (virtual) vlan-capable interface where that vlan might be
configured.
As rx_handlers do not necessarily return RX_HANDLER_ANOTHER, this could cause
frames for unknown vlans to be delivered to the protocol stack as if they had
been received untagged.
For example, if an ipv6 router advertisement that's tagged for a locally not
configured vlan is received on an interface with macvlan interfaces attached,
macvlan's rx_handler returns RX_HANDLER_PASS after delivering the frame to the
macvlan interfaces, which caused it to be passed to the protocol stack, leading
to ipv6 addresses for the announced prefix being configured even though those
are completely unusable on the underlying interface.
The fix moves marking as PACKET_OTHERHOST after the rx_handler so the
rx_handler, if there is one, sees the frame unchanged, but afterwards,
before the frame is delivered to the protocol stack, it gets marked whether
there is an rx_handler or not.
Signed-off-by: Florian Zumbiehl <florz@florz.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Current GRO can hold packets in gro_list for almost unlimited
time, in case napi->poll() handler consumes its budget over and over.
In this case, napi_complete()/napi_gro_flush() are not called.
Another problem is that gro_list is flushed in non friendly way :
We scan the list and complete packets in the reverse order.
(youngest packets first, oldest packets last)
This defeats priorities that sender could have cooked.
Since GRO currently only store TCP packets, we dont really notice the
bug because of retransmits, but this behavior can add unexpected
latencies, particularly on mice flows clamped by elephant flows.
This patch makes sure no packet can stay more than 1 ms in queue, and
only in stress situations.
It also complete packets in the right order to minimize latencies.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Jesse Gross <jesse@nicira.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before accessing skb first fragment, better make sure there
is one.
This is probably not needed for old kernels, since an ethernet frame
cannot contain only an ethernet header, but the recent GRO addition
to tunnels makes this patch needed.
Also skb_gro_reset_offset() can be static, it actually allows
compiler to inline it.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The retry loop in neigh_resolve_output() and neigh_connected_output()
call dev_hard_header() with out reseting the skb to network_header.
This causes the retry to fail with skb_under_panic. The fix is to
reset the network_header within the retry loop.
Signed-off-by: Ramesh Nagappa <ramesh.nagappa@ericsson.com>
Reviewed-by: Shawn Lu <shawn.lu@ericsson.com>
Reviewed-by: Robert Coulson <robert.coulson@ericsson.com>
Reviewed-by: Billie Alsup <billie.alsup@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Over time, skb recycling infrastructure got litle interest and
many bugs. Generic rx path skb allocation is now using page
fragments for efficient GRO / TCP coalescing, and recyling
a tx skb for rx path is not worth the pain.
Last identified bug is that fat skbs can be recycled
and it can endup using high order pages after few iterations.
With help from Maxime Bizon, who pointed out that commit
87151b8689 (net: allow pskb_expand_head() to get maximum tailroom)
introduced this regression for recycled skbs.
Instead of fixing this bug, lets remove skb recycling.
Drivers wanting really hot skbs should use build_skb() anyway,
to allocate/populate sk_buff right before netif_receive_skb()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Maxime Bizon <mbizon@freebox.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull vfs update from Al Viro:
- big one - consolidation of descriptor-related logics; almost all of
that is moved to fs/file.c
(BTW, I'm seriously tempted to rename the result to fd.c. As it is,
we have a situation when file_table.c is about handling of struct
file and file.c is about handling of descriptor tables; the reasons
are historical - file_table.c used to be about a static array of
struct file we used to have way back).
A lot of stray ends got cleaned up and converted to saner primitives,
disgusting mess in android/binder.c is still disgusting, but at least
doesn't poke so much in descriptor table guts anymore. A bunch of
relatively minor races got fixed in process, plus an ext4 struct file
leak.
- related thing - fget_light() partially unuglified; see fdget() in
there (and yes, it generates the code as good as we used to have).
- also related - bits of Cyrill's procfs stuff that got entangled into
that work; _not_ all of it, just the initial move to fs/proc/fd.c and
switch of fdinfo to seq_file.
- Alex's fs/coredump.c spiltoff - the same story, had been easier to
take that commit than mess with conflicts. The rest is a separate
pile, this was just a mechanical code movement.
- a few misc patches all over the place. Not all for this cycle,
there'll be more (and quite a few currently sit in akpm's tree)."
Fix up trivial conflicts in the android binder driver, and some fairly
simple conflicts due to two different changes to the sock_alloc_file()
interface ("take descriptor handling from sock_alloc_file() to callers"
vs "net: Providing protocol type via system.sockprotoname xattr of
/proc/PID/fd entries" adding a dentry name to the socket)
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (72 commits)
MAX_LFS_FILESIZE should be a loff_t
compat: fs: Generic compat_sys_sendfile implementation
fs: push rcu_barrier() from deactivate_locked_super() to filesystems
btrfs: reada_extent doesn't need kref for refcount
coredump: move core dump functionality into its own file
coredump: prevent double-free on an error path in core dumper
usb/gadget: fix misannotations
fcntl: fix misannotations
ceph: don't abuse d_delete() on failure exits
hypfs: ->d_parent is never NULL or negative
vfs: delete surplus inode NULL check
switch simple cases of fget_light to fdget
new helpers: fdget()/fdput()
switch o2hb_region_dev_write() to fget_light()
proc_map_files_readdir(): don't bother with grabbing files
make get_file() return its argument
vhost_set_vring(): turn pollstart/pollstop into bool
switch prctl_set_mm_exe_file() to fget_light()
switch xfs_find_handle() to fget_light()
switch xfs_swapext() to fget_light()
...
Pull networking changes from David Miller:
1) GRE now works over ipv6, from Dmitry Kozlov.
2) Make SCTP more network namespace aware, from Eric Biederman.
3) TEAM driver now works with non-ethernet devices, from Jiri Pirko.
4) Make openvswitch network namespace aware, from Pravin B Shelar.
5) IPV6 NAT implementation, from Patrick McHardy.
6) Server side support for TCP Fast Open, from Jerry Chu and others.
7) Packet BPF filter supports MOD and XOR, from Eric Dumazet and Daniel
Borkmann.
8) Increate the loopback default MTU to 64K, from Eric Dumazet.
9) Use a per-task rather than per-socket page fragment allocator for
outgoing networking traffic. This benefits processes that have very
many mostly idle sockets, which is quite common.
From Eric Dumazet.
10) Use up to 32K for page fragment allocations, with fallbacks to
smaller sizes when higher order page allocations fail. Benefits are
a) less segments for driver to process b) less calls to page
allocator c) less waste of space.
From Eric Dumazet.
11) Allow GRO to be used on GRE tunnels, from Eric Dumazet.
12) VXLAN device driver, one way to handle VLAN issues such as the
limitation of 4096 VLAN IDs yet still have some level of isolation.
From Stephen Hemminger.
13) As usual there is a large boatload of driver changes, with the scale
perhaps tilted towards the wireless side this time around.
Fix up various fairly trivial conflicts, mostly caused by the user
namespace changes.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1012 commits)
hyperv: Add buffer for extended info after the RNDIS response message.
hyperv: Report actual status in receive completion packet
hyperv: Remove extra allocated space for recv_pkt_list elements
hyperv: Fix page buffer handling in rndis_filter_send_request()
hyperv: Fix the missing return value in rndis_filter_set_packet_filter()
hyperv: Fix the max_xfer_size in RNDIS initialization
vxlan: put UDP socket in correct namespace
vxlan: Depend on CONFIG_INET
sfc: Fix the reported priorities of different filter types
sfc: Remove EFX_FILTER_FLAG_RX_OVERRIDE_IP
sfc: Fix loopback self-test with separate_tx_channels=1
sfc: Fix MCDI structure field lookup
sfc: Add parentheses around use of bitfield macro arguments
sfc: Fix null function pointer in efx_sriov_channel_type
vxlan: virtual extensible lan
igmp: export symbol ip_mc_leave_group
netlink: add attributes to fdb interface
tg3: unconditionally select HWMON support when tg3 is enabled.
Revert "net: ti cpsw ethernet: allow reading phy interface mode from DT"
gre: fix sparse warning
...
Pull user namespace changes from Eric Biederman:
"This is a mostly modest set of changes to enable basic user namespace
support. This allows the code to code to compile with user namespaces
enabled and removes the assumption there is only the initial user
namespace. Everything is converted except for the most complex of the
filesystems: autofs4, 9p, afs, ceph, cifs, coda, fuse, gfs2, ncpfs,
nfs, ocfs2 and xfs as those patches need a bit more review.
The strategy is to push kuid_t and kgid_t values are far down into
subsystems and filesystems as reasonable. Leaving the make_kuid and
from_kuid operations to happen at the edge of userspace, as the values
come off the disk, and as the values come in from the network.
Letting compile type incompatible compile errors (present when user
namespaces are enabled) guide me to find the issues.
The most tricky areas have been the places where we had an implicit
union of uid and gid values and were storing them in an unsigned int.
Those places were converted into explicit unions. I made certain to
handle those places with simple trivial patches.
Out of that work I discovered we have generic interfaces for storing
quota by projid. I had never heard of the project identifiers before.
Adding full user namespace support for project identifiers accounts
for most of the code size growth in my git tree.
Ultimately there will be work to relax privlige checks from
"capable(FOO)" to "ns_capable(user_ns, FOO)" where it is safe allowing
root in a user names to do those things that today we only forbid to
non-root users because it will confuse suid root applications.
While I was pushing kuid_t and kgid_t changes deep into the audit code
I made a few other cleanups. I capitalized on the fact we process
netlink messages in the context of the message sender. I removed
usage of NETLINK_CRED, and started directly using current->tty.
Some of these patches have also made it into maintainer trees, with no
problems from identical code from different trees showing up in
linux-next.
After reading through all of this code I feel like I might be able to
win a game of kernel trivial pursuit."
Fix up some fairly trivial conflicts in netfilter uid/git logging code.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (107 commits)
userns: Convert the ufs filesystem to use kuid/kgid where appropriate
userns: Convert the udf filesystem to use kuid/kgid where appropriate
userns: Convert ubifs to use kuid/kgid
userns: Convert squashfs to use kuid/kgid where appropriate
userns: Convert reiserfs to use kuid and kgid where appropriate
userns: Convert jfs to use kuid/kgid where appropriate
userns: Convert jffs2 to use kuid and kgid where appropriate
userns: Convert hpfs to use kuid and kgid where appropriate
userns: Convert btrfs to use kuid/kgid where appropriate
userns: Convert bfs to use kuid/kgid where appropriate
userns: Convert affs to use kuid/kgid wherwe appropriate
userns: On alpha modify linux_to_osf_stat to use convert from kuids and kgids
userns: On ia64 deal with current_uid and current_gid being kuid and kgid
userns: On ppc convert current_uid from a kuid before printing.
userns: Convert s390 getting uid and gid system calls to use kuid and kgid
userns: Convert s390 hypfs to use kuid and kgid where appropriate
userns: Convert binder ipc to use kuids
userns: Teach security_path_chown to take kuids and kgids
userns: Add user namespace support to IMA
userns: Convert EVM to deal with kuids and kgids in it's hmac computation
...
Pull cgroup hierarchy update from Tejun Heo:
"Currently, different cgroup subsystems handle nested cgroups
completely differently. There's no consistency among subsystems and
the behaviors often are outright broken.
People at least seem to agree that the broken hierarhcy behaviors need
to be weeded out if any progress is gonna be made on this front and
that the fallouts from deprecating the broken behaviors should be
acceptable especially given that the current behaviors don't make much
sense when nested.
This patch makes cgroup emit warning messages if cgroups for
subsystems with broken hierarchy behavior are nested to prepare for
fixing them in the future. This was put in a separate branch because
more related changes were expected (didn't make it this round) and the
memory cgroup wanted to pull in this and make changes on top."
* 'for-3.7-hierarchy' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them
Pull cgroup updates from Tejun Heo:
- xattr support added. The implementation is shared with tmpfs. The
usage is restricted and intended to be used to manage per-cgroup
metadata by system software. tmpfs changes are routed through this
branch with Hugh's permission.
- cgroup subsystem ID handling simplified.
* 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: Define CGROUP_SUBSYS_COUNT according the configuration
cgroup: Assign subsystem IDs during compile time
cgroup: Do not depend on a given order when populating the subsys array
cgroup: Wrap subsystem selection macro
cgroup: Remove CGROUP_BUILTIN_SUBSYS_COUNT
cgroup: net_prio: Do not define task_netpioidx() when not selected
cgroup: net_cls: Do not define task_cls_classid() when not selected
cgroup: net_cls: Move sock_update_classid() declaration to cls_cgroup.h
cgroup: trivial fixes for Documentation/cgroups/cgroups.txt
xattr: mark variable as uninitialized to make both gcc and smatch happy
fs: add missing documentation to simple_xattr functions
cgroup: add documentation on extended attributes usage
cgroup: rename subsys_bits to subsys_mask
cgroup: add xattr support
cgroup: revise how we re-populate root directory
xattr: extract simple_xattr code from tmpfs
Pull workqueue changes from Tejun Heo:
"This is workqueue updates for v3.7-rc1. A lot of activities this
round including considerable API and behavior cleanups.
* delayed_work combines a timer and a work item. The handling of the
timer part has always been a bit clunky leading to confusing
cancelation API with weird corner-case behaviors. delayed_work is
updated to use new IRQ safe timer and cancelation now works as
expected.
* Another deficiency of delayed_work was lack of the counterpart of
mod_timer() which led to cancel+queue combinations or open-coded
timer+work usages. mod_delayed_work[_on]() are added.
These two delayed_work changes make delayed_work provide interface
and behave like timer which is executed with process context.
* A work item could be executed concurrently on multiple CPUs, which
is rather unintuitive and made flush_work() behavior confusing and
half-broken under certain circumstances. This problem doesn't
exist for non-reentrant workqueues. While non-reentrancy check
isn't free, the overhead is incurred only when a work item bounces
across different CPUs and even in simulated pathological scenario
the overhead isn't too high.
All workqueues are made non-reentrant. This removes the
distinction between flush_[delayed_]work() and
flush_[delayed_]_work_sync(). The former is now as strong as the
latter and the specified work item is guaranteed to have finished
execution of any previous queueing on return.
* In addition to the various bug fixes, Lai redid and simplified CPU
hotplug handling significantly.
* Joonsoo introduced system_highpri_wq and used it during CPU
hotplug.
There are two merge commits - one to pull in IRQ safe timer from
tip/timers/core and the other to pull in CPU hotplug fixes from
wq/for-3.6-fixes as Lai's hotplug restructuring depended on them."
Fixed a number of trivial conflicts, but the more interesting conflicts
were silent ones where the deprecated interfaces had been used by new
code in the merge window, and thus didn't cause any real data conflicts.
Tejun pointed out a few of them, I fixed a couple more.
* 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (46 commits)
workqueue: remove spurious WARN_ON_ONCE(in_irq()) from try_to_grab_pending()
workqueue: use cwq_set_max_active() helper for workqueue_set_max_active()
workqueue: introduce cwq_set_max_active() helper for thaw_workqueues()
workqueue: remove @delayed from cwq_dec_nr_in_flight()
workqueue: fix possible stall on try_to_grab_pending() of a delayed work item
workqueue: use hotcpu_notifier() for workqueue_cpu_down_callback()
workqueue: use __cpuinit instead of __devinit for cpu callbacks
workqueue: rename manager_mutex to assoc_mutex
workqueue: WORKER_REBIND is no longer necessary for idle rebinding
workqueue: WORKER_REBIND is no longer necessary for busy rebinding
workqueue: reimplement idle worker rebinding
workqueue: deprecate __cancel_delayed_work()
workqueue: reimplement cancel_delayed_work() using try_to_grab_pending()
workqueue: use mod_delayed_work() instead of __cancel + queue
workqueue: use irqsafe timer for delayed_work
workqueue: clean up delayed_work initializers and add missing one
workqueue: make deferrable delayed_work initializer names consistent
workqueue: cosmetic whitespace updates for macro definitions
workqueue: deprecate system_nrt[_freezable]_wq
workqueue: deprecate flush[_delayed]_work_sync()
...
Later changes need to be able to refer to neighbour attributes
when doing fdb_add.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds a new include file (include/net/gro_cells.h), to bring GRO
(Generic Receive Offload) capability to tunnels, in a modular way.
Because tunnels receive path is lockless, and GRO adds a serialization
using a napi_struct, I chose to add an array of up to
DEFAULT_MAX_NUM_RSS_QUEUES cells, so that multi queue devices wont be
slowed down because of GRO layer.
skb_get_rx_queue() is used as selector.
In the future, we might add optional fanout capabilities, using rxhash
for example.
With help from Ben Hutchings who reminded me
netif_get_num_default_rss_queues() function.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit ec47ea824774(skb: Add inline helper for getting the skb end offset from
head) introduces this helper function, skb_end_offset(),
we should make use of it.
Signed-off-by: Weiping Pan <wpan@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Here is the big driver core update for 3.7-rc1.
A number of firmware_class.c updates (as you saw a month or so ago), and
some hyper-v updates and some printk fixes as well. All patches that
are outside of the drivers/base area have been acked by the respective
maintainers, and have all been in the linux-next tree for a while.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iEYEABECAAYFAlBp3vkACgkQMUfUDdst+ylQoACgldktGFgkCLzH+rGYthrXOC5P
9hUAnjmOhdoHlMTL81vWTlH+BrGernym
=khrr
-----END PGP SIGNATURE-----
Merge tag 'driver-core-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core merge from Greg Kroah-Hartman:
"Here is the big driver core update for 3.7-rc1.
A number of firmware_class.c updates (as you saw a month or so ago),
and some hyper-v updates and some printk fixes as well. All patches
that are outside of the drivers/base area have been acked by the
respective maintainers, and have all been in the linux-next tree for a
while.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"
* tag 'driver-core-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (95 commits)
memory: tegra{20,30}-mc: Fix reading incorrect register in mc_readl()
device.h: Add missing inline to #ifndef CONFIG_PRINTK dev_vprintk_emit
memory: emif: Add ifdef CONFIG_DEBUG_FS guard for emif_debugfs_[init|exit]
Documentation: Fixes some translation error in Documentation/zh_CN/gpio.txt
Documentation: Remove 3 byte redundant code at the head of the Documentation/zh_CN/arm/booting
Documentation: Chinese translation of Documentation/video4linux/omap3isp.txt
device and dynamic_debug: Use dev_vprintk_emit and dev_printk_emit
dev: Add dev_vprintk_emit and dev_printk_emit
netdev_printk/netif_printk: Remove a superfluous logging colon
netdev_printk/dynamic_netdev_dbg: Directly call printk_emit
dev_dbg/dynamic_debug: Update to use printk_emit, optimize stack
driver-core: Shut up dev_dbg_reatelimited() without DEBUG
tools/hv: Parse /etc/os-release
tools/hv: Check for read/write errors
tools/hv: Fix exit() error code
tools/hv: Fix file handle leak
Tools: hv: Implement the KVP verb - KVP_OP_GET_IP_INFO
Tools: hv: Rename the function kvp_get_ip_address()
Tools: hv: Implement the KVP verb - KVP_OP_SET_IP_INFO
Tools: hv: Add an example script to configure an interface
...
Conflicts:
drivers/net/team/team.c
drivers/net/usb/qmi_wwan.c
net/batman-adv/bat_iv_ogm.c
net/ipv4/fib_frontend.c
net/ipv4/route.c
net/l2tp/l2tp_netlink.c
The team, fib_frontend, route, and l2tp_netlink conflicts were simply
overlapping changes.
qmi_wwan and bat_iv_ogm were of the "use HEAD" variety.
With help from Antonio Quartulli.
Signed-off-by: David S. Miller <davem@davemloft.net>
We currently use percpu order-0 pages in __netdev_alloc_frag
to deliver fragments used by __netdev_alloc_skb()
Depending on NIC driver and arch being 32 or 64 bit, it allows a page to
be split in several fragments (between 1 and 8), assuming PAGE_SIZE=4096
Switching to bigger pages (32768 bytes for PAGE_SIZE=4096 case) allows :
- Better filling of space (the ending hole overhead is less an issue)
- Less calls to page allocator or accesses to page->_count
- Could allow struct skb_shared_info futures changes without major
performance impact.
This patch implements a transparent fallback to smaller
pages in case of memory pressure.
It also uses a standard "struct page_frag" instead of a custom one.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
It seems sk_init() has no value today and even does strange things :
# grep . /proc/sys/net/core/?mem_*
/proc/sys/net/core/rmem_default:212992
/proc/sys/net/core/rmem_max:131071
/proc/sys/net/core/wmem_default:212992
/proc/sys/net/core/wmem_max:131071
We can remove it completely.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Shan Wei <davidshan@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
iterates through the opened files in given descriptor table,
calling a supplied function; we stop once non-zero is returned.
Callback gets struct file *, descriptor number and const void *
argument passed to iterator. It is called with files->file_lock
held, so it is not allowed to block.
tty_io, netprio_cgroup and selinux flush_unauthorized_files()
converted to its use.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Its possible to use RAW sockets to get a crash in
tcp_set_keepalive() / sk_reset_timer()
Fix is to make sure socket is a SOCK_STREAM one.
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SKF_AD_ALU_XOR_X has been added a while ago, but as an 'ancillary'
operation that is invoked through a negative offset in K within BPF
load operations. Since BPF_MOD has recently been added, BPF_XOR should
also be part of the common ALU operations. Removing SKF_AD_ALU_XOR_X
might not be an option since this is exposed to user space.
Signed-off-by: Daniel Borkmann <daniel.borkmann@tik.ee.ethz.ch>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We currently use a per socket order-0 page cache for tcp_sendmsg()
operations.
This page is used to build fragments for skbs.
Its done to increase probability of coalescing small write() into
single segments in skbs still in write queue (not yet sent)
But it wastes a lot of memory for applications handling many mostly
idle sockets, since each socket holds one page in sk->sk_sndmsg_page
Its also quite inefficient to build TSO 64KB packets, because we need
about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
page allocator more than wanted.
This patch adds a per task frag allocator and uses bigger pages,
if available. An automatic fallback is done in case of memory pressure.
(up to 32768 bytes per frag, thats order-3 pages on x86)
This increases TCP stream performance by 20% on loopback device,
but also benefits on other network devices, since 8x less frags are
mapped on transmit and unmapped on tx completion. Alexander Duyck
mentioned a probable performance win on systems with IOMMU enabled.
Its possible some SG enabled hardware cant cope with bigger fragments,
but their ndo_start_xmit() should already handle this, splitting a
fragment in sub fragments, since some arches have PAGE_SIZE=65536
Successfully tested on various ethernet devices.
(ixgbe, igb, bnx2x, tg3, mellanox mlx4)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A change in a series of VLAN-related changes appears to have
inadvertently disabled the use of the scatter gather feature of
network cards for transmission of non-IP ethernet protocols like ATA
over Ethernet (AoE). Below is a reference to the commit that
introduces a "harmonize_features" function that turns off scatter
gather when the NIC does not support hardware checksumming for the
ethernet protocol of an sk buff.
commit f01a5236bd
Author: Jesse Gross <jesse@nicira.com>
Date: Sun Jan 9 06:23:31 2011 +0000
net offloading: Generalize netif_get_vlan_features().
The can_checksum_protocol function is not equipped to consider a
protocol that does not require checksumming. Calling it for a
protocol that requires no checksum is inappropriate.
The patch below has harmonize_features call can_checksum_protocol when
the protocol needs a checksum, so that the network layer is not forced
to perform unnecessary skb linearization on the transmission of AoE
packets. Unnecessary linearization results in decreased performance
and increased memory pressure, as reported here:
http://www.spinics.net/lists/linux-mm/msg15184.html
The problem has probably not been widely experienced yet, because
only recently has the kernel.org-distributed aoe driver acquired the
ability to use payloads of over a page in size, with the patchset
recently included in the mm tree:
https://lkml.org/lkml/2012/8/28/140
The coraid.com-distributed aoe driver already could use payloads of
greater than a page in size, but its users generally do not use the
newest kernels.
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It should be the skb which is not cloned
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In netpoll tx path, we miss the chance of calling ->ndo_select_queue(),
thus could cause problems when bonding is involved.
This patch makes dev_pick_tx() extern (and rename it to netdev_pick_tx())
to let netpoll call it in netpoll_send_skb_on_dev().
Reported-by: Sylvain Munaut <s.munaut@whatever-company.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Cong Wang <amwang@redhat.com>
Tested-by: Sylvain Munaut <s.munaut@whatever-company.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The internal functions for add/deleting addresses don't change
their argument.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of forcing device drivers to provide empty ethtool_ops or tweak
net/core/ethtool.c again, we could provide a generic ethtool_ops.
This occurred to me when I wanted to add GSO support to GRE tunnels.
ethtool -k support should be generic for all drivers.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Maciej Żenczykowski <maze@google.com>
Reviewed-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When moving a nic from net namespace A to net namespace B,
in dev_change_net_namesapce,we call __dev_get_by_name to
decide if the netns B has the device has the same name.
if the netns B already has the same named device,we call
dev_get_valid_name to try to get a valid name for this nic in
the netns B,but net_device->nd_net still point to netns A now.
this patch fix it.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
dev_queue_xmit_nit() should be called right before ndo_start_xmit()
calls or we might give wrong packet contents to taps users :
Packet checksum can be changed, or packet can be linearized or
segmented, and segments partially sent for the later case.
Also a memory allocation can fail and packet never really hit the
driver entry point.
Reported-by: Jamie Gloudon <jamie.gloudon@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If orphan flags fails, we don't free the skb
on receive, which leaks the skb memory.
Return value was also wrong: netif_receive_skb
is supposed to return NET_RX_DROP, not ENOMEM.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Always store audit loginuids in type kuid_t.
Print loginuids by converting them into uids in the appropriate user
namespace, and then printing the resulting uid.
Modify audit_get_loginuid to return a kuid_t.
Modify audit_set_loginuid to take a kuid_t.
Modify /proc/<pid>/loginuid on read to convert the loginuid into the
user namespace of the opener of the file.
Modify /proc/<pid>/loginud on write to convert the loginuid
rom the user namespace of the opener of the file.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@redhat.com>
Cc: Paul Moore <paul@paul-moore.com> ?
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Convert direct calls of vprintk_emit and printk_emit to the
dev_ equivalents.
Make create_syslog_header static.
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: David S. Miller <davem@davemloft.net>
Tested-by: Jim Cromie <jim.cromie@gmail.com>
Acked-by: Jason Baron <jbaron@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
netdev_printk originally called dev_printk with %pV.
This style emitted the complete dev_printk header with
a colon followed by the netdev_name prefix followed
by a colon.
Now that netdev_printk does not call dev_printk, the
extra colon is superfluous. Remove it.
Example:
old: sky2 0000:02:00.0: eth0: Link is up at 100 Mbps, full duplex, flow control both
new: sky2 0000:02:00.0 eth0: Link is up at 100 Mbps, full duplex, flow control both
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: David S. Miller <davem@davemloft.net>
Tested-by: Jim Cromie <jim.cromie@gmail.com>
Acked-by: Jason Baron <jbaron@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
A lot of stack is used in recursive printks with %pV.
Using multiple levels of %pV (a logging function with %pV
that calls another logging function with %pV) can consume
more stack than necessary.
Avoid excessive stack use by not calling dev_printk from
netdev_printk and dynamic_netdev_dbg. Duplicate the logic
and form of dev_printk instead.
Make __netdev_printk static.
Remove EXPORT_SYMBOL(__netdev_printk)
Whitespace and brace style neatening.
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: David S. Miller <davem@davemloft.net>
Tested-by: Jim Cromie <jim.cromie@gmail.com>
Acked-by: Jason Baron <jbaron@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Conflicts:
net/netfilter/nfnetlink_log.c
net/netfilter/xt_LOG.c
Rather easy conflict resolution, the 'net' tree had bug fixes to make
sure we checked if a socket is a time-wait one or not and elide the
logging code if so.
Whereas on the 'net-next' side we are calculating the UID and GID from
the creds using different interfaces due to the user namespace changes
from Eric Biederman.
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, cgroup hierarchy support is a mess. cpu related subsystems
behave correctly - configuration, accounting and control on a parent
properly cover its children. blkio and freezer completely ignore
hierarchy and treat all cgroups as if they're directly under the root
cgroup. Others show yet different behaviors.
These differing interpretations of cgroup hierarchy make using cgroup
confusing and it impossible to co-mount controllers into the same
hierarchy and obtain sane behavior.
Eventually, we want full hierarchy support from all subsystems and
probably a unified hierarchy. Users using separate hierarchies
expecting completely different behaviors depending on the mounted
subsystem is deterimental to making any progress on this front.
This patch adds cgroup_subsys.broken_hierarchy and sets it to %true
for controllers which are lacking in hierarchy support. The goal of
this patch is two-fold.
* Move users away from using hierarchy on currently non-hierarchical
subsystems, so that implementing proper hierarchy support on those
doesn't surprise them.
* Keep track of which controllers are broken how and nudge the
subsystems to implement proper hierarchy support.
For now, start with a single warning message. We can whine louder
later on.
v2: Fixed a typo spotted by Michal. Warning message updated.
v3: Updated memcg part so that it doesn't generate warning in the
cases where .use_hierarchy=false doesn't make the behavior
different from root.use_hierarchy=true. Fixed a typo spotted by
Glauber.
v4: Check ->broken_hierarchy after cgroup creation is complete so that
->create() can affect the result per Michal. Dropped unnecessary
memcg root handling per Michal.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Turner <pjt@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Thomas Graf <tgraf@suug.ch>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
WARNING: With this change it is impossible to load external built
controllers anymore.
In case where CONFIG_NETPRIO_CGROUP=m and CONFIG_NET_CLS_CGROUP=m is
set, corresponding subsys_id should also be a constant. Up to now,
net_prio_subsys_id and net_cls_subsys_id would be of the type int and
the value would be assigned during runtime.
By switching the macro definition IS_SUBSYS_ENABLED from IS_BUILTIN
to IS_ENABLED, all *_subsys_id will have constant value. That means we
need to remove all the code which assumes a value can be assigned to
net_prio_subsys_id and net_cls_subsys_id.
A close look is necessary on the RCU part which was introduces by
following patch:
commit f845172531
Author: Herbert Xu <herbert@gondor.apana.org.au> Mon May 24 09:12:34 2010
Committer: David S. Miller <davem@davemloft.net> Mon May 24 09:12:34 2010
cls_cgroup: Store classid in struct sock
Tis code was added to init_cgroup_cls()
/* We can't use rcu_assign_pointer because this is an int. */
smp_wmb();
net_cls_subsys_id = net_cls_subsys.subsys_id;
respectively to exit_cgroup_cls()
net_cls_subsys_id = -1;
synchronize_rcu();
and in module version of task_cls_classid()
rcu_read_lock();
id = rcu_dereference(net_cls_subsys_id);
if (id >= 0)
classid = container_of(task_subsys_state(p, id),
struct cgroup_cls_state, css)->classid;
rcu_read_unlock();
Without an explicit explaination why the RCU part is needed. (The
rcu_deference was fixed by exchanging it to rcu_derefence_index_check()
in a later commit, but that is a minor detail.)
So here is my pondering why it was introduced and why it safe to
remove it now. Note that this code was copied over to net_prio the
reasoning holds for that subsystem too.
The idea behind the RCU use for net_cls_subsys_id is to make sure we
get a valid pointer back from task_subsys_state(). task_subsys_state()
is just blindly accessing the subsys array and returning the
pointer. Obviously, passing in -1 as id into task_subsys_state()
returns an invalid value (out of lower bound).
So this code makes sure that only after module is loaded and the
subsystem registered, the id is assigned.
Before unregistering the module all old readers must have left the
critical section. This is done by assigning -1 to the id and issuing a
synchronized_rcu(). Any new readers wont call task_subsys_state()
anymore and therefore it is safe to unregister the subsystem.
The new code relies on the same trick, but it looks at the subsys
pointer return by task_subsys_state() (remember the id is constant
and therefore we allways have a valid index into the subsys
array).
No precautions need to be taken during module loading
module. Eventually, all CPUs will get a valid pointer back from
task_subsys_state() because rebind_subsystem() which is called after
the module init() function will assigned subsys[net_cls_subsys_id] the
newly loaded module subsystem pointer.
When the subsystem is about to be removed, rebind_subsystem() will
called before the module exit() function. In this case,
rebind_subsys() will assign subsys[net_cls_subsys_id] a NULL pointer
and then it calls synchronize_rcu(). All old readers have left by then
the critical section. Any new reader wont access the subsystem
anymore. At this point we are safe to unregister the subsystem. No
synchronize_rcu() call is needed.
Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: John Fastabend <john.r.fastabend@intel.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: netdev@vger.kernel.org
Cc: cgroups@vger.kernel.org
task_netprioidx() should not be defined in case the configuration is
CONFIG_NETPRIO_CGROUP=n. The reason is that in a following patch the
net_prio_subsys_id will only be defined if CONFIG_NETPRIO_CGROUP!=n.
When net_prio is not built at all any callee should only get an empty
task_netprioidx() without any references to net_prio_subsys_id.
Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Cc: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: John Fastabend <john.r.fastabend@intel.com>
Cc: netdev@vger.kernel.org
Cc: cgroups@vger.kernel.org
task_cls_classid() should not be defined in case the configuration is
CONFIG_NET_CLS_CGROUP=n. The reason is that in a following patch the
net_cls_subsys_id will only be defined if CONFIG_NET_CLS_CGROUP!=n.
When net_cls is not built at all a callee should only get an empty
task_cls_classid() without any references to net_cls_subsys_id.
Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Cc: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: John Fastabend <john.r.fastabend@intel.com>
Cc: netdev@vger.kernel.org
Cc: cgroups@vger.kernel.org
If vlan option is being specified in the pktgen and packet size
being requested is less than 46 bytes, despite being illogical
request, pktgen should not crash the kernel.
BUG: unable to handle kernel paging request at ffff88021fb82000
Process kpktgend_0 (pid: 1184, threadinfo ffff880215f1a000, task ffff880218544530)
Call Trace:
[<ffffffffa0637cd2>] ? pktgen_finalize_skb+0x222/0x300 [pktgen]
[<ffffffff814f0084>] ? build_skb+0x34/0x1c0
[<ffffffffa0639b11>] pktgen_thread_worker+0x5d1/0x1790 [pktgen]
[<ffffffffa03ffb10>] ? igb_xmit_frame_ring+0xa30/0xa30 [igb]
[<ffffffff8107ba20>] ? wake_up_bit+0x40/0x40
[<ffffffff8107ba20>] ? wake_up_bit+0x40/0x40
[<ffffffffa0639540>] ? spin+0x240/0x240 [pktgen]
[<ffffffff8107b4e3>] kthread+0x93/0xa0
[<ffffffff81615de4>] kernel_thread_helper+0x4/0x10
[<ffffffff8107b450>] ? flush_kthread_worker+0x80/0x80
[<ffffffff81615de0>] ? gs_change+0x13/0x13
The root cause of why pktgen is not able to handle this case is due
to comparison of signed (datalen) and unsigned data (sizeof), which
eventually passes a huge number to skb_put().
Signed-off-by: Nishank Trivedi <nistrive@cisco.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace the current (inefficient) for-loop with memcpy, to copy priomap.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The update_netdev_tables() function appears to be unnecessary, since the
write_update_netdev_table() function will adjust the priomaps as and when
required anyway. So drop the usage of update_netdev_tables() entirely.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix net/core/sock.c build error when CONFIG_INET is not enabled:
net/built-in.o: In function `sock_edemux':
(.text+0xd396): undefined reference to `inet_twsk_put'
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a new ALU opcode, to compute a modulus.
Commit ffe06c17af used an ancillary to implement XOR_X,
but here we reserve one of the available ALU opcode to implement both
MOD_X and MOD_K
Signed-off-by: Eric Dumazet <edumazet@google.com>
Suggested-by: George Bakos <gbakos@alpinista.org>
Cc: Jay Schulist <jschlst@samba.org>
Cc: Jiri Pirko <jpirko@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is a frequent mistake to confuse the netlink port identifier with a
process identifier. Try to reduce this confusion by renaming fields
that hold port identifiers portid instead of pid.
I have carefully avoided changing the structures exported to
userspace to avoid changing the userspace API.
I have successfully built an allyesconfig kernel with this change.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch defines netlink_kernel_create as a wrapper function of
__netlink_kernel_create to hide the struct module *me parameter
(which seems to be THIS_MODULE in all existing netlink subsystems).
Suggested by David S. Miller.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace netlink_set_nonroot by one new field `flags' in
struct netlink_kernel_cfg that is passed to netlink_kernel_create.
This patch also renames NL_NONROOT_* to NL_CFG_F_NONROOT_* since
now the flags field in nl_table is generic (so we can add more
flags if needed in the future).
Also adjust all callers in the net-next tree to use these flags
instead of netlink_set_nonroot.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the current rxhash calculation function, while the
sorting of the ports/addrs is coherent (you get the
same rxhash for packets sharing the same 4-tuple, in
both directions), ports and addrs are sorted
independently. This implies packets from a connection
between the same addresses but crossed ports hash to
the same rxhash.
For example, traffic between A=S:l and B=L:s is hashed
(in both directions) from {L, S, {s, l}}. The same
rxhash is obtained for packets between C=S:s and D=L:l.
This patch ensures that you either swap both addrs and ports,
or you swap none. Traffic between A and B, and traffic
between C and D, get their rxhash from different sources
({L, S, {l, s}} for A<->B, and {L, S, {s, l}} for C<->D)
The patch is co-written with Eric Dumazet <edumazet@google.com>
Signed-off-by: Chema Gonzalez <chema@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Passing uids and gids on NETLINK_CB from a process in one user
namespace to a process in another user namespace can result in the
wrong uid or gid being presented to userspace. Avoid that problem by
passing kuids and kgids instead.
- define struct scm_creds for use in scm_cookie and netlink_skb_parms
that holds uid and gid information in kuid_t and kgid_t.
- Modify scm_set_cred to fill out scm_creds by heand instead of using
cred_to_ucred to fill out struct ucred. This conversion ensures
userspace does not get incorrect uid or gid values to look at.
- Modify scm_recv to convert from struct scm_creds to struct ucred
before copying credential values to userspace.
- Modify __scm_send to populate struct scm_creds on in the scm_cookie,
instead of just copying struct ucred from userspace.
- Modify netlink_sendmsg to copy scm_creds instead of struct ucred
into the NETLINK_CB.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently when the NIC duplex state is DUPLEX_UNKNOWN it is exported as
full through sysfs, this patch adds support for DUPLEX_UNKNOWN. It is
handled the same way as in ethtool.
Signed-off-by: Nikolay Aleksandrov <naleksan@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sock_edemux() can handle either a regular socket or a timewait socket
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch builds on top of the previous patch to add the support
for TFO listeners. This includes -
1. allocating, properly initializing, and managing the per listener
fastopen_queue structure when TFO is enabled
2. changes to the inet_csk_accept code to support TFO. E.g., the
request_sock can no longer be freed upon accept(), not until 3WHS
finishes
3. allowing a TCP_SYN_RECV socket to properly poll() and sendmsg()
if it's a TFO socket
4. properly closing a TFO listener, and a TFO socket before 3WHS
finishes
5. supporting TCP_FASTOPEN socket option
6. modifying tcp_check_req() to use to check a TFO socket as well
as request_sock
7. supporting TCP's TFO cookie option
8. adding a new SYN-ACK retransmit handler to use the timer directly
off the TFO socket rather than the listener socket. Note that TFO
server side will not retransmit anything other than SYN-ACK until
the 3WHS is completed.
The patch also contains an important function
"reqsk_fastopen_remove()" to manage the somewhat complex relation
between a listener, its request_sock, and the corresponding child
socket. See the comment above the function for the detail.
Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb_needs_linearize() does not check highmem DMA as it does not call
illegal_highdma() anymore, so there is no need to mention highmem DMA here.
(Indeed, ~NETIF_F_SG flag, which is checked in skb_needs_linearize(), can
be set when illegal_highdma() returns true, and we are assured that
illegal_highdma() is invoked prior to skb_needs_linearize() as
skb_needs_linearize() is a static method called only once.
But ~NETIF_F_SG can be set not only there in this same invocation path.
It can also be set when can_checksum_protocol() returns false).
see commit 02932ce9e2,
Convert skb_need_linearize() to use precomputed features.
Signed-off-by: Rami Rosen <rosenr@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Merge the 'net' tree to get the recent set of netfilter bug fixes in
order to assist with some merge hassles Pablo is going to have to deal
with for upcoming changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
Let's fill IP header ident field with a meaningful value,
it might help some setups.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When moving a net device from one net namespace to another
net namespace,dev_change_net_namespace calls NETDEV_DOWN
event,so the original net namespace's dst entries which
beloned to this net device will be put into dst_garbage
list.
then dev_change_net_namespace will set this net device's
net to the new net namespace.
If we unregister this net device's driver, this will trigger
the NETDEV_UNREGISTER_FINAL event, dst_ifdown will be called,
and get this net device's dst entries from dst_garbage list,
put these entries' dev to the new net namespace's lo device.
It's not what we want,actually we need these dst entries hold
the original net namespace's lo device,this incorrect device
holding will trigger emg message like below.
unregister_netdevice: waiting for lo to become free. Usage count = 1
so we should call NETDEV_UNREGISTER_FINAL event in
dev_change_net_namespace too,in order to make sure dst entries
already in the dst_garbage list, we need rcu_barrier before we
call NETDEV_UNREGISTER_FINAL event.
With help form Eric Dumazet.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add inet_proto_csum_replace16 for incrementally updating IPv6 pseudo header
checksums for IPv6 NAT.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Acked-by: David S. Miller <davem@davemloft.net>
Against -net.
In the patch "netpoll: re-enable irq in poll_napi()", I tried to
fix the following warning:
[100718.051041] ------------[ cut here ]------------
[100718.051048] WARNING: at kernel/softirq.c:159 local_bh_enable_ip+0x7d/0xb0()
(Not tainted)
[100718.051049] Hardware name: ProLiant BL460c G7
...
[100718.051068] Call Trace:
[100718.051073] [<ffffffff8106b747>] ? warn_slowpath_common+0x87/0xc0
[100718.051075] [<ffffffff8106b79a>] ? warn_slowpath_null+0x1a/0x20
[100718.051077] [<ffffffff810747ed>] ? local_bh_enable_ip+0x7d/0xb0
[100718.051080] [<ffffffff8150041b>] ? _spin_unlock_bh+0x1b/0x20
[100718.051085] [<ffffffffa00ee974>] ? be_process_mcc+0x74/0x230 [be2net]
[100718.051088] [<ffffffffa00ea68c>] ? be_poll_tx_mcc+0x16c/0x290 [be2net]
[100718.051090] [<ffffffff8144fe76>] ? netpoll_poll_dev+0xd6/0x490
[100718.051095] [<ffffffffa01d24a5>] ? bond_poll_controller+0x75/0x80 [bonding]
[100718.051097] [<ffffffff8144fde5>] ? netpoll_poll_dev+0x45/0x490
[100718.051100] [<ffffffff81161b19>] ? ksize+0x19/0x80
[100718.051102] [<ffffffff81450437>] ? netpoll_send_skb_on_dev+0x157/0x240
by reenabling IRQ before calling ->poll, but it seems more
problems are introduced after that patch:
http://ozlabs.org/~akpm/stuff/IMG_20120824_122054.jpghttp://marc.info/?l=linux-netdev&m=134563282530588&w=2
So it is safe to fix be2net driver code directly.
This patch reverts the offending commit and fixes be_poll() by
avoid disabling BH there, this is okay because be_poll()
can be called either by poll_napi() which already disables
IRQ, or by net_rx_action() which already disables BH.
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Reported-by: Sylvain Munaut <s.munaut@whatever-company.com>
Cc: Sylvain Munaut <s.munaut@whatever-company.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Miller <davem@davemloft.net>
Cc: Sathya Perla <sathya.perla@emulex.com>
Cc: Subbu Seetharaman <subbu.seetharaman@emulex.com>
Cc: Ajit Khaparde <ajit.khaparde@emulex.com>
Signed-off-by: Cong Wang <amwang@redhat.com>
Tested-by: Sylvain Munaut <s.munaut@whatever-company.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is an initial merge in of Eric Biederman's work to start adding
user namespace support to the networking.
Signed-off-by: David S. Miller <davem@davemloft.net>
The operstate of a device is initially IF_OPER_UNKNOWN and is updated
asynchronously by linkwatch after each change of carrier state
reported by the driver. The default carrier state of a net device is
on, and this will never be changed on drivers that do not support
carrier detection, thus the operstate remains IF_OPER_UNKNOWN.
For devices that do support carrier detection, the driver must set the
carrier state to off initially, then poll the hardware state when the
device is opened. However, we must not activate linkwatch for a
unregistered device, and commit b473001 ('net: Do not fire linkwatch
events until the device is registered.') ensured that we don't. But
this means that the operstate for many devices that support carrier
detection remains IF_OPER_UNKNOWN when it should be IF_OPER_DOWN.
The same issue exists with the dormant state.
The proper initialisation sequence, avoiding a race with opening of
the device, is:
rtnl_lock();
rc = register_netdevice(dev);
if (rc)
goto out_unlock;
netif_carrier_off(dev); /* or netif_dormant_on(dev) */
rtnl_unlock();
but it seems silly that this should have to be repeated in so many
drivers. Further, the operstate seen immediately after opening the
device may still be IF_OPER_UNKNOWN due to the asynchronous nature of
linkwatch.
Commit 22604c8 ('net: Fix for initial link state in 2.6.28') attempted
to fix this by setting the operstate synchronously, but it was
reverted as it could lead to deadlock.
This initialises the operstate synchronously at registration time
only.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The network classifier cgroup initalizes each cgroups instance classid value to
0. However, the sock_update_classid function only updates classid's in sockets
if the tasks cgroup classid is not zero, and if it differs from the current
classid. The later check is to prevent cache line dirtying, but the former is
detrimental, as it prevents resetting a classid for a cgroup to 0. While this
is not a common action, it has administrative usefulness (if the admin wants to
disable classification of a certain group temporarily for instance).
Easy fix, just remove the zero check. Tested successfully by myself
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: "David S. Miller" <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Biederman pointed out that not holding RTNL while calling
call_netdevice_notifiers() was racy.
This patch is a direct transcription his feedback
against commit 0115e8e30d (net: remove delay at device dismantle)
Thanks Eric !
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Mahesh Bandewar <maheshb@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Gao feng <gaofeng@cn.fujitsu.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I noticed extra one second delay in device dismantle, tracked down to
a call to dst_dev_event() while some call_rcu() are still in RCU queues.
These call_rcu() were posted by rt_free(struct rtable *rt) calls.
We then wait a little (but one second) in netdev_wait_allrefs() before
kicking again NETDEV_UNREGISTER.
As the call_rcu() are now completed, dst_dev_event() can do the needed
device swap on busy dst.
To solve this problem, add a new NETDEV_UNREGISTER_FINAL, called
after a rcu_barrier(), but outside of RTNL lock.
Use NETDEV_UNREGISTER_FINAL with care !
Change dst_dev_event() handler to react to NETDEV_UNREGISTER_FINAL
Also remove NETDEV_UNREGISTER_BATCH, as its not used anymore after
IP cache removal.
With help from Gao feng
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Mahesh Bandewar <maheshb@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that mod_delayed_work() is safe to call from IRQ handlers,
__cancel_delayed_work() followed by queue_delayed_work() can be
replaced with mod_delayed_work().
Most conversions are straight-forward except for the following.
* net/core/link_watch.c: linkwatch_schedule_work() was doing a quite
elaborate dancing around its delayed_work. Collapse it such that
linkwatch_work is queued for immediate execution if LW_URGENT and
existing timer is kept otherwise.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
Initalizers for deferrable delayed_work are confused.
* __DEFERRED_WORK_INITIALIZER()
* DECLARE_DEFERRED_WORK()
* INIT_DELAYED_WORK_DEFERRABLE()
Rename them to
* __DEFERRABLE_WORK_INITIALIZER()
* DECLARE_DEFERRABLE_WORK()
* INIT_DEFERRABLE_WORK()
This patch doesn't cause any functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fix kernel-doc warning:
Warning(net/core/dev.c:5745): No description found for parameter 'dev'
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
If a packet is emitted on one socket in one group of fanout sockets,
it is transmitted again. It is thus read again on one of the sockets
of the fanout group. This result in a loop for software which
generate packets when receiving one.
This retransmission is not the intended behavior: a fanout group
must behave like a single socket. The packet should not be
transmitted on a socket if it originates from a socket belonging
to the same fanout group.
This patch fixes the issue by changing the transmission check to
take fanout group info account.
Reported-by: Aleksandr Kotov <a1k@mail.ru>
Signed-off-by: Eric Leblond <eric@regit.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
A race exists where creating cgroups and also updating the priomap
may result in losing a priomap update. This is because priomap
writers are not protected by rtnl_lock.
Move priority writer into rtnl_lock()/rtnl_unlock().
CC: Neil Horman <nhorman@tuxdriver.com>
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A socket fd passed in a SCM_RIGHTS datagram was not getting
updated with the new tasks cgrp prioidx. This leaves IO on
the socket tagged with the old tasks priority.
To fix this add a check in the scm recvmsg path to update the
sock cgrp prioidx with the new tasks value.
Thanks to Al Viro for catching this.
CC: Neil Horman <nhorman@tuxdriver.com>
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add lock to prevent a race with a file closing and also remove
useless and ugly sscanf code. The extra code was never needed
and the case it supposedly protected against is in fact handled
correctly by sock_from_file as pointed out by Al Viro.
CC: Neil Horman <nhorman@tuxdriver.com>
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Klaus Heinrich Kiwi <klausk@br.ibm.com>
Cc: Eric Paris <eparis@redhat.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
With the existence of kuid_t and kgid_t we can take this further
and remove the usage of struct cred altogether, ensuring we
don't get cache line misses from reference counts. For now
however start simply and do a straight forward conversion
I can be certain is correct.
In cred_to_ucred use from_kuid_munged and from_kgid_munged
as these values are going directly to userspace and we want to use
the userspace safe values not -1 when reporting a value that does not
map. The earlier conversion that used from_kuid was buggy in that
respect. Oops.
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
napi->poll() needs IRQ enabled, so we have to re-enable IRQ before
calling it.
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Without this patch, I can't get netconsole logs remotely over
vlan. The reason is probably we don't handle vlan tags in either
netpoll tx or rx path.
I am not sure if I use these vlan functions correctly, at
least this patch works.
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes several problems in the call path of
netpoll_send_skb_on_dev():
1. Disable IRQ's before calling netpoll_send_skb_on_dev().
2. All the callees of netpoll_send_skb_on_dev() should use
rcu_dereference_bh() to dereference ->npinfo.
3. Rename arp_reply() to netpoll_arp_reply(), the former is too generic.
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In __netpoll_rx(), it dereferences ->npinfo without rcu_dereference_bh(),
this patch fixes it by using the 'npinfo' passed from netpoll_rx()
where it is already dereferenced with rcu_dereference_bh().
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Like the previous patch, slave_disable_netpoll() and __netpoll_cleanup()
may be called with read_lock() held too, so we should make them
non-block, by moving the cleanup and kfree() to call_rcu_bh() callbacks.
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
slave_enable_netpoll() and __netpoll_setup() may be called
with read_lock() held, so should use GFP_ATOMIC to allocate
memory. Eric suggested to pass gfp flags to __netpoll_setup().
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I don't see any benifits to use netdev_bonding_change() than
using call_netdevice_notifiers() directly.
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I believe net/core/dev.c is a better place for netif_notify_peers(),
because other net event notify functions also stay in this file.
And rename it to netdev_notify_peers().
Cc: David S. Miller <davem@davemloft.net>
Cc: Ian Campbell <Ian.Campbell@citrix.com>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert delayed_work users doing cancel_delayed_work() followed by
queue_delayed_work() to mod_delayed_work().
Most conversions are straight-forward. Ones worth mentioning are,
* drivers/edac/edac_mc.c: edac_mc_workq_setup() converted to always
use mod_delayed_work() and cancel loop in
edac_mc_reset_delay_period() is dropped.
* drivers/platform/x86/thinkpad_acpi.c: No need to remember whether
watchdog is active or not. @fan_watchdog_active and related code
dropped.
* drivers/power/charger-manager.c: Seemingly a lot of
delayed_work_pending() abuse going on here.
[delayed_]work_pending() are unsynchronized and racy when used like
this. I converted one instance in fullbatt_handler(). Please
conver the rest so that it invokes workqueue APIs for the intended
target state rather than trying to game work item pending state
transitions. e.g. if timer should be modified - call
mod_delayed_work(), canceled - call cancel_delayed_work[_sync]().
* drivers/thermal/thermal_sys.c: thermal_zone_device_set_polling()
simplified. Note that round_jiffies() calls in this function are
meaningless. round_jiffies() work on absolute jiffies not delta
delay used by delayed_work.
v2: Tomi pointed out that __cancel_delayed_work() users can't be
safely converted to mod_delayed_work(). They could be calling it
from irq context and if that happens while delayed_work_timer_fn()
is running, it could deadlock. __cancel_delayed_work() users are
dropped.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Henrique de Moraes Holschuh <hmh@hmh.eng.br>
Acked-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Acked-by: Anton Vorontsov <cbouatmailru@gmail.com>
Acked-by: David Howells <dhowells@redhat.com>
Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Doug Thompson <dougthompson@xmission.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Roland Dreier <roland@kernel.org>
Cc: "John W. Linville" <linville@tuxdriver.com>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Len Brown <len.brown@intel.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Johannes Berg <johannes@sipsolutions.net>
Strictly speaking this is only _really_ required for checkpoint-restore to
make loopback device always have the same index.
This change appears to be safe wrt "ifindex should be unique per-system"
concept, as all the ifindex usage is either already made per net namespace
of is explicitly limited with init_net only.
There are two cool side effects of this. The first one -- ifindices of
devices in container are always small, regardless of how many containers
we've started (and re-started) so far. The second one is -- we can speed
up the loopback ifidex access as shown in the next patch.
v2: Place ifindex right after dev_base_seq : avoid two holes and use the
same cache line, dirtied in list_netdevice()/unlist_netdevice()
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently the RTM_NEWLINK results in -EOPNOTSUPP if the ifinfomsg->ifi_index
is not zero. I propose to allow requesting ifindices on link creation. This
is required by the checkpoint-restore to correctly restore a net namespace
(i.e. -- a container).
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Various /proc/net files sometimes report crazy timer values, expressed
in clock_t units.
This happens when an expired timer delta (expires - jiffies) is passed
to jiffies_to_clock_t().
This function has an overflow in :
return div_u64((u64)x * TICK_NSEC, NSEC_PER_SEC / USER_HZ);
commit cbbc719fcc (time: Change jiffies_to_clock_t() argument type
to unsigned long) only got around the problem.
As we cant output negative values in /proc/net/tcp without breaking
various tools, I suggest adding a jiffies_delta_to_clock_t() wrapper
that caps the negative delta to a 0 value.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Maciej Żenczykowski <maze@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: hank <pyu@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Do not leak memory by updating pointer with potentially NULL realloc return value.
Found by Linux Driver Verification project (linuxtesting.org).
Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
While investigating on network performance problems, I found this little
gem :
$ nm -v vmlinux | grep -1 dst_default_metrics
ffffffff82736540 b busy.46605
ffffffff82736560 B dst_default_metrics
ffffffff82736598 b dst_busy_list
Apparently, declaring a const array without initializer put it in
(writeable) bss section, in middle of possibly often dirtied cache
lines.
Since we really want dst_default_metrics be const to avoid any possible
false sharing and catch any buggy writes, I force a null initializer.
ffffffff818a4c20 R dst_default_metrics
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cache the device gso_max_segs in sock::sk_gso_max_segs and use it to
limit the size of TSO skbs. This avoids the need to fall back to
software GSO for local TCP senders.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A peer (or local user) may cause TCP to use a nominal MSS of as little
as 88 (actual MSS of 76 with timestamps). Given that we have a
sufficiently prodigious local sender and the peer ACKs quickly enough,
it is nevertheless possible to grow the window for such a connection
to the point that we will try to send just under 64K at once. This
results in a single skb that expands to 861 segments.
In some drivers with TSO support, such an skb will require hundreds of
DMA descriptors; a substantial fraction of a TX ring or even more than
a full ring. The TX queue selected for the skb may stall and trigger
the TX watchdog repeatedly (since the problem skb will be retried
after the TX reset). This particularly affects sfc, for which the
issue is designated as CVE-2012-3412.
Therefore:
1. Add the field net_device::gso_max_segs holding the device-specific
limit.
2. In netif_skb_features(), if the number of segments is too high then
mask out GSO features to force fall back to software GSO.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Merge Andrew's second set of patches:
- MM
- a few random fixes
- a couple of RTC leftovers
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (120 commits)
rtc/rtc-88pm80x: remove unneed devm_kfree
rtc/rtc-88pm80x: assign ret only when rtc_register_driver fails
mm: hugetlbfs: close race during teardown of hugetlbfs shared page tables
tmpfs: distribute interleave better across nodes
mm: remove redundant initialization
mm: warn if pg_data_t isn't initialized with zero
mips: zero out pg_data_t when it's allocated
memcg: gix memory accounting scalability in shrink_page_list
mm/sparse: remove index_init_lock
mm/sparse: more checks on mem_section number
mm/sparse: optimize sparse_index_alloc
memcg: add mem_cgroup_from_css() helper
memcg: further prevent OOM with too many dirty pages
memcg: prevent OOM with too many dirty pages
mm: mmu_notifier: fix freed page still mapped in secondary MMU
mm: memcg: only check anon swapin page charges for swap cache
mm: memcg: only check swap cache pages for repeated charging
mm: memcg: split swapin charge function into private and public part
mm: memcg: remove needless !mm fixup to init_mm when charging
mm: memcg: remove unneeded shmem charge type
...
from interrupts for /dev/random and /dev/urandom. The goal is to
addresses weaknesses discussed in the paper "Mining your Ps and Qs:
Detection of Widespread Weak Keys in Network Devices", by Nadia
Heninger, Zakir Durumeric, Eric Wustrow, J. Alex Halderman, which will
be published in the Proceedings of the 21st Usenix Security Symposium,
August 2012. (See https://factorable.net for more information and an
extended version of the paper.)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABCAAGBQJQF/0DAAoJENNvdpvBGATwIowQAOep9QKtLrBvb2lwIRVmeiy8
lRf7V/tYZnz4FePbR0W92JQfKYkCV8yyOO0bmeRzWL3v4m+lRwDTSyA1DDyQMoH+
LOMzvDKSLJMSXTXdSOIr1WYACphViCR/9CrbMBCKSkYfZLJ1MdaEDxT3rcpTGD0T
6iknUweiSkHHhkerU5yQL7FKzD5kYUe0hsF47w7QVlHRHJsW2fsZqkFoh+RpnhNw
03u+djxNGBo9qV81vZ9D1b0vA9uRlEjoWOOEG2XE4M2iq6TUySueA72dQnCwunfi
3kG/u1Swv2dgq6aRrP3H7zdwhYSourGxziu3jNhEKwKEohrxYY7xjNX3RVeTqP67
AzlKsOTWpRLIDrzjSLlb8VxRQiZewu8Unex3e1G+eo20sbcIObHGrxNp7K00zZvd
QZiMHhOwItwFTe4lBO+XbqH2JKbL9/uJmwh5EipMpQTraKO9E6N3CJiUHjzBLo2K
iGDZxRMKf4gVJRwDxbbP6D70JPVu8ZJ09XVIpsXQ3Z1xNqaMF0QdCmP3ty56q1o0
NvkSXxPKrijZs8Sk0rVDqnJ3ll8PuDnXMv5eDtL42VT818I5WxESn9djjwEanGv0
TYxbFub/NRxmPEE5B2Js5FBpqsLf5f282OSMeS/5WLBbnHJR1OoPoAhGVpHvxntC
bi5FC1OolqhvzVIdsqgt
=u7KM
-----END PGP SIGNATURE-----
Merge tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random
Pull random subsystem patches from Ted Ts'o:
"This patch series contains a major revamp of how we collect entropy
from interrupts for /dev/random and /dev/urandom.
The goal is to addresses weaknesses discussed in the paper "Mining
your Ps and Qs: Detection of Widespread Weak Keys in Network Devices",
by Nadia Heninger, Zakir Durumeric, Eric Wustrow, J. Alex Halderman,
which will be published in the Proceedings of the 21st Usenix Security
Symposium, August 2012. (See https://factorable.net for more
information and an extended version of the paper.)"
Fix up trivial conflicts due to nearby changes in
drivers/{mfd/ab3100-core.c, usb/gadget/omap_udc.c}
* tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random: (33 commits)
random: mix in architectural randomness in extract_buf()
dmi: Feed DMI table to /dev/random driver
random: Add comment to random_initialize()
random: final removal of IRQF_SAMPLE_RANDOM
um: remove IRQF_SAMPLE_RANDOM which is now a no-op
sparc/ldc: remove IRQF_SAMPLE_RANDOM which is now a no-op
[ARM] pxa: remove IRQF_SAMPLE_RANDOM which is now a no-op
board-palmz71: remove IRQF_SAMPLE_RANDOM which is now a no-op
isp1301_omap: remove IRQF_SAMPLE_RANDOM which is now a no-op
pxa25x_udc: remove IRQF_SAMPLE_RANDOM which is now a no-op
omap_udc: remove IRQF_SAMPLE_RANDOM which is now a no-op
goku_udc: remove IRQF_SAMPLE_RANDOM which was commented out
uartlite: remove IRQF_SAMPLE_RANDOM which is now a no-op
drivers: hv: remove IRQF_SAMPLE_RANDOM which is now a no-op
xen-blkfront: remove IRQF_SAMPLE_RANDOM which is now a no-op
n2_crypto: remove IRQF_SAMPLE_RANDOM which is now a no-op
pda_power: remove IRQF_SAMPLE_RANDOM which is now a no-op
i2c-pmcmsp: remove IRQF_SAMPLE_RANDOM which is now a no-op
input/serio/hp_sdc.c: remove IRQF_SAMPLE_RANDOM which is now a no-op
mfd: remove IRQF_SAMPLE_RANDOM which is now a no-op
...
This patch series is based on top of "Swap-over-NBD without deadlocking
v15" as it depends on the same reservation of PF_MEMALLOC reserves logic.
When a user or administrator requires swap for their application, they
create a swap partition and file, format it with mkswap and activate it
with swapon. In diskless systems this is not an option so if swap if
required then swapping over the network is considered. The two likely
scenarios are when blade servers are used as part of a cluster where the
form factor or maintenance costs do not allow the use of disks and thin
clients.
The Linux Terminal Server Project recommends the use of the Network Block
Device (NBD) for swap but this is not always an option. There is no
guarantee that the network attached storage (NAS) device is running Linux
or supports NBD. However, it is likely that it supports NFS so there are
users that want support for swapping over NFS despite any performance
concern. Some distributions currently carry patches that support swapping
over NFS but it would be preferable to support it in the mainline kernel.
Patch 1 avoids a stream-specific deadlock that potentially affects TCP.
Patch 2 is a small modification to SELinux to avoid using PFMEMALLOC
reserves.
Patch 3 adds three helpers for filesystems to handle swap cache pages.
For example, page_file_mapping() returns page->mapping for
file-backed pages and the address_space of the underlying
swap file for swap cache pages.
Patch 4 adds two address_space_operations to allow a filesystem
to pin all metadata relevant to a swapfile in memory. Upon
successful activation, the swapfile is marked SWP_FILE and
the address space operation ->direct_IO is used for writing
and ->readpage for reading in swap pages.
Patch 5 notes that patch 3 is bolting
filesystem-specific-swapfile-support onto the side and that
the default handlers have different information to what
is available to the filesystem. This patch refactors the
code so that there are generic handlers for each of the new
address_space operations.
Patch 6 adds an API to allow a vector of kernel addresses to be
translated to struct pages and pinned for IO.
Patch 7 adds support for using highmem pages for swap by kmapping
the pages before calling the direct_IO handler.
Patch 8 updates NFS to use the helpers from patch 3 where necessary.
Patch 9 avoids setting PF_private on PG_swapcache pages within NFS.
Patch 10 implements the new swapfile-related address_space operations
for NFS and teaches the direct IO handler how to manage
kernel addresses.
Patch 11 prevents page allocator recursions in NFS by using GFP_NOIO
where appropriate.
Patch 12 fixes a NULL pointer dereference that occurs when using
swap-over-NFS.
With the patches applied, it is possible to mount a swapfile that is on an
NFS filesystem. Swap performance is not great with a swap stress test
taking roughly twice as long to complete than if the swap device was
backed by NBD.
This patch: netvm: prevent a stream-specific deadlock
It could happen that all !SOCK_MEMALLOC sockets have buffered so much data
that we're over the global rmem limit. This will prevent SOCK_MEMALLOC
buffers from receiving data, which will prevent userspace from running,
which is needed to reduce the buffered data.
Fix this by exempting the SOCK_MEMALLOC sockets from the rmem limit. Once
this change it applied, it is important that sockets that set
SOCK_MEMALLOC do not clear the flag until the socket is being torn down.
If this happens, a warning is generated and the tokens reclaimed to avoid
accounting errors until the bug is fixed.
[davem@davemloft.net: Warning about clearing SOCK_MEMALLOC]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In order to make sure pfmemalloc packets receive all memory needed to
proceed, ensure processing of pfmemalloc SKBs happens under PF_MEMALLOC.
This is limited to a subset of protocols that are expected to be used for
writing to swap. Taps are not allowed to use PF_MEMALLOC as these are
expected to communicate with userspace processes which could be paged out.
[a.p.zijlstra@chello.nl: Ideas taken from various patches]
[jslaby@suse.cz: Lock imbalance fix]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: Neil Brown <neilb@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change the skb allocation API to indicate RX usage and use this to fall
back to the PFMEMALLOC reserve when needed. SKBs allocated from the
reserve are tagged in skb->pfmemalloc. If an SKB is allocated from the
reserve and the socket is later found to be unrelated to page reclaim, the
packet is dropped so that the memory remains available for page reclaim.
Network protocols are expected to recover from this packet loss.
[a.p.zijlstra@chello.nl: Ideas taken from various patches]
[davem@davemloft.net: Use static branches, coding style corrections]
[sebastian@breakpoint.cc: Avoid unnecessary cast, fix !CONFIG_NET build]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: Neil Brown <neilb@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Allow specific sockets to be tagged SOCK_MEMALLOC and use __GFP_MEMALLOC
for their allocations. These sockets will be able to go below watermarks
and allocate from the emergency reserve. Such sockets are to be used to
service the VM (iow. to swap over). They must be handled kernel side,
exposing such a socket to user-space is a bug.
There is a risk that the reserves be depleted so for now, the
administrator is responsible for increasing min_free_kbytes as necessary
to prevent deadlock for their workloads.
[a.p.zijlstra@chello.nl: Original patches]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: Neil Brown <neilb@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
commit 404e0a8b6a (net: ipv4: fix RCU races on dst refcounts) tried
to solve a race but added a problem at device/fib dismantle time :
We really want to call dst_free() as soon as possible, even if sockets
still have dst in their cache.
dst_release() calls in free_fib_info_rcu() are not welcomed.
Root of the problem was that now we also cache output routes (in
nh_rth_output), we must use call_rcu() instead of call_rcu_bh() in
rt_free(), because output route lookups are done in process context.
Based on feedback and initial patch from David Miller (adding another
call_rcu_bh() call in fib, but it appears it was not the right fix)
I left the inet_sk_rx_dst_set() helper and added __rcu attributes
to nh_rth_output and nh_rth_input to better document what is going on in
this code.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit c6cffba4ff (ipv4: Fix input route performance regression.)
added various fatal races with dst refcounts.
crashes happen on tcp workloads if routes are added/deleted at the same
time.
The dst_free() calls from free_fib_info_rcu() are clearly racy.
We need instead regular dst refcounting (dst_release()) and make
sure dst_release() is aware of RCU grace periods :
Add DST_RCU_FREE flag so that dst_release() respects an RCU grace period
before dst destruction for cached dst
Introduce a new inet_sk_rx_dst_set() helper, using atomic_inc_not_zero()
to make sure we dont increase a zero refcount (On a dst currently
waiting an rcu grace period before destruction)
rt_cache_route() must take a reference on the new cached route, and
release it if was not able to install it.
With this patch, my machines survive various benchmarks.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When userspace use RTM_GETROUTE to dump route table, with an already
expired route entry, we always got an 'expires' value(2147157)
calculated base on INT_MAX.
The reason of this problem is in the following satement:
rt->dst.expires - jiffies < INT_MAX
gcc promoted the type of both sides of '<' to unsigned long, thus
a small negative value would be considered greater than INT_MAX.
With the help of Eric Dumazet, do the out of bound checks in
rtnl_put_cacheinfo(), _after_ conversion to clock_t.
Signed-off-by: Li Wei <lw@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When device flags are set using rtnetlink, IFF_PROMISC and IFF_ALLMULTI
flags are handled specially. Function dev_change_flags sets IFF_PROMISC and
IFF_ALLMULTI bits in dev->gflags according to the passed value but
do_setlink passes a result of rtnl_dev_combine_flags which takes those bits
from dev->flags.
This can be easily trigerred by doing:
tcpdump -i eth0 &
ip l s up eth0
ip sets IFF_UP flag in ifi_flags and ifi_change, which is combined with
IFF_PROMISC by rtnl_dev_combine_flags, causing __dev_change_flags to set
IFF_PROMISC in gflags.
Reported-by: Max Matveev <makc@redhat.com>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull trivial tree from Jiri Kosina:
"Trivial updates all over the place as usual."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (29 commits)
Fix typo in include/linux/clk.h .
pci: hotplug: Fix typo in pci
iommu: Fix typo in iommu
video: Fix typo in drivers/video
Documentation: Add newline at end-of-file to files lacking one
arm,unicore32: Remove obsolete "select MISC_DEVICES"
module.c: spelling s/postition/position/g
cpufreq: Fix typo in cpufreq driver
trivial: typo in comment in mksysmap
mach-omap2: Fix typo in debug message and comment
scsi: aha152x: Fix sparse warning and make printing pointer address more portable.
Change email address for Steve Glendinning
Btrfs: fix typo in convert_extent_bit
via: Remove bogus if check
netprio_cgroup.c: fix comment typo
backlight: fix memory leak on obscure error path
Documentation: asus-laptop.txt references an obsolete Kconfig item
Documentation: ManagementStyle: fixed typo
mm/vmscan: cleanup comment error in balance_pgdat
mm: cleanup on the comments of zone_reclaim_stat
...
Pull networking changes from David S Miller:
1) Remove the ipv4 routing cache. Now lookups go directly into the FIB
trie and use prebuilt routes cached there.
No more garbage collection, no more rDOS attacks on the routing
cache. Instead we now get predictable and consistent performance,
no matter what the pattern of traffic we service.
This has been almost 2 years in the making. Special thanks to
Julian Anastasov, Eric Dumazet, Steffen Klassert, and others who
have helped along the way.
I'm sure that with a change of this magnitude there will be some
kind of fallout, but such things ought the be simple to fix at this
point. Luckily I'm not European so I'll be around all of August to
fix things :-)
The major stages of this work here are each fronted by a forced
merge commit whose commit message contains a top-level description
of the motivations and implementation issues.
2) Pre-demux of established ipv4 TCP sockets, saves a route demux on
input.
3) TCP SYN/ACK performance tweaks from Eric Dumazet.
4) Add namespace support for netfilter L4 conntrack helpers, from Gao
Feng.
5) Add config mechanism for Energy Efficient Ethernet to ethtool, from
Yuval Mintz.
6) Remove quadratic behavior from /proc/net/unix, from Eric Dumazet.
7) Support for connection tracker helpers in userspace, from Pablo
Neira Ayuso.
8) Allow userspace driven TX load balancing functions in TEAM driver,
from Jiri Pirko.
9) Kill off NLMSG_PUT and RTA_PUT macros, more gross stuff with
embedded gotos.
10) TCP Small Queues, essentially minimize the amount of TCP data queued
up in the packet scheduler layer. Whereas the existing BQL (Byte
Queue Limits) limits the pkt_sched --> netdevice queuing levels,
this controls the TCP --> pkt_sched queueing levels.
From Eric Dumazet.
11) Reduce the number of get_page/put_page ops done on SKB fragments,
from Alexander Duyck.
12) Implement protection against blind resets in TCP (RFC 5961), from
Eric Dumazet.
13) Support the client side of TCP Fast Open, basically the ability to
send data in the SYN exchange, from Yuchung Cheng.
Basically, the sender queues up data with a sendmsg() call using
MSG_FASTOPEN, then they do the connect() which emits the queued up
fastopen data.
14) Avoid all the problems we get into in TCP when timers or PMTU events
hit a locked socket. The TCP Small Queues changes added a
tcp_release_cb() that allows us to queue work up to the
release_sock() caller, and that's what we use here too. From Eric
Dumazet.
15) Zero copy on TX support for TUN driver, from Michael S. Tsirkin.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1870 commits)
genetlink: define lockdep_genl_is_held() when CONFIG_LOCKDEP
r8169: revert "add byte queue limit support".
ipv4: Change rt->rt_iif encoding.
net: Make skb->skb_iif always track skb->dev
ipv4: Prepare for change of rt->rt_iif encoding.
ipv4: Remove all RTCF_DIRECTSRC handliing.
ipv4: Really ignore ICMP address requests/replies.
decnet: Don't set RTCF_DIRECTSRC.
net/ipv4/ip_vti.c: Fix __rcu warnings detected by sparse.
ipv4: Remove redundant assignment
rds: set correct msg_namelen
openvswitch: potential NULL deref in sample()
tcp: dont drop MTU reduction indications
bnx2x: Add new 57840 device IDs
tcp: avoid oops in tcp_metrics and reset tcpm_stamp
niu: Change niu_rbr_fill() to use unlikely() to check niu_rbr_add_page() return value
niu: Fix to check for dma mapping errors.
net: Fix references to out-of-scope variables in put_cmsg_compat()
net: ethernet: davinci_emac: add pm_runtime support
net: ethernet: davinci_emac: Remove unnecessary #include
...
Make it follow device decapsulation, from things such as VLAN and
bonding.
The stuff that actually cares about pre-demuxed device pointers, is
handled by the "orig_dev" variable in __netif_receive_skb(). And
the only consumer of that is the po->origdev feature of AF_PACKET
sockets.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull the big VFS changes from Al Viro:
"This one is *big* and changes quite a few things around VFS. What's in there:
- the first of two really major architecture changes - death to open
intents.
The former is finally there; it was very long in making, but with
Miklos getting through really hard and messy final push in
fs/namei.c, we finally have it. Unlike his variant, this one
doesn't introduce struct opendata; what we have instead is
->atomic_open() taking preallocated struct file * and passing
everything via its fields.
Instead of returning struct file *, it returns -E... on error, 0
on success and 1 in "deal with it yourself" case (e.g. symlink
found on server, etc.).
See comments before fs/namei.c:atomic_open(). That made a lot of
goodies finally possible and quite a few are in that pile:
->lookup(), ->d_revalidate() and ->create() do not get struct
nameidata * anymore; ->lookup() and ->d_revalidate() get lookup
flags instead, ->create() gets "do we want it exclusive" flag.
With the introduction of new helper (kern_path_locked()) we are rid
of all struct nameidata instances outside of fs/namei.c; it's still
visible in namei.h, but not for long. Come the next cycle,
declaration will move either to fs/internal.h or to fs/namei.c
itself. [me, miklos, hch]
- The second major change: behaviour of final fput(). Now we have
__fput() done without any locks held by caller *and* not from deep
in call stack.
That obviously lifts a lot of constraints on the locking in there.
Moreover, it's legal now to call fput() from atomic contexts (which
has immediately simplified life for aio.c). We also don't need
anti-recursion logics in __scm_destroy() anymore.
There is a price, though - the damn thing has become partially
asynchronous. For fput() from normal process we are guaranteed
that pending __fput() will be done before the caller returns to
userland, exits or gets stopped for ptrace.
For kernel threads and atomic contexts it's done via
schedule_work(), so theoretically we might need a way to make sure
it's finished; so far only one such place had been found, but there
might be more.
There's flush_delayed_fput() (do all pending __fput()) and there's
__fput_sync() (fput() analog doing __fput() immediately). I hope
we won't need them often; see warnings in fs/file_table.c for
details. [me, based on task_work series from Oleg merged last
cycle]
- sync series from Jan
- large part of "death to sync_supers()" work from Artem; the only
bits missing here are exofs and ext4 ones. As far as I understand,
those are going via the exofs and ext4 trees resp.; once they are
in, we can put ->write_super() to the rest, along with the thread
calling it.
- preparatory bits from unionmount series (from dhowells).
- assorted cleanups and fixes all over the place, as usual.
This is not the last pile for this cycle; there's at least jlayton's
ESTALE work and fsfreeze series (the latter - in dire need of fixes,
so I'm not sure it'll make the cut this cycle). I'll probably throw
symlink/hardlink restrictions stuff from Kees into the next pile, too.
Plus there's a lot of misc patches I hadn't thrown into that one -
it's large enough as it is..."
* 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (127 commits)
ext4: switch EXT4_IOC_RESIZE_FS to mnt_want_write_file()
btrfs: switch btrfs_ioctl_balance() to mnt_want_write_file()
switch dentry_open() to struct path, make it grab references itself
spufs: shift dget/mntget towards dentry_open()
zoran: don't bother with struct file * in zoran_map
ecryptfs: don't reinvent the wheels, please - use struct completion
don't expose I_NEW inodes via dentry->d_inode
tidy up namei.c a bit
unobfuscate follow_up() a bit
ext3: pass custom EOF to generic_file_llseek_size()
ext4: use core vfs llseek code for dir seeks
vfs: allow custom EOF in generic_file_llseek code
vfs: Avoid unnecessary WB_SYNC_NONE writeback during sys_sync and reorder sync passes
vfs: Remove unnecessary flushing of block devices
vfs: Make sys_sync writeout also block device inodes
vfs: Create function for iterating over block devices
vfs: Reorder operations during sys_sync
quota: Move quota syncing to ->sync_fs method
quota: Split dquot_quota_sync() to writeback and cache flushing part
vfs: Move noop_backing_dev_info check from sync into writeback
...
The ipv4 routing cache is non-deterministic, performance wise, and is
subject to reasonably easy to launch denial of service attacks.
The routing cache works great for well behaved traffic, and the world
was a much friendlier place when the tradeoffs that led to the routing
cache's design were considered.
What it boils down to is that the performance of the routing cache is
a product of the traffic patterns seen by a system rather than being a
product of the contents of the routing tables. The former of which is
controllable by external entitites.
Even for "well behaved" legitimate traffic, high volume sites can see
hit rates in the routing cache of only ~%10.
The general flow of this patch series is that first the routing cache
is removed. We build a completely new rtable entry every lookup
request.
Next we make some simplifications due to the fact that removing the
routing cache causes several members of struct rtable to become no
longer necessary.
Then we need to make some amends such that we can legally cache
pre-constructed routes in the FIB nexthops. Firstly, we need to
invalidate routes which are hit with nexthop exceptions. Secondly we
have to change the semantics of rt->rt_gateway such that zero means
that the destination is on-link and non-zero otherwise.
Now that the preparations are ready, we start caching precomputed
routes in the FIB nexthops. Output and input routes need different
kinds of care when determining if we can legally do such caching or
not. The details are in the commit log messages for those changes.
The patch series then winds down with some more struct rtable
simplifications and other tidy ups that remove unnecessary overhead.
On a SPARC-T3 output route lookups are ~876 cycles. Input route
lookups are ~1169 cycles with rpfilter disabled, and about ~1468
cycles with rpfilter enabled.
These measurements were taken with the kbench_mod test module in the
net_test_tools GIT tree:
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net_test_tools.git
That GIT tree also includes a udpflood tester tool and stresses
route lookups on packet output.
For example, on the same SPARC-T3 system we can run:
time ./udpflood -l 10000000 10.2.2.11
with routing cache:
real 1m21.955s user 0m6.530s sys 1m15.390s
without routing cache:
real 1m31.678s user 0m6.520s sys 1m25.140s
Performance undoubtedly can easily be improved further.
For example fib_table_lookup() performs a lot of excessive
computations with all the masking and shifting, some of it
conditionalized to deal with edge cases.
Also, Eric's no-ref optimization for input route lookups can be
re-instated for the FIB nexthop caching code path. I would be really
pleased if someone would work on that.
In fact anyone suitable motivated can just fire up perf on the loading
of the test net_test_tools benchmark kernel module. I spend much of
my time going:
bash# perf record insmod ./kbench_mod.ko dst=172.30.42.22 src=74.128.0.1 iif=2
bash# perf report
Thanks to helpful feedback from Joe Perches, Eric Dumazet, Ben
Hutchings, and others.
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of updating the sk_cgrp_prioidx struct field on every send
this only updates the field when a task is moved via cgroup
infrastructure.
This allows sockets that may be used by a kernel worker thread
to be managed. For example in the iscsi case today a user can
put iscsid in a netprio cgroup and control traffic will be sent
with the correct sk_cgrp_prioidx value set but as soon as data
is sent the kernel worker thread isssues a send and sk_cgrp_prioidx
is updated with the kernel worker threads value which is the
default case.
It seems more correct to only update the field when the user
explicitly sets it via control group infrastructure. This allows
the users to manage sockets that may be used with other threads.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Export skb_copy_ubufs so that modules can orphan frags.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
zero copy packets are normally sent to the outside
network, but bridging, tun etc might loop them
back to host networking stack. If this happens
destructors will never be called, so orphan
the frags immediately on receive.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reduce code duplication a bit using the new helper.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 76ff5cc919
(rtnl: allow to specify number of rx and tx queues
on device creation) added a reference to the net_device
structure's 'num_rx_queues' member in
net/core/rtnetlink.c:rtnl_fill_ifinfo()
However, the definition for 'num_rx_queues' is surrounded
by an '#ifdef CONFIG_RPS' while the new reference to it is
not. This causes a compile error when CONFIG_RPS is not
defined.
Fix the compile error by surrounding the new reference to
'num_rx_queues' by an '#ifdef CONFIG_RPS'.
CC: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Mark A. Greer <mgreer@animalcreek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a big comment explaining how the field works, and use defines
instead of magic constants for the values assigned to it.
Suggested by Joe Perches.
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch introduces IFLA_NUM_TX_QUEUES and IFLA_NUM_RX_QUEUES by
which userspace can set number of rx and/or tx queues to be allocated
for newly created netdevice.
This overrides ops->get_num_[tr]x_queues()
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Also cut out unused function parameters and possible err in return
value.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change eliminates an initialization-order hazard most
recently seen when netprio_cgroup is built into the kernel.
With thanks to Eric Dumazet for catching a bug.
Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce ipv6_addr_hash() helper doing a XOR on all bits
of an IPv6 address, with an optimized x86_64 version.
Use it in flow dissector, as suggested by Andrew McGregor,
to reduce hash collision probabilities in fq_codel (and other
users of flow dissector)
Use it in ip6_tunnel.c and use more bit shuffling, as suggested
by David Laight, as existing hash was ignoring most of them.
Use it in sunrpc and use more bit shuffling, using hash_32().
Use it in net/ipv6/addrconf.c, using hash_32() as well.
As a cleanup, use it in net/ipv4/tcp_metrics.c
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrew McGregor <andrewmcgr@gmail.com>
Cc: Dave Taht <dave.taht@gmail.com>
Cc: Tom Herbert <therbert@google.com>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use correct allocation flags during copy of user space fragments
to the kernel. Also "improve" couple of for loops.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
there are some out of bound accesses in netprio cgroup.
now before accessing the dev->priomap.priomap array,we only check
if the dev->priomap exist.and because we don't want to see
additional bound checkings in fast path, so we should make sure
that dev->priomap is null or array size of dev->priomap.priomap
is equal to max_prioidx + 1;
so in write_priomap logic,we should call extend_netdev_table when
dev->priomap is null and dev->priomap.priomap_len < max_len.
and in cgrp_create->update_netdev_tables logic,we should call
extend_netdev_table only when dev->priomap exist and
dev->priomap.priomap_len < max_len.
and it's not needed to call update_netdev_tables in write_priomap,
we can only allocate the net device's priomap which we change through
net_prio.ifpriomap.
this patch also add a return value for update_netdev_tables &
extend_netdev_table, so when new_priomap is allocated failed,
write_priomap will stop to access the priomap,and return -ENOMEM
back to the userspace to tell the user what happend.
Change From v3:
1. add rtnl protect when reading max_prioidx in write_priomap.
2. only call extend_netdev_table when map->priomap_len < max_len,
this will make sure array size of dev->map->priomap always
bigger than any prioidx.
3. add a function write_update_netdev_table to make codes clear.
Change From v2:
1. protect extend_netdev_table by RTNL.
2. when extend_netdev_table failed,call dev_put to reduce device's refcount.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Eric Dumazet <edumazet@google.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before this patch sock_diag works for init_net only and dumps
information about sockets from all namespaces.
This patch expands sock_diag for all name-spaces.
It creates a netlink kernel socket for each netns and filters
data during dumping.
v2: filter accoding with netns in all places
remove an unused variable.
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Pavel Emelyanov <xemul@parallels.com>
CC: Eric Dumazet <eric.dumazet@gmail.com>
Cc: linux-kernel@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Andrew Vagin <avagin@openvz.org>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Few drivers use GFP_DMA allocations, and netdev_alloc_frag()
doesn't allocate pages in DMA zone.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is meant to help improve performance by reducing the number of
locked operations required to allocate a frag on x86 and other platforms.
This is accomplished by using atomic_set operations on the page count
instead of calling get_page and put_page. It is based on work originally
provided by Eric Dumazet.
In addition it also helps to reduce memory overhead when using TCP. This
is done by recycling the page if the only holder of the frame is the
netdev_alloc_frag call itself. This can occur when skb heads are stolen by
either GRO or TCP and the driver providing the packets is using paged frags
to store all of the data for the packets.
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This introduce TSQ (TCP Small Queues)
TSQ goal is to reduce number of TCP packets in xmit queues (qdisc &
device queues), to reduce RTT and cwnd bias, part of the bufferbloat
problem.
sk->sk_wmem_alloc not allowed to grow above a given limit,
allowing no more than ~128KB [1] per tcp socket in qdisc/dev layers at a
given time.
TSO packets are sized/capped to half the limit, so that we have two
TSO packets in flight, allowing better bandwidth use.
As a side effect, setting the limit to 40000 automatically reduces the
standard gso max limit (65536) to 40000/2 : It can help to reduce
latencies of high prio packets, having smaller TSO packets.
This means we divert sock_wfree() to a tcp_wfree() handler, to
queue/send following frames when skb_orphan() [2] is called for the
already queued skbs.
Results on my dev machines (tg3/ixgbe nics) are really impressive,
using standard pfifo_fast, and with or without TSO/GSO.
Without reduction of nominal bandwidth, we have reduction of buffering
per bulk sender :
< 1ms on Gbit (instead of 50ms with TSO)
< 8ms on 100Mbit (instead of 132 ms)
I no longer have 4 MBytes backlogged in qdisc by a single netperf
session, and both side socket autotuning no longer use 4 Mbytes.
As skb destructor cannot restart xmit itself ( as qdisc lock might be
taken at this point ), we delegate the work to a tasklet. We use one
tasklest per cpu for performance reasons.
If tasklet finds a socket owned by the user, it sets TSQ_OWNED flag.
This flag is tested in a new protocol method called from release_sock(),
to eventually send new segments.
[1] New /proc/sys/net/ipv4/tcp_limit_output_bytes tunable
[2] skb_orphan() is usually called at TX completion time,
but some drivers call it in their start_xmit() handler.
These drivers should at least use BQL, or else a single TCP
session can still fill the whole NIC TX ring, since TSQ will
have no effect.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Dave Taht <dave.taht@bufferbloat.net>
Cc: Tom Herbert <therbert@google.com>
Cc: Matt Mathis <mattmathis@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
net/batman-adv/bridge_loop_avoidance.c
net/batman-adv/bridge_loop_avoidance.h
net/batman-adv/soft-interface.c
net/mac80211/mlme.c
With merge help from Antonio Quartulli (batman-adv) and
Stephen Rothwell (drivers/net/usb/qmi_wwan.c).
The net/mac80211/mlme.c conflict seemed easy enough, accounting for a
conversion to some new tracing macros.
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix incorrect start markers, wrapped summary lines, missing section
breaks, incorrect separators, and some name mismatches.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Defining a function with no parameters as 'T foo()' is the deprecated
K&R style, and is not strictly equivalent to defining it as 'T foo(void)'.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
dev->priomap is allocated by extend_netdev_table() called from
update_netdev_tables().
And this is only called if write_priomap() is called.
But if write_priomap() is not called, it seems we can have out of bounds
accesses in cgrp_destroy(), read_priomap() & skb_update_prio()
With help from Gao Feng
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Gao feng <gaofeng@cn.fujitsu.com>
Acked-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
we set max_prioidx to the first zero bit index of prioidx_map in
function get_prioidx.
So when we delete the low index netprio cgroup and adding a new
netprio cgroup again,the max_prioidx will be set to the low index.
when we set the high index cgroup's net_prio.ifpriomap,the function
write_priomap will call update_netdev_tables to alloc memory which
size is sizeof(struct netprio_map) + sizeof(u32) * (max_prioidx + 1),
so the size of array that map->priomap point to is max_prioidx +1,
which is low than what we actually need.
fix this by adding check in get_prioidx,only set max_prioidx when
max_prioidx low than the new prioidx.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Most multi-queue networking driver consider the number of online cpus when
configuring RSS queues.
This patch adds a wrapper to the number of cpus, setting an upper limit on the
number of cpus a driver should consider (by default) when allocating resources
for his queues.
Signed-off-by: Yuval Mintz <yuvalmin@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a dst_confirm() happens, mark the confirmation as pending in the
dst. Then on the next packet out, when we have the neigh in-hand, do
the update.
This removes the dependency in dst_confirm() of dst's having an
attached neigh.
While we're here, remove the explicit 'dst' NULL check, all except 2
or 3 call sites ensure it's not NULL. So just fix those cases up.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking update from David Miller:
1) Fix RX sequence number handling in mwifiex, from Stone Piao.
2) Netfilter ipset mis-compares device names, fix from Florian
Westphal.
3) Fix route leak in ipv6 IPVS, from Eric Dumazet.
4) NFS fixes. Several buffer overflows in NCI layer from Dan
Rosenberg, and release sock OOPS'er fix from Eric Dumazet.
5) Fix WEP handling ath9k, we started using a bit the chip provides to
indicate undecrypted packets but that bit turns out to be unreliable
in certain configurations. Fix from Felix Fietkau.
6) Fix Kconfig dependency bug in wlcore, from Randy Dunlap.
7) New USB IDs for rtlwifi driver from Larry Finger.
8) Fix crashes in qmi_wwan usbnet driver when disconnecting, from Bjørn
Mork.
9) Gianfar driver programs coalescing settings properly in single queue
mode, but does not do so in multi-queue mode. Fix from Claudiu
Manoil.
10) Missing module.h include in davinci_cpdma.c, from Daniel Mack.
11) Need dummy handler for IPSET_CMD_NONE otherwise we crash in ipset if
we get this via nfnetlink, fix from Tomasz Bursztyka.
12) Missing RCU unlock in nfnetlink error path, also from Tomasz.
13) Fix divide by zero in igbvf when the user tries to set an RX
coalescing value of 0 usecs, from Mitch A Williams.
14) We can process SCTP sacks for the wrong transport, oops. Fix from
Neil Horman.
15) Remove hw IP payload checksumming from e1000e driver. This has zery
value in our stack, and turning it on creates a very unintuitive
restriction for users when using jumbo MTUs.
Specifically, when IP payload checksums are on you cannot use both
receive hashing offload and jumbo MTU. Fix from Bruce Allan.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (27 commits)
e1000e: remove use of IP payload checksum
sctp: be more restrictive in transport selection on bundled sacks
igbvf: fix divide by zero
netfilter: nfnetlink: fix missing rcu_read_unlock in nfnetlink_rcv_msg
netfilter: ipset: fix crash if IPSET_CMD_NONE command is sent
davinci_cpdma: include linux/module.h
gianfar: Fix RXICr/TXICr programming for multi-queue mode
net: Downgrade CAP_SYS_MODULE deprecated message from error to warning.
net: qmi_wwan: fix Oops while disconnecting
mwifiex: fix memory leak associated with IE manamgement
ath9k: fix panic caused by returning a descriptor we have queued for reuse
mac80211: correct behaviour on unrecognised action frames
ath9k: enable serialize_regmode for non-PCIE AR9287
rtlwifi: rtl8192cu: New USB IDs
NFC: Return from rawsock_release when sk is NULL
iwlwifi: fix activating inactive stations
wlcore: drop INET dependency
ath9k: fix dynamic WEP related regression
NFC: Prevent multiple buffer overflows in NCI
netfilter: update location of my trees
...
Pull block bits from Jens Axboe:
"As vacation is coming up, thought I'd better get rid of my pending
changes in my for-linus branch for this iteration. It contains:
- Two patches for mtip32xx. Killing a non-compliant sysfs interface
and moving it to debugfs, where it belongs.
- A few patches from Asias. Two legit bug fixes, and one killing an
interface that is no longer in use.
- A patch from Jan, making the annoying partition ioctl warning a bit
less annoying, by restricting it to !CAP_SYS_RAWIO only.
- Three bug fixes for drbd from Lars Ellenberg.
- A fix for an old regression for umem, it hasn't really worked since
the plugging scheme was changed in 3.0.
- A few fixes from Tejun.
- A splice fix from Eric Dumazet, fixing an issue with pipe
resizing."
* 'for-linus' of git://git.kernel.dk/linux-block:
scsi: Silence unnecessary warnings about ioctl to partition
block: Drop dead function blk_abort_queue()
block: Mitigate lock unbalance caused by lock switching
block: Avoid missed wakeup in request waitqueue
umem: fix up unplugging
splice: fix racy pipe->buffers uses
drbd: fix null pointer dereference with on-congestion policy when diskless
drbd: fix list corruption by failing but already aborted reads
drbd: fix access of unallocated pages and kernel panic
xen/blkfront: Add WARN to deal with misbehaving backends.
blkcg: drop local variable @q from blkg_destroy()
mtip32xx: Create debugfs entries for troubleshooting
mtip32xx: Remove 'registers' and 'flags' from sysfs
blkcg: fix blkg_alloc() failure path
block: blkcg_policy_cfq shouldn't be used if !CONFIG_CFQ_GROUP_IOSCHED
block: fix return value on cfq_init() failure
mtip32xx: Remove version.h header file inclusion
xen/blkback: Copy id field when doing BLKIF_DISCARD.
This patch adds the following structure:
struct netlink_kernel_cfg {
unsigned int groups;
void (*input)(struct sk_buff *skb);
struct mutex *cb_mutex;
};
That can be passed to netlink_kernel_create to set optional configurations
for netlink kernel sockets.
I've populated this structure by looking for NULL and zero parameters at the
existing code. The remaining parameters that always need to be set are still
left in the original interface.
That includes optional parameters for the netlink socket creation. This allows
easy extensibility of this interface in the future.
This patch also adapts all callers to use this new interface.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
If rpfilter is off (or the SKB has an IPSEC path) and there are not
tclassid users, we don't have to do anything at all when
fib_validate_source() is invoked besides setting the itag to zero.
We monitor tclassid uses with a counter (modified only under RTNL and
marked __read_mostly) and we protect the fib_validate_source() real
work with a test against this counter and whether rpfilter is to be
done.
Having a way to know whether we need no tclassid processing or not
also opens the door for future optimized rpfilter algorithms that do
not perform full FIB lookups.
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/caif/caif_hsi.c
drivers/net/usb/qmi_wwan.c
The qmi_wwan merge was trivial.
The caif_hsi.c, on the other hand, was not. It's a conflict between
1c385f1fdf ("caif-hsi: Replace platform
device with ops structure.") in the net-next tree and commit
39abbaef19 ("caif-hsi: Postpone init of
HIS until open()") in the net tree.
I did my best with that one and will ask Sjur to check it out.
Signed-off-by: David S. Miller <davem@davemloft.net>
Make logging level consistent with other deprecation messages in net
subsystem.
Signed-off-by: Vinson Lee <vlee@twitter.com>
Cc: David Mackey <tdmackey@twitter.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
dropwatch wrongly diagnose all received UDP packets as drops.
This patch removes trace_kfree_skb() done in skb_free_datagram_locked().
Locations calling skb_free_datagram_locked() should do it on their own.
As a result, drops are accounted on the right function.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Removes all RTA_GET*() and RTA_PUT*() variations, as well as the
the unused rtattr_strcmp(). Get rid of rtm_get_table() by moving
it to its only user decnet.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Input packet processing for local sockets involves two major demuxes.
One for the route and one for the socket.
But we can optimize this down to one demux for certain kinds of local
sockets.
Currently we only do this for established TCP sockets, but it could
at least in theory be expanded to other kinds of connections.
If a TCP socket is established then it's identity is fully specified.
This means that whatever input route was used during the three-way
handshake must work equally well for the rest of the connection since
the keys will not change.
Once we move to established state, we cache the receive packet's input
route to use later.
Like the existing cached route in sk->sk_dst_cache used for output
packets, we have to check for route invalidations using dst->obsolete
and dst->ops->check().
Early demux occurs outside of a socket locked section, so when a route
invalidation occurs we defer the fixup of sk->sk_rx_dst until we are
actually inside of established state packet processing and thus have
the socket locked.
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
net/ipv6/route.c
This deals with a merge conflict between the net-next addition of the
inetpeer network namespace ops, and Thomas Graf's bug fix in
2a0c451ade which makes sure we don't
register /proc/net/ipv6_route before it is actually safe to do so.
Signed-off-by: David S. Miller <davem@davemloft.net>
Orphaning skb in dev_hard_start_xmit() makes bonding behavior
unfriendly for applications sending big UDP bursts : Once packets
pass the bonding device and come to real device, they might hit a full
qdisc and be dropped. Without orphaning, the sender is automatically
throttled because sk->sk_wmemalloc reaches sk->sk_sndbuf (assuming
sk_sndbuf is not too big)
We could try to defer the orphaning adding another test in
dev_hard_start_xmit(), but all this seems of little gain,
now that BQL tends to make packets more likely to be parked
in Qdisc queues instead of NIC TX ring, in cases where performance
matters.
Reverts commits :
fc6055a5ba net: Introduce skb_orphan_try()
87fd308cfc net: skb_tx_hash() fix relative to skb_orphan_try()
and removes SKBTX_DRV_NEEDS_SK_REF flag
Reported-and-bisected-by: Jean-Michel Hautbois <jhautbois@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Oliver Hartkopp <socketcan@hartkopp.net>
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bogdan Hamciuc diagnosed and fixed following bug in netpoll_send_udp() :
"skb->len += len;" instead of "skb_put(skb, len);"
Meaning that _if_ a network driver needs to call skb_realloc_headroom(),
only packet headers would be copied, leaving garbage in the payload.
However the skb_realloc_headroom() must be avoided as much as possible
since it requires memory and netpoll tries hard to work even if memory
is exhausted (using a pool of preallocated skbs)
It appears netpoll_send_udp() reserved 16 bytes for the ethernet header,
which happens to work for typicall drivers but not all.
Right thing is to use LL_RESERVED_SPACE(dev)
(And also add dev->needed_tailroom of tailroom)
This patch combines both fixes.
Many thanks to Bogdan for raising this issue.
Reported-by: Bogdan Hamciuc <bogdan.hamciuc@freescale.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Bogdan Hamciuc <bogdan.hamciuc@freescale.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Neil Horman <nhorman@tuxdriver.com>
Reviewed-by: Neil Horman <nhorman@tuxdriver.com>
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dave Jones reported a kernel BUG at mm/slub.c:3474! triggered
by splice_shrink_spd() called from vmsplice_to_pipe()
commit 35f3d14dbb (pipe: add support for shrinking and growing pipes)
added capability to adjust pipe->buffers.
Problem is some paths don't hold pipe mutex and assume pipe->buffers
doesn't change for their duration.
Fix this by adding nr_pages_max field in struct splice_pipe_desc, and
use it in place of pipe->buffers where appropriate.
splice_shrink_spd() loses its struct pipe_inode_info argument.
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Tom Herbert <therbert@google.com>
Cc: stable <stable@vger.kernel.org> # 2.6.35
Tested-by: Dave Jones <davej@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Conflicts:
MAINTAINERS
drivers/net/wireless/iwlwifi/pcie/trans.c
The iwlwifi conflict was resolved by keeping the code added
in 'net' that turns off the buggy chip feature.
The MAINTAINERS conflict was merely overlapping changes, one
change updated all the wireless web site URLs and the other
changed some GIT trees to be Johannes's instead of John's.
Signed-off-by: David S. Miller <davem@davemloft.net>
'Get' commands should generally not require CAP_NET_ADMIN, with
the exception of those that expose internal state.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add dev_loopback_xmit() in order to deduplicate functions
ip_dev_loopback_xmit() (in net/ipv4/ip_output.c) and
ip6_dev_loopback_xmit() (in net/ipv6/ip6_output.c).
I was about to reinvent the wheel when I noticed that
ip_dev_loopback_xmit() and ip6_dev_loopback_xmit() do exactly what I
need and are not IP-only functions, but they were not available to reuse
elsewhere.
ip6_dev_loopback_xmit() does not have line "skb_dst_force(skb);", but I
understand that this is harmless, and should be in dev_loopback_xmit().
Signed-off-by: Michel Machado <michel@digirati.com.br>
CC: "David S. Miller" <davem@davemloft.net>
CC: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
CC: James Morris <jmorris@namei.org>
CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
CC: Patrick McHardy <kaber@trash.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jpirko@redhat.com>
CC: "Michał Mirosław" <mirq-linux@rere.qmqm.pl>
CC: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix kernel-doc warnings in net/core:
Warning(net/core/skbuff.c:3368): No description found for parameter 'delta_truesize'
Warning(net/core/filter.c:628): No description found for parameter 'pfp'
Warning(net/core/filter.c:628): Excess function parameter 'sk' description in 'sk_unattached_filter_create'
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch extends the kernel's ethtool interface by adding support
for 2 new EEE commands - get_eee and set_eee.
Thanks goes to Giuseppe Cavallaro for his original patch adding this support.
Signed-off-by: Yuval Mintz <yuvalmin@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Reviewed-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
__alloc_skb() now extends tailroom to allow the use of padding added
by the heap allocator.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Denys found out "ip neigh" output was truncated to
about 54 neighbours.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Denys Fedoryshchenko <denys@visp.net.lb>
Signed-off-by: David S. Miller <davem@davemloft.net>
The only user of this was hal prior to its 0.5.12
release which happened over two years ago, so I'm
sure this can be removed without issues.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
drop_monitor calls several sleeping functions while in atomic context.
BUG: sleeping function called from invalid context at mm/slub.c:943
in_atomic(): 1, irqs_disabled(): 0, pid: 2103, name: kworker/0:2
Pid: 2103, comm: kworker/0:2 Not tainted 3.5.0-rc1+ #55
Call Trace:
[<ffffffff810697ca>] __might_sleep+0xca/0xf0
[<ffffffff811345a3>] kmem_cache_alloc_node+0x1b3/0x1c0
[<ffffffff8105578c>] ? queue_delayed_work_on+0x11c/0x130
[<ffffffff815343fb>] __alloc_skb+0x4b/0x230
[<ffffffffa00b0360>] ? reset_per_cpu_data+0x160/0x160 [drop_monitor]
[<ffffffffa00b022f>] reset_per_cpu_data+0x2f/0x160 [drop_monitor]
[<ffffffffa00b03ab>] send_dm_alert+0x4b/0xb0 [drop_monitor]
[<ffffffff810568e0>] process_one_work+0x130/0x4c0
[<ffffffff81058249>] worker_thread+0x159/0x360
[<ffffffff810580f0>] ? manage_workers.isra.27+0x240/0x240
[<ffffffff8105d403>] kthread+0x93/0xa0
[<ffffffff816be6d4>] kernel_thread_helper+0x4/0x10
[<ffffffff8105d370>] ? kthread_freezable_should_stop+0x80/0x80
[<ffffffff816be6d0>] ? gs_change+0xb/0xb
Rework the logic to call the sleeping functions in right context.
Use standard timer/workqueue api to let system chose any cpu to perform
the allocation and netlink send.
Also avoid a loop if reset_per_cpu_data() cannot allocate memory :
use mod_timer() to wait 1/10 second before next try.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Reviewed-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adding socket backlog len in INET_DIAG_SKMEMINFO is really useful to
diagnose various TCP problems.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We need to validate the number of pages consumed by data_len, otherwise frags
array could be overflowed by userspace. So this patch validate data_len and
return -EMSGSIZE when data_len may occupies more frags than MAX_SKB_FRAGS.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that we have module alias macros for generic netlink families, lets use
those to mark modules with the appropriate family names for loading
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: David Miller <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull user namespace enhancements from Eric Biederman:
"This is a course correction for the user namespace, so that we can
reach an inexpensive, maintainable, and reasonably complete
implementation.
Highlights:
- Config guards make it impossible to enable the user namespace and
code that has not been converted to be user namespace safe.
- Use of the new kuid_t type ensures the if you somehow get past the
config guards the kernel will encounter type errors if you enable
user namespaces and attempt to compile in code whose permission
checks have not been updated to be user namespace safe.
- All uids from child user namespaces are mapped into the initial
user namespace before they are processed. Removing the need to add
an additional check to see if the user namespace of the compared
uids remains the same.
- With the user namespaces compiled out the performance is as good or
better than it is today.
- For most operations absolutely nothing changes performance or
operationally with the user namespace enabled.
- The worst case performance I could come up with was timing 1
billion cache cold stat operations with the user namespace code
enabled. This went from 156s to 164s on my laptop (or 156ns to
164ns per stat operation).
- (uid_t)-1 and (gid_t)-1 are reserved as an internal error value.
Most uid/gid setting system calls treat these value specially
anyway so attempting to use -1 as a uid would likely cause
entertaining failures in userspace.
- If setuid is called with a uid that can not be mapped setuid fails.
I have looked at sendmail, login, ssh and every other program I
could think of that would call setuid and they all check for and
handle the case where setuid fails.
- If stat or a similar system call is called from a context in which
we can not map a uid we lie and return overflowuid. The LFS
experience suggests not lying and returning an error code might be
better, but the historical precedent with uids is different and I
can not think of anything that would break by lying about a uid we
can't map.
- Capabilities are localized to the current user namespace making it
safe to give the initial user in a user namespace all capabilities.
My git tree covers all of the modifications needed to convert the core
kernel and enough changes to make a system bootable to runlevel 1."
Fix up trivial conflicts due to nearby independent changes in fs/stat.c
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (46 commits)
userns: Silence silly gcc warning.
cred: use correct cred accessor with regards to rcu read lock
userns: Convert the move_pages, and migrate_pages permission checks to use uid_eq
userns: Convert cgroup permission checks to use uid_eq
userns: Convert tmpfs to use kuid and kgid where appropriate
userns: Convert sysfs to use kgid/kuid where appropriate
userns: Convert sysctl permission checks to use kuid and kgids.
userns: Convert proc to use kuid/kgid where appropriate
userns: Convert ext4 to user kuid/kgid where appropriate
userns: Convert ext3 to use kuid/kgid where appropriate
userns: Convert ext2 to use kuid/kgid where appropriate.
userns: Convert devpts to use kuid/kgid where appropriate
userns: Convert binary formats to use kuid/kgid where appropriate
userns: Add negative depends on entries to avoid building code that is userns unsafe
userns: signal remove unnecessary map_cred_ns
userns: Teach inode_capable to understand inodes whose uids map to other namespaces.
userns: Fail exec for suid and sgid binaries with ids outside our user namespace.
userns: Convert stat to return values mapped from kuids and kgids
userns: Convert user specfied uids and gids in chown into kuids and kgid
userns: Use uid_eq gid_eq helpers when comparing kuids and kgids in the vfs
...
Pull cgroup updates from Tejun Heo:
"cgroup file type addition / removal is updated so that file types are
added and removed instead of individual files so that dynamic file
type addition / removal can be implemented by cgroup and used by
controllers. blkio controller changes which will come through block
tree are dependent on this. Other changes include res_counter cleanup
and disallowing kthread / PF_THREAD_BOUND threads to be attached to
non-root cgroups.
There's a reported bug with the file type addition / removal handling
which can lead to oops on cgroup umount. The issue is being looked
into. It shouldn't cause problems for most setups and isn't a
security concern."
Fix up trivial conflict in Documentation/feature-removal-schedule.txt
* 'for-3.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits)
res_counter: Account max_usage when calling res_counter_charge_nofail()
res_counter: Merge res_counter_charge and res_counter_charge_nofail
cgroups: disallow attaching kthreadd or PF_THREAD_BOUND threads
cgroup: remove cgroup_subsys->populate()
cgroup: get rid of populate for memcg
cgroup: pass struct mem_cgroup instead of struct cgroup to socket memcg
cgroup: make css->refcnt clearing on cgroup removal optional
cgroup: use negative bias on css->refcnt to block css_tryget()
cgroup: implement cgroup_rm_cftypes()
cgroup: introduce struct cfent
cgroup: relocate __d_cgrp() and __d_cft()
cgroup: remove cgroup_add_file[s]()
cgroup: convert memcg controller to the new cftype interface
memcg: always create memsw files if CONFIG_CGROUP_MEM_RES_CTLR_SWAP
cgroup: convert all non-memcg controllers to the new cftype interface
cgroup: relocate cftype and cgroup_subsys definitions in controllers
cgroup: merge cft_release_agent cftype array into the base files array
cgroup: implement cgroup_add_cftypes() and friends
cgroup: build list of all cgroups under a given cgroupfs_root
cgroup: move cgroup_clear_directory() call out of cgroup_populate_dir()
...
Pull security subsystem updates from James Morris:
"New notable features:
- The seccomp work from Will Drewry
- PR_{GET,SET}_NO_NEW_PRIVS from Andy Lutomirski
- Longer security labels for Smack from Casey Schaufler
- Additional ptrace restriction modes for Yama by Kees Cook"
Fix up trivial context conflicts in arch/x86/Kconfig and include/linux/filter.h
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (65 commits)
apparmor: fix long path failure due to disconnected path
apparmor: fix profile lookup for unconfined
ima: fix filename hint to reflect script interpreter name
KEYS: Don't check for NULL key pointer in key_validate()
Smack: allow for significantly longer Smack labels v4
gfp flags for security_inode_alloc()?
Smack: recursive tramsmute
Yama: replace capable() with ns_capable()
TOMOYO: Accept manager programs which do not start with / .
KEYS: Add invalidation support
KEYS: Do LRU discard in full keyrings
KEYS: Permit in-place link replacement in keyring list
KEYS: Perform RCU synchronisation on keys prior to key destruction
KEYS: Announce key type (un)registration
KEYS: Reorganise keys Makefile
KEYS: Move the key config into security/keys/Kconfig
KEYS: Use the compat keyctl() syscall wrapper on Sparc64 for Sparc32 compat
Yama: remove an unused variable
samples/seccomp: fix dependencies on arch macros
Yama: add additional ptrace scopes
...
Move tcp_try_coalesce() protocol independent part to
skb_try_coalesce().
skb_try_coalesce() can be used in IPv4 defrag and IPv6 reassembly,
to build optimized skbs (less sk_buff, and possibly less 'headers')
skb_try_coalesce() is zero copy, unless the copy can fit in destination
header (its a rare case)
kfree_skb_partial() is also moved to net/core/skbuff.c and exported,
because IPv6 will need it in patch (ipv6: use skb coalescing in
reassembly).
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit c57b546840 (pktgen: fix crash at module unload) did a very poor
job with list primitives.
1) list_splice() arguments were in the wrong order
2) list_splice(list, head) has undefined behavior if head is not
initialized.
3) We should use the list_splice_init() variant to clear pktgen_threads
list.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix two issues introduced in commit a1c7fff7e1
( net: netdev_alloc_skb() use build_skb() )
- Must be IRQ safe (non NAPI drivers can use it)
- Must not leak the frag if build_skb() fails to allocate sk_buff
This patch introduces netdev_alloc_frag() for drivers willing to
use build_skb() instead of __netdev_alloc_skb() variants.
Factorize code so that :
__dev_alloc_skb() is a wrapper around __netdev_alloc_skb(), and
dev_alloc_skb() a wrapper around netdev_alloc_skb()
Use __GFP_COLD flag.
Almost all network drivers now benefit from skb->head_frag
infrastructure.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When I first wrote drop monitor I wrote it to just build monolithically. There
is no reason it can't be built modularly as well, so lets give it that
flexibiity.
I've tested this by building it as both a module and monolithically, and it
seems to work quite well
Change notes:
v2)
* fixed for_each_present_cpu loops to be more correct as per Eric D.
* Converted exit path failures to BUG_ON as per Ben H.
v3)
* Converted del_timer to del_timer_sync to close race noted by Ben H.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Ben Hutchings <bhutchings@solarflare.com>
Reviewed-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
netdev_alloc_skb() is used by networks driver in their RX path to
allocate an skb to receive an incoming frame.
With recent skb->head_frag infrastructure, it makes sense to change
netdev_alloc_skb() to use build_skb() and a frag allocator.
This permits a zero copy splice(socket->pipe), and better GRO or TCP
coalescing.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use the current logging style.
This enables use of dynamic debugging as well.
Convert printk(KERN_<LEVEL> to pr_<level>.
Add pr_fmt. Remove embedded prefixes, use
%s, __func__ instead.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert printk(KERN_DEBUG to pr_debug which can
enable dynamic debugging.
Remove embedded prefixes from the conversions as
pr_fmt adds them.
Align arguments.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- sock_flag() accepts a const pointer
- sock_flag() returns a boolean
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We are going to delete the Token ring support. This removes any
special processing in the core networking for token ring, (aside
from net/tr.c itself), leaving the drivers and remaining tokenring
support present but inert.
The mass removal of the drivers and net/tr.c will be in a separate
commit, so that the history of these files that we still care
about won't have the giant deletion tied into their history.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Standardize the net core ratelimited logging functions.
Coalesce formats, align arguments.
Change a printk then vprintk sequence to use printf extension %pV.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 8a83a00b07.
It causes regressions for S390 devices, because it does an
unconditional DST drop on SKBs for vlans and the QETH device
needs the neighbour entry hung off the DST for certain things
on transmit.
Arnd can't remember exactly why he even needed this change.
Conflicts:
drivers/net/macvlan.c
net/8021q/vlan_dev.c
net/core/dev.c
Signed-off-by: David S. Miller <davem@davemloft.net>
ETHTOOL_GMODULEINFO returns a new struct ethtool_modinfo that will return the
type and size of plug-in module eeprom (such as SFP+) for parsing
by userland program.
ETHTOOL_GMODULEEEPROM returns the raw eeprom information
using the existing ethtool_eeprom structture to return the data
Signed-off-by: Stuart Hodgson <smhodgson@solarflare.com>
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
We want to support reading module (SFP+, XFP, ...) EEPROMs as well as
NIC EEPROMs. They will need a different command number and driver
operation, but the structure and arguments will be the same and so we
can share most of the code here.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
To build ip_vs as a module sysctl_rmem_max and sysctl_wmem_max
needs to be exported.
The dependency was added by "ipvs: wakeup master thread" patch.
Signed-off-by: Hans Schillstrom <hans.schillstrom@ericsson.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Conflicts:
drivers/net/ethernet/intel/e1000e/param.c
drivers/net/wireless/iwlwifi/iwl-agn-rx.c
drivers/net/wireless/iwlwifi/iwl-trans-pcie-rx.c
drivers/net/wireless/iwlwifi/iwl-trans.h
Resolved the iwlwifi conflict with mainline using 3-way diff posted
by John Linville and Stephen Rothwell. In 'net' we added a bug
fix to make iwlwifi report a more accurate skb->truesize but this
conflicted with RX path changes that happened meanwhile in net-next.
In e1000e a conflict arose in the validation code for settings of
adapter->itr. 'net-next' had more sophisticated logic so that
logic was used.
Signed-off-by: David S. Miller <davem@davemloft.net>
With the recent changes for how we compute the skb truesize it occurs to me
we are probably going to have a lot of calls to skb_end_pointer -
skb->head. Instead of running all over the place doing that it would make
more sense to just make it a separate inline skb_end_offset(skb) that way
we can return the correct value without having gcc having to do all the
optimization to cancel out skb->head - skb->head.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since there is now only one spot that actually uses "fastpath" there isn't
much point in carrying it. Instead we can just use a check for skb_cloned
to verify if we can perform the fast-path free for the head or not.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The fast-path for pskb_expand_head contains a check where the size plus the
unaligned size of skb_shared_info is compared against the size of the data
buffer. This code path has two issues. First is the fact that after the
recent changes by Eric Dumazet to __alloc_skb and build_skb the shared info
is always placed in the optimal spot for a buffer size making this check
unnecessary. The second issue is the fact that the check doesn't take into
account the aligned size of shared info. As a result the code burns cycles
doing a memcpy with nothing actually being shifted.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
iQEcBAABAgAGBQJPnb50AAoJEHm+PkMAQRiGAE0H/A4zFZIUGmF3miKPDYmejmrZ
oVDYxVAu6JHjHWhu8E3VsinvyVscowjV8dr15eSaQzmDmRkSHAnUQ+dB7Di7jLC2
MNopxsWjwyZ8zvvr3rFR76kjbWKk/1GYytnf7GPZLbJQzd51om2V/TY/6qkwiDSX
U8Tt7ihSgHAezefqEmWp2X/1pxDCEt+VFyn9vWpkhgdfM1iuzF39MbxSZAgqDQ/9
JJrBHFXhArqJguhENwL7OdDzkYqkdzlGtS0xgeY7qio2CzSXxZXK4svT6FFGA8Za
xlAaIvzslDniv3vR2ZKd6wzUwFHuynX222hNim3QMaYdXm012M+Nn1ufKYGFxI0=
=4d4w
-----END PGP SIGNATURE-----
Merge tag 'v3.4-rc5' into next
Linux 3.4-rc5
Merge to pull in prerequisite change for Smack:
86812bb0de
Requested by Casey.
This patch adds support for a skb_head_is_locked helper function. It is
meant to be used any time we are considering transferring the head from
skb->head to a paged frag. If the head is locked it means we cannot remove
the head from the skb so it must be copied or we must take the skb as a
whole.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
GRO is very optimistic in skb truesize estimates, only taking into
account the used part of fragments.
Be conservative, and use more precise computation, so that bloated GRO
skbs can be collapsed eventually.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These function are no longer needed replace them with their more useful equivalents.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
This change is meant ot prevent stealing the skb->head to use as a page in
the event that the skb->head was cloned. This allows the other clones to
track each other via shinfo->dataref.
Without this we break down to two methods for tracking the reference count,
one being dataref, the other being the page count. As a result it becomes
difficult to track how many references there are to skb->head.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I just noticed after some recent updates, that the init path for the drop
monitor protocol has a minor error. drop monitor maintains a per cpu structure,
that gets initalized from a single cpu. Normally this is fine, as the protocol
isn't in use yet, but I recently made a change that causes a failed skb
allocation to reschedule itself . Given the current code, the implication is
that this workqueue reschedule will take place on the wrong cpu. If drop
monitor is used early during the boot process, its possible that two cpus will
access a single per-cpu structure in parallel, possibly leading to data
corruption.
This patch fixes the situation, by storing the cpu number that a given instance
of this per-cpu data should be accessed from. In the case of a need for a
reschedule, the cpu stored in the struct is assigned the rescheule, rather than
the currently executing cpu
Tested successfully by myself.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: David Miller <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP or UDP stacks have big enough latencies that prefetching next
pointer is worth it.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
__skb_splice_bits() can check if skb to be spliced has its skb->head
mapped to a page fragment, instead of a kmalloc() area.
If so we can avoid a copy of the skb head and get a reference on
underlying page.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Matt Carlson <mcarlson@broadcom.com>
Cc: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
GRO can check if skb to be merged has its skb->head mapped to a page
fragment, instead of a kmalloc() area.
We 'upgrade' skb->head as a fragment in itself
This avoids the frag_list fallback, and permits to build true GRO skb
(one sk_buff and up to 16 fragments), using less memory.
This reduces number of cache misses when user makes its copy, since a
single sk_buff is fetched.
This is a followup of patch "net: allow skb->head to be a page fragment"
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Matt Carlson <mcarlson@broadcom.com>
Cc: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb->head is currently allocated from kmalloc(). This is convenient but
has the drawback the data cannot be converted to a page fragment if
needed.
We have three spots were it hurts :
1) GRO aggregation
When a linear skb must be appended to another skb, GRO uses the
frag_list fallback, very inefficient since we keep all struct sk_buff
around. So drivers enabling GRO but delivering linear skbs to network
stack aren't enabling full GRO power.
2) splice(socket -> pipe).
We must copy the linear part to a page fragment.
This kind of defeats splice() purpose (zero copy claim)
3) TCP coalescing.
Recently introduced, this permits to group several contiguous segments
into a single skb. This shortens queue lengths and save kernel memory,
and greatly reduce probabilities of TCP collapses. This coalescing
doesnt work on linear skbs (or we would need to copy data, this would be
too slow)
Given all these issues, the following patch introduces the possibility
of having skb->head be a fragment in itself. We use a new skb flag,
skb->head_frag to carry this information.
build_skb() is changed to accept a frag_size argument. Drivers willing
to provide a page fragment instead of kmalloc() data will set a non zero
value, set to the fragment size.
Then, on situations we need to convert the skb head to a frag in itself,
we can check if skb->head_frag is set and avoid the copies or various
fallbacks we have.
This means drivers currently using frags could be updated to avoid the
current skb->head allocation and reduce their memory footprint (aka skb
truesize). (thats 512 or 1024 bytes saved per skb). This also makes
bpf/netfilter faster since the 'first frag' will be part of skb linear
part, no need to copy data.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Matt Carlson <mcarlson@broadcom.com>
Cc: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixed a coding style issue relating to spaces
in net/core/sock.c
Signed-off-by: Jeffrin Jose <ahiliation@yahoo.co.in>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet pointed out to me that the drop_monitor protocol has some holes in
its smp protections. Specifically, its possible to replace data->skb while its
being written. This patch corrects that by making data->skb an rcu protected
variable. That will prevent it from being overwritten while a tracepoint is
modifying it.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: David Miller <davem@davemloft.net>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet pointed out this warning in the drop_monitor protocol to me:
[ 38.352571] BUG: sleeping function called from invalid context at kernel/mutex.c:85
[ 38.352576] in_atomic(): 1, irqs_disabled(): 0, pid: 4415, name: dropwatch
[ 38.352580] Pid: 4415, comm: dropwatch Not tainted 3.4.0-rc2+ #71
[ 38.352582] Call Trace:
[ 38.352592] [<ffffffff8153aaf0>] ? trace_napi_poll_hit+0xd0/0xd0
[ 38.352599] [<ffffffff81063f2a>] __might_sleep+0xca/0xf0
[ 38.352606] [<ffffffff81655b16>] mutex_lock+0x26/0x50
[ 38.352610] [<ffffffff8153aaf0>] ? trace_napi_poll_hit+0xd0/0xd0
[ 38.352616] [<ffffffff810b72d9>] tracepoint_probe_register+0x29/0x90
[ 38.352621] [<ffffffff8153a585>] set_all_monitor_traces+0x105/0x170
[ 38.352625] [<ffffffff8153a8ca>] net_dm_cmd_trace+0x2a/0x40
[ 38.352630] [<ffffffff8154a81a>] genl_rcv_msg+0x21a/0x2b0
[ 38.352636] [<ffffffff810f8029>] ? zone_statistics+0x99/0xc0
[ 38.352640] [<ffffffff8154a600>] ? genl_rcv+0x30/0x30
[ 38.352645] [<ffffffff8154a059>] netlink_rcv_skb+0xa9/0xd0
[ 38.352649] [<ffffffff8154a5f0>] genl_rcv+0x20/0x30
[ 38.352653] [<ffffffff81549a7e>] netlink_unicast+0x1ae/0x1f0
[ 38.352658] [<ffffffff81549d76>] netlink_sendmsg+0x2b6/0x310
[ 38.352663] [<ffffffff8150824f>] sock_sendmsg+0x10f/0x130
[ 38.352668] [<ffffffff8150abe0>] ? move_addr_to_kernel+0x60/0xb0
[ 38.352673] [<ffffffff81515f04>] ? verify_iovec+0x64/0xe0
[ 38.352677] [<ffffffff81509c46>] __sys_sendmsg+0x386/0x390
[ 38.352682] [<ffffffff810ffaf9>] ? handle_mm_fault+0x139/0x210
[ 38.352687] [<ffffffff8165b5bc>] ? do_page_fault+0x1ec/0x4f0
[ 38.352693] [<ffffffff8106ba4d>] ? set_next_entity+0x9d/0xb0
[ 38.352699] [<ffffffff81310b49>] ? tty_ldisc_deref+0x9/0x10
[ 38.352703] [<ffffffff8106d363>] ? pick_next_task_fair+0x63/0x140
[ 38.352708] [<ffffffff8150b8d4>] sys_sendmsg+0x44/0x80
[ 38.352713] [<ffffffff8165f8e2>] system_call_fastpath+0x16/0x1b
It stems from holding a spinlock (trace_state_lock) while attempting to register
or unregister tracepoint hooks, making in_atomic() true in this context, leading
to the warning when the tracepoint calls might_sleep() while its taking a mutex.
Since we only use the trace_state_lock to prevent trace protocol state races, as
well as hardware stat list updates on an rcu write side, we can just convert the
spinlock to a mutex to avoid this problem.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: David Miller <davem@davemloft.net>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use min_t()/max_t() macros, reformat two comments, use !!test_bit() to
match !!sock_flag()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
read only, so change it to const.
Signed-off-by: Shan Wei <davidshan@tencent.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix merge between commit 3adadc08cc ("net ax25: Reorder ax25_exit to
remove races") and commit 0ca7a4c87d ("net ax25: Simplify and
cleanup the ax25 sysctl handling")
The former moved around the sysctl register/unregister calls, the
later simply removed them.
With help from Stephen Rothwell.
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 35f3d14db (pipe: add support for shrinking and growing pipes)
added a slowdown for splice(socket -> pipe), as we might grow the spd
used in skb_splice_bits() for each skb we process in splice() syscall.
Its not needed since skb lengths are capped. The default on-stack arrays
are more than enough.
Use MAX_SKB_FRAGS instead of PIPE_DEF_BUFFERS to describe the reasonable
limit per skb.
Add coalescing support to help splicing of GRO skbs built from linear
skbs (linked into frag_list)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sk_add_backlog() & sk_rcvqueues_full() hard coded sk_rcvbuf as the
memory limit. We need to make this limit a parameter for TCP use.
No functional change expected in this patch, all callers still using the
old sk_rcvbuf limit.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Cc: Rick Jones <rick.jones2@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
splice() from socket to pipe needs linear_to_page() helper to transfert
skb header to part of page.
We can reset the offset in the current sk->sk_sndmsg_page if we are the
last user of the page.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It seems there is a logic error in trace_drop_common(), since we store
only 64 drops, even if they are from same location.
This fix is a one liner, but we probably need more work to avoid useless
atomic dec/inc
Now I can watch 1 Mpps drops through dropwatch...
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Name them in a "backward compatible" manner, i.e. reuse or not
are still 1 and 0 respectively. The reuse value of 2 means that
the socket with it will forcibly reuse everyone else's port.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
We don't use struct ctl_path anymore so delete the exported constants.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This results in code with less boiler plate that is a bit easier
to read.
Additionally stops us from using compatibility code in the sysctl
core, hastening the day when the compatibility code can be removed.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Using an ascii path to register_net_sysctl as opposed to the slightly
awkward ctl_path allows for much simpler code.
We no longer need to malloc dev_name to keep it alive the length of our
sysctl register instead we can use a small temporary buffer on the
stack.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On the next line we register the net_core_table in net/core which
creates the directory and ensures it exists.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This makes it clearer which sysctls are relative to your current network
namespace.
This makes it a little less error prone by not exposing sysctls for the
initial network namespace in other namespaces.
This is the same way we handle all of our other network interfaces to
userspace and I can't honestly remember why we didn't do this for
sysctls right from the start.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
register_sysctl_rotable never caught on as an interesting way to
register sysctls. My take on the situation is that what we want are
sysctls that we can only see in the initial network namespace. What we
have implemented with register_sysctl_rotable are sysctls that we can
see in all of the network namespaces and can only change in the initial
network namespace.
That is a very silly way to go. Just register the network sysctls
in the initial network namespace and we don't have any weird special
cases to deal with.
The sysctls affected are:
/proc/sys/net/ipv4/ipfrag_secret_interval
/proc/sys/net/ipv4/ipfrag_max_dist
/proc/sys/net/ipv6/ip6frag_secret_interval
/proc/sys/net/ipv6/mld_max_msf
I really don't expect anyone will miss them if they can't read them in a
child user namespace.
CC: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As part of GRO processing, merged skbs should be consumed, not freed, to
not confuse dropwatch/drop_monitor.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When we need to clone skb, we dont drop a packet.
Call consume_skb() to not confuse dropwatch.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net/core/sysctl_net_core.c: In function ‘sysctl_core_init’:
net/core/sysctl_net_core.c:259: error: implicit declaration of function ‘kmemleak_not_leak’
with same error in net/ipv4/route.c
Signed-off-by: Shan Wei <davidshan@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ops_init should free the net_generic data on
init failure and __register_pernet_operations should not
call ops_free when NET_NS is not enabled.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is useful to be able to monitor for FDB events in user space.
This patch adds support to generate netlink events when a change
is made to a device supporting the FDB ops.
This brings embedded switches inline with the SW net/bridge which
triggers events on FDB updates as well.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds a generic dump routine drivers can call. It
should be sufficient to handle any bridging model that
uses the unicast address list. This should be most SR-IOV
enabled NICs.
v2: return error on nlmsg_put and use -EMSGSIZE instead
of -ENOMEM this is inline other usages
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds a dev_uc_add_excl() and dev_mc_add_excl() calls
similar to the original dev_{uc|mc}_add() except it sets
the global bit and returns -EEXIST for duplicat entires.
This is useful for drivers that support SR-IOV, macvlan
devices and any other devices that need to manage the
unicast and multicast lists.
v2: fix typo UNICAST should be MULTICAST in dev_mc_add_excl()
CC: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds two new flags NTF_MASTER and NTF_SELF that can
now be used to specify where PF_BRIDGE netlink commands should
be sent. NTF_MASTER sends the commands to the 'dev->master'
device for parsing. Typically this will be the linux net/bridge,
or open-vswitch devices. Also without any flags set the command
will be handled by the master device as well so that current user
space tools continue to work as expected.
The NTF_SELF flag will push the PF_BRIDGE commands to the
device. In the basic example below the commands are then parsed
and programmed in the embedded bridge.
Note if both NTF_SELF and NTF_MASTER bits are set then the
command will be sent to both 'dev->master' and 'dev' this allows
user space to easily keep the embedded bridge and software bridge
in sync.
There is a slight complication in the case with both flags set
when an error occurs. To resolve this the rtnl handler clears
the NTF_ flag in the netlink ack to indicate which sets completed
successfully. The add/del handlers will abort as soon as any
error occurs.
To support this new net device ops were added to call into
the device and the existing bridging code was refactored
to use these. There should be no required changes in user space
to support the current bridge behavior.
A basic setup with a SR-IOV enabled NIC looks like this,
veth0 veth2
| |
------------
| bridge0 | <---- software bridging
------------
/
/
ethx.y ethx
VF PF
\ \ <---- propagate FDB entries to HW
\ \
--------------------
| Embedded Bridge | <---- hardware offloaded switching
--------------------
In this case the embedded bridge must be managed to allow 'veth0'
to communicate with 'ethx.y' correctly. At present drivers managing
the embedded bridge either send frames onto the network which
then get dropped by the switch OR the embedded bridge will flood
these frames. With this patch we have a mechanism to manage the
embedded bridge correctly from user space. This example is specific
to SR-IOV but replacing the VF with another PF or dropping this
into the DSA framework generates similar management issues.
Examples session using the 'br'[1] tool to add, dump and then
delete a mac address with a new "embedded" option and enabled
ixgbe driver:
# br fdb add 22:35:19:ac:60:59 dev eth3
# br fdb
port mac addr flags
veth0 22:35:19:ac:60:58 static
veth0 9a:5f:81:f7:f6:ec local
eth3 00:1b:21:55:23:59 local
eth3 22:35:19:ac:60:59 static
veth0 22:35:19:ac:60:57 static
#br fdb add 22:35:19:ac:60:59 embedded dev eth3
#br fdb
port mac addr flags
veth0 22:35:19:ac:60:58 static
veth0 9a:5f:81:f7:f6:ec local
eth3 00:1b:21:55:23:59 local
eth3 22:35:19:ac:60:59 static
veth0 22:35:19:ac:60:57 static
eth3 22:35:19:ac:60:59 local embedded
#br fdb del 22:35:19:ac:60:59 embedded dev eth3
I added a couple lines to 'br' to set the flags correctly is all. It
is my opinion that the merit of this patch is now embedded and SW
bridges can both be modeled correctly in user space using very nearly
the same message passing.
[1] 'br' tool was published as an RFC here and will be renamed 'bridge'
http://patchwork.ozlabs.org/patch/117664/
Thanks to Jamal Hadi Salim, Stephen Hemminger and Ben Hutchings for
valuable feedback, suggestions, and review.
v2: fixed api descriptions and error case with both NTF_SELF and
NTF_MASTER set plus updated patch description.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use of "unsigned int" is preferred to bare "unsigned" in net tree.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduces a new BPF ancillary instruction that all LD calls will be
mapped through when skb_run_filter() is being used for seccomp BPF. The
rewriting will be done using a secondary chk_filter function that is run
after skb_chk_filter.
The code change is guarded by CONFIG_SECCOMP_FILTER which is added,
along with the seccomp_bpf_load() function later in this series.
This is based on http://lkml.org/lkml/2012/3/2/141
Suggested-by: Indan Zupancic <indan@nul.nu>
Signed-off-by: Will Drewry <wad@chromium.org>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Eric Paris <eparis@redhat.com>
v18: rebase
...
v15: include seccomp.h explicitly for when seccomp_bpf_load exists.
v14: First cut using a single additional instruction
... v13: made bpf functions generic.
Signed-off-by: James Morris <james.l.morris@oracle.com>
neigh_table_init_no_netlink() is only used in net/core/neighbour.c file.
Signed-off-by: Hiroaki SHIMODA <shimoda.hiroaki@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change get_tx_queues, drop unsused arg/return value real_tx_queues,
and use return by value (with error) rather than call by reference.
Probably bonding should just change to LLTX and the whole get_tx_queues
API could disappear!
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull in the 'net' tree to get CAIF bug fixes upon which
the following set of CAIF feature patches depend.
Signed-off-by: David S. Miller <davem@davemloft.net>
We already synthesize events in register_netdevice_notifier and synthesizing
events in unregister_netdevice_notifier allows to us remove the need for
special case cleanup code.
This change should be safe as it adds no new cases for existing callers
of unregiser_netdevice_notifier to handle.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixed coding style issues in net/core/utils.c
in relation with braces placement.
Signed-off-by: Jeffrin Jose <ahiliation@yahoo.co.in>
Signed-off-by: David S. Miller <davem@davemloft.net>
Changed net/core/net-sysfs.c: netdev_store() to use kstrtoul()
instead of obsolete simple_strtoul().
Signed-off-by: Shuah Khan <shuahkhan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Marc Merlin reported many order-1 allocations failures in TX path on its
wireless setup, that dont make any sense with MTU=1500 network, and non
SG capable hardware.
Turns out part of the problem comes from pskb_expand_head() not using
ksize() to get exact head size given by kmalloc(). Doing the same thing
than __alloc_skb() allows more tailroom in skb and can prevent future
reallocations.
As a bonus, struct skb_shared_info becomes cache line aligned.
Reported-by: Marc MERLIN <marc@merlins.org>
Tested-by: Marc MERLIN <marc@merlins.org>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The only reason cgroup was used, was to be consistent with the populate()
interface. Now that we're getting rid of it, not only we no longer need
it, but we also *can't* call it this way.
Since we will no longer rely on populate(), this will be called from
create(). During create, the association between struct mem_cgroup
and struct cgroup does not yet exist, since cgroup internals hasn't
yet initialized its bookkeeping. This means we would not be able
to draw the memcg pointer from the cgroup pointer in these
functions, which is highly undesirable.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
CC: Li Zefan <lizefan@huawei.com>
CC: Johannes Weiner <hannes@cmpxchg.org>
CC: Michal Hocko <mhocko@suse.cz>
As soon as an skb is queued into socket error queue, another thread
can consume it, so we are not allowed to reference skb anymore, or risk
use after free.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit 3e4d3af501 (mm: stack based kmap_atomic()) we dont have
to disable BH anymore while mapping skb frags.
We can remove kmap_skb_frag() / kunmap_skb_frag() helpers and use
kmap_atomic() / kunmap_atomic()
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, most drivers do not support transmit SO_TIMESTAMPING. For those
that do support it, there is one appropriate response to the get_ts_info
query. This patch adds a common function providing this response.
Signed-off-by: Richard Cochran <richardcochran@gmail.com>
Reviewed-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This commit adds a new ethtool ioctl that exposes the SO_TIMESTAMPING
capabilities of a network interface. In addition, user space programs
can use this ioctl to discover the PTP Hardware Clock (PHC) device
associated with the interface.
Since software receive time stamps are handled by the stack, the generic
ethtool code can answer the query correctly in case the MAC or PHY
drivers lack special time stamping features.
Signed-off-by: Richard Cochran <richardcochran@gmail.com>
Reviewed-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add XOR instruction fo BPF machine. Needed for computing packet hashes.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Today, BPF filters are bind to sockets. Since BPF machine becomes handy
for other purposes, this patch allows to create unattached filter.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function is renamed to make it a little more clear what it does.
It is not added to any .h because it is not for general consumption, only for
bpf internal use (and so by the jits).
Signed-of-by: Jan Seiffert <kaffeemonster@googlemail.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit f04565ddf5 (dev: use name hash for dev_seq_ops) added a second
regression, as some devices are missing from /proc/net/dev if many
devices are defined.
When seq_file buffer is filled, the last ->next/show() method is
canceled (pos value is reverted to value prior ->next() call)
Problem is after above commit, we dont restart the lookup at right
position in ->start() method.
Fix this by removing the internal 'pos' pointer added in commit, since
we need to use the 'loff_t *pos' provided by seq_file layer.
This also reverts commit 5cac98dd0 (net: Fix corruption
in /proc/*/net/dev_mcast), since its not needed anymore.
Reported-by: Ben Greear <greearb@candelatech.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Mihai Maruseac <mmaruseac@ixiacom.com>
Tested-by: Ben Greear <greearb@candelatech.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Provide device string properly for USB i2400m wimax devices, also
don't OOPS when providing firmware string. From Phil Sutter.
2) Add support for sh_eth SH7734 chips, from Nobuhiro Iwamatsu.
3) Add another device ID to USB zaurus driver, from Guan Xin.
4) Loop index start in pool vector iterator is wrong causing MAC to not
get configured in bnx2x driver, fix from Dmitry Kravkov.
5) EQL driver assumes HZ=100, fix from Eric Dumazet.
6) Now that skb_add_rx_frag() can specify the truesize increment
separately, do so in f_phonet and cdc_phonet, also from Eric
Dumazet.
7) virtio_net accidently uses net_ratelimit() not only on the kernel
warning but also the statistic bump, fix from Rick Jones.
8) ip_route_input_mc() uses fixed init_net namespace, oops, use
dev_net(dev) instead. Fix from Benjamin LaHaise.
9) dev_forward_skb() needs to clear the incoming interface index of the
SKB so that it looks like a new incoming packet, also from Benjamin
LaHaise.
10) iwlwifi mistakenly initializes a channel entry as 2GHZ instead of
5GHZ, fix from Stanislav Yakovlev.
11) Missing kmalloc() return value checks in orinoco, from Santosh
Nayak.
12) ath9k doesn't check for HT capabilities in the right way, it is
checking ht_supported instead of the ATH9K_HW_CAP_HT flag. Fix from
Sujith Manoharan.
13) Fix x86 BPF JIT emission of 16-bit immediate field of AND
instructions, from Feiran Zhuang.
14) Avoid infinite loop in GARP code when registering sysfs entries.
From David Ward.
15) rose protocol uses memcpy instead of memcmp in a device address
comparison, oops. Fix from Daniel Borkmann.
16) Fix build of lpc_eth due to dev_hw_addr_rancom() interface being
renamed to eth_hw_addr_random(). From Roland Stigge.
17) Make ipv6 RTM_GETROUTE interpret RTA_IIF attribute the same way
that ipv4 does. Fix from Shmulik Ladkani.
18) via-rhine has an inverted bit test, causing suspend/resume
regressions. Fix from Andreas Mohr.
19) RIONET assumes 4K page size, fix from Akinobu Mita.
20) Initialization of imask register in sky2 is buggy, because bits are
"or'd" into an uninitialized local variable. Fix from Lino
Sanfilippo.
21) Fix FCOE checksum offload handling, from Yi Zou.
22) Fix VLAN processing regression in e1000, from Jiri Pirko.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (52 commits)
sky2: dont overwrite settings for PHY Quick link
tg3: Fix 5717 serdes powerdown problem
net: usb: cdc_eem: fix mtu
net: sh_eth: fix endian check for architecture independent
usb/rtl8150 : Remove duplicated definitions
rionet: fix page allocation order of rionet_active
via-rhine: fix wait-bit inversion.
ipv6: Fix RTM_GETROUTE's interpretation of RTA_IIF to be consistent with ipv4
net: lpc_eth: Fix rename of dev_hw_addr_random
net/netfilter/nfnetlink_acct.c: use linux/atomic.h
rose_dev: fix memcpy-bug in rose_set_mac_address
Fix non TBI PHY access; a bad merge undid bug fix in a previous commit.
net/garp: avoid infinite loop if attribute already exists
x86 bpf_jit: fix a bug in emitting the 16-bit immediate operand of AND
bonding: emit event when bonding changes MAC
mac80211: fix oper channel timestamp updation
ath9k: Use HW HT capabilites properly
MAINTAINERS: adding maintainer for ipw2x00
net: orinoco: add error handling for failed kmalloc().
net/wireless: ipw2x00: fix a typo in wiphy struct initilization
...
The standard ways of probing a device's promiscuity
(ifi_flags, for instance) does not report the actual
state of the device. This patch adds dev->promiscuity
to the netlink netdevice report so that users can know
for certain if the device is acting PROMISC or not.
Signed-off-by: Ben Greear <greearb@candelatech.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These macros contain a hidden goto, and are thus extremely error
prone and make code hard to audit.
Signed-off-by: David S. Miller <davem@davemloft.net>
These macros contain a hidden goto, and are thus extremely error
prone and make code hard to audit.
Signed-off-by: David S. Miller <davem@davemloft.net>
These macros contain a hidden goto, and are thus extremely error
prone and make code hard to audit.
Signed-off-by: David S. Miller <davem@davemloft.net>
These macros contain a hidden goto, and are thus extremely error
prone and make code hard to audit.
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert debug, freezer, cpuset, cpu_cgroup, cpuacct, net_prio, blkio,
net_cls and device controllers to use the new cftype based interface.
Termination entry is added to cftype arrays and populate callbacks are
replaced with cgroup_subsys->base_cftypes initializations.
This is functionally identical transformation. There shouldn't be any
visible behavior change.
memcg is rather special and will be converted separately.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Vivek Goyal <vgoyal@redhat.com>
blk-cgroup, netprio_cgroup, cls_cgroup and tcp_memcontrol
unnecessarily define cftype array and cgroup_subsys structures at the
top of the file, which is unconventional and necessiates forward
declaration of methods.
This patch relocates those below the definitions of the methods and
removes the forward declarations. Note that forward declaration of
tcp_files[] is added in tcp_memcontrol.c for tcp_init_cgroup(). This
will be removed soon by another patch.
This patch doesn't introduce any functional change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIVAwUAT3NKzROxKuMESys7AQKElw/+JyDxJSlj+g+nymkx8IVVuU8CsEwNLgRk
8KEnRfLhGtkXFLSJYWO6jzGo16F8Uqli1PdMFte/wagSv0285/HZaKlkkBVHdJ/m
u40oSjgT013bBh6MQ0Oaf8pFezFUiQB5zPOA9QGaLVGDLXCmgqUgd7exaD5wRIwB
ZmyItjZeAVnDfk1R+ZiNYytHAi8A5wSB+eFDCIQYgyulA1Igd1UnRtx+dRKbvc/m
rWQ6KWbZHIdvP1ksd8wHHkrlUD2pEeJ8glJLsZUhMm/5oMf/8RmOCvmo8rvE/qwl
eDQ1h4cGYlfjobxXZMHqAN9m7Jg2bI946HZjdb7/7oCeO6VW3FwPZ/Ic75p+wp45
HXJTItufERYk6QxShiOKvA+QexnYwY0IT5oRP4DrhdVB/X9cl2MoaZHC+RbYLQy+
/5VNZKi38iK4F9AbFamS7kd0i5QszA/ZzEzKZ6VMuOp3W/fagpn4ZJT1LIA3m4A9
Q0cj24mqeyCfjysu0TMbPtaN+Yjeu1o1OFRvM8XffbZsp5bNzuTDEvviJ2NXw4vK
4qUHulhYSEWcu9YgAZXvEWDEM78FXCkg2v/CrZXH5tyc95kUkMPcgG+QZBB5wElR
FaOKpiC/BuNIGEf02IZQ4nfDxE90QwnDeoYeV+FvNj9UEOopJ5z5bMPoTHxm4cCD
NypQthI85pc=
=G9mT
-----END PGP SIGNATURE-----
Merge tag 'split-asm_system_h-for-linus-20120328' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-asm_system
Pull "Disintegrate and delete asm/system.h" from David Howells:
"Here are a bunch of patches to disintegrate asm/system.h into a set of
separate bits to relieve the problem of circular inclusion
dependencies.
I've built all the working defconfigs from all the arches that I can
and made sure that they don't break.
The reason for these patches is that I recently encountered a circular
dependency problem that came about when I produced some patches to
optimise get_order() by rewriting it to use ilog2().
This uses bitops - and on the SH arch asm/bitops.h drags in
asm-generic/get_order.h by a circuituous route involving asm/system.h.
The main difficulty seems to be asm/system.h. It holds a number of
low level bits with no/few dependencies that are commonly used (eg.
memory barriers) and a number of bits with more dependencies that
aren't used in many places (eg. switch_to()).
These patches break asm/system.h up into the following core pieces:
(1) asm/barrier.h
Move memory barriers here. This already done for MIPS and Alpha.
(2) asm/switch_to.h
Move switch_to() and related stuff here.
(3) asm/exec.h
Move arch_align_stack() here. Other process execution related bits
could perhaps go here from asm/processor.h.
(4) asm/cmpxchg.h
Move xchg() and cmpxchg() here as they're full word atomic ops and
frequently used by atomic_xchg() and atomic_cmpxchg().
(5) asm/bug.h
Move die() and related bits.
(6) asm/auxvec.h
Move AT_VECTOR_SIZE_ARCH here.
Other arch headers are created as needed on a per-arch basis."
Fixed up some conflicts from other header file cleanups and moving code
around that has happened in the meantime, so David's testing is somewhat
weakened by that. We'll find out anything that got broken and fix it..
* tag 'split-asm_system_h-for-linus-20120328' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-asm_system: (38 commits)
Delete all instances of asm/system.h
Remove all #inclusions of asm/system.h
Add #includes needed to permit the removal of asm/system.h
Move all declarations of free_initmem() to linux/mm.h
Disintegrate asm/system.h for OpenRISC
Split arch_align_stack() out from asm-generic/system.h
Split the switch_to() wrapper out of asm-generic/system.h
Move the asm-generic/system.h xchg() implementation to asm-generic/cmpxchg.h
Create asm-generic/barrier.h
Make asm-generic/cmpxchg.h #include asm-generic/cmpxchg-local.h
Disintegrate asm/system.h for Xtensa
Disintegrate asm/system.h for Unicore32 [based on ver #3, changed by gxt]
Disintegrate asm/system.h for Tile
Disintegrate asm/system.h for Sparc
Disintegrate asm/system.h for SH
Disintegrate asm/system.h for Score
Disintegrate asm/system.h for S390
Disintegrate asm/system.h for PowerPC
Disintegrate asm/system.h for PA-RISC
Disintegrate asm/system.h for MN10300
...
Remove all #inclusions of asm/system.h preparatory to splitting and killing
it. Performed with the following command:
perl -p -i -e 's!^#\s*include\s*<asm/system[.]h>.*\n!!' `grep -Irl '^#\s*include\s*<asm/system[.]h>' *`
Signed-off-by: David Howells <dhowells@redhat.com>
While investigating another bug, I found that the code on the incoming path
in __netif_receive_skb will only set skb->skb_iif if it is already 0. When
dev_forward_skb() is used in the case of interfaces like veth, skb_iif may
already have been set. Making dev_forward_skb() cause the packet to look
like a newly received packet would seem to the the correct behaviour here,
as otherwise the wrong incoming interface can be reported for such a packet.
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb_add_rx_frag() API is misleading.
Network skbs built with this helper can use uncharged kernel memory and
eventually stress/crash machine in OOM.
Add a 'truesize' parameter and then fix drivers in followup patches.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Wey-Yi Guy <wey-yi.w.guy@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
napi->skb is allocated in napi_get_frags() using
netdev_alloc_skb_ip_align(), with a reserve of NET_SKB_PAD +
NET_IP_ALIGN bytes.
However, when such skb is recycled in napi_reuse_skb(), it ends with a
reserve of NET_IP_ALIGN which is suboptimal.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull kmap_atomic cleanup from Cong Wang.
It's been in -next for a long time, and it gets rid of the (no longer
used) second argument to k[un]map_atomic().
Fix up a few trivial conflicts in various drivers, and do an "evil
merge" to catch some new uses that have come in since Cong's tree.
* 'kmap_atomic' of git://github.com/congwang/linux: (59 commits)
feature-removal-schedule.txt: schedule the deprecated form of kmap_atomic() for removal
highmem: kill all __kmap_atomic() [swarren@nvidia.com: highmem: Fix ARM build break due to __kmap_atomic rename]
drbd: remove the second argument of k[un]map_atomic()
zcache: remove the second argument of k[un]map_atomic()
gma500: remove the second argument of k[un]map_atomic()
dm: remove the second argument of k[un]map_atomic()
tomoyo: remove the second argument of k[un]map_atomic()
sunrpc: remove the second argument of k[un]map_atomic()
rds: remove the second argument of k[un]map_atomic()
net: remove the second argument of k[un]map_atomic()
mm: remove the second argument of k[un]map_atomic()
lib: remove the second argument of k[un]map_atomic()
power: remove the second argument of k[un]map_atomic()
kdb: remove the second argument of k[un]map_atomic()
udf: remove the second argument of k[un]map_atomic()
ubifs: remove the second argument of k[un]map_atomic()
squashfs: remove the second argument of k[un]map_atomic()
reiserfs: remove the second argument of k[un]map_atomic()
ocfs2: remove the second argument of k[un]map_atomic()
ntfs: remove the second argument of k[un]map_atomic()
...
Pull networking merge from David Miller:
"1) Move ixgbe driver over to purely page based buffering on receive.
From Alexander Duyck.
2) Add receive packet steering support to e1000e, from Bruce Allan.
3) Convert TCP MD5 support over to RCU, from Eric Dumazet.
4) Reduce cpu usage in handling out-of-order TCP packets on modern
systems, also from Eric Dumazet.
5) Support the IP{,V6}_UNICAST_IF socket options, making the wine
folks happy, from Erich Hoover.
6) Support VLAN trunking from guests in hyperv driver, from Haiyang
Zhang.
7) Support byte-queue-limtis in r8169, from Igor Maravic.
8) Outline code intended for IP_RECVTOS in IP_PKTOPTIONS existed but
was never properly implemented, Jiri Benc fixed that.
9) 64-bit statistics support in r8169 and 8139too, from Junchang Wang.
10) Support kernel side dump filtering by ctmark in netfilter
ctnetlink, from Pablo Neira Ayuso.
11) Support byte-queue-limits in gianfar driver, from Paul Gortmaker.
12) Add new peek socket options to assist with socket migration, from
Pavel Emelyanov.
13) Add sch_plug packet scheduler whose queue is controlled by
userland daemons using explicit freeze and release commands. From
Shriram Rajagopalan.
14) Fix FCOE checksum offload handling on transmit, from Yi Zou."
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1846 commits)
Fix pppol2tp getsockname()
Remove printk from rds_sendmsg
ipv6: fix incorrent ipv6 ipsec packet fragment
cpsw: Hook up default ndo_change_mtu.
net: qmi_wwan: fix build error due to cdc-wdm dependecy
netdev: driver: ethernet: Add TI CPSW driver
netdev: driver: ethernet: add cpsw address lookup engine support
phy: add am79c874 PHY support
mlx4_core: fix race on comm channel
bonding: send igmp report for its master
fs_enet: Add MPC5125 FEC support and PHY interface selection
net: bpf_jit: fix BPF_S_LDX_B_MSH compilation
net: update the usage of CHECKSUM_UNNECESSARY
fcoe: use CHECKSUM_UNNECESSARY instead of CHECKSUM_PARTIAL on tx
net: do not do gso for CHECKSUM_UNNECESSARY in netif_needs_gso
ixgbe: Fix issues with SR-IOV loopback when flow control is disabled
net/hyperv: Fix the code handling tx busy
ixgbe: fix namespace issues when FCoE/DCB is not enabled
rtlwifi: Remove unused ETH_ADDR_LEN defines
igbvf: Use ETH_ALEN
...
Fix up fairly trivial conflicts in drivers/isdn/gigaset/interface.c and
drivers/net/usb/{Kconfig,qmi_wwan.c} as per David.
Pull cgroup changes from Tejun Heo:
"Out of the 8 commits, one fixes a long-standing locking issue around
tasklist walking and others are cleanups."
* 'for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: Walk task list under tasklist_lock in cgroup_enable_task_cg_list
cgroup: Remove wrong comment on cgroup_enable_task_cg_list()
cgroup: remove cgroup_subsys argument from callbacks
cgroup: remove extra calls to find_existing_css_set
cgroup: replace tasklist_lock with rcu_read_lock
cgroup: simplify double-check locking in cgroup_attach_proc
cgroup: move struct cgroup_pidlist out from the header file
cgroup: remove cgroup_attach_task_current_cg()
The following 4 functions:
move_addr_to_kernel
move_addr_to_user
verify_iovec
verify_compat_iovec
are always effectively called with a sockaddr_storage.
Make this explicit by changing their signature.
This removes a large number of casts from sockaddr_storage to sockaddr.
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/vmxnet3/vmxnet3_drv.c
Small vmxnet3 conflict with header size bug fix in 'net'.
Signed-off-by: David S. Miller <davem@davemloft.net>
Some drivers use internal netdev stats member to store part of their
stats, yet advertize ndo_get_stats64() to implement some 64bit fields.
Allow them to use netdev_stats_to_stats64() helper to make the copy of
netdev stats before they compute their 64bit counters.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
nlmsg_parse() might return an error, so test its return value before
potential random memory accesses.
Errors introduced in commit 115c9b8192 (rtnetlink: Fix problem with
buffer allocation)
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Greg Rose <gregory.v.rose@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add VF spoof check to IFLA policy. The original patch I submitted to
add the spoof checking feature to rtnl failed to add the proper policy
rule that identifies the data type and len. This patch corrects that
oversight. No bugs have been reported against this but it may cause
some problem for the netlink message parsing that uses the policy
table.
CC: stable@vger.kernel.org
Signed-off-by: Greg Rose <gregory.v.rose@intel.com>
Tested-by: Sibai Li <sibai.li@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Conflicts:
drivers/net/ethernet/sfc/rx.c
Overlapping changes in drivers/net/ethernet/sfc/rx.c, one to change
the rx_buf->is_page boolean into a set of u16 flags, and another to
adjust how ->ip_summed is initialized.
Signed-off-by: David S. Miller <davem@davemloft.net>
Davem considers that the argument list of this interface is getting
out of control. This patch tries to address this issue following
his proposal:
struct netlink_dump_control c = { .dump = dump, .done = done, ... };
netlink_dump_start(..., &c);
Suggested by David S. Miller.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This flag requests that network devices pass all
received frames up the stack, even ones with errors
such as invalid FCS (frame check sum). This will
allow sniffers to see bad packets and perhaps
give the user some idea how to fix the problem.
Signed-off-by: Ben Greear <greearb@candelatech.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
This is useful for testing RX handling of frames with bad
CRCs.
Requires driver support to actually put the packet on the
wire properly.
Signed-off-by: Ben Greear <greearb@candelatech.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
When set on hardware that supports the feature,
this causes the Ethernet FCS to be appended
to the end of the skb.
Useful for sniffing packets.
Signed-off-by: Ben Greear <greearb@candelatech.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
So here's a boot tested patch on top of Jason's series that does
all the cleanups I talked about and turns jump labels into a
more intuitive to use facility. It should also address the
various misconceptions and confusions that surround jump labels.
Typical usage scenarios:
#include <linux/static_key.h>
struct static_key key = STATIC_KEY_INIT_TRUE;
if (static_key_false(&key))
do unlikely code
else
do likely code
Or:
if (static_key_true(&key))
do likely code
else
do unlikely code
The static key is modified via:
static_key_slow_inc(&key);
...
static_key_slow_dec(&key);
The 'slow' prefix makes it abundantly clear that this is an
expensive operation.
I've updated all in-kernel code to use this everywhere. Note
that I (intentionally) have not pushed through the rename
blindly through to the lowest levels: the actual jump-label
patching arch facility should be named like that, so we want to
decouple jump labels from the static-key facility a bit.
On non-jump-label enabled architectures static keys default to
likely()/unlikely() branches.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Jason Baron <jbaron@redhat.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: a.p.zijlstra@chello.nl
Cc: mathieu.desnoyers@efficios.com
Cc: davem@davemloft.net
Cc: ddaney.cavm@gmail.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20120222085809.GA26397@elte.hu
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Implement a new netlink attribute type IFLA_EXT_MASK. The mask
is a 32 bit value that can be used to indicate to the kernel that
certain extended ifinfo values are requested by the user application.
At this time the only mask value defined is RTEXT_FILTER_VF to
indicate that the user wants the ifinfo dump to send information
about the VFs belonging to the interface.
This patch fixes a bug in which certain applications do not have
large enough buffers to accommodate the extra information returned
by the kernel with large numbers of SR-IOV virtual functions.
Those applications will not send the new netlink attribute with
the interface info dump request netlink messages so they will
not get unexpectedly large request buffers returned by the kernel.
Modifies the rtnl_calcit function to traverse the list of net
devices and compute the minimum buffer size that can hold the
info dumps of all matching devices based upon the filter passed
in via the new netlink attribute filter mask. If no filter
mask is sent then the buffer allocation defaults to NLMSG_GOODSIZE.
With this change it is possible to add yet to be defined netlink
attributes to the dump request which should make it fairly extensible
in the future.
Signed-off-by: Greg Rose <gregory.v.rose@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the fixed race condition happens:
1. While function neigh_periodic_work scans the neighbor hash table
pointed by field tbl->nht, it unlocks and locks tbl->lock between
buckets in order to call cond_resched.
2. Assume that function neigh_periodic_work calls cond_resched, that is,
the lock tbl->lock is available, and function neigh_hash_grow runs.
3. Once function neigh_hash_grow finishes, and RCU calls
neigh_hash_free_rcu, the original struct neigh_hash_table that function
neigh_periodic_work was using doesn't exist anymore.
4. Once back at neigh_periodic_work, whenever the old struct
neigh_hash_table is accessed, things can go badly.
Signed-off-by: Michel Machado <michel@digirati.com.br>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This one specifies where to start MSG_PEEK-ing queue data from. When
set to negative value means that MSG_PEEK works as ususally -- peeks
from the head of the queue always.
When some bytes are peeked from queue and the peeking offset is non
negative it is moved forward so that the next peek will return next
portion of data.
When non-peeking recvmsg occurs and the peeking offset is non negative
is is moved backward so that the next peek will still peek the proper
data (i.e. the one that would have been picked if there were no non
peeking recv in between).
The offset is set using per-proto opteration to let the protocol handle
the locking issues and to check whether the peeking offset feature is
supported by the protocol the socket belongs to.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This one is only considered for MSG_PEEK flag and the value pointed by
it specifies where to start peeking bytes from. If the offset happens to
point into the middle of the returned skb, the offset within this skb is
put back to this very argument.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This makes lines shorter and simplifies further patching.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/broadcom/bnx2x/bnx2x_stats.c
Small minor conflict in bnx2x, wherein one commit changed how
statistics were stored in software, and another commit
fixed endianness bugs wrt. reading the values provided by
the chip in memory.
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 5a698af53f (bond: service netpoll arp queue on master device)
tested IFF_SLAVE flag against dev->priv_flags instead of dev->flags
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: WANG Cong <amwang@redhat.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb_gro_receive() doesnt update truesize properly when adding one skb to
frag_list.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the netprio_cgroup module is not loaded, net_prio_subsys_id
is -1, and so sock_update_prioidx() accesses cgroup_subsys array
with negative index subsys[-1].
Make the code resembles cls_cgroup code, which is bug free.
Origionally-authored-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: "David S. Miller" <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
So we delay the allocation till the priority is set through cgroup,
and this makes skb_update_priority() faster when it's not set.
This also eliminates an off-by-one bug similar with the one fixed
in the previous patch.
Origionally-authored-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: "David S. Miller" <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
# mount -t cgroup xxx /mnt
# mkdir /mnt/tmp
# cat /mnt/tmp/net_prio.ifpriomap
lo 0
eth0 0
virbr0 0
# echo 'lo 999' > /mnt/tmp/net_prio.ifpriomap
# cat /mnt/tmp/net_prio.ifpriomap
lo 999
eth0 0
virbr0 4101267344
We got weired output, because we exceeded the boundary of the array.
We may even crash the kernel..
Origionally-authored-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: "David S. Miller" <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Shlomo Pongratz reported GRO L2 header check was suited for Ethernet
only, and failed on IB/ipoib traffic.
He provided a patch faking a zeroed header to let GRO aggregates frames.
Roland Dreier, Herbert Xu, and others suggested we change GRO L2 header
check to be more generic, ie not assuming L2 header is 14 bytes, but
taking into account hard_header_len.
__napi_gro_receive() has special handling for the common case (Ethernet)
to avoid a memcmp() call and use an inline optimized function instead.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Reported-by: Shlomo Pongratz <shlomop@mellanox.com>
Cc: Roland Dreier <roland@kernel.org>
Cc: Or Gerlitz <ogerlitz@mellanox.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Tested-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Shlomo Pongratz reported GRO L2 header check was suited for Ethernet
only, and failed on IB/ipoib traffic.
He provided a patch faking a zeroed header to let GRO aggregates frames.
Roland Dreier, Herbert Xu, and others suggested we change GRO L2 header
check to be more generic, ie not assuming L2 header is 14 bytes, but
taking into account hard_header_len.
__napi_gro_receive() has special handling for the common case (Ethernet)
to avoid a memcmp() call and use an inline optimized function instead.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Reported-by: Shlomo Pongratz <shlomop@mellanox.com>
Cc: Roland Dreier <roland@kernel.org>
Cc: Or Gerlitz <ogerlitz@mellanox.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Tested-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It was recently pointed out to me that the get_prioidx function sets a bit in
the prioidx map prior to checking to see if the index being set is out of
bounds. This patch corrects that, avoiding the possiblity of us writing beyond
the end of the array
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Reported-by: Stanislaw Gruszka <sgruszka@redhat.com>
CC: Stanislaw Gruszka <sgruszka@redhat.com>
CC: "David S. Miller" <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The argument is not used at all, and it's not necessary, because
a specific callback handler of course knows which subsys it
belongs to.
Now only ->pupulate() takes this argument, because the handlers of
this callback always call cgroup_add_file()/cgroup_add_files().
So we reduce a few lines of code, though the shrinking of object size
is minimal.
16 files changed, 113 insertions(+), 162 deletions(-)
text data bss dec hex filename
5486240 656987 7039960 13183187 c928d3 vmlinux.o.orig
5486170 656987 7039960 13183117 c9288d vmlinux.o
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Use the current logging style.
Coalesce formats where appropriate.
Update grammar where appropriate.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The parameters for ETHTOOL_FLASHDEV include a filename, which ought to
be null-terminated. Currently the only driver that implements
ethtool_ops::flash_device attempts to add a null terminator if
necessary, but does it wrongly. Do it in the ethtool core instead.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the types in the packet layout order.
Signed-off-by: Joe Perches <joe@perches.com>
Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use a more current message logging style.
Add pr_fmt to prefix dmesg output with "netpoll: "
Add macros to print np->name.
Signed-off-by: Joe Perches <joe@perches.com>
Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add ability to return neighbour proxies list to caller if
it sent full ndmsg structure and has NTF_PROXY flag set.
Before this patch (and before iproute2 patches):
$ ip neigh add proxy 2001::1 dev eth0
$ ip -6 neigh show
$
After it and with applied iproute2 patches:
$ ip neigh add proxy 2001::1 dev eth0
$ ip -6 neigh show
2001::1 dev eth0 proxy
$
Compatibility with old versions of iproute2 is not broken,
kernel checks for incoming structure size and properly
works if old structure is came.
[v2]
* changed comments style.
* removed useless line with continue and curly bracket.
* changed incoming message size check from equal to more or
equal.
CC: davem@davemloft.net
CC: kuznet@ms2.inr.ac.ru
CC: netdev@vger.kernel.org
CC: xemul@parallels.com
Signed-off-by: Tony Zelenoff <antonz@parallels.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Setting link parameters on a netdevice changes the value
of if_nlmsg_size(), therefore it is necessary to recalculate
min_ifinfo_dump_size.
Signed-off-by: Stefan Gula <steweg@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a new net namespace is created, we should attach to it a "struct
net_generic" with enough slots (even empty), or we can hit the following
BUG_ON() :
[ 200.752016] kernel BUG at include/net/netns/generic.h:40!
...
[ 200.752016] [<ffffffff825c3cea>] ? get_cfcnfg+0x3a/0x180
[ 200.752016] [<ffffffff821cf0b0>] ? lockdep_rtnl_is_held+0x10/0x20
[ 200.752016] [<ffffffff825c41be>] caif_device_notify+0x2e/0x530
[ 200.752016] [<ffffffff810d61b7>] notifier_call_chain+0x67/0x110
[ 200.752016] [<ffffffff810d67c1>] raw_notifier_call_chain+0x11/0x20
[ 200.752016] [<ffffffff821bae82>] call_netdevice_notifiers+0x32/0x60
[ 200.752016] [<ffffffff821c2b26>] register_netdevice+0x196/0x300
[ 200.752016] [<ffffffff821c2ca9>] register_netdev+0x19/0x30
[ 200.752016] [<ffffffff81c1c67a>] loopback_net_init+0x4a/0xa0
[ 200.752016] [<ffffffff821b5e62>] ops_init+0x42/0x180
[ 200.752016] [<ffffffff821b600b>] setup_net+0x6b/0x100
[ 200.752016] [<ffffffff821b6466>] copy_net_ns+0x86/0x110
[ 200.752016] [<ffffffff810d5789>] create_new_namespaces+0xd9/0x190
net_alloc_generic() should take into account the maximum index into the
ptr array, as a subsystem might use net_generic() anytime.
This also reduces number of reallocations in net_assign_generic()
Reported-by: Sasha Levin <levinsasha928@gmail.com>
Tested-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Sjur Brændeland <sjur.brandeland@stericsson.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The file net/core/flow_dissector.c seems to be missing
including linux/export.h.
Signed-off-by: Jesper Dangaard Brouer <hawk@comx.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow ETHTOOL_GSSET_INFO ethtool ioctl() for unprivileged users.
ETHTOOL_GSTRINGS is already allowed, but is unusable without this one.
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Acked-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a case in __sk_mem_schedule(), where an allocation
is beyond the maximum, but yet we are allowed to proceed.
It happens under the following condition:
sk->sk_wmem_queued + size >= sk->sk_sndbuf
The network code won't revert the allocation in this case,
meaning that at some point later it'll try to do it. Since
this is never communicated to the underlying res_counter
code, there is an inbalance in res_counter uncharge operation.
I see two ways of fixing this:
1) storing the information about those allocations somewhere
in memcg, and then deducting from that first, before
we start draining the res_counter,
2) providing a slightly different allocation function for
the res_counter, that matches the original behavior of
the network code more closely.
I decided to go for #2 here, believing it to be more elegant,
since #1 would require us to do basically that, but in a more
obscure way.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
CC: Tejun Heo <tj@kernel.org>
CC: Li Zefan <lizf@cn.fujitsu.com>
CC: Laurent Chavey <chavey@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Every call to num_args() immediately checks the return value for
less than zero, as it will return -EFAULT for a failed get_user()
call. So it makes no sense for the function to be declared as an
unsigned long.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (47 commits)
tg3: Fix single-vector MSI-X code
openvswitch: Fix multipart datapath dumps.
ipv6: fix per device IP snmp counters
inetpeer: initialize ->redirect_genid in inet_getpeer()
net: fix NULL-deref in WARN() in skb_gso_segment()
net: WARN if skb_checksum_help() is called on skb requiring segmentation
caif: Remove bad WARN_ON in caif_dev
caif: Fix typo in Vendor/Product-ID for CAIF modems
bnx2x: Disable AN KR work-around for BCM57810
bnx2x: Remove AutoGrEEEn for BCM84833
bnx2x: Remove 100Mb force speed for BCM84833
bnx2x: Fix PFC setting on BCM57840
bnx2x: Fix Super-Isolate mode for BCM84833
net: fix some sparse errors
net: kill duplicate included header
net: sh-eth: Fix build error by the value which is not defined
net: Use device model to get driver name in skb_gso_segment()
bridge: BH already disabled in br_fdb_cleanup()
net: move sock_update_memcg outside of CONFIG_INET
mwl8k: Fixing Sparse ENDIAN CHECK warning
...
skb_checksum_help() has never done anything useful with skbs that
require segmentation. Setting skb->ip_summed = CHECKSUM_NONE makes
them invalid and provokes a later WARNing in skb_gso_segment().
Passing such an skb to skb_checksum_help() indicates a bug, so we
should warn about it immediately. Move the warning from
skb_gso_segment() into a shared function, and add gso_type and
gso_size to it.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
make C=2 CF="-D__CHECK_ENDIAN__" M=net
And fix flowi4_init_output() prototype for sport
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ethtool operations generally require the caller to hold RTNL and are
not safe to call in atomic context. The device model provides this
information for most devices; we'll only lose it for some old ISA
drivers.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no store() method for inflight attribute in the
tx-<n>/byte_queue_limits sysfs directory.
So remove S_IWUSR bit.
Signed-off-by: Hiroaki SHIMODA <shimoda.hiroaki@gmail.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* 'for-linus' of git://selinuxproject.org/~jmorris/linux-security:
capabilities: remove __cap_full_set definition
security: remove the security_netlink_recv hook as it is equivalent to capable()
ptrace: do not audit capability check when outputing /proc/pid/stat
capabilities: remove task_ns_* functions
capabitlies: ns_capable can use the cap helpers rather than lsm call
capabilities: style only - move capable below ns_capable
capabilites: introduce new has_ns_capabilities_noaudit
capabilities: call has_ns_capability from has_capability
capabilities: remove all _real_ interfaces
capabilities: introduce security_capable_noaudit
capabilities: reverse arguments to security_capable
capabilities: remove the task from capable LSM hook entirely
selinux: sparse fix: fix several warnings in the security server cod
selinux: sparse fix: fix warnings in netlink code
selinux: sparse fix: eliminate warnings for selinuxfs
selinux: sparse fix: declare selinux_disable() in security.h
selinux: sparse fix: move selinux_complete_init
selinux: sparse fix: make selinux_secmark_refcount static
SELinux: Fix RCU deref check warning in sel_netport_insert()
Manually fix up a semantic mis-merge wrt security_netlink_recv():
- the interface was removed in commit fd77846152 ("security: remove
the security_netlink_recv hook as it is equivalent to capable()")
- a new user of it appeared in commit a38f7907b9 ("crypto: Add
userspace configuration API")
causing no automatic merge conflict, but Eric Paris pointed out the
issue.
commit a9b3cd7f32 (rcu: convert uses of rcu_assign_pointer(x, NULL) to
RCU_INIT_POINTER) did a lot of incorrect changes, since it did a
complete conversion of rcu_assign_pointer(x, y) to RCU_INIT_POINTER(x,
y).
We miss needed barriers, even on x86, when y is not NULL.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Stephen Hemminger <shemminger@vyatta.com>
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
> net/core/sock.c: In function 'sk_update_clone':
> net/core/sock.c:1278:3: error: implicit declaration of function 'sock_update_memcg'
Reported-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
dev_uc_sync() and dev_mc_sync() are acquiring netif_addr_lock for
destination device of synchronization. Since netif_addr_lock is
already held at the time for source device, this triggers lockdep
deadlock warning.
There's no way this deadlock can happen so use spin_lock_nested() to
silence the warning.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
so move it there. Fixes build errors when CONFIG_INET is not defined:
In file included from include/linux/tcp.h:211:0,
from include/linux/ipv6.h:221,
from include/net/ipv6.h:16,
from include/linux/sunrpc/clnt.h:26,
from include/linux/nfs_fs.h:50,
from init/do_mounts.c:20:
include/net/sock.h: In function 'sk_update_clone':
include/net/sock.h:1109:3: error: implicit declaration of function 'sock_update_memcg' [-Werror=implicit-function-declaration]
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
In 882716604e "pktgen: fix multiple queue warning" we added special
logic to handle the case where ntxq is zero. It's not clear to me that
ntxq can actually be zero. But if it were then we would set
->queue_map_min and ->queue_map_max to USHRT_MAX when probably we want
to set them to zero?
Signed-off-by: David S. Miller <davem@davemloft.net>
Sockets can also be created through sock_clone. Because it copies
all data in the sock structure, it also copies the memcg-related pointer,
and all should be fine. However, since we now use reference counts in
socket creation, we are left with some sockets that have no reference
counts. It matters when we destroy them, since it leads to a mismatch.
Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: David S. Miller <davem@davemloft.net>
CC: Greg Thelen <gthelen@google.com>
CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
CC: Laurent Chavey <chavey@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Once upon a time netlink was not sync and we had to get the effective
capabilities from the skb that was being received. Today we instead get
the capabilities from the current task. This has rendered the entire
purpose of the hook moot as it is now functionally equivalent to the
capable() call.
Signed-off-by: Eric Paris <eparis@redhat.com>
All implementations have been converted to implement set_rxnfc
instead.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Define special location values for RX NFC that request the driver to
select the actual rule location. This allows for implementation on
devices that use hash-based filter lookup, whereas currently the API is
more suited to devices with TCAM lookup or linear search.
In ethtool_set_rxnfc() and the compat wrapper ethtool_ioctl(), copy
the structure back to user-space after insertion so that the actual
location is returned.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a routine that dumps memory-related values of a socket.
It's made as an array to make it possible to add more stuff
here later without breaking compatibility.
Since v1: The SK_MEMINFO_ constants are in userspace
visible part of sock_diag.h, the rest is under __KERNEL__.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to perform a proper universal hash on a vector of integers,
we have to use different universal hashes on each vector element.
Which means we need 4 different hash randoms for ipv6.
Signed-off-by: David S. Miller <davem@davemloft.net>
Aim of this patch is to provide full range of rps_flow_cnt on 64bit arches.
Theorical limit on number of flows is 2^32
Fix some buggy RPS/RFS macros as well.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Tom Herbert <therbert@google.com>
CC: Xi Wang <xi.wang@gmail.com>
CC: Laurent Chavey <chavey@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
net/bluetooth/l2cap_core.c
Just two overlapping changes, one added an initialization of
a local variable, and another change added a new local variable.
Signed-off-by: David S. Miller <davem@davemloft.net>
skb->truesize might be big even for a small packet.
Its even bigger after commit 87fb4b7b53 (net: more accurate skb
truesize) and big MTU.
We should allow queueing at least one packet per receiver, even with a
low RCVBUF setting.
Reported-by: Michal Simek <monstr@monstr.eu>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Setting a large rps_flow_cnt like (1 << 30) on 32-bit platform will
cause a kernel oops due to insufficient bounds checking.
if (count > 1<<30) {
/* Enforce a limit to prevent overflow */
return -EINVAL;
}
count = roundup_pow_of_two(count);
table = vmalloc(RPS_DEV_FLOW_TABLE_SIZE(count));
Note that the macro RPS_DEV_FLOW_TABLE_SIZE(count) is defined as:
... + (count * sizeof(struct rps_dev_flow))
where sizeof(struct rps_dev_flow) is 8. (1 << 30) * 8 will overflow
32 bits.
This patch replaces the magic number (1 << 30) with a symbolic bound.
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Xi Wang <xi.wang@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
flow_cach_flush() might sleep but can be called from
atomic context via the xfrm garbage collector. So add
a flow_cache_flush_deferred() function and use this if
the xfrm garbage colector is invoked from within the
packet path.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Acked-by: Timo Teräs <timo.teras@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 5c3ddec73d.
S390 qeth driver actually still uses the setup ops.
Reported-by: Frank Blaschka <blaschka@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use IS_ENABLED(CONFIG_FOO)
instead of defined(CONFIG_FOO) || defined (CONFIG_FOO_MODULE)
Signed-off-by: Igor Maravić <igorm@etf.rs>
Signed-off-by: David S. Miller <davem@davemloft.net>
We can't scan the proto_list to initialize sock cgroups, as it
holds a rwlock, and we also want to keep the code generic enough to
avoid calling the initialization functions of protocols directly,
Convert proto_list_lock into a mutex, so we can sleep and do the
necessary allocations. This lock is seldom taken, so there shouldn't
be any performance penalties associated with that
Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
CC: David S. Miller <davem@davemloft.net>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Stephen Rothwell <sfr@canb.auug.org.au>
CC: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
All drivers that support modification of the RX flow hash indirection
table initialise it in the same way: RX rings are assigned to table
entries in rotation. Make that default policy explicit by having them
call a ethtool_rxfh_indir_default() function.
In the ethtool core, add support for a zero size value for
ETHTOOL_SRXFHINDIR, which resets the table to this default.
Partly-suggested-by: Matt Carlson <mcarlson@broadcom.com>
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Acked-by: Shreyas N Bhatewara <sbhatewara@vmware.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a new ethtool operation (get_rxfh_indir_size) to get the
indirectional table size. Use this to validate the user buffer size
before calling get_rxfh_indir or set_rxfh_indir. Use get_rxnfc to get
the number of RX rings, and validate the contents of the new
indirection table before calling set_rxfh_indir. Remove this
validation from drivers.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Acked-by: Dimitris Michailidis <dm@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The sk address is used as a cookie between dump/get_exact calls.
It will be required for unix socket sdumping, so move it from
inet_diag to sock_diag.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I've made a mistake when fixing the sock_/inet_diag aliases :(
1. The sock_diag layer should request the family-based alias,
not just the IPPROTO_IP one;
2. The inet_diag layer should request for AF_INET+protocol alias,
not just the protocol one.
Thus fix this.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before adding a struct rtnl_link_ops into link_ops list, check it doesnt
clash with a prior one.
Based on a previous patch from Alexander Smirnov
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Alexander Smirnov <alex.bluesman.smirnov@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It's simpler to just keep these things out until there is a real user
of them, so we can see what the needs actually are, rather than keep
these things around as useless overhead.
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch introduces memory pressure controls for the tcp
protocol. It uses the generic socket memory pressure code
introduced in earlier patches, and fills in the
necessary data in cg_proto struct.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The goal of this work is to move the memory pressure tcp
controls to a cgroup, instead of just relying on global
conditions.
To avoid excessive overhead in the network fast paths,
the code that accounts allocated memory to a cgroup is
hidden inside a static_branch(). This branch is patched out
until the first non-root cgroup is created. So when nobody
is using cgroups, even if it is mounted, no significant performance
penalty should be seen.
This patch handles the generic part of the code, and has nothing
tcp-specific.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtsu.com>
CC: Kirill A. Shutemov <kirill@shutemov.name>
CC: David S. Miller <davem@davemloft.net>
CC: Eric W. Biederman <ebiederm@xmission.com>
CC: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch replaces all uses of struct sock fields' memory_pressure,
memory_allocated, sockets_allocated, and sysctl_mem to acessor
macros. Those macros can either receive a socket argument, or a mem_cgroup
argument, depending on the context they live in.
Since we're only doing a macro wrapping here, no performance impact at all is
expected in the case where we don't have cgroups disabled.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Reviewed-by: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
CC: David S. Miller <davem@davemloft.net>
CC: Eric W. Biederman <ebiederm@xmission.com>
CC: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of testing defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 865d9f9f74.
This commit breaks the build with CONFIG_NETPRIO_CGROUP=y so
revert it. It does build as a module though. The SUBSYS macro
in the cgroup core code automatically defines a subsys structure
as extern. Long term we should fix the macro. And I need to
fully build test things.
Tested with CONFIG_NETPRIO_CGROUP={y|m|n} with and without
CONFIG_CGROUPS defined.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
CC: Neil Horman <nhorman@tuxdriver.com>
Reported-By: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These tests are off by one because sock_diag_handlers[] only has AF_MAX
elements.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net_prio_subsys can be made static this removes the sparse
warning it was throwing.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On a CONFIG_NET=y build
net/core/secure_seq.c:22: warning: 'seq_scale' defined but not
used
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch moves the sock_ code from inet_diag.c to generic sock_diag.c
file and provides necessary request_module-s calls and a pointer on
inet_diag_compat dumping routine.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit c5ed63d66f24(tcp: fix three tcp sysctls tuning),
sysctl_max_syn_backlog is determined by tcp_hashinfo->ehash_mask,
and the minimal value is 128, and it will increase in proportion to the
memory of machine.
The original description for tcp_max_syn_backlog and sysctl_max_syn_backlog
are out of date.
Changelog:
V2: update description for sysctl_max_syn_backlog
Signed-off-by: Weiping Pan <panweiping3@gmail.com>
Reviewed-by: Shan Wei <shanwei88@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
netdev_queue_release() should be called even if CONFIG_XPS=n
to properly release device reference.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To reflect the fact that a refrence is not obtained to the
resulting neighbour entry.
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Roland Dreier <roland@purestorage.com>
We discovered that TCP stack could retransmit misaligned skbs if a
malicious peer acknowledged sub MSS frame. This currently can happen
only if output interface is non SG enabled : If SG is enabled, tcp
builds headless skbs (all payload is included in fragments), so the tcp
trimming process only removes parts of skb fragments, header stay
aligned.
Some arches cant handle misalignments, so force a head reallocation and
shrink headroom to MAX_TCP_HEADER.
Dont care about misaligments on x86 and PPC (or other arches setting
NET_IP_ALIGN to 0)
This patch introduces __pskb_copy() which can specify the headroom of
new head, and pskb_copy() becomes a wrapper on top of __pskb_copy()
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit b00055aacd ([NET] core: add RFC2863 operstate) changed
net_device flags from unsigned short to unsigned int.
Some core functions still assume its an unsigned short.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Within nested statements, the break statement terminates only the
do, for, switch, or while statement that immediately encloses it,
So replace the break with goto.
Signed-off-by: RongQing.Li <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
netdev->neigh_priv_len records the private area length.
This will trigger for neigh_table objects which set tbl->entry_size
to zero, and the first instances of this will be forthcoming.
Signed-off-by: David S. Miller <davem@davemloft.net>
We are going to alloc for device specific private areas for
neighbour entries, and in order to do that we have to move
away from the fixed allocation size enforced by using
neigh_table->kmem_cachep
As a nice side effect we can now use kfree_rcu().
Signed-off-by: David S. Miller <davem@davemloft.net>
Change function rcu_dereference to rcu_dereference_bh to avoid warning
[ INFO: suspicious RCU usage. ]
-------------------------------
net/core/dev.c:2459 suspicious rcu_dereference_check() usage!
because we are locking with
rcu_read_lock_bh();
in function dev_queue_xmit(struct sk_buff *skb)
Signed-off-by: Igor Maravic <igorm@etf.rs>
Signed-off-by: David S. Miller <davem@davemloft.net>
Le lundi 28 novembre 2011 à 19:06 -0500, David Miller a écrit :
> From: Dimitris Michailidis <dm@chelsio.com>
> Date: Mon, 28 Nov 2011 08:25:39 -0800
>
> >> +bool skb_flow_dissect(const struct sk_buff *skb, struct flow_keys
> >> *flow)
> >> +{
> >> + int poff, nhoff = skb_network_offset(skb);
> >> + u8 ip_proto;
> >> + u16 proto = skb->protocol;
> >
> > __be16 instead of u16 for proto?
>
> I'll take care of this when I apply these patches.
( CC trimmed )
Thanks David !
Here is a small patch to use one 64bit load/store on x86_64 instead of
two 32bit load/stores.
[PATCH net-next] flow_dissector: use a 64bit load/store
gcc compiler is smart enough to use a single load/store if we
memcpy(dptr, sptr, 8) on x86_64, regardless of
CONFIG_CC_OPTIMIZE_FOR_SIZE
In IP header, daddr immediately follows saddr, this wont change in the
future. We only need to make sure our flow_keys (src,dst) fields wont
break the rule.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Networking stack support for byte queue limits, uses dynamic queue
limits library. Byte queue limits are maintained per transmit queue,
and a dql structure has been added to netdev_queue structure for this
purpose.
Configuration of bql is in the tx-<n> sysfs directory for the queue
under the byte_queue_limits directory. Configuration includes:
limit_min, bql minimum limit
limit_max, bql maximum limit
hold_time, bql slack hold time
Also under the directory are:
limit, current byte limit
inflight, current number of bytes on the queue
Signed-off-by: Tom Herbert <therbert@google.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch moves the xps specific parts in netdev_queue_release into
its own function which netdev_queue_release can call. This allows
netdev_queue_release to be more generic (for adding new attributes
to tx queues).
Signed-off-by: Tom Herbert <therbert@google.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Create separate queue state flags so that either the stack or drivers
can turn on XOFF. Added a set of functions used in the stack to determine
if a queue is really stopped (either by stack or driver)
Signed-off-by: Tom Herbert <therbert@google.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We can test/set multiple bits from sk_flags at once, to shorten a bit
socket setup/dismantle phase.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Igor Maravic reported an error caused by jump_label_dec() being called
from IRQ context :
BUG: sleeping function called from invalid context at kernel/mutex.c:271
in_atomic(): 1, irqs_disabled(): 0, pid: 0, name: swapper
1 lock held by swapper/0:
#0: (&n->timer){+.-...}, at: [<ffffffff8107ce90>] call_timer_fn+0x0/0x340
Pid: 0, comm: swapper Not tainted 3.2.0-rc2-net-next-mpls+ #1
Call Trace:
<IRQ> [<ffffffff8104f417>] __might_sleep+0x137/0x1f0
[<ffffffff816b9a2f>] mutex_lock_nested+0x2f/0x370
[<ffffffff810a89fd>] ? trace_hardirqs_off+0xd/0x10
[<ffffffff8109a37f>] ? local_clock+0x6f/0x80
[<ffffffff810a90a5>] ? lock_release_holdtime.part.22+0x15/0x1a0
[<ffffffff81557929>] ? sock_def_write_space+0x59/0x160
[<ffffffff815e936e>] ? arp_error_report+0x3e/0x90
[<ffffffff810969cd>] atomic_dec_and_mutex_lock+0x5d/0x80
[<ffffffff8112fc1d>] jump_label_dec+0x1d/0x50
[<ffffffff81566525>] net_disable_timestamp+0x15/0x20
[<ffffffff81557a75>] sock_disable_timestamp+0x45/0x50
[<ffffffff81557b00>] __sk_free+0x80/0x200
[<ffffffff815578d0>] ? sk_send_sigurg+0x70/0x70
[<ffffffff815e936e>] ? arp_error_report+0x3e/0x90
[<ffffffff81557cba>] sock_wfree+0x3a/0x70
[<ffffffff8155c2b0>] skb_release_head_state+0x70/0x120
[<ffffffff8155c0b6>] __kfree_skb+0x16/0x30
[<ffffffff8155c119>] kfree_skb+0x49/0x170
[<ffffffff815e936e>] arp_error_report+0x3e/0x90
[<ffffffff81575bd9>] neigh_invalidate+0x89/0xc0
[<ffffffff81578dbe>] neigh_timer_handler+0x9e/0x2a0
[<ffffffff81578d20>] ? neigh_update+0x640/0x640
[<ffffffff81073558>] __do_softirq+0xc8/0x3a0
Since jump_label_{inc|dec} must be called from process context only,
we must defer jump_label_dec() if net_disable_timestamp() is called
from interrupt context.
Reported-by: Igor Maravic <igorm@etf.rs>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
No functional changes.
This uses the code we factorized in skb_flow_dissect()
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We use at least two flow dissectors in network stack, with known
limitations and code duplication.
Introduce skb_flow_dissect() to factorize this, highly inspired from
existing dissector from __skb_get_rxhash()
Note : We extensively use skb_header_pointer(), this permits us to not
touch skb at all.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I just hit this during my testing. Isn't there another bug lurking?
BUG kmalloc-8: Redzone overwritten
INFO: 0xc0000000de9dec48-0xc0000000de9dec4b. First byte 0x0 instead of 0xcc
INFO: Allocated in .__seq_open_private+0x30/0xa0 age=0 cpu=5 pid=3896
.__kmalloc+0x1e0/0x2d0
.__seq_open_private+0x30/0xa0
.seq_open_net+0x60/0xe0
.dev_mc_seq_open+0x4c/0x70
.proc_reg_open+0xd8/0x260
.__dentry_open.clone.11+0x2b8/0x400
.do_last+0xf4/0x950
.path_openat+0xf8/0x480
.do_filp_open+0x48/0xc0
.do_sys_open+0x140/0x250
syscall_exit+0x0/0x40
dev_mc_seq_ops uses dev_seq_start/next/stop but only allocates
sizeof(struct seq_net_private) of private data, whereas it expects
sizeof(struct dev_iter_state):
struct dev_iter_state {
struct seq_net_private p;
unsigned int pos; /* bucket << BUCKET_SPACE + offset */
};
Create dev_seq_open_ops and use it so we don't have to expose
struct dev_iter_state.
[ Problem added by commit f04565ddf5 (dev: use name hash for
dev_seq_ops) -Eric ]
Signed-off-by: Anton Blanchard <anton@samba.org>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Skip entries from foreign network namespaces.
Signed-off-by: Jorge Boncompte [DTI2] <jorge@dti2.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
rcu_assign_pointer(ptr, NULL) can be safely replaced by
RCU_INIT_POINTER(ptr, NULL)
(old rcu_assign_pointer() macro was testing the NULL value and could
omit the smp_wmb(), but this had to be removed because of compiler
warnings)
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
C assignment can handle struct in6_addr copying.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
when skb_shift, we want to shift paged data from skb to tgt frag area.
Original comments revert the shift order
Signed-off-by: Feng King <kinwin2008@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds in the infrastructure code to create the network priority
cgroup. The cgroup, in addition to the standard processes file creates two
control files:
1) prioidx - This is a read-only file that exports the index of this cgroup.
This is a value that is both arbitrary and unique to a cgroup in this subsystem,
and is used to index the per-device priority map
2) priomap - This is a writeable file. On read it reports a table of 2-tuples
<name:priority> where name is the name of a network interface and priority is
indicates the priority assigned to frames egresessing on the named interface and
originating from a pid in this cgroup
This cgroup allows for skb priority to be set prior to a root qdisc getting
selected. This is benenficial for DCB enabled systems, in that it allows for any
application to use dcb configured priorities so without application modification
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
CC: Robert Love <robert.w.love@intel.com>
CC: "David S. Miller" <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
net: Remove all uses of LL_ALLOCATED_SPACE
The macro LL_ALLOCATED_SPACE was ill-conceived. It applies the
alignment to the sum of needed_headroom and needed_tailroom. As
the amount that is then reserved for head room is needed_headroom
with alignment, this means that the tail room left may be too small.
This patch replaces all uses of LL_ALLOCATED_SPACE with the macro
LL_RESERVED_SPACE and direct reference to needed_tailroom.
This also fixes the problem with needed_headroom changing between
allocating the skb and reserving the head room.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Most machines dont use RPS/RFS, and pay a fair amount of instructions in
netif_receive_skb() / netif_rx() / get_rps_cpu() just to discover
RPS/RFS is not setup.
Add a jump_label named rps_needed.
If no device rps_map or global rps_sock_flow_table is setup,
netif_receive_skb() / netif_rx() do a single instruction instead of many
ones, including conditional jumps.
jmp +0 (if CONFIG_JUMP_LABEL=y)
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds the /sys/class/net/DEV/queues/Q/tx_timeout attribute
containing the total number of timeout events on the given queue. It
is always available with CONFIG_SYSFS, independently of
CONFIG_RPS/XPS.
Credits to Stephen Hemminger for a preliminary version of this patch.
Tested:
without CONFIG_SYSFS (compilation only)
with sysfs and without CONFIG_RPS & CONFIG_XPS
with sysfs and without CONFIG_RPS
with sysfs and without CONFIG_XPS
with defaults
Signed-off-by: David Decotigny <david.decotigny@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This commit fixes following warning:
net/core/net-sysfs.c:921:6: warning: symbol 'numa_node' shadows an earlier one
include/linux/topology.h:222:1: originally declared here
Signed-off-by: David Decotigny <david.decotigny@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add missing spaces around multiplication operator.
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
Only distinct use is checking if NETIF_F_NOCACHE_COPY should be
enabled by default. The check heuristics is altered a bit here,
so it hits other people than before. The default shouldn't be
trusted for performance-critical cases anyway.
For all other uses NETIF_F_NO_CSUM is equivalent to NETIF_F_HW_CSUM.
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
v2: changed loop in ethtool_set_features() per Ben's suggestion
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
v2: add couple missing conversions in drivers
split unexporting netdev_fix_features()
implemented %pNF
convert sock::sk_route_(no?)caps
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is the only place left where dev->features are directly
exposed to userspace.
I know checkpatch.pl complains about __ethtool_{get,set}_flags(), but
the code is easier to read this way.
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Acked-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As all drivers are converted, we may now remove discrete offload setting
callback handling.
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Acked-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
netstamp_needed seems a good candidate to jump_label conversion.
This avoids 3 conditional branches per incoming packet in fast path.
No measurable difference, given that these conditional branches are
predicted on modern cpus. Only a small icache reduction, thanks to the
unlikely() stuff.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
One of the thing we discussed during netdev 2011 conference was the idea
to change some network drivers to allocate/populate their skb at RX
completion time, right before feeding the skb to network stack.
In old days, we allocated skbs when populating the RX ring.
This means bringing into cpu cache sk_buff and skb_shared_info cache
lines (since we clear/initialize them), then 'queue' skb->data to NIC.
By the time NIC fills a frame in skb->data buffer and host can process
it, cpu probably threw away the cache lines from its caches, because lot
of things happened between the allocation and final use.
So the deal would be to allocate only the data buffer for the NIC to
populate its RX ring buffer. And use build_skb() at RX completion to
attach a data buffer (now filled with an ethernet frame) to a new skb,
initialize the skb_shared_info portion, and give the hot skb to network
stack.
build_skb() is the function to allocate an skb, caller providing the
data buffer that should be attached to it. Drivers are expected to call
skb_reserve() right after build_skb() to adjust skb->data to the
Ethernet frame (usually skipping NET_SKB_PAD and NET_IP_ALIGN, but some
drivers might add a hardware provided alignment)
Data provided to build_skb() MUST have been allocated by a prior
kmalloc() call, with enough room to add SKB_DATA_ALIGN(sizeof(struct
skb_shared_info)) bytes at the end of the data without corrupting
incoming frame.
data = kmalloc(NET_SKB_PAD + NET_IP_ALIGN + 1536 +
SKB_DATA_ALIGN(sizeof(struct skb_shared_info)),
GFP_ATOMIC);
...
skb = build_skb(data);
if (!skb) {
recycle_data(data);
} else {
skb_reserve(skb, NET_SKB_PAD + NET_IP_ALIGN);
...
}
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Eilon Greenstein <eilong@broadcom.com>
CC: Ben Hutchings <bhutchings@solarflare.com>
CC: Tom Herbert <therbert@google.com>
CC: Jamal Hadi Salim <hadi@mojatatu.com>
CC: Stephen Hemminger <shemminger@vyatta.com>
CC: Thomas Graf <tgraf@infradead.org>
CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Le mercredi 09 novembre 2011 à 16:21 -0500, David Miller a écrit :
> From: David Miller <davem@davemloft.net>
> Date: Wed, 09 Nov 2011 16:16:44 -0500 (EST)
>
> > From: Eric Dumazet <eric.dumazet@gmail.com>
> > Date: Wed, 09 Nov 2011 12:14:09 +0100
> >
> >> unres_qlen is the number of frames we are able to queue per unresolved
> >> neighbour. Its default value (3) was never changed and is responsible
> >> for strange drops, especially if IP fragments are used, or multiple
> >> sessions start in parallel. Even a single tcp flow can hit this limit.
> > ...
> >
> > Ok, I've applied this, let's see what happens :-)
>
> Early answer, build fails.
>
> Please test build this patch with DECNET enabled and resubmit. The
> decnet neigh layer still refers to the removed ->queue_len member.
>
> Thanks.
Ouch, this was fixed on one machine yesterday, but not the other one I
used this morning, sorry.
[PATCH V5 net-next] neigh: new unresolved queue limits
unres_qlen is the number of frames we are able to queue per unresolved
neighbour. Its default value (3) was never changed and is responsible
for strange drops, especially if IP fragments are used, or multiple
sessions start in parallel. Even a single tcp flow can hit this limit.
$ arp -d 192.168.20.108 ; ping -c 2 -s 8000 192.168.20.108
PING 192.168.20.108 (192.168.20.108) 8000(8028) bytes of data.
8008 bytes from 192.168.20.108: icmp_seq=2 ttl=64 time=0.322 ms
Signed-off-by: David S. Miller <davem@davemloft.net>
The 802.1X EAPOL handshake hostapd does requires
knowing whether the frame was ack'ed by the peer.
Currently, we fudge this pretty badly by not even
transmitting the frame as a normal data frame but
injecting it with radiotap and getting the status
out of radiotap monitor as well. This is rather
complex, confuses users (mon.wlan0 presence) and
doesn't work with all hardware.
To get rid of that hack, introduce a real wifi TX
status option for data frame transmissions.
This works similar to the existing TX timestamping
in that it reflects the SKB back to the socket's
error queue with a SCM_WIFI_STATUS cmsg that has
an int indicating ACK status (0/1).
Since it is possible that at some point we will
want to have TX timestamping and wifi status in a
single errqueue SKB (there's little point in not
doing that), redefine SO_EE_ORIGIN_TIMESTAMPING
to SO_EE_ORIGIN_TXSTATUS which can collect more
than just the timestamp; keep the old constant
as an alias of course. Currently the internal APIs
don't make that possible, but it wouldn't be hard
to split them up in a way that makes it possible.
Thanks to Neil Horman for helping me figure out
the functions that add the control messages.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Make clear that sk_clone() and inet_csk_clone() return a locked socket.
Add _lock() prefix and kerneldoc.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
Revert "tracing: Include module.h in define_trace.h"
irq: don't put module.h into irq.h for tracking irqgen modules.
bluetooth: macroize two small inlines to avoid module.h
ip_vs.h: fix implicit use of module_get/module_put from module.h
nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
include: replace linux/module.h with "struct module" wherever possible
include: convert various register fcns to macros to avoid include chaining
crypto.h: remove unused crypto_tfm_alg_modname() inline
uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
pm_runtime.h: explicitly requires notifier.h
linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
miscdevice.h: fix up implicit use of lists and types
stop_machine.h: fix implicit use of smp.h for smp_processor_id
of: fix implicit use of errno.h in include/linux/of.h
of_platform.h: delete needless include <linux/module.h>
acpi: remove module.h include from platform/aclinux.h
miscdevice.h: delete unnecessary inclusion of module.h
device_cgroup.h: delete needless include <linux/module.h>
net: sch_generic remove redundant use of <linux/module.h>
net: inet_timewait_sock doesnt need <linux/module.h>
...
Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
- drivers/media/dvb/frontends/dibx000_common.c
- drivers/media/video/{mt9m111.c,ov6650.c}
- drivers/mfd/ab3550-core.c
- include/linux/dmaengine.h
Whatever situations make this state legitimate when SMP
also would be legitimate when !SMP and f.e. preemption is
enabled.
This is dubious enough that we should just delete it entirely. If we
want to add debugging for neigh timer races, better more thorough
mechanisms are needed.
Signed-off-by: David S. Miller <davem@davemloft.net>
These files are non modular, but need to export symbols using
the macros now living in export.h -- call out the include so
that things won't break when we remove the implicit presence
of module.h from everywhere.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
With calls to modular infrastructure, these files really
needs the full module.h header. Call it out so some of the
cleanups of implicit and unrequired includes elsewhere can be
cleaned up.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
commit 2425717b27 (net: allow vlan traffic to be received under bond)
broke ARP processing on vlan on top of bonding.
+-------+
eth0 --| bond0 |---bond0.103
eth1 --| |
+-------+
52870.115435: skb_gro_reset_offset <-napi_gro_receive
52870.115435: dev_gro_receive <-napi_gro_receive
52870.115435: napi_skb_finish <-napi_gro_receive
52870.115435: netif_receive_skb <-napi_skb_finish
52870.115435: get_rps_cpu <-netif_receive_skb
52870.115435: __netif_receive_skb <-netif_receive_skb
52870.115436: vlan_do_receive <-__netif_receive_skb
52870.115436: bond_handle_frame <-__netif_receive_skb
52870.115436: vlan_do_receive <-__netif_receive_skb
52870.115436: arp_rcv <-__netif_receive_skb
52870.115436: kfree_skb <-arp_rcv
Packet is dropped in arp_rcv() because its pkt_type was set to
PACKET_OTHERHOST in the first vlan_do_receive() call, since no eth0.103
exists.
We really need to change pkt_type only if no more rx_handler is about to
be called for the packet.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Reviewed-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1745 commits)
dp83640: free packet queues on remove
dp83640: use proper function to free transmit time stamping packets
ipv6: Do not use routes from locally generated RAs
|PATCH net-next] tg3: add tx_dropped counter
be2net: don't create multiple RX/TX rings in multi channel mode
be2net: don't create multiple TXQs in BE2
be2net: refactor VF setup/teardown code into be_vf_setup/clear()
be2net: add vlan/rx-mode/flow-control config to be_setup()
net_sched: cls_flow: use skb_header_pointer()
ipv4: avoid useless call of the function check_peer_pmtu
TCP: remove TCP_DEBUG
net: Fix driver name for mdio-gpio.c
ipv4: tcp: fix TOS value in ACK messages sent from TIME_WAIT
rtnetlink: Add missing manual netlink notification in dev_change_net_namespaces
ipv4: fix ipsec forward performance regression
jme: fix irq storm after suspend/resume
route: fix ICMP redirect validation
net: hold sock reference while processing tx timestamps
tcp: md5: add more const attributes
Add ethtool -g support to virtio_net
...
Fix up conflicts in:
- drivers/net/Kconfig:
The split-up generated a trivial conflict with removal of a
stale reference to Documentation/networking/net-modules.txt.
Remove it from the new location instead.
- fs/sysfs/dir.c:
Fairly nasty conflicts with the sysfs rb-tree usage, conflicting
with Eric Biederman's changes for tagged directories.
* 'driver-core-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (38 commits)
mm: memory hotplug: Check if pages are correctly reserved on a per-section basis
Revert "memory hotplug: Correct page reservation checking"
Update email address for stable patch submission
dynamic_debug: fix undefined reference to `__netdev_printk'
dynamic_debug: use a single printk() to emit messages
dynamic_debug: remove num_enabled accounting
dynamic_debug: consolidate repetitive struct _ddebug descriptor definitions
uio: Support physical addresses >32 bits on 32-bit systems
sysfs: add unsigned long cast to prevent compile warning
drivers: base: print rejected matches with DEBUG_DRIVER
memory hotplug: Correct page reservation checking
memory hotplug: Refuse to add unaligned memory regions
remove the messy code file Documentation/zh_CN/SubmitChecklist
ARM: mxc: convert device creation to use platform_device_register_full
new helper to create platform devices with dma mask
docs/driver-model: Update device class docs
docs/driver-model: Document device.groups
kobj_uevent: Ignore if some listeners cannot handle message
dynamic_debug: make netif_dbg() call __netdev_printk()
dynamic_debug: make netdev_dbg() call __netdev_printk()
...
Renato Westphal noticed that since commit a2835763e1
"rtnetlink: handle rtnl_link netlink notifications manually" was merged
we no longer send a netlink message when a networking device is moved
from one network namespace to another.
Fix this by adding the missing manual notification in dev_change_net_namespaces.
Since all network devices that are processed by dev_change_net_namspaces are
in the initialized state the complicated tests that guard the manual
rtmsg_ifinfo calls in rollback_registered and register_netdevice are
unnecessary and we can just perform a plain notification.
Cc: stable@kernel.org
Tested-by: Renato Westphal <renatowestphal@gmail.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The pair of functions,
* skb_clone_tx_timestamp()
* skb_complete_tx_timestamp()
were designed to allow timestamping in PHY devices. The first
function, called during the MAC driver's hard_xmit method, identifies
PTP protocol packets, clones them, and gives them to the PHY device
driver. The PHY driver may hold onto the packet and deliver it at a
later time using the second function, which adds the packet to the
socket's error queue.
As pointed out by Johannes, nothing prevents the socket from
disappearing while the cloned packet is sitting in the PHY driver
awaiting a timestamp. This patch fixes the issue by taking a reference
on the socket for each such packet. In addition, the comments
regarding the usage of these function are expanded to highlight the
rule that PHY drivers must use skb_complete_tx_timestamp() to release
the packet, in order to release the socket reference, too.
These functions first appeared in v2.6.36.
Reported-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Richard Cochran <richard.cochran@omicron.at>
Cc: <stable@vger.kernel.org>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adding const qualifiers to pointers can ease code review, and spot some
bugs. It might allow compiler to optimize code further.
For example, is it legal to temporary write a null cksum into tcphdr
in tcp_md5_hash_header() ? I am afraid a sniffer could catch the
temporary null value...
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of using the dev->next chain and trying to resync at each call to
dev_seq_start, use the name hash, keeping the bucket and the offset in
seq->private field.
Tests revealed the following results for ifconfig > /dev/null
* 1000 interfaces:
* 0.114s without patch
* 0.089s with patch
* 3000 interfaces:
* 0.489s without patch
* 0.110s with patch
* 5000 interfaces:
* 1.363s without patch
* 0.250s with patch
* 128000 interfaces (other setup):
* ~100s without patch
* ~30s with patch
Signed-off-by: Mihai Maruseac <mmaruseac@ixiacom.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I've split this bit out of the skb frag destructor patch since it helps enforce
the use of the fragment API.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Turull reported inaccuracies in pktgen when using low packet
rates, because we call ndelay(val) with values bigger than 20000.
Instead of calling ndelay() for delays < 100us, we can instead loop
calling ktime_now() only.
Reported-by: Daniel Turull <daniel.turull@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I audited all of the callers in the tree and only one of them (pktgen) expects
it to do so. Taking this reference is pretty obviously confusing and error
prone.
In particular I looked at the following commits which switched callers of
(__)skb_frag_set_page to the skb paged fragment api:
6a930b9f16 cxgb3: convert to SKB paged frag API.
5dc3e196ea myri10ge: convert to SKB paged frag API.
0e0634d20d vmxnet3: convert to SKB paged frag API.
86ee8130a4 virtionet: convert to SKB paged frag API.
4a22c4c919 sfc: convert to SKB paged frag API.
18324d690d cassini: convert to SKB paged frag API.
b061b39e3a benet: convert to SKB paged frag API.
b7b6a688d2 bnx2: convert to SKB paged frag API.
804cf14ea5 net: xfrm: convert to SKB frag APIs
ea2ab69379 net: convert core to skb paged frag APIs
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is just a cleanup.
My testing version of Smatch warns about this:
net/core/filter.c +380 check_load_and_stores(6)
warn: check 'flen' for negative values
flen comes from the user. We try to clamp the values here between 1
and BPF_MAXINSNS but the clamp doesn't work because it could be
negative. This is a bug, but it's not exploitable.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
we should decrease ops->unresolved_rules when deleting a unresolved rule.
Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a sanity check on the values provided by user space for
the hardware time stamping configuration. If the values lie outside of
the absolute limits, then the ioctl request will be denied.
Signed-off-by: Richard Cochran <richard.cochran@omicron.at>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch moves the rcu_barrier from rollback_registered_many
(inside the rtnl_lock) into netdev_run_todo (just outside the rtnl_lock).
This allows us to gain the full benefit of sychronize_net calling
synchronize_rcu_expedited when the rtnl_lock is held.
The rcu_barrier in rollback_registered_many was originally a synchronize_net
but was promoted to be a rcu_barrier() when it was found that people were
unnecessarily hitting the 250ms wait in netdev_wait_allrefs(). Changing
the rcu_barrier back to a synchronize_net is therefore safe.
Since we only care about waiting for the rcu callbacks before we get
to netdev_wait_allrefs() it is also safe to move the wait into
netdev_run_todo.
This was tested by creating and destroying 1000 tap devices and observing
/proc/lock_stat. /proc/lock_stat reports this change reduces the hold
times of the rtnl_lock by a factor of 10. There was no observable
difference in the amount of time it takes to destroy a network device.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb_recycle_check resets the skb if it's eligible for recycling.
However, there are times when a driver might want to optionally
manipulate the skb data with the skb before resetting the skb,
but after it has determined eligibility. We do this by splitting the
eligibility check from the skb reset, creating two inline functions to
accomplish that task.
Signed-off-by: Andy Fleming <afleming@freescale.com>
Acked-by: David Daney <david.daney@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To ease skb->truesize sanitization, its better to be able to localize
all references to skb frags size.
Define accessors : skb_frag_size() to fetch frag size, and
skb_frag_size_{set|add|sub}() to manipulate it.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The following configuration used to work as I expected. At least
we could use the fcoe interfaces to do MPIO and the bond0 iface
to do load balancing or failover.
---eth2.228-fcoe
|
eth2 -----|
|
|---- bond0
|
eth3 -----|
|
---eth3.228-fcoe
This worked because of a change we added to allow inactive slaves
to rx 'exact' matches. This functionality was kept intact with the
rx_handler mechanism. However now the vlan interface attached to the
active slave never receives traffic because the bonding rx_handler
updates the skb->dev and goto's another_round. Previously, the
vlan_do_receive() logic was called before the bonding rx_handler.
Now by the time vlan_do_receive calls vlan_find_dev() the
skb->dev is set to bond0 and it is clear no vlan is attached
to this iface. The vlan lookup fails.
This patch moves the VLAN check above the rx_handler. A VLAN
tagged frame is now routed to the eth2.228-fcoe iface in the
above schematic. Untagged frames continue to the bond0 as
normal. This case also remains intact,
eth2 --> bond0 --> vlan.228
Here the skb is VLAN tagged but the vlan lookup fails on eth2
causing the bonding rx_handler to be called. On the second
pass the vlan lookup is on the bond0 iface and completes as
expected.
Putting a VLAN.228 on both the bond0 and eth2 device will
result in eth2.228 receiving the skb. I don't think this is
completely unexpected and was the result prior to the rx_handler
result.
Note, the same setup is also used for other storage traffic that
MPIO is used with eg. iSCSI and similar setups can be contrived
without storage protocols.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Jesse Gross <jesse@nicira.com>
Reviewed-by: Jiri Pirko <jpirko@redhat.com>
Tested-by: Hans Schillstrom <hams.schillstrom@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While preparing net flow caches, once a fail may cause potential
memory leak , fix it.
Signed-off-by: Huajun Li <huajun.li.lee@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add configuration setting for drivers to turn spoof checking on or off
for discrete VFs.
v2 - Fix indentation problem, wrap the ifla_vf_info structure in
#ifdef __KERNEL__ to prevent user space from accessing and
change function paramater for the spoof check setting netdev
op from u8 to bool.
v3 - Preset spoof check setting to -1 so that user space tools such
as ip can detect that the driver didn't report a spoofcheck
setting. Prevents incorrect display of spoof check settings
for drivers that don't report it.
Signed-off-by: Greg Rose <gregory.v.rose@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
skb truesize currently accounts for sk_buff struct and part of skb head.
kmalloc() roundings are also ignored.
Considering that skb_shared_info is larger than sk_buff, its time to
take it into account for better memory accounting.
This patch introduces SKB_TRUESIZE(X) macro to centralize various
assumptions into a single place.
At skb alloc phase, we put skb_shared_info struct at the exact end of
skb head, to allow a better use of memory (lowering number of
reallocations), since kmalloc() gives us power-of-two memory blocks.
Unless SLUB/SLUB debug is active, both skb->head and skb_shared_info are
aligned to cache lines, as before.
Note: This patch might trigger performance regressions because of
misconfigured protocol stacks, hitting per socket or global memory
limits that were previously not reached. But its a necessary step for a
more accurate memory accounting.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Andi Kleen <ak@linux.intel.com>
CC: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There's no point in open-coding sock_valbool_flag().
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Amir Vadai wrote:
> When a stream is paused, and its rule is expired while it is paused,
> no new rule will be configured to the HW when traffic resume.
[...]
> - When stream was resumed, traffic was steered again by RSS, and
> because current-cpu was equal to desired-cpu, ndo_rx_flow_steer
> wasn't called and no rule was configured to the HW.
Fix this by setting the flow's current CPU only in the table for the
newly selected RX queue.
Reported-and-tested-by: Amir Vadai <amirv@dev.mellanox.co.il>
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The upper protocol numbers of PPPOE are different, and should be treated
specially.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit 7361c36c52 (af_unix: Allow credentials to work across
user and pid namespaces) af_unix performance dropped a lot.
This is because we now take a reference on pid and cred in each write(),
and release them in read(), usually done from another process,
eventually from another cpu. This triggers false sharing.
# Events: 154K cycles
#
# Overhead Command Shared Object Symbol
# ........ ....... .................. .........................
#
10.40% hackbench [kernel.kallsyms] [k] put_pid
8.60% hackbench [kernel.kallsyms] [k] unix_stream_recvmsg
7.87% hackbench [kernel.kallsyms] [k] unix_stream_sendmsg
6.11% hackbench [kernel.kallsyms] [k] do_raw_spin_lock
4.95% hackbench [kernel.kallsyms] [k] unix_scm_to_skb
4.87% hackbench [kernel.kallsyms] [k] pid_nr_ns
4.34% hackbench [kernel.kallsyms] [k] cred_to_ucred
2.39% hackbench [kernel.kallsyms] [k] unix_destruct_scm
2.24% hackbench [kernel.kallsyms] [k] sub_preempt_count
1.75% hackbench [kernel.kallsyms] [k] fget_light
1.51% hackbench [kernel.kallsyms] [k]
__mutex_lock_interruptible_slowpath
1.42% hackbench [kernel.kallsyms] [k] sock_alloc_send_pskb
This patch includes SCM_CREDENTIALS information in a af_unix message/skb
only if requested by the sender, [man 7 unix for details how to include
ancillary data using sendmsg() system call]
Note: This might break buggy applications that expected SCM_CREDENTIAL
from an unaware write() system call, and receiver not using SO_PASSCRED
socket option.
If SOCK_PASSCRED is set on source or destination socket, we still
include credentials for mere write() syscalls.
Performance boost in hackbench : more than 50% gain on a 16 thread
machine (2 quad-core cpus, 2 threads per core)
hackbench 20 thread 2000
4.228 sec instead of 9.102 sec
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
add new fib rule can cause BUG_ON happen
the reproduce shell is
ip rule add pref 38
ip rule add pref 38
ip rule add to 192.168.3.0/24 goto 38
ip rule del pref 38
ip rule add to 192.168.3.0/24 goto 38
ip rule add pref 38
then the BUG_ON will happen
del BUG_ON and use (ctarget == NULL) identify whether this rule is unresolved
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With the conversion of struct flowi to a union of AF-specific structs, some
operations on the flow cache need to account for the exact size of the key.
Signed-off-by: David Ward <david.ward@ll.mit.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch does several things:
- introduces __ethtool_get_settings which is called from ethtool code and
from drivers as well. Put ASSERT_RTNL there.
- dev_ethtool_get_settings() is replaced by __ethtool_get_settings()
- changes calling in drivers so rtnl locking is respected. In
iboe_get_rate was previously ->get_settings() called unlocked. This
fixes it. Also prb_calc_retire_blk_tmo() in af_packet.c had the same
problem. Also fixed by calling __dev_get_by_index() instead of
dev_get_by_index() and holding rtnl_lock for both calls.
- introduces rtnl_lock in bnx2fc_vport_create() and fcoe_vport_create()
so bnx2fc_if_create() and fcoe_if_create() are called locked as they
are from other places.
- use __ethtool_get_settings() in bonding code
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
v2->v3:
-removed dev_ethtool_get_settings()
-added ASSERT_RTNL into __ethtool_get_settings()
-prb_calc_retire_blk_tmo - use __dev_get_by_index() and lock
around it and __ethtool_get_settings() call
v1->v2:
add missing export_symbol
Reviewed-by: Ben Hutchings <bhutchings@solarflare.com> [except FCoE bits]
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a time-lag of IFF_RUNNING flag consistency between vlan and
real devices when the real devices are in problem such as link or cable
broken.
This leads to a degradation of Availability such as a delay of failover
in HA systems using vlan since the detection of the problem at real
device is delayed.
We can avoid the linkwatch delay (~1 sec) for devices linked to another
ones, since delay is already done for the realdev.
Based on a previous patch from Mitsuo Hayasaka
Reported-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Patrick McHardy <kaber@trash.net>
Cc: "Michał Mirosław" <mirq-linux@rere.qmqm.pl>
Cc: Tom Herbert <therbert@google.com>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Jesse Gross <jesse@nicira.com>
Tested-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
dev_forward_skb loops an skb back into host networking
stack which might hang on the memory indefinitely.
In particular, this can happen in macvtap in bridged mode.
Copy the userspace fragments to avoid blocking the
sender in that case.
As this patch makes skb_copy_ubufs extern now,
I also added some documentation and made it clear
the SKBTX_DEV_ZEROCOPY flag automatically instead
of doing it in all callers. This can be made into a separate
patch if people feel it's worth it.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
flow_cache_lookup will return a cached object (or null pointer) that the
resolver (i.e. xfrm_policy_lookup) previously found for another namespace
using the same key/family/dir. Instead, make the namespace part of what
identifies entries in the cache.
As before, flow_entry_valid will return 0 for entries where the namespace
has been deleted, and they will be removed from the cache the next time
flow_cache_gc_task is run.
Reported-by: Andrew Dickinson <whydna@whydna.net>
Signed-off-by: David Ward <david.ward@ll.mit.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
__netpoll_rx() doesnt properly handle skbs with small header
pskb_may_pull() or pskb_trim_rcsum() can change skb->data, we must
reload it.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: "Michał Mirosław" <mirq-linux@rere.qmqm.pl>
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
Skip IPIP header to get proper layer-4 information.
Like GRE tunnels, this only works if rxhash is not already provided by
the device itself (ethtool -K ethX rxhash off), to allow kernel compute
a software rxhash.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Previously, if dynamic debug was enabled netdev_dbg() was using
dynamic_dev_dbg() to print out the underlying msg. Fix this by making
sure netdev_dbg() uses __netdev_printk().
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Jason Baron <jbaron@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Now, when vlan tag on untagged in non-accelerated path is stripped from
skb, headers are reset right away. Benefit from that and avoid calling
__netif_receive_skb recursivelly and just use another_round.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Inspect the payload of PPPOE session messages for the 4 tuples to generate
skb->rxhash.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For the 802.1Q packets, if the NIC doesn't support hw-accel-vlan-rx, RPS
won't inspect the internal 4 tuples to generate skb->rxhash, so this kind
of traffic can't get any benefit from RPS.
This patch adds the support for 802.1Q to RPS.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use IFF_UNICAST_FTL to find out if driver handles unicast address
filtering. In case it does not, promisc mode is entered.
Patch also fixes following drivers:
stmmac, niu: support uc filtering and yet it propagated
ndo_set_multicast_list
bna, benet, pxa168_eth, ks8851, ks8851_mll, ksz884x : has set
ndo_set_rx_mode but do not support uc filtering
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Crack open GRE packets in __skb_get_rxhash to compute 4-tuple hash on
in encapsulated packet. Note that this is used only when the
__skb_get_rxhash is taken, in particular only when the device does
not compute provide the rxhash (ie. feature is disabled).
This was tested by creating a single GRE tunnel between two 16 core
AMD machines. 200 netperf TCP_RR streams were ran with 1 byte
request and response size.
Without patch: 157497 tps, 50/90/99% latencies 1250/1292/1364 usecs
With patch: 325896 tps, 50/90/99% latencies 603/848/1169
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Basics for looking for ports in encapsulated packets in tunnels.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The l4_rxhash flag was added to the skb structure to indicate
that the rxhash value was computed over the 4 tuple for the
packet which includes the port information in the encapsulated
transport packet. This is used by the stack to preserve the
rxhash value in __skb_rx_tunnel.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use some variables for clarity and extensibility.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RCU api had been completed and rcu_access_pointer() or
rcu_dereference_protected() are better than generic
rcu_dereference_raw()
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove the artificial HZ latency on arp resolution.
Instead of firing a timer in one jiffy (up to 10 ms if HZ=100), lets
send the ARP message immediately.
Before patch :
# arp -d 192.168.20.108 ; ping -c 3 192.168.20.108
PING 192.168.20.108 (192.168.20.108) 56(84) bytes of data.
64 bytes from 192.168.20.108: icmp_seq=1 ttl=64 time=9.91 ms
64 bytes from 192.168.20.108: icmp_seq=2 ttl=64 time=0.065 ms
64 bytes from 192.168.20.108: icmp_seq=3 ttl=64 time=0.061 ms
After patch :
$ arp -d 192.168.20.108 ; ping -c 3 192.168.20.108
PING 192.168.20.108 (192.168.20.108) 56(84) bytes of data.
64 bytes from 192.168.20.108: icmp_seq=1 ttl=64 time=0.152 ms
64 bytes from 192.168.20.108: icmp_seq=2 ttl=64 time=0.064 ms
64 bytes from 192.168.20.108: icmp_seq=3 ttl=64 time=0.074 ms
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
dev->real_num_tx_queues is correctly set already in alloc_netdev_mqs.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch corrects an erroneous update of credential's gid with uid
introduced in commit 257b5358b3 since 2.6.36.
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Reviewed-by: James Morris <jmorris@namei.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Followup of commit f2c31e32b3 (fix NULL dereferences in
check_peer_redir()).
We need to make sure dst neighbour doesnt change in dst_ifdown().
Fix some sparse errors.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Computers have become a lot faster since we compromised on the
partial MD4 hash which we use currently for performance reasons.
MD5 is a much safer choice, and is inline with both RFC1948 and
other ISS generators (OpenBSD, Solaris, etc.)
Furthermore, only having 24-bits of the sequence number be truly
unpredictable is a very serious limitation. So the periodic
regeneration and 8-bit counter have been removed. We compute and
use a full 32-bit sequence number.
For ipv6, DCCP was found to use a 32-bit truncated initial sequence
number (it needs 43-bits) and that is fixed here as well.
Reported-by: Dan Kaminsky <dan@doxpara.com>
Tested-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
When assigning a NULL value to an RCU protected pointer, no barrier
is needed. The rcu_assign_pointer, used to handle that but will soon
change to not handle the special case.
Convert all rcu_assign_pointer of NULL value.
//smpl
@@ expression P; @@
- rcu_assign_pointer(P, NULL)
+ RCU_INIT_POINTER(P, NULL)
// </smpl>
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since skb_copy_bits() is called from assembly, add a fat comment to make
clear we should think twice before changing its prototype.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (32 commits)
tg3: Remove 5719 jumbo frames and TSO blocks
tg3: Break larger frags into 4k chunks for 5719
tg3: Add tx BD budgeting code
tg3: Consolidate code that calls tg3_tx_set_bd()
tg3: Add partial fragment unmapping code
tg3: Generalize tg3_skb_error_unmap()
tg3: Remove short DMA check for 1st fragment
tg3: Simplify tx bd assignments
tg3: Reintroduce tg3_tx_ring_info
ASIX: Use only 11 bits of header for data size
ASIX: Simplify condition in rx_fixup()
Fix cdc-phonet build
bonding: reduce noise during init
bonding: fix string comparison errors
net: Audit drivers to identify those needing IFF_TX_SKB_SHARING cleared
net: add IFF_SKB_TX_SHARED flag to priv_flags
net: sock_sendmsg_nosec() is static
forcedeth: fix vlans
gianfar: fix bug caused by 87c288c6e9
gro: Only reset frag0 when skb can be pulled
...
Pktgen attempts to transmit shared skbs to net devices, which can't be used by
some drivers as they keep state information in skbs. This patch adds a flag
marking drivers as being able to handle shared skbs in their tx path. Drivers
are defaulted to being unable to do so, but calling ether_setup enables this
flag, as 90% of the drivers calling ether_setup touch real hardware and can
handle shared skbs. A subsequent patch will audit drivers to ensure that the
flag is set properly
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Reported-by: Jiri Pirko <jpirko@redhat.com>
CC: Robert Olsson <robert.olsson@its.uu.se>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Alexey Dobriyan <adobriyan@gmail.com>
CC: David S. Miller <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This allows us to move duplicated code in <asm/atomic.h>
(atomic_inc_not_zero() for now) to <linux/atomic.h>
Signed-off-by: Arun Sharma <asharma@fb.com>
Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
No need to use int, its uses are boolean.
May save a few bytes one day.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As reported by Ben Greer and Froncois Romieu. The code path in
the netif_carrier code leads it to try and disable
a late workqueue to reenable it immediately
netif_carrier_on
-> linkwatch_fire_event
-> linkwatch_schedule_work
-> cancel_delayed_work
-> del_timer_sync
If __cancel_delayed_work is used instead then there is no
problem of waiting for running linkwatch_event.
There is a race between linkwatch_event running re-scheduling
but it is harmless to schedule an extra scan of the linkwatch queue.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some drivers (ab)use the ethtool_ops::get_regs operation to expose
only a hardware revision ID. Commit
a77f5db361 ('ethtool: Allocate register
dump buffer with vmalloc()') had the side-effect of breaking these, as
vmalloc() returns a null pointer for size=0 whereas kmalloc() did not.
For backward-compatibility, allow zero-length dumps again.
Reported-by: Kalle Valo <kvalo@qca.qualcomm.com>
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Cc: stable@kernel.org [2.6.37+]
Signed-off-by: David S. Miller <davem@davemloft.net>
There are two problems:
1) "n" was allocated with alloc_skb() so we should free it with
kfree_skb() instead of regular kfree().
2) We return the freed pointer instead of NULL.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This will get us closer to being able to do "neigh stuff"
completely independent of the underlying dst_entry for
protocols (ipv4/ipv6) that wish to do so.
We will also be able to make dst entries neigh-less.
Signed-off-by: David S. Miller <davem@davemloft.net>
It's just taking on one of two possible values, either
neigh_ops->output or dev_queue_xmit(). And this is purely depending
upon whether nud_state has NUD_CONNECTED set or not.
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that hh_cache entries are embedded inside of neighbour
entries, their lifetimes and accesses are now synchronous
to that of the encompassing neighbour object.
Therefore we don't need to hook up the blackhole op to
hh_output on destroy.
Signed-off-by: David S. Miller <davem@davemloft.net>
The same information and more can be obtained by using ethtool
with ETHTOOL_GFEATURES.
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is not used anywhere except net/core/dev.c now.
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
vlan_features contains features inherited from underlying device.
NETIF_SOFT_FEATURES are not inherited but belong to the vlan device
itself (ensured in vlan_dev_fix_features()).
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that there is a one-to-one correspondance between neighbour
and hh_cache entries, we no longer need:
1) dynamic allocation
2) attachment to dst->hh
3) refcounting
Initialization of the hh_cache entry is indicated by hh_len
being non-zero, and such initialization is always done with
the neighbour's lock held as a writer.
Signed-off-by: David S. Miller <davem@davemloft.net>
This never, ever, happens.
Neighbour entries are always tied to one address family, and therefore
one set of dst_ops, and therefore one dst_ops->protocol "hh_type"
value.
This capability was blindly imported by Alexey Kuznetsov when he wrote
the neighbour layer.
Signed-off-by: David S. Miller <davem@davemloft.net>
And mask the hash function result by simply shifting
down the "->hash_shift" most significant bits.
Currently which bits we use is arbitrary since jhash
produces entropy evenly across the whole hash function
result.
But soon we'll be using universal hashing functions,
and in those cases more entropy exists in the higher
bits than the lower bits, because they use multiplies.
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds userspace buffers support in skb shared info. A new
struct skb_ubuf_info is needed to maintain the userspace buffers
argument and index, a callback is used to notify userspace to release
the buffers once lower device has done DMA (Last reference to that skb
has gone).
If there is any userspace apps to reference these userspace buffers,
then these userspaces buffers will be copied into kernel. This way we
can prevent userspace apps from holding these userspace buffers too long.
Use destructor_arg to point to the userspace buffer info; a new tx flags
SKBTX_DEV_ZEROCOPY is added for zero-copy buffer check.
Signed-off-by: Shirley Ma <xma@...ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Just add GSO to vlan_features initialization, and update comments.
When we set offload features, vlan_dev_fix_features() will do more check.
In vlan_dev_fix_features(), final features is decided by
features of real device and vlan_features of real device.
Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Lauro Ramos Venancio <lauro.venancio@openbossa.org>
Signed-off-by: Aloisio Almeida Jr <aloisio.almeida@openbossa.org>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Too trivial to live.
cc: WANG Cong <amwang@redhat.com>
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Unused symbols waste space.
Commit 0e34e93177
"(netpoll: add generic support for bridge and bonding devices)"
added the symbol more than a year ago with the promise of "future use".
Because it is so far unused, remove it for now.
It can be easily readded if or when it actually needs to be used.
cc: WANG Cong <amwang@redhat.com>
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPV6, unlike IPV4, doesn't have a routing cache.
Routing table entries, as well as clones made in response
to route lookup requests, all live in the same table. And
all of these things are together collected in the destination
cache table for ipv6.
This means that routing table entries count against the garbage
collection limits, even though such entries cannot ever be reclaimed
and are added explicitly by the administrator (rather than being
created in response to lookups).
Therefore it makes no sense to count ipv6 routing table entries
against the GC limits.
Add a DST_NOCOUNT destination cache entry flag, and skip the counting
if it is set. Use this flag bit in ipv6 when adding routing table
entries.
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a change sequence counter to each net namespace
which is bumped whenever a netdevice is added or removed from
the list. If such a change occurred while a link dump took place,
the dump will have the NLM_F_DUMP_INTR flag set in the first
message which has been interrupted and in all subsequent messages
of the same dump.
Note that links may still be modified or renamed while a dump is
taking place but we can guarantee for userspace to receive a
complete list of links and not miss any.
Testing:
I have added 500 VLAN netdevices to make sure the dump is split
over multiple messages. Then while continuously dumping links in
one process I also continuously deleted and re-added a dummy
netdevice in another process. Multiple dumps per seconds have
had the NLM_F_DUMP_INTR flag set.
I guess we can wait for Johannes patch to hit net-next via the
wireless tree. I just wanted to give this some testing right away.
Signed-off-by: Thomas Graf <tgraf@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are enough instances of this:
iph->frag_off & htons(IP_MF | IP_OFFSET)
that a helper function is probably warranted.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds 2 tracepoints to get a status of a socket receive queue
and related parameter.
One tracepoint is added to sock_queue_rcv_skb. It records rcvbuf size
and its usage. The other tracepoint is added to __sk_mem_schedule and
it records limitations of memory for sockets and current usage.
By using these tracepoints we're able to know detailed reason why kernel
drop the packet.
Signed-off-by: Satoru Moriya <satoru.moriya@hds.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a tracepoint to __udp_queue_rcv_skb to get the
return value of ip_queue_rcv_skb. It indicates why kernel drops
a packet at this point.
ip_queue_rcv_skb returns following values in the packet drop case:
rcvbuf is full : -ENOMEM
sk_filter returns error : -EINVAL, -EACCESS, -ENOMEM, etc.
__sk_mem_schedule returns error: -ENOBUF
Signed-off-by: Satoru Moriya <satoru.moriya@hds.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ethernet MAC drivers based on phylib (but not using NAPI) can
enable hardware time stamping in phy devices by calling netif_rx()
conditionally based on a call to skb_defer_rx_timestamp().
This commit exports that function so that drivers calling it may
be compiled as modules.
Signed-off-by: Richard Cochran <richard.cochran@omicron.at>
Signed-off-by: David S. Miller <davem@davemloft.net>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
AFS: Use i_generation not i_version for the vnode uniquifier
AFS: Set s_id in the superblock to the volume name
vfs: Fix data corruption after failed write in __block_write_begin()
afs: afs_fill_page reads too much, or wrong data
VFS: Fix vfsmount overput on simultaneous automount
fix wrong iput on d_inode introduced by e6bc45d65d
Delay struct net freeing while there's a sysfs instance refering to it
afs: fix sget() races, close leak on umount
ubifs: fix sget races
ubifs: split allocation of ubifs_info into a separate function
fix leak in proc_set_super()
The network stack provides the function, skb_clone_tx_timestamp().
Ethernet MAC drivers can call this via the transmit time stamping
hook, skb_tx_timestamp(). This commit exports the clone function so
that drivers using it can be compiled as modules.
Signed-off-by: Richard Cochran <richard.cochran@omicron.at>
Signed-off-by: David S. Miller <davem@conan.davemloft.net>
* new refcount in struct net, controlling actual freeing of the memory
* new method in kobj_ns_type_operations (->drop_ns())
* ->current_ns() semantics change - it's supposed to be followed by
corresponding ->drop_ns(). For struct net in case of CONFIG_NET_NS it bumps
the new refcount; net_drop_ns() decrements it and calls net_free() if the
last reference has been dropped. Method renamed to ->grab_current_ns().
* old net_free() callers call net_drop_ns() instead.
* sysfs_exit_ns() is gone, along with a large part of callchain
leading to it; now that the references stored in ->ns[...] stay valid we
do not need to hunt them down and replace them with NULL. That fixes
problems in sysfs_lookup() and sysfs_readdir(), along with getting rid
of sb->s_instances abuse.
Note that struct net *shutdown* logics has not changed - net_cleanup()
is called exactly when it used to be called. The only thing postponed by
having a sysfs instance refering to that struct net is actual freeing of
memory occupied by struct net.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
There is a dev_put(ndev) missing on an error path. This was
introduced in 0c1ad04aec "netpoll: prevent netpoll setup on slave
devices".
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Testing of VLAN_FLAG_REORDER_HDR does not belong in vlan_untag
but rather in vlan_do_receive. Otherwise the vlan header
will not be properly put on the packet in the case of
vlan header accelleration.
As we remove the check from vlan_check_reorder_header
rename it vlan_reorder_header to keep the naming clean.
Fix up the skb->pkt_type early so we don't look at the packet
after adding the vlan tag, which guarantees we don't goof
and look at the wrong field.
Use a simple if statement instead of a complicated switch
statement to decided that we need to increment rx_stats
for a multicast packet.
Hopefully at somepoint we will just declare the case where
VLAN_FLAG_REORDER_HDR is cleared as unsupported and remove
the code. Until then this keeps it working correctly.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Acked-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The message size allocated for rtnl ifinfo dumps was limited to
a single page. This is not enough for additional interface info
available with devices that support SR-IOV and caused a bug in
which VF info would not be displayed if more than approximately
40 VFs were created per interface.
Implement a new function pointer for the rtnl_register service that will
calculate the amount of data required for the ifinfo dump and allocate
enough data to satisfy the request.
Signed-off-by: Greg Rose <gregory.v.rose@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
In commit 8d8fc29d02
(netpoll: disable netpoll when enslave a device), we automatically
disable netpoll when the underlying device is being enslaved,
we also need to prevent people from setuping netpoll on
devices that are already enslaved.
Signed-off-by: WANG Cong <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change is meant to remove all support for displaying an ntuple as
strings via ETHTOOL_GRXNTUPLE. The reason for this change is due to the
fact that multiple issues have been found including:
- Multiple buffer overruns for strings being displayed.
- Incorrect filters displayed, cleared filters with ring of -2 are displayed
- Setting get_rx_ntuple displays no rules if defined.
- Endianess wrong on displayed values.
- Hard limit of 1024 filters makes display functionality extremely limited
The only driver that had supported this interface was ixgbe. Since it no
longer uses the interface and due to the issues mentioned above I am
submitting this patch to remove it.
v2:
Updated based on comments from Ben Hutchings
- Left ETH_SS_NTUPLE_FILTERS in code but commented on it being deprecated
- Removed ethtool_rx_ntuple_list and ethtool_rx_ntuple_flow_spec_container
- Left ETHTOOL_GRXNTUPLE but commented it as deprecated
Also cleaned up set_rx_ntuple since there is no flow spec container to
maintain we can drop all the code for the alloc and free of it and just
return ops->set_rx_ntuple().
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Frank Blaschka reported :
<quote>
During heavy network load we turn off/on cpus.
Sometimes this causes a stall on the network device.
Digging into the dump I found out following:
napi is scheduled but does not run. From the I/O buffers
and the napi state I see napi/rx_softirq processing has stopped
because the budget was reached. napi stays in the
softnet_data poll_list and the rx_softirq was raised again.
I assume at this time the cpu offline comes in,
the rx softirq is raised/moved to another cpu but napi stays in the
poll_list of the softnet_data of the now offline cpu.
Reviewing dev_cpu_callback (net/core/dev.c) I did not find the
poll_list is transfered to the new cpu.
</quote>
This patch is a straightforward implementation of Frank suggestion :
Transfert poll_list and trigger NET_RX_SOFTIRQ on new cpu.
Reported-by: Frank Blaschka <blaschka@linux.vnet.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Tested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This interface uses a temporary buffer, but for no real reason.
And now can generate warnings like:
net/sched/sch_generic.c: In function dev_watchdog
net/sched/sch_generic.c:254:10: warning: unused variable drivername
Just return driver->name directly or "".
Reported-by: Connor Hansen <cmdkhh@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
BTW, looking through the code related to struct net lifetime rules has
caught something else:
struct net *get_net_ns_by_fd(int fd)
{
...
file = proc_ns_fget(fd);
if (!file)
goto out;
ei = PROC_I(file->f_dentry->d_inode);
while in proc_ns_fget() we have two return ERR_PTR(...) and not a single
path that would return NULL. The other caller of proc_ns_fget() treats
ERR_PTR() correctly...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (40 commits)
tg3: Fix tg3_skb_error_unmap()
net: tracepoint of net_dev_xmit sees freed skb and causes panic
drivers/net/can/flexcan.c: add missing clk_put
net: dm9000: Get the chip in a known good state before enabling interrupts
drivers/net/davinci_emac.c: add missing clk_put
af-packet: Add flag to distinguish VID 0 from no-vlan.
caif: Fix race when conditionally taking rtnl lock
usbnet/cdc_ncm: add missing .reset_resume hook
vlan: fix typo in vlan_dev_hard_start_xmit()
net/ipv4: Check for mistakenly passed in non-IPv4 address
iwl4965: correctly validate temperature value
bluetooth l2cap: fix locking in l2cap_global_chan_by_psm
ath9k: fix two more bugs in tx power
cfg80211: don't drop p2p probe responses
Revert "net: fix section mismatches"
drivers/net/usb/catc.c: Fix potential deadlock in catc_ctrl_run()
sctp: stop pending timers and purge queues when peer restart asoc
drivers/net: ks8842 Fix crash on received packet when in PIO mode.
ip_options_compile: properly handle unaligned pointer
iwlagn: fix incorrect PCI subsystem id for 6150 devices
...
Because there is a possibility that skb is kfree_skb()ed and zero cleared
after ndo_start_xmit, we should not see the contents of skb like skb->len and
skb->dev->name after ndo_start_xmit. But trace_net_dev_xmit does that
and causes panic by NULL pointer dereference.
This patch fixes trace_net_dev_xmit not to see the contents of skb directly.
If you want to reproduce this panic,
1. Get tracepoint of net_dev_xmit on
2. Create 2 guests on KVM
2. Make 2 guests use virtio_net
4. Execute netperf from one to another for a long time as a network burden
5. host will panic(It takes about 30 minutes)
Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
net: Kill ratelimit.h dependency in linux/net.h
net: Add linux/sysctl.h includes where needed.
net: Kill ether_table[] declaration.
inetpeer: fix race in unused_list manipulations
atm: expose ATM device index in sysfs
IPVS: bug in ip_vs_ftp, same list heaad used in all netns.
bug.h: Move ratelimit warn interfaces to ratelimit.h
bonding: cleanup module option descriptions
net:8021q:vlan.c Fix pr_info to just give the vlan fullname and version.
net: davinci_emac: fix dev_err use at probe
can: convert to %pK for kptr_restrict support
net: fix ETHTOOL_SFEATURES compatibility with old ethtool_ops.set_flags
netfilter: Fix several warnings in compat_mtw_from_user().
netfilter: ipset: fix ip_set_flush return code
netfilter: ipset: remove unused variable from type_pf_tdel()
netfilter: ipset: Use proper timeout value to jiffies conversion
Ingo Molnar noticed that we have this unnecessary ratelimit.h
dependency in linux/net.h, which hid compilation problems from
people doing builds only with CONFIG_NET enabled.
Move this stuff out to a seperate net/net_ratelimit.h file and
include that in the only two places where this thing is needed.
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Ingo Molnar <mingo@elte.hu>
As reported by Ingo Molnar, we still have configuration combinations
where use of the WARN_RATELIMIT interfaces break the build because
dependencies don't get met.
Instead of going down the long road of trying to make it so that
ratelimit.h can get included by kernel.h or asm-generic/bug.h,
just move the interface into ratelimit.h and make users have
to include that.
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Current code squashes flags to bool - this makes set_flags fail whenever
some ETH_FLAG_* equivalent features are set. Fix this.
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/linux-2.6-nsfd:
net: fix get_net_ns_by_fd for !CONFIG_NET_NS
ns proc: Return -ENOENT for a nonexistent /proc/self/ns/ entry.
ns: Declare sys_setns in syscalls.h
net: Allow setting the network namespace by fd
ns proc: Add support for the ipc namespace
ns proc: Add support for the uts namespace
ns proc: Add support for the network namespace.
ns: Introduce the setns syscall
ns: proc files for namespace naming policy.
Commit e67f88dd12 (dont hold rtnl mutex during netlink dump callbacks)
missed fact that rtnl_fill_ifinfo() must be called with rtnl held.
Because of possible deadlocks between two mutexes (cb_mutex and rtnl),
its not easy to solve this problem, so revert this part of the patch.
It also forgot one rcu_read_unlock() in FIB dump_rules()
Add one ASSERT_RTNL() in rtnl_fill_ifinfo() to remind us the rule.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Patrick McHardy <kaber@trash.net>
CC: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If the device passed into dev_disable_lro is a vlan, then repoint the dev
poniter so that we actually modify the underlying physical device.
Signed-of-by: Neil Horman <nhorman@tuxdriver.com>
CC: davem@davemloft.net
CC: bhutchings@solarflare.com
Signed-off-by: David S. Miller <davem@davemloft.net>
After merging the final tree, today's linux-next build (powerpc
ppc44x_defconfig) failed like this:
net/built-in.o: In function `get_net_ns_by_fd':
(.text+0x11976): undefined reference to `netns_operations'
net/built-in.o: In function `get_net_ns_by_fd':
(.text+0x1197a): undefined reference to `netns_operations'
netns_operations is only available if CONFIG_NET_NS is set ...
Caused by commit f063052947 ("net: Allow setting the network namespace
by fd").
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
dst_default_metrics is readonly, we dont want to kfree() it later.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
synchronize_rcu() is very slow in various situations (HZ=100,
CONFIG_NO_HZ=y, CONFIG_PREEMPT=n)
Extract from my (mostly idle) 8 core machine :
synchronize_rcu() in 99985 us
synchronize_rcu() in 79982 us
synchronize_rcu() in 87612 us
synchronize_rcu() in 79827 us
synchronize_rcu() in 109860 us
synchronize_rcu() in 98039 us
synchronize_rcu() in 89841 us
synchronize_rcu() in 79842 us
synchronize_rcu() in 80151 us
synchronize_rcu() in 119833 us
synchronize_rcu() in 99858 us
synchronize_rcu() in 73999 us
synchronize_rcu() in 79855 us
synchronize_rcu() in 79853 us
When we hold RTNL mutex, we would like to spend some cpu cycles but not
block too long other processes waiting for this mutex.
We also want to setup/dismantle network features as fast as possible at
boot/shutdown time.
This patch makes synchronize_net() call the expedited version if RTNL is
locked.
synchronize_rcu_expedited() typical delay is about 20 us on my machine.
synchronize_rcu_expedited() in 18 us
synchronize_rcu_expedited() in 18 us
synchronize_rcu_expedited() in 18 us
synchronize_rcu_expedited() in 18 us
synchronize_rcu_expedited() in 20 us
synchronize_rcu_expedited() in 16 us
synchronize_rcu_expedited() in 20 us
synchronize_rcu_expedited() in 18 us
synchronize_rcu_expedited() in 18 us
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
CC: Ben Greear <greearb@candelatech.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A mis-configured filter can spam the logs with lots of stack traces.
Rate-limit the warnings and add printout of the bogus filter information.
Original-patch-by: Ben Greear <greearb@candelatech.com>
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This also shrinks the object size a little.
text data bss dec hex filename
28834 186 8 29028 7164 net/core/pktgen.o
28816 186 8 29010 7152 net/core/pktgen.o.AFTER
Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com>
Cc: "David Miller" <davem@davemloft.net>,
Signed-off-by: David S. Miller <davem@davemloft.net>
In the old days, we used to access dev->master in __netif_receive_skb()
in a rcu_read_lock section.
So one synchronize_net() call was needed in netdev_set_master() to make
sure another cpu could not use old master while/after we release it.
We now use netdev_rx_handler infrastructure and added one
synchronize_net() call in bond_release()/bond_release_all()
Remove the obsolete synchronize_net() from netdev_set_master() and add
one in bridge del_nbp() after its netdev_rx_handler_unregister() call.
This makes enslave -d a bit faster.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Jiri Pirko <jpirko@redhat.com>
CC: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These two events are not expected to be caught by userspace.
Signed-off-by: WANG Cong <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1446 commits)
macvlan: fix panic if lowerdev in a bond
tg3: Add braces around 5906 workaround.
tg3: Fix NETIF_F_LOOPBACK error
macvlan: remove one synchronize_rcu() call
networking: NET_CLS_ROUTE4 depends on INET
irda: Fix error propagation in ircomm_lmp_connect_response()
irda: Kill set but unused variable 'bytes' in irlan_check_command_param()
irda: Kill set but unused variable 'clen' in ircomm_connect_indication()
rxrpc: Fix set but unused variable 'usage' in rxrpc_get_transport()
be2net: Kill set but unused variable 'req' in lancer_fw_download()
irda: Kill set but unused vars 'saddr' and 'daddr' in irlan_provider_connect_indication()
atl1c: atl1c_resume() is only used when CONFIG_PM_SLEEP is defined.
rxrpc: Fix set but unused variable 'usage' in rxrpc_get_peer().
rxrpc: Kill set but unused variable 'local' in rxrpc_UDP_error_handler()
rxrpc: Kill set but unused variable 'sp' in rxrpc_process_connection()
rxrpc: Kill set but unused variable 'sp' in rxrpc_rotate_tx_window()
pkt_sched: Kill set but unused variable 'protocol' in tc_classify()
isdn: capi: Use pr_debug() instead of ifdefs.
tg3: Update version to 3.119
tg3: Apply rx_discards fix to 5719/5720
...
Fix up trivial conflicts in arch/x86/Kconfig and net/mac80211/agg-tx.c
as per Davem.
Commit e66eed651f ("list: remove prefetching from regular list
iterators") removed the include of prefetch.h from list.h, which
uncovered several cases that had apparently relied on that rather
obscure header file dependency.
So this fixes things up a bit, using
grep -L linux/prefetch.h $(git grep -l '[^a-z_]prefetchw*(' -- '*.[ch]')
grep -L 'prefetchw*(' $(git grep -l 'linux/prefetch.h' -- '*.[ch]')
to guide us in finding files that either need <linux/prefetch.h>
inclusion, or have it despite not needing it.
There are more of them around (mostly network drivers), but this gets
many core ones.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When one macvlan device is dismantled, we can avoid one
synchronize_rcu() call done after deletion from hash list, since caller
will perform a synchronize_net() call after its ndo_stop() call.
Add a new netdev->dismantle field to signal this dismantle intent.
Reduces RTNL hold time.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Patrick McHardy <kaber@trash.net>
CC: Ben Greear <greearb@candelatech.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It's way past it's usefulness. And this gets rid of a bunch
of stray ->rt_{dst,src} references.
Even the comment documenting the macro was inaccurate (stated
default was 1 when it's 0).
If reintroduced, it should be done properly, with dynamic debug
facilities.
Signed-off-by: David S. Miller <davem@davemloft.net>
Cool, how about we make 'Features changed' debug as well?
This way userspace can't fill up the log just by tweaking tun features
with an ioctl.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Using plain hlist_del() in dev_change_name() is wrong since a
concurrent reader can crash trying to dereference LIST_POISON1.
Bug introduced in commit 72c9528bab (net: Introduce
dev_get_by_name_rcu())
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Those reduced to DEBUG can possibly be triggered by unprivileged processes
and are nothing exceptional. Illegal checksum combinations can only be
caused by driver bug, so promote those messages to WARN.
Since GSO without SG will now only cause DEBUG message from
netdev_fix_features(), remove the workaround from register_netdevice().
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
We plan to remove cpu_xx() old api later. Thus this patch
convert it.
This patch has no functional change.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Added code to take FW dump via ethtool. Dump level can be controlled via setting the
dump flag. A get function is provided to query the current setting of the dump flag.
Dump data is obtained from the driver via a separate get function.
Changes from v3:
Fixed buffer length issue in ethtool_get_dump_data function.
Updated kernel doc for ethtool_dump struct and get_dump_flag function.
Changes from v2:
Provided separate commands for get flag and data.
Check for minimum of the two buffer length obtained via ethtool and driver and
use that for dump buffer
Pass up the driver return error codes up to the caller.
Added kernel doc comments.
Signed-off-by: Anirban Chakraborty <anirban.chakraborty@qlogic.com>
Reviewed-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It will be needed by bonding and other drivers changing vlan_features
after ndo_init callback.
As a bonus, this includes kernel-doc for netdev_update_features().
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
The issue was introduced in commit eed2a12f1e.
Signed-off-by: Franco Fichtner <franco@lastsummer.de>
Acked-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Acked-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 443457242b (factorize sync-rcu call in
unregister_netdevice_many) mistakenly removed one test from dev_close()
Following actions trigger a BUG :
modprobe bonding
modprobe dummy
ifconfig bond0 up
ifenslave bond0 dummy0
rmmod dummy
dev_close() must not close a non IFF_UP device.
With help from Frank Blaschka and Einar EL Lueck
Reported-by: Frank Blaschka <blaschka@linux.vnet.ibm.com>
Reported-by: Einar EL Lueck <ELELUECK@de.ibm.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Take advantage of the new abstraction and allow network devices
to be placed in any network namespace that we have a fd to talk
about.
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Daniel Lezcano <daniel.lezcano@free.fr>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Implementing file descriptors for the network namespace
is simple and straight forward.
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Daniel Lezcano <daniel.lezcano@free.fr>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
mac_pton() parses MAC address in form XX:XX:XX:XX:XX:XX and only in that form.
mac_pton() doesn't dirty result until it's sure string representation is valid.
mac_pton() doesn't care about characters _after_ last octet,
it's up to caller to deal with it.
mac_pton() diverges from 0/-E return value convention.
Target usage:
if (!mac_pton(str, whatever->mac))
return -EINVAL;
/* ->mac being u8 [ETH_ALEN] is filled at this point. */
/* optionally check str[3 * ETH_ALEN - 1] for termination */
Use mac_pton() in pktgen and netconsole for start.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
veth devices dont use the batched device unregisters yet.
Since veth are a pair of devices, it makes sense to use a batch of two
unregisters, this roughly divides dismantle time by two.
Fix this by changing dellink() callers to always provide a non NULL
head. (Idea from Michał Mirosław)
This patch also handles macvlan case : We now dismantle all macvlans on
top of a lower dev at once.
Reported-by: Alex Bligh <alex@alex.org.uk>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Michał Mirosław <mirqus@gmail.com>
Cc: Jesse Gross <jesse@nicira.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ben Greear <greearb@candelatech.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch enables ethtool to set the loopback mode on a given interface.
By configuring the interface in loopback mode in conjunction with a policy
route / rule, a userland application can stress the egress / ingress path
exposing the flows of the change in progress and potentially help developer(s)
understand the impact of those changes without even sending a packet out
on the network.
Following set of commands illustrates one such example -
a) ip -4 addr add 192.168.1.1/24 dev eth1
b) ip -4 rule add from all iif eth1 lookup 250
c) ip -4 route add local 0/0 dev lo proto kernel scope host table 250
d) arp -Ds 192.168.1.100 eth1
e) arp -Ds 192.168.1.200 eth1
f) sysctl -w net.ipv4.ip_nonlocal_bind=1
g) sysctl -w net.ipv4.conf.all.accept_local=1
# Assuming that the machine has 8 cores
h) taskset 000f netserver -L 192.168.1.200
i) taskset 00f0 netperf -t TCP_CRR -L 192.168.1.100 -H 192.168.1.200 -l 30
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Acked-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I don't know why %pI6 doesn't compress, but the format specifier is
kernel-standard, so use it.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After that all the upstream kernel drivers now use phys_id,
and the old ethtool_ops interface (phys_id) can be removed.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The rcu callback net_generic_release() just calls a kfree(),
so we use kfree_rcu() instead of the call_rcu(net_generic_release).
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
The rcu callback xps_dev_maps_release() just calls a kfree(),
so we use kfree_rcu() instead of the call_rcu(xps_dev_maps_release).
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
The rcu callback xps_map_release() just calls a kfree(),
so we use kfree_rcu() instead of the call_rcu(xps_map_release).
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
The rcu callback rps_map_release() just calls a kfree(),
so we use kfree_rcu() instead of the call_rcu(rps_map_release).
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
The rcu callback free_dm_hw_stat() just calls a kfree(),
so we use kfree_rcu() instead of the call_rcu(free_dm_hw_stat).
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
The rcu callback __gen_kill_estimator() just calls a kfree(),
so we use kfree_rcu() instead of the call_rcu(__gen_kill_estimator).
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
The rcu callback ha_rcu_free() just calls a kfree(),
so we use kfree_rcu() instead of the call_rcu(ha_rcu_free).
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Force dev_alloc_name() to be called from register_netdevice() by
dev_get_valid_name(). That allows to remove multiple explicit
dev_alloc_name() calls.
The possibility to call dev_alloc_name in advance remains.
This also fixes veth creation regresion caused by
84c49d8c3e
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ioctl() calls against a socket with an inappropriate ioctl operation
are incorrectly returning EINVAL rather than ENOTTY:
[ENOTTY]
Inappropriate I/O control operation.
BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=33992
Signed-off-by: Lifeng Sun <lifongsun@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Four years ago, Patrick made a change to hold rtnl mutex during netlink
dump callbacks.
I believe it was a wrong move. This slows down concurrent dumps, making
good old /proc/net/ files faster than rtnetlink in some situations.
This occurred to me because one "ip link show dev ..." was _very_ slow
on a workload adding/removing network devices in background.
All dump callbacks are able to use RCU locking now, so this patch does
roughly a revert of commits :
1c2d670f36 : [RTNETLINK]: Hold rtnl_mutex during netlink dump callbacks
6313c1e099 : [RTNETLINK]: Remove unnecessary locking in dump callbacks
This let writers fight for rtnl mutex and readers going full speed.
It also takes care of phonet : phonet_route_get() is now called from rcu
read section. I renamed it to phonet_route_get_rcu()
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Remi Denis-Courmont <remi.denis-courmont@nokia.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This makes sure that when a driver calls the ethtool's
get/set_settings() callback of another driver, the data passed to it
is clean. This guarantees that speed_hi will be zeroed correctly if
the called callback doesn't explicitely set it: we are sure we don't
get a corrupted speed from the underlying driver. We also take care of
setting the cmd field appropriately (ETHTOOL_GSET/SSET).
This applies to dev_ethtool_get_settings(), which now makes sure it
sets up that ethtool command parameter correctly before passing it to
drivers. This also means that whoever calls dev_ethtool_get_settings()
does not have to clean the ethtool command parameter. This function
also becomes an exported symbol instead of an inline.
All drivers visible to make allyesconfig under x86_64 have been
updated.
Signed-off-by: David Decotigny <decot@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pktgen doesn't generate number of frags requested.
Divide packet size by number of frags and fill that in every frags.
Example:
With packet size 1470, it generate only 11 frags. Initial frags
get lenght 706, 353, 177....so on. Last frag get divided by 2.
Now with this fix, each frags will get 78 bytes.
Signed-off-by: Amit Kumar Salecha <amit.salecha@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make dst_alloc() and it's users explicitly initialize the entire
entry.
The zero'ing done by kmem_cache_zalloc() was almost entirely
redundant.
Signed-off-by: David S. Miller <davem@davemloft.net>
Simplify and fix netdev_increment_features() to conform to what is
stated in netdevice.h comments about NETIF_F_ONE_FOR_ALL.
Include FCoE segmentation and VLAN-challedged flags in computation.
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since now when bonding uses rx_handler, all traffic going into bond
device goes thru bond_handle_frame. So there's no need to go back into
bonding code later via ptype handlers. This patch converts
original ptype handlers into "bonding receive probes". These functions
are called from bond_handle_frame and they are registered per-mode.
Note that vlan packets are also handled because they are always untagged
thanks to vlan_untag()
Note that this also allows arpmon for eth-bond-bridge-vlan topology.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add const qualifiers to structs iphdr, ipv6hdr and in6_addr pointers
where possible, to make code intention more obvious.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
__ethtool_set_flags() was not taking into account features set but not
user-toggleable.
Since GFLAGS returns masked dev->features, EINVAL is returned when
passed flags differ to it, and not to wanted_features.
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
Inline a small static function that's only ever called from one place.
Signed-off-by: Rob Landley <rlandley@parallels.com>
Reviewed-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When physical identification of an adapter is done by toggling the
mechanism on and off through software utilizing the set_phys_id operation,
it is done with a fixed duration for both on and off states. Some drivers
may want to set a custom duration for the on/off intervals. This patch
changes the API so the return code from the driver's entry point when it
is called with ETHTOOL_ID_ACTIVE can specify the frequency at which to
cycle the on/off states, and updates the drivers that have already been
converted to use the new set_phys_id and use the synchronous method for
identifying an adapter.
The physical identification frequency set in the updated drivers is based
on how it was done prior to the introduction of set_phys_id.
Compile tested only. Also fixes a compiler warning in sfc.
v2: drivers do not return -EINVAL for ETHOOL_ID_ACTIVE
v3: fold patchset into single patch and cleanup per Ben's feedback
Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Sathya Perla <sathya.perla@emulex.com>
Cc: Subbu Seetharaman <subbu.seetharaman@emulex.com>
Cc: Ajit Khaparde <ajit.khaparde@emulex.com>
Cc: Michael Chan <mchan@broadcom.com>
Cc: Eilon Greenstein <eilong@broadcom.com>
Cc: Divy Le Ray <divy@chelsio.com>
Cc: Don Fry <pcnet32@frontier.com>
Cc: Jon Mason <jdmason@kudzu.us>
Cc: Solarflare linux maintainers <linux-net-drivers@solarflare.com>
Cc: Steve Hodgson <shodgson@solarflare.com>
Cc: Stephen Hemminger <shemminger@linux-foundation.org>
Cc: Matt Carlson <mcarlson@broadcom.com>
Acked-by: Jon Mason <jdmason@kudzu.us>
Acked-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ethtool support to configure RX, TX and other channels. combined field
in struct ethtool_channels to reflect set of channel (RX, TX or other).
Other channel can be link interrupts, SR-IOV coordination etc.
ETHTOOL_GCHANNELS will report max and current number of RX channels,
max and current number of TX channels, max and current number of other channel
or max and current number of combined channel.
Number of channel can be modify upto max number of channel through
ETHTOOL_SCHANNELS command.
Ben Hutchings:
o define 'combined' and 'other' types. Most multiqueue drivers pair up RX and TX
queues so that most channels combine RX and TX work.
o Please could you use a kernel-doc comment to describe the structure.
Signed-off-by: Amit Kumar Salecha <amit.salecha@qlogic.com>
Reviewed-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
NETIF_F_TSO_ECN has no effect when TSO is disabled; this just means
that feature state will be accurately reported to user-space.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The feature flags NETIF_F_TSO and NETIF_F_TSO6 independently enable
TSO for IPv4 and IPv6 respectively. However, the test in
netdev_fix_features() and its predecessor functions was never updated
to check for NETIF_F_TSO6, possibly because it was originally proposed
that TSO for IPv6 would be dependent on both feature flags.
Now that these feature flags can be changed independently from
user-space and we depend on netdev_fix_features() to fix invalid
feature combinations, it's important to disable them both if
scatter-gather is disabled. Also disable NETIF_F_TSO_ECN so
user-space sees all TSO features as disabled.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now there are 2 paths for rx vlan frames. When rx-vlan-hw-accel is
enabled, skb is untagged by NIC, vlan_tci is set and the skb gets into
vlan code in __netif_receive_skb - vlan_hwaccel_do_receive.
For non-rx-vlan-hw-accel however, tagged skb goes thru whole
__netif_receive_skb, it's untagged in ptype_base hander and reinjected
This incosistency is fixed by this patch. Vlan untagging happens early in
__netif_receive_skb so the rest of code (ptype_all handlers, rx_handlers)
see the skb like it was untagged by hw.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
v1->v2:
remove "inline" from vlan_core.c functions
Signed-off-by: David S. Miller <davem@davemloft.net>
When blinking for a duration set by the user, the value specified is in
seconds but it is used as the number of jiffies in the timeout after which
the Physical ID indicator is deactivated. Fix by converting the timeout
to seconds.
Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
Acked-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change is meant to prevent a possible null pointer dereference if
NETIF_F_NTUPLE is defined but the set_rx_ntuple function pointer is not.
The main motivation behind this patch is to eventually replace the ntuple
interfaces entirely with the network flow classifier interfaces. This
allows the device drivers to maintain the ntuple check internally while
using the network flow classifier interface for setting up and displaying
rules.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ethtool ETHTOOL_PHYS_ID command runs for an arbitrarily long
period of time, holding the RTNL lock. This blocks routing updates,
device enumeration, and various important operations that one might
want to keep running while hunting for the flashing LED.
We need to drop the RTNL lock during this operation, but currently the
core implementation is a thin wrapper around a driver operation and
drivers may well depend upon holding the lock.
Define a new driver operation 'set_phys_id' with an argument that sets
the ID indicator on/off/inactive/active (the last optional, for any
driver or firmware that prefers to handle blinking asynchronously).
When this is defined, the ethtool core drops the lock while waiting
and only acquires it around calls to this operation.
Deprecate the 'phys_id' operation in favour of this. It can be
removed once all in-tree drivers are converted.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
This patch uses __copy_from_user_nocache on transmit to bypass data
cache for a performance improvement. skb_add_data_nocache and
skb_copy_to_page_nocache can be called by sendmsg functions to use
this feature, initial support is in tcp_sendmsg. This functionality is
configurable per device using ethtool.
Presumably, this feature would only be useful when the driver does
not touch the data. The feature is turned on by default if a device
indicates that it does some form of checksum offload; it is off by
default for devices that do no checksum offload or indicate no checksum
is necessary. For the former case copy-checksum is probably done
anyway, in the latter case the device is likely loopback in which case
the no cache copy is probably not beneficial.
This patch was tested using 200 instances of netperf TCP_RR with
1400 byte request and one byte reply. Platform is 16 core AMD x86.
No-cache copy disabled:
672703 tps, 97.13% utilization
50/90/99% latency:244.31 484.205 1028.41
No-cache copy enabled:
702113 tps, 96.16% utilization,
50/90/99% latency 238.56 467.56 956.955
Using 14000 byte request and response sizes demonstrate the
effects more dramatically:
No-cache copy disabled:
79571 tps, 34.34 %utlization
50/90/95% latency 1584.46 2319.59 5001.76
No-cache copy enabled:
83856 tps, 34.81% utilization
50/90/95% latency 2508.42 2622.62 2735.88
Note especially the effect on latency tail (95th percentile).
This seems to provide a nice performance improvement and is
consistent in the tests I ran. Presumably, this would provide
the greatest benfits in the presence of an application workload
stressing the cache and a lot of transmit data happening.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Issue FEAT_CHANGE notification when features are changed by
netdev_update_features(). This will allow changes made by extra constraints
on e.g. MTU change to be properly propagated like changes via ethtool.
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
In case the device where is coming from the packet has TSO enabled,
we should not check the mtu size value as this one could be bigger
than the expected value.
This is the case for the macvlan driver when the lower device has
TSO enabled. The macvlan inherit this feature and forward the packets
without fragmenting them. Then the packets go through dev_forward_skb
and are dropped. This patch fix this by checking TSO is not enabled
when we want to check the mtu size.
Signed-off-by: Daniel Lezcano <daniel.lezcano@free.fr>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (30 commits)
xfrm: Restrict extended sequence numbers to esp
xfrm: Check for esn buffer len in xfrm_new_ae
xfrm: Assign esn pointers when cloning a state
xfrm: Move the test on replay window size into the replay check functions
netdev: bfin_mac: document TE setting in RMII modes
drivers net: Fix declaration ordering in inline functions.
cxgb3: Apply interrupt coalescing settings to all queues
net: Always allocate at least 16 skb frags regardless of page size
ipv4: Don't ip_rt_put() an error pointer in RAW sockets.
net: fix ethtool->set_flags not intended -EINVAL return value
mlx4_en: Fix loss of promiscuity
tg3: Fix inline keyword usage
tg3: use <linux/io.h> and <linux/uaccess.h> instead <asm/io.h> and <asm/uaccess.h>
net: use CHECKSUM_NONE instead of magic number
Net / jme: Do not use legacy PCI power management
myri10ge: small rx_done refactoring
bridge: notify applications if address of bridge device changes
ipv4: Fix IP timestamp option (IPOPT_TS_PRESPEC) handling in ip_options_echo()
can: c_can: Fix tx_bytes accounting
can: c_can_platform: fix irq check in probe
...
After commit d5dbda2380 "ethtool: Add
support for vlan accleration.", drivers that have NETIF_F_HW_VLAN_TX,
and/or NETIF_F_HW_VLAN_RX feature, but do not allow enable/disable vlan
acceleration via ethtool set_flags, always return -EINVAL from that
function. Fix by returning -EINVAL only if requested features do not
match current settings and can not be changed by driver.
Change any driver that define ethtool->set_flags to use
ethtool_invalid_flags() to avoid similar problems in the future
(also on drivers that do not have the problem).
Tested with modified (to reproduce this bug) myri10ge driver.
Cc: stable@kernel.org # 2.6.37+
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The mac address of the bridge device may be changed when a new interface
is added to the bridge. If this happens, then the bridge needs to call
the network notifiers to tickle any other systems that care. Since bridge
can be a module, this also means exporting the notifier function.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The code itself can explain what it is doing, no need these comments.
Signed-off-by: WANG Cong <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (56 commits)
route: Take the right src and dst addresses in ip_route_newports
ipv4: Fix nexthop caching wrt. scoping.
ipv4: Invalidate nexthop cache nh_saddr more correctly.
net: fix pch_gbe section mismatch warning
ipv4: fix fib metrics
mlx4_en: Removing HW info from ethtool -i report.
net_sched: fix THROTTLED/RUNNING race
drivers/net/a2065.c: Convert release_resource to release_region/release_mem_region
drivers/net/ariadne.c: Convert release_resource to release_region/release_mem_region
bonding: fix rx_handler locking
myri10ge: fix rmmod crash
mlx4_en: updated driver version to 1.5.4.1
mlx4_en: Using blue flame support
mlx4_core: reserve UARs for userspace consumers
mlx4_core: maintain available field in bitmap allocator
mlx4: Add blue flame support for kernel consumers
mlx4_en: Enabling new steering
mlx4: Add support for promiscuous mode in the new steering model.
mlx4: generalization of multicast steering.
mlx4_en: Reporting HW revision in ethtool -i
...
ksoftirqd, kworker, migration, and pktgend kthreads can be created with
kthread_create_on_node(), to get proper NUMA affinities for their stack and
task_struct.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: David Howells <dhowells@redhat.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Implement compatibility with new hw_features for dev_disable_lro().
This is a transition path - dev_disable_lro() should be later
integrated into netdev_fix_features() after all drivers are converted.
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
It was pointed out to me recently that my spelling could be better :)
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
__ethtool_set_sg does not check if dev->ethtool_ops->set_sg is defined
which can result in a NULL pointer dereference when ethtool is used to
change SG settings for drivers without SG support.
Signed-off-by: Roger Luethi <rl@hellgate.ch>
Reviewed-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (47 commits)
doc: CONFIG_UNEVICTABLE_LRU doesn't exist anymore
Update cpuset info & webiste for cgroups
dcdbas: force SMI to happen when expected
arch/arm/Kconfig: remove one to many l's in the word.
asm-generic/user.h: Fix spelling in comment
drm: fix printk typo 'sracth'
Remove one to many n's in a word
Documentation/filesystems/romfs.txt: fixing link to genromfs
drivers:scsi Change printk typo initate -> initiate
serial, pch uart: Remove duplicate inclusion of linux/pci.h header
fs/eventpoll.c: fix spelling
mm: Fix out-of-date comments which refers non-existent functions
drm: Fix printk typo 'failled'
coh901318.c: Change initate to initiate.
mbox-db5500.c Change initate to initiate.
edac: correct i82975x error-info reported
edac: correct i82975x mci initialisation
edac: correct commented info
fs: update comments to point correct document
target: remove duplicate include of target/target_core_device.h from drivers/target/target_core_hba.c
...
Trivial conflict in fs/eventpoll.c (spelling vs addition)
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1480 commits)
bonding: enable netpoll without checking link status
xfrm: Refcount destination entry on xfrm_lookup
net: introduce rx_handler results and logic around that
bonding: get rid of IFF_SLAVE_INACTIVE netdev->priv_flag
bonding: wrap slave state work
net: get rid of multiple bond-related netdevice->priv_flags
bonding: register slave pointer for rx_handler
be2net: Bump up the version number
be2net: Copyright notice change. Update to Emulex instead of ServerEngines
e1000e: fix kconfig for crc32 dependency
netfilter ebtables: fix xt_AUDIT to work with ebtables
xen network backend driver
bonding: Improve syslog message at device creation time
bonding: Call netif_carrier_off after register_netdevice
bonding: Incorrect TX queue offset
net_sched: fix ip_tos2prio
xfrm: fix __xfrm_route_forward()
be2net: Fix UDP packet detected status in RX compl
Phonet: fix aligned-mode pipe socket buffer header reserve
netxen: support for GbE port settings
...
Fix up conflicts in drivers/staging/brcm80211/brcmsmac/wl_mac80211.c
with the staging updates.
This patch allows rx_handlers to better signalize what to do next to
it's caller. That makes skb->deliver_no_wcard no longer needed.
kernel-doc for rx_handler_result is taken from Nicolas' patch.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Reviewed-by: Nicolas de Pesloüan <nicolas.2p.debian@free.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (57 commits)
tidy the trailing symlinks traversal up
Turn resolution of trailing symlinks iterative everywhere
simplify link_path_walk() tail
Make trailing symlink resolution in path_lookupat() iterative
update nd->inode in __do_follow_link() instead of after do_follow_link()
pull handling of one pathname component into a helper
fs: allow AT_EMPTY_PATH in linkat(), limit that to CAP_DAC_READ_SEARCH
Allow passing O_PATH descriptors via SCM_RIGHTS datagrams
readlinkat(), fchownat() and fstatat() with empty relative pathnames
Allow O_PATH for symlinks
New kind of open files - "location only".
ext4: Copy fs UUID to superblock
ext3: Copy fs UUID to superblock.
vfs: Export file system uuid via /proc/<pid>/mountinfo
unistd.h: Add new syscalls numbers to asm-generic
x86: Add new syscalls for x86_64
x86: Add new syscalls for x86_32
fs: Remove i_nlink check from file system link callback
fs: Don't allow to create hardlink for deleted file
vfs: Add open by file handle support
...
Just need to make sure that AF_UNIX garbage collector won't
confuse O_PATHed socket on filesystem for real AF_UNIX opened
socket.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
(bug introduced by commit 26ad787962
(pktgen: speedup fragmented skbs)
The headers of pktgen were incorrectly added in a pktgen packet
without frags (frags=0). There was an offset in the pktgen headers.
The cause was in reusing the pgh variable as a return variable in skb_put
when adding the payload to the skb.
Signed-off-by: Daniel Turull <daniel.turull@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
I intend to turn struct flowi into a union of AF specific flowi
structs. There will be a common structure that each variant includes
first, much like struct sock_common.
This is the first step to move in that direction.
Signed-off-by: David S. Miller <davem@davemloft.net>
Since a8f80e8ff9 any process with
CAP_NET_ADMIN may load any module from /lib/modules/. This doesn't mean
that CAP_NET_ADMIN is a superset of CAP_SYS_MODULE as modules are
limited to /lib/modules/**. However, CAP_NET_ADMIN capability shouldn't
allow anybody load any module not related to networking.
This patch restricts an ability of autoloading modules to netdev modules
with explicit aliases. This fixes CVE-2011-1019.
Arnd Bergmann suggested to leave untouched the old pre-v2.6.32 behavior
of loading netdev modules by name (without any prefix) for processes
with CAP_SYS_MODULE to maintain the compatibility with network scripts
that use autoloading netdev modules by aliases like "eth0", "wlan0".
Currently there are only three users of the feature in the upstream
kernel: ipip, ip_gre and sit.
root@albatros:~# capsh --drop=$(seq -s, 0 11),$(seq -s, 13 34) --
root@albatros:~# grep Cap /proc/$$/status
CapInh: 0000000000000000
CapPrm: fffffff800001000
CapEff: fffffff800001000
CapBnd: fffffff800001000
root@albatros:~# modprobe xfs
FATAL: Error inserting xfs
(/lib/modules/2.6.38-rc6-00001-g2bf4ca3/kernel/fs/xfs/xfs.ko): Operation not permitted
root@albatros:~# lsmod | grep xfs
root@albatros:~# ifconfig xfs
xfs: error fetching interface information: Device not found
root@albatros:~# lsmod | grep xfs
root@albatros:~# lsmod | grep sit
root@albatros:~# ifconfig sit
sit: error fetching interface information: Device not found
root@albatros:~# lsmod | grep sit
root@albatros:~# ifconfig sit0
sit0 Link encap:IPv6-in-IPv4
NOARP MTU:1480 Metric:1
root@albatros:~# lsmod | grep sit
sit 10457 0
tunnel4 2957 1 sit
For CAP_SYS_MODULE module loading is still relaxed:
root@albatros:~# grep Cap /proc/$$/status
CapInh: 0000000000000000
CapPrm: ffffffffffffffff
CapEff: ffffffffffffffff
CapBnd: ffffffffffffffff
root@albatros:~# ifconfig xfs
xfs: error fetching interface information: Device not found
root@albatros:~# lsmod | grep xfs
xfs 745319 0
Reference: https://lkml.org/lkml/2011/2/24/203
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Kees Cook <kees.cook@canonical.com>
Signed-off-by: James Morris <jmorris@namei.org>
The units in show_results in pktgen were not correct.
The results are in usec but it was displayed nsec.
Reported-by: Jong-won Lee <ljw@handong.edu>
Signed-off-by: Daniel Turull <daniel.turull@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This was there before, I forgot about this. Allows deliveries to
ptype_base handlers registered for orig_dev. I presume this is still
desired.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Reviewed-by: Nicolas de Pesloüan <nicolas.2p.debian@free.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
UFO doesn't really use the sk_sndmsg_* parameters so touching
them is pointless. It can't use them anyway since the whole
point of UFO is to use the original pages without copying.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Neil pointed out that we can't send ARP reply on behalf of slaves,
we need to move the arp queue to their bond device.
Signed-off-by: WANG Cong <amwang@redhat.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
V4: rebase to net-next-2.6
This patch removes the flag IFF_IN_NETPOLL, we don't need it any more since
we have netpoll_tx_running() now.
Signed-off-by: WANG Cong <amwang@redhat.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
addr_type of 0 means that the type should be adopted from from_dev and
not from __hw_addr_del_multiple(). Unfortunately it isn't so and
addr_type will always be considered. Fix this by implementing the
considered and documented behavior.
Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use discrete setting ops for not updated drivers. This will not make
them conform to full G/SFEATURES semantics, though.
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement getting rx checksum state for not updated drivers.
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
Avoid "Features changed" message and ndo_set_features call on device
registration caused by automatic enabling of GSO and GRO. Driver should
have enabled hardware offloads it set in features, so the ndo_set_features()
is not needed at registration time.
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix netdev_update_features() messages on register time by moving
the call further in register_netdevice(). When
netdev->reg_state != NETREG_REGISTERED, netdev_name() returns
"(unregistered netdevice)" even if the dev's name is already filled.
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 9b5e383c11 (net: Introduce
unregister_netdevice_many()) left an active LIST_HEAD() in
rollback_registered(), with possible memory corruption.
Even if device is freed without touching its unreg_list (and therefore
touching the previous memory location holding LISTE_HEAD(single), better
close the bug for good, since its really subtle.
(Same fix for default_device_exit_batch() for completeness)
Reported-by: Michal Hocko <mhocko@suse.cz>
Tested-by: Michal Hocko <mhocko@suse.cz>
Reported-by: Eric W. Biderman <ebiderman@xmission.com>
Tested-by: Eric W. Biderman <ebiderman@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Ingo Molnar <mingo@elte.hu>
CC: Octavian Purdila <opurdila@ixiacom.com>
CC: stable <stable@kernel.org> [.33+]
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric W. Biderman and Michal Hocko reported various memory corruptions
that we suspected to be related to a LIST head located on stack, that
was manipulated after thread left function frame (and eventually exited,
so its stack was freed and reused).
Eric Dumazet suggested the problem was probably coming from commit
443457242b (net: factorize
sync-rcu call in unregister_netdevice_many)
This patch fixes __dev_close() and dev_close() to properly deinit their
respective LIST_HEAD(single) before exiting.
References: https://lkml.org/lkml/2011/2/16/304
References: https://lkml.org/lkml/2011/2/14/223
Reported-by: Michal Hocko <mhocko@suse.cz>
Tested-by: Michal Hocko <mhocko@suse.cz>
Reported-by: Eric W. Biderman <ebiderman@xmission.com>
Tested-by: Eric W. Biderman <ebiderman@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Ingo Molnar <mingo@elte.hu>
CC: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This allows avoiding multiple writes to the initial __refcnt.
The most simplest cases of wanting an initial reference of "1"
in ipv4 and ipv6 have been converted, the rest have been left
along and kept at the existing "0".
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce NETIF_F_RXCSUM to replace device-private flags for RX checksum
offload. Integrate it with ndo_fix_features.
ethtool_op_get_rx_csum() is removed altogether as nothing in-tree uses it.
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
This introduces a new framework to handle device features setting.
It consists of:
- new fields in struct net_device:
+ hw_features - features that hw/driver supports toggling
+ wanted_features - features that user wants enabled, when possible
- new netdev_ops:
+ feat = ndo_fix_features(dev, feat) - API checking constraints for
enabling features or their combinations
+ ndo_set_features(dev) - API updating hardware state to match
changed dev->features
- new ethtool commands:
+ ETHTOOL_GFEATURES/ETHTOOL_SFEATURES: get/set dev->wanted_features
and trigger device reconfiguration if resulting dev->features
changed
+ ETHTOOL_GSTRINGS(ETH_SS_FEATURES): get feature bits names (meaning)
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
This allows to enable GRO even if RX csum is disabled. GRO will not
be used for packets without hardware checksum anyway.
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is needed for unified offloads patch.
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
For testing and debugging purposes it is useful to be able to disable
hardware acceleration of RFS without disabling RFS altogether. Since
this is a similar feature to 'n-tuple' flow steering through the
ethtool API, test the same feature flag that controls that.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
If the root qdisc for a net device is mqprio, and the driver's
ndo_setup_tc() operation dynamically adds and remvoes TX queues,
netif_set_real_num_tx_queues() will be called during device
unregistration to remove the extra TX queues when the qdisc is
destroyed. Currently this causes the corresponding kobjects
to be leaked, and the device's reference count never drops to 0.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
This patch allows userspace to enslave/release slave devices via netlink
interface using IFLA_MASTER. This introduces generic way to add/remove
underling devices.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
dev->master is now tightly connected to bonding driver. This patch makes
this pointer more general and ready to be used by others.
- netdev_set_master() - bond specifics moved to new function
netdev_set_bond_master()
- introduced netif_is_bond_slave() to check if device is a bonding slave
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
No need to check (master) twice and to drive in and out the header file.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Reviewed-by: Nicolas de Pesloüan <nicolas.2p.debian@free.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit a512b92 adds sysfs entry for net device group, but
before this commit, tun also uses group sysfs, so after this
commit checkin, kernel warns like this:
sysfs: cannot create duplicate filename '/devices/virtual/net/vnet0/group'
Since tun has used this for years, rename sysfs under tun might
break existing userspace, so rename group sysfs entry for net device
group is a better choice.
Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit aa94210411 ("net: init ingress
queue") we moved the allocation and lock initialization of the queues
into alloc_netdev_mq() since register_netdevice() is way too late.
The problem is that dev->type is not setup until the setup()
callback is invoked by alloc_netdev_mq(), and the dev->type is
what determines the lockdep class to use for the locks in the
queues.
Fix this by doing the queue allocation after the setup() callback
runs.
This is safe because the setup() callback is not allowed to make any
state changes that need to be undone on error (memory allocations,
etc.). It may, however, make state changes that are undone by
free_netdev() (such as netif_napi_add(), which is done by the
ipoib driver's setup routine).
The previous code also leaked a reference to the &init_net namespace
object on RX/TX queue allocation failures.
Signed-off-by: David S. Miller <davem@davemloft.net>
Like Herbert's change from a few days ago:
66c46d741e gro: Reset dev pointer on reuse
this may not be necessary at this point, but we should still clean up
the skb->skb_iif. If not we may end up with an invalid valid for
skb->skb_iif when the skb is reused and the check is done in
__netif_receive_skb.
Signed-off-by: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
In get_rps_cpu, add check that the rps_flow_table for the device is
NULL when trying to take fast path when RPS map length is one.
Without this, RFS is effectively disabled if map length is one which
is not correct.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On older kernels the VLAN code may zero skb->dev before dropping
it and causing it to be reused by GRO.
Unfortunately we didn't reset skb->dev in that case which causes
the next GRO user to get a bogus skb->dev pointer.
This particular problem no longer happens with the current upstream
kernel due to changes in VLAN processing.
However, for correctness we should still reset the skb->dev pointer
in the GRO reuse function in case a future user does the same thing.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
If there are no explicit metrics attached to a route, hook
fi->fib_info up to dst_default_metrics.
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit c6d14c8456 (net: Introduce for_each_netdev_rcu() iterator)
added a race in dev_seq_next().
The rcu_dereference() call should be done _before_ testing the end of
list, or we might return a wrong net_device if a concurrent thread
changes net_device list under us.
Note : discovered thanks to a sparse warning :
net/core/dev.c:3919:9: error: incompatible types in comparison expression
(different address spaces)
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
pskb_expand_head() triggers a kmemcheck warning when copy of
skb_shared_info is done in pskb_expand_head()
This is because destructor_arg field is not necessarily initialized at
this point. Add kmemcheck_annotate_variable() call in __alloc_skb() to
instruct kmemcheck this is a normal situation.
Resolves bugzilla.kernel.org 27212
Reference: https://bugzilla.kernel.org/show_bug.cgi?id=27212
Reported-by: Christian Casteyde <casteyde.christian@free.fr>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
I'm testing an API that uses IFLA_AF_SPEC attribute.
In the rtnetlink core , the set_link_af() member
of the rtnl_af_ops struct receives the nested attribute
(as I expected), but the validate_link_af() member
receives the parent attribute.
IMO, this patch fixes this.
Signed-off-by: Kurt Van Dijck <kurt.van.dijck@eia.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
Routing metrics are now copy-on-write.
Initially a route entry points it's metrics at a read-only location.
If a routing table entry exists, it will point there. Else it will
point at the all zero metric place-holder called 'dst_default_metrics'.
The writeability state of the metrics is stored in the low bits of the
metrics pointer, we have two bits left to spare if we want to store
more states.
For the initial implementation, COW is implemented simply via kmalloc.
However future enhancements will change this to place the writable
metrics somewhere else, in order to increase sharing. Very likely
this "somewhere else" will be the inetpeer cache.
Note also that this means that metrics updates may transiently fail
if we cannot COW the metrics successfully.
But even by itself, this patch should decrease memory usage and
increase cache locality especially for routing workloads. In those
cases the read-only metric copies stay in place and never get written
to.
TCP workloads where metrics get updated, and those rare cases where
PMTU triggers occur, will take a very slight performance hit. But
that hit will be alleviated when the long-term writable metrics
move to a more sharable location.
Since the metrics storage went from a u32 array of RTAX_MAX entries to
what is essentially a pointer, some retooling of the dst_entry layout
was necessary.
Most importantly, we need to preserve the alignment of the reference
count so that it doesn't share cache lines with the read-mostly state,
as per Eric Dumazet's alignment assertion checks.
The only non-trivial bit here is the move of the 'flags' member into
the writeable cacheline. This is OK since we are always accessing the
flags around the same moment when we made a modification to the
reference count.
Signed-off-by: David S. Miller <davem@davemloft.net>
We spend lot of time clearing pages in pktgen.
(Or not clearing them on ipv6 and leaking kernel memory)
Since we dont modify them, we can use one zeroed page, and get
references on it. This page can use NUMA affinity as well.
Define pktgen_finalize_skb() helper, used both in ipv4 and ipv6
Results using skbs with one frag :
Before patch :
Result: OK: 608980458(c608978520+d1938) nsec, 1000000000
(100byte,1frags)
1642088pps 1313Mb/sec (1313670400bps) errors: 0
After patch :
Result: OK: 345285014(c345283891+d1123) nsec, 1000000000
(100byte,1frags)
2896158pps 2316Mb/sec (2316926400bps) errors: 0
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The group of a network device can be queried or changed from userspace
using sysfs.
For example, considering sysfs mounted in /sys, one can change the group
that interface lo belongs to:
echo 1 > /sys/class/net/lo/group
Signed-off-by: Vlad Dogaru <ddvlad@rosedu.org>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a conflict between commit b00916b1 and a77f5db3. This patch resolves
the conflict by clearing the heap allocation in ethtool_get_regs().
Cc: stable@kernel.org
Signed-off-by: Eugene Teo <eugeneteo@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reduce printk() levels to KERN_INFO in netdev_fix_features() as this will
be used by ethtool and might spam dmesg unnecessarily.
This converts the function to use netdev_info() instead of plain printk().
As a side effect, bonding and bridge devices will now log dropped features
on every slave device change.
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
Quoting Ben Hutchings: we presumably won't be defining features that
can only be enabled on 64-bit architectures.
Occurences found by `grep -r` on net/, drivers/net, include/
[ Move features and vlan_features next to each other in
struct netdev, as per Eric Dumazet's suggestion -DaveM ]
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow drivers for multiqueue hardware with flow filter tables to
accelerate RFS. The driver must:
1. Set net_device::rx_cpu_rmap to a cpu_rmap of the RX completion
IRQs (in queue order). This will provide a mapping from CPUs to the
queues for which completions are handled nearest to them.
2. Implement net_device_ops::ndo_rx_flow_steer. This operation adds
or replaces a filter steering the given flow to the given RX queue, if
possible.
3. Periodically remove filters for which rps_may_expire_flow() returns
true.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Suppose that several linear skbs of the same flow were received by GRO. They
were thus merged into one skb with a frag_list. Then a new skb of the same flow
arrives, but it is a paged skb with data starting in its frags[].
Before adding the skb to the frag_list skb_gro_receive() will of course adjust
the skb to throw away the headers. It correctly modifies the page_offset and
size of the frag, but it leaves incorrect information in the skb:
->data_len is not decreased at all.
->len is decreased only by headlen, as if no change were done to the frag.
Later in a receiving process this causes skb_copy_datagram_iovec() to return
-EFAULT and this is seen in userspace as the result of the recv() syscall.
In practice the bug can be reproduced with the sfc driver. By default the
driver uses an adaptive scheme when it switches between using
napi_gro_receive() (with skbs) and napi_gro_frags() (with pages). The bug is
reproduced when under rx load with enough successful GRO merging the driver
decides to switch from the former to the latter.
Manual control is also possible, so reproducing this is easy with netcat:
- on machine1 (with sfc): nc -l 12345 > /dev/null
- on machine2: nc machine1 12345 < /dev/zero
- on machine1:
echo 1 > /sys/module/sfc/parameters/rx_alloc_method # use skbs
echo 2 > /sys/module/sfc/parameters/rx_alloc_method # use pages
- See that nc has quit suddenly.
[v2: Modified by Eric Dumazet to avoid advancing skb->data past the end
and to use a temporary variable.]
Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 941666c2e3 "net: RCU conversion of dev_getbyhwaddr() and
arp_ioctl()" introduced a regression, reported by Jamie Heilman.
"arp -Ds 192.168.2.41 eth0 pub" triggered the ASSERT_RTNL() assert
in pneigh_lookup()
Removing RTNL requirement from arp_ioctl() was a mistake, just revert
that part.
Reported-by: Jamie Heilman <jamie@audible.transient.net>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
rtnl_group_changelink() is invoked by rtnl_newlink() before the link
attributes have been validated. Additionally the group changes are
performed even if NLM_F_CREATE is specified and a new link is
created, while more reasonable semantics would be to set the group
value on the newly created link.
Fix both problems by moving the rtnl_group_changelink() invocation
down to the handling of non-existant links without NLM_F_CREATE()
and add a dev_set_group() call to rtnl_create_link().
Signed-off-by: Patrick McHardy <kaber@trash.net>
Acked-by: Vlad Dogaru <ddvlad@rosedu.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
fix some minor issues and sparse (__rcu) warnings
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch converts stab qdisc management to RCU, so that we can perform
the qdisc_calculate_pkt_len() call before getting qdisc lock.
This shortens the lock's held time in __dev_xmit_skb().
This permits more qdiscs to get TCQ_F_CAN_BYPASS status, avoiding lot of
cache misses and so reducing latencies.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Patrick McHardy <kaber@trash.net>
CC: Jesper Dangaard Brouer <hawk@diku.dk>
CC: Jarek Poplawski <jarkao2@gmail.com>
CC: Jamal Hadi Salim <hadi@cyberus.ca>
CC: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Octavian Purdila <opurdila@ixiacom.com>
Reviewed-by: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch provides a mechanism for lower layer devices to
steer traffic using skb->priority to tx queues. This allows
for hardware based QOS schemes to use the default qdisc without
incurring the penalties related to global state and the qdisc
lock. While reliably receiving skbs on the correct tx ring
to avoid head of line blocking resulting from shuffling in
the LLD. Finally, all the goodness from txq caching and xps/rps
can still be leveraged.
Many drivers and hardware exist with the ability to implement
QOS schemes in the hardware but currently these drivers tend
to rely on firmware to reroute specific traffic, a driver
specific select_queue or the queue_mapping action in the
qdisc.
By using select_queue for this drivers need to be updated for
each and every traffic type and we lose the goodness of much
of the upstream work. Firmware solutions are inherently
inflexible. And finally if admins are expected to build a
qdisc and filter rules to steer traffic this requires knowledge
of how the hardware is currently configured. The number of tx
queues and the queue offsets may change depending on resources.
Also this approach incurs all the overhead of a qdisc with filters.
With the mechanism in this patch users can set skb priority using
expected methods ie setsockopt() or the stack can set the priority
directly. Then the skb will be steered to the correct tx queues
aligned with hardware QOS traffic classes. In the normal case with
single traffic class and all queues in this class everything
works as is until the LLD enables multiple tcs.
To steer the skb we mask out the lower 4 bits of the priority
and allow the hardware to configure upto 15 distinct classes
of traffic. This is expected to be sufficient for most applications
at any rate it is more then the 8021Q spec designates and is
equal to the number of prio bands currently implemented in
the default qdisc.
This in conjunction with a userspace application such as
lldpad can be used to implement 8021Q transmission selection
algorithms one of these algorithms being the extended transmission
selection algorithm currently being used for DCB.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If a rtnetlink request specifies a negative or zero ifindex and has no
interface name attribute, but has a group attribute, then the chenges
are made to all the interfaces belonging to the specified group.
Signed-off-by: Vlad Dogaru <ddvlad@rosedu.org>
Acked-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
Net devices can now be grouped, enabling simpler manipulation from
userspace. This patch adds a group field to the net_device structure, as
well as rtnetlink support to query and modify it.
Signed-off-by: Vlad Dogaru <ddvlad@rosedu.org>
Acked-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (41 commits)
sctp: user perfect name for Delayed SACK Timer option
net: fix can_checksum_protocol() arguments swap
Revert "netlink: test for all flags of the NLM_F_DUMP composite"
gianfar: Fix misleading indentation in startup_gfar()
net/irda/sh_irda: return to RX mode when TX error
net offloading: Do not mask out NETIF_F_HW_VLAN_TX for vlan.
USB CDC NCM: tx_fixup() race condition fix
ns83820: Avoid bad pointer deref in ns83820_init_one().
ipv6: Silence privacy extensions initialization
bnx2x: Update bnx2x version to 1.62.00-4
bnx2x: Fix AER setting for BCM57712
bnx2x: Fix BCM84823 LED behavior
bnx2x: Mark full duplex on some external PHYs
bnx2x: Fix BCM8073/BCM8727 microcode loading
bnx2x: LED fix for BCM8727 over BCM57712
bnx2x: Common init will be executed only once after POR
bnx2x: Swap BCM8073 PHY polarity if required
iwlwifi: fix valid chain reading from EEPROM
ath5k: fix locking in tx_complete_poll_work
ath9k_hw: do PA offset calibration only on longcal interval
...
commit 0363466866 (net offloading: Convert checksums to use
centrally computed features.) mistakenly swapped can_checksum_protocol()
arguments.
This broke IPv6 on bnx2 for instance, on NIC without TCPv6 checksum
offloads.
Reported-by: Hans de Bruin <jmdebruin@xmsnet.nl>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Packet filter (BPF) doesnt need to disable softirqs, being fully
re-entrant and lock-less.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In netif_skb_features() we return only the features that are valid for vlans
if we have a vlan packet. However, we should not mask out NETIF_F_HW_VLAN_TX
since it enables transmission of vlan tags and is obviously valid.
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (47 commits)
GRETH: resolve SMP issues and other problems
GRETH: handle frame error interrupts
GRETH: avoid writing bad speed/duplex when setting transfer mode
GRETH: fixed skb buffer memory leak on frame errors
GRETH: GBit transmit descriptor handling optimization
GRETH: fix opening/closing
GRETH: added raw AMBA vendor/device number to match against.
cassini: Fix build bustage on x86.
e1000e: consistent use of Rx/Tx vs. RX/TX/rx/tx in comments/logs
e1000e: update Copyright for 2011
e1000: Avoid unhandled IRQ
r8169: keep firmware in memory.
netdev: tilepro: Use is_unicast_ether_addr helper
etherdevice.h: Add is_unicast_ether_addr function
ks8695net: Use default implementation of ethtool_ops::get_link
ks8695net: Disable non-working ethtool operations
USB CDC NCM: Don't deref NULL in cdc_ncm_rx_fixup() and don't use uninitialized variable.
vxge: Remember to release firmware after upgrading firmware
netdev: bfin_mac: Remove is_multicast_ether_addr use in netdev_for_each_mc_addr
ipsec: update MAX_AH_AUTH_LEN to support sha512
...
After recent changes, (percpu stats on vlan/tunnels...), we dont need
anymore per struct netdev_queue tx_bytes/tx_packets/tx_dropped counters.
Only remaining users are ixgbe, sch_teql, gianfar & macvlan :
1) ixgbe can be converted to use existing tx_ring counters.
2) macvlan incremented txq->tx_dropped, it can use the
dev->stats.tx_dropped counter.
3) sch_teql : almost revert ab35cd4b8f (Use net_device internal stats)
Now we have ndo_get_stats64(), use it, even for "unsigned long"
fields (No need to bring back a struct net_device_stats)
4) gianfar adds a stats structure per tx queue to hold
tx_bytes/tx_packets
This removes a lockdep warning (and possible lockup) in rndis gadget,
calling dev_get_stats() from hard IRQ context.
Ref: http://www.spinics.net/lists/netdev/msg149202.html
Reported-by: Neil Jones <neiljay@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Jarek Poplawski <jarkao2@gmail.com>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
CC: Sandeep Gopalpet <sandeep.kumar@freescale.com>
CC: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The IPv6 tproxy patches split IPv6 defragmentation off of conntrack, but
failed to update the #ifdef stanzas guarding the defragmentation related
fields and code in skbuff and conntrack related code in nf_defrag_ipv6.c.
This patch adds the required #ifdefs so that IPv6 tproxy can truly be used
without connection tracking.
Original report:
http://marc.info/?l=linux-netdev&m=129010118516341&w=2
Reported-by: Randy Dunlap <randy.dunlap@oracle.com>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: KOVACS Krisztian <hidden@balabit.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
HTB takes into account skb is segmented in stats updates.
Generalize this to all schedulers.
They should use qdisc_bstats_update() helper instead of manipulating
bstats.bytes and bstats.packets
Add bstats_update() helper too for classes that use
gnet_stats_basic_packed fields.
Note : Right now, TCQ_F_CAN_BYPASS shortcurt can be taken only if no
stab is setup on qdisc.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Added alloc_netdev_mqs function which allows the number of transmit and
receive queues to be specified independenty. alloc_netdev_mq was
changed to a macro to call the new function. Also added
alloc_etherdev_mqs with same purpose.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to compute the features for other offloads (primarily
scatter/gather), we need to first check the ability of the NIC to
offload the checksum for the packet. Since we have already computed
this, we can directly use the result instead of figuring it out
again.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This switches skb_need_linearize() to use the features that have
been centrally computed. In doing so, this fixes a problem where
scatter/gather should not be used because the card does not support
checksum offloading on that type of packet. On device registration
we only check that some form of checksum offloading is available if
scatter/gatther is enabled but we must also check at transmission
time. Examples of this include IPv6 or vlan packets on a NIC that
only supports IPv4 offloading.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This switches dev_gso_segment() to use the device features computed
by the centralized routine. In doing so, it fixes a problem where
it would always use dev->features, instead of those appropriate
to the number of vlan tags if any are present.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that there is a single function that can compute the device
features relevant to a packet, we don't want to run it for each
offload. This converts netif_needs_gso() to take the features
of the device, rather than computing them itself.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
netif_get_vlan_features() is currently only used by netif_needs_gso(),
so it only concerns itself with GSO features. However, several other
places also should take into account the contents of the packet when
deciding whether to offload to hardware. This generalizes the function
to return features about all of the various forms of offloading. Since
offloads tend to be linked together, this avoids duplicating the logic
in each location (i.e. the scatter/gather code also needs the checksum
logic).
Suggested-by: Michał Mirosław <mirqus@gmail.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We currently only have software fallback for one type of checksum: the
TCP/UDP one's complement. This means that a protocol that uses hardware
offloading for a different type of checksum (FCoE, SCTP) must directly
check the device's features and do the right thing ahead of time. By
the time we get to dev_can_checksum(), we're only deciding whether to
apply the one algorithm in software or hardware. NETIF_F_HW_CSUM has the
same capabilities as the software version, so we should always use it if
present. The primary advantage of this is multiply tagged vlans can use
hardware checksumming.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix new kernel-doc notation warning in net/core/filter.c:
Warning(net/core/filter.c:172): No description found for parameter 'fentry'
Warning(net/core/filter.c:172): Excess function parameter 'filter' description in 'sk_run_filter'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Due to NLM_F_DUMP is composed of two bits, NLM_F_ROOT | NLM_F_MATCH,
when doing "if (x & NLM_F_DUMP)", it tests for _either_ of the bits
being set. Because NLM_F_MATCH's value overlaps with NLM_F_EXCL,
non-dump requests with NLM_F_EXCL set are mistaken as dump requests.
Substitute the condition to test for _all_ bits being set.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
* 'for-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (33 commits)
usb: don't use flush_scheduled_work()
speedtch: don't abuse struct delayed_work
media/video: don't use flush_scheduled_work()
media/video: explicitly flush request_module work
ioc4: use static work_struct for ioc4_load_modules()
init: don't call flush_scheduled_work() from do_initcalls()
s390: don't use flush_scheduled_work()
rtc: don't use flush_scheduled_work()
mmc: update workqueue usages
mfd: update workqueue usages
dvb: don't use flush_scheduled_work()
leds-wm8350: don't use flush_scheduled_work()
mISDN: don't use flush_scheduled_work()
macintosh/ams: don't use flush_scheduled_work()
vmwgfx: don't use flush_scheduled_work()
tpm: don't use flush_scheduled_work()
sonypi: don't use flush_scheduled_work()
hvsi: don't use flush_scheduled_work()
xen: don't use flush_scheduled_work()
gdrom: don't use flush_scheduled_work()
...
Fixed up trivial conflict in drivers/media/video/bt8xx/bttv-input.c
as per Tejun.
Leonardo Chiquitto found poll() could block forever on tcp sockets and
Urgent data was received, if the event flag only contains POLLPRI.
He did a bisection and found commit 4938d7e023 (poll: avoid extra
wakeups in select/poll) was the source of the problem.
Problem is TCP sockets use standard sock_def_readable() function for
their sk_data_ready() handler, and sock_def_readable() doesnt signal
POLLPRI.
Only TCP is affected by the problem. Adding POLLPRI to the list of flags
might trigger unnecessary schedules, but URGENT handling is such a
seldom used feature this seems a good compromise.
Thanks a lot to Leonardo for providing the bisection result and a test
program as well.
Reference : http://www.spinics.net/lists/netdev/msg151793.html
Reported-and-bisected-by: Leonardo Chiquitto <leonardo.lists@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Tested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 4465b46900.
Conflicts:
net/ipv4/fib_frontend.c
As reported by Ben Greear, this causes regressions:
> Change 4465b46900 caused rules
> to stop matching the input device properly because the
> FLOWI_FLAG_MATCH_ANY_IIF is always defined in ip_dev_find().
>
> This breaks rules such as:
>
> ip rule add pref 512 lookup local
> ip rule del pref 0 lookup local
> ip link set eth2 up
> ip -4 addr add 172.16.0.102/24 broadcast 172.16.0.255 dev eth2
> ip rule add to 172.16.0.102 iif eth2 lookup local pref 10
> ip rule add iif eth2 lookup 10001 pref 20
> ip route add 172.16.0.0/24 dev eth2 table 10001
> ip route add unreachable 0/0 table 10001
>
> If you had a second interface 'eth0' that was on a different
> subnet, pinging a system on that interface would fail:
>
> [root@ct503-60 ~]# ping 192.168.100.1
> connect: Invalid argument
Reported-by: Ben Greear <greearb@candelatech.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
MAINTAINERS
arch/arm/mach-omap2/pm24xx.c
drivers/scsi/bfa/bfa_fcpim.c
Needed to update to apply fixes for which the old branch was too
outdated.
We can translate pseudo load instructions at filter check time to
dedicated instructions to speed up filtering and avoid one switch().
libpcap currently uses SKF_AD_PROTOCOL, but custom filters probably use
other ancillary accesses.
Note : I made the assertion that ancillary data was always accessed with
BPF_LD|BPF_?|BPF_ABS instructions, not with BPF_LD|BPF_?|BPF_IND ones
(offset given by K constant, not by K + X register)
On x86_64, this saves a few bytes of text :
# size net/core/filter.o.*
text data bss dec hex filename
4864 0 0 4864 1300 net/core/filter.o.new
4944 0 0 4944 1350 net/core/filter.o.old
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Le vendredi 17 décembre 2010 à 10:26 +0100, Eric Dumazet a écrit :
>
> I think we can add this after latest Changli patch :
>
> He does one skb_clone() before calling the sniffers.
> We could set timestamp on this clone, instead of original skb.
>
> Problem solved.
>
[PATCH net-next-2.6] net: timestamp cloned packet in dev_queue_xmit_nit
Now we do one clone of skb if at least one sniffer might take packet,
we also can do the skb timestamping on the clone and let original packet
unchanged.
This is a generalization of commit 8caf153974 (net: sch_netem: Fix an
inconsistency in ingress netem timestamps.)
This way, we can have a good idea when packets are delivered to our
stack (tcpdump -i ifb0), while a tcpdump on original device gives
timestamps right before ingressing.
This also speedup our stack, avoiding taking timestamps if not needed.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Changli Gao <xiaosuo@gmail.com>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Jarek Poplawski <jarkao2@gmail.com>
Acked-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove "pktgen: " prefix string from one pr_info.
pr_fmt adds it, so this is a duplicate.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In dev_queue_xmit_nit(), we have to clone skbs as we need to mangle skbs,
however, we don't need to clone skbs for all the packet_types.
Except for the first packet_type, we increase skb->users instead of
skb_clone().
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace skb->csum_start - skb_headroom(skb) with skb_checksum_start_offset().
Note for usb/smsc95xx: skb->data - skb->head == skb_headroom(skb).
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
Special care is taken inside sk_port_alloc to avoid overwriting
skc_node/skc_nulls_node. We should also avoid overwriting
skc_bind_node/skc_portaddr_node.
The patch fixes the following crash:
BUG: unable to handle kernel paging request at fffffffffffffff0
IP: [<ffffffff812ec6dd>] udp4_lib_lookup2+0xad/0x370
[<ffffffff812ecc22>] __udp4_lib_lookup+0x282/0x360
[<ffffffff812ed63e>] __udp4_lib_rcv+0x31e/0x700
[<ffffffff812bba45>] ? ip_local_deliver_finish+0x65/0x190
[<ffffffff812bbbf8>] ? ip_local_deliver+0x88/0xa0
[<ffffffff812eda35>] udp_rcv+0x15/0x20
[<ffffffff812bba45>] ip_local_deliver_finish+0x65/0x190
[<ffffffff812bbbf8>] ip_local_deliver+0x88/0xa0
[<ffffffff812bb2cd>] ip_rcv_finish+0x32d/0x6f0
[<ffffffff8128c14c>] ? netif_receive_skb+0x99c/0x11c0
[<ffffffff812bb94b>] ip_rcv+0x2bb/0x350
[<ffffffff8128c14c>] netif_receive_skb+0x99c/0x11c0
Signed-off-by: Leonard Crestez <lcrestez@ixiacom.com>
Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add dev_close_many and dev_deactivate_many to factorize another
sync-rcu operation on the netdevice unregister path.
$ modprobe dummy numdummies=10000
$ ip link set dev dummy* up
$ time rmmod dummy
Without the patch With the patch
real 0m 24.63s real 0m 5.15s
user 0m 0.00s user 0m 0.00s
sys 0m 6.05s sys 0m 5.14s
Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the calcualation of the Tx hash for a given hash range into a separate
function and define the skb_tx_hash(), which calculates a Tx hash for a
[0; dev->real_num_tx_queues - 1] hash values range, using this
function (__skb_tx_hash()).
Signed-off-by: Vladislav Zolotarov <vladz@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
cancel_rearming_delayed_work[queue]() has been superceded by
cancel_delayed_work_sync() quite some time ago. Convert all the
in-kernel users. The conversions are completely equivalent and
trivial.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: "David S. Miller" <davem@davemloft.net>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Acked-by: Evgeniy Polyakov <zbr@ioremap.net>
Cc: Jeff Garzik <jgarzik@pobox.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
Cc: netdev@vger.kernel.org
Cc: Anton Vorontsov <cbou@mail.ru>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Alex Elder <aelder@sgi.com>
Cc: xfs-masters@oss.sgi.com
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: netfilter-devel@vger.kernel.org
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: linux-nfs@vger.kernel.org
After commit c1f19b51d1 (net: support time stamping in phy devices.),
kernel might crash if CONFIG_NETWORK_PHY_TIMESTAMPING=y and
skb_defer_rx_timestamp() handles a packet without an ethernet header.
Fixes kernel bugzilla #24102
Reference: https://bugzilla.kernel.org/show_bug.cgi?id=24102
Reported-and-tested-by: Andrew Watts <akwatts@ymail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While an interface is down, many implementations of
ethtool_ops::get_link, including the default, ethtool_op_get_link(),
will report the last link state seen while the interface was up. In
general the current physical link state is not available if the
interface is down.
Define ETHTOOL_GLINK to reflect whether the interface *and* any
physical port have a working link, and consistently return 0 when the
interface is down.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We know for sure pktgen is going to write skb->data right after
*_alloc_skb, causing unnecessary cache misses.
Idea is to add a prefetchw() call to prefetch the first cache line
indicated by skb->data. On systems with Adjacent Cache Line Prefetch,
it's probably two cache lines are prefetched.
With this patch, pktgen on Intel SR1625 server with two E5530
quad-core processors and a single ixgbe-based NIC went from 8.63Mpps
to 9.03Mpps, with 4.6% improvement.
Signed-off-by: Junchang Wang <junchangwang@gmail.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
__load_pointer() checks data we fetch from skb is included in head
portion, but assumes we fetch one byte, instead of up to four.
This wont crash because we have extra bytes (struct skb_shared_info)
after head, but this can read uninitialized bytes.
Fix this using size of the data (1, 2, 4 bytes) in the test.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Followup of commit b178bb3dfc (net: reorder struct sock fields)
Optimize INET input path a bit further, by :
1) moving sk_refcnt close to sk_lock.
This reduces number of dirtied cache lines by one on 64bit arches (and
64 bytes cache line size).
2) moving inet_daddr & inet_rcv_saddr at the beginning of sk
(same cache line than hash / family / bound_dev_if / nulls_node)
This reduces number of accessed cache lines in lookups by one, and dont
increase size of inet and timewait socks.
inet and tw sockets now share same place-holder for these fields.
Before patch :
offsetof(struct sock, sk_refcnt) = 0x10
offsetof(struct sock, sk_lock) = 0x40
offsetof(struct sock, sk_receive_queue) = 0x60
offsetof(struct inet_sock, inet_daddr) = 0x270
offsetof(struct inet_sock, inet_rcv_saddr) = 0x274
After patch :
offsetof(struct sock, sk_refcnt) = 0x44
offsetof(struct sock, sk_lock) = 0x48
offsetof(struct sock, sk_receive_queue) = 0x68
offsetof(struct inet_sock, inet_daddr) = 0x0
offsetof(struct inet_sock, inet_rcv_saddr) = 0x4
compute_score() (udp or tcp) now use a single cache line per ignored
item, instead of two.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Avoid some atomic ops on dst refcount, calling dev_queue_xmit_nit()
after skb_dst_drop() in dev_hard_start_xmit().
When queueing a packet into af_packet socket, we drop dst anyway.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sk_run_filter() doesnt write on skb, change its prototype to reflect
this.
Fix two af_packet comments.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Le dimanche 05 décembre 2010 à 09:19 +0100, Eric Dumazet a écrit :
> Hmm..
>
> If somebody can explain why RTNL is held in arp_ioctl() (and therefore
> in arp_req_delete()), we might first remove RTNL use in arp_ioctl() so
> that your patch can be applied.
>
> Right now it is not good, because RTNL wont be necessarly held when you
> are going to call arp_invalidate() ?
While doing this analysis, I found a refcount bug in llc, I'll send a
patch for net-2.6
Meanwhile, here is the patch for net-next-2.6
Your patch then can be applied after mine.
Thanks
[PATCH] net: RCU conversion of dev_getbyhwaddr() and arp_ioctl()
dev_getbyhwaddr() was called under RTNL.
Rename it to dev_getbyhwaddr_rcu() and change all its caller to now use
RCU locking instead of RTNL.
Change arp_ioctl() to use RCU instead of RTNL locking.
Note: this fix a dev refcount bug in llc
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The dev field of ingress queue is forgot to initialized, then NULL
pointer dereference happens in qdisc_alloc().
Move inits of tx queues to netif_alloc_netdev_queues().
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hi,
This patch fixes a typo in net/core/datagram.c and in net/sctp/socket.c
Regards,
David Shwartz
Signed-off-by: David Shwartz <dshwatrz@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We added some security checks in commit 57fe93b374
(filter: make sure filters dont read uninitialized memory) to close a
potential leak of kernel information to user.
This added a potential extra cost at run time, while we can perform a
check of the filter itself, to make sure a malicious user doesnt try to
abuse us.
This patch adds a check_loads() function, whole unique purpose is to
make this check, allocating a temporary array of mask. We scan the
filter and propagate a bitmask information, telling us if a load M(K) is
allowed because a previous store M(K) is guaranteed. (So that
sk_run_filter() can possibly not read unitialized memory)
Note: this can uncover application bug, denying a filter attach,
previously allowed.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Dan Rosenberg <drosenberg@vsecurity.com>
Cc: Changli Gao <xiaosuo@gmail.com>
Acked-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add SKF_AD_RXHASH and SKF_AD_CPU to filter ancillary mechanism,
to be able to build advanced filters.
This can help spreading packets on several sockets with a fast
selection, after RPS dispatch to N cpus for example, or to catch a
percentage of flows in one queue.
tcpdump -s 500 "cpu = 1" :
[0] ld CPU
[1] jeq #1 jt 2 jf 3
[2] ret #500
[3] ret #0
# take 12.5 % of flows (average)
tcpdump -s 1000 "rxhash & 7 = 2" :
[0] ld RXHASH
[1] and #7
[2] jeq #2 jt 3 jf 4
[3] ret #1000
[4] ret #0
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Rui <wirelesser@gmail.com>
Acked-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
NETIF_F_HW_CSUM is a superset of NETIF_F_IP_CSUM+NETIF_F_IPV6_CSUM, but
some drivers miss the difference. Fix this and also fix UFO dependency
on checksumming offload as it makes the same mistake in assumptions.
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Acked-by: Jon Mason <jon.mason@exar.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pavel Emelyanov tried to fix a race between sk_filter_(de|at)tach and
sk_clone() in commit 47e958eac2
Problem is we can have several clones sharing a common sk_filter, and
these clones might want to sk_filter_attach() their own filters at the
same time, and can overwrite old_filter->rcu, corrupting RCU queues.
We can not use filter->rcu without being sure no other thread could do
the same thing.
Switch code to a more conventional ref-counting technique : Do the
atomic decrement immediately and queue one rcu call back when last
reference is released.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb head being allocated by kmalloc(), it might be larger than what
actually requested because of discrete kmem caches sizes. Before
reallocating a new skb head, check if the current one has the needed
extra size.
Do this check only if skb head is not shared.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This will also improve handling of ipv6 tcp socket request
backlog when syncookies are not enabled. When backlog
becomes very deep, last quarter of backlog is limited to
validated destinations. Previously only ipv4 implemented
this logic, but now ipv6 does too.
Now we are only one step away from enabling timewait
recycling for ipv6, and that step is simply filling in
the implementation of tcp_v6_get_peer() and
tcp_v6_tw_get_peer().
Signed-off-by: David S. Miller <davem@davemloft.net>
Allocate qdisc memory according to NUMA properties of cpus included in
xps map.
To be effective, qdisc should be (re)setup after changes
of /sys/class/net/eth<n>/queues/tx-<n>/xps_cpus
I added a numa_node field in struct netdev_queue, containing NUMA node
if all cpus included in xps_cpus share same node, else -1.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Avoid sparse warnings : add __rcu annotations and use
rcu_dereference_protected() where necessary.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
store_xps_map() allocates maps that are used by single cpu, it makes
sense to use NUMA allocations.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds XPS_CONFIG option to enable and disable XPS. This is
done in the same manner as RPS_CONFIG. This is also fixes build
failure in XPS code when SMP is not enabled.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When testing struct netdev_queue state against FROZEN bit, we also test
XOFF bit. We can test both bits at once and save some cycles.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As David pointed out correctly, updates to af-specific attributes
are currently not atomic. If multiple changes are requested and
one of them fails, previous updates may have been applied already
leaving the link behind in a undefined state.
This patch splits the function parse_link_af() into two functions
validate_link_af() and set_link_at(). validate_link_af() is placed
to validate_linkmsg() check for errors as early as possible before
any changes to the link have been made. set_link_af() is called to
commit the changes later.
This method is not fail proof, while it is currently sufficient
to make set_link_af() inerrable and thus 100% atomic, the
validation function method will not be able to detect all error
scenarios in the future, there will likely always be errors
depending on states which are f.e. not protected by rtnl_mutex
and thus may change between validation and setting.
Also, instead of silently ignoring unknown address families and
config blocks for address families which did not register a set
function the errors EAFNOSUPPORT respectively EOPNOSUPPORT are
returned to avoid comitting 4 out of 5 update requests without
notifying the user.
Signed-off-by: Thomas Graf <tgraf@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch implements transmit packet steering (XPS) for multiqueue
devices. XPS selects a transmit queue during packet transmission based
on configuration. This is done by mapping the CPU transmitting the
packet to a queue. This is the transmit side analogue to RPS-- where
RPS is selecting a CPU based on receive queue, XPS selects a queue
based on the CPU (previously there was an XPS patch from Eric
Dumazet, but that might more appropriately be called transmit completion
steering).
Each transmit queue can be associated with a number of CPUs which will
use the queue to send packets. This is configured as a CPU mask on a
per queue basis in:
/sys/class/net/eth<n>/queues/tx-<n>/xps_cpus
The mappings are stored per device in an inverted data structure that
maps CPUs to queues. In the netdevice structure this is an array of
num_possible_cpu structures where each structure holds and array of
queue_indexes for queues which that CPU can use.
The benefits of XPS are improved locality in the per queue data
structures. Also, transmit completions are more likely to be done
nearer to the sending thread, so this should promote locality back
to the socket on free (e.g. UDP). The benefits of XPS are dependent on
cache hierarchy, application load, and other factors. XPS would
nominally be configured so that a queue would only be shared by CPUs
which are sharing a cache, the degenerative configuration woud be that
each CPU has it's own queue.
Below are some benchmark results which show the potential benfit of
this patch. The netperf test has 500 instances of netperf TCP_RR test
with 1 byte req. and resp.
bnx2x on 16 core AMD
XPS (16 queues, 1 TX queue per CPU) 1234K at 100% CPU
No XPS (16 queues) 996K at 100% CPU
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In dev_pick_tx, don't do work in calculating queue
index or setting
the index in the sock unless the device has more than one queue. This
allows the sock to be set only with a queue index of a multi-queue
device which is desirable if device are stacked like in a tunnel.
We also allow the mapping of a socket to queue to be changed. To
maintain in order packet transmission a flag (ooo_okay) has been
added to the sk_buff structure. If a transport layer sets this flag
on a packet, the transmit queue can be changed for the socket.
Presumably, the transport would set this if there was no possbility
of creating OOO packets (for instance, there are no packets in flight
for the socket). This patch includes the modification in TCP output
for setting this flag.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lower SCM_MAX_FD from 255 to 253 so that allocations for scm_fp_list are
halved. (commit f8d570a4 added two pointers in this structure)
scm_fp_dup() should not copy whole structure (and trigger kmemcheck
warnings), but only the used part. While we are at it, only allocate
needed size.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Unloading pktgen module needs ~6 seconds on a 64 cpus machine, to stop
64 kthreads.
Add a pktgen_exiting variable to let kernel threads die faster, so that
kthread_stop() doesnt have to wait too long for them. This variable is
not tested in fast path.
Note : Before exiting from pktgen_thread_worker(), we must make sure
kthread_stop() is waiting for this thread to be stopped, like its done
in kernel/softirq.c
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We forgot to use __GFP_HIGHMEM in several __vmalloc() calls.
In ceph, add the missing flag.
In fib_trie.c, xfrm_hash.c and request_sock.c, using vzalloc() is
cleaner and allows using HIGHMEM pages as well.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
At compile time, we can replace the DIV_K instruction (divide by a
constant value) by a reciprocal divide.
At exec time, the expensive divide is replaced by a multiply, a less
expensive operation on most processors.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Starting the translated instruction to 1 instead of 0 allows us to
remove one descrement at check time and makes codes[] array init
cleaner.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove pc variable to avoid arithmetic to compute fentry at each filter
instruction. Jumps directly manipulate fentry pointer.
As the last instruction of filter[] is guaranteed to be a RETURN, and
all jumps are before the last instruction, we dont need to check filter
bounds (number of instructions in filter array) at each iteration, so we
remove it from sk_run_filter() params.
On x86_32 remove f_k var introduced in commit 57fe93b374
(filter: make sure filters dont read uninitialized memory)
Note : We could use a CONFIG_ARCH_HAS_{FEW|MANY}_REGISTERS in order to
avoid too many ifdefs in this code.
This helps compiler to use cpu registers to hold fentry and A
accumulator.
On x86_32, this saves 401 bytes, and more important, sk_run_filter()
runs much faster because less register pressure (One less conditional
branch per BPF instruction)
# size net/core/filter.o net/core/filter_pre.o
text data bss dec hex filename
2948 0 0 2948 b84 net/core/filter.o
3349 0 0 3349 d15 net/core/filter_pre.o
on x86_64 :
# size net/core/filter.o net/core/filter_pre.o
text data bss dec hex filename
5173 0 0 5173 1435 net/core/filter.o
5224 0 0 5224 1468 net/core/filter_pre.o
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix kernel-doc warning for sk_filter_rcu_release():
Warning(net/core/filter.c:586): missing initial short description on line:
* sk_filter_rcu_release: Release a socket filter by rcu_head
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
BPF_S_* are used internally, should not be exposed to the others.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since repeating u16 value to u8 value conversion using switch() clause's
case statement is wasteful, this patch introduces u16 to u8 mapping table
and removes most of case statements. As a result, the size of net/core/filter.o
is reduced by about 29% on x86.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add option to set skb priority to pktgen. Useful for testing
QOS features. Also by running pktgen on the vlan device the
qdisc on the real device can be tested.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
netif_set_real_num_rx_queues() can decrement and increment
the number of rx queues. For example ixgbe does this as
features and offloads are toggled. Presumably this could
also happen across down/up on most devices if the available
resources changed (cpu offlined).
The kobject needs to be zero'd in this case so that the
state is not preserved across kobject_put()/kobject_init_and_add().
This resolves the following error report.
ixgbe 0000:03:00.0: eth2: NIC Link is Up 10 Gbps, Flow Control: RX/TX
kobject (ffff880324b83210): tried to init an initialized object, something is seriously wrong.
Pid: 1972, comm: lldpad Not tainted 2.6.37-rc18021qaz+ #169
Call Trace:
[<ffffffff8121c940>] kobject_init+0x3a/0x83
[<ffffffff8121cf77>] kobject_init_and_add+0x23/0x57
[<ffffffff8107b800>] ? mark_lock+0x21/0x267
[<ffffffff813c6d11>] net_rx_queue_update_kobjects+0x63/0xc6
[<ffffffff813b5e0e>] netif_set_real_num_rx_queues+0x5f/0x78
[<ffffffffa0261d49>] ixgbe_set_num_queues+0x1c6/0x1ca [ixgbe]
[<ffffffffa0262509>] ixgbe_init_interrupt_scheme+0x1e/0x79c [ixgbe]
[<ffffffffa0274596>] ixgbe_dcbnl_set_state+0x167/0x189 [ixgbe]
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
netif_set_real_num_rx_queues() can decrement and increment
the number of rx queues. For example ixgbe does this as
features and offloads are toggled. Presumably this could
also happen across down/up on most devices if the available
resources changed (cpu offlined).
The kobject needs to be zero'd in this case so that the
state is not preserved across kobject_put()/kobject_init_and_add().
This resolves the following error report.
ixgbe 0000:03:00.0: eth2: NIC Link is Up 10 Gbps, Flow Control: RX/TX
kobject (ffff880324b83210): tried to init an initialized object, something is seriously wrong.
Pid: 1972, comm: lldpad Not tainted 2.6.37-rc18021qaz+ #169
Call Trace:
[<ffffffff8121c940>] kobject_init+0x3a/0x83
[<ffffffff8121cf77>] kobject_init_and_add+0x23/0x57
[<ffffffff8107b800>] ? mark_lock+0x21/0x267
[<ffffffff813c6d11>] net_rx_queue_update_kobjects+0x63/0xc6
[<ffffffff813b5e0e>] netif_set_real_num_rx_queues+0x5f/0x78
[<ffffffffa0261d49>] ixgbe_set_num_queues+0x1c6/0x1ca [ixgbe]
[<ffffffffa0262509>] ixgbe_init_interrupt_scheme+0x1e/0x79c [ixgbe]
[<ffffffffa0274596>] ixgbe_dcbnl_set_state+0x167/0x189 [ixgbe]
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Each net_device contains address family specific data such as
per device settings and statistics. We already expose this data
via procfs/sysfs and partially netlink.
The netlink method requires the requester to send one RTM_GETLINK
request for each address family it wishes to receive data of
and then merge this data itself.
This patch implements a new API which combines all address family
specific link data in a new netlink attribute IFLA_AF_SPEC.
IFLA_AF_SPEC contains a sequence of nested attributes, one for each
address family which in turn defines the structure of its own
attribute. Example:
[IFLA_AF_SPEC] = {
[AF_INET] = {
[IFLA_INET_CONF] = ...,
},
[AF_INET6] = {
[IFLA_INET6_FLAGS] = ...,
[IFLA_INET6_CONF] = ...,
}
}
The API also allows for address families to implement a function
which parses the IFLA_AF_SPEC attribute sent by userspace to
implement address family specific link options.
Signed-off-by: Thomas Graf <tgraf@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
ERROR: "netif_get_vlan_features" [drivers/net/xen-netfront.ko] undefined!
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch move RX queue allocation to alloc_netdev_mq and freeing of
the queues to free_netdev (symmetric to TX queue allocation). Each
kobject RX queue takes a reference to the queue's device so that the
device can't be freed before all the kobjects have been released-- this
obviates the need for reference counts specific to RX queues.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TX queues are now allocated in alloc_netdev_mq and freed in
free_netdev.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We currently use vlan_features to check for TSO support if there is
a vlan tag. However, it's quite likely that the NIC is not able to
do TSO when there is an arbitrary number of tags. Therefore if there
is more than one tag (in-band or out-of-band), fall back to software
emulation.
Signed-off-by: Jesse Gross <jesse@nicira.com>
CC: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We assume that hardware TSO can't support multiple levels of vlan tags
but we allow it to be done. Therefore, enable GSO to parse these tags
so we can fallback to software.
Signed-off-by: Jesse Gross <jesse@nicira.com>
CC: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When checking if it is necessary to linearize a packet, we currently
use vlan_features if the packet contains either an in-band or out-
of-band vlan tag. However, in-band tags aren't special in any way
for scatter/gather since they are part of the packet buffer and are
simply more data to DMA. Therefore, only use vlan_features for out-
of-band tags, which could potentially have some interaction with
scatter/gather.
Signed-off-by: Jesse Gross <jesse@nicira.com>
CC: Ben Hutchings <bhutchings@solarflare.com>
Reviewed-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
nlmsg_total_size() calculates the length of a netlink message
including header and alignment. nla_total_size() calculates the
space an individual attribute consumes which was meant to be used
in this context.
Also, ensure to account for the attribute header for the
IFLA_INFO_XSTATS attribute as implementations of get_xstats_size()
seem to assume that we do so.
The addition of two message headers minus the missing attribute
header resulted in a calculated message size that was larger than
required. Therefore we never risked running out of skb tailroom.
Signed-off-by: Thomas Graf <tgraf@infradead.org>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Robin Holt tried to boot a 16TB machine and found some limits were
reached : sysctl_tcp_mem[2], sysctl_udp_mem[2]
We can switch infrastructure to use long "instead" of "int", now
atomic_long_t primitives are available for free.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Reported-by: Robin Holt <holt@sgi.com>
Reviewed-by: Robin Holt <holt@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a possibility malicious users can get limited information about
uninitialized stack mem array. Even if sk_run_filter() result is bound
to packet length (0 .. 65535), we could imagine this can be used by
hostile user.
Initializing mem[] array, like Dan Rosenberg suggested in his patch is
expensive since most filters dont even use this array.
Its hard to make the filter validation in sk_chk_filter(), because of
the jumps. This might be done later.
In this patch, I use a bitmap (a single long var) so that only filters
using mem[] loads/stores pay the price of added security checks.
For other filters, additional cost is a single instruction.
[ Since we access fentry->k a lot now, cache it in a local variable
and mark filter entry pointer as const. -DaveM ]
Reported-by: Dan Rosenberg <drosenberg@vsecurity.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Followup of commit ef885afbf8 (net: use rcu_barrier() in
rollback_registered_many)
dst_dev_event() scans a garbage dst list that might be feeded by various
network notifiers at device dismantle time.
Its important to call dst_dev_event() after other notifiers, or we might
enter the infamous msleep(250) in netdev_wait_allrefs(), and wait one
second before calling again call_netdevice_notifiers(NETDEV_UNREGISTER,
dev) to properly remove last device references.
Use priority -10 to let dst_dev_notifier be called after other network
notifiers (they have the default 0 priority)
Reported-by: Ben Greear <greearb@candelatech.com>
Reported-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reported-by: Octavian Purdila <opurdila@ixiacom.com>
Reported-by: Benjamin LaHaise <bcrl@kvack.org>
Tested-by: Ben Greear <greearb@candelatech.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This fix a bug reported by backyes.
Right the first time pktgen's using queue_map that's not been initialized
by set_cur_queue_map(pkt_dev);
Signed-off-by: Junchang Wang <junchangwang@gmail.com>
Signed-off-by: Backyes <backyes@mail.ustc.edu.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
This should fix the following warning:
net/core/pktgen.c: In function ‘pktgen_if_write’:
net/core/pktgen.c:890: warning: comparison of distinct pointer types lacks a cast
Signed-off-by: Dmitry Torokhov <dtor@mail.ru>
Reviewed-by: Nelson Elhage <nelhage@ksplice.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In dev_pick_tx recompute the queue index if the value stored in the
socket is greater than or equal to the number of real queues for the
device. The saved index in the sock structure is not guaranteed to
be appropriate for the egress device (this could happen on a route
change or in presence of tunnelling). The result of the queue index
being bad would be to return a bogus queue (crash could prersumably
follow).
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A program that accidentally writes too much data to the pktgen file can overflow
the kernel stack and oops the machine. This is only triggerable by root, so
there's no security issue, but it's still an unfortunate bug.
printk() won't print more than 1024 bytes in a single call, anyways, so let's
just never copy more than that much data. We're on a fairly shallow stack, so
that should be safe even with CONFIG_4KSTACKS.
Signed-off-by: Nelson Elhage <nelhage@ksplice.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This helps protect us from overflow issues down in the
individual protocol sendmsg/recvmsg handlers. Once
we hit INT_MAX we truncate out the rest of the iovec
by setting the iov_len members to zero.
This works because:
1) For SOCK_STREAM and SOCK_SEQPACKET sockets, partial
writes are allowed and the application will just continue
with another write to send the rest of the data.
2) For datagram oriented sockets, where there must be a
one-to-one correspondance between write() calls and
packets on the wire, INT_MAX is going to be far larger
than the packet size limit the protocol is going to
check for and signal with -EMSGSIZE.
Based upon a patch by Linus Torvalds.
Signed-off-by: David S. Miller <davem@davemloft.net>
Adds __rcu annotation to (struct fib_rule)->ctarget
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
NETIF_F_HW_CSUM indicates the ability to update an TCP/IP-style 16-bit
checksum with the checksum of an arbitrary part of the packet data,
whereas the FCoE CRC is something entirely different.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Cc: stable@kernel.org [2.6.32+]
Signed-off-by: David S. Miller <davem@davemloft.net>
dev_can_checksum() incorrectly returns true in these cases:
1. The skb has both out-of-band and in-band VLAN tags and the device
supports checksum offload for the encapsulated protocol but only with
one layer of encapsulation.
2. The skb has a VLAN tag and the device supports generic checksumming
but not in conjunction with VLAN encapsulation.
Rearrange the VLAN tag checks to avoid these.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some panic reports in fib_rules_lookup() show a rule could have a NULL
pointer as a next pointer in the rules_list.
This can actually happen because of a bug in fib_nl_newrule() : It
checks if current rule is the destination of unresolved gotos. (Other
rules have gotos to this about to be inserted rule)
Problem is it does the resolution of the gotos before the rule is
inserted in the rules_list (and has a valid next pointer)
Fix this by moving the rules_list insertion before the changes on gotos.
A lockless reader can not any more follow a ctarget pointer, unless
destination is ready (has a valid next pointer)
Reported-by: Oleg A. Arkhangelsky <sysoleg@yandex.ru>
Reported-by: Joe Buehler <aspam@cox.net>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add __rcu annotation to :
(struct sock)->sk_filter
And use appropriate rcu primitives to reduce sparse warnings if
CONFIG_SPARSE_RCU_POINTER=y
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
add __rcu annotation to (struct net)->gen, and use
rcu_dereference_protected() in net_assign_generic()
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add __rcu annotations to :
(struct netdev_rx_queue)->rps_map
(struct netdev_rx_queue)->rps_flow_table
struct rps_sock_flow_table *rps_sock_flow_table;
And use appropriate rcu primitives.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(struct net_device)->ip6_ptr is rcu protected :
add __rcu annotation and proper rcu primitives.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Three is definitely too low, and we know from reports that GRE tunnels
stacked as deeply as 37 levels cause stack overflows, so pick some
reasonable value between those two.
Signed-off-by: David S. Miller <davem@davemloft.net>
The temporary variable "i" is needlessly initialized to zero
in two distinct cases in this file:
1) where it is set to zero and then used as an argument in an addition
before being assigned a non-zero value.
2) where it is only used in a standard/typical loop counter
For (1), simply delete assignment to zero and usages while still
zero; for (2) simply make the loop start at zero as per standard
practice as seen everywhere else in the same file.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1699 commits)
bnx2/bnx2x: Unsupported Ethtool operations should return -EINVAL.
vlan: Calling vlan_hwaccel_do_receive() is always valid.
tproxy: use the interface primary IP address as a default value for --on-ip
tproxy: added IPv6 support to the socket match
cxgb3: function namespace cleanup
tproxy: added IPv6 support to the TPROXY target
tproxy: added IPv6 socket lookup function to nf_tproxy_core
be2net: Changes to use only priority codes allowed by f/w
tproxy: allow non-local binds of IPv6 sockets if IP_TRANSPARENT is enabled
tproxy: added tproxy sockopt interface in the IPV6 layer
tproxy: added udp6_lib_lookup function
tproxy: added const specifiers to udp lookup functions
tproxy: split off ipv6 defragmentation to a separate module
l2tp: small cleanup
nf_nat: restrict ICMP translation for embedded header
can: mcp251x: fix generation of error frames
can: mcp251x: fix endless loop in interrupt handler if CANINTF_MERRF is set
can-raw: add msg_flags to distinguish local traffic
9p: client code cleanup
rds: make local functions/variables static
...
Fix up conflicts in net/core/dev.c, drivers/net/pcmcia/smc91c92_cs.c and
drivers/net/wireless/ath/ath9k/debug.c as per David
The function napi_reuse_skb is only used inside core.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When there are VLANs on a VETH device, the packets being transmitted
through the VETH device may be 4 bytes bigger than MTU. A check
in dev_forward_skb did not take this into account and so dropped
these packets.
This patch is needed at least as far back as 2.6.34.7 and should
be considered for -stable.
Signed-off-by: Ben Greear <greearb@candelatech.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function rtnl_kill_links is defined but never used.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that vlan acceleration is handled consistently regardless of usage,
it is possible to enable and disable it at will. This adds support for
Ethtool operations that change the offloading status for debugging
purposes, similar to other forms of hardware acceleration.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently each driver that is capable of vlan hardware acceleration
must be aware of the vlan groups that are configured and then pass
the stripped tag to a specialized receive function. This is
different from other types of hardware offload in that it places a
significant amount of knowledge in the driver itself rather keeping
it in the networking core.
This makes vlan offloading function more similarly to other forms
of offloading (such as checksum offloading or TSO) by doing the
following:
* On receive, stripped vlans are passed directly to the network
core, without attempting to check for vlan groups or reconstructing
the header if no group
* vlans are made less special by folding the logic into the main
receive routines
* On transmit, the device layer will add the vlan header in software
if the hardware doesn't support it, instead of spreading that logic
out in upper layers, such as bonding.
There are a number of advantages to this:
* Fixes all bugs with drivers incorrectly dropping vlan headers at once.
* Avoids having to disable VLAN acceleration when in promiscuous mode
(good for bridging since it always puts devices in promiscuous mode).
* Keeps VLAN tag separate until given to ultimate consumer, which
avoids needing to do header reconstruction as in tg3 unless absolutely
necessary.
* Consolidates common code in core networking.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently users of hardware vlan accleration need to know whether
the device supports it before generating packets. However, vlan
acceleration will soon be available in a more flexible manner so
knowing ahead of time becomes much more difficult. This adds
a software fallback path for vlan packets on devices without the
necessary offloading support, similar to other types of hardware
accleration.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no point using RCU for dst we allocate for a very short time
(used once).
Change dst_release() to take DST_NOCACHE into account, but also change
skb_dst_set_noref() to force a refcount increment for such dst.
This is a _huge_ gain, because we dont waste memory to store xx thousand
of dsts. Instead of queueing them to RCU, we can free them instantly.
CPU caches can stay hot, re-using same memory blocks to hold temporary
dsts.
Note : remove unneeded smp_mb__before_atomic_dec(); in dst_release(),
since atomic_dec_return() implies a full memory barrier.
Stress test, 160.000.000 udp frames sent, IP route cache disabled
(DDOS).
Before:
real 0m38.091s
user 0m13.189s
sys 7m53.018s
After:
real 0m29.946s
user 0m12.157s
sys 7m40.605s
For reference, if IP route cache was enabled :
real 0m32.030s
user 0m10.521s
sys 8m15.243s
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch introduces netif_alloc_netdev_queues which is called from
register_device instead of alloc_netdev_mq. This makes TX queue
allocation symmetric with RX allocation. Also, queue locks allocation
is done in netdev_init_one_queue. Change set_real_num_tx_queues to
fail if requested number < 1 or greater than number of allocated
queues.
Signed-off-by: Tom Herbert <therbert@google.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Clean up in RX queue allocation. In netif_set_real_num_rx_queues
return error on attempt to set zero queues, or requested number is
greater than number of allocated queues. In netif_alloc_rx_queues,
do BUG_ON if queue_count is zero.
Signed-off-by: Tom Herbert <therbert@google.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In alloc_netdev_mq fail if requested queue_count < 1.
Signed-off-by: Tom Herbert <therbert@google.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In an erlier patch I modified napi_poll so that devices with IFF_MASTER polled
the per_cpu list instead of the device list for napi. I did this because the
bonding driver has no napi instances to poll, it instead expects to check the
slave devices napi instances, which napi_poll was unaware of. Looking at this
more closely however, I now see this isn't strictly needed. As the bond driver
poll_controller calls the slaves poll_controller via netpoll_poll_dev, which
recursively calls poll_napi on each slave, allowing those napi instances to get
serviced. The earlier patch isn't at all harmfull, its just not needed, so lets
revert it to make the code cleaner. Sorry for the noise,
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Reviewed-by: WANG Cong <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Usually the netpoll path, when preforming a napi poll can get away with just
polling all the napi instances of the configured device. Thats not the case for
the bonding driver however, as the napi instances which may wind up getting
flagged as needing polling after the poll_controller call don't belong to the
bonded device, but rather to the slave devices. Fix this by checking the device
in question for the IFF_MASTER flag, if set, we know we need to check the full
poll list for this cpu, rather than just the devices napi instance list.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bonding driver currently modifies the netpoll structure in its xmit path
while sending frames from netpoll. This is racy, as other cpus can access the
netpoll structure in parallel. Since the bonding driver points np->dev to a
slave device, other cpus can inadvertently attempt to send data directly to
slave devices, leading to improper locking with the bonding master, lost frames,
and deadlocks. This patch fixes that up.
This patch also removes the real_dev pointer from the netpoll structure as that
data is really only used by bonding in the poll_controller, and we can emulate
its behavior by check each slave for IS_UP.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fib_nl_delrule() calls synchronize_rcu() for no apparent reason,
while rtnl is held.
I suspect it was done to avoid an atomic_inc_not_zero() in
fib_rules_lookup(), which commit 7fa7cb7109 added anyway.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit b30973f877 (node-aware skb allocation) spread a wrong habit of
allocating net drivers skbs on a given memory node : The one closest to
the NIC hardware. This is wrong because as soon as we try to scale
network stack, we need to use many cpus to handle traffic and hit
slub/slab management on cross-node allocations/frees when these cpus
have to alloc/free skbs bound to a central node.
skb allocated in RX path are ephemeral, they have a very short
lifetime : Extra cost to maintain NUMA affinity is too expensive. What
appeared as a nice idea four years ago is in fact a bad one.
In 2010, NIC hardwares are multiqueue, or we use RPS to spread the load,
and two 10Gb NIC might deliver more than 28 million packets per second,
needing all the available cpus.
Cost of cross-node handling in network and vm stacks outperforms the
small benefit hardware had when doing its DMA transfert in its 'local'
memory node at RX time. Even trying to differentiate the two allocations
done for one skb (the sk_buff on local node, the data part on NIC
hardware node) is not enough to bring good performance.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We tried very hard to remove all possible dev_hold()/dev_put() pairs in
network stack, using RCU conversions.
There is still an unavoidable device refcount change for every dst we
create/destroy, and this can slow down some workloads (routers or some
app servers, mmap af_packet)
We can switch to a percpu refcount implementation, now dynamic per_cpu
infrastructure is mature. On a 64 cpus machine, this consumes 256 bytes
per device.
On x86, dev_hold(dev) code :
before
lock incl 0x280(%ebx)
after:
movl 0x260(%ebx),%eax
incl fs:(%eax)
Stress bench :
(Sending 160.000.000 UDP frames,
IP route cache disabled, dual E5540 @2.53GHz,
32bit kernel, FIB_TRIE)
Before:
real 1m1.662s
user 0m14.373s
sys 12m55.960s
After:
real 0m51.179s
user 0m15.329s
sys 10m15.942s
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
struct dst_ops tracks number of allocated dst in an atomic_t field,
subject to high cache line contention in stress workload.
Switch to a percpu_counter, to reduce number of time we need to dirty a
central location. Place it on a separate cache line to avoid dirtying
read only fields.
Stress test :
(Sending 160.000.000 UDP frames,
IP route cache disabled, dual E5540 @2.53GHz,
32bit kernel, FIB_TRIE, SLUB/NUMA)
Before:
real 0m51.179s
user 0m15.329s
sys 10m15.942s
After:
real 0m45.570s
user 0m15.525s
sys 9m56.669s
With a small reordering of struct neighbour fields, subject of a
following patch, (to separate refcnt from other read mostly fields)
real 0m41.841s
user 0m15.261s
sys 8m45.949s
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a seqlock in struct neighbour to protect neigh->ha[], and avoid
dirtying neighbour in stress situation (many different flows / dsts)
Dirtying takes place because of read_lock(&n->lock) and n->used writes.
Switching to a seqlock, and writing n->used only on jiffies changes
permits less dirtying.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Several other ethtool functions leave heap uncleared (potentially) by
drivers. Some interfaces appear safe (eeprom, etc), in that the sizes
are well controlled. In some situations (e.g. unchecked error conditions),
the heap will remain unchanged in areas before copying back to userspace.
Note that these are less of an issue since these all require CAP_NET_ADMIN.
Cc: stable@kernel.org
Signed-off-by: Kees Cook <kees.cook@canonical.com>
Acked-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a new dst is used to send a frame, neigh_resolve_output() tries to
associate an struct hh_cache to this dst, calling neigh_hh_init() with
the neigh rwlock write locked.
Most of the time, hh_cache is already known and linked into neighbour,
so we find it and increment its refcount.
This patch changes the logic so that we call neigh_hh_init() with
neighbour lock read locked only, so that fast path can be run in
parallel by concurrent cpus.
This brings part of the speedup we got with commit c7d4426a98
(introduce DST_NOCACHE flag) for non cached dsts, even for cached ones,
removing one of the contention point that routers hit on multiqueue
enabled machines.
Further improvements would need to use a seqlock instead of an rwlock to
protect neigh->ha[], to not dirty neigh too often and remove two atomic
ops.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The rx->count reference is used to track reference counts to the
number of rx-queue kobjects created for the device. This patch
eliminates initialization of the counter in netif_alloc_rx_queues
and instead increments the counter each time a kobject is created.
This is now symmetric with the decrement that is done when an object is
released.
Signed-off-by: Tom Herbert <therbert@google.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Calling ETHTOOL_GRXCLSRLALL with a large rule_cnt will allocate kernel
heap without clearing it. For the one driver (niu) that implements it,
it will leave the unused portion of heap unchanged and copy the full
contents back to userspace.
Signed-off-by: Kees Cook <kees.cook@canonical.com>
Acked-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Synchronise the comment with the preceding implementation change.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Do not set num_rx_queues in netif_set_real_num_rx_queues() some
drivers will increase the real_num_rx_queues later due to a feature
changes or available interrupts increasing. By setting num_rx_queues
here this ends up creating a cap on the number of rx queues
available.
For example the ixgbe driver sets the max number of queues it intends
to use ever then sets the current number in use with the
netif_set_num_{rx|tx}_queues calls. With the current implementation
the number of rx queues gets limited so when a feature such as DCB
or FCoE is enabled the queues are no longer available.
kobjects will only be allocated for real_num_rx_queues so the waste
in memory is minimal.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is the second step for neighbour RCU conversion.
(first was commit d6bf7817 : RCU conversion of neigh hash table)
neigh_lookup() becomes lockless, but still take a reference on found
neighbour. (no more read_lock()/read_unlock() on tbl->lock)
struct neighbour gets an additional rcu_head field and is freed after an
RCU grace period.
Future work would need to eventually not take a reference on neighbour
for temporary dst (DST_NOCACHE), but this would need dst->_neighbour to
use a noref bit like we did for skb->_dst.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David
This is the first step for RCU conversion of neigh code.
Next patches will convert hash_buckets[] and "struct neighbour" to RCU
protected objects.
Thanks
[PATCH net-next] net neigh: RCU conversion of neigh hash table
Instead of storing hash_buckets, hash_mask and hash_rnd in "struct
neigh_table", a new structure is defined :
struct neigh_hash_table {
struct neighbour **hash_buckets;
unsigned int hash_mask;
__u32 hash_rnd;
struct rcu_head rcu;
};
And "struct neigh_table" has an RCU protected pointer to such a
neigh_hash_table.
This means the signature of (*hash)() function changed: We need to add a
third parameter with the actual hash_rnd value, since this is not
anymore a neigh_table field.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
neigh_delete() and neigh_add() dont need to touch device refcount,
we hold RTNL when calling them, so device cannot disappear under us.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In various situations, a device provides a packet to our stack and we
drop it before it enters protocol stack :
- softnet backlog full (accounted in /proc/net/softnet_stat)
- bad vlan tag (not accounted)
- unknown/unregistered protocol (not accounted)
We can handle a per-device counter of such dropped frames at core level,
and automatically adds it to the device provided stats (rx_dropped), so
that standard tools can be used (ifconfig, ip link, cat /proc/net/dev)
This is a generalization of commit 8990f468a (net: rx_dropped
accounting), thus reverting it.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fib_rules_cleanup_ups is only defined and used in one place.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ingress being not used very much, and net_device->ingress_queue being
quite a big object (128 or 256 bytes), use a dynamic allocation if
needed (tc qdisc add dev eth0 ingress ...)
dev_ingress_queue(dev) helper should be used only with RTNL taken.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While doing stress tests with IP route cache disabled, and multi queue
devices, I noticed a very high contention on one rwlock used in
neighbour code.
When many cpus are trying to send frames (possibly using a high
performance multiqueue device) to the same neighbour, they fight for the
neigh->lock rwlock in order to call neigh_hh_init(), and fight on
hh->hh_refcnt (a pair of atomic_inc/atomic_dec_and_test())
But we dont need to call neigh_hh_init() for dst that are used only
once. It costs four atomic operations at least, on two contended cache
lines, plus the high contention on neigh->lock rwlock.
Introduce a new dst flag, DST_NOCACHE, that is set when dst was not
inserted in route cache.
With the stress test bench, sending 160000000 frames on one neighbour,
results are :
Before patch:
real 2m28.406s
user 0m11.781s
sys 36m17.964s
After patch:
real 1m26.532s
user 0m12.185s
sys 20m3.903s
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes the condition (3rd arg) passed to sk_wait_event() in
sk_stream_wait_memory(). The incorrect check in sk_stream_wait_memory()
causes the following soft lockup in tcp_sendmsg() when the global tcp
memory pool has exhausted.
>>> snip <<<
localhost kernel: BUG: soft lockup - CPU#3 stuck for 11s! [sshd:6429]
localhost kernel: CPU 3:
localhost kernel: RIP: 0010:[sk_stream_wait_memory+0xcd/0x200] [sk_stream_wait_memory+0xcd/0x200] sk_stream_wait_memory+0xcd/0x200
localhost kernel:
localhost kernel: Call Trace:
localhost kernel: [sk_stream_wait_memory+0x1b1/0x200] sk_stream_wait_memory+0x1b1/0x200
localhost kernel: [<ffffffff802557c0>] autoremove_wake_function+0x0/0x40
localhost kernel: [ipv6:tcp_sendmsg+0x6e6/0xe90] tcp_sendmsg+0x6e6/0xce0
localhost kernel: [sock_aio_write+0x126/0x140] sock_aio_write+0x126/0x140
localhost kernel: [xfs:do_sync_write+0xf1/0x130] do_sync_write+0xf1/0x130
localhost kernel: [<ffffffff802557c0>] autoremove_wake_function+0x0/0x40
localhost kernel: [hrtimer_start+0xe3/0x170] hrtimer_start+0xe3/0x170
localhost kernel: [vfs_write+0x185/0x190] vfs_write+0x185/0x190
localhost kernel: [sys_write+0x50/0x90] sys_write+0x50/0x90
localhost kernel: [system_call+0x7e/0x83] system_call+0x7e/0x83
>>> snip <<<
What is happening is, that the sk_wait_event() condition passed from
sk_stream_wait_memory() evaluates to true for the case of tcp global memory
exhaustion. This is because both sk_stream_memory_free() and vm_wait are true
which causes sk_wait_event() to *not* call schedule_timeout().
Hence sk_stream_wait_memory() returns immediately to the caller w/o sleeping.
This causes the caller to again try allocation, which again fails and again
calls sk_stream_wait_memory(), and so on.
[ Bug introduced by commit c1cbe4b7ad
("[NET]: Avoid atomic xchg() for non-error case") -DaveM ]
Signed-off-by: Nagendra Singh Tomar <tomer_iisc@yahoo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is some confusion with rx_queue name after RPS, and net drivers
private rx_queue fields.
I suggest to rename "struct net_device"->rx_queue to ingress_queue.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As tunnel devices are going to be lockless, we need to make sure a
misconfigured machine wont enter an infinite loop.
Add a percpu variable, and limit to three the number of stacked xmits.
Reported-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch allows a host to be configured to respond to any address in
a specified range as if it were local, without actually needing to
configure the address on an interface. This is done through routing
table configuration. For instance, to configure a host to respond
to any address in 10.1/16 received on eth0 as a local address we can do:
ip rule add from all iif eth0 lookup 200
ip route add local 10.1/16 dev lo proto kernel scope host src 127.0.0.1 table 200
This host is now reachable by any 10.1/16 address (route lookup on
input for packets received on eth0 can find the route). On output, the
rule will not be matched so that this host can still send packets to
10.1/16 (not sent on loopback). Presumably, external routing can be
configured to make sense out of this.
To make this work, we needed to modify the logic in finding the
interface which is assigned a given source address for output
(dev_ip_find). We perform a normal fib_lookup instead of just a
lookup on the local table, and in the lookup we ignore the input
interface for matching.
This patch is useful to implement IP-anycast for subnets of virtual
addresses.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For RPS, we create a kobject for each RX queue based on the number of
queues passed to alloc_netdev_mq(). However, drivers generally do not
determine the numbers of hardware queues to use until much later, so
this usually represents the maximum number the driver may use and not
the actual number in use.
For TX queues, drivers can update the actual number using
netif_set_real_num_tx_queues(). Add a corresponding function for RX
queues, netif_set_real_num_rx_queues().
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sk_attach_filter() and sk_detach_filter() are run with socket locked.
Use the appropriate rcu_dereference_protected() instead of blocking BH,
and rcu_dereference_bh().
There is no point adding BH prevention and memory barrier.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It seems we dont use appropriate refcount increment in an
rcu_read_lock() protected section.
fib_rule_get() might increment a null refcount and bad things could
happen.
While fib_nl_delrule() respects an rcu grace period before calling
fib_rule_put(), fib_rules_cleanup_ops() calls fib_rule_put() without a
grace period.
Note : after this patch, we might avoid the synchronize_rcu() call done
in fib_nl_delrule()
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes kernel bugzilla #16603
tcp_sendmsg() truncates iov_len to an 'int' which a 4GB write to write
zero bytes, for example.
There is also the problem higher up of how verify_iovec() works. It
wants to prevent the total length from looking like an error return
value.
However it does this using 'int', but syscalls return 'long' (and
thus signed 64-bit on 64-bit machines). So it could trigger
false-positives on 64-bit as written. So fix it to use 'long'.
Reported-by: Olaf Bonorden <bono@onlinehome.de>
Reported-by: Daniel Büse <dbuese@gmx.de>
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of having two places were we allocate dev->_rx, introduce
netif_alloc_rx_queues() helper and call it only from
register_netdevice(), not from alloc_netdev_mq()
Goal is to let drivers change dev->num_rx_queues after allocating netdev
and before registering it.
This also removes a lot of ifdefs in net/core/dev.c
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Automatically allows vlans to get NETIF_F_HIGHDMA if underlying device
supports it.
On 32bit arches (and more precisely if CONFIG_HIGHMEM is enabled), it
can help to reduce cost of illegal_highdma() and __skb_linearize()
calls.
Tested on tg3 , bnx2, bonding, this worked very well.
This is a generalization of a patch provided by Yi Zou & Jeff Kirsher.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We have for each socket :
One spinlock (sk_slock.slock)
One rwlock (sk_callback_lock)
Possible scenarios are :
(A) (this is used in net/sunrpc/xprtsock.c)
read_lock(&sk->sk_callback_lock) (without blocking BH)
<BH>
spin_lock(&sk->sk_slock.slock);
...
read_lock(&sk->sk_callback_lock);
...
(B)
write_lock_bh(&sk->sk_callback_lock)
stuff
write_unlock_bh(&sk->sk_callback_lock)
(C)
spin_lock_bh(&sk->sk_slock)
...
write_lock_bh(&sk->sk_callback_lock)
stuff
write_unlock_bh(&sk->sk_callback_lock)
spin_unlock_bh(&sk->sk_slock)
This (C) case conflicts with (A) :
CPU1 [A] CPU2 [C]
read_lock(callback_lock)
<BH> spin_lock_bh(slock)
<wait to spin_lock(slock)>
<wait to write_lock_bh(callback_lock)>
We have one problematic (C) use case in inet_csk_listen_stop() :
local_bh_disable();
bh_lock_sock(child); // spin_lock_bh(&sk->sk_slock)
WARN_ON(sock_owned_by_user(child));
...
sock_orphan(child); // write_lock_bh(&sk->sk_callback_lock)
lockdep is not happy with this, as reported by Tetsuo Handa
It seems only way to deal with this is to use read_lock_bh(callbacklock)
everywhere.
Thanks to Jarek for pointing a bug in my first attempt and suggesting
this solution.
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Jarek Poplawski <jarkao2@gmail.com>
Tested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change "return (EXPR);" to "return EXPR;"
return is not a function, parentheses are not required.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net/core/ethtool.c: In function 'ethtool_get_regs':
net/core/ethtool.c:818:2: error: implicit declaration of function 'vmalloc'
net/core/ethtool.c:818:9: warning: assignment makes pointer from integer without a cast
net/core/ethtool.c:833:2: error: implicit declaration of function 'vfree'
Signed-off-by: David S. Miller <davem@davemloft.net>
Some NICs have huge register files which exceed the maximum heap
allocation size.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ethtool utility does not set masks for flow parameters that are
not specified, so if both value and mask are 0 then this must be
treated as equivalent to a mask with all bits set. Currently that is
done in the only driver that implements RX n-tuple filtering, ixgbe.
Move it to the ethtool core.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
previously, if a vlan master device was moved from one network namespace
to another, all 802.1q and macvlan slaves were deleted.
we can use dev->reg_state to figure out whether dev_change_net_namespace
is happening, since that won't set dev->reg_state NETREG_UNREGISTERING.
so, this changes 8021q and macvlan to ignore NETDEV_UNREGISTER when
reg_state is not NETREG_UNREGISTERING.
Signed-off-by: David Lamparter <equinox@diac24.net>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Daniel Lezcano <daniel.lezcano@free.fr>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
To be able to switch on GRO on a device, ethtool_set_gro() checks this
device provides a get_rx_csum() method.
Some devices dont provide this method, while they do support RX
checksumming.
This patch allows bonding to support GRO :
ethtool -K bond0 gro on
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
rcu_dereference_raw() now needs to know the type of its argument.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently vlan devices don't have GRO by default as none of the Ethernet
drivers add NETIF_F_GRO to their vlan_features.
As GRO is a software feature add GRO to dev->vlan_features in
register_netdevice() and let vlan_dev_init() take care that it gets
enabled only when dev->features has NETIF_F_GRO too.
Signed-off-by: Brandon Philips <bphilips@suse.de>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
dev->ip_ptr is protected by rtnl and rcu.
Yet some places dont use appropriate primitives and/or locking rules.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
struct ethtool_rawip4_spec and struct ethtool_ether_spec are neither
commented nor used by any driver, so remove them. Adjust padding in
the user-visible unions that included these structures.
Fix references to struct ethtool_rawip4_spec in
ethtool_get_rx_ntuple(), which should use struct ethtool_usrip4_spec.
struct ethtool_usrip4_spec cannot hold IPv6 host addresses and there
is no separate structure that can, so remove ETH_RX_NFC_IP6 and the
reference to it in niu.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
netdev_wait_allrefs() waits that all references to a device vanishes.
It currently uses a _very_ pessimistic 250 ms delay between each probe.
Some users reported that no more than 4 devices can be dismantled per
second, this is a pretty serious problem for some setups.
Most of the time, a refcount is about to be released by an RCU callback,
that is still in flight because rollback_registered_many() uses a
synchronize_rcu() call instead of rcu_barrier(). Problem is visible if
number of online cpus is one, because synchronize_rcu() is then a no op.
time to remove 50 ipip tunnels on a UP machine :
before patch : real 11.910s
after patch : real 1.250s
Reported-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reported-by: Octavian Purdila <opurdila@ixiacom.com>
Reported-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allocate hash tables for every online cpus, not every possible ones.
NUMA aware allocations.
Dont use a full page on arches where PAGE_SIZE > 1024*sizeof(void *)
misc:
__percpu , __read_mostly, __cpuinit annotations
flow_compare_t is just an "unsigned long"
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that est_tree_lock is acquired with BH protection, the other
call is unnecessary.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
__lock_sock() and __release_sock() releases and regrabs lock but
were missing proper annotations. Add it. This removes following
warning from sparse. (Currently __lock_sock() does not emit any
warning about it but I think it is better to add also.)
net/core/sock.c:1580:17: warning: context imbalance in '__release_sock' - unexpected unlock
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
move_addr_to_kernel() and copy_from_user() requires their argument
as __user pointer but were missing proper markups. Add it.
This removes following warnings from sparse.
net/core/iovec.c:44:52: warning: incorrect type in argument 1 (different address spaces)
net/core/iovec.c:44:52: expected void [noderef] <asn:1>*uaddr
net/core/iovec.c:44:52: got void *msg_name
net/core/iovec.c:55:34: warning: incorrect type in argument 2 (different address spaces)
net/core/iovec.c:55:34: expected void const [noderef] <asn:1>*from
net/core/iovec.c:55:34: got struct iovec *msg_iov
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When there is only one rps_cpus, skb_get_rxhash() can be eliminated.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The patch: "gro: fix different skb headrooms" in its part:
"2) allocate a minimal skb for head of frag_list" is buggy. The copied
skb has p->data set at the ip header at the moment, and skb_gro_offset
is the length of ip + tcp headers. So, after the change the length of
mac header is skipped. Later skb_set_mac_header() sets it into the
NET_SKB_PAD area (if it's long enough) and ip header is misaligned at
NET_SKB_PAD + NET_IP_ALIGN offset. There is no reason to assume the
original skb was wrongly allocated, so let's copy it as it was.
bugzilla : https://bugzilla.kernel.org/show_bug.cgi?id=16626
fixes commit: 3d3be4333f
Reported-by: Plamen Petrov <pvp-lsts@fs.uni-ruse.bg>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
CC: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Tested-by: Plamen Petrov <pvp-lsts@fs.uni-ruse.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a net device is implementing the select_queue callback and is part of
a bridge, frames coming from the bridge already have a tx queue associated
to the socket (introduced in commit a4ee3ce329,
"net: Use sk_tx_queue_mapping for connected sockets"). The call to
sk_tx_queue_get will then return the tx queue used by the bridge instead
of calling the select_queue callback.
In case of mac80211 this broke QoS which is implemented by using the
select_queue callback. Furthermore it introduced problems with rt2x00
because frames with the same TID and RA sometimes appeared on different
tx queues which the hw cannot handle correctly.
Fix this by always calling select_queue first if it is available and only
afterwards use the socket tx queue mapping.
Signed-off-by: Helmut Schaa <helmut.schaa@googlemail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds tracepoint to consume_skb and add trace_kfree_skb
before __kfree_skb in skb_free_datagram_locked and net_tx_action.
Combinating with tracepoint on dev_hard_start_xmit, we can check
how long it takes to free transmitted packets. And using it, we can
calculate how many packets driver had at that time. It is useful when
a drop of transmitted packet is a problem.
sshd-6828 [000] 112689.258154: consume_skb: skbaddr=f2d99bb8
Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Kaneshige Kenji <kaneshige.kenji@jp.fujitsu.com>
Cc: Izumo Taku <izumi.taku@jp.fujitsu.com>
Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Scott Mcmillan <scott.a.mcmillan@intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
LKML-Reference: <4C724364.50903@jp.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
No need to test twice sk->sk_shutdown & RCV_SHUTDOWN
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
pskb_expand_head() blindly takes references on fragments before calling
skb_release_data(), potentially releasing these references.
We can add a fast path, avoiding these atomic operations, if we own the
last reference on skb->head.
Based on a previous patch from David
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
__alloc_skb() uses a memset() to clear all the beginning of skb,
including bitfields contained in 'flags1' & 'flags2'.
We dont need any more to use kmemcheck_annotate_bitfield() on these
fields. However, we still need it for the clone part, which is not
cleared.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes a lockdep warning:
[ 516.287584] =========================================================
[ 516.288386] [ INFO: possible irq lock inversion dependency detected ]
[ 516.288386] 2.6.35b #7
[ 516.288386] ---------------------------------------------------------
[ 516.288386] swapper/0 just changed the state of lock:
[ 516.288386] (&qdisc_tx_lock){+.-...}, at: [<c12eacda>] est_timer+0x62/0x1b4
[ 516.288386] but this lock took another, SOFTIRQ-unsafe lock in the past:
[ 516.288386] (est_tree_lock){+.+...}
[ 516.288386]
[ 516.288386] and interrupts could create inverse lock ordering between them.
...
So, est_tree_lock needs BH protection because it's taken by
qdisc_tx_lock, which is used both in BH and process contexts.
(Full warning with this patch at netdev, 02 Sep 2010.)
Fixes commit: ae638c47dc
("pkt_sched: gen_estimator: add a new lock")
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a small helper ptype_head() to get the head to manipulate
dev_add_pack() & __dev_remove_pack() can use a spinlock without
blocking BH, since softirq use RCU, and these functions are run from
process context only.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Packets entering GRO might have different headrooms, even for a given
flow (because of implementation details in drivers, like copybreak).
We cant force drivers to deliver packets with a fixed headroom.
1) fix skb_segment()
skb_segment() makes the false assumption headrooms of fragments are same
than the head. When CHECKSUM_PARTIAL is used, this can give csum_start
errors, and crash later in skb_copy_and_csum_dev()
2) allocate a minimal skb for head of frag_list
skb_gro_receive() uses netdev_alloc_skb(headroom + skb_gro_offset(p)) to
allocate a fresh skb. This adds NET_SKB_PAD to a padding already
provided by netdevice, depending on various things, like copybreak.
Use alloc_skb() to allocate an exact padding, to reduce cache line
needs:
NET_SKB_PAD + NET_IP_ALIGN
bugzilla : https://bugzilla.kernel.org/show_bug.cgi?id=16626
Many thanks to Plamen Petrov, testing many debugging patches !
With help of Jarek Poplawski.
Reported-by: Plamen Petrov <pvp-lsts@fs.uni-ruse.bg>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(skb->data - skb->head) can be changed by skb_headroom(skb)
Remove some uses of NET_SKBUFF_DATA_USES_OFFSET, using
(skb_end_pointer(skb) - skb->head) or
(skb_tail_pointer(skb) - skb->head) : compiler does the right thing,
and this is more readable for us ;)
(struct skb_shared_info *) casts in pskb_expand_head() to help memcpy()
to use aligned moves.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- napi_gro_flush() is exported from net/core/dev.c, to avoid
an irq_save/irq_restore in the packet receive path.
- use napi_gro_receive() instead of netif_receive_skb()
- use napi_gro_flush() before calling __napi_complete()
- turn on NETIF_F_GRO by default
- Tested on a Marvell 88E8001 Gigabit NIC
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
remove non used variable "queue" in pg_cleanup
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
compare_ether_header() can have a special implementation on 64 bit
arches if CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is defined.
__napi_gro_receive() and vlan_gro_common() can avoid a conditional
branch to perform device match.
On x86_64, __napi_gro_receive() has now 38 instructions instead of 53
As gcc-4.4.3 still choose to not inline it, add inline keyword to this
performance critical function.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The SNMP daemon uses ethtool to determine the speed of
network interfaces. This fails on Debian (and probably elsewhere)
because for security SNMP daemon runs as non-root user (snmp).
Note: A similar patch was rejected previously because of a concern about
the possibility that on some hardware querying the ethtool settings
requires access to the PHY and could slow the machine down. But the
security risk of requiring SNMP daemon (and related services)
to run as root far out weighs the risk of denial-of-service.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
No need to use a temporary struct rtnl_link_stats64 variable,
just copy the source to skb buffer.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Reviewed-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SKBs can be "fragmented" in two ways, via a page array (called
skb_shinfo(skb)->frags[]) and via a list of SKBs (called
skb_shinfo(skb)->frag_list).
Since skb_has_frags() tests the latter, it's name is confusing
since it sounds more like it's testing the former.
Signed-off-by: David S. Miller <davem@davemloft.net>
vlan_hwaccel_do_receive() always returns 0, so make it return void.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
__skb_get_rxhash() was broken after the commit:
commit bfb564e739
Author: Krishna Kumar <krkumar2@in.ibm.com>
Date: Wed Aug 4 06:15:52 2010 +0000
core: Factor out flow calculation from get_rps_cpu
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fragmented IP packets may have no transfer header, so when computing
rxhash, we should skip them.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb_get_rxhash() assumes the network header pointer of the skb is set
properly after the commit:
commit bfb564e739
Author: Krishna Kumar <krkumar2@in.ibm.com>
Date: Wed Aug 4 06:15:52 2010 +0000
core: Factor out flow calculation from get_rps_cpu
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes the abstraction introduced by the union skb_shared_tx in
the shared skb data.
The access of the different union elements at several places led to some
confusion about accessing the shared tx_flags e.g. in skb_orphan_try().
http://marc.info/?l=linux-netdev&m=128084897415886&w=2
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
>Xin Xiaohui wrote:
> I looked into the code dev_gro_receive(), found the code here:
> if the frags[0] is pulled to 0, then the page will be released,
> and memmove() frags left.
> Is that right? I'm not sure if memmove do right or not, but
> frags[0].size is never set after memove at least. what I think
> a simple way is not to do anything if we found frags[0].size == 0.
> The patch is as followed.
...
This version of the patch fixes the bug directly in memmove.
Reported-by: "Xin, Xiaohui" <xiaohui.xin@intel.com>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The driver name and bus address for a net_device can normally be found
through the driver model now. Instead of requiring drivers to provide
this information redundantly through the ethtool_ops::get_drvinfo
operation, use the driver model to do so if the driver does not define
the operation. Since ETHTOOL_GDRVINFO no longer requires the driver
to implement any operations, do not require net_device::ethtool_ops to
be set either.
Remove implementations of get_drvinfo and ethtool_ops that provide
only this information.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Factor out flow calculation code from get_rps_cpu, since other
functions can use the same code.
Revisions:
v2 (Ben): Separate flow calcuation out and use in select queue.
v3 (Arnd): Don't re-implement MIN.
v4 (Changli): skb->data points to ethernet header in macvtap, and
make a fast path. Tested macvtap with this patch.
v5 (Changli):
- Cache skb->rxhash in skb_get_rxhash
- macvtap may not have pow(2) queues, so change code for
queue selection.
(Arnd):
- Use first available queue if all fails.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Enable using network namespaces with
wireless devices even when sysfs is
enabled using the same infrastructure
that was built for netdevs.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Although netif_rx() isn't expected to be called in process context with
preemption enabled, it'd better handle this case. And this is why get_cpu()
is used in the non-RPS #ifdef branch. If tree RCU is selected,
rcu_read_lock() won't disable preemption, so preempt_disable() should be
called explictly.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The netpoll_rx_on() check in __napi_gro_receive() skips part of the
"common" GRO_NORMAL path, especially "pull:" in dev_gro_receive(),
where at least eth header should be copied for entirely paged skbs.
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 15e83ed788.
As explained by Johannes Berg, the optimization made here is
invalid. Or, at best, incomplete.
Not only destructor invocation, but conntract entry releasing
must be executed outside of hw IRQ context.
So just checking "skb->destructor" is insufficient.
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit ab95bfe01f replaces bridge and macvlan
hooks in __netif_receive_skb(), so dev.c doesn't need to include their headers.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If user misconfigures ingress and causes a redirection loop, don't
overwhelm the log. This is also a error case so make it unlikely.
Found by inspection, luckily not in real system.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
[ Fix unused local variable build warnings. -DaveM ]
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With conn-track zones and probably with different network
namespaces, the netfilter logic needs to be re-calculated
on packet receive. If the netfilter logic is not reset,
it will not be recalculated properly. This patch adds
the nf_reset logic to dev_forward_skb.
Signed-off-by: Ben Greear <greearb@candelatech.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move frags[] at the end of struct skb_shared_info, and make
pskb_expand_head() copy only the used part of it instead of whole array.
This should avoid kmemcheck warnings and speedup pskb_expand_head() as
well, avoiding a lot of cache misses.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add addr_assign_type to struct net_device and expose it via sysfs.
This new attribute has the purpose of giving user-space the ability to
distinguish between different assignment types of MAC addresses.
For example user-space can treat NICs with randomly generated MAC
addresses differently than NICs that have permanent (locally assigned)
MAC addresses.
For the former udev could write a persistent net rule by matching the
device path instead of the MAC address.
There's also the case of devices that 'steal' MAC addresses from slave
devices. In which it is also be beneficial for user-space to be aware
of the fact.
This patch also introduces a helper function to assist adoption of
drivers that generate MAC addresses randomly.
Signed-off-by: Stefan Assmann <sassmann@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make pskb_expand_head() check ip_summed to make sure csum_start is really
csum_start and not csum before adjusting it.
This fixes a bug I encountered using a Sun Quad-Fast Ethernet card and VLANs.
On my configuration, the sunhme driver produces skbs with differing amounts
of headroom on receive depending on the packet size. See line 2030 of
drivers/net/sunhme.c; packets smaller than RX_COPY_THRESHOLD have 52 bytes
of headroom but packets larger than that cutoff have only 20 bytes.
When these packets reach the VLAN driver, vlan_check_reorder_header()
calls skb_cow(), which, if the packet has less than NET_SKB_PAD (== 32) bytes
of headroom, uses pskb_expand_head() to make more.
Then, pskb_expand_head() needs to adjust a lot of offsets into the skb,
including csum_start. Since csum_start is a union with csum, if the packet
has a valid csum value this will corrupt it, which was the effect I observed.
The sunhme hardware computes receive checksums, so the skbs would be created
by the driver with ip_summed == CHECKSUM_COMPLETE and a valid csum field, and
then pskb_expand_head() would corrupt the csum field, leading to an "hw csum
error" message later on, for example in icmp_rcv() for pings larger than the
sunhme RX_COPY_THRESHOLD.
On the basis of the comment at the beginning of include/linux/skbuff.h,
I believe that the csum_start skb field is only meaningful if ip_csummed is
CSUM_PARTIAL, so this patch makes pskb_expand_head() adjust it only in that
case to avoid corrupting a valid csum value.
Please see my more in-depth disucssion of tracking down this bug for
more details if you like:
http://puellavulnerata.livejournal.com/112186.htmlhttp://puellavulnerata.livejournal.com/112567.htmlhttp://puellavulnerata.livejournal.com/112891.htmlhttp://puellavulnerata.livejournal.com/113096.htmlhttp://puellavulnerata.livejournal.com/113591.html
I am not subscribed to this list, so please CC me on replies.
Signed-off-by: Andrea Shepard <andrea@persephoneslair.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/vhost/net.c
net/bridge/br_device.c
Fix merge conflict in drivers/vhost/net.c with guidance from
Stephen Rothwell.
Revert the effects of net-2.6 commit 573201f36f
since net-next-2.6 has fixes that make bridge netpoll work properly thus
we don't need it disabled.
Signed-off-by: David S. Miller <davem@davemloft.net>
Patch to add -EAGAIN error to dropwatch netlink message handling code.
-EAGAIN will be returned anytime userspace attempts to transition the state of
the drop monitor service to a state that its already in. That allows user space
to detect this condition, so it doesn't wait for a success ACK that will never
arrive. Tested successfully by me
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use modern this_cpu_xxx() api, saving few bytes on x86
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since struct netdev_queue tx_bytes/tx_packets/tx_dropped are already
protected by _xmit_lock, its easy to convert these fields to u64 instead
of unsigned long.
This completes 64bit stats for devices using them (vlan, macvlan, ...)
Strictly, we could avoid the locking in dev_txq_stats_fold() on 64bit
arches, but its slow path and we prefer keep it simple.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a new networking option to allow hardware time stamps
from PHY devices. When enabled, likely candidates among incoming and
outgoing network packets are offered to the PHY driver for possible
time stamping. When accepted by the PHY driver, incoming packets are
deferred for later delivery by the driver.
The patch also adds phylib driver methods for the SIOCSHWTSTAMP ioctl
and callbacks for transmit and receive time stamping. Drivers may
optionally implement these functions.
Signed-off-by: Richard Cochran <richard.cochran@omicron.at>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix problem in reading the tx_queue recorded in a socket. In
dev_pick_tx, the TX queue is read by doing a check with
sk_tx_queue_recorded on the socket, followed by a sk_tx_queue_get.
The problem is that there is not mutual exclusion across these
calls in the socket so it it is possible that the queue in the
sock can be invalidated after sk_tx_queue_recorded is called so
that sk_tx_queue get returns -1, which sets 65535 in queue_index
and thus dev_pick_tx returns 65536 which is a bogus queue and
can cause crash in dev_queue_xmit.
We fix this by only calling sk_tx_queue_get which does the proper
checks. The interface is that sk_tx_queue_get returns the TX queue
if the sock argument is non-NULL and TX queue is recorded, else it
returns -1. sk_tx_queue_recorded is no longer used so it can be
completely removed.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When configuring DMVPN (GRE + openNHRP) and a GRE remote
address is configured a kernel Oops is observed. The
obserseved Oops is caused by a NULL header_ops pointer
(neigh->dev->header_ops) in neigh_update_hhs() when
void (*update)(struct hh_cache*, const struct net_device*, const unsigned char *)
= neigh->dev->header_ops->cache_update;
is executed. The dev associated with the NULL header_ops is
the GRE interface. This patch guards against the
possibility that header_ops is NULL.
This Oops was first observed in kernel version 2.6.26.8.
Signed-off-by: Doug Kehn <rdkehn@yahoo.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit fc6055a5ba (net: Introduce skb_orphan_try()) added early
orphaning of skbs.
This unfortunately added a performance regression in skb_tx_hash() in
case of stacked devices (bonding, vlans, ...)
Since skb->sk is now NULL, we cannot access sk->sk_hash anymore to
spread tx packets to multiple NIC queues on multiqueue devices.
skb_tx_hash() in this case only uses skb->protocol, same value for all
flows.
skb_orphan_try() can copy sk->sk_hash into skb->rxhash and skb_tx_hash()
can use this saved sk_hash value to compute its internal hash value.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Avoid two extra instructions in sock_free(), to reload
skb->truesize and skb->sk
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
CodingStyle cleanups
EXPORT_SYMBOL should immediately follow the symbol declaration.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Document that dev_get_stats() returns the same stats pointer it was
given. Remove const qualification from the returned pointer since the
caller may do what it likes with that structure.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit be1f3c2c02 "net: Enable 64-bit
net device statistics on 32-bit architectures" I redefined struct
net_device_stats so that it could be used in a union with struct
rtnl_link_stats64, avoiding the need for explicit copying or
conversion between the two. However, this is unsafe because there is
no locking required and no lock consistently held around calls to
dev_get_stats() and use of the statistics structure it returns.
In commit 28172739f0 "net: fix 64 bit
counters on 32 bit arches" Eric Dumazet dealt with that problem by
requiring callers of dev_get_stats() to provide storage for the
result. This means that the net_device::stats64 field and the padding
in struct net_device_stats are now redundant, so remove them.
Update the comment on net_device_ops::ndo_get_stats64 to reflect its
new usage.
Change dev_txq_stats_fold() to use struct rtnl_link_stats64, since
that is what all its callers are really using and it is no longer
going to be compatible with struct net_device_stats.
Eric Dumazet suggested the separate function for the structure
conversion.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a small possibility that a reader gets incorrect values on 32
bit arches. SNMP applications could catch incorrect counters when a
32bit high part is changed by another stats consumer/provider.
One way to solve this is to add a rtnl_link_stats64 param to all
ndo_get_stats64() methods, and also add such a parameter to
dev_get_stats().
Rule is that we are not allowed to use dev->stats64 as a temporary
storage for 64bit stats, but a caller provided area (usually on stack)
Old drivers (only providing get_stats() method) need no changes.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reduces an x86 defconfig text and data ~2k.
text is smaller, data is larger.
$ size vmlinux*
text data bss dec hex filename
7198862 720112 1366288 9285262 8dae8e vmlinux
7205273 716016 1366288 9287577 8db799 vmlinux.device_h
Uses %pV and struct va_format
Format arguments are verified before printk
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reducing real_num_queues needs to flush the qdisc otherwise
skbs with queue_mappings greater then real_num_tx_queues can
be sent to the underlying driver.
The flow for this is,
dev_queue_xmit()
dev_pick_tx()
skb_tx_hash() => hash using real_num_tx_queues
skb_set_queue_mapping()
...
qdisc_enqueue_root() => enqueue skb on txq from hash
...
dev->real_num_tx_queues -= n
...
sch_direct_xmit()
dev_hard_start_xmit()
ndo_start_xmit(skb,dev) => skb queue set with old hash
skbs are enqueued on the qdisc with skb->queue_mapping set
0 < queue_mappings < real_num_tx_queues. When the driver
decreases real_num_tx_queues skb's may be dequeued from the
qdisc with a queue_mapping greater then real_num_tx_queues.
This fixes a case in ixgbe where this was occurring with DCB
and FCoE. Because the driver is using queue_mapping to map
skbs to tx descriptor rings we can potentially map skbs to
rings that no longer exist.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Tested-by: Ross Brattain <ross.b.brattain@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Many NICs use an indirection table to map an RX flow hash value to one
of an arbitrary number of queues (not necessarily a power of 2). It
can be useful to remove some queues from this indirection table so
that they are only used for flows that are specifically filtered
there. It may also be useful to weight the mapping to account for
user processes with the same CPU-affinity as the RX interrupts.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ethtool_op_set_flags() does not check for unsupported flags, and has
no way of doing so. This means it is not suitable for use as a
default implementation of ethtool_ops::set_flags.
Add a 'supported' parameter specifying the flags that the driver and
hardware support, validate the requested flags against this, and
change all current callers to pass this parameter.
Change some other trivial implementations of ethtool_ops::set_flags to
call ethtool_op_set_flags().
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Reviewed-by: Stanislaw Gruszka <sgruszka@redhat.com>
Acked-by: Jeff Garzik <jgarzik@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is only noticed by people that are not doing everything correct in
the first place.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
struct ethtool_rxnfc was originally defined in 2.6.27 for the
ETHTOOL_{G,S}RXFH command with only the cmd, flow_type and data
fields. It was then extended in 2.6.30 to support various additional
commands. These commands should have been defined to use a new
structure, but it is too late to change that now.
Since user-space may still be using the old structure definition
for the ETHTOOL_{G,S}RXFH commands, and since they do not need the
additional fields, only copy the originally defined fields to and
from user-space.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Cc: stable@kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
On a 32-bit machine, info.rule_cnt >= 0x40000000 leads to integer
overflow and the buffer may be smaller than needed. Since
ETHTOOL_GRXCLSRLALL is unprivileged, this can presumably be used for at
least denial of service.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Cc: stable@kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
use this_cpu_ptr(p) instead of per_cpu_ptr(p, smp_processor_id())
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add pr_fmt(fmt) KBUILD_MODNAME ": " fmt
Remove "pktgen: " from formats
Convert printks to pr_<level>
Added func_enter() for debugging
Moved version to end of string at module_init
Coalesced long formats
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Gcc is currenlty not in the ability to optimize the switch statement in
sk_run_filter() because of dense case labels. This patch replace the
OR'd labels with ordered sequenced case labels. The sk_chk_filter()
function is modified to patch/replace the original OPCODES in a
ordered but equivalent form. gcc is now in the ability to transform the
switch statement in sk_run_filter into a jump table of complexity O(1).
Until this patch gcc generates a sequence of conditional branches (O(n) of 567
byte .text segment size (arch x86_64):
7ff: 8b 06 mov (%rsi),%eax
801: 66 83 f8 35 cmp $0x35,%ax
805: 0f 84 d0 02 00 00 je adb <sk_run_filter+0x31d>
80b: 0f 87 07 01 00 00 ja 918 <sk_run_filter+0x15a>
811: 66 83 f8 15 cmp $0x15,%ax
815: 0f 84 c5 02 00 00 je ae0 <sk_run_filter+0x322>
81b: 77 73 ja 890 <sk_run_filter+0xd2>
81d: 66 83 f8 04 cmp $0x4,%ax
821: 0f 84 17 02 00 00 je a3e <sk_run_filter+0x280>
827: 77 29 ja 852 <sk_run_filter+0x94>
829: 66 83 f8 01 cmp $0x1,%ax
[...]
With the modification the compiler translate the switch statement into
the following jump table fragment:
7ff: 66 83 3e 2c cmpw $0x2c,(%rsi)
803: 0f 87 1f 02 00 00 ja a28 <sk_run_filter+0x26a>
809: 0f b7 06 movzwl (%rsi),%eax
80c: ff 24 c5 00 00 00 00 jmpq *0x0(,%rax,8)
813: 44 89 e3 mov %r12d,%ebx
816: e9 43 03 00 00 jmpq b5e <sk_run_filter+0x3a0>
81b: 41 89 dc mov %ebx,%r12d
81e: e9 3b 03 00 00 jmpq b5e <sk_run_filter+0x3a0>
Furthermore, I reordered the instructions to reduce cache line misses by
order the most common instruction to the start.
Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove rtnl_unlock() which had no corresponding rtnl_lock().
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
netif_needs_gso() is checked twice in the TX path once,
before submitting the skb to the qdisc and once after
it is dequeued from the qdisc just before calling
ndo_hard_start(). This opens a window for a user to
change the gso/tso or tx checksum settings that can
cause netif_needs_gso to be true in one check and false
in the other.
Specifically, changing TX checksum setting may cause
the warning in skb_gso_segment() to be triggered if
the checksum is calculated earlier.
This consolidates the netif_needs_gso() calls so that
the stack only checks if gso is needed in
dev_hard_start_xmit().
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Start capturing not only the userspace pid, uid and gid values of the
sending process but also the struct pid and struct cred of the sending
process as well.
This is in preparation for properly supporting SCM_CREDENTIALS for
sockets that have different uid and/or pid namespaces at the different
ends.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Serge E. Hallyn <serge@hallyn.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use struct pid and struct cred to store the peer credentials on struct
sock. This gives enough information to convert the peer credential
information to a value relative to whatever namespace the socket is in
at the time.
This removes nasty surprises when using SO_PEERCRED on socket
connetions where the processes on either side are in different pid and
user namespaces.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Daniel Lezcano <daniel.lezcano@free.fr>
Acked-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
To keep the coming code clear and to allow both the sock
code and the scm code to share the logic introduce a
fuction to translate from struct cred to struct ucred.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Register net_bridge_port pointer as rx_handler data pointer. As br_port is
removed from struct net_device, another netdev priv_flag is added to indicate
the device serves as a bridge port. Also rcuized pointers are now correctly
dereferenced in br_fdb.c and in netfilter parts.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add possibility to register rx_handler data pointer along with a rx_handler.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the functions __netpoll_setup/__netpoll_cleanup
which is designed to be called recursively through ndo_netpoll_seutp.
They must be called with RTNL held, and the caller must initialise
np->dev and ensure that it has a valid reference count.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds ndo_netpoll_setup as the initialisation primitive
to complement ndo_netpoll_cleanup.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
As it stands, netpoll_setup and netpoll_cleanup have no locking
protection whatsoever. So chaos ensures if two entities try to
perform them on the same device.
This patch adds RTNL to the equation. The code has been rearranged so
that bits that do not need RTNL protection are now moved to the top of
netpoll_setup.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The use of RCU in netpoll is incorrect in a number of places:
1) The initial setting is lacking a write barrier.
2) The synchronize_rcu is in the wrong place.
3) Read barriers are missing.
4) Some places are even missing rcu_read_lock.
5) npinfo is zeroed after freeing.
This patch fixes those issues. As most users are in BH context,
this also converts the RCU usage to the BH variant.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since we have to NULL npinfo regardless of whether there is a
ndo_netpoll_cleanup, it makes sense to do this unconditionally
in netpoll_cleanup rather than having every driver do it by
themselves.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
No need to copy rxhash again in __skb_clone()
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
deliver_no_wcard is not being set in skb_copy_header.
In the skb_cloned case it is not being cleared and
may cause the skb to be dropped when the loopback device
pushes it back up the stack.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Tested-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use struct rtnl_link_stats64 as the statistics structure.
On 32-bit architectures, insert 32 bits of padding after/before each
field of struct net_device_stats to make its layout compatible with
struct rtnl_link_stats64. Add an anonymous union in net_device; move
stats into the union and add struct rtnl_link_stats64 stats64.
Add net_device_ops::ndo_get_stats64, implementations of which will
return a pointer to struct rtnl_link_stats64. Drivers that implement
this operation must not update the structure asynchronously.
Change dev_get_stats() to call ndo_get_stats64 if available, and to
return a pointer to struct rtnl_link_stats64. Change callers of
dev_get_stats() accordingly.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch increases the granularity of the rate generated by pktgen.
The previous version of pktgen uses micro seconds (udelay) resolution when it
was delayed causing gaps in the rates. It is changed to nanosecond (ndelay).
Now any rate is possible.
Also it allows to set, the desired rate in Mb/s or packets per second.
The documentation has been updated.
Signed-off-by: Daniel Turull <daniel.turull@gmail.com>
Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
gen_kill_estimator() API is incomplete or not well documented, since
caller should make sure an RCU grace period is respected before
freeing stats_lock.
This was partially addressed in commit 5d944c640b
(gen_estimator: deadlock fix), but same problem exist for all
gen_kill_estimator() users, if lock they use is not already RCU
protected.
A code review shows xt_RATEEST.c, act_api.c, act_police.c have this
problem. Other are ok because they use qdisc lock, already RCU
protected.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch correct a bug in the delay of pktgen.
It makes sure the inter-packet interval is accurate.
Signed-off-by: Daniel Turull <daniel.turull@gmail.com>
Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
gen_kill_estimator() / gen_new_estimator() is not always called with
RTNL held.
net/netfilter/xt_RATEEST.c is one user of these API that do not hold
RTNL, so random corruptions can occur between "tc" and "iptables".
Add a new fine grained lock instead of trying to use RTNL in netfilter.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, the accelerated receive path for VLAN's will
drop packets if the real device is an inactive slave and
is not one of the special pkts tested for in
skb_bond_should_drop(). This behavior is different then
the non-accelerated path and for pkts over a bonded vlan.
For example,
vlanx -> bond0 -> ethx
will be dropped in the vlan path and not delivered to any
packet handlers at all. However,
bond0 -> vlanx -> ethx
and
bond0 -> ethx
will be delivered to handlers that match the exact dev,
because the VLAN path checks the real_dev which is not a
slave and netif_recv_skb() doesn't drop frames but only
delivers them to exact matches.
This patch adds a sk_buff flag which is used for tagging
skbs that would previously been dropped and allows the
skb to continue to skb_netif_recv(). Here we add
logic to check for the deliver_no_wcard flag and if it
is set only deliver to handlers that match exactly. This
makes both paths above consistent and gives pkt handlers
a way to identify skbs that come from inactive slaves.
Without this patch in some configurations skbs will be
delivered to handlers with exact matches and in others
be dropped out right in the vlan path.
I have tested the following 4 configurations in failover modes
and load balancing modes.
# bond0 -> ethx
# vlanx -> bond0 -> ethx
# bond0 -> vlanx -> ethx
# bond0 -> ethx
|
vlanx -> --
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
BugLink: http://bugs.launchpad.net/bugs/591416
There are a number of network drivers (bridge, bonding, etc) that are not yet
receive multi-queue enabled and use alloc_netdev(), so don't print a
num_rx_queues imbalance warning in that case.
Also, only print the warning once for those drivers that _are_ multi-queue
enabled.
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
- dev_get_by_flags() changed to dev_get_by_flags_rcu()
- ipv6_sock_ac_join() dont touch dev & idev refcounts
- ipv6_sock_ac_drop() dont touch dev & idev refcounts
- ipv6_sock_ac_close() dont touch dev & idev refcounts
- ipv6_dev_ac_dec() dount touch idev refcount
- ipv6_chk_acast_addr() dont touch idev refcount
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The extra assertion to allow packet munging only when there are
no other ptypes listening which may have worked around an old bug
is unnecessary. It is sufficient to check if the skb is cloned before
trampling on it. Thanks to Herbert Xu for being persistent and patient
in getting this across.
[Note that cloning checks and assertions are the general rule used
by tc actions (documentation/networking/tc-actions-env-rules.txt)].
Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
CAIF is using "xxx-AF_MAX" strings for the lock validator. It should use
its own strings.
Signed-off-by: Alex Lorca <alex.lorca@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We can avoid an unecessary cache miss by checking if the skb is non-linear
before accessing gso_size/gso_type in skb_warn_if_lro, the same can also be
done to avoid a cache miss on nr_frags if data_len is 0.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
What this patch does is it removes two receive frame hooks (for bridge and for
macvlan) from __netif_receive_skb. These are replaced them with a single
hook for both. It only supports one hook per device because it makes no
sense to do bridging and macvlan on the same device.
Then a network driver (of virtual netdev like macvlan or bridge) can register
an rx_handler for needed net device.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When many cpus compete for sending frames on a given qdisc, the qdisc
spinlock suffers from very high contention.
The cpu owning __QDISC_STATE_RUNNING bit has same priority to acquire
the lock, and cannot dequeue packets fast enough, since it must wait for
this lock for each dequeued packet.
One solution to this problem is to force all cpus spinning on a second
lock before trying to get the main lock, when/if they see
__QDISC_STATE_RUNNING already set.
The owning cpu then compete with at most one other cpu for the main
lock, allowing for higher dequeueing rate.
Based on a previous patch from Alexander Duyck. I added the heuristic to
avoid the atomic in fast path, and put the new lock far away from the
cache line used by the dequeue worker. Also try to release the busylock
lock as late as possible.
Tests with following script gave a boost from ~50.000 pps to ~600.000
pps on a dual quad core machine (E5450 @3.00GHz), tg3 driver.
(A single netperf flow can reach ~800.000 pps on this platform)
for j in `seq 0 3`; do
for i in `seq 0 7`; do
netperf -H 192.168.0.1 -t UDP_STREAM -l 60 -N -T $i -- -m 6 &
done
done
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If a skb is received on an inactive bond that does not meet
the special cases checked for by skb_bond_should_drop it should
only be delivered to exact matches as the comment in
netif_receive_skb() says.
However because null_or_bond could also be null this is not
always true. This patch renames null_or_bond to orig_or_bond
and initializes it to orig_dev. This keeps the intent of
null_or_bond to pass frames received on VLAN interfaces stacked
on bonding interfaces without invalidating the statement for
null_or_orig.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Define three helpers to manipulate QDISC_STATE_RUNNIG flag, that a
second patch will move on another location.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Correct sk_forward_alloc handling for error_queue would need to use a
backlog of frames that softirq handler could not deliver because socket
is owned by user thread. Or extend backlog processing to be able to
process normal and error packets.
Another possibility is to not use mem charge for error queue, this is
what I implemented in this patch.
Note: this reverts commit 29030374
(net: fix sk_forward_alloc corruptions), since we dont need to lock
socket anymore.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
netpoll does an interesting work in zap_completion_queue(), but this was
before we did skb orphaning before delivering packets to device.
It now makes sense to add a test in dev_kfree_skb_irq() to not queue a
skb if already orphaned, and to remove netpoll zap_completion_queue() as
a bonus.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch saves 224 bytes of text on my machine.
__this_cpu_inc() generates a single instruction, using no scratch
registers :
65 ff 04 25 a8 30 01 00 incl %gs:0x130a8
instead of :
48 c7 c2 80 30 01 00 mov $0x13080,%rdx
65 48 8b 04 25 88 ea 00 00 mov %gs:0xea88,%rax
83 44 10 28 01 addl $0x1,0x28(%rax,%rdx,1)
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As David found out, sock_queue_err_skb() should be called with socket
lock hold, or we risk sk_forward_alloc corruption, since we use non
atomic operations to update this field.
This patch adds bh_lock_sock()/bh_unlock_sock() pair to three spots.
(BH already disabled)
1) skb_tstamp_tx()
2) Before calling ip_icmp_error(), in __udp4_lib_err()
3) Before calling ipv6_icmp_error(), in __udp6_lib_err()
Reported-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (22 commits)
netlink: bug fix: wrong size was calculated for vfinfo list blob
netlink: bug fix: don't overrun skbs on vf_port dump
xt_tee: use skb_dst_drop()
netdev/fec: fix ifconfig eth0 down hang issue
cnic: Fix context memory init. on 5709.
drivers/net: Eliminate a NULL pointer dereference
drivers/net/hamradio: Eliminate a NULL pointer dereference
be2net: Patch removes redundant while statement in loop.
ipv6: Add GSO support on forwarding path
net: fix __neigh_event_send()
vhost: fix the memory leak which will happen when memory_access_ok fails
vhost-net: fix to check the return value of copy_to/from_user() correctly
vhost: fix to check the return value of copy_to/from_user() correctly
vhost: Fix host panic if ioctl called with wrong index
net: fix lock_sock_bh/unlock_sock_bh
net/iucv: Add missing spin_unlock
net: ll_temac: fix checksum offload logic
net: ll_temac: fix interrupt bug when interrupt 0 is used
sctp: dubious bitfields in sctp_transport
ipmr: off by one in __ipmr_fill_mroute()
...
The wrong size was being calculated for vfinfo. In one case, it was over-
calculating using nlmsg_total_size on attrs, in another case, it was
under-calculating by assuming ifla_vf_* structs are packed together, but
each struct is it's own attr w/ hdr (and padding).
Signed-off-by: Scott Feldman <scofeldm@cisco.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Noticed by Patrick McHardy: was continuing to fill skb after a
nla_put_failure, ignoring the size calculated by upper layer. Now,
return -EMSGSIZE on any overruns, but also allow netdev to
fail ndo_get_vf_port with error other than -EMSGSIZE, thus unwinding
nest.
Signed-off-by: Scott Feldman <scofeldm@cisco.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 7fee226ad2 (net: add a noref bit on skb dst) missed one spot
where an skb is enqueued, with a possibly not refcounted dst entry.
__neigh_event_send() inserts skb into arp_queue, so we must make sure
dst entry is refcounted, or dst entry can be freed by garbage collector
after caller exits from rcu protected section.
Reported-by: Ingo Molnar <mingo@elte.hu>
Tested-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (61 commits)
tracing: Add __used annotation to event variable
perf, trace: Fix !x86 build bug
perf report: Support multiple events on the TUI
perf annotate: Fix up usage of the build id cache
x86/mmiotrace: Remove redundant instruction prefix checks
perf annotate: Add TUI interface
perf tui: Remove annotate from popup menu after failure
perf report: Don't start the TUI if -D is used
perf: Fix getline undeclared
perf: Optimize perf_tp_event_match()
perf: Remove more code from the fastpath
perf: Optimize the !vmalloc backed buffer
perf: Optimize perf_output_copy()
perf: Fix wakeup storm for RO mmap()s
perf-record: Share per-cpu buffers
perf-record: Remove -M
perf: Ensure that IOC_OUTPUT isn't used to create multi-writer buffers
perf, trace: Optimize tracepoints by using per-tracepoint-per-cpu hlist to track events
perf, trace: Optimize tracepoints by removing IRQ-disable from perf/tracepoint interaction
perf tui: Allow disabling the TUI on a per command basis in ~/.perfconfig
...
This new sock lock primitive was introduced to speedup some user context
socket manipulation. But it is unsafe to protect two threads, one using
regular lock_sock/release_sock, one using lock_sock_bh/unlock_sock_bh
This patch changes lock_sock_bh to be careful against 'owned' state.
If owned is found to be set, we must take the slow path.
lock_sock_bh() now returns a boolean to say if the slow path was taken,
and this boolean is used at unlock_sock_bh time to call the appropriate
unlock function.
After this change, BH are either disabled or enabled during the
lock_sock_bh/unlock_sock_bh protected section. This might be misleading,
so we rename these functions to lock_sock_fast()/unlock_sock_fast().
Reported-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Tested-by: Anton Blanchard <anton@samba.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (63 commits)
drivers/net/usb/asix.c: Fix pointer cast.
be2net: Bug fix to avoid disabling bottom half during firmware upgrade.
proc_dointvec: write a single value
hso: add support for new products
Phonet: fix potential use-after-free in pep_sock_close()
ath9k: remove VEOL support for ad-hoc
ath9k: change beacon allocation to prefer the first beacon slot
sock.h: fix kernel-doc warning
cls_cgroup: Fix build error when built-in
macvlan: do proper cleanup in macvlan_common_newlink() V2
be2net: Bug fix in init code in probe
net/dccp: expansion of error code size
ath9k: Fix rx of mcast/bcast frames in PS mode with auto sleep
wireless: fix sta_info.h kernel-doc warnings
wireless: fix mac80211.h kernel-doc warnings
iwlwifi: testing the wrong variable in iwl_add_bssid_station()
ath9k_htc: rare leak in ath9k_hif_usb_alloc_tx_urbs()
ath9k_htc: dereferencing before check in hif_usb_tx_cb()
rt2x00: Fix rt2800usb TX descriptor writing.
rt2x00: Fix failed SLEEP->AWAKE and AWAKE->SLEEP transitions.
...
This patch makes tun update its socket classid every time we
inject a packet into the network stack. This is so that any
updates made by the admin to the process writing packets to
tun is effected.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Up until now cls_cgroup has relied on fetching the classid out of
the current executing thread. This runs into trouble when a packet
processing is delayed in which case it may execute out of another
thread's context.
Furthermore, even when a packet is not delayed we may fail to
classify it if soft IRQs have been disabled, because this scenario
is indistinguishable from one where a packet unrelated to the
current thread is processed by a real soft IRQ.
In fact, the current semantics is inherently broken, as a single
skb may be constructed out of the writes of two different tasks.
A different manifestation of this problem is when the TCP stack
transmits in response of an incoming ACK. This is currently
unclassified.
As we already have a concept of packet ownership for accounting
purposes in the skb->sk pointer, this is a natural place to store
the classid in a persistent manner.
This patch adds the cls_cgroup classid in struct sock, filling up
an existing hole on 64-bit :)
The value is set at socket creation time. So all sockets created
via socket(2) automatically gains the ID of the thread creating it.
Whenever another process touches the socket by either reading or
writing to it, we will change the socket classid to that of the
process if it has a valid (non-zero) classid.
For sockets created on inbound connections through accept(2), we
inherit the classid of the original listening socket through
sk_clone, possibly preceding the actual accept(2) call.
In order to minimise risks, I have not made this the authoritative
classid. For now it is only used as a backup when we execute
with soft IRQs disabled. Once we're completely happy with its
semantics we can use it as the sole classid.
Footnote: I have rearranged the error path on cls_group module
creation. If we didn't do this, then there is a window where
someone could create a tc rule using cls_group before the cgroup
subsystem has been registered.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
the commit:
commit d90310243f
Author: Octavian Purdila <opurdila@ixiacom.com>
Date: Wed Nov 18 02:36:59 2009 +0000
net: device name allocation cleanups
introduced a bug when there is a hash collision making impossible
to rename a device with eth%d. This bug is very hard to reproduce
and appears rarely.
The problem is coming from we don't pass a temporary buffer to
__dev_alloc_name but 'dev->name' which is modified by the function.
A detailed explanation is here:
http://marc.info/?l=linux-netdev&m=127417784011987&w=2
Changelog:
V2 : replaced strings comparison by pointers comparison
Signed-off-by: Daniel Lezcano <daniel.lezcano@free.fr>
Reviewed-by: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit c02db8c629:
Author: Chris Wright <chrisw@sous-sol.org>
Date: Sun May 16 01:05:45 2010 -0700
Subject: rtnetlink: make SR-IOV VF interface symmetric
adds broken error handling to do_setlink() in net/core/rtnetlink.c. The
problem is the following chunk of code:
if (tb[IFLA_VFINFO_LIST]) {
struct nlattr *attr;
int rem;
nla_for_each_nested(attr, tb[IFLA_VFINFO_LIST], rem) {
if (nla_type(attr) != IFLA_VF_INFO)
----> goto errout;
err = do_setvfinfo(dev, attr);
if (err < 0)
goto errout;
modified = 1;
}
}
which can get to errout without setting err, resulting in the following error:
net/core/rtnetlink.c: In function 'do_setlink':
net/core/rtnetlink.c:904: warning: 'err' may be used uninitialized in this function
Change the code to return -EINVAL in this case. Note that this might not be
the appropriate error though.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Chris Wright <chrisw@sous-sol.org>
cc: David S. Miller <davem@davemloft.net>
Acked-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds F_GETPIPE_SZ and F_SETPIPE_SZ fcntl() actions for
growing and shrinking the size of a pipe and adjusts pipe.c and splice.c
(and relay and network splice) usage to work with these larger (or smaller)
pipes.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This reverts commit aaf8cdc34d.
Drivers like the ipw2100 call device_create_group when they
are initialized and device_remove_group when they are shutdown.
Moving them between namespaces deletes their sysfs groups early.
In particular the following call chain results.
netdev_unregister_kobject -> device_del -> kobject_del -> sysfs_remove_dir
With sysfs_remove_dir recursively deleting all of it's subdirectories,
and nothing adding them back.
Ouch!
Therefore we need to call something that ultimate calls sysfs_mv_dir
as that sysfs function can move sysfs directories between namespaces
without deleting their subdirectories or their contents. Allowing
us to avoid placing extra boiler plate into every driver that does
something interesting with sysfs.
Currently the function that provides that capability is device_rename.
That is the code works without nasty side effects as originally written.
So remove the misguided fix for moving devices between namespaces. The
bug in the kobject layer that inspired it has now been recognized and
fixed.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
I had a couple of stupid bugs in:
netns: Teach network device kobjects which namespace they are in.
- I duplicated the Kconfig for the NET_NS
- The build was broken when sysfs was not compiled in
The sysfs breakage is because after I moved the operations
for the sysfs to the kobject layer, to make things cleaner
I forgot to move the ifdefs. Opps.
I'm not quite certain how I got introduced a second NET_NS Kconfig,
but it was probably a 3 way merge somewhere along the way that
did not notice that the NET_NS Kconfig option had mvoed and thout
that was a bug. It probably slipped in because it used to be the
sysfs patches were the first patches in my network namespace patches.
Some things just don't go like you would expect.
Neither of these bugs actually affect anything in the common case
but they should be fixed.
Thanks to Serge for noticing they were present.
Reported-by: Serge E. Hallyn <serue@us.ibm.com>
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Acked-by: David S. Miller <davem@davemloft.net>
The problem. Network devices show up in sysfs and with the network
namespace active multiple devices with the same name can show up in
the same directory, ouch!
To avoid that problem and allow existing applications in network namespaces
to see the same interface that is currently presented in sysfs, this
patch enables the tagging directory support in sysfs.
By using the network namespace pointers as tags to separate out the
the sysfs directory entries we ensure that we don't have conflicts
in the directories and applications only see a limited set of
the network devices.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Fix some issues introduced in batch skb dequeuing for input_pkt_queue.
The primary issue it that the queue head must be incremented only
after a packet has been processed, that is only after
__netif_receive_skb has been called. This is needed for the mechanism
to prevent OOO packet in RFS. Also when flushing the input_pkt_queue
and process_queue, the process queue should be done first to prevent
OOO packets.
Because the input_pkt_queue has been effectively split into two queues,
the calculation of the tail ptr is no longer correct. The correct value
would be head+input_pkt_queue->len+process_queue->len. To avoid
this calculation we added an explict input_queue_tail in softnet_data.
The tail value is simply incremented when queuing to input_pkt_queue.
Signed-off-by: Tom Herbert <therbert@google.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When GRO produces fraglist entries, and the resulting skb hits
an interface that is incapable of TSO but capable of FRAGLIST,
we end up producing a bogus packet with gso_size non-zero.
This was reported in the field with older versions of KVM that
did not set the TSO bits on tuntap.
This patch fixes that.
Reported-by: Igor Zhang <yugzhang@redhat.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add new netdev ops ndo_{set|get}_vf_port to allow setting of
port-profile on a netdev interface. Extends netlink socket RTM_SETLINK/
RTM_GETLINK with two new sub msgs called IFLA_VF_PORTS and IFLA_PORT_SELF
(added to end of IFLA_cmd list). These are both nested atrtibutes
using this layout:
[IFLA_NUM_VF]
[IFLA_VF_PORTS]
[IFLA_VF_PORT]
[IFLA_PORT_*], ...
[IFLA_VF_PORT]
[IFLA_PORT_*], ...
...
[IFLA_PORT_SELF]
[IFLA_PORT_*], ...
These attributes are design to be set and get symmetrically. VF_PORTS
is a list of VF_PORTs, one for each VF, when dealing with an SR-IOV
device. PORT_SELF is for the PF of the SR-IOV device, in case it wants
to also have a port-profile, or for the case where the VF==PF, like in
enic patch 2/2 of this patch set.
A port-profile is used to configure/enable the external switch virtual port
backing the netdev interface, not to configure the host-facing side of the
netdev. A port-profile is an identifier known to the switch. How port-
profiles are installed on the switch or how available port-profiles are
made know to the host is outside the scope of this patch.
There are two types of port-profiles specs in the netlink msg. The first spec
is for 802.1Qbg (pre-)standard, VDP protocol. The second spec is for devices
that run a similar protocol as VDP but in firmware, thus hiding the protocol
details. In either case, the specs have much in common and makes sense to
define the netlink msg as the union of the two specs. For example, both specs
have a notition of associating/deassociating a port-profile. And both specs
require some information from the hypervisor manager, such as client port
instance ID.
The general flow is the port-profile is applied to a host netdev interface
using RTM_SETLINK, the receiver of the RTM_SETLINK msg communicates with the
switch, and the switch virtual port backing the host netdev interface is
configured/enabled based on the settings defined by the port-profile. What
those settings comprise, and how those settings are managed is again
outside the scope of this patch, since this patch only deals with the
first step in the flow.
Signed-off-by: Scott Feldman <scofeldm@cisco.com>
Signed-off-by: Roopa Prabhu <roprabhu@cisco.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Also added an explicit break; to avoid
a fallthrough in net/ipv4/tcp_input.c
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use low order bit of skb->_skb_dst to tell dst is not refcounted.
Change _skb_dst to _skb_refdst to make sure all uses are catched.
skb_dst() returns the dst, regardless of noref bit set or not, but
with a lockdep check to make sure a noref dst is not given if current
user is not rcu protected.
New skb_dst_set_noref() helper to set an notrefcounted dst on a skb.
(with lockdep check)
skb_dst_drop() drops a reference only if skb dst was refcounted.
skb_dst_force() helper is used to force a refcount on dst, when skb
is queued and not anymore RCU protected.
Use skb_dst_force() in __sk_add_backlog(), __dev_xmit_skb() if
!IFF_XMIT_DST_RELEASE or skb enqueued on qdisc queue, in
sock_queue_rcv_skb(), in __nf_queue().
Use skb_dst_force() in dev_requeue_skb().
Note: dst_use_noref() still dirties dst, we might transform it
later to do one dirtying per jiffies.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If CONFIG_SMP=y, then we own a queue spinlock, we can avoid the atomic
test_and_set_bit() from napi_schedule_prep().
We now have same number of atomic ops per netif_rx() calls than with
pre-RPS kernel.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now we have a set of nested attributes:
IFLA_VFINFO_LIST (NESTED)
IFLA_VF_INFO (NESTED)
IFLA_VF_MAC
IFLA_VF_VLAN
IFLA_VF_TX_RATE
This allows a single set to operate on multiple attributes if desired.
Among other things, it means a dump can be replayed to set state.
The current interface has yet to be released, so this seems like
something to consider for 2.6.34.
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP-MD5 sessions have intermittent failures, when route cache is
invalidated. ip_queue_xmit() has to find a new route, calls
sk_setup_caps(sk, &rt->u.dst), destroying the
sk->sk_route_caps &= ~NETIF_F_GSO_MASK
that MD5 desperately try to make all over its way (from
tcp_transmit_skb() for example)
So we send few bad packets, and everything is fine when
tcp_transmit_skb() is called again for this socket.
Since ip_queue_xmit() is at a lower level than TCP-MD5, I chose to use a
socket field, sk_route_nocaps, containing bits to mask on sk_route_caps.
Reported-by: Bhaskar Dutta <bhaskie@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With RPS inclusion, skb timestamping is not consistent in RX path.
If netif_receive_skb() is used, its deferred after RPS dispatch.
If netif_rx() is used, its done before RPS dispatch.
This can give strange tcpdump timestamps results.
I think timestamping should be done as soon as possible in the receive
path, to get meaningful values (ie timestamps taken at the time packet
was delivered by NIC driver to our stack), even if NAPI already can
defer timestamping a bit (RPS can help to reduce the gap)
Tom Herbert prefer to sample timestamps after RPS dispatch. In case
sampling is expensive (HPET/acpi_pm on x86), this makes sense.
Let admins switch from one mode to another, using a new
sysctl, /proc/sys/net/core/netdev_tstamp_prequeue
Its default value (1), means timestamps are taken as soon as possible,
before backlog queueing, giving accurate timestamps.
Setting a 0 value permits to sample timestamps when processing backlog,
after RPS dispatch, to lower the load of the pre-RPS cpu.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now there's null check here and also again in the hook. Looking at bridge bits
which are simmilar, port structure is rcu_dereferenced right away in
handle_bridge and passed to hook. Looks nicer.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds data to be passed to tracepoint callbacks.
The created functions from DECLARE_TRACE() now need a mandatory data
parameter. For example:
DECLARE_TRACE(mytracepoint, int value, value)
Will create the register function:
int register_trace_mytracepoint((void(*)(void *data, int value))probe,
void *data);
As the first argument, all callbacks (probes) must take a (void *data)
parameter. So a callback for the above tracepoint will look like:
void myprobe(void *data, int value)
{
}
The callback may choose to ignore the data parameter.
This change allows callbacks to register a private data pointer along
with the function probe.
void mycallback(void *data, int value);
register_trace_mytracepoint(mycallback, mydata);
Then the mycallback() will receive the "mydata" as the first parameter
before the args.
A more detailed example:
DECLARE_TRACE(mytracepoint, TP_PROTO(int status), TP_ARGS(status));
/* In the C file */
DEFINE_TRACE(mytracepoint, TP_PROTO(int status), TP_ARGS(status));
[...]
trace_mytracepoint(status);
/* In a file registering this tracepoint */
int my_callback(void *data, int status)
{
struct my_struct my_data = data;
[...]
}
[...]
my_data = kmalloc(sizeof(*my_data), GFP_KERNEL);
init_my_data(my_data);
register_trace_mytracepoint(my_callback, my_data);
The same callback can also be registered to the same tracepoint as long
as the data registered is different. Note, the data must also be used
to unregister the callback:
unregister_trace_mytracepoint(my_callback, my_data);
Because of the data parameter, tracepoints declared this way can not have
no args. That is:
DECLARE_TRACE(mytracepoint, TP_PROTO(void), TP_ARGS());
will cause an error.
If no arguments are needed, a new macro can be used instead:
DECLARE_TRACE_NOARGS(mytracepoint);
Since there are no arguments, the proto and args fields are left out.
This is part of a series to make the tracepoint footprint smaller:
text data bss dec hex filename
4913961 1088356 861512 6863829 68bbd5 vmlinux.orig
4914025 1088868 861512 6864405 68be15 vmlinux.class
4918492 1084612 861512 6864616 68bee8 vmlinux.tracepoint
Again, this patch also increases the size of the kernel, but
lays the ground work for decreasing it.
v5: Fixed net/core/drop_monitor.c to handle these updates.
v4: Moved the DECLARE_TRACE() DECLARE_TRACE_NOARGS out of the
#ifdef CONFIG_TRACE_POINTS, since the two are the same in both
cases. The __DECLARE_TRACE() is what changes.
Thanks to Frederic Weisbecker for pointing this out.
v3: Made all register_* functions require data to be passed and
all callbacks to take a void * parameter as its first argument.
This makes the calling functions comply with C standards.
Also added more comments to the modifications of DECLARE_TRACE().
v2: Made the DECLARE_TRACE() have the ability to pass arguments
and added a new DECLARE_TRACE_NOARGS() for tracepoints that
do not need any arguments.
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Introduce ____napi_schedule() helper for callers in irq disabled
contexts. rps_trigger_softirq() becomes a leaf function.
Use container_of() in process_backlog() instead of accessing per_cpu
address.
Use a custom inlined version of __napi_complete() in process_backlog()
to avoid one locked instruction :
only current cpu owns and manipulates this napi,
and NAPI_STATE_SCHED is the only possible flag set on backlog.
we can use a plain write instead of clear_bit(),
and we dont need an smp_mb() memory barrier, since RPS is on,
backlog is protected by a spinlock.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In case of congestion, netif_rx() frees the skb, so we must assume
dev_forward_skb() also consume skb.
Bug introduced by commit 445409602c
(veth: move loopback logic to common location)
We must change dev_forward_skb() to always consume skb, and veth to not
double free it.
Bug report : http://marc.info/?l=linux-netdev&m=127310770900442&w=3
Reported-by: Martín Ferrari <martin.ferrari@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This whole patchset is for adding netpoll support to bridge and bonding
devices. I already tested it for bridge, bonding, bridge over bonding,
and bonding over bridge. It looks fine now.
To make bridge and bonding support netpoll, we need to adjust
some netpoll generic code. This patch does the following things:
1) introduce two new priv_flags for struct net_device:
IFF_IN_NETPOLL which identifies we are processing a netpoll;
IFF_DISABLE_NETPOLL is used to disable netpoll support for a device
at run-time;
2) introduce one new method for netdev_ops:
->ndo_netpoll_cleanup() is used to clean up netpoll when a device is
removed.
3) introduce netpoll_poll_dev() which takes a struct net_device * parameter;
export netpoll_send_skb() and netpoll_poll_dev() which will be used later;
4) hide a pointer to struct netpoll in struct netpoll_info, ditto.
5) introduce ->real_dev for struct netpoll.
6) introduce a new status NETDEV_BONDING_DESLAE, which is used to disable
netconsole before releasing a slave, to avoid deadlocks.
Cc: David Miller <davem@davemloft.net>
Cc: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: WANG Cong <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With following patch I can reach maximum rate of my pktgen+udpsink
simulator :
- 'old' machine : dual quad core E5450 @3.00GHz
- 64 UDP rx flows (only differ by destination port)
- RPS enabled, NIC interrupts serviced on cpu0
- rps dispatched on 7 other cores. (~130.000 IPI per second)
- SLAB allocator (faster than SLUB in this workload)
- tg3 NIC
- 1.080.000 pps without a single drop at NIC level.
Idea is to add two prefetchw() calls in __alloc_skb(), one to prefetch
first sk_buff cache line, the second to prefetch the shinfo part.
Also using one memset() to initialize all skb_shared_info fields instead
of one by one to reduce number of instructions, using long word moves.
All skb_shared_info fields before 'dataref' are cleared in
__alloc_skb().
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 4b0b72f7dd ( net: speedup udp receive path )
introduced a bug in skb_free_datagram_locked().
We should not skb_orphan() skb if we dont have the guarantee we are the
last skb user, this might happen with MSG_PEEK concurrent users.
To keep socket locked for the smallest period of time, we split
consume_skb() logic, inlined in skb_free_datagram_locked()
Reported-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Per cpu variable softnet_data.total was shared between IRQ and SoftIRQ context
without any protection. And enqueue_to_backlog should update the netdev_rx_stat
of the target CPU.
This patch renames softnet_data.total to softnet_data.processed: the number of
packets processed in uppper levels(IP stacks).
softnet_stat data is moved into softnet_data.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
----
include/linux/netdevice.h | 17 +++++++----------
net/core/dev.c | 26 ++++++++++++--------------
net/sched/sch_generic.c | 2 +-
3 files changed, 20 insertions(+), 25 deletions(-)
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit 6be8ac2f ("[NET]: uninline skb_pull, de-bloats a lot")
we uninlined skb_pull.
But in some critical paths it makes sense to inline this thing
and it helps performance significantly.
Create an skb_pull_inline() so that we can do this in a way that
serves also as annotation.
Based upon a patch by Eric Dumazet.
Signed-off-by: David S. Miller <davem@davemloft.net>
sk_callback_lock rwlock actually protects sk->sk_sleep pointer, so we
need two atomic operations (and associated dirtying) per incoming
packet.
RCU conversion is pretty much needed :
1) Add a new structure, called "struct socket_wq" to hold all fields
that will need rcu_read_lock() protection (currently: a
wait_queue_head_t and a struct fasync_struct pointer).
[Future patch will add a list anchor for wakeup coalescing]
2) Attach one of such structure to each "struct socket" created in
sock_alloc_inode().
3) Respect RCU grace period when freeing a "struct socket_wq"
4) Change sk_sleep pointer in "struct sock" by sk_wq, pointer to "struct
socket_wq"
5) Change sk_sleep() function to use new sk->sk_wq instead of
sk->sk_sleep
6) Change sk_has_sleeper() to wq_has_sleeper() that must be used inside
a rcu_read_lock() section.
7) Change all sk_has_sleeper() callers to :
- Use rcu_read_lock() instead of read_lock(&sk->sk_callback_lock)
- Use wq_has_sleeper() to eventually wakeup tasks.
- Use rcu_read_unlock() instead of read_unlock(&sk->sk_callback_lock)
8) sock_wake_async() is modified to use rcu protection as well.
9) Exceptions :
macvtap, drivers/net/tun.c, af_unix use integrated "struct socket_wq"
instead of dynamically allocated ones. They dont need rcu freeing.
Some cleanups or followups are probably needed, (possible
sk_callback_lock conversion to a spinlock for example...).
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit 95766fff ([UDP]: Add memory accounting.),
each received packet needs one extra sock_lock()/sock_release() pair.
This added latency because of possible backlog handling. Then later,
ticket spinlocks added yet another latency source in case of DDOS.
This patch introduces lock_sock_bh() and unlock_sock_bh()
synchronization primitives, avoiding one atomic operation and backlog
processing.
skb_free_datagram_locked() uses them instead of full blown
lock_sock()/release_sock(). skb is orphaned inside locked section for
proper socket memory reclaim, and finally freed outside of it.
UDP receive path now take the socket spinlock only once.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now there's no need to use this fuction directly because it's handled by
register_pernet_device. So to make this simple and easy to understand,
make this static to do not tempt potentional users.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Current socket backlog limit is not enough to really stop DDOS attacks,
because user thread spend many time to process a full backlog each
round, and user might crazy spin on socket lock.
We should add backlog size and receive_queue size (aka rmem_alloc) to
pace writers, and let user run without being slow down too much.
Introduce a sk_rcvqueues_full() helper, to avoid taking socket lock in
stress situations.
Under huge stress from a multiqueue/RPS enabled NIC, a single flow udp
receiver can now process ~200.000 pps (instead of ~100 pps before the
patch) on a 8 core machine.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
batch skb dequeueing from softnet input_pkt_queue to reduce potential lock
contention when RPS is enabled.
Note: in the worst case, the number of packets in a softnet_data may
be double of netdev_max_backlog.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
reimplement softnet_data.output_queue as a FIFO queue to keep the
fairness among the qdiscs rescheduled.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
----
include/linux/netdevice.h | 1 +
net/core/dev.c | 22 ++++++++++++----------
2 files changed, 13 insertions(+), 10 deletions(-)
Signed-off-by: David S. Miller <davem@davemloft.net>
Decouple rtnetlink address families from real address families in socket.h to
be able to add rtnetlink interfaces to code that is not a real address family
without increasing AF_MAX/NPROTO.
This will be used to add support for multicast route dumping from all tables
as the proc interface can't be extended to support anything but the main table
without breaking compatibility.
This partialy undoes the patch to introduce independant families for routing
rules and converts ipmr routing rules to a new rtnetlink family. Similar to
that patch, values up to 127 are reserved for real address families, values
above that may be used arbitrarily.
Signed-off-by: Patrick McHardy <kaber@trash.net>
fib_rules_register() duplicates the template passed to it without modification,
mark the argument as const. Additionally the templates are only needed when
instantiating a new namespace, so mark them as __net_initdata, which means
they can be discarded when CONFIG_NET_NS=n.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Stay consistent with other functions and with comment also and name
pernet_operations parameter properly.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
optimize rps_get_cpu().
don't initialize ports when we can get the ports. one memory access
for ports than two.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add an SKF_AD_HATYPE field to the packet ancilliary data area, giving
access to skb->dev->type, as reported in the sll_hatype field.
When capturing packets on a PF_PACKET/SOCK_RAW socket bound to all
interfaces, there doesn't appear to be a way for the filter program to
actually find out the underlying hardware type the packet was captured
on. This patch adds such ability.
This patch also handles the case where skb->dev can be NULL, such as on
netlink sockets.
Signed-off-by: Paul Evans <leonerd@leonerd.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the original code, if rtnl_create_link() returned an ERR_PTR then that
would get passed to rtnl_configure_link() which dereferences it.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This way GSO packets don't get handled differently.
With help from Eric Dumazet.
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
If some skb are queued to our backlog, we are delaying IPI sending at
the end of net_rx_action(), increasing latencies. This defeats the
queueing, since we want to quickly dispatch packets to the pool of
worker cpus, then eventually deeply process our packets.
It's better to send IPI before processing our packets in upper layers,
from process_backlog().
Change the _and_disable_irq suffix to _and_enable_irq(), since we enable
local irq in net_rps_action(), sorry for the confusion.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
At this point, skb->destructor is not the original one (stored in
DEV_GSO_CB(skb)->destructor)
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no need to export skb_under_panic() and skb_over_panic() in
skbuff.c, since these methods are used only in skbuff.c ; this patch
removes these two exports. It also marks these functions as 'static'
and removeS the extern declarations of them from
include/linux/skbuff.h
Signed-off-by: Rami Rosen <ramirose@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Define a new function to return the waitqueue of a "struct sock".
static inline wait_queue_head_t *sk_sleep(struct sock *sk)
{
return sk->sk_sleep;
}
Change all read occurrences of sk_sleep by a call to this function.
Needed for a future RCU conversion. sk_sleep wont be a field directly
available.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since netdev_chain is guarded by rtnl_lock, ASSERT_RTNL should be
present here to make sure that all callers of call_netdevice_notifiers
does the locking properly.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In case we compute a software skb->rxhash, we can generate a consistent
hash : Its value will be the same in both flow directions.
This helps some workloads, like conntracking, since the same state needs
to be accessed in both directions.
tbench + RFS + this patch gives better results than tbench with default
kernel configuration (no RPS, no RFS)
Also fixed some sparse warnings.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
struct softnet_data holds many queues, so consistent use "sd" name
instead of "queue" is better.
Adds a rps_ipi_queued() helper to cleanup enqueue_to_backlog()
Adds a _and_irq_disable suffix to net_rps_action() name, as David
suggested.
incr_input_queue_head() becomes input_queue_head_incr()
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
store_rps_map() & store_rps_dev_flow_table_cnt() are static.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net_rps_action() is a bit expensive on NR_CPUS=64..4096 kernels, even if
RPS is not active.
Tom Herbert used two bitmasks to hold information needed to send IPI,
but a single LIFO list seems more appropriate.
Move all RPS logic into net_rps_action() to cleanup net_rx_action() code
(remove two ifdefs)
Move rps_remote_softirq_cpus into softnet_data to share its first cache
line, filling an existing hole.
In a future patch, we could call net_rps_action() from process_backlog()
to make sure we send IPI before handling this cpu backlog.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Transmitted skb might be attached to a socket and a destructor, for
memory accounting purposes.
Traditionally, this destructor is called at tx completion time, when skb
is freed.
When tx completion is performed by another cpu than the sender, this
forces some cache lines to change ownership. XPS was an attempt to give
tx completion to initial cpu.
David idea is to call destructor right before giving skb to device (call
to ndo_start_xmit()). Because device queues are usually small, orphaning
skb before tx completion is not a big deal. Some drivers already do
this, we could do it in upper level.
There is one known exception to this early orphaning, called tx
timestamping. It needs to keep a reference to socket until device can
give a hardware or software timestamp.
This patch adds a skb_orphan_try() helper, to centralize all exceptions
to early orphaning in one spot, and use it in dev_hard_start_xmit().
"tbench 16" results on a Nehalem machine (2 X5570 @ 2.93GHz)
before: Throughput 4428.9 MB/sec 16 procs
after: Throughput 4448.14 MB/sec 16 procs
UDP should get even better results, its destructor being more complex,
since SOCK_USE_WRITE_QUEUE is not set (four atomic ops instead of one)
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- There is no point to enforce a time limit in process_backlog(), since
other napi instances dont follow same rule. We can exit after only one
packet processed...
The normal quota of 64 packets per napi instance should be the norm, and
net_rx_action() already has its own time limit.
Note : /proc/net/core/dev_weight can be used to tune this 64 default
value.
- Use DEFINE_PER_CPU_ALIGNED for softnet_data definition.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch implements receive flow steering (RFS). RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running. RFS is an
extension of Receive Packet Steering (RPS).
The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure. The rxhash is passed in skb's received on
the connection from netif_receive_skb. For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.
The convolution of the simple approach is that it would potentially
allow OOO packets. If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.
To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.
rps_sock_table is a global hash table. Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.
rps_dev_flow_table is specific to each device queue. Each entry
contains a CPU and a tail queue counter. The CPU is the "current"
CPU for a matching flow. The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.
Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length. When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.
And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted. When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:
- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table. This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.
Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality. 2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.
This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.
There are two configuration parameters for RFS. The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue. Both are rounded to power of two.
The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).
The benefits of RFS are dependent on cache hierarchy, application
load, and other factors. On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation. However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.
Below are some benchmark results which show the potential benfit of
this patch. The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp. The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.
e1000e on 8 core Intel
No RFS or RPS 104K tps at 30% CPU
No RFS (best RPS config): 290K tps at 63% CPU
RFS 303K tps at 61% CPU
RPC test tps CPU% 50/90/99% usec latency Latency StdDev
No RFS/RPS 103K 48% 757/900/3185 4472.35
RPS only: 174K 73% 415/993/2468 491.66
RFS 223K 73% 379/651/1382 315.61
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When dev_pick_tx() caches tx queue_index on a socket, we must check
socket dst_entry matches skb one, or risk a crash later, as reported by
Denys Fedorysychenko, if old packets are in flight during a route
change, involving devices with different number of queues.
Bug introduced by commit a4ee3ce3
(net: Use sk_tx_queue_mapping for connected sockets)
Reported-by: Denys Fedorysychenko <nuclearcat@nuclearcat.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Paris reported netif_rx() is calling smp_processor_id() from
preemptible context, in particular when caller is
ip_dev_loopback_xmit().
RPS commit added this smp_processor_id() call, this patch makes sure
preemption is disabled. rps_get_cpus() wants rcu_read_lock() anyway, we
can dot it a bit earlier.
Reported-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Decouple the address family values used for fib_rules from the real
address families in socket.h. This allows to use fib_rules for
code that is not a real address family without increasing AF_MAX/NPROTO.
Values up to 127 are reserved for real address families and map directly
to the corresponding AF value, values starting from 128 are for other
uses. rtnetlink is changed to invoke the AF_UNSPEC dumpit/doit handlers
for these families.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
All fib_rules implementations need to set the family in their ->fill()
functions. Since the value is available to the generic fib_nl_fill_rule()
function, set it there.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Both functions are equivalent, consolidate them since a following patch
needs a third implementation for multicast routing.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function dst_ifdown is called only two places but in a non-
performance critical code path, there is no reason to inline it.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb_bond_should_drop() is too big to be inlined.
This patch reduces kernel text size, and its compilation time as well
(shrinking include/linux/netdevice.h)
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With latest CONFIG_PROVE_RCU stuff, I felt more comfortable to make this
work.
sk->sk_dst_cache is currently protected by a rwlock (sk_dst_lock)
This rwlock is readlocked for a very small amount of time, and dst
entries are already freed after RCU grace period. This calls for RCU
again :)
This patch converts sk_dst_lock to a spinlock, and use RCU for readers.
__sk_dst_get() is supposed to be called with rcu_read_lock() or if
socket locked by user, so use appropriate rcu_dereference_check()
condition (rcu_read_lock_held() || sock_owned_by_user(sk))
This patch avoids two atomic ops per tx packet on UDP connected sockets,
for example, and permits sk_dst_lock to be much less dirtied.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dont use netdev_warn() in dev_cap_txqueue() and get_rps_cpu() so that we
can catch following warnings without crash.
bond0.2240 received packet on queue 6, but number of RX queues is 1
bond0.2240 received packet on queue 11, but number of RX queues is 1
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix coding style errors and warnings output while running checkpatch.pl
on the files net/core/ethtool.c and include/linux/ethtool.h
Signed-off-by: chavey <chavey@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As pointed by Randy Dunlap, we must include linux/proc_fs.h in
net/core/dev_addr_lists.c, regardless of CONFIG_PROC_FS
Reported-by: Randy Dunlap <randy.dunlap@oracle.com>,
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Speed up lookups by freeing flow cache entries later. After
virtualizing flow cache entry operations, the flow cache may now
end up calling policy or bundle destructor which can be slowish.
As gc_list is more effective with double linked list, the flow cache
is converted to use common hlist and list macroes where appropriate.
Signed-off-by: Timo Teras <timo.teras@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
This allows to validate the cached object before returning it.
It also allows to destruct object properly, if the last reference
was held in flow cache. This is also a prepartion for caching
bundles in the flow cache.
In return for virtualizing the methods, we save on:
- not having to regenerate the whole flow cache on policy removal:
each flow matching a killed policy gets refreshed as the getter
function notices it smartly.
- we do not have to call flow_cache_flush from policy gc, since the
flow cache now properly deletes the object if it had any references
Signed-off-by: Timo Teras <timo.teras@iki.fi>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
As noticed by Changli Gao, we must call local_irq_enable() after
rps_unlock()
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix spin_unlock_irq which needs to be rps_unlock.
Signed-off-by: Tom Herbert <therbert@google.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Converts the list and the core manipulating with it to be the same as uc_list.
+uses two functions for adding/removing mc address (normal and "global"
variant) instead of a function parameter.
+removes dev_mcast.c completely.
+exposes netdev_hw_addr_list_* macros along with __hw_addr_* functions for
manipulation with lists on a sandbox (used in bonding and 80211 drivers)
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
+little renaming of unicast functions to be smooth with multicast ones
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Followup to commit 5acbbd428d
(net: change illegal_highdma to use dma_mask)
If dev->dev.parent is NULL, we should not try to dereference it.
Dont force inline illegal_highdma() as its pretty big now.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Robert Hancock pointed out two problems about NETIF_F_HIGHDMA:
-Many drivers only set the flag when they detect they can use 64-bit DMA,
since otherwise they could receive DMA addresses that they can't handle
(which on platforms without IOMMU/SWIOTLB support is fatal). This means that if
64-bit support isn't available, even buffers located below 4GB will get copied
unnecessarily.
-Some drivers set the flag even though they can't actually handle 64-bit DMA,
which would mean that on platforms without IOMMU/SWIOTLB they would get a DMA
mapping error if the memory they received happened to be located above 4GB.
http://lkml.org/lkml/2010/3/3/530
We can use the dma_mask if we need bouncing or not here. Then we can
safely fix drivers that misuse NETIF_F_HIGHDMA.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
Group all per-cpu data to one structure instead of having many
globals. Also prepare the internals so that we can have multiple
instances of the flow cache if needed.
Only the kmem_cache is left as a global as all flow caches share
the same element size, and benefit from using a common cache.
Signed-off-by: Timo Teras <timo.teras@iki.fi>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
keep the old behavior on SMP without rps
RPS introduces a lock operation to per cpu variable input_pkt_queue on
SMP whenever rps is enabled or not. On SMP without RPS, this lock isn't
needed at all.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
----
net/core/dev.c | 42 ++++++++++++++++++++++++++++--------------
1 file changed, 28 insertions(+), 14 deletions(-)
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix coding style errors and warnings output while running checkpatch.pl
on the file net/core/dst.c.
Signed-off-by: chavey <chavey@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds ethtool and device feature flag to allow control
of receive hashing offload.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Jeff Garzik <jgarzik@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When more data is stuffed into an nlmsg than initially projected, an
extra allocation needs to be done. Reserve enough for IFLA_STATS64 so
that this does not to needlessy happen.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tony Luck observes that the original IFLA_STATS64 submission causes
unaligned accesses. This is because nla_data() returns a pointer to a
memory region that is only aligned to 32 bits. Do some memcpying to
workaround this.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
RPS currently depends on SMP and SYSFS
Adding a CONFIG_RPS makes sense in case this requirement changes in the
future. This patch saves about 1500 bytes of kernel text in case SMP is
on but SYSFS is off.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Need to take spinlocks when dequeuing from input_pkt_queue in flush_backlog.
Also, flush_backlog can now be called directly from netdev_run_todo.
Signed-off-by: Tom Herbert <therbert@google.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
v2: update according to Frans' comments.
Currently, if we leave spaces before dst port,
netconsole will silently accept it as 0. Warn about this.
Also, when spaces appear in other places, make them
visible in error messages.
Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: David Miller <davem@davemloft.net>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Here is patch to manipulate packet node allocation and implicitly
how packets are DMA'd etc.
The flag NODE_ALLOC enables the function and numa_node_id();
when enabled it can also be explicitly controlled via a new
node parameter
Tested this with 10 Intel 82599 ports w. TYAN S7025 E5520 CPU's.
Was able to TX/DMA ~80 Gbit/s to Ethernet wires.
Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use RCU to avoid RTNL use in dev_getfirstbyhwtype()
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We currently force a synchronize_net() in netdev_set_master()
This seems necessary only when a slave had a master and we dismantle it.
In the other case ("ifenslave bond0 ethO"), we dont need this long
delay.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After the type change, addresses in unicast and multicast lists wouldn't make
sense, not to mention possible different lenghts. So flush both lists here.
Note "dev_addr_discard" will be very soon replaced by "dev_mc_flush" (once
mc_list conversion will be done).
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ignore the new NETDEV_PRE_TYPE_CHANGE event in rtnetlink_event() since
there have been no changes userspace needs to be notified of.
Also add a comment to the netdev notifier event definitions to remind
people to update the exclusion list when adding new event types.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
When doing "ifenslave -d bond0 eth0", there is chance to get NULL
dereference in netif_receive_skb(), because dev->master suddenly becomes
NULL after we tested it.
We should use ACCESS_ONCE() to avoid this (or rcu_dereference())
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the possibility to refuse the bonding type change for
other subsystems (such as for example bridge, vlan, etc.)
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
`ip -s link` shows interface counters truncated to 32 bit. This is
because interface statistics are transported only in 32-bit quantity
to userspace. This commit adds a new IFLA_STATS64 attribute that
exports them in full 64 bit.
References: http://lkml.indiana.edu/hypermail/linux/kernel/0307.3/0215.html
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
We hold RTNL at this point and dont use RCU variants of list traversals,
we dont need rcu_read_lock()/rcu_read_unlock()
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch implements software receive side packet steering (RPS). RPS
distributes the load of received packet processing across multiple CPUs.
Problem statement: Protocol processing done in the NAPI context for received
packets is serialized per device queue and becomes a bottleneck under high
packet load. This substantially limits pps that can be achieved on a single
queue NIC and provides no scaling with multiple cores.
This solution queues packets early on in the receive path on the backlog queues
of other CPUs. This allows protocol processing (e.g. IP and TCP) to be
performed on packets in parallel. For each device (or each receive queue in
a multi-queue device) a mask of CPUs is set to indicate the CPUs that can
process packets. A CPU is selected on a per packet basis by hashing contents
of the packet header (e.g. the TCP or UDP 4-tuple) and using the result to index
into the CPU mask. The IPI mechanism is used to raise networking receive
softirqs between CPUs. This effectively emulates in software what a multi-queue
NIC can provide, but is generic requiring no device support.
Many devices now provide a hash over the 4-tuple on a per packet basis
(e.g. the Toeplitz hash). This patch allow drivers to set the HW reported hash
in an skb field, and that value in turn is used to index into the RPS maps.
Using the HW generated hash can avoid cache misses on the packet when
steering it to a remote CPU.
The CPU mask is set on a per device and per queue basis in the sysfs variable
/sys/class/net/<device>/queues/rx-<n>/rps_cpus. This is a set of canonical
bit maps for receive queues in the device (numbered by <n>). If a device
does not support multi-queue, a single variable is used for the device (rx-0).
Generally, we have found this technique increases pps capabilities of a single
queue device with good CPU utilization. Optimal settings for the CPU mask
seem to depend on architectures and cache hierarcy. Below are some results
running 500 instances of netperf TCP_RR test with 1 byte req. and resp.
Results show cumulative transaction rate and system CPU utilization.
e1000e on 8 core Intel
Without RPS: 108K tps at 33% CPU
With RPS: 311K tps at 64% CPU
forcedeth on 16 core AMD
Without RPS: 156K tps at 15% CPU
With RPS: 404K tps at 49% CPU
bnx2x on 16 core AMD
Without RPS 567K tps at 61% CPU (4 HW RX queues)
Without RPS 738K tps at 96% CPU (8 HW RX queues)
With RPS: 854K tps at 76% CPU (4 HW RX queues)
Caveats:
- The benefits of this patch are dependent on architecture and cache hierarchy.
Tuning the masks to get best performance is probably necessary.
- This patch adds overhead in the path for processing a single packet. In
a lightly loaded server this overhead may eliminate the advantages of
increased parallelism, and possibly cause some relative performance degradation.
We have found that masks that are cache aware (share same caches with
the interrupting CPU) mitigate much of this.
- The RPS masks can be changed dynamically, however whenever the mask is changed
this introduces the possibility of generating out of order packets. It's
probably best not change the masks too frequently.
Signed-off-by: Tom Herbert <therbert@google.com>
include/linux/netdevice.h | 32 ++++-
include/linux/skbuff.h | 3 +
net/core/dev.c | 335 +++++++++++++++++++++++++++++++++++++--------
net/core/net-sysfs.c | 225 ++++++++++++++++++++++++++++++-
net/core/skbuff.c | 2 +
5 files changed, 538 insertions(+), 59 deletions(-)
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stanse found that one error path in netpoll_setup dereferences npinfo
even though it is NULL. Avoid that by adding new label and go to that
instead.
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Daniel Borkmann <danborkmann@googlemail.com>
Cc: David S. Miller <davem@davemloft.net>
Acked-by: chavey@google.com
Acked-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 6e17d45a (net: add addr len check to dev_mc_add)
added a bug in dev_mc_add(), since it can now exit with a lock
imbalance.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Annotates neigh_invalidate() with __releases() and __acquires() for
sparse sake.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use self documenting noinline_for_stack instead of duplicated comments.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We test that "prot->rsk_prot" is non-null right before we dereference it
on this line.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On 03/04/2010 09:26 AM, Ben Hutchings wrote:
> On Thu, 2010-03-04 at 00:51 -0800, Jeff Kirsher wrote:
>> From: Jeff Garzik<jgarzik@redhat.com>
>>
>> This patch is an alternative approach for accessing string
>> counts, vs. the drvinfo indirect approach. This way the drvinfo
>> space doesn't run out, and we don't break ABI later.
> [...]
>> --- a/net/core/ethtool.c
>> +++ b/net/core/ethtool.c
>> @@ -214,6 +214,10 @@ static noinline int ethtool_get_drvinfo(struct net_device *dev, void __user *use
>> info.cmd = ETHTOOL_GDRVINFO;
>> ops->get_drvinfo(dev,&info);
>>
>> + /*
>> + * this method of obtaining string set info is deprecated;
>> + * consider using ETHTOOL_GSSET_INFO instead
>> + */
>
> This comment belongs on the interface (ethtool.h) not the
> implementation.
Debatable -- the current comment is located at the callsite of
ops->get_sset_count(), which is where an implementor might think to add
a new call. Not all the numeric fields in ethtool_drvinfo are obtained
from ->get_sset_count().
Hence the "some" in the attached patch to include/linux/ethtool.h,
addressing your comment.
> [...]
>> +static noinline int ethtool_get_sset_info(struct net_device *dev,
>> + void __user *useraddr)
>> +{
> [...]
>> + /* calculate size of return buffer */
>> + for (i = 0; i< 64; i++)
>> + if (sset_mask& (1ULL<< i))
>> + n_bits++;
> [...]
>
> We have a function for this:
>
> n_bits = hweight64(sset_mask);
Agreed.
I've attached a follow-up patch, which should enable my/Jeff's kernel
patch to be applied, followed by this one.
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is an alternative approach for accessing string
counts, vs. the drvinfo indirect approach. This way the drvinfo
space doesn't run out, and we don't break ABI later.
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sk_add_backlog -> __sk_add_backlog
sk_add_backlog_limited -> sk_add_backlog
Signed-off-by: Zhu Yi <yi.zhu@intel.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We got system OOM while running some UDP netperf testing on the loopback
device. The case is multiple senders sent stream UDP packets to a single
receiver via loopback on local host. Of course, the receiver is not able
to handle all the packets in time. But we surprisingly found that these
packets were not discarded due to the receiver's sk->sk_rcvbuf limit.
Instead, they are kept queuing to sk->sk_backlog and finally ate up all
the memory. We believe this is a secure hole that a none privileged user
can crash the system.
The root cause for this problem is, when the receiver is doing
__release_sock() (i.e. after userspace recv, kernel udp_recvmsg ->
skb_free_datagram_locked -> release_sock), it moves skbs from backlog to
sk_receive_queue with the softirq enabled. In the above case, multiple
busy senders will almost make it an endless loop. The skbs in the
backlog end up eat all the system memory.
The issue is not only for UDP. Any protocols using socket backlog is
potentially affected. The patch adds limit for socket backlog so that
the backlog size cannot be expanded endlessly.
Reported-by: Alex Shi <alex.shi@intel.com>
Cc: David Miller <davem@davemloft.net>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru
Cc: "Pekka Savola (ipv6)" <pekkas@netcore.fi>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Vlad Yasevich <vladislav.yasevich@hp.com>
Cc: Sridhar Samudrala <sri@us.ibm.com>
Cc: Jon Maloy <jon.maloy@ericsson.com>
Cc: Allan Stephens <allan.stephens@windriver.com>
Cc: Andrew Hendry <andrew.hendry@gmail.com>
Signed-off-by: Zhu Yi <yi.zhu@intel.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We use scm_send and scm_recv on both unix domain and
netlink sockets, but only unix domain sockets support
everything required for file descriptor passing,
so error if someone attempts to pass file descriptors
over netlink sockets.
Cc: stable@kernel.org
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
NETIF_F_NTUPLE flag setting introduced a bug: non-ntuple flags
like LRO may be successfully set, before ioctl(2) returns failure
to userspace.
The set-flags operation should be all-or-none, rather than leaving
things in an inconsistent state prior to reporting failure to
userspace.
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit e8469ed959c373c2ff9e6f488aa5a14971aebe1f
Author: Patrick McHardy <kaber@trash.net>
Date: Tue Feb 23 20:41:30 2010 +0100
Support specifying the initial device flags when creating a device though
rtnl_link. Devices allocated by rtnl_create_link() are marked as INITIALIZING
in order to surpress netlink registration notifications. To complete setup,
rtnl_configure_link() must be called, which performs the device flag changes
and invokes the deferred notifiers if everything went well.
Two examples:
# add macvlan to eth0
#
$ ip link add link eth0 up allmulticast on type macvlan
[LINK]11: macvlan0@eth0: <BROADCAST,MULTICAST,ALLMULTI,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN
link/ether 26:f8:84:02:f9:2a brd ff:ff:ff:ff:ff:ff
[ROUTE]ff00::/8 dev macvlan0 table local metric 256 mtu 1500 advmss 1440 hoplimit 0
[ROUTE]fe80::/64 dev macvlan0 proto kernel metric 256 mtu 1500 advmss 1440 hoplimit 0
[LINK]11: macvlan0@eth0: <BROADCAST,MULTICAST,ALLMULTI,UP,LOWER_UP> mtu 1500
link/ether 26:f8:84:02:f9:2a
[ADDR]11: macvlan0 inet6 fe80::24f8:84ff:fe02:f92a/64 scope link
valid_lft forever preferred_lft forever
[ROUTE]local fe80::24f8:84ff:fe02:f92a via :: dev lo table local proto none metric 0 mtu 16436 advmss 16376 hoplimit 0
[ROUTE]default via fe80::215:e9ff:fef0:10f8 dev macvlan0 proto kernel metric 1024 mtu 1500 advmss 1440 hoplimit 0
[NEIGH]fe80::215:e9ff:fef0:10f8 dev macvlan0 lladdr 00:15:e9:f0:10:f8 router STALE
[ROUTE]2001:6f8:974::/64 dev macvlan0 proto kernel metric 256 expires 0sec mtu 1500 advmss 1440 hoplimit 0
[PREFIX]prefix 2001:6f8:974::/64 dev macvlan0 onlink autoconf valid 14400 preferred 131084
[ADDR]11: macvlan0 inet6 2001:6f8:974:0:24f8:84ff:fe02:f92a/64 scope global dynamic
valid_lft 86399sec preferred_lft 14399sec
# add VLAN to eth1, eth1 is down
#
$ ip link add link eth1 up type vlan id 1000
RTNETLINK answers: Network is down
<no events>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Split dev_change_flags() into two functions: __dev_change_flags() to
perform the actual changes and __dev_notify_flags() to invoke netdevice
notifiers. This will be used by rtnl_link to defer netlink notifications
until the device has been fully configured.
This changes ordering of some operations, in particular:
- netlink notifications are sent after all changes have been performed.
As a side effect this surpresses one unnecessary netlink message when
the IFF_UP and other flags are changed simultaneously.
- The NETDEV_UP/NETDEV_DOWN and NETDEV_CHANGE notifiers are invoked
after all changes have been performed. Their relative is unchanged.
- net_dmaengine_put() is invoked before the NETDEV_DOWN notifier instead
of afterwards. This should not make any difference since both RX and TX
are already shut down at this point.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to support specifying device flags during device creation,
we must be able to roll back device registration in case setting the
flags fails without sending any notifications related to the device
to userspace.
This patch changes rollback_registered_many() and register_netdevice()
to manually send netlink notifications for devices not handled by
rtnl_link and allows to defer notifications for devices handled by
rtnl_link until setup is complete.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 3b8bcfd (net: introduce pre-up netdev notifier) added a new
notifier which is run before a device is set UP for use by cfg80211.
The patch missed to add the new notifier to the ignore list in
rtnetlink_event(), so we currently get an unnecessary netlink
notification before a device is set UP.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit c79c5ffdce.
As Jeff points out we can't break the user visible interface
like this, we need to add this into the reserved[] thing.
Signed-off-by: David S. Miller <davem@davemloft.net>
The drvinfo struct should include the number of strings that
get_rx_ntuple will return. It will be variable if an underlying
driver implements its own get_rx_ntuple routine, so userspace
needs to know how much data is coming.
Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use list_first_entry macro; no longer any need to use
'next' directly in list to find first entry.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch consists of a few minor cleanups to the SR-IOV
configurion code in rtnetlink.
- Remove unneccesary lock
- Remove unneccesary casts
- Return correct error code for no driver support
These changes are based on comments from Patrick McHardy
Signed-off-by: Mitch Williams <mitch.a.williams@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Update rcu_dereference() primitives to use new lockdep-based
checking. The rcu_dereference() in __in6_dev_get() may be
protected either by rcu_read_lock() or RTNL, per Eric Dumazet.
The rcu_dereference() in __sk_free() is protected by the fact
that it is never reached if an update could change it. Check
for this by using rcu_dereference_check() to verify that the
struct sock's ->sk_wmem_alloc counter is zero.
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <1266887105-1528-5-git-send-email-paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Traffic (tcp) doesnot start on a vlan interface when gro is enabled.
Even the tcp handshake was not taking place.
This is because, the eth_type_trans call before the netif_receive_skb
in napi_gro_finish() resets the skb->dev to napi->dev from the previously
set vlan netdev interface. This causes the ip_route_input to drop the
incoming packet considering it as a packet coming from a martian source.
I could repro this on 2.6.32.7 (stable) and 2.6.33-rc7.
With this fix, the traffic starts and the test runs fine on both vlan
and non-vlan interfaces.
CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: Patrick McHardy <kaber@trash.net>
Signed-off-by: Ajit Khaparde <ajitk@serverengines.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
pass mark to all SA lookups to prepare them for when we add code
to have them search.
Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
The wireless sysfs methods like the rest of the networking sysfs
methods are removed with the rtnl_lock held and block until
the existing methods stop executing. So use rtnl_trylock
and restart_syscall so that the code continues to work.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Export sk_attach_filter/sk_detach_filter routines,
so that tun module can use them.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Traffic (tcp) doesnot start on a vlan interface when gro is enabled.
Even the tcp handshake was not taking place.
This is because, the eth_type_trans call before the netif_receive_skb
in napi_gro_finish() resets the skb->dev to napi->dev from the previously
set vlan netdev interface. This causes the ip_route_input to drop the
incoming packet considering it as a packet coming from a martian source.
I could repro this on 2.6.32.7 (stable) and 2.6.33-rc7.
With this fix, the traffic starts and the test runs fine on both vlan
and non-vlan interfaces.
CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: Patrick McHardy <kaber@trash.net>
Signed-off-by: Ajit Khaparde <ajitk@serverengines.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The n-tuple list should be flushed if and only if the ETH_RESET_FILTER
flag is set and the driver is able to reset filtering/flow direction
hardware without also resetting a component whose flag is not set.
This test is best left to the driver.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
call_rcu() will unconditionally reinitialize RCU head anyway.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stop computing the number of neighbour table settings we have by
counting the number of binary sysctls. This behaviour was silly
and meant that we could not add another neighbour table setting
without also adding another binary sysctl.
Don't pass the binary sysctl path for neighour table entries
into neigh_sysctl_register. These parameters are no longer
used and so are just dead code.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
dev_ethtool() is currently using 604 bytes of stack, even with gcc-4.4.2
objdump -d vmlinux | scripts/checkstack.pl
...
0xc04bbc33 dev_ethtool [vmlinux]: 604
...
Adding noinline attributes to selected functions can reduce stack usage.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
set_flags should check if the underlying device supports
n-tuple filter programming before setting the device flags
on the netdevice.
Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We can allow a filter to be added successfully to the underlying
hardware, but still return an error if the cached list memory
allocation fails. This patch fixes that condition.
Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add code to allow rtnetlink clients to query and set VF information through
the PF driver.
Signed-off-by: Mitch Williams <mitch.a.williams@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch replaces dev->mc_count in all drivers (hopefully I didn't miss
anything). Used spatch and did small tweaks and conding style changes when
it was suitable.
Jirka
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Initialize the .cmd member of various ethtool using a designated struct
initializer rather. This makes things a teeny bit more robust, although
the chance of a struct layout changing is extremely remote, and also
makes the code a little easier to read.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patchset enables the ethtool layer to program n-tuple
filters to an underlying device. The idea is to allow capable
hardware to have static rules applied that can assist steering
flows into appropriate queues.
Hardware that is known to support these types of filters today
are ixgbe and niu.
Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kernel bugzilla #15239
On some workloads, it is quite possible to get a huge dst list to
process in dst_gc_task(), and trigger soft lockup detection.
Fix is to call cond_resched(), as we run in process context.
Reported-by: Pawel Staszewski <pstaszewski@itcare.pl>
Tested-by: Pawel Staszewski <pstaszewski@itcare.pl>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add missing try_to_freeze() to one of the pktgen_thread_worker() code
paths so that it doesn't block suspend/hibernation.
Fixes http://bugzilla.kernel.org/show_bug.cgi?id=15006
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Reported-and-tested-by: Ciprian Dorin Craciun <ciprian.craciun@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the vlan and macvlan drivers, the start_xmit function forwards
data to the dev_queue_xmit function for another device, which may
potentially belong to a different namespace.
To make sure that classification stays within a single namespace,
this resets the potentially critical fields.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch introduces three macros to work with uc list from net drivers.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Simpily pass 'struct neigh_table' with seq_file private pointer,
and save one dereference. Proc entry itself isn't interesting.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Avoid checking twice whether skb needs to be linearized, if one
skb_linearize was already done.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
__net_init/__net_exit are apparently not going away, so use them
to full extent.
In some cases __net_init was removed, because it was called from
__net_exit code.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In sock_getsockopt the symbol 'lv' is declared as an
unsigned int type, probably due to sizeof returning a
size_t which is really an unsigned int.
This produces a sparse warning for SO_PEERNAME due to
the sock->ops->getname() call:
warning: incorrect type in argument 3 (different signedness)
expected int *sockaddr_len
got unsigned int *<noident>
Quiet the warning by changing the type of 'lv' to an int.
Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (56 commits)
sky2: Fix oops in sky2_xmit_frame() after TX timeout
Documentation/3c509: document ethtool support
af_packet: Don't use skb after dev_queue_xmit()
vxge: use pci_dma_mapping_error to test return value
netfilter: ebtables: enforce CAP_NET_ADMIN
e1000e: fix and commonize code for setting the receive address registers
e1000e: e1000e_enable_tx_pkt_filtering() returns wrong value
e1000e: perform 10/100 adaptive IFS only on parts that support it
e1000e: don't accumulate PHY statistics on PHY read failure
e1000e: call pci_save_state() after pci_restore_state()
netxen: update version to 4.0.72
netxen: fix set mac addr
netxen: fix smatch warning
netxen: fix tx ring memory leak
tcp: update the netstamp_needed counter when cloning sockets
TI DaVinci EMAC: Handle emac module clock correctly.
dmfe/tulip: Let dmfe handle DM910x except for SPARC on-board chips
ixgbe: Fix compiler warning about variable being used uninitialized
netfilter: nf_ct_ftp: fix out of bounds read in update_nl_seq()
mv643xx_eth: don't include cache padding in rx desc buffer size
...
Fix trivial conflict in drivers/scsi/cxgb3i/cxgb3i_offload.c
The contents of /proc/net/dev is annoying to parse, because
it changes whether there is a space after the "ethX:" or not.
It depends upon the size of the "Receive bytes" counter,
if the number is below 7 digits, then there is whitespaces
else if the number is 8 digits or above there is no space
between the ":" and the number.
This patch changes the output to assure there is always a space
between the ":" and the number. Given that all existing userspace
application already need to handle the whitespaces, I see
no breakage of existing tools.
Signed-off-by: Jesper Dangaard Brouer <hawk@comx.dk>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On Wed, Jan 06, 2010 at 10:10:03PM +0100, Eric Dumazet wrote:
> Le 06/01/2010 19:38, Eric Dumazet a écrit :
> >
> > (net-next-2.6 doesnt work well on my bond/vlan setup, I suspect I need a bisection)
>
> David, I had to revert 1f3c8804ac
> (bonding: allow arp_ip_targets on separate vlans to use arp validation)
>
> Or else, my vlan devices dont work (unfortunatly I dont have much time
> these days to debug the thing)
>
> My config :
>
> +---------+
> vlan.103 -----+ bond0 +--- eth1 (bnx2)
> | +
> vlan.825 -----+ +--- eth2 (tg3)
> +---------+
>
> $ cat /proc/net/bonding/bond0
> Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009)
>
> Bonding Mode: fault-tolerance (active-backup)
> Primary Slave: None
> Currently Active Slave: eth2
> MII Status: up
> MII Polling Interval (ms): 100
> Up Delay (ms): 0
> Down Delay (ms): 0
>
> Slave Interface: eth1 (bnx2)
> MII Status: down
> Link Failure Count: 1
> Permanent HW addr: 00:1e:0b:ec:d3:d2
>
> Slave Interface: eth2 (tg3)
> MII Status: up
> Link Failure Count: 0
> Permanent HW addr: 00:1e:0b:92:78:50
>
This patch fixes up a problem with found with commit
1f3c8804ac. The original change
overloaded null_or_orig, but doing that prevented any packet handlers
that were not tied to a specific device (i.e. ptype->dev == NULL) from
ever receiving any frames.
The null_or_orig variable cannot be overloaded, and must be kept as NULL
to prevent the frame from being ignored by packet handlers designed to
accept frames on any interface.
Signed-off-by: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This allows a bond device to specify an arp_ip_target as a host that is
not on the same vlan as the base bond device and still use arp
validation. A configuration like this, now works:
BONDING_OPTS="mode=active-backup arp_interval=1000 arp_ip_target=10.0.100.1 arp_validate=3"
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master bond0 qlen 1000
link/ether 00:13:21:be:33:e9 brd ff:ff:ff:ff:ff:ff
3: eth0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master bond0 qlen 1000
link/ether 00:13:21:be:33:e9 brd ff:ff:ff:ff:ff:ff
8: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue
link/ether 00:13:21:be:33:e9 brd ff:ff:ff:ff:ff:ff
inet6 fe80::213:21ff:febe:33e9/64 scope link
valid_lft forever preferred_lft forever
9: bond0.100@bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue
link/ether 00:13:21:be:33:e9 brd ff:ff:ff:ff:ff:ff
inet 10.0.100.2/24 brd 10.0.100.255 scope global bond0.100
inet6 fe80::213:21ff:febe:33e9/64 scope link
valid_lft forever preferred_lft forever
Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009)
Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth1
MII Status: up
MII Polling Interval (ms): 0
Up Delay (ms): 0
Down Delay (ms): 0
ARP Polling Interval (ms): 1000
ARP IP target/s (n.n.n.n form): 10.0.100.1
Slave Interface: eth1
MII Status: up
Link Failure Count: 1
Permanent HW addr: 00:40:05:30:ff:30
Slave Interface: eth0
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:13:21:be:33:e9
Signed-off-by: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (74 commits)
Revert "b43: Enforce DMA descriptor memory constraints"
iwmc3200wifi: fix array out-of-boundary access
wl1251: timeout one too soon in wl1251_boot_run_firmware()
mac80211: fix propagation of failed hardware reconfigurations
mac80211: fix race with suspend and dynamic_ps_disable_work
ath9k: fix missed error codes in the tx status check
ath9k: wake hardware during AMPDU TX actions
ath9k: wake hardware for interface IBSS/AP/Mesh removal
ath9k: fix suspend by waking device prior to stop
cfg80211: fix error path in cfg80211_wext_siwscan
wl1271_cmd.c: cleanup char => u8
iwlwifi: Storage class should be before const qualifier
ath9k: Storage class should be before const qualifier
cfg80211: fix race between deauth and assoc response
wireless: remove remaining qual code
rt2x00: Add USB ID for Linksys WUSB 600N rev 2.
ath5k: fix SWI calibration interrupt storm
mac80211: fix ibss join with fixed-bssid
libertas: Remove carrier signaling from the scan code
orinoco: fix GFP_KERNEL in orinoco_set_key with interrupts disabled
...
This updates pktgen so that it does not decrement skb->users
when it receives valid NET_XMIT_xxx values. These are now
valid return values from ndo_start_xmit in net-next-2.6.
They also indicate the skb has been consumed.
This fixes pktgen to work correctly with vlan devices.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Non-GSO code drops dst entry for performance reasons, but
the same is missing for GSO code. Drop dst while cache-hot
for GSO case too.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (26 commits)
net: sh_eth alignment fix for sh7724 using NET_IP_ALIGN V2
ixgbe: allow tx of pre-formatted vlan tagged packets
ixgbe: Fix 82598 premature copper PHY link indicatation
ixgbe: Fix tx_restart_queue/non_eop_desc statistics counters
bcm63xx_enet: fix compilation failure after get_stats_count removal
packet: dont call sleeping functions while holding rcu_read_lock()
tcp: Revert per-route SACK/DSACK/TIMESTAMP changes.
ipvs: zero usvc and udest
netfilter: fix crashes in bridge netfilter caused by fragment jumps
ipv6: reassembly: use seperate reassembly queues for conntrack and local delivery
sky2: leave PCI config space writeable
sky2: print Optima chip name
x25: Update maintainer.
ipvs: fix synchronization on connection close
netfilter: xtables: document minimal required version
drivers/net/bonding/: : use pr_fmt
can: CAN_MCP251X should depend on HAS_DMA
drivers/net/usb: Correct code taking the size of a pointer
drivers/net/cpmac.c: Correct code taking the size of a pointer
drivers/net/sfc: Correct code taking the size of a pointer
...
I received some bug reports about userspace programs having problems
because after RTM_NEWLINK was received they could not immediate access
files under /proc/sys/net/ because they had not been registered yet.
The original problem was trivially fixed by moving the userspace
notification from rtnetlink_event() to the end of
register_netdevice().
When testing that change I discovered I was still getting RTM_NEWLINK
events before I could access proc and I was also getting RTM_NEWLINK
events after I was seeing RTM_DELLINK. Things practically guaranteed
to confuse userspace.
After a little more investigation these extra notifications proved to
be from the new notifiers NETDEV_POST_INIT and NETDEV_UNREGISTER_BATCH
hitting the default case in rtnetlink_event, and triggering
unnecessary RTM_NEWLINK messages.
rtnetlink_event now explicitly handles NETDEV_UNREGISTER_BATCH and
NETDEV_POST_INIT to avoid sending the incorrect userspace
notifications.
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix two problems:
1. If unregister_netdevice_many() is called with both registered
and unregistered devices, rollback_registered_many() bails out
when it reaches the first unregistered device. The processing
of the prior registered devices is unfinished, and the
remaining devices are skipped, and possible registered netdev's
are leaked/unregistered.
2. System hangs or panics depending on how the devices are passed,
since when netdev_run_todo() runs, some devices were not fully
processed.
Tested by passing intermingled unregistered and registered vlan
devices to unregister_netdevice_many() as follows:
1. dev, fake_dev1, fake_dev2: hangs in run_todo
("unregister_netdevice: waiting for eth1.100 to become
free. Usage count = 1")
2. fake_dev1, dev, fake_dev2: failure during de-registration
and next registration, followed by a vlan driver Oops
during subsequent registration.
Confirmed that the patch fixes both cases.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1815 commits)
mac80211: fix reorder buffer release
iwmc3200wifi: Enable wimax core through module parameter
iwmc3200wifi: Add wifi-wimax coexistence mode as a module parameter
iwmc3200wifi: Coex table command does not expect a response
iwmc3200wifi: Update wiwi priority table
iwlwifi: driver version track kernel version
iwlwifi: indicate uCode type when fail dump error/event log
iwl3945: remove duplicated event logging code
b43: fix two warnings
ipw2100: fix rebooting hang with driver loaded
cfg80211: indent regulatory messages with spaces
iwmc3200wifi: fix NULL pointer dereference in pmkid update
mac80211: Fix TX status reporting for injected data frames
ath9k: enable 2GHz band only if the device supports it
airo: Fix integer overflow warning
rt2x00: Fix padding bug on L2PAD devices.
WE: Fix set events not propagated
b43legacy: avoid PPC fault during resume
b43: avoid PPC fault during resume
tcp: fix a timewait refcnt race
...
Fix up conflicts due to sysctl cleanups (dead sysctl_check code and
CTL_UNNUMBERED removed) in
kernel/sysctl_check.c
net/ipv4/sysctl_net_ipv4.c
net/ipv6/addrconf.c
net/sctp/sysctl.c
* git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/sysctl-2.6: (43 commits)
security/tomoyo: Remove now unnecessary handling of security_sysctl.
security/tomoyo: Add a special case to handle accesses through the internal proc mount.
sysctl: Drop & in front of every proc_handler.
sysctl: Remove CTL_NONE and CTL_UNNUMBERED
sysctl: kill dead ctl_handler definitions.
sysctl: Remove the last of the generic binary sysctl support
sysctl net: Remove unused binary sysctl code
sysctl security/tomoyo: Don't look at ctl_name
sysctl arm: Remove binary sysctl support
sysctl x86: Remove dead binary sysctl support
sysctl sh: Remove dead binary sysctl support
sysctl powerpc: Remove dead binary sysctl support
sysctl ia64: Remove dead binary sysctl support
sysctl s390: Remove dead sysctl binary support
sysctl frv: Remove dead binary sysctl support
sysctl mips/lasat: Remove dead binary sysctl support
sysctl drivers: Remove dead binary sysctl support
sysctl crypto: Remove dead binary sysctl support
sysctl security/keys: Remove dead binary sysctl support
sysctl kernel: Remove binary sysctl logic
...
* 'core-printk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
ratelimit: Make suppressed output messages more useful
printk: Remove ratelimit.h from kernel.h
ratelimit: Fix/allow use in atomic contexts
ratelimit: Use per ratelimit context locking
Provide common routine for the transition of operational state for a leaf
device during a root device transition.
Signed-off-by: Patrick Mullaney <pmullaney@novell.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Refactor the code so fib_rules_register always takes a template instead
of the actual fib_rules_ops structure that will be used. This is
required for network namespace support so 2 out of the 3 callers already
do this, it allows the error handling to be made common, and it allows
fib_rules_unregister to free the template for hte caller.
Modify fib_rules_unregister to use call_rcu instead of syncrhonize_rcu
to allw multiple namespaces to be cleaned up in the same rcu grace
period.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This allows namespace exit methods to batch work that comes requires an
rcu barrier using call_rcu without having to treat the
unregister_pernet_operations cases specially.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move network device exit batching from a special case in
net_namespace.c to using common mechanisms in dev.c
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Add exit_list to struct net to support building lists of network
namespaces to cleanup.
- Add exit_batch to pernet_operations to allow running operations only
once during a network namespace exit. Instead of once per network
namespace.
- Factor opt ops_exit_list and ops_exit_free so the logic with cleanup
up a network namespace does not need to be duplicated.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit d124356ce314fff22a047ea334379d5105b2d834
Author: Patrick McHardy <kaber@trash.net>
Date: Thu Dec 3 12:16:35 2009 +0100
net: fib_rules: allow to delete local rule
Allow to delete the local rule and recreate it with a higher priority. This
can be used to force packets with a local destination out on the wire instead
of routing them to loopback. Additionally this patch allows to recreate rules
with a priority of 0.
Combined with the previous patch to allow oif classification, a socket can
be bound to the desired interface and packets routed to the wire like this:
# move local rule to lower priority
ip rule add pref 1000 lookup local
ip rule del pref 0
# route packets of sockets bound to eth0 to the wire independant
# of the destination address
ip rule add pref 100 oif eth0 lookup 100
ip route add default dev eth0 table 100
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 68144d350f4f6c348659c825cde6a82b34c27a91
Author: Patrick McHardy <kaber@trash.net>
Date: Thu Dec 3 12:05:25 2009 +0100
net: fib_rules: add oif classification
Support routing table lookup based on the flow's oif. This is useful to
classify packets originating from sockets bound to interfaces differently.
The route cache already includes the oif and needs no changes.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 229e77eec406ad68662f18e49fda8b5d366768c5
Author: Patrick McHardy <kaber@trash.net>
Date: Thu Dec 3 12:05:23 2009 +0100
net: fib_rules: rename ifindex/ifname/FRA_IFNAME to iifindex/iifname/FRA_IIFNAME
The next patch will add oif classification, rename interface related members
and attributes to reflect that they're used for iif classification.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The two functions skb_dma_map/unmap are unsafe to use as they cause
problems when packets are cloned and sent to multiple devices while a HW
IOMMU is enabled. Due to this it is best to remove the code so it is not
used by any other network driver maintainters.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Defer dellink to net_cleanup() allowing for batching.
- Fix comment.
- Use for_each_netdev_safe again as dev_change_net_namespace touches
at most one network device (unlike veth dellink).
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To get the full benefit of batched network namespace cleanup netowrk
device deletion needs to be performed by the generic code. When
using register_pernet_gen_device and freeing the data in exit_net
it is impossible to delay allocation until after exit_net has called
as the device uninit methods are no longer safe.
To correct this, and to simplify working with per network namespace data
I have moved allocation and deletion of per network namespace data into
the network namespace core. The core now frees the data only after
all of the network namespace exit routines have run.
Now it is only required to set the new fields .id and .size
in the pernet_operations structure if you want network namespace
data to be managed for you automatically.
This makes the current register_pernet_gen_device and
register_pernet_gen_subsys routines unnecessary. For the moment
I have left them as compatibility wrappers in net_namespace.h
They will be removed once all of the users have been updated.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is fairly common to kill several network namespaces at once. Either
because they are nested one inside the other or because they are cooperating
in multiple machine networking experiments. As the network stack control logic
does not parallelize easily batch up multiple network namespaces existing
together.
To get the full benefit of batching the virtual network devices to be
removed must be all removed in one batch. For that purpose I have added
a loop after the last network device operations have run that batches
up all remaining network devices and deletes them.
An extra benefit is that the reorganization slightly shrinks the size
of the per network namespace data structures replaceing a work_struct
with a list_head.
In a trivial test with 4K namespaces this change reduced the cost of
a destroying 4K namespaces from 7+ minutes (at 12% cpu) to 44 seconds
(at 60% cpu). The bulk of that 44s was spent in inet_twsk_purge.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The motivation for an additional notifier in batched netdevice
notification (rt_do_flush) only needs to be called once per batch not
once per namespace.
For further batching improvements I need a guarantee that the
netdevices are unregistered in order allowing me to unregister an all
of the network devices in a network namespace at the same time with
the guarantee that the loopback device is really and truly
unregistered last.
Additionally it appears that we moved the route cache flush after
the final synchronize_net, which seems wrong and there was no
explanation. So I have restored the original location of the final
synchronize_net.
Cc: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Not including net/atm/
Compiled tested x86 allyesconfig only
Added a > 80 column line or two, which I ignored.
Existing checkpatch plaints willfully, cheerfully ignored.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
pktgen threads are bound to given CPU, we can allocate memory for
these threads in a NUMA aware way.
After a pktgen session on two threads, we can check flows memory was
allocated on right node, instead of a not related one.
# grep pktgen_thread_write /proc/vmallocinfo
0xffffc90007204000-0xffffc90007385000 1576960 pktgen_thread_write+0x3a4/0x6b0 [pktgen] pages=384 vmalloc N0=384
0xffffc90007386000-0xffffc90007507000 1576960 pktgen_thread_write+0x3a4/0x6b0 [pktgen] pages=384 vmalloc N1=384
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The veth driver contains code to forward an skb
from the start_xmit function of one network
device into the receive path of another device.
Moving that code into a common location lets us
reuse the code for direct forwarding of data
between macvlan ports, and possibly in other
drivers.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Generated with the following semantic patch
@@
struct net *n1;
struct net *n2;
@@
- n1 == n2
+ net_eq(n1, n2)
@@
struct net *n1;
struct net *n2;
@@
- n1 != n2
+ !net_eq(n1, n2)
applied over {include,net,drivers/net}.
Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When multi queue compatable names are used by pktgen (eg eth0@0),
we currently cannot unload a NIC driver if one of its device
is currently in use.
Allow pktgen_find_dev() to find pktgen devices by their suffix (netdev name)
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit e6fce5b916 (pktgen: multiqueue etc.) tried to relax
the pktgen restriction of one device per kernel thread, adding a '@'
tag to device names.
Problem is we dont perform check on full pktgen device name.
This allows adding many time same 'device' to pktgen thread
pgset "add_device eth0@0"
one session later :
pgset "add_device eth0@0"
(This doesnt find previous device)
This consumes ~1.5 MBytes of vmalloc memory per round and also triggers
this warning :
[ 673.186380] proc_dir_entry 'pktgen/eth0@0' already registered
[ 673.186383] Modules linked in: pktgen ixgbe ehci_hcd psmouse mdio mousedev evdev [last unloaded: pktgen]
[ 673.186406] Pid: 6219, comm: bash Tainted: G W 2.6.32-rc7-03302-g41cec6f-dirty #16
[ 673.186410] Call Trace:
[ 673.186417] [<ffffffff8104a29b>] warn_slowpath_common+0x7b/0xc0
[ 673.186422] [<ffffffff8104a341>] warn_slowpath_fmt+0x41/0x50
[ 673.186426] [<ffffffff8114e789>] proc_register+0x109/0x210
[ 673.186433] [<ffffffff8100bf2e>] ? apic_timer_interrupt+0xe/0x20
[ 673.186438] [<ffffffff8114e905>] proc_create_data+0x75/0xd0
[ 673.186444] [<ffffffffa006ad38>] pktgen_thread_write+0x568/0x640 [pktgen]
[ 673.186449] [<ffffffffa006a7d0>] ? pktgen_thread_write+0x0/0x640 [pktgen]
[ 673.186453] [<ffffffff81149144>] proc_reg_write+0x84/0xc0
[ 673.186458] [<ffffffff810f5a58>] vfs_write+0xb8/0x180
[ 673.186463] [<ffffffff810f5c11>] sys_write+0x51/0x90
[ 673.186468] [<ffffffff8100b51b>] system_call_fastpath+0x16/0x1b
[ 673.186470] ---[ end trace ccbb991b0a8d994d ]---
Solution to this problem is to use a odevname field (includes @ tag and suffix),
instead of using netdevice name.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix the following htmldocs warning:
Warning(net/core/dev.c:5378): bad line:
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Herbert Xu a écrit :
> On Tue, Nov 17, 2009 at 04:26:04AM -0800, David Miller wrote:
>> Really, the link watch stuff is just due for a redesign. I don't
>> think a simple hack is going to cut it this time, sorry Eric :-)
>
> I have no objections against any redesigns, but since the only
> caller of linkwatch_forget_dev runs in process context with the
> RTNL, it could also legally emit those events.
Thanks guys, here an updated version then, before linkwatch surgery ?
In this version, I force the event to be sent synchronously.
[PATCH net-next-2.6] linkwatch: linkwatch_forget_dev() to speedup device dismantle
time ip link del eth3.103 ; time ip link del eth3.104 ; time ip link del eth3.105
real 0m0.266s
user 0m0.000s
sys 0m0.001s
real 0m0.770s
user 0m0.000s
sys 0m0.000s
real 0m1.022s
user 0m0.000s
sys 0m0.000s
One problem of current schem in vlan dismantle phase is the
holding of device done by following chain :
vlan_dev_stop() ->
netif_carrier_off(dev) ->
linkwatch_fire_event(dev) ->
dev_hold() ...
And __linkwatch_run_queue() runs up to one second later...
A generic fix to this problem is to add a linkwatch_forget_dev() method
to unlink the device from the list of watched devices.
dev->link_watch_next becomes dev->link_watch_list (and use a bit more memory),
to be able to unlink device in O(1).
After patch :
time ip link del eth3.103 ; time ip link del eth3.104 ; time ip link del eth3.105
real 0m0.024s
user 0m0.000s
sys 0m0.000s
real 0m0.032s
user 0m0.000s
sys 0m0.001s
real 0m0.033s
user 0m0.000s
sys 0m0.000s
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This new event is called once for each unique net namespace in batched
unregister operations (with the argument set to a random device from
that namespace) and once per device in non-batched unregister
operations.
It allows us to factorize some device unregister work such as clearing the
routing cache.
Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some drivers ndo_get_stats() method need to perform txqueue stats folding.
Move folding from dev_get_stats() to a new dev_txq_stats_fold() function
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When we've merged skb's with page frags, and subsequently receive
a trailer skb (< MSS) that is not completely non-linear (this can
occur on Intel NICs if the packet size falls below the threshold),
GRO ends up producing an illegal GSO skb with a frag_list.
This is harmless unless the skb is then forwarded through an
interface that requires software GSO, whereupon the GSO code
will BUG.
This patch detects this case in GRO and avoids merging the
trailer skb.
Reported-by: Mark Wagner <mwagner@redhat.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
net: Fix the rollback test in dev_change_name()
In dev_change_name() an err variable is used for storing the original
call_netdevice_notifiers() errno (negative) and testing for a rollback
error later, but the test for non-zero is wrong, because the err might
have positive value as well - from dev_alloc_name(). It means the
rollback for a netdevice with a number > 0 will never happen. (The err
test is reordered btw. to make it more readable.)
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Recent changes in the TX error propagation require additional checking
and masking of values returned from hard_start_xmit(), mainly to
separate cases where skb was consumed. This aim can be simplified by
changing the order of NETDEV_TX and NET_XMIT codes, because the latter
are treated similarly to negative (ERRNO) values.
After this change much simpler dev_xmit_complete() is also used in
sch_direct_xmit(), so it is moved to netdevice.h.
Additionally NET_RX definitions in netdevice.h are moved up from
between TX codes to avoid confusion while reading the TX comment.
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Check the return value of ndo_select_queue(). If the value isn't smaller
than the real_num_tx_queues, print a warning message, and reset it to zero.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
----
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently the ->ndo_hard_start_xmit() callbacks are only permitted to return
one of the NETDEV_TX codes. This prevents any kind of error propagation for
virtual devices, like queue congestion of the underlying device in case of
layered devices, or unreachability in case of tunnels.
This patches changes the NET_XMIT codes to avoid clashes with the NETDEV_TX
codes and changes the two callers of dev_hard_start_xmit() to expect either
errno codes, NET_XMIT codes or NETDEV_TX codes as return value.
In case of qdisc_restart(), all non NETDEV_TX codes are mapped to NETDEV_TX_OK
since no error propagation is possible when using qdiscs. In case of
dev_queue_xmit(), the error is propagated upwards.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that sys_sysctl is a compatiblity wrapper around /proc/sys
all sysctl strategy routines, and all ctl_name and strategy
entries in the sysctl tables are unused, and can be
revmoed.
In addition neigh_sysctl_register has been modified to no longer
take a strategy argument and it's callers have been modified not
to pass one.
Cc: "David Miller" <davem@davemloft.net>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: netdev@vger.kernel.org
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
The full_name_hash function does not produce well distributed values in
the lower bits, so most code uses hash_32() to fold it. This is really
a bug introduced when name hashing was added, back in 2.5 when I added
name hashing.
hash_32 is all that is needed since full_name_hash returns unsigned int
which is only 32 bits on 64 bit platforms.
Also, there is no point in using hash_32 on ifindex, because the is naturally
sequential and usually well distributed.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
NAPI drivers try to recycle SKBs in their polling routine, but we
generally don't know the context in which the polling will be called,
and the skb recycling itself may require IRQs to be enabled.
This patch adds irqs_disabled() test to the skb_recycle_check()
routine, so that we'll not let the drivers hit the skb recycling
path with IRQs disabled.
As a side effect, this patch actually disables skb recycling for some
[broken] drivers. E.g. gianfar driver grabs an irqsave spinlock during
TX ring processing, and then tries to recycle an skb, and that caused
the following badness:
nf_conntrack version 0.5.0 (1008 buckets, 4032 max)
------------[ cut here ]------------
Badness at kernel/softirq.c:143
NIP: c003e3c4 LR: c423a528 CTR: c003e344
...
NIP [c003e3c4] local_bh_enable+0x80/0xc4
LR [c423a528] destroy_conntrack+0xd4/0x13c [nf_conntrack]
Call Trace:
[c15d1b60] [c003e32c] local_bh_disable+0x1c/0x34 (unreliable)
[c15d1b70] [c423a528] destroy_conntrack+0xd4/0x13c [nf_conntrack]
[c15d1b80] [c02c6370] nf_conntrack_destroy+0x3c/0x70
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no good reason to not support userspace specifying the
network namespace during device creation, and it makes it easier
to create a network device and pass it to a child network namespace
with a well known name.
We have to be careful to ensure that the target network namespace
for the new device exists through the life of the call. To keep
that logic clear I have factored out the network namespace grabbing
logic into rtnl_link_get_net.
In addtion we need to continue to pass the source network namespace
to the rtnl_link_ops.newlink method so that we can find the base
device source network namespace.
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
SPIN_LOCK_UNLOCKED is deprecated. Use DEFINE_SPINLOCK instead.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/usb/cdc_ether.c
All CDC ethernet devices of type USB_CLASS_COMM need to use
'&mbm_info'.
Signed-off-by: David S. Miller <davem@davemloft.net>
net/core/sock.c: In function 'sock_setsockopt':
net/core/sock.c:396: warning: 'index' may be used uninitialized in this function
net/core/sock.c:396: note: 'index' was declared here
GCC can't see that all paths initialize index, so just
set it to the default (0) and eliminate the specific
code block that handles the null device name string.
Signed-off-by: David S. Miller <davem@davemloft.net>
cur_pkt_size can be changed in proc fs while pktgen is running,
we better use a private field to get precise tx-bytes counter.
Signed-off-by: Ben Greear <greearb@candelatech.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Avoid dev_hold()/dev_put() in sock_bindtodevice()
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adds RCU management to the list of netdevices.
Convert some for_each_netdev() users to RCU version, if
it can avoid read_lock-ing dev_base_lock
Ie:
read_lock(&dev_base_loack);
for_each_netdev(net, dev)
some_action();
read_unlock(&dev_base_lock);
becomes :
rcu_read_lock();
for_each_netdev_rcu(net, dev)
some_action();
rcu_read_unlock();
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
All ioctls() implemented by dev_ifsioc_locked() :
SIOCGIFFLAGS, SIOCGIFMETRIC, SIOCGIFMTU, SIOCGIFHWADDR,
SIOCGIFSLAVE, SIOCGIFMAP, SIOCGIFINDEX & SIOCGIFTXQLEN
can use RCU lock instead of dev_base_lock rwlock
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I tested the recent unregister many changes and got a weird,
nasty and seemingly unrelasted kernel oops. Changing
unregister_netdevice_queue to use list_move_tail fixes
the problem for me.
ip link add type veth
rmmod veth
ls /sys/class/net/
showed one of the veth devices still present.
A subsequent ip link oopsed the box.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some workloads hit dev_base_lock rwlock pretty hard.
We can use RCU lookups to avoid touching this rwlock
(and avoid touching netdevice refcount)
netdevices are already freed after a RCU grace period, so this patch
adds no penalty at device dismantle time.
However, it adds a synchronize_rcu() call in dev_change_name()
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This isn't beautifully abstracted, but it is simple,
simplifies uses and so far is only needed for the bonding driver.
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On UDP sockets, we must call skb_free_datagram() with socket locked,
or risk sk_forward_alloc corruption. This requirement is not respected
in SUNRPC.
Add a convenient helper, skb_free_datagram_locked() and use it in SUNRPC
Reported-by: Francis Moreau <francis.moro@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Small cleanup of __dev_get_by_name() and __dev_get_by_index()
to use hlist_for_each_entry() : They'll look like their _rcu variant.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This will allow drivers to adjust their receive path dynamically
based on whether GRO is being applied successfully.
Currently all in-tree callers ignore the return values of these
functions and do not need to be changed.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This clarifies which return and parameter types are GRO result codes
and not RX result codes.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some workloads hit dev_base_lock rwlock pretty hard.
We can use RCU lookups to avoid touching this rwlock.
netdevices are already freed after a RCU grace period, so this patch
adds no penalty at device dismantle time.
dev_ifname() converted to dev_get_by_index_rcu()
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit d519e17e2d
(net: export device speed and duplex via sysfs)
made the wrong assumption that netdev->ethtool_ops was always set.
This makes possible to crash kernel and let rtnl in locked state.
modprobe dummy
ip link set dummy0 up
(udev runs and crash)
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use unregister_netdevice_many() to speedup master device unregister.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adding a list_head parameter to rtnl_link_ops->dellink() methods
allow us to queue devices on a list, in order to dismantle
them all at once.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce rollback_registered_many() and unregister_netdevice_many()
rollback_registered_many() is able to perform necessary steps at device dismantle
time, factorizing two expensive synchronize_net() calls.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patchs adds an unreg_list anchor to struct net_device, and
introduces an unregister_netdevice_queue() function, able to queue
a net_device to a list instead of immediately unregister it.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We currently use a 16 bit field (vlan_tci) to store VLAN ID/PRIO on a skb.
Null value is used as a special value, meaning vlan tagging not enabled.
This forbids use of null vlan ID.
As pointed by David, some drivers use the 3 high order bits (PRIO)
As VLAN ID is 12 bits, we can use the remaining bit (CFI) as a flag, and
allow null VLAN ID.
In case future code really wants to use VLAN_CFI_MASK, we'll have to use
a bit outside of vlan_tci.
#define VLAN_PRIO_MASK 0xe000 /* Priority Code Point */
#define VLAN_PRIO_SHIFT 13
#define VLAN_CFI_MASK 0x1000 /* Canonical Format Indicator */
#define VLAN_TAG_PRESENT VLAN_CFI_MASK
#define VLAN_VID_MASK 0x0fff /* VLAN Identifier */
Reported-by: Gertjan Hofman <gertjan_hofman@yahoo.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While playing with pktgen, I realized IP ID was not filled and a
random value was taken, possibly leaking 2 bytes of kernel memory.
We can use an increasing ID, this can help diagnostics anyway.
Also clear packet payload, instead of leaking kernel memory.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When handling large number of netdevice, rtnl_dump_ifinfo()
is very slow because it has O(N^2) complexity.
Instead of scanning one single list, we can use the 256 sub lists
of the dev_index hash table.
This considerably speedups "ip link" operations
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
rtnl_getlink() & rtnl_setlink() run with RTNL held, we can use
__dev_get_by_index() and __dev_get_by_name() variants and avoid
dev_hold()/dev_put()
Adds to rtnl_getlink() the capability to find a device by its name,
not only by its index.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For connected sockets, the first run of dev_pick_tx saves the
calculated txq in sk_tx_queue_mapping. This is not saved if
either the device has a queue select or the socket is not
connected. Next iterations of dev_pick_tx uses the cached value
of sk_tx_queue_mapping.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
dst_negative_advice() should check for changed dst and reset
sk_tx_queue_mapping accordingly. Pass sock to the callers of
dst_negative_advice.
(sk_reset_txq is defined just for use by dst_negative_advice. The
only way I could find to get around this is to move dst_negative_()
from dst.h to dst.c, include sock.h in dst.c, etc)
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce sk_tx_queue_mapping; and functions that set, test and
get this value. Reset sk_tx_queue_mapping to -1 whenever the dst
cache is set/reset, and in socket alloc. Setting txq to -1 and
using valid txq=<0 to n-1> allows the tx path to use the value
of sk_tx_queue_mapping directly instead of subtracting 1 on every
tx.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It can help being able to filter packets on their queue_mapping.
If filter performance is not good, we could add a "numqueue" field
in struct packet_type, so that netif_nit_deliver() and other functions
can directly ignore packets with not expected queue number.
Lets experiment this simple filter extension first.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow bpf to set a filter to drop packets that dont
match a specific mark
Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
- skb_kill_datagram() can increment sk->sk_drops itself, not callers.
- UDP on IPV4 & IPV6 dropped frames (because of bad checksum or policy checks) increment sk_drops
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sock_queue_rcv_skb() can update sk_drops itself, removing need for
callers to take care of it. This is more consistent since
sock_queue_rcv_skb() also reads sk_drops when queueing a skb.
This adds sk_drops managment to many protocols that not cared yet.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Create a new socket level option to report number of queue overflows
Recently I augmented the AF_PACKET protocol to report the number of frames lost
on the socket receive queue between any two enqueued frames. This value was
exported via a SOL_PACKET level cmsg. AFter I completed that work it was
requested that this feature be generalized so that any datagram oriented socket
could make use of this option. As such I've created this patch, It creates a
new SOL_SOCKET level option called SO_RXQ_OVFL, which when enabled exports a
SOL_SOCKET level cmsg that reports the nubmer of times the sk_receive_queue
overflowed between any two given frames. It also augments the AF_PACKET
protocol to take advantage of this new feature (as it previously did not touch
sk->sk_drops, which this patch uses to record the overflow count). Tested
successfully by me.
Notes:
1) Unlike my previous patch, this patch simply records the sk_drops value, which
is not a number of drops between packets, but rather a total number of drops.
Deltas must be computed in user space.
2) While this patch currently works with datagram oriented protocols, it will
also be accepted by non-datagram oriented protocols. I'm not sure if thats
agreeable to everyone, but my argument in favor of doing so is that, for those
protocols which aren't applicable to this option, sk_drops will always be zero,
and reporting no drops on a receive queue that isn't used for those
non-participating protocols seems reasonable to me. This also saves us having
to code in a per-protocol opt in mechanism.
3) This applies cleanly to net-next assuming that commit
977750076d (my af packet cmsg patch) is reverted
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that software UFO is supported, UFO can be enabled on master
devices like bridge, bond even though the attached device doesn't
support this feature in hardware.
This allows UFO to be used between KVM host and guest even when a
physical interface attached to the bridge doesn't support UFO.
Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Refactor wext to
* split out iwpriv handling
* split out iwspy handling
* split out procfs support
* allow cfg80211 to have wireless extensions compat code
w/o CONFIG_WIRELESS_EXT
After this, drivers need to
- select WIRELESS_EXT - for wext support
- select WEXT_PRIV - for iwpriv support
- select WEXT_SPY - for iwspy support
except cfg80211 -- which gets new hooks in wext-core.c
and can then get wext handlers without CONFIG_WIRELESS_EXT.
Wireless extensions procfs support is auto-selected
based on PROC_FS and anything that requires the wext core
(i.e. WIRELESS_EXT or CFG80211_WEXT).
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
After updating firmware stored in flash, users may wish to reset the
relevant hardware and start the new firmware immediately. This should
not be completely automatic as it may be disruptive.
A selective reset may also be useful for debugging or diagnostics.
This adds a separate reset operation which takes flags indicating the
components to be reset. Drivers are allowed to reset only a subset of
those requested, and must indicate the actual subset. This allows the
use of generic component masks and some future expansion.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jarek Poplawski a écrit :
>
>
> Hmm... So you made me to do some "real" work here, and guess what?:
> there is one serious checkpatch warning! ;-) Plus, this new parameter
> should be added to the function description. Otherwise:
> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
>
> Thanks,
> Jarek P.
>
> PS: I guess full "Don't" would show we really mean it...
Okay :) Here is the last round, before the night !
Thanks again
[RFC] pkt_sched: gen_estimator: Don't report fake rate estimators
We currently send TCA_STATS_RATE_EST elements to netlink users, even if no estimator
is running.
# tc -s -d qdisc
qdisc pfifo_fast 0: dev eth0 root bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
Sent 112833764978 bytes 1495081739 pkt (dropped 0, overlimits 0 requeues 0)
rate 0bit 0pps backlog 0b 0p requeues 0
User has no way to tell if the "rate 0bit 0pps" is a real estimation, or a fake
one (because no estimator is active)
After this patch, tc command output is :
$ tc -s -d qdisc
qdisc pfifo_fast 0: dev eth0 root bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
Sent 561075 bytes 1196 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
We add a parameter to gnet_stats_copy_rate_est() function so that
it can use gen_estimator_active(bstats, r), as suggested by Jarek.
This parameter can be NULL if check is not necessary, (htb for
example has a mandatory rate estimator)
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A number of drivers (recently including cfg80211-based ones)
assume that all wireless handlers, including statistics, can
sleep and they often also implicitly assume that the rtnl is
held around their invocation. This is almost always true now
except when reading from sysfs:
BUG: sleeping function called from invalid context at kernel/mutex.c:280
in_atomic(): 1, irqs_disabled(): 0, pid: 10450, name: head
2 locks held by head/10450:
#0: (&buffer->mutex){+.+.+.}, at: [<c10ceb99>] sysfs_read_file+0x24/0xf4
#1: (dev_base_lock){++.?..}, at: [<c12844ee>] wireless_show+0x1a/0x4c
Pid: 10450, comm: head Not tainted 2.6.32-rc3 #1
Call Trace:
[<c102301c>] __might_sleep+0xf0/0xf7
[<c1324355>] mutex_lock_nested+0x1a/0x33
[<f8cea53b>] wdev_lock+0xd/0xf [cfg80211]
[<f8cea58f>] cfg80211_wireless_stats+0x45/0x12d [cfg80211]
[<c13118d6>] get_wireless_stats+0x16/0x1c
[<c12844fe>] wireless_show+0x2a/0x4c
Fix this by using the rtnl instead of dev_base_lock.
Reported-by: Miles Lane <miles.lane@gmail.com>
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch exports the link-speed (in Mbps) and duplex of an interface
via sysfs. This eliminates the need to use ethtool just to check the
link-speed. Not requiring 'ethtool' and not relying on the SIOCETHTOOL
ioctl should be helpful in an embedded environment where space is at a
premium as well.
NOTE: This patch also intentionally allows non-root users to check the link
speed and duplex -- something not possible with ethtool.
Here's some sample output:
# cat /sys/class/net/eth0/speed
100
# cat /sys/class/net/eth0/duplex
half
# ethtool eth0
Settings for eth0:
Supported ports: [ TP ]
Supported link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Half 1000baseT/Full
Supports auto-negotiation: Yes
Advertised link modes: Not reported
Advertised auto-negotiation: No
Speed: 100Mb/s
Duplex: Half
Port: Twisted Pair
PHYAD: 1
Transceiver: internal
Auto-negotiation: off
Supports Wake-on: g
Wake-on: g
Current message level: 0x000000ff (255)
Link detected: yes
Signed-off-by: David S. Miller <davem@davemloft.net>
For various purposes including a wireless extensions
bugfix, we need to hook into the netdev creation before
before netdev_register_kobject(). This will also ease
doing the dev type assignment that Marcel was working
on for cfg80211 drivers w/o touching them all.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
We can avoid two atomic ops on skb->users if packet is not going to be
sent to the device (because hardware txqueue is full)
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The in-tree implementations have all been converted to
get_sset_count().
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit fd29cf72 (pktgen: convert to use ktime_t)
inadvertantly converted "delay" parameter from nanosec to microsec.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is not currently possible to instruct pktgen to use one selected tx queue.
When Robert added multiqueue support in commit 45b270f8, he added
an interval (queue_map_min, queue_map_max), and his code doesnt take
into account the case of min = max, to select one tx queue exactly.
I suspect a high performance setup on a eight txqueue device wants
to use exactly eight cpus, and assign one tx queue to each sender.
This patchs makes pktgen select the right tx queue, not the first one.
Also updates Documentation to reflect Robert changes.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
After last pktgen changes, delay handling is wrong.
pktgen actually sends packets at full line speed.
Fix is to update pkt_dev->next_tx even if spin() returns early,
so that next spin() calls have a chance to see a positive delay.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit 9b22ea5609
( net: fix packet socket delivery in rx irq handler )
We lost rx timestamping of packets received on accelerated vlans.
Effect is that tcpdump on real dev can show strange timings, since it gets rx timestamps
too late (ie at skb dequeueing time, not at skb queueing time)
14:47:26.986871 IP 192.168.20.110 > 192.168.20.141: icmp 64: echo request seq 1
14:47:26.986786 IP 192.168.20.141 > 192.168.20.110: icmp 64: echo reply seq 1
14:47:27.986888 IP 192.168.20.110 > 192.168.20.141: icmp 64: echo request seq 2
14:47:27.986781 IP 192.168.20.141 > 192.168.20.110: icmp 64: echo reply seq 2
14:47:28.986896 IP 192.168.20.110 > 192.168.20.141: icmp 64: echo request seq 3
14:47:28.986780 IP 192.168.20.141 > 192.168.20.110: icmp 64: echo reply seq 3
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 2b85a34e91
(net: No more expensive sock_hold()/sock_put() on each tx)
opens a window in sock_wfree() where another cpu
might free the socket we are working on.
A fix is to call sk->sk_write_space(sk) while still
holding a reference on sk.
Reported-by: Jike Song <albcamus@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This provides safety against negative optlen at the type
level instead of depending upon (sometimes non-trivial)
checks against this sprinkled all over the the place, in
each and every implementation.
Based upon work done by Arjan van de Ven and feedback
from Linus Torvalds.
Signed-off-by: David S. Miller <davem@davemloft.net>
The move away from having drivers assign wireless handlers,
in favour of making cfg80211 assign them, broke the sysfs
registration (the wireless/ dir went missing) because the
handlers are now assigned only after registration, which is
too late.
Fix this by special-casing cfg80211-based devices, all
of which are required to have an ieee80211_ptr, in the
sysfs code, and also using get_wireless_stats() to have
the same values reported as in procfs.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Reported-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Tested-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Previous update did not resched in inner loop causing watchdogs.
Rewrite inner loop to:
* account for delays better with less clock calls
* more accurate timing of delay:
- only delay if packet was successfully sent
- if delay is 100ns and it takes 10ns to build packet then
account for that
* use wait_event_interruptible_timeout rather than open coding it.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Decouple kernel.h from ratelimit.h: the global declaration of
printk's ratelimit_state is not needed, and it leads to messy
circular dependencies due to ratelimit.h's (new) adding of a
spinlock_types.h include.
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: David S. Miller <davem@davemloft.net>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Sizing of memory allocations shouldn't depend on the number of physical
pages found in a system, as that generally includes (perhaps a huge amount
of) non-RAM pages. The amount of what actually is usable as storage
should instead be used as a basis here.
Some of the calculations (i.e. those not intending to use high memory)
should likely even use (totalram_pages - totalhigh_pages).
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Dave Airlie <airlied@linux.ie>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (66 commits)
be2net: fix some cmds to use mccq instead of mbox
atl1e: fix 2.6.31-git4 -- ATL1E 0000:03:00.0: DMA-API: device driver frees DMA
pkt_sched: Fix qstats.qlen updating in dump_stats
ipv6: Log the affected address when DAD failure occurs
wl12xx: Fix print_mac() conversion.
af_iucv: fix race when queueing skbs on the backlog queue
af_iucv: do not call iucv_sock_kill() twice
af_iucv: handle non-accepted sockets after resuming from suspend
af_iucv: fix race in __iucv_sock_wait()
iucv: use correct output register in iucv_query_maxconn()
iucv: fix iucv_buffer_cpumask check when calling IUCV functions
iucv: suspend/resume error msg for left over pathes
wl12xx: switch to %pM to print the mac address
b44: the poll handler b44_poll must not enable IRQ unconditionally
ipv6: Ignore route option with ROUTER_PREF_INVALID
bonding: make ab_arp select active slaves as other modes
cfg80211: fix SME connect
rc80211_minstrel: fix contention window calculation
ssb/sdio: fix printk format warnings
p54usb: add Zcomax XG-705A usbid
...
Let attribute group vectors be declared "const". We'd
like to let most attribute metadata live in read-only
sections... this is a start.
Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This patch fixes commit e36b9d16c6. The approach
there is to call dev_close()/dev_open() whenever the device type is changed in
order to remap the device IP multicast addresses to HW multicast addresses.
This approach suffers from 2 drawbacks:
*. It assumes tha the device is UP when calling dev_close(), or otherwise
dev_close() has no affect. It is worth to mention that initscripts (Redhat)
and sysconfig (Suse) doesn't act the same in this matter.
*. dev_close() has other side affects, like deleting entries from the routing
table, which might be unnecessary.
The fix here is to directly remap the IP multicast addresses to HW multicast
addresses for a bonding device that changes its type, and nothing else.
Reported-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Moni Shoua <monis@voltaire.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1623 commits)
netxen: update copyright
netxen: fix tx timeout recovery
netxen: fix file firmware leak
netxen: improve pci memory access
netxen: change firmware write size
tg3: Fix return ring size breakage
netxen: build fix for INET=n
cdc-phonet: autoconfigure Phonet address
Phonet: back-end for autoconfigured addresses
Phonet: fix netlink address dump error handling
ipv6: Add IFA_F_DADFAILED flag
net: Add DEVTYPE support for Ethernet based devices
mv643xx_eth.c: remove unused txq_set_wrr()
ucc_geth: Fix hangs after switching from full to half duplex
ucc_geth: Rearrange some code to avoid forward declarations
phy/marvell: Make non-aneg speed/duplex forcing work for 88E1111 PHYs
drivers/net/phy: introduce missing kfree
drivers/net/wan: introduce missing kfree
net: force bridge module(s) to be GPL
Subject: [PATCH] appletalk: Fix skb leak when ipddp interface is not loaded
...
Fixed up trivial conflicts:
- arch/x86/include/asm/socket.h
converted to <asm-generic/socket.h> in the x86 tree. The generic
header has the same new #define's, so that works out fine.
- drivers/net/tun.c
fix conflict between 89f56d1e9 ("tun: reuse struct sock fields") that
switched over to using 'tun->socket.sk' instead of the redundantly
available (and thus removed) 'tun->sk', and 2b980dbd ("lsm: Add hooks
to the TUN driver") which added a new 'tun->sk' use.
Noted in 'next' by Stephen Rothwell.
The only valid usage for the bridge frame hooks are by a
GPL components (such as the bridge module).
The kernel should not leave a crack in the door for proprietary
networking stacks to slip in.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently the multiqueue integration with the qdisc API suffers from
a few problems:
- with multiple queues, all root qdiscs use the same handle. This means
they can't be exposed to userspace in a backwards compatible fashion.
- all API operations always refer to queue number 0. Newly created
qdiscs are automatically shared between all queues, its not possible
to address individual queues or restore multiqueue behaviour once a
shared qdisc has been attached.
- Dumps only contain the root qdisc of queue 0, in case of non-shared
qdiscs this means the statistics are incomplete.
This patch reintroduces dev->qdisc, which points to the (single) root qdisc
from userspace's point of view. Currently it either points to the first
(non-shared) default qdisc, or a qdisc shared between all queues. The
following patches will introduce a classful dummy qdisc, which will be used
as root qdisc and contain the per-queue qdiscs as children.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove a debugging aid I accidently left in previous 'cleanup' patch
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support to flash a firmware image to a device using ethtool.
The driver gets the filename of the firmware image and flashes the image
using the request firmware path.
The region "on the chip" to be flashed can be specified by an option.
It is upto the device driver to enumerate the region number passed by ethtool,
to the region to be flashed.
The default behavior is to flash all the regions on the chip.
Signed-off-by: Ajit Khaparde <ajitk@serverengines.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
vlan devices are currently not multi-queue capable.
We can do that with a new rtnl_link_ops method,
get_tx_queues(), called from rtnl_create_link()
This new method gets num_tx_queues/real_num_tx_queues
from real device.
register_vlan_device() is also handled.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It was recently pointed out to me that the last_rx field of the
net_device structure wasn't updated regularly. In fact only the
bonding driver really uses it currently. Since the drop_monitor code
relies on the last_rx field to detect drops on recevie in hardware, We
need to find a more reliable way to rate limit our drop checks (so
that we don't check for drops on every frame recevied, which would be
inefficient. This patch makes a last_rx timestamp that is private to
the drop monitor code and is updated for every device that we track.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The net_dev of backlog napi is NULL, like below:
__get_cpu_var(softnet_data).backlog.dev == NULL
So, we should check it in napi tracepoint's probe function
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After commit 2b85a34e91
(net: No more expensive sock_hold()/sock_put() on each tx)
sk_free() frees socks conditionally and depends
on sk_wmem_alloc being set e.g. in sock_init_data(). But in some
cases sk_free() is called earlier, usually after other alloc errors.
Fix is to move sk_wmem_alloc initialization from sock_init_data()
to sk_alloc() itself.
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Patch compiled and 32 simultaneous netperf testing ran fine.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It looks like after rename device proc entry is unusable,
because of no ->read_proc or ->proc_fops.
And create_proc_entry() is deprecated.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Increase module version, and cleanup module info.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Simpler to have one place that spins and accounts for delays,
this will also make the last packet be detected faster for more
repeatable timing.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This changes how the pktgen thread spins/waits between
packets if delay is configured. It uses a high res timer to
wait for time to arrive.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The kernel ktime_t is a nice generic infrastructure for mananging
high resolution times, as is done in pktgen.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If not using delay then no need to update next_tx after
each packet sent. This allows pktgen to send faster especially
on systems with slower clock sources.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Handle standard (and non-standard) return values in a switch.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
netdev_alloc_skb is NUMA node aware.
Also, don't exhaust atomic emergency pool. Don't want pktgen
to cause OOM behaviour.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The if statement to test for "should a new packet be used"
can be simplified.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Do some reorganization of transmit logic path:
* move transmit queue full idle to separate routine
* add a cpu_relax()
* eliminate some of the uneeded goto's
* if queue is still stopped, go back to main thread loop.
* don't give up transmitting if quantum is exhausted (be greedy)
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
All the callers were freeing skb after stopping device.
Remove unneeded forward decl.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Don't force inlining where not needed. Gcc does better job
of deciding to inline local functions.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A couple of minor functions can be written more compactly.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
WARN_ONCE for ndo_start_xmit() enable interrupts in netpoll_send_skb(),
because the NETPOLL API requires that interrupts remain disabled in
netpoll_send_skb().
Signed-off-by: Dongdong Deng <dongdong.deng@windriver.com>
Acked-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are not maste devices in mac802154 anymore, so drop
ARPHRD_IEEE802154_PHY definition.
Signed-off-by: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com>
In 5e140dfc1f "net: reorder struct Qdisc
for better SMP performance" the definition of struct gnet_stats_basic
changed incompatibly, as copies of this struct are shipped to
userland via netlink.
Restoring old behavior is not welcome, for performance reason.
Fix is to use a private structure for kernel, and
teach gnet_stats_copy_basic() to convert from kernel to user land,
using legacy structure (struct gnet_stats_basic)
Based on a report and initial patch from Michael Spang.
Reported-by: Michael Spang <mspang@csclub.uwaterloo.ca>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The networking code checks CAP_SYS_MODULE before using request_module() to
try to load a kernel module. While this seems reasonable it's actually
weakening system security since we have to allow CAP_SYS_MODULE for things
like /sbin/ip and bluetoothd which need to be able to trigger module loads.
CAP_SYS_MODULE actually grants those binaries the ability to directly load
any code into the kernel. We should instead be protecting modprobe and the
modules on disk, rather than granting random programs the ability to load code
directly into the kernel. Instead we are going to gate those networking checks
on CAP_NET_ADMIN which still limits them to root but which does not grant
those processes the ability to load arbitrary code into the kernel.
Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Paul Moore <paul.moore@hp.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: James Morris <jmorris@namei.org>
skb allocation / cosumption tracer - Add consumption tracepoint
This patch adds a tracepoint to skb_copy_datagram_iovec, which is called each
time a userspace process copies a frame from a socket receive queue to a user
space buffer. It allows us to hook in and examine each sk_buff that the system
receives on a per-socket bases, and can be use to compile a list of which skb's
were received by which processes.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
include/trace/events/skb.h | 20 ++++++++++++++++++++
net/core/datagram.c | 3 +++
2 files changed, 23 insertions(+)
Signed-off-by: David S. Miller <davem@davemloft.net>
dev_queue_xmit enqueue's a skb and calls qdisc_run which
dequeue's the skb and xmits it. In most cases, the skb that
is enqueue'd is the same one that is dequeue'd (unless the
queue gets stopped or multiple cpu's write to the same queue
and ends in a race with qdisc_run). For default qdiscs, we
can remove the redundant enqueue/dequeue and simply xmit the
skb since the default qdisc is work-conserving.
The patch uses a new flag - TCQ_F_CAN_BYPASS to identify the
default fast queue. The controversial part of the patch is
incrementing qlen when a skb is requeued - this is to avoid
checks like the second line below:
+ } else if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
>> !q->gso_skb &&
+ !test_and_set_bit(__QDISC_STATE_RUNNING, &q->state)) {
Results of a 2 hour testing for multiple netperf sessions (1,
2, 4, 8, 12 sessions on a 4 cpu system-X). The BW numbers are
aggregate Mb/s across iterations tested with this version on
System-X boxes with Chelsio 10gbps cards:
----------------------------------
Size | ORG BW NEW BW |
----------------------------------
128K | 156964 159381 |
256K | 158650 162042 |
----------------------------------
Changes from ver1:
1. Move sch_direct_xmit declaration from sch_generic.h to
pkt_sched.h
2. Update qdisc basic statistics for direct xmit path.
3. Set qlen to zero in qdisc_reset.
4. Changed some function names to more meaningful ones.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This sockopt goes in line with SO_TYPE and SO_PROTOCOL. It makes it
possible for userspace programs to pass around file descriptors — I
am referring to arguments-to-functions, but it may even work for the
fd passing over UNIX sockets — without needing to also pass the
auxiliary information (PF_INET6/IPPROTO_TCP).
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to SO_TYPE returning the socket type, SO_PROTOCOL allows to
retrieve the protocol used with a given socket.
I am not quite sure why we have that-many copies of socket.h, and why
the values are not the same on all arches either, but for where hex
numbers dominate, I use 0x1029 for SO_PROTOCOL as that seems to be
the next free unused number across a bunch of operating systems, or
so Google results make me want to believe. SO_PROTOCOL for others
just uses the next free Linux number, 38.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
String literals are constant, and usually, we can also tag the array
of pointers const too, moving it to the .rodata section.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Current neigh_periodic_timer() function is fired by timer IRQ, and
scans one hash bucket each round (very litle work in fact)
As we are supposed to scan whole hash table in 15 seconds, this means
neigh_periodic_timer() can be fired very often. (depending on the number
of concurrent hash entries we stored in this table)
Converting this to a workqueue permits scanning whole table, minimizing
icache pollution, and firing this work every 15 seconds, independantly
of hash table size.
This 15 seconds delay is not a hard number, as work is a deferrable one.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a path when an assetion in dev_unicast_sync() appears.
igmp6_group_added -> dev_mc_add -> __dev_set_rx_mode ->
-> vlan_dev_set_rx_mode -> dev_unicast_sync
Therefore we cannot protect this list with rtnl. This patch restores the
original protecting this list with spinlock.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Tested-by: Meelis Roos <mroos@linux.ee>
Signed-off-by: David S. Miller <davem@davemloft.net>
memcpy() should take into account size of pointers,
not only number of pointers to copy.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to make cfg80211/nl80211 aware of network namespaces,
we have to do the following things:
* del_virtual_intf method takes an interface index rather
than a netdev pointer - simply change this
* nl80211 uses init_net a lot, it changes to use the sender's
network namespace
* scan requests use the interface index, hold a netdev pointer
and reference instead
* we want a wiphy and its associated virtual interfaces to be
in one netns together, so
- we need to be able to change ns for a given interface, so
export dev_change_net_namespace()
- for each virtual interface set the NETIF_F_NETNS_LOCAL
flag, and clear that flag only when the wiphy changes ns,
to disallow breaking this invariant
* when a network namespace goes away, we need to reparent the
wiphy to init_net
* cfg80211 users that support creating virtual interfaces must
create them in the wiphy's namespace, currently this affects
only mac80211
The end result is that you can now switch an entire wiphy into
a different network namespace with the new command
iw phy#<idx> set netns <pid>
and all virtual interfaces will follow (or the operation fails).
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
This helps avoid error messages with ethtool -k on devices that
don't provide device specific routines.
Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
------------------------------------------------------------------
Signed-off-by: David S. Miller <davem@davemloft.net>
mac80211 required this due to the master netdev, but now
it can put all information into skb->cb and this can go.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
For mac80211, with the master netdev removal, we need to be
able to sync a multicast address list onto another list that
is not tracked within a netdev, so we need access to the
functions doing that.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
I guess it should be -EINVAL rather than EINVAL. I have not checked
when the bug came in. Perhaps a candidate for -stable?
Signed-off-by: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit e912b1142b
(net: sk_prot_alloc() should not blindly overwrite memory)
took care of not zeroing whole new socket at allocation time.
sock_copy() is another spot where we should be very careful.
We should not set refcnt to a non null value, until
we are sure other fields are correctly setup, or
a lockless reader could catch this socket by mistake,
while not fully (re)initialized.
This patch puts sk_node & sk_refcnt to the very beginning
of struct sock to ease sock_copy() & sk_prot_alloc() job.
We add appropriate smp_wmb() before sk_refcnt initializations
to match our RCU requirements (changes to sock keys should
be committed to memory before sk_refcnt setting)
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename lookup_neigh_params to lookup_neigh_parms as the struct is named
neigh_parms and all other functions dealing with the struct carry
neigh_parms in their names.
Signed-off-by: Tobias Klauser <klto@zhaw.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function get_net_ns_by_pid(), to get a network
namespace from a pid_t, will be required in cfg80211
as well. Therefore, let's move it to net_namespace.c
and export it. We can't make it a static inline in
the !NETNS case because it needs to verify that the
given pid even exists (and return -ESRCH).
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
All we need to take care of is using proper RCU list
add/del primitives and inserting a synchronize_rcu()
at one place to make sure the exit notifiers are run
after everybody has stopped iterating the list.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some sockets use SLAB_DESTROY_BY_RCU, and our RCU code correctness
depends on sk->sk_nulls_node.next being always valid. A NULL
value is not allowed as it might fault a lockless reader.
Current sk_prot_alloc() implementation doesnt respect this hypothesis,
calling kmem_cache_alloc() with __GFP_ZERO. Just call memset() around
the forbidden field.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adding memory barrier after the poll_wait function, paired with
receive callbacks. Adding fuctions sock_poll_wait and sk_has_sleeper
to wrap the memory barrier.
Without the memory barrier, following race can happen.
The race fires, when following code paths meet, and the tp->rcv_nxt
and __add_wait_queue updates stay in CPU caches.
CPU1 CPU2
sys_select receive packet
... ...
__add_wait_queue update tp->rcv_nxt
... ...
tp->rcv_nxt check sock_def_readable
... {
schedule ...
if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
wake_up_interruptible(sk->sk_sleep)
...
}
If there was no cache the code would work ok, since the wait_queue and
rcv_nxt are opposit to each other.
Meaning that once tp->rcv_nxt is updated by CPU2, the CPU1 either already
passed the tp->rcv_nxt check and sleeps, or will get the new value for
tp->rcv_nxt and will return with new data mask.
In both cases the process (CPU1) is being added to the wait queue, so the
waitqueue_active (CPU2) call cannot miss and will wake up CPU1.
The bad case is when the __add_wait_queue changes done by CPU1 stay in its
cache, and so does the tp->rcv_nxt update on CPU2 side. The CPU1 will then
endup calling schedule and sleep forever if there are no more data on the
socket.
Calls to poll_wait in following modules were ommited:
net/bluetooth/af_bluetooth.c
net/irda/af_irda.c
net/irda/irnet/irnet_ppp.c
net/mac80211/rc80211_pid_debugfs.c
net/phonet/socket.c
net/rds/af_rds.c
net/rfkill/core.c
net/sunrpc/cache.c
net/sunrpc/rpc_pipe.c
net/tipc/socket.c
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Using early netconsole and gianfar driver this error pops up:
netconsole: timeout waiting for carrier
It appears that net/core/netpoll.c:netpoll_setup() is using
cond_resched() in a loop waiting for a carrier.
The thing is that cond_resched() is a no-op when system_state !=
SYSTEM_RUNNING, and so drivers/net/phy/phy.c's state_queue is never
scheduled, therefore link detection doesn't work.
I belive that the main problem is in cond_resched()[1], but despite
how the cond_resched() story ends, it might be a good idea to call
msleep(1) instead of cond_resched(), as suggested by Andrew Morton.
[1] http://lkml.org/lkml/2009/7/7/463
Signed-off-by: Anton Vorontsov <avorontsov@ru.mvista.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some PHYs require longer timeouts for carrier detection, and
auto-negotiation process may take indefinite amount of time.
It may be inconvenient to force longer timeouts for sane PHYs,
so let's introduce a kernel command line option.
Since we're using module_param(), the option also can be
changed in runtime.
Signed-off-by: Anton Vorontsov <avorontsov@ru.mvista.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch converts the remaining occurences of raw return values to their
symbolic counterparts in ndo_start_xmit() functions that were missed by the
previous automatic conversion.
Additionally code that assumed the symbolic value of NETDEV_TX_OK to be zero
is changed to explicitly use NETDEV_TX_OK.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
When NAPI is disabled while we're in net_rx_action, we end up
calling __napi_complete without flushing GRO packets. This is
a bug as it would cause the GRO packets to linger, of course it
also literally BUGs to catch error like this :)
This patch changes it to napi_complete, with the obligatory IRQ
reenabling. This should be safe because we've only just disabled
IRQs and it does not materially affect the test conditions in
between.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6:
bnx2: Fix the behavior of ethtool when ONBOOT=no
qla3xxx: Don't sleep while holding lock.
qla3xxx: Give the PHY time to come out of reset.
ipv4 routing: Ensure that route cache entries are usable and reclaimable with caching is off
net: Move rx skb_orphan call to where needed
ipv6: Use correct data types for ICMPv6 type and code
net: let KS8842 driver depend on HAS_IOMEM
can: let SJA1000 driver depend on HAS_IOMEM
netxen: fix firmware init handshake
netxen: fix build with without CONFIG_PM
netfilter: xt_rateest: fix comparison with self
netfilter: xt_quota: fix incomplete initialization
netfilter: nf_log: fix direct userspace memory access in proc handler
netfilter: fix some sparse endianess warnings
netfilter: nf_conntrack: fix conntrack lookup race
netfilter: nf_conntrack: fix confirmation race condition
netfilter: nf_conntrack: death_by_timeout() fix
In order to get the tun driver to account packets, we need to be
able to receive packets with destructors set. To be on the safe
side, I added an skb_orphan call for all protocols by default since
some of them (IP in particular) cannot handle receiving packets
destructors properly.
Now it seems that at least one protocol (CAN) expects to be able
to pass skb->sk through the rx path without getting clobbered.
So this patch attempts to fix this properly by moving the skb_orphan
call to where it's actually needed. In particular, I've added it
to skb_set_owner_[rw] which is what most users of skb->destructor
call.
This is actually an improvement for tun too since it means that
we only give back the amount charged to the socket when the skb
is passed to another socket that will also be charged accordingly.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Tested-by: Oliver Hartkopp <olver@hartkopp.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (55 commits)
netxen: fix tx ring accounting
netxen: fix detection of cut-thru firmware mode
forcedeth: fix dma api mismatches
atm: sk_wmem_alloc initial value is one
net: correct off-by-one write allocations reports
via-velocity : fix no link detection on boot
Net / e100: Fix suspend of devices that cannot be power managed
TI DaVinci EMAC : Fix rmmod error
net: group address list and its count
ipv4: Fix fib_trie rebalancing, part 2
pkt_sched: Update drops stats in act_police
sky2: version 1.23
sky2: add GRO support
sky2: skb recycling
sky2: reduce default transmit ring
sky2: receive counter update
sky2: fix shutdown synchronization
sky2: PCI irq issues
sky2: more receive shutdown
sky2: turn off pause during shutdown
...
Manually fix trivial conflict in net/core/skbuff.c due to kmemcheck
This patch is inspired by patch recently posted by Johannes Berg. Basically what
my patch does is to group list and a count of addresses into newly introduced
structure netdev_hw_addr_list. This brings us two benefits:
1) struct net_device becames a bit nicer.
2) in the future there will be a possibility to operate with lists independently
on netdevices (with exporting right functions).
I wanted to introduce this patch before I'll post a multicast lists conversion.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
drivers/net/bnx2.c | 4 +-
drivers/net/e1000/e1000_main.c | 4 +-
drivers/net/ixgbe/ixgbe_main.c | 6 +-
drivers/net/mv643xx_eth.c | 2 +-
drivers/net/niu.c | 4 +-
drivers/net/virtio_net.c | 10 ++--
drivers/s390/net/qeth_l2_main.c | 2 +-
include/linux/netdevice.h | 17 +++--
net/core/dev.c | 130 ++++++++++++++++++--------------------
9 files changed, 89 insertions(+), 90 deletions(-)
Signed-off-by: David S. Miller <davem@davemloft.net>
The skb mac_header field is sometimes NULL (or ~0u) as a sentinel
value. The places where skb is expanded add an offset which would
change this flag into an invalid pointer (or offset).
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Looking at the crash in log_martians(), one suspect is that the check for
mac header being set is not correct. The value of mac_header defaults to
0 on allocation, therefore skb_mac_header_was_set will always be true on
platforms using NET_SKBUFF_USES_OFFSET.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch changes FDB entry check for ATM LANE bridge integration.
There's no point in holding a FDB entry around SKB building.
br_fdb_get()/br_fdb_put() pair are changed into single br_fdb_test_addr()
hook that checks if the addr has FDB entry pointing to other port
to the one the request arrived on.
FDB entry refcounting is removed as it's not used anywhere else.
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit b00055aacd " [NET] core: add
RFC2863 operstate" defined new interface flag values. Its
documentation specified that these flags could be accessed from user
space via SIOCGIFFLAGS. However, this does not work because the new
flags do not fit in that ioctl's argument width.
Change the documentation to match the code's behavior. Also change
the source to explicitly show the truncation. This _should_ have no
effect on executable code, and did not with gcc 4.2.4 generating x86
code.
A new ioctl could be defined to return all interface flags to user
space. However, since this has been broken for three years with no
one complaining, there doesn't seem much need. They are still
accessible via netlink.
Reported-by: "Fredrik Arnerup" <fredrik.arnerup@edgeware.tv>
Signed-off-by: John Dykstra <john.dykstra1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current code errors out the INCOMPLETE neigh entry skb queue only from
the timer if maximum probes have been attempted and there has been no reply.
This also causes the transtion to FAILED state.
However, the neigh entry can be also updated via Netlink to inform that the
address is unavailable. Currently, neigh_update() just stops the timers and
leaves the pending skb's unreleased. This results that the clean up code in
the timer callback is never called, preventing also proper garbage collection.
This fixes neigh_update() to process the pending skb queue immediately if
INCOMPLETE -> FAILED state transtion occurs due to a Netlink request.
Signed-off-by: Timo Teras <timo.teras@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
One of the problem with sock memory accounting is it uses
a pair of sock_hold()/sock_put() for each transmitted packet.
This slows down bidirectional flows because the receive path
also needs to take a refcount on socket and might use a different
cpu than transmit path or transmit completion path. So these
two atomic operations also trigger cache line bounces.
We can see this in tx or tx/rx workloads (media gateways for example),
where sock_wfree() can be in top five functions in profiles.
We use this sock_hold()/sock_put() so that sock freeing
is delayed until all tx packets are completed.
As we also update sk_wmem_alloc, we could offset sk_wmem_alloc
by one unit at init time, until sk_free() is called.
Once sk_free() is called, we atomic_dec_and_test(sk_wmem_alloc)
to decrement initial offset and atomicaly check if any packets
are in flight.
skb_set_owner_w() doesnt call sock_hold() anymore
sock_wfree() doesnt call sock_put() anymore, but check if sk_wmem_alloc
reached 0 to perform the final freeing.
Drawback is that a skb->truesize error could lead to unfreeable sockets, or
even worse, prematurely calling __sk_free() on a live socket.
Nice speedups on SMP. tbench for example, going from 2691 MB/s to 2711 MB/s
on my 8 cpu dev machine, even if tbench was not really hitting sk_refcnt
contention point. 5 % speedup on a UDP transmit workload (depends
on number of flows), lowering TX completion cpu usage.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* 'tracing-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (244 commits)
Revert "x86, bts: reenable ptrace branch trace support"
tracing: do not translate event helper macros in print format
ftrace/documentation: fix typo in function grapher name
tracing/events: convert block trace points to TRACE_EVENT(), fix !CONFIG_BLOCK
tracing: add protection around module events unload
tracing: add trace_seq_vprint interface
tracing: fix the block trace points print size
tracing/events: convert block trace points to TRACE_EVENT()
ring-buffer: fix ret in rb_add_time_stamp
ring-buffer: pass in lockdep class key for reader_lock
tracing: add annotation to what type of stack trace is recorded
tracing: fix multiple use of __print_flags and __print_symbolic
tracing/events: fix output format of user stack
tracing/events: fix output format of kernel stack
tracing/trace_stack: fix the number of entries in the header
ring-buffer: discard timestamps that are at the start of the buffer
ring-buffer: try to discard unneeded timestamps
ring-buffer: fix bug in ring_buffer_discard_commit
ftrace: do not profile functions when disabled
tracing: make trace pipe recognize latency format flag
...
In order to handle powersave frames properly we had needed
to pass these out to the device queues again, and introduce
the skb->requeue bit. This, however, also has unnecessary
overhead by needing to 'clean up' already tried frames, and
this clean-up code is also buggy when software encryption
is used.
Instead of sending the frames via the master netdev queue
again, simply put them into the pending queue. This also
fixes a problem where frames for that particular station
could be reordered when some were still on the software
queues and older ones are re-injected into the software
queue after them.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
IEEE 802.15.4 stack requires several constants to be defined/adjusted.
Signed-off-by: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com>
Signed-off-by: Sergey Lapin <slapin@ossfans.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit f001fde5ea
(net: introduce a list of device addresses dev_addr_list (v6))
added one regression Vegard Nossum found in its testings.
With kmemcheck help, Vegard found some uninitialized memory
was read and reported to user, potentialy leaking kernel data.
( thread can be found on http://lkml.org/lkml/2009/5/30/177 )
dev_addr_init() incorrectly uses sizeof() operator. We were
initializing one byte instead of MAX_ADDR_LEN bytes.
Reported-by: Vegard Nossum <vegard.nossum@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
vfree() does its own 'NULL' check, so no need for check before
calling it.
Signed-off-by: Figo.zhang <figo1802@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Increment the iovec base by the offset passed in for the initial
copy_to_user() in memcpy_to_iovecend().
Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb_dma_unmap() is quite expensive for small packets,
because we use two different cache lines from skb_shared_info.
One to access nr_frags, one to access dma_maps[0]
Instead of dma_maps being an array of MAX_SKB_FRAGS + 1 elements,
let dma_head alone in a new dma_head field, close to nr_frags,
to reduce cache lines misses.
Tested on my dev machine (bnx2 & tg3 adapters), nice speedup !
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Get rid of num_dma_maps in struct skb_shared_info, as it seems unused.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On Thu, Jun 04, 2009 at 09:06:00PM +1000, Herbert Xu wrote:
>
> tun: Optimise handling of bogus gso->hdr_len
>
> As all current versions of virtio_net generate a value for the
> header length that's too small, we should optimise this so that
> we don't copy it twice. This can be done by ensuring that it is
> at least as large as the place where we'll write the checksum.
>
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
With this applied we can strengthen the partial checksum check:
In skb_partial_csum_set we check to see if the checksum offset
is within the packet. However, we really should check that it
is within the skb head as that's the only bit we can modify
without copying.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
NETDEV_UP is called after the device is set UP, but sometimes
it is useful to be able to veto the device UP. Introduce a
new NETDEV_PRE_UP notifier that can be used for exactly this.
The first use case will be cfg80211 denying interfaces to be
set UP if the device is known to be rfkill'ed.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Define three accessors to get/set dst attached to a skb
struct dst_entry *skb_dst(const struct sk_buff *skb)
void skb_dst_set(struct sk_buff *skb, struct dst_entry *dst)
void skb_dst_drop(struct sk_buff *skb)
This one should replace occurrences of :
dst_release(skb->dst)
skb->dst = NULL;
Delete skb->dst field
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch converts unicast address list to standard list_head using
previously introduced struct netdev_hw_addr. It also relaxes the
locking. Original spinlock (still used for multicast addresses) is not
needed and is no longer used for a protection of this list. All
reading and writing takes place under rtnl (with no changes).
I also removed a possibility to specify the length of the address
while adding or deleting unicast address. It's always dev->addr_len.
The convertion touched especially e1000 and ixgbe codes when the
change is not so trivial.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
drivers/net/bnx2.c | 13 +--
drivers/net/e1000/e1000_main.c | 24 +++--
drivers/net/ixgbe/ixgbe_common.c | 14 ++--
drivers/net/ixgbe/ixgbe_common.h | 4 +-
drivers/net/ixgbe/ixgbe_main.c | 6 +-
drivers/net/ixgbe/ixgbe_type.h | 4 +-
drivers/net/macvlan.c | 11 +-
drivers/net/mv643xx_eth.c | 11 +-
drivers/net/niu.c | 7 +-
drivers/net/virtio_net.c | 7 +-
drivers/s390/net/qeth_l2_main.c | 6 +-
drivers/scsi/fcoe/fcoe.c | 16 ++--
include/linux/netdevice.h | 18 ++--
net/8021q/vlan.c | 4 +-
net/8021q/vlan_dev.c | 10 +-
net/core/dev.c | 195 +++++++++++++++++++++++++++-----------
net/dsa/slave.c | 10 +-
net/packet/af_packet.c | 4 +-
18 files changed, 227 insertions(+), 137 deletions(-)
Signed-off-by: David S. Miller <davem@davemloft.net>
Use ALIGN() and PTR_ALIGN() macros instead of handcoding them.
Get rid of NETDEV_ALIGN_CONST ugly define
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch stores the two shinfo pointers in local variables
because they're used over and over again in skb_gro_receive.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch reverses the direction of the frags array copy in
skb_gro_receive in order simplify the loop conditional. It
also avoids touching the first element of the original frags
array.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
As we know the only packets which need the final pskb_may_pull
are completely non-linear, and have all the required bits in
frag0, we can perform a straight memcpy instead of going through
pskb_may_pull and doing skb_copy_bits.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
For the overwhelming majority of cases, skb_gro_header's return
value cannot be NULL. Yet we must check it because of its current
form. This patch splits it up into multiple functions in order
to avoid this.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
By caching frag0_len, we can avoid checking both frag0 and the
length separately in skb_gro_header. This helps as skb_gro_header
is called four times per packet which amounts to a few million
times at 10Gb/s.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently skb_gro_header is used for packets which put the hardware
header in skb->data with the rest in frags. Since the drivers that
need this optimisation all provide completely non-linear packets,
we can gain extra optimisations by only performing the frag0
optimisation for completely non-linear packets.
In particular, we can simply test frag0 (instead of skb_headlen)
to see whether the optimisation is in force.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch stores the offset/headlen in local variables as they're
used repeatedly in skb_gro_offset.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function skb_gro_header is called four times per packet which
quickly adds up at 10Gb/s. This patch inlines it to allow better
optimisations.
Some architectures perform multiplication for page_address, which
is done by each skb_gro_header invocation. This patch caches that
value in skb->cb to avoid the unnecessary multiplications.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
gcc does a poor job at generating code for the memcpy of the frags
array in skb_gro_receive, which is the primary purpose of that
function when merging frags. In particular, it can't utilise the
alignment information of the source and destination. This patch
open-codes the copy so we process words instead of bytes.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
BUS_ID_SIZE is really no more, and device names are dynamically
allocated and thus can be any necessary size.
So remove the BUG check here making sure BUS_ID_SIZE is at least
as large as IFNAMSIZ.
Signed-off-by: David S. Miller <davem@davemloft.net>
We would like to get rid of netdev->trans_start = jiffies; that about all net
drivers have to use in their start_xmit() function, and use txq->trans_start
instead.
This can be done generically in core network, as suggested by David.
Some devices, (particularly loopback) dont need trans_start update, because
they dont have transmit watchdog. We could add a new device flag, or rely
on fact that txq->tran_start can be updated is txq->xmit_lock_owner is
different than -1. Use a helper function to hide our choice.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Right-shifts of signed integers are implementation-defined so unportable.
With feedback from: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
All drivers are already converted to new net_device_ops API
and nobody uses old API anymore.
Signed-off-by: Alexander Beregalov <a.beregalov@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hi:
skbuff: Copy csum instead of csum_start/csum_offset
It's easier to copy the u32 csum instead of its two u16
constituents.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Cheers,
Signed-off-by: David S. Miller <davem@davemloft.net>
Hi:
skbuff: Move new __skb_clone code into __copy_skb_header
It seems that people just keep on adding stuff to __skb_clone
instead __copy_skb_header. This is wrong as it means your brand-new
attributes won't always get copied as you intended.
This patch moves them to the right place, and adds a comment to
prevent this from happening again.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Thanks,
Signed-off-by: David S. Miller <davem@davemloft.net>
Patch to add the ability to detect drops in hardware interfaces via dropwatch.
Adds a tracepoint to net_rx_action to signal everytime a napi instance is
polled. The dropmon code then periodically checks to see if the rx_frames
counter has changed, and if so, adds a drop notification to the netlink
protocol, using the reserved all-0's vector to indicate the drop location was in
hardware, rather than somewhere in the code.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
include/linux/net_dropmon.h | 8 ++
include/trace/napi.h | 11 +++
net/core/dev.c | 5 +
net/core/drop_monitor.c | 124 ++++++++++++++++++++++++++++++++++++++++++--
net/core/net-traces.c | 4 +
net/core/netpoll.c | 2
6 files changed, 149 insertions(+), 5 deletions(-)
Signed-off-by: David S. Miller <davem@davemloft.net>
The net_ns_init code can be simplified. No need to save error code
if it is only going to panic if it is set 4 lines later.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
typo -- pkt_dev->nflows is for stats only, the number of concurrent
flows is stored in cflows.
Reported-By: Vladimir Ivashchenko <hazard@francoudi.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
The netlink message header (struct nlmsghdr) is an unused parameter in
fill method of fib_rules_ops struct. This patch removes this
parameter from this method and fixes the places where this method is
called.
(include/net/fib_rules.h)
Signed-off-by: Rami Rosen <ramirose@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
One point of contention in high network loads is the dst_release() performed
when a transmited skb is freed. This is because NIC tx completion calls
dev_kree_skb() long after original call to dev_queue_xmit(skb).
CPU cache is cold and the atomic op in dst_release() stalls. On SMP, this is
quite visible if one CPU is 100% handling softirqs for a network device,
since dst_clone() is done by other cpus, involving cache line ping pongs.
It seems right place to release dst is in dev_hard_start_xmit(), for most
devices but ones that are virtual, and some exceptions.
David Miller suggested to define a new device flag, set in alloc_netdev_mq()
(so that most devices set it at init time), and carefuly unset in devices
which dont want a NULL skb->dst in their ndo_start_xmit().
List of devices that must clear this flag is :
- loopback device, because it calls netif_rx() and quoting Patrick :
"ip_route_input() doesn't accept loopback addresses, so loopback packets
already need to have a dst_entry attached."
- appletalk/ipddp.c : needs skb->dst in its xmit function
- And all devices that call again dev_queue_xmit() from their xmit function
(as some classifiers need skb->dst) : bonding, vlan, macvlan, eql, ifb, hdlc_fr
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The earlier patch to fix the deadlock between a network device going
away and writing to sysfs attributes was incomplete.
- It did not set signal_pending so we would leak ERSTARTSYS to user space.
- It used ERESTARTSYS which only restarts if sigaction configures it to.
- It did not cover store and show for ifalias.
So fix all of these up and use the new helper restart_syscall so we get
the details correct on what it takes.
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When called with a consumed value that is less than skb_headlen(skb)
bytes into a page frag, skb_seq_read() incorrectly returns an
offset/length relative to skb->data. Ensure that data which should come
from a page frag does.
Signed-off-by: Thomas Chenault <thomas_chenault@dell.com>
Tested-by: Shyam Iyer <shyam_iyer@dell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
gen_estimator can overflow bps (bytes per second) with Gb links, while
it was designed with a u32 API, with a theorical limit of 34360Mbit
(2^32 bytes)
Using 64 bit intermediate avbps/brate counters can allow us to reach
this theorical limit.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
offsetof(struct net_device, features)=0x44
offsetof(struct net_device, stats.tx_packets)=0x54
offsetof(struct net_device, stats.tx_bytes)=0x5c
offsetof(struct net_device, stats.tx_dropped)=0x6c
Network drivers that touch dev->stats.tx_packets/stats.tx_bytes in their
tx path can slow down SMP operations, since they dirty a cache line
that should stay shared (dev->features is needed in rx and tx paths)
We could move away stats field in net_device but it wont help that much.
(Two cache lines dirtied in tx path, we can do one only)
Better solution is to add tx_packets/tx_bytes/tx_dropped in struct
netdev_queue because this structure is already touched in tx path and
counters updates will then be free (no increase in size)
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When TCP frees up write buffer space, avoid waking up tasks that have
done a poll() or select() on the same socket specifying read-side
events.
This is an extension of a read-side patch by Eric Dumazet.
Signed-off-by: John Dykstra <john.dykstra1@gmail.com>
Acked-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It looks like the dev in netpoll_poll can be NULL - at lease it's
checked at the function beginning. Thus the dev->netde_ops dereference
looks dangerous.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add missed checking of dev_addr_init return value in alloc_netdev_mq.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
net/core/dev.c | 15 ++++++++++++---
1 files changed, 12 insertions(+), 3 deletions(-)
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit ead2ceb0ec ("Network Drop Monitor:
Adding kfree_skb_clean for non-drops and modifying end-of-line points
for skbs") established new conventions for identifying dropped packets.
Align skb_kill_datagram() with these conventions so that packets that
get dropped just before the copy to userspace are properly tracked.
Signed-off-by: John Dykstra <john.dykstra1@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Merge reason: tracing/core was on a .30-rc1 base and was missing out on
on a handful of tracing fixes present in .30-rc5-almost.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Commit ac45f602ee ("net: infrastructure
for hardware time stamping") added two skb initialization actions to
__alloc_skb(), which need to be added to skb_recycle_check() as well.
Signed-off-by: Lennert Buytenhek <buytenh@wantstofly.org>
Signed-off-by: Patrick Ohly <patrick.ohly@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
v5 -> v6 (current):
-removed so far unused static functions
-corrected dev_addr_del_multiple to call del instead of add
v4 -> v5:
-added device address type (suggested by davem)
-removed refcounting (better to have simplier code then safe potentially few
bytes)
v3 -> v4:
-changed kzalloc to kmalloc in __hw_addr_add_ii()
-ASSERT_RTNL() avoided in dev_addr_flush() and dev_addr_init()
v2 -> v3:
-removed unnecessary rcu read locking
-moved dev_addr_flush() calling to ensure no null dereference of dev_addr
v1 -> v2:
-added forgotten ASSERT_RTNL to dev_addr_init and dev_addr_flush
-removed unnecessary rcu_read locking in dev_addr_init
-use compare_ether_addr_64bits instead of compare_ether_addr
-use L1_CACHE_BYTES as size for allocating struct netdev_hw_addr
-use call_rcu instead of rcu_synchronize
-moved is_etherdev_addr into __KERNEL__ ifdef
This patch introduces a new list in struct net_device and brings a set of
functions to handle the work with device address list. The list is a replacement
for the original dev_addr field and because in some situations there is need to
carry several device addresses with the net device. To be backward compatible,
dev_addr is made to point to the first member of the list so original drivers
sees no difference.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net_create() will be used by C/R to create fresh netns on restart.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
copy_net_ns() doesn't copy anything, it creates fresh netns, so
get/put of old netns isn't needed.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Based almost entirely upon a patch by Eric Dumazet.
The common case is to have num-tx-queues <= num_rx_queues
and even if num_tx_queues is larger it will not be significantly
larger.
Therefore, a subtraction loop is always going to be faster than
modulus.
Signed-off-by: David S. Miller <davem@davemloft.net>
When skb_rx_queue_recorded() is true, we dont want to use jash distribution
as the device driver exactly told us which queue was selected at RX time.
jhash makes a statistical shuffle, but this wont work with 8 static inputs.
Later improvements would be to compute reciprocal value of real_num_tx_queues
to avoid a divide here. But this computation should be done once,
when real_num_tx_queues is set. This needs a separate patch, and a new
field in struct net_device.
Reported-by: Andrew Dickinson <andrew@whydna.net>
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lennert Buytenhek wrote:
> Since 4fb6699481 ("net: Optimize memory
> usage when splicing from sockets.") I'm seeing this oops (e.g. in
> 2.6.30-rc3) when splicing from a TCP socket to /dev/null on a driver
> (mv643xx_eth) that uses LRO in the skb mode (lro_receive_skb) rather
> than the frag mode:
My patch incorrectly assumed skb->sk was always valid, but for
"frag_listed" skbs we can only use skb->sk of their parent.
Reported-by: Lennert Buytenhek <buytenh@wantstofly.org>
Debugged-by: Lennert Buytenhek <buytenh@wantstofly.org>
Tested-by: Lennert Buytenhek <buytenh@wantstofly.org>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In 2.6.25 we added UDP mem accounting.
This unfortunatly added a penalty when a frame is transmitted, since
we have at TX completion time to call sock_wfree() to perform necessary
memory accounting. This calls sock_def_write_space() and utimately
scheduler if any thread is waiting on the socket.
Thread(s) waiting for an incoming frame was scheduled, then had to sleep
again as event was meaningless.
(All threads waiting on a socket are using same sk_sleep anchor)
This adds lot of extra wakeups and increases latencies, as noted
by Christoph Lameter, and slows down softirq handler.
Reference : http://marc.info/?l=linux-netdev&m=124060437012283&w=2
Fortunatly, Davide Libenzi recently added concept of keyed wakeups
into kernel, and particularly for sockets (see commit
37e5540b3c
epoll keyed wakeups: make sockets use keyed wakeups)
Davide goal was to optimize epoll, but this new wakeup infrastructure
can help non epoll users as well, if they care to setup an appropriate
handler.
This patch introduces new DEFINE_WAIT_FUNC() helper and uses it
in wait_for_packet(), so that only relevant event can wakeup a thread
blocked in this function.
Trace of function calls from bnx2 TX completion bnx2_poll_work() is :
__kfree_skb()
skb_release_head_state()
sock_wfree()
sock_def_write_space()
__wake_up_sync_key()
__wake_up_common()
receiver_wake_function() : Stops here since thread is waiting for an INPUT
Reported-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The skb_gro_* code fails to handle the case where a header starts
in the linear area but ends in the frags area. Since the goal
of skb_gro_* is to optimise the case of completely non-linear
packets, we can simply bail out if we have anything in the linear
area.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
When I initially implemented this protocol, I disregarded the use of netlink
attribute headers, thinking for my purposes they weren't needed. I've come to
find out that, as I'm starting to work with sending down messages with
associated data (like config messages), the kernel code spits out warnings about
trailing data in a netlink skb that doesn't have an associated header on it. As
such, I'm going to start including attribute headers in my netlink transaction,
and so for completeness, I should likely include them on messages bound from the
kernel to user space. This patch adds that header to the kernel, and bumps the
protocol version accordingly
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
aio_write gets const struct iovec * but tun_chr_aio_write casts this to struct
iovec * and modifies the iovec. As a result, attempts to use io_submit
to send packets to a tun device fail with weird errors such as EINVAL.
Since tun is the only user of skb_copy_datagram_from_iovec, we can
fix this simply by changing the later so that it does not
touch the iovec passed to it.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There's an skb_copy_datagram_iovec() to copy out of a paged skb,
but it modifies the iovec, and does not support starting
at an offset in the destination. We want both in tun.c, so let's
add the function.
It's a carbon copy of skb_copy_datagram_iovec() with enough changes to
be annoying.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This loop over fragments in napi_fraginfo_skb() was "interesting".
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Sidorenko reported:
"while experimenting with 'netem' we have found some strange behaviour. It
seemed that ingress delay as measured by 'ping' command shows up on some
hosts but not on others.
After some investigation I have found that the problem is that skbuff->tstamp
field value depends on whether there are any packet sniffers enabled. That
is:
- if any ptype_all handler is registered, the tstamp field is as expected
- if there are no ptype_all handlers, the tstamp field does not show the delay"
This patch prevents unnecessary update of tstamp in dev_queue_xmit_nit()
on ingress path (with act_mirred) adding a check, so minimal overhead on
the fast path, but only when sniffers etc. are active.
Since netem at ingress seems to logically emulate a network before a host,
tstamp is zeroed to trigger the update and pretend delays are from the
outside.
Reported-by: Alex Sidorenko <alexandre.sidorenko@hp.com>
Tested-by: Alex Sidorenko <alexandre.sidorenko@hp.com>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It turns out that copying a 16-byte area at ~800k times a second
can be really expensive :) This patch redesigns the frags GRO
interface to avoid copying that area twice.
The two disciples of the frags interface have been converted.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Impact: clean up
Create a sub directory in include/trace called events to keep the
trace point headers in their own separate directory. Only headers that
declare trace points should be defined in this directory.
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Zhao Lei <zhaolei@cn.fujitsu.com>
Cc: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Since everybody has been focusing on baremetal GRO performance
no one noticed when I added a bug that zapped gso_size for all
GRO packets. This only gets picked up when you forward the skb
out of an interface.
Thanks to Mark Wagner for noticing this bug when testing kvm.
Reported-by: Mark Wagner <mwagner@redhat.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch lowers the number of places a developer must modify to add
new tracepoints. The current method to add a new tracepoint
into an existing system is to write the trace point macro in the
trace header with one of the macros TRACE_EVENT, TRACE_FORMAT or
DECLARE_TRACE, then they must add the same named item into the C file
with the macro DEFINE_TRACE(name) and then add the trace point.
This change cuts out the needing to add the DEFINE_TRACE(name).
Every file that uses the tracepoint must still include the trace/<type>.h
file, but the one C file must also add a define before the including
of that file.
#define CREATE_TRACE_POINTS
#include <trace/mytrace.h>
This will cause the trace/mytrace.h file to also produce the C code
necessary to implement the trace point.
Note, if more than one trace/<type>.h is used to create the C code
it is best to list them all together.
#define CREATE_TRACE_POINTS
#include <trace/foo.h>
#include <trace/bar.h>
#include <trace/fido.h>
Thanks to Mathieu Desnoyers and Christoph Hellwig for coming up with
the cleaner solution of the define above the includes over my first
design to have the C code include a "special" header.
This patch converts sched, irq and lockdep and skb to use this new
method.
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Zhao Lei <zhaolei@cn.fujitsu.com>
Cc: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently netif_device_attach/detach are only stopping one queue. They
should be starting and stopping all the queues on a given device.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (54 commits)
glge: remove unused #include <version.h>
dnet: remove unused #include <version.h>
tcp: miscounts due to tcp_fragment pcount reset
tcp: add helper for counter tweaking due mid-wq change
hso: fix for the 'invalid frame length' messages
hso: fix for crash when unplugging the device
fsl_pq_mdio: Fix compile failure
fsl_pq_mdio: Revive UCC MDIO support
ucc_geth: Pass proper device to DMA routines, otherwise oops happens
i.MX31: Fixing cs89x0 network building to i.MX31ADS
tc35815: Fix build error if NAPI enabled
hso: add Vendor/Product ID's for new devices
ucc_geth: Remove unused header
gianfar: Remove unused header
kaweth: Fix locking to be SMP-safe
net: allow multiple dev per napi with GRO
r8169: reset IntrStatus after chip reset
ixgbe: Fix potential memory leak/driver panic issue while setting up Tx & Rx ring parameters
ixgbe: fix ethtool -A|a behavior
ixgbe: Patch to fix driver panic while freeing up tx & rx resources
...
GRO assumes that there is a one-to-one relationship between NAPI
structure and network device. Some devices like sky2 share multiple
devices on a single interrupt so only have one NAPI handler. Rather than
split GRO from NAPI, just have GRO assume if device changes that
it is a different flow.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for event-aware wakeups to the sockets code. Events are
delivered to the wakeup target, so that epoll can avoid spurious wakeups
for non-interesting events.
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Acked-by: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Cc: William Lee Irwin III <wli@movementarian.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
wireless: remove duplicated .ndo_set_mac_address
netfilter: xtables: fix IPv6 dependency in the cluster match
tg3: Add GRO support.
niu: Add GRO support.
ucc_geth: Fix use-after-of_node_put() in ucc_geth_probe().
gianfar: Fix use-after-of_node_put() in gfar_of_init().
kernel: remove HIPQUAD()
netpoll: store local and remote ip in net-endian
netfilter: fix endian bug in conntrack printks
dmascc: fix incomplete conversion to network_device_ops
gso: Fix support for linear packets
skbuff.h: fix missing kernel-doc
ni5010: convert to net_device_ops
Setting ->owner as done currently (pde->owner = THIS_MODULE) is racy
as correctly noted at bug #12454. Someone can lookup entry with NULL
->owner, thus not pinning enything, and release it later resulting
in module refcount underflow.
We can keep ->owner and supply it at registration time like ->proc_fops
and ->data.
But this leaves ->owner as easy-manipulative field (just one C assignment)
and somebody will forget to unpin previous/pin current module when
switching ->owner. ->proc_fops is declared as "const" which should give
some thoughts.
->read_proc/->write_proc were just fixed to not require ->owner for
protection.
rmmod'ed directories will be empty and return "." and ".." -- no harm.
And directories with tricky enough readdir and lookup shouldn't be modular.
We definitely don't want such modular code.
Removing ->owner will also make PDE smaller.
So, let's nuke it.
Kudos to Jeff Layton for reminding about this, let's say, oversight.
http://bugzilla.kernel.org/show_bug.cgi?id=12454
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Allows for the removal of byteswapping in some places and
the removal of HIPQUAD (replaced by %pI4).
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When GRO/frag_list support was added to GSO, I made an error
which broke the support for segmenting linear GSO packets (GSO
packets are normally non-linear in the payload).
These days most of these packets are constructed by the tun
driver, which prefers to allocate linear memory if possible.
This is fixed in the latest kernel, but for 2.6.29 and earlier
it is still the norm.
Therefore this bug causes failures with GSO when used with tun
in 2.6.29.
Reported-by: James Huang <jamesclhuang@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
When I fixed the GRO crash in the legacy receive path I used
napi_complete to replace __napi_complete. Unfortunately they're
not the same when NETPOLL is enabled, which may result in us
not calling __napi_complete at all.
What's more, we really do need to keep the __napi_complete call
within the IRQ-off section since in theory an IRQ can occur in
between and fill up the backlog to the maximum, causing us to
lock up.
Since we can't seem to find a fix that works properly right now,
this patch reverts all the GRO support from the netif_rx path.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no gain using prefetch() in dev_hard_start_xmit(), since
we already had to read ops->ndo_select_queue pointer in dev_pick_tx(),
and both pointers are probably located in the same cache line.
This prefetch call slows down fast path because of a stall in address
computation.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some minor changes to queue hashing:
1. Use const on accessor functions
2. Export skb_tx_hash for use in drivers (see ixgbe)
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
struct sk_buff pointers should be freed with kfree_skb.
Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On the legacy netif_rx path, I incorrectly tried to optimise
the napi_complete call by using __napi_complete before we reenable
IRQs. This simply doesn't work since we need to flush the held
GRO packets first.
This patch fixes it by doing the obvious thing of reenabling
IRQs first and then calling napi_complete.
Reported-by: Frank Blaschka <blaschka@linux.vnet.ibm.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
As my netpoll fix for net doesn't really work for net-next, we
need this update to move the checks into the right place. As it
stands we may pass freed skbs to netpoll_receive_skb.
This patch also introduces a netpoll_rx_on function to avoid GRO
completely if we're invoked through netpoll. This might seem
paranoid but as netpoll may have an external receive hook it's
better to be safe than sorry. I don't think we need this for
2.6.29 though since there's nothing immediately broken by it.
This patch also moves the GRO_* return values to netdevice.h since
VLAN needs them too (I tried to avoid this originally but alas
this seems to be the easiest way out). This fixes a bug in VLAN
where it continued to use the old return value 2 instead of the
correct GRO_DROP.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add FC CRC offload check for ETH_P_FCOE.
Signed-off-by: Yi Zou <yi.zou@intel.com>
Acked-by: David Miller <davem@davemloft.net>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Since dev_set_name takes a printf style string, new gcc complains
if arg is not const.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As analyzed by Patrick McHardy, vlan needs to reset it's
netdev_ops pointer in it's ->init() function but this
leaves the compat method pointers stale.
Add a netdev_resync_ops() and call it from the vlan code.
Any other driver which changes ->netdev_ops after register_netdevice()
will need to call this new function after doing so too.
With help from Patrick McHardy.
Tested-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently it is possible to do just about everything with the arp table
from user space except treat an entry like you are using it. To that end
implement and a flag NTF_USE that when set in a netwlink update request
treats the neighbour table entry like the kernel does on the output path.
This allows user space applications to share the kernel's arp cache.
Signed-off-by: Eric Biederman <ebiederm@aristanetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It turns out that net_alive is unnecessary, and the original problem
that led to it being added was simply that the icmp code thought
it was a network device and wound up being unable to handle packets
while there were still packets in the network namespace.
Now that icmp and tcp have been fixed to properly register themselves
this problem is no longer present and we have a stronger guarantee
that packets will not arrive in a network namespace then that provided
by net_alive in netif_receive_skb. So remove net_alive allowing
packet reception run a little faster.
Additionally document the strong reason why network namespace cleanup
is safe so that if something happens again someone else will have
a chance of figuring it out.
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The netpoll entry checks are required to ensure that we don't
receive normal packets when invoked via netpoll. Unfortunately
it only ever worked for the netif_receive_skb/netif_rx entry
points. The VLAN (and subsequently GRO) entry point didn't
have the check and therefore can trigger all sorts of weird
problems.
This patch adds the netpoll check to all entry points.
I'm still uneasy with receiving at all under netpoll (which
apparently is only used by the out-of-tree kdump code). The
reason is it is perfectly legal to receive all data including
headers into highmem if netpoll is off, but if you try to do
that with netpoll on and someone gets a printk in an IRQ handler
you're going to get a nice BUG_ON.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Impact: Include header file.
Fix this sparse warning:
net/core/sysctl_net_core.c:123:32: warning: symbol 'net_core_path' was not declared. Should it be static?
Signed-off-by: Hannes Eder <hannes@hanneseder.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove some pointless conditionals before kfree_skb().
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove some pointless conditionals before kfree_skb().
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch changes the return value of nlmsg_notify() as follows:
If NETLINK_BROADCAST_ERROR is set by any of the listeners and
an error in the delivery happened, return the broadcast error;
else if there are no listeners apart from the socket that
requested a change with the echo flag, return the result of the
unicast notification. Thus, with this patch, the unicast
notification is handled in the same way of a broadcast listener
that has set the NETLINK_BROADCAST_ERROR socket flag.
This patch is useful in case that the caller of nlmsg_notify()
wants to know the result of the delivery of a netlink notification
(including the broadcast delivery) and take any action in case
that the delivery failed. For example, ctnetlink can drop packets
if the event delivery failed to provide reliable logging and
state-synchronization at the cost of dropping packets.
This patch also modifies the rtnetlink code to ignore the return
value of rtnl_notify() in all callers. The function rtnl_notify()
(before this patch) returned the error of the unicast notification
which makes rtnl_set_sk_err() reports errors to all listeners. This
is not of any help since the origin of the change (the socket that
requested the echoing) notices the ENOBUFS error if the notification
fails and should resync itself.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The fix for CVE-2009-0676 (upstream commit df0bca04) is incomplete. Note
that the same problem of leaking kernel memory will reappear if someone
on some architecture uses struct timeval with some internal padding (for
example tv_sec 64-bit and tv_usec 32-bit) --- then, you are going to
leak the padded bytes to userspace.
Signed-off-by: Eugene Teo <eugeneteo@kernel.sg>
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net_alloc_generic was defined in #ifdef CONFIG_NET_NS, but used
unconditionally. Move net_alloc_generic out of #ifdef.
Signed-off-by: Clemens Noss <cnoss@gmx.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
It turns out that net_alive is unnecessary, and the original problem
that led to it being added was simply that the icmp code thought
it was a network device and wound up being unable to handle packets
while there were still packets in the network namespace.
Now that icmp and tcp have been fixed to properly register themselves
this problem is no longer present and we have a stronger guarantee
that packets will not arrive in a network namespace then that provided
by net_alive in netif_receive_skb. So remove net_alive allowing
packet reception run a little faster.
Additionally document the strong reason why network namespace cleanup
is safe so that if something happens again someone else will have
a chance of figuring it out.
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fix a double free when a network namespace fails.
The previous code does a kfree of the net_generic structure when
one of the init subsystem initialization fails.
The 'setup_net' function does kfree(ng) and returns an error.
The caller, 'copy_net_ns', call net_free on error, and this one
calls kfree(net->gen), making this pointer freed twice.
This patch make the code symetric, the net_alloc does the net_generic
allocation and the net_free frees the net_generic.
Signed-off-by: Daniel Lezcano <daniel.lezcano@free.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current implementation of the TX software time stamping fallback is
faulty because it accesses the skb after ndo_start_xmit() returns
successfully. This patch removes the fallback, which fixes kernel panics
seen during stress tests. Hardware time stamping is not affected by this
removal.
Signed-off-by: Patrick Ohly <patrick.ohly@intel.com>
Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>