NLMSG_HDRLEN is already aligned value. It's for directly reference
without extra alignment.
The redundant alignment here may confuse the API users.
Signed-off-by: Hong Zhiguo <honkiko@gmail.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Inspection of upper layer protocol is considered harmful, especially
if it is about ARP or other stateful upper layer protocol; driver
cannot (and should not) have full state of them.
IPv4 over Firewire module used to inspect ARP (both in sending path
and in receiving path), and record peer's GUID, max packet size, max
speed and fifo address. This patch removes such inspection by extending
our "hardware address" definition to include other information as well:
max packet size, max speed and fifo. By doing this, The neighbour
module in networking subsystem can cache them.
Note: As we have started ignoring sspd and max_rec in ARP/NDP, those
information will not be used in the driver when sending.
When a packet is being sent, the IP layer fills our pseudo header with
the extended "hardware address", including GUID and fifo. The driver
can look-up node-id (the real but rather volatile low-level address)
by GUID, and then the module can send the packet to the wire using
parameters provided in the extendedn hardware address.
This approach is realistic because IP over IEEE1394 (RFC2734) and IPv6
over IEEE1394 (RFC3146) share same "hardware address" format
in their address resolution protocols.
Here, extended "hardware address" is defined as follows:
union fwnet_hwaddr {
u8 u[16];
struct {
__be64 uniq_id; /* EUI-64 */
u8 max_rec; /* max packet size */
u8 sspd; /* max speed */
__be16 fifo_hi; /* hi 16bits of FIFO addr */
__be32 fifo_lo; /* lo 32bits of FIFO addr */
} __packed uc;
};
Note that Hardware address is declared as union, so that we can map full
IP address into this, when implementing MCAP (Multicast Cannel Allocation
Protocol) for IPv6, but IP and ARP subsystem do not need to know this
format in detail.
One difference between original ARP (RFC826) and 1394 ARP (RFC2734)
is that 1394 ARP Request/Reply do not contain the target hardware address
field (aka ar$tha). This difference is handled in the ARP subsystem.
CC: Stephan Gatzka <stephan.gatzka@gmail.com>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Following patch refactors GRE code into ip tunneling code and GRE
specific code. Common tunneling code is moved to ip_tunnel module.
ip_tunnel module is written as generic library which can be used
by different tunneling implementations.
ip_tunnel module contains following components:
- packet xmit and rcv generic code. xmit flow looks like
(gre_xmit/ipip_xmit)->ip_tunnel_xmit->ip_local_out.
- hash table of all devices.
- lookup for tunnel devices.
- control plane operations like device create, destroy, ioctl, netlink
operations code.
- registration for tunneling modules, like gre, ipip etc.
- define single pcpu_tstats dev->tstats.
- struct tnl_ptk_info added to pass parsed tunnel packet parameters.
ipip.h header is renamed to ip_tunnel.h
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip-header id needs to be incremented even if IP_DF flag is set.
This behaviour was changed in commit 490ab08127
(IP_GRE: Fix IP-Identification).
Following patch fixes it so that identification is always
incremented.
Reported-by: Cong Wang <amwang@redhat.com>
Acked-by: Cong Wang <amwang@redhat.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Pablo Neira Ayuso says:
====================
The following patchset contains Netfilter/IPVS updates for
your net-next tree, they are:
* Better performance in nfnetlink_queue by avoiding copy from the
packet to netlink message, from Eric Dumazet.
* Remove unnecessary locking in the exit path of ebt_ulog, from Gao Feng.
* Use new function ipv6_iface_scope_id in nf_ct_ipv6, from Hannes Frederic Sowa.
* A couple of sparse fixes for IPVS, from Julian Anastasov.
* Use xor hashing in nfnetlink_queue, as suggested by Eric Dumazet, from
myself.
* Allow to dump expectations per master conntrack via ctnetlink, from myself.
* A couple of cleanups to use PTR_RET in module init path, from Silviu-Mihai
Popescu.
* Remove nf_conntrack module a bit faster if netns are in use, from
Vladimir Davydov.
* Use checksum_partial in ip6t_NPT, from YOSHIFUJI Hideaki.
* Sparse fix for nf_conntrack, from Stephen Hemminger.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Hello!
After patch 1 got accepted to net-next I will also send a patch to
netfilter-devel to make the corresponding changes to the netfilter
reassembly logic.
Thanks,
Hannes
-- >8 --
[PATCH 2/2] ipv6: implement RFC3168 5.3 (ecn protection) for ipv6 fragmentation handling
This patch also ensures that INET_ECN_CE is propagated if one fragment
had the codepoint set.
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Jesper Dangaard Brouer <jbrouer@redhat.com>
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch just moves some code arround to make the ip4_frag_ecn_table
and IPFRAG_ECN_* constants accessible from the other reassembly engines. I
also renamed ip4_frag_ecn_table to ip_frag_ecn_table.
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Jesper Dangaard Brouer <jbrouer@redhat.com>
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a dev_addr_genid for IPv6. The goal is to use it, combined with
dev_base_seq to check if a change occurs during a netlink dump.
If a change is detected, the flag NLM_F_DUMP_INTR is set in the first message
after the dump was interrupted.
Note that only dump of unicast addresses is checked (multicast and anycast are
not checked).
Reported-by: Junwei Zhang <junwei.zhang@6wind.com>
Reported-by: Hongjun Li <hongjun.li@6wind.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull to get the thermal netlink multicast group name fix, otherwise
the assertion added in net-next to netlink to detect that kind of bug
makes systems unbootable for some folks.
Signed-off-by: David S. Miller <davem@davemloft.net>
With decnet converted, we can finally get rid of rta_buf and its
computations around it. It also gets rid of the minimal header
length verification since all message handlers do that explicitly
anyway.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
decnet is the only subsystem left that is relying on the global
netlink attribute buffer rta_buf. It's horrible design and we
want to get rid of it.
This converts all of decnet to do implicit attribute parsing. It
also gets rid of the error prone struct dn_kern_rta.
Yes, the fib_magic() stuff is not pretty.
It's compiled tested but I need someone with appropriate hardware
to test the patch since I don't have access to it.
Cc: linux-decnet-user@lists.sourceforge.net
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch converts the Marvell MV643XX ethernet driver to use the
Marvell Orion MDIO driver. As a result, PowerPC and ARM platforms
registering the Marvell MV643XX ethernet driver are also updated to
register a Marvell Orion MDIO driver. This driver voluntarily overlaps
with the Marvell Ethernet shared registers because it will use a subset
of this shared register (shared_base + 0x4 to shared_base + 0x84). The
Ethernet driver is also updated to look up for a PHY device using the
Orion MDIO bus driver.
For ARM and PowerPC we register a single instance of the "mvmdio" driver
in the system like it used to be done with the use of the "shared_smi"
platform_data cookie on ARM.
Note that it is safe to register the mvmdio driver only for the "ge00"
instance of the driver because this "ge00" interface is guaranteed to
always be explicitely registered by consumers of
arch/arm/plat-orion/common.c and other instances (ge01, ge10 and ge11)
were all pointing their shared_smi to ge00. For PowerPC the in-tree
Device Tree Source files mention only one MV643XX ethernet MAC instance
so the MDIO bus driver is registered only when id == 0.
Signed-off-by: Florian Fainelli <florian@openwrt.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
You can access it directly now, since 3.8: v3.7-rc1-13-g06ca287
'virtio: move queue_index and num_free fields into core struct
virtqueue.'
Cc: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If bpf_jit_enable > 1, then we dump the emitted JIT compiled image
after creation. Currently, only SPARC and PowerPC has similar output
as in the reference implementation on x86_64. Make a small helper
function in order to reduce duplicated code and make the dump output
uniform across architectures x86_64, SPARC, PPC, ARM (e.g. on ARM
flen, pass and proglen are currently not shown, but would be
interesting to know as well), also for future BPF JIT implementations
on other archs.
Cc: Mircea Gherzan <mgherzan@gmail.com>
Cc: Matt Evans <matt@ozlabs.org>
Cc: Eric Dumazet <eric.dumazet@google.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The netlink_diag can be built as a module, just like it's done in
unix sockets.
The core dumping message carries the basic info about netlink sockets:
family, type and protocol, portis, dst_group, dst_portid, state.
Groups can be received as an optional parameter NETLINK_DIAG_GROUPS.
Netlink sockets cab be filtered by protocols.
The socket inode number and cookie is reserved for future per-socket info
retrieving. The per-protocol filtering is also reserved for future by
requiring the sdiag_protocol to be zero.
The file /proc/net/netlink doesn't provide enough information for
dumping netlink sockets. It doesn't provide dst_group, dst_portid,
groups above 32.
v2: fix NETLINK_DIAG_MAX. Now it's equal to the last constant.
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Follow the common pattern and define *_DIAG_MAX like:
[...]
__XXX_DIAG_MAX,
};
Because everyone is used to do:
struct nlattr *attrs[XXX_DIAG_MAX+1];
nla_parse([...], XXX_DIAG_MAX, [...]
Reported-by: Thomas Graf <tgraf@suug.ch>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch implements F-RTO (foward RTO recovery):
When the first retransmission after timeout is acknowledged, F-RTO
sends new data instead of old data. If the next ACK acknowledges
some never-retransmitted data, then the timeout was spurious and the
congestion state is reverted. Otherwise if the next ACK selectively
acknowledges the new data, then the timeout was genuine and the
loss recovery continues. This idea applies to recurring timeouts
as well. While F-RTO sends different data during timeout recovery,
it does not (and should not) change the congestion control.
The implementaion follows the three steps of SACK enhanced algorithm
(section 3) in RFC5682. Step 1 is in tcp_enter_loss(). Step 2 and
3 are in tcp_process_loss(). The basic version is not supported
because SACK enhanced version also works for non-SACK connections.
The new implementation is functionally in parity with the old F-RTO
implementation except the one case where it increases undo events:
In addition to the RFC algorithm, a spurious timeout may be detected
without sending data in step 2, as long as the SACK confirms not
all the original data are dropped. When this happens, the sender
will undo the cwnd and perhaps enter fast recovery instead. This
additional check increases the F-RTO undo events by 5x compared
to the prior implementation on Google Web servers, since the sender
often does not have new data to send for HTTP.
Note F-RTO may detect spurious timeout before Eifel with timestamps
does so.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The patch series refactor the F-RTO feature (RFC4138/5682).
This is to simplify the loss recovery processing. Existing F-RTO
was developed during the experimental stage (RFC4138) and has
many experimental features. It takes a separate code path from
the traditional timeout processing by overloading CA_Disorder
instead of using CA_Loss state. This complicates CA_Disorder state
handling because it's also used for handling dubious ACKs and undos.
While the algorithm in the RFC does not change the congestion control,
the implementation intercepts congestion control in various places
(e.g., frto_cwnd in tcp_ack()).
The new code implements newer F-RTO RFC5682 using CA_Loss processing
path. F-RTO becomes a small extension in the timeout processing
and interfaces with congestion control and Eifel undo modules.
It lets congestion control (module) determines how many to send
independently. F-RTO only chooses what to send in order to detect
spurious retranmission. If timeout is found spurious it invokes
existing Eifel undo algorithms like DSACK or TCP timestamp based
detection.
The first patch removes all F-RTO code except the sysctl_tcp_frto is
left for the new implementation. Since CA_EVENT_FRTO is removed, TCP
westwood now computes ssthresh on regular timeout CA_EVENT_LOSS event.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Process connector can now also detect coredumping events.
Main aim of patch is get notified at start of coredumping, instead of
having to wait for it to finish and then being notified through EXIT
event.
Could be used for instance by process-managers that want to get
notified as soon as possible about process failures, and not
necessarily beeing notified after coredump, which could be in the
order of minutes depending on size of coredump, piping and so on.
Signed-off-by: Jesper Derehag <jderehag@hotmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is very useful to do dynamic truncation of packets. In particular,
we're interested to push the necessary header bytes to the user space and
cut off user payload that should probably not be transferred for some reasons
(e.g. privacy, speed, or others). With the ancillary extension PAY_OFFSET,
we can load it into the accumulator, and return it. E.g. in bpfc syntax ...
ld #poff ; { 0x20, 0, 0, 0xfffff034 },
ret a ; { 0x16, 0, 0, 0x00000000 },
... as a filter will accomplish this without having to do a big hackery in
a BPF filter itself. Follow-up JIT implementations are welcome.
Thanks to Eric Dumazet for suggesting and discussing this during the
Netfilter Workshop in Copenhagen.
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
__skb_get_poff() returns the offset to the payload as far as it could
be dissected. The main user is currently BPF, so that we can dynamically
truncate packets without needing to push actual payload to the user
space and instead can analyze headers only.
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In skb_flow_dissect(), we perform a dissection of a skbuff. Since we're
doing the work here anyway, also store thoff for a later usage, e.g. in
the BPF filter.
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Users of udp encapsulation currently have an encap_rcv callback which they can
use to hook into the udp receive path.
In situations where a encapsulation user allocates resources associated with a
udp encap socket, it may be convenient to be able to also hook the proto
.destroy operation. For example, if an encap user holds a reference to the
udp socket, the destroy hook might be used to relinquish this reference.
This patch adds a socket destroy hook into udp, which is set and enabled
in the same way as the existing encap_rcv hook.
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
The following patchset contains 7 Netfilter/IPVS fixes for 3.9-rc, they are:
* Restrict IPv6 stateless NPT targets to the mangle table. Many users are
complaining that this target does not work in the nat table, which is the
wrong table for it, from Florian Westphal.
* Fix possible use before initialization in the netns init path of several
conntrack protocol trackers (introduced recently while improving conntrack
netns support), from Gao Feng.
* Fix incorrect initialization of copy_range in nfnetlink_queue, spotted
by Eric Dumazet during the NFWS2013, patch from myself.
* Fix wrong calculation of next SCTP chunk in IPVS, from Julian Anastasov.
* Remove rcu_read_lock section in IPVS while calling ipv4_update_pmtu
not required anymore after change introduced in 3.7, again from Julian.
* Fix SYN looping in IPVS state sync if the backup is used a real server
in DR/TUN modes, this required a new /proc entry to disable the director
function when acting as backup, also from Julian.
* Remove leftover IP_NF_QUEUE Kconfig after ip_queue removal, noted by
Paul Bolle.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Changes:
v3->v2: rebase (no other changes)
passes selftest
v2->v1: read f->num_members only once
fix bug: test rollover mode + flag
Minimize packet drop in a fanout group. If one socket is full,
roll over packets to another from the group. Maintain flow
affinity during normal load using an rxhash fanout policy, while
dispersing unexpected traffic storms that hit a single cpu, such
as spoofed-source DoS flows. Rollover breaks affinity for flows
arriving at saturated sockets during those conditions.
The patch adds a fanout policy ROLLOVER that rotates between sockets,
filling each socket before moving to the next. It also adds a fanout
flag ROLLOVER. If passed along with any other fanout policy, the
primary policy is applied until the chosen socket is full. Then,
rollover selects another socket, to delay packet drop until the
entire system is saturated.
Probing sockets is not free. Selecting the last used socket, as
rollover does, is a greedy approach that maximizes chance of
success, at the cost of extreme load imbalance. In practice, with
sufficiently long queues to absorb bursts, sockets are drained in
parallel and load balance looks uniform in `top`.
To avoid contention, scales counters with number of sockets and
accesses them lockfree. Values are bounds checked to ensure
correctness.
Tested using an application with 9 threads pinned to CPUs, one socket
per thread and sufficient busywork per packet operation to limits each
thread to handling 32 Kpps. When sent 500 Kpps single UDP stream
packets, a FANOUT_CPU setup processes 32 Kpps in total without this
patch, 270 Kpps with the patch. Tested with read() and with a packet
ring (V1).
Also, passes psock_fanout.c unit test added to selftests.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Fix ARM BPF JIT handling of negative 'k' values, from Chen Gang.
2) Insufficient space reserved for bridge netlink values, fix from
Stephen Hemminger.
3) Some dst_neigh_lookup*() callers don't interpret error pointer
correctly, fix from Zhouyi Zhou.
4) Fix transport match in SCTP active_path loops, from Xugeng Zhang.
5) Fix qeth driver handling of multi-order SKB frags, from Frank
Blaschka.
6) fec driver is missing napi_disable() call, resulting in crashes on
unload, from Georg Hofmann.
7) Don't try to handle PMTU events on a listening socket, fix from Eric
Dumazet.
8) Fix timestamp location calculations in IP option processing, from
David Ward.
9) FIB_TABLE_HASHSZ setting is not controlled by the correct kconfig
tests, from Denis V Lunev.
10) Fix TX descriptor push handling in SFC driver, from Ben Hutchings.
11) Fix isdn/hisax and tulip/de4x5 kconfig dependencies, from Arnd
Bergmann.
12) bnx2x statistics don't handle 4GB rollover correctly, fix from
Maciej Żenczykowski.
13) Openvswitch bug fixes for vport del/new error reporting, missing
genlmsg_end() call in netlink processing, and mis-parsing of
LLC/SNAP ethernet types. From Rich Lane.
14) SKB pfmemalloc state should only be propagated from the head page of
a compound page, fix from Pavel Emelyanov.
15) Fix link handling in tg3 driver for 5715 chips when autonegotation
is disabled. From Nithin Sujir.
16) Fix inverted test of cpdma_check_free_tx_desc return value in
davinci_emac driver, from Mugunthan V N.
17) vlan_depth is incorrectly calculated in skb_network_protocol(), from
Li RongQing.
18) Fix probing of Gobi 1K devices in qmi_wwan driver, and fix NCM
device mode backwards compat in cdc_ncm driver. From Bjørn Mork.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (52 commits)
inet: limit length of fragment queue hash table bucket lists
qeth: Fix scatter-gather regression
qeth: Fix invalid router settings handling
qeth: delay feature trace
tcp: dont handle MTU reduction on LISTEN socket
bnx2x: fix occasional statistics off-by-4GB error
vhost/net: fix heads usage of ubuf_info
bridge: Add support for setting BR_ROOT_BLOCK flag.
bnx2x: add missing napi deletion in error path
drivers: net: ethernet: ti: davinci_emac: fix usage of cpdma_check_free_tx_desc()
ethernet/tulip: DE4x5 needs VIRT_TO_BUS
isdn: hisax: netjet requires VIRT_TO_BUS
net: cdc_ncm, cdc_mbim: allow user to prefer NCM for backwards compatibility
rtnetlink: Mask the rta_type when range checking
Revert "ip_gre: make ipgre_tunnel_xmit() not parse network header as IP unconditionally"
Fix dst_neigh_lookup/dst_neigh_lookup_skb return value handling bug
smsc75xx: configuration help incorrectly mentions smsc95xx
net: fec: fix missing napi_disable call
net: fec: restart the FEC when PHY speed changes
skb: Propagate pfmemalloc on skb from head page only
...
The patch introduces nf_conntrack_cleanup_net_list(), which cleanups
nf_conntrack for a list of netns and calls synchronize_net() only once
for them all. This should reduce netns destruction time.
I've measured cleanup time for 1k dummy net ns. Here are the results:
<without the patch>
# modprobe nf_conntrack
# time modprobe -r nf_conntrack
real 0m10.337s
user 0m0.000s
sys 0m0.376s
<with the patch>
# modprobe nf_conntrack
# time modprobe -r nf_conntrack
real 0m5.661s
user 0m0.000s
sys 0m0.216s
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Patrick McHardy <kaber@trash.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch introduces a constant limit of the fragment queue hash
table bucket list lengths. Currently the limit 128 is choosen somewhat
arbitrary and just ensures that we can fill up the fragment cache with
empty packets up to the default ip_frag_high_thresh limits. It should
just protect from list iteration eating considerable amounts of cpu.
If we reach the maximum length in one hash bucket a warning is printed.
This is implemented on the caller side of inet_frag_find to distinguish
between the different users of inet_fragment.c.
I dropped the out of memory warning in the ipv4 fragment lookup path,
because we already get a warning by the slab allocator.
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Jesper Dangaard Brouer <jbrouer@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dmitry Akindinov is reporting for a problem where SYNs are looping
between the master and backup server when the backup server is used as
real server in DR mode and has IPVS rules to function as director.
Even when the backup function is enabled we continue to forward
traffic and schedule new connections when the current master is using
the backup server as real server. While this is not a problem for NAT,
for DR and TUN method the backup server can not determine if a request
comes from client or from director.
To avoid such loops add new sysctl flag backup_only. It can be needed
for DR/TUN setups that do not need backup and director function at the
same time. When the backup function is enabled we stop any forwarding
and pass the traffic to the local stack (real server mode). The flag
disables the director function when the backup function is enabled.
For setups that enable backup function for some virtual services and
director function for other virtual services there should be another
more complex solution to support DR/TUN mode, may be to assign
per-virtual service syncid value, so that we can differentiate the
requests.
Reported-by: Dmitry Akindinov <dimak@stalker.com>
Tested-by: German Myzovsky <lawyer@sipnet.ru>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
This fixes a couple of problems. Firstly, some people are actually still
using old small-page flash and we broke it by removing the ready check.
Secondly. fix the handling of partitions on Broadcom 47xx devices.
Recent changes had made it misdetect the location of the NVRAM and
scribble over the bootloader when it tried to update the variables there.
With predictably sad results.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.13 (GNU/Linux)
iEYEABECAAYFAlFG/koACgkQdwG7hYl686OShQCgyOXCXhTltBJU3wHDtTeT1rc9
Vx0AoL655JpZaUf+BpPsCSTagH4k28vQ
=OK6b
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20130318' of git://git.infradead.org/linux-mtd
Pull MTD fixes from David Woodhouse:
"This fixes a couple of problems. Firstly, some people are actually
still using old small-page flash and we broke it by removing the ready
check.
Secondly. fix the handling of partitions on Broadcom 47xx devices.
Recent changes had made it misdetect the location of the NVRAM and
scribble over the bootloader when it tried to update the variables
there. With predictably sad results."
* tag 'for-linus-20130318' of git://git.infradead.org/linux-mtd:
mtd: nand: reintroduce NAND_NO_READRDY as NAND_NEED_READRDY
mtd: bcm47xxpart: look for NVRAM at the end of device
Revert "mtd: bcm47xxpart: improve probing of nvram partition"
Things are calming down for arm-soc as well. This set of bug fixes is
dominated in size by the at91 platform bug fixes. Some of them were
meant to go through the framebuffer tree during the merge window, but
since the framebuffer maintainer could not be reached, I offered to
take them here. The other notable at91 change is the addition of pinctrl
definitions to fix the NAND controller.
The rest are mostly simple regression fixes:
* Our removal of VIRT_TO_BUS conflicted with Stephen Rothwell's
renaming of the Kconfig symbol. You will get a trivial merge conflict
here, we still want to remove it.
* missing bits for clocks on imx and s5pv210
* missing header inclusions in mmp and shmobile
* typos in s5pv210 camera and vt8500 clock support code
and three trivial fixes for pre-3.8 bugs:
* an old bogus build warning in the joystick driver
* a misleading Kconfig description
* a NULL pointer check on davinci
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIVAwUAUUchPmCrR//JCVInAQLxHhAA4bbv0+aS3vhEV8sMomBQ7XpjlI2wJ5wy
cd2jA04Gb54bQlRkZNuflIHH5xYq9bslR98Y3iEMqPHrxheDV5qgfZ9wO1E5b8wd
bl/Fj1bj7D7AeQpvhAYHZufQnV4xGSpW7j/6hkEWCDDgla82BaEwQq3aVCqFsZu5
u41xlWCFYbwS+sEcdALnGmFdEBtNHzsfwkY7AClcunARWcFTyIAm5J2VhO/1Z3eY
sA31DBizTsxhkfgOEXTDvyH1N3YwcGlm3Mb7J0ZfdU5d5QQlthmU1ims2fVPoo3t
x1rJNb5HARsJuuuFIgoRa/Vbcytqxj2+MhJGy2cLhsmAxr8L61cb618oniZxxDoW
y4DMurF790q3uSkJOrhtcAmGBmHNBdTHcvV4U05EYIQl64k/oY+L7IB18ACAHVqO
LwimbZ+KF1kxv/hVosGbu7l0EKDt7MS4ykc5QJAtiYu7RDikoRmH05742feWfQ+2
Fy6V1GqIyUCea1cWDjomeTx+lERknSWPweesrcyiRhIs2BsqrtDRDngse/S59Lf9
mUFiLh+tZqZxTh8HqZbnHbuJoqNvfVyZVYWrvifkH0Ji8VZqeLuzxx/8fBvnCDWz
tXZOkl4m2U4lVYzkYOLN9VAurEHSYcHOw51IIgQp4IfS3U32sA1a4/fF/ATq0ugP
tdJBtr7mpzA=
=oLKI
-----END PGP SIGNATURE-----
Merge tag 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
Pull ARM SoC bug fixes from Arnd Bergmann:
"Things are calming down for arm-soc as well. This set of bug fixes is
dominated in size by the at91 platform bug fixes. Some of them were
meant to go through the framebuffer tree during the merge window, but
since the framebuffer maintainer could not be reached, I offered to
take them here. The other notable at91 change is the addition of
pinctrl definitions to fix the NAND controller.
The rest are mostly simple regression fixes:
- Our removal of VIRT_TO_BUS conflicted with Stephen Rothwell's
renaming of the Kconfig symbol. You will get a trivial merge
conflict here, we still want to remove it.
- missing bits for clocks on imx and s5pv210
- missing header inclusions in mmp and shmobile
- typos in s5pv210 camera and vt8500 clock support code
and three trivial fixes for pre-3.8 bugs:
- an old bogus build warning in the joystick driver
- a misleading Kconfig description
- a NULL pointer check on davinci"
* tag 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
ARM: fix CONFIG_VIRT_TO_BUS handling
ARM: i.MX35: enable MAX clock
ARM: Scorpion is a v7 architecture, not v6
ARM: mmp: add platform_device head file in gplugd
input/joystick: use get_cycles on ARM
[media] s5p-fimc: fix s5pv210 build
clk: vt8500: Fix "fix device clock divisor calculations"
ARM: i.MX25: Fix DT compilation
ARM: at91: fix infinite loop in at91_irq_suspend/resume
ARM: at91: add gpio suspend/resume support when using pinctrl
ARM: at91: fix LCD-wiring mode
atmel_lcdfb: fix 16-bpp modes on older SOCs
ARM: at91: dt: at91sam9x5: complete NAND pinctrl
ARM: at91: dt: at91sam9x5: correct NAND pins comments
ARM: davinci: edma: fix dmaengine induced null pointer dereference on da830
ARM: shmobile: marzen: Include mmc/host.h
ARM: EXYNOS: Add #dma-cells for generic dma binding support for PL330
ARM: S5PV210: Fix PL330 DMA controller clkdev entries
Commit 1d9d8639c0 ("perf,x86: fix kernel crash with PEBS/BTS after
suspend/resume") introduces a link failure since
perf_restore_debug_store() is only defined for CONFIG_CPU_SUP_INTEL:
arch/x86/power/built-in.o: In function `restore_processor_state':
(.text+0x45c): undefined reference to `perf_restore_debug_store'
Fix it by defining the dummy function appropriately.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
TCPCT uses option-number 253, reserved for experimental use and should
not be used in production environments.
Further, TCPCT does not fully implement RFC 6013.
As a nice side-effect, removing TCPCT increases TCP's performance for
very short flows:
Doing an apache-benchmark with -c 100 -n 100000, sending HTTP-requests
for files of 1KB size.
before this patch:
average (among 7 runs) of 20845.5 Requests/Second
after:
average (among 7 runs) of 21403.6 Requests/Second
Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Pravin B Shelar <pshelar@nicira.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
net/openvswitch/vport-internal_dev.c
Jesse Gross says:
====================
A couple of minor enhancements for net-next/3.10. The largest is an
extension to allow variable length metadata to be passed to userspace
with packets.
There is a merge conflict in net/openvswitch/vport-internal_dev.c:
A existing commit modifies internal_dev_mac_addr() and a new commit
deletes it. The new one is correct, so you can just remove that function.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch generalizes VXLAN forwarding table entries allowing an administrator
to:
1) specify multiple destinations for a given MAC
2) specify alternate vni's in the VXLAN header
3) specify alternate destination UDP ports
4) use multicast MAC addresses as fdb lookup keys
5) specify multicast destinations
6) specify the outgoing interface for forwarded packets
The combination allows configuration of more complex topologies using VXLAN
encapsulation.
Changes since v1: rebase to 3.9.0-rc2
Signed-Off-By: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
caif_shm is an old implementation
caif_shm will be replaced by caif_virtio
[ As explained by Linus Walleij: "U5500 used this, but was cancelled
and the silicon did not reach anyone outside ST-Ericsson. Then for
the next platforms, we have gone for the leaner & cleaner approach
of using virtio, rpmesg and rproc." ]
Signed-off-by: Erwan Yvin <erwan.yvin@stericsson.com>
Acked-by: Linus Walleij <linus.walleij@linaro.org>
Acked-by: Sjur Brendeland <sjur.brandeland@stericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit bd329e1 ("net: cdc_ncm: do not bind to NCM compatible MBIM devices")
introduced a new policy, preferring MBIM for dual NCM/MBIM functions if
the cdc_mbim driver was enabled. This caused a regression for users
wanting to use NCM.
Devices implementing NCM backwards compatibility according to section
3.2 of the MBIM v1.0 specification allow either NCM or MBIM on a single
USB function, using different altsettings. The cdc_ncm and cdc_mbim
drivers will both probe such functions, and must agree on a common
policy for selecting either MBIM or NCM. Until now, this policy has
been set at build time based on CONFIG_USB_NET_CDC_MBIM.
Use a module parameter to set the system policy at runtime, allowing the
user to prefer NCM on systems with the cdc_mbim driver.
Cc: Greg Suarez <gsuarez@smithmicro.com>
Cc: Alexey Orishko <alexey.orishko@stericsson.com>
Reported-by: Geir Haatveit <nospam@haatveit.nu>
Reported-by: Tommi Kyntola <kynde@ts.ray.fi>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=54791
Signed-off-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
* The GPIO descriptor work has exposed how broken the non-GPIOLIB
bits for OpenRISC were. We now require GPIOLIB as this is the
preferred way forward.
* The system.h split introduced a bug in llist.h for arches using
asm-generic/cmpxchg.h directly, which is currently only OpenRISC.
The patch here moves two defines from asm-generic/atomic.h to
asm-generic/cmpxchg.h to make things work as they should.
* The VIRT_TO_BUS selector was added for OpenRISC, but OpenRISC does
not have the virt_to_bus methods, so there's a patch to remove it
again.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iEYEABECAAYFAlFCu28ACgkQ70gcjN2673OzpwCfdv7JYVybaCXWHhmcvGIpyrjU
ed0AnAuz0rGFqyN7vSAxmHtusUnJVEsT
=Lddj
-----END PGP SIGNATURE-----
Merge tag 'for-3.9-rc3' of git://openrisc.net/jonas/linux
Pull OpenRISC bug fixes from Jonas Bonn:
- The GPIO descriptor work has exposed how broken the non-GPIOLIB bits
for OpenRISC were. We now require GPIOLIB as this is the preferred
way forward.
- The system.h split introduced a bug in llist.h for arches using
asm-generic/cmpxchg.h directly, which is currently only OpenRISC.
The patch here moves two defines from asm-generic/atomic.h to
asm-generic/cmpxchg.h to make things work as they should.
- The VIRT_TO_BUS selector was added for OpenRISC, but OpenRISC does
not have the virt_to_bus methods, so there's a patch to remove it
again.
* tag 'for-3.9-rc3' of git://openrisc.net/jonas/linux:
openrisc: remove HAVE_VIRT_TO_BUS
asm-generic: move cmpxchg*_local defs to cmpxchg.h
openrisc: require gpiolib
With this one we have:
- An ab8500 build failure fix.
- An ab8500 device tree parsing fix.
- A fix for twl4030_madc remove routine to work properly (when built-in).
- A fix for properly registering palmas interrupt handler.
- A fix for omap-usb init routine to actually write into the hostconfig
register.
- A couple of warning fixes for ab8500-gpadc and tps65912.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJRQs2IAAoJEIqAPN1PVmxKN6sQAISiEmSjOiLRZ3b13qeKGiMX
3F3W5qXGYJQzhiVU/unPK9/2rTdfJ0Mqnuj88bLY+336pr89vi8g0k2VBWPfHlcv
jv2mJKWuNUM15D0d6uZ3t27jrXXBzNEvMjKKen2kYfm1+JcR+1N8pbJUZtOlgREJ
2Z23or/XP+ZCSbXcvzH1KvRFrcaBDedqgT4m0nat2dTfZ77MpKZEA58sutNBOUMa
YfoE2ncJBq5Ku4LBo6wGhsVptzilmpH3YBnSEgTh7tccyLrt4SnodNdT3vxmU4+1
C0r/oX9Idcij4/VAW/2bcEq1E12YUeKnPeVDAARXc+X/1Oyw+Mc6Cqd/rnvFomfe
9uTGMOmH1me1mSUgrzFNM1nnKXaZkWti56rLEe8tLsN7tkc+oWSJPIlWwjCmeWkC
R5SPgR5Jf6QeA4mW9qMxnBM3y8YlaV4awW6wwVYXGUDICZMG04qURSLo9NX3kDAA
AHsbE+IiSxdDqmZr8aejhaGd7PIVjfPhnM25TNlUZ5P5EWlApl4ha7QZ/BK2fx6k
DqaRujfB2uGcQAwqTprVpDI6lHUEneUKaa3zDpoxkP4vPVXDsRIiBo7HRfY8QXjg
vMWrYfLvAYBXHx69I5q2DghyFmofu/H0SXDdGlo33cQ5SfzeNGS7h0pPGrkNJyvN
7WJbiOdAYeQXVlBCy7sB
=XFgy
-----END PGP SIGNATURE-----
Merge tag 'mfd-fixes-3.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/mfd-fixes
Pull MFD fixes from Samuel Ortiz:
"This is the first batch of MFD fixes for 3.9.
With this one we have:
- An ab8500 build failure fix.
- An ab8500 device tree parsing fix.
- A fix for twl4030_madc remove routine to work properly (when
built-in).
- A fix for properly registering palmas interrupt handler.
- A fix for omap-usb init routine to actually write into the
hostconfig register.
- A couple of warning fixes for ab8500-gpadc and tps65912"
* tag 'mfd-fixes-3.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/mfd-fixes:
mfd: twl4030-madc: Remove __exit_p annotation
mfd: ab8500: Kill "reg" property from binding
mfd: ab8500-gpadc: Complain if we fail to enable vtvout LDO
mfd: wm831x: Don't forward declare enum wm831x_auxadc
mfd: twl4030-audio: Fix argument type for twl4030_audio_disable_resource()
mfd: tps65912: Declare and use tps65912_irq_exit()
mfd: palmas: Provide irq flags through DT/platform data
mfd: Make AB8500_CORE select POWER_SUPPLY to fix build error
mfd: omap-usb-host: Actually update hostconfig
This patch fixes a kernel crash when using precise sampling (PEBS)
after a suspend/resume. Turns out the CPU notifier code is not invoked
on CPU0 (BP). Therefore, the DS_AREA (used by PEBS) is not restored properly
by the kernel and keeps it power-on/resume value of 0 causing any PEBS
measurement to crash when running on CPU0.
The workaround is to add a hook in the actual resume code to restore
the DS Area MSR value. It is invoked for all CPUS. So for all but CPU0,
the DS_AREA will be restored twice but this is harmless.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When neighbour table is full, dst_neigh_lookup/dst_neigh_lookup_skb will return
-ENOBUFS which is absolutely non zero, while all the code in kernel which use
above functions assume failure only on zero return which will cause panic. (for
example: : https://bugzilla.kernel.org/show_bug.cgi?id=54731).
This patch corrects above error with smallest changes to kernel source code and
also correct two return value check missing bugs in drivers/infiniband/hw/cxgb4/cm.c
Tested on my x86_64 SMP machine
Reported-by: Zhouyi Zhou <zhouzhouyi@gmail.com>
Tested-by: Zhouyi Zhou <zhouzhouyi@gmail.com>
Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
COMPAT_NET_DEV_OPS was removed a while back and with it the definition of
netdev_resync_ops() went away. Let's finish the clean-up.
Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch allows LRO aggregation on bonded devices that contain an
NX3031 device. It also adds a for_each_netdev_in_bond_rcu(bond, slave)
macro which executes for each slave that has bond as master.
V3: After testing and discussing this with Rajesh, I decided to keep the
vlan ip cache and just rename it to ip_cache since it will store bond
ip addresses too. A new master flag has been added to the ip cache to
denote that the address has been added because of a master device.
I've taken care of the enslave/release cases by checking for various
combinations of events and flags (e.g. netxen has a master, it's a
bond master and it's not marked as a slave means it is being enslaved
and is dev_open()ed in bond_enslave).
I've changed netxen_free_ip_list() to have a "master" parameter which
causes all IP addresses marked as master to be deleted (used when a
netxen is being released). I've made the patch use the new upper
device API as well. The following cases were tested:
- bond -> netxen
- vlan -> netxen
- vlan -> bond -> netxen
V2: Remove local ip caching, retrieve addresses dynamically and
restore them if necessary.
Note: Tested with NX3031 adapter.
Tested-by: Rajesh Borundia <rajesh.borundia@qlogic.com>
Signed-off-by: Andy Gospodarek <agospoda@redhat.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull fix for hlist_entry_safe() regression from Paul McKenney:
"This contains a single commit that fixes a regression in
hlist_entry_safe(). This macro references its argument twice, which
can cause NULL-pointer errors. This commit applies a gcc statement
expression, creating a temporary variable to avoid the double
reference. This has been posted to LKML at
https://lkml.org/lkml/2013/3/9/75.
Kudos to CAI Qian, whose testing uncovered this, to Eric Dumazet, who
spotted root cause, and to Li Zefan, who tested this commit."
* 'rcu/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
list: Fix double fetch of pointer in hlist_entry_safe()