DSA drivers may drive different families of switches which need
different tag protocol. Rather than hard code the tag protocol in the
driver structure, have a callback for the DSA core to call.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If no RARP, BOOTP, or DHCP response is received, ic_dev is never set,
causing a NULL pointer dereference in ic_close_devs():
Sending DHCP requests ...... timed out!
Unable to handle kernel NULL pointer dereference at virtual address 00000004
To fix this, add a check to avoid dereferencing ic_dev if it is still
NULL.
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Fixes: 2647cffb2b ("net: ipconfig: Support using "delayed" DHCP replies")
Signed-off-by: David S. Miller <davem@davemloft.net>
Encoding of the metadata was using the padded length as opposed to
the real length of the data which is a bug per specification.
This has not been an issue todate because all metadatum specified
so far has been 32 bit where aligned and data length are the same width.
This also includes a bug fix for validating the length of a u16 field.
But since there is no metadata of size u16 yes we are fine to include it
here.
While at it get rid of magic numbers.
Fixes: ef6980b6be ("net sched: introduce IFE action")
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- place kref_get near usage of referenced objects, separate patches
for various used objects to improve readability and maintainability
by Sven Eckelmann (18 patches)
- Keep batadv net device when all hard interfaces disappear, to
improve situations where tools currently use work arounds, by
Sven Eckelmann
- Add an option to disable debugfs support to minimize footprint when
userspace uses netlink only, by Sven Eckelmann
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdBQJXuwOiFhxzd0BzaW1vbnd1bmRlcmxpY2guZGUACgkQoSvjmEKS
nqFK2A/9HOwk9n9MEDBcGu3S2524LPO4yBIBlp9eo59gMuRezTPIKuqFT/bFhEfI
/wwQDK8/lm4AaWGyo7JGCe5Q/4LJourue+QyT/IG0leoIeO4QxopvmskQnhKr8GT
mB1vwVzMChBwCTjcr4qTxVNub5Zf2FrvpI8rEQsIYTATstyATQNszIllw19X0Eum
2Tc29QNoLJ1G1TgH57VCchf0QUEBXPk/yj/pll7YOFnv2LTN+hi6GHaw+A3DHXJL
svT8m1eS70UhgtgXzEE6W6XSqHDXP96nyTd4DGd+YkpiXL480i2SanFY5e79ECIH
5W4blUa5QKhO7vhxFG01cf5TmiwNL9OpI7i4fj/UB+qlwLmowtnoQu07Wojtd1aL
nR61zCjW+hvTPIvW3sNxskabZM8UHtMpiyoPIj/ZhnMchNfOCDVhaDgF+eiNpzL+
UlPrN1SOIftQ1IBjbX4KjQVVpHP8F3CWrfg+QlAzzkVtJ5Z7WMGedKBz1bobdlvJ
N0Q6ytxG13eDaQHmjTDyAqFHfhCESNFMhjDaC7qOTovvkLUfA4PS0cqUzus0OJuD
q78DKCxtkRhkxC7QKeUcI/WwONOmKcKPkD6pscP4RXBEhanhKPHAfDsdvMtxDG++
jGJLFbQHA9x09J5nrvUb35W2UhX/MLo3emTcGcBayP1dV8jFTJU=
=/eyc
-----END PGP SIGNATURE-----
Merge tag 'batadv-next-for-davem-20160822' of git://git.open-mesh.org/linux-merge
Simon Wunderlich says:
====================
This feature patchset includes the following changes:
- place kref_get near usage of referenced objects, separate patches
for various used objects to improve readability and maintainability
by Sven Eckelmann (18 patches)
- Keep batadv net device when all hard interfaces disappear, to
improve situations where tools currently use work arounds, by
Sven Eckelmann
- Add an option to disable debugfs support to minimize footprint when
userspace uses netlink only, by Sven Eckelmann
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
In b8247f095e,
"net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow segmentation for local udp tunneled skbs"
gso skbs arriving from an ingress interface that go through UDP
tunneling, are allowed to be fragmented if the resulting encapulated
segments exceed the dst mtu of the egress interface.
This aligned the behavior of gso skbs to non-gso skbs going through udp
encapsulation path.
However the non-gso vs gso anomaly is present also in the following
cases of a GRE tunnel:
- ip_gre in collect_md mode, where TUNNEL_DONT_FRAGMENT is not set
(e.g. OvS vport-gre with df_default=false)
- ip_gre in nopmtudisc mode, where IFLA_GRE_IGNORE_DF is set
In both of the above cases, the non-gso skbs get fragmented, whereas the
gso skbs (having skb_gso_network_seglen that exceeds dst mtu) get dropped,
as they don't go through the segment+fragment code path.
Fix: Setting IPSKB_FRAG_SEGS if the tunnel specified IP_DF bit is NOT set.
Tunnels that do set IP_DF, will not go to fragmentation of segments.
This preserves behavior of ip_gre in (the default) pmtudisc mode.
Fixes: b8247f095e ("net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow segmentation for local udp tunneled skbs")
Reported-by: wenxu <wenxu@ucloud.cn>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Tested-by: wenxu <wenxu@ucloud.cn>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
After commit 22dc13c837 ("net_sched: convert tcf_exts from list to pointer array")
we do dynamic allocation in tcf_exts_init(), therefore we need
to handle the ENOMEM case properly.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If DAD fails with accept_dad set to 2, global addresses and host routes
are incorrectly left in place. Even though disable_ipv6 is set,
contrary to documentation, the addresses are not dynamically deleted
from the interface. It is only on a subsequent link down/up that these
are removed. The fix is not only to set the disable_ipv6 flag, but
also to call addrconf_ifdown(), which is the action to carry out when
disabling IPv6. This results in the addresses and routes being deleted
immediately. The DAD failure for the LL addr is determined as before
via netlink, or by the absence of the LL addr (which also previously
would have had to be checked for in case of an intervening link down
and up). As the call to addrconf_ifdown() requires an rtnl lock, the
logic to disable IPv6 when DAD fails is moved to addrconf_dad_work().
Previous behavior:
root@vm1:/# sysctl net.ipv6.conf.eth3.accept_dad=2
net.ipv6.conf.eth3.accept_dad = 2
root@vm1:/# ip -6 addr add 2000::10/64 dev eth3
root@vm1:/# ip link set up eth3
root@vm1:/# ip -6 addr show dev eth3
5: eth3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qlen 1000
inet6 2000::10/64 scope global
valid_lft forever preferred_lft forever
inet6 fe80::5054:ff:fe43:dd5a/64 scope link tentative dadfailed
valid_lft forever preferred_lft forever
root@vm1:/# ip -6 route show dev eth3
2000::/64 proto kernel metric 256
fe80::/64 proto kernel metric 256
root@vm1:/# ip link set down eth3
root@vm1:/# ip link set up eth3
root@vm1:/# ip -6 addr show dev eth3
root@vm1:/# ip -6 route show dev eth3
root@vm1:/#
New behavior:
root@vm1:/# sysctl net.ipv6.conf.eth3.accept_dad=2
net.ipv6.conf.eth3.accept_dad = 2
root@vm1:/# ip -6 addr add 2000::10/64 dev eth3
root@vm1:/# ip link set up eth3
root@vm1:/# ip -6 addr show dev eth3
root@vm1:/# ip -6 route show dev eth3
root@vm1:/#
Signed-off-by: Mike Manning <mmanning@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes the following sparse warning:
net/netfilter/nft_hash.c:40:25: warning:
symbol 'nft_hash_policy' was not declared. Should it be static?
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
trivial fix to spelling mistake in pr_debug message
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch adds the numgen expression that allows us to generated
incremental and random numbers, this generator is bound to a upper limit
that is specified by userspace.
This expression is useful to distribute packets in a round-robin fashion
as well as randomly.
Signed-off-by: Laura Garcia Liebana <nevola@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch adds the quota expression. This new stateful expression
integrate easily into the dynset expression to build 'hashquota' flow
tables.
Arguably, we could use instead "counter bytes > 1000" instead, but this
approach has several problems:
1) We only support for one single stateful expression in dynamic set
definitions, and the expression above is a composite of two
expressions: get counter + comparison.
2) We would need to restore the packed counter representation (that we
used to have) based on seqlock to synchronize this, since per-cpu is
not suitable for this.
So instead of bloating the counter expression back with the seqlock
representation and extending the existing set infrastructure to make it
more complex for the composite described above, let's follow the more
simple approach of adding a quota expression that we can plug into our
existing infrastructure.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Subash reported that commit 42a7b32b73 ("xfrm: Add oif to dst lookups")
broke a wifi use case that uses fib rules and xfrms. The intent of
42a7b32b73 was driven by VRFs with IPsec. As a compromise relax the
use of oif in xfrm lookups to L3 master devices only (ie., oif is either
an L3 master device or is enslaved to a master device).
Fixes: 42a7b32b73 ("xfrm: Add oif to dst lookups")
Reported-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Use PPP_ALLSTATIONS, PPP_UI, and SEND_SHUTDOWN instead of 0xff,
0x03, and 2 separately.
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We've already set sk to sock->sk and dereferenced it, so if it's NULL
we would have crashed already. Moreover, if it was NULL we would have
crashed anyway when jumping to 'out' and trying to unlock the sock.
Furthermore, if we had assigned a different value to 'sk' we would
have been calling lock_sock() and release_sock() on different sockets.
My conclusion is that these two lines are complete nonsense and only
serve to confuse the reader.
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The sk->sk_state is bits flag, so need use bit operation check
instead of value check.
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Tested-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
The Broadcom Starfighter 2 switch driver should be a proper platform
driver, now that the DSA code has been updated to allow that, register a
switch device, feed it with the proper configuration data coming from
Device Tree and register our switch device with DSA.
The bulk of the changes consist in moving what bcm_sf2_sw_setup() did
into the platform driver probe function.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In preparation for allowing switch drivers to implement system-wide
suspend/resume functions, export dsa_switch_suspend and
dsa_switch_resume() such that these are callable from the appropriate
driver specific suspend/resume functions.
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Tested-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes following sparse errors :
net/ipv4/fib_semantics.c:1579:61: warning: incorrect type in argument 2
(different base types)
net/ipv4/fib_semantics.c:1579:61: expected unsigned int [unsigned]
[usertype] key
net/ipv4/fib_semantics.c:1579:61: got restricted __be32 const
[usertype] nh_gw
Fixes: a6db4494d2 ("net: ipv4: Consider failed nexthops in multipath routes")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Include ipv4_rcv_saddr_equal() definition to avoid this sparse error :
net/ipv4/udp.c:362:5: warning: symbol 'ipv4_rcv_saddr_equal' was not
declared. Should it be static?
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After commit 19689e38ec ("tcp: md5: use kmalloc() backed scratch
areas") this function is no longer used.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch converts the diag dumping code to use the rhashtable
walk code instead of going through rhashtable by hand. The lock
nl_table_lock is now only taken while we process the multicast
list as it's not needed for the rhashtable walk.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
1) Fix one typo: s/tn/tp/
2) Fix the description about the "u" bits.
Signed-off-by: Xunlei Pang <xlpang@redhat.com>
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As recently discussed during the task_under_cgroup_hierarchy() addition,
we should get rid of the ifdefs surrounding the bpf_skb_under_cgroup()
helper. If related functionality is not built-in, the helper cannot be
used anyway, which is also in line with what we do for all other helpers.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Follow-up to 555c8a8623 ("bpf: avoid stack copy and use skb ctx for
event output") for also adding the event output helper for XDP typed
programs. The event output helper has been very useful in particular for
debugging or event notification purposes, since it's much faster and
flexible than regular trace printk due to programmatically being able to
attach meta data. Same flags structure applies as with tc BPF programs.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This work adds a bpf_skb_change_tail() helper for tc BPF programs. The
basic idea is to expand or shrink the skb in a controlled manner. The
eBPF program can then rewrite the rest via helpers like bpf_skb_store_bytes(),
bpf_lX_csum_replace() and others rather than passing a raw buffer for
writing here.
bpf_skb_change_tail() is really a slow path helper and intended for
replies with f.e. ICMP control messages. Concept is similar to other
helpers like bpf_skb_change_proto() helper to keep the helper without
protocol specifics and let the BPF program mangle the remaining parts.
A flags field has been added and is reserved for now should we extend
the helper in future.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since we have a skb_pkt_type_ok() helper for checking the type before
mangling, make use of it instead of open coding. Follow-up to commit
8b10cab64c ("net: simplify and make pkt_type_ok() available for other
users") that came in after d2485c4242 ("bpf: add bpf_skb_change_type
helper").
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add TIPC_NL_PEER_REMOVE netlink command. This command can remove
an offline peer node from the internal data structures.
This will be supported by the tipc user space tool in iproute2.
Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Over the years, TCP BDP has increased a lot, and is typically
in the order of ~10 Mbytes with help of clever Congestion Control
modules.
In presence of packet losses, TCP stores incoming packets into an out of
order queue, and number of skbs sitting there waiting for the missing
packets to be received can match the BDP (~10 Mbytes)
In some cases, TCP needs to make room for incoming skbs, and current
strategy can simply remove all skbs in the out of order queue as a last
resort, incurring a huge penalty, both for receiver and sender.
Unfortunately these 'last resort events' are quite frequent, forcing
sender to send all packets again, stalling the flow and wasting a lot of
resources.
This patch cleans only a part of the out of order queue in order
to meet the memory constraints.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: C. Stephen Gun <csg@google.com>
Cc: Van Jacobson <vanj@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While chasing tcp_xmit_retransmit_queue() kasan issue, I found
that we could avoid reading sacked field of skb that we wont send,
possibly removing one cache line miss.
Very minor change in slow path, but why not ? ;)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use one of the vlan xstats padding fields to export the vlan flags. This is
needed in order to be able to distinguish between master (bridge) and port
vlan entries in user-space when dumping the bridge vlan stats.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the bridge driver we usually have the same function working for both
port and bridge. In order to follow that logic and also avoid code
duplication, consolidate the bridge_ and brport_ linkxstats calls into
one since they share most of their code. As a side effect this allows us
to dump the vlan stats also via the slave call which is in preparation for
the upcoming per-port vlan stats and vlan flag dumping.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current vlan push action supports only vid and protocol options.
Add priority option.
Example script that adds vlan push action with vid and
priority:
tc filter add dev veth0 protocol ip parent ffff: \
flower \
indev veth0 \
action vlan push id 100 priority 5
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Enhance flower to support 802.1Q vlan protocol classification.
Currently, the supported fields are vlan_id and vlan_priority.
Example:
# add a flower filter with vlan id and priority classification
tc filter add dev ens4f0 protocol 802.1Q parent ffff: \
flower \
indev ens4f0 \
vlan_ethtype ipv4 \
vlan_id 100 \
vlan_prio 3 \
action vlan pop
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current flower implementation checks the mask range and set all the
keys included in that range as "used_keys", even if a specific key in
the range has a zero mask.
This behavior can cause a false positive return value of
dissector_uses_key function and unnecessary dissection in
__skb_flow_dissect.
This patch checks explicitly the mask of each key and "used_keys" will
be set accordingly.
Fixes: 77b9900ef5 ('tc: introduce Flower classifier')
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add vlan priority check to the flow dissector by adding new flow
dissector struct, flow_dissector_key_vlan which includes vlan tag
fields.
vlan_id and flow_label fields were under the same struct
(flow_dissector_key_tags). It was a convenient setting since struct
flow_dissector_key_tags is used by struct flow_keys and by setting
vlan_id and flow_label under the same struct, we get precisely 24 or 48
bytes in flow_keys from flow_dissector_key_basic.
Now, when adding vlan priority support, the code will be cleaner if
flow_label and vlan tag won't be under the same struct anymore.
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Early in the datapath skb_vlan_untag function is called, stripped
the vlan from the skb and set skb->vlan_tci and skb->vlan_proto fields.
The current dissection doesn't handle stripped vlan packets correctly.
In some flows, vlan doesn't exist in skb->data anymore when applying
flow dissection on the skb, fix that.
In case vlan info wasn't stripped before applying flow_dissector (RPS
flow for example), or in case of skb with multiple vlans (e.g. 802.1ad),
get the vlan info from skb->data. The flow_dissector correctly skips
any number of vlans and stores only the first level vlan.
Fixes: 0744dd00c1 ('net: introduce skb_flow_dissect()')
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tc_dump_qdisc() performs dumping of the per-device qdiscs in two phases;
first, the "standard" dev->qdisc is being dumped. Second, if there is/are
ingress queue(s), they are being dumped as well.
After conversion of netdevice's qdisc linked-list into hashtable, these
two sets are not in two disjunctive sets/lists any more, but are both
"reachable" directly from netdevice's hashtable. As a consequence, the
"full-depth" dump of the ingress qdiscs results in immediately hitting the
netdevice hashtable again, and duplicating the dump that has already been
performed for dev->qdisc.
What in fact needs to be dumped in case of ingress queue is "just" the
top-level ingress qdisc, as everything else has been dumped already.
Fix this by extending tc_dump_qdisc_root() in a way that it can be instructed
whether it should (while performing the "full" per-netdev qdisc dump) perform
the whole recursion, or just dump "additional" top-level (ingress) qdiscs
without performing any kind of recursion.
This fixes duplicate dumps such as
qdisc mq 0: root
qdisc pfifo_fast 0: parent :4 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc clsact ffff: parent ffff:fff1
qdisc pfifo_fast 0: parent :4 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
Fixes: 59cc1f61f ("net: sched: convert qdisc linked list to hashtable")
Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
qdisc_match_from_root() is now iterating over per-netdevice qdisc
hashtable instead of going through a linked-list of qdiscs (independently
on the actual underlying netdev), which was the case before the switch to
hashtable for qdiscs.
For singleton qdiscs, there is no underlying netdev associated though, and
therefore dumping a singleton qdisc will panic, as qdisc_dev(root) will
always be NULL.
BUG: unable to handle kernel NULL pointer dereference at 0000000000000410
IP: [<ffffffff8167efac>] qdisc_match_from_root+0x2c/0x70
PGD 1aceba067 PUD 1aceb7067 PMD 0
Oops: 0000 [#1] PREEMPT SMP
[ ... ]
task: ffff8801ec996e00 task.stack: ffff8801ec934000
RIP: 0010:[<ffffffff8167efac>] [<ffffffff8167efac>] qdisc_match_from_root+0x2c/0x70
RSP: 0018:ffff8801ec937ab0 EFLAGS: 00010203
RAX: 0000000000000408 RBX: ffff88025e612000 RCX: ffffffffffffffd8
RDX: 0000000000000000 RSI: 00000000ffff0000 RDI: ffffffff81cf8100
RBP: ffff8801ec937ab0 R08: 000000000001c160 R09: ffff8802668032c0
R10: ffffffff81cf8100 R11: 0000000000000030 R12: 00000000ffff0000
R13: ffff88025e612000 R14: ffffffff81cf3140 R15: 0000000000000000
FS: 00007f24b9af6740(0000) GS:ffff88026f280000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000410 CR3: 00000001aceec000 CR4: 00000000001406e0
Stack:
ffff8801ec937ad0 ffffffff81681210 ffff88025dd51a00 00000000fffffff1
ffff8801ec937b88 ffffffff81681e4e ffffffff81c42bc0 ffff880262431500
ffffffff81cf3140 ffff88025dd51a10 ffff88025dd51a24 00000000ec937b38
Call Trace:
[<ffffffff81681210>] qdisc_lookup+0x40/0x50
[<ffffffff81681e4e>] tc_modify_qdisc+0x21e/0x550
[<ffffffff8166ae25>] rtnetlink_rcv_msg+0x95/0x220
[<ffffffff81209602>] ? __kmalloc_track_caller+0x172/0x230
[<ffffffff8166ad90>] ? rtnl_newlink+0x870/0x870
[<ffffffff816897b7>] netlink_rcv_skb+0xa7/0xc0
[<ffffffff816657c8>] rtnetlink_rcv+0x28/0x30
[<ffffffff8168919b>] netlink_unicast+0x15b/0x210
[<ffffffff81689569>] netlink_sendmsg+0x319/0x390
[<ffffffff816379f8>] sock_sendmsg+0x38/0x50
[<ffffffff81638296>] ___sys_sendmsg+0x256/0x260
[<ffffffff811b1275>] ? __pagevec_lru_add_fn+0x135/0x280
[<ffffffff811b1a90>] ? pagevec_lru_move_fn+0xd0/0xf0
[<ffffffff811b1140>] ? trace_event_raw_event_mm_lru_insertion+0x180/0x180
[<ffffffff811b1b85>] ? __lru_cache_add+0x75/0xb0
[<ffffffff817708a6>] ? _raw_spin_unlock+0x16/0x40
[<ffffffff811d8dff>] ? handle_mm_fault+0x39f/0x1160
[<ffffffff81638b15>] __sys_sendmsg+0x45/0x80
[<ffffffff81638b62>] SyS_sendmsg+0x12/0x20
[<ffffffff810038e7>] do_syscall_64+0x57/0xb0
Fix this by special-casing singleton qdiscs (those that don't have
underlying netdevice) and introduce immediate handling of those rather
than trying to go over an underlying netdevice. We're in the same
situation in tc_dump_qdisc_root() and tc_dump_tclass_root().
Ultimately, this will have to be slightly reworked so that we are actually
able to show singleton qdiscs (noop) in the dump properly; but we're not
currently doing that anyway, so no regression there, and better do this in
a gradual manner.
Fixes: 59cc1f61f ("net: sched: convert qdisc linked list to hashtable")
Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Daniel Borkmann <daniel@iogearbox.net>
Reported-by: David Ahern <dsa@cumulusnetworks.com>
Tested-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a link is attempted woken up after congestion, it uses a different,
more generous criteria than when it was originally declared congested.
This has the effect that the link, and the sending process, sometimes
will be woken up unnecessarily, just to immediately return to congestion
when it turns out there is not not enough space in its send queue to
host the pending message. This is a waste of CPU cycles.
We now change the function link_prepare_wakeup() to use exactly the same
criteria as tipc_link_xmit(). However, since we are now excluding the
window limit from the wakeup calculation, and the current backlog limit
for the lowest level is too small to house even a single maximum-size
message, we have to expand this limit. We do this by evaluating an
alternative, minimum value during the setting of the importance limits.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit 5b7066c3dd ("tipc: stricter filtering of packets in bearer
layer") we introduced a method of filtering out messages while a bearer
is being reset, to avoid that links may be re-created and come back in
working state while we are still in the process of shutting them down.
This solution works well, but is limited to only work with L2 media, which
is insufficient with the increasing use of UDP as carrier media.
We now replace this solution with a more generic one, by introducing a
new flag "up" in the generic struct tipc_bearer. This field will be set
and reset at the same locations as with the previous solution, while
the packet filtering is moved to the generic code for the sending side.
On the receiving side, the filtering is still done in media specific
code, but now including the UDP bearer.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
dev->name is a char array of IFNAMSIZ elements, hence can never be
null, so the null pointer check is redundant. Remove it.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter updates for your net tree,
they are:
1) Dump only conntrack that belong to this namespace via /proc file.
This is some fallout from the conversion to single conntrack table
for all netns, patch from Liping Zhang.
2) Missing MODULE_ALIAS_NF_LOGGER() for the ARP family that prevents
module autoloading, also from Liping Zhang.
3) Report overquota event to the right netnamespace, again from Liping.
4) Fix tproxy listener sk refcount that leads to crash, from
Eric Dumazet.
5) Fix racy refcounting on object deletion from nfnetlink and rule
removal both for nfacct and cttimeout, from Liping Zhang.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This is required to iterate over the hash table in cttimeout, ctnetlink
and nf_conntrack_ipv4.
>> ERROR: "nf_conntrack_htable_size" [net/netfilter/nfnetlink_cttimeout.ko] undefined!
ERROR: "nf_conntrack_htable_size" [net/netfilter/nf_conntrack_netlink.ko] undefined!
ERROR: "nf_conntrack_htable_size" [net/ipv4/netfilter/nf_conntrack_ipv4.ko] undefined!
Fixes: adf0516845 ("netfilter: remove ip_conntrack* sysctl compat code")
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
In general, when we want to delete a netns, cttimeout_net_exit will
be called before ipt_unregister_table, i.e. before ctnl_timeout_put.
But after call kfree_rcu in cttimeout_net_exit, we will still decrease
the timeout object's refcnt in ctnl_timeout_put, this is incorrect,
and will cause a use after free error.
It is easy to reproduce this problem:
# while : ; do
ip netns add xxx
ip netns exec xxx nfct add timeout testx inet icmp timeout 200
ip netns exec xxx iptables -t raw -p icmp -I OUTPUT -j CT --timeout testx
ip netns del xxx
done
=======================================================================
BUG kmalloc-96 (Tainted: G B E ): Poison overwritten
-----------------------------------------------------------------------
INFO: 0xffff88002b5161e8-0xffff88002b5161e8. First byte 0x6a instead of
0x6b
INFO: Allocated in cttimeout_new_timeout+0xd4/0x240 [nfnetlink_cttimeout]
age=104 cpu=0 pid=3330
___slab_alloc+0x4da/0x540
__slab_alloc+0x20/0x40
__kmalloc+0x1c8/0x240
cttimeout_new_timeout+0xd4/0x240 [nfnetlink_cttimeout]
nfnetlink_rcv_msg+0x21a/0x230 [nfnetlink]
[ ... ]
So only when the refcnt decreased to 0, we call kfree_rcu to free the
timeout object. And like nfnetlink_acct do, use atomic_cmpxchg to
avoid race between ctnl_timeout_try_del and ctnl_timeout_put.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Suppose that we input the following commands at first:
# nfacct add test
# iptables -A INPUT -m nfacct --nfacct-name test
And now "test" acct's refcnt is 2, but later when we try to delete the
"test" nfacct and the related iptables rule at the same time, race maybe
happen:
CPU0 CPU1
nfnl_acct_try_del nfnl_acct_put
atomic_dec_and_test //ref=1,testfail -
- atomic_dec_and_test //ref=0,testok
- kfree_rcu
atomic_inc //ref=1 -
So after the rcu grace period, nf_acct will be freed but it is still linked
in the nfnl_acct_list, and we can access it later, then oops will happen.
Convert atomic_dec_and_test and atomic_inc combinaiton to one atomic
operation atomic_cmpxchg here to fix this problem.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Minor overlapping changes for both merge conflicts.
Resolution work done by Stephen Rothwell was used
as a reference.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Buffers powersave frame test is reversed in cfg80211, fix from Felix
Fietkau.
2) Remove bogus WARN_ON in openvswitch, from Jarno Rajahalme.
3) Fix some tg3 ethtool logic bugs, and one that would cause no
interrupts to be generated when rx-coalescing is set to 0. From
Satish Baddipadige and Siva Reddy Kallam.
4) QLCNIC mailbox corruption and napi budget handling fix from Manish
Chopra.
5) Fix fib_trie logic when walking the trie during /proc/net/route
output than can access a stale node pointer. From David Forster.
6) Several sctp_diag fixes from Phil Sutter.
7) PAUSE frame handling fixes in mlxsw driver from Ido Schimmel.
8) Checksum fixup fixes in bpf from Daniel Borkmann.
9) Memork leaks in nfnetlink, from Liping Zhang.
10) Use after free in rxrpc, from David Howells.
11) Use after free in new skb_array code of macvtap driver, from Jason
Wang.
12) Calipso resource leak, from Colin Ian King.
13) mediatek bug fixes (missing stats sync init, etc.) from Sean Wang.
14) Fix bpf non-linear packet write helpers, from Daniel Borkmann.
15) Fix lockdep splats in macsec, from Sabrina Dubroca.
16) hv_netvsc bug fixes from Vitaly Kuznetsov, mostly to do with VF
handling.
17) Various tc-action bug fixes, from CONG Wang.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (116 commits)
net_sched: allow flushing tc police actions
net_sched: unify the init logic for act_police
net_sched: convert tcf_exts from list to pointer array
net_sched: move tc offload macros to pkt_cls.h
net_sched: fix a typo in tc_for_each_action()
net_sched: remove an unnecessary list_del()
net_sched: remove the leftover cleanup_a()
mlxsw: spectrum: Allow packets to be trapped from any PG
mlxsw: spectrum: Unmap 802.1Q FID before destroying it
mlxsw: spectrum: Add missing rollbacks in error path
mlxsw: reg: Fix missing op field fill-up
mlxsw: spectrum: Trap loop-backed packets
mlxsw: spectrum: Add missing packet traps
mlxsw: spectrum: Mark port as active before registering it
mlxsw: spectrum: Create PVID vPort before registering netdevice
mlxsw: spectrum: Remove redundant errors from the code
mlxsw: spectrum: Don't return upon error in removal path
i40e: check for and deal with non-contiguous TCs
ixgbe: Re-enable ability to toggle VLAN filtering
ixgbe: Force VLNCTRL.VFE to be set in all VMDq paths
...
Adapt KCM to use the stream parser. This mostly involves removing
the RX handling and setting up the strparser using the interface.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch introduces a utility for parsing application layer protocol
messages in a TCP stream. This is a generalization of the mechanism
implemented of Kernel Connection Multiplexor.
The API includes a context structure, a set of callbacks, utility
functions, and a data ready function.
A stream parser instance is defined by a strparse structure that
is bound to a TCP socket. The function to initialize the structure
is:
int strp_init(struct strparser *strp, struct sock *csk,
struct strp_callbacks *cb);
csk is the TCP socket being bound to and cb are the parser callbacks.
The upper layer calls strp_tcp_data_ready when data is ready on the lower
socket for strparser to process. This should be called from a data_ready
callback that is set on the socket:
void strp_tcp_data_ready(struct strparser *strp);
A parser is bound to a TCP socket by setting data_ready function to
strp_tcp_data_ready so that all receive indications on the socket
go through the parser. This is assumes that sk_user_data is set to
the strparser structure.
There are four callbacks.
- parse_msg is called to parse the message (returns length or error).
- rcv_msg is called when a complete message has been received
- read_sock_done is called when data_ready function exits
- abort_parser is called to abort the parser
The input to parse_msg is an skbuff which contains next message under
construction. The backend processing of parse_msg will parse the
application layer protocol headers to determine the length of
the message in the stream. The possible return values are:
>0 : indicates length of successfully parsed message
0 : indicates more data must be received to parse the message
-ESTRPIPE : current message should not be processed by the
kernel, return control of the socket to userspace which
can proceed to read the messages itself
other < 0 : Error is parsing, give control back to userspace
assuming that synchronzation is lost and the stream
is unrecoverable (application expected to close TCP socket)
In the case of error return (< 0) strparse will stop the parser
and report and error to userspace. The application must deal
with the error. To handle the error the strparser is unbound
from the TCP socket. If the error indicates that the stream
TCP socket is at recoverable point (ESTRPIPE) then the application
can read the TCP socket to process the stream. Once the application
has dealt with the exceptions in the stream, it may again bind the
socket to a strparser to continue data operations.
Note that ENODATA may be returned to the application. In this case
parse_msg returned -ESTRPIPE, however strparser was unable to maintain
synchronization of the stream (i.e. some of the message in question
was already read by the parser).
strp_pause and strp_unpause are used to provide flow control. For
instance, if rcv_msg is called but the upper layer can't immediately
consume the message it can hold the message and pause strparser.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While commit 9c706a49d6 ("net: ipconfig: fix use after free") avoids
the use after free, the resulting code still ends up calling both the
ic_setup_if() and ic_setup_routes() after calling ic_close_devs(), and
access to the device is still required.
Move the call to ic_close_devs() to the very end of the function.
Signed-off-by: Thierry Reding <treding@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The act_police uses its own code to walk the
action hashtable, which leads to that we could
not flush standalone tc police actions, so just
switch to tcf_generic_walker() like other actions.
(Joint work from Roman and Cong.)
Signed-off-by: Roman Mashak <mrv@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jamal reported a crash when we create a police action
with a specific index, this is because the init logic
is not correct, we should always create one for this
case. Just unify the logic with other tc actions.
Fixes: a03e6fe569 ("act_police: fix a crash during removal")
Reported-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As pointed out by Jamal, an action could be shared by
multiple filters, so we can't use list to chain them
any more after we get rid of the original tc_action.
Instead, we could just save pointers to these actions
in tcf_exts, since they are refcount'ed, so convert
the list to an array of pointers.
The "ugly" part is the action API still accepts list
as a parameter, I just introduce a helper function to
convert the array of pointers to a list, instead of
relying on the C99 feature to iterate the array.
Fixes: a85a970af2 ("net_sched: move tc_action into tcf_common")
Reported-by: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This list_del() for tc action is not needed actually,
because we only use this list to chain bulk operations,
therefore should not be carried for latter operations.
Fixes: ec0595cc44 ("net_sched: get rid of struct tcf_common")
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After refactoring tc_action into tcf_common, we no
longer need to cleanup temporary "actions" in list,
they are permanently stored in the hashtable.
Fixes: a85a970af2 ("net_sched: move tc_action into tcf_common")
Reported-by: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
supersede our debugfs configuration interface in the long run. It is
especially necessary when batman-adv should be used in different
namespaces, since debugfs can not differentiate between those.
More specifically, the following changes are included:
- Two fixes for namespace handling by Andrew Lunn, checking also the
namespaces for parent interfaces, and supress debugfs entries
for non-default netns
- Implement various netlink commands for the new interface, by
Matthias Schiffer, Andrew Lunn, Sven Eckelmann and Simon Wunderlich
(13 patches):
* routing algorithm list
* hardif list
* translation tables (local and global)
* TTVN for the translation tables
* originator and neighbor tables for B.A.T.M.A.N. IV
and B.A.T.M.A.N. V
* gateway dump functionality for B.A.T.M.A.N. IV
and B.A.T.M.A.N. V
* Bridge Loop Avoidance claims, and corresponding BLA group
* Bridge Loop Avoidance backbone tables
- Finally, mark batman-adv as netns compatible, by Andrew Lunn (1 patch)
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdBQJXss7MFhxzd0BzaW1vbnd1bmRlcmxpY2guZGUACgkQoSvjmEKS
nqGvTBAAw7A0lG5ghEEDTVWl++/q3fc41ZPn+XGihizQ3z9Hy5ZAuyREKqMz43RP
MJb2sHnoS/guCY7Y0Mn/ubQDuvp7PmqJxNmHiqdW0UVKrwgRrlhk/uZfd3Blib8J
TR1ktRAT/OKtPIxps2CSq2UX1GcnadtstaUvDDSWnak/0zsQl5GWVYxOkbdsbYUb
qYAbcHBXkdvTfIpZxSwb3QDfKoRs+Hf8hr09V19DH/GZs4puYbIxjw1QhC2TBe0f
SkcMVkmQ6GqJsjRU4BDVCrrfYvv3ncBWXtb5CKyq8il2AvdI1HbXha9hpg0SO69p
fAC5yzyB0rCCr7AKMYBgeIf9u6z5mllKly9QJkZMjtWuIIxt4J5rFK2PN+M3xprb
BWXrINWR4/1C4LA3dDvCL7sFHlObHVKRjSNwzmQ3b6UNY72d6UILG0D9JTI8M+y7
YXtjwCQYNCvjmkprM6mgPMnlk90RdXNhNUngfOe2/2C1li2gaodX7lrx+lBS8/5N
oK5W85vmO41FChLFof5PV6mn4cUV7sKlKPmv93xRvHd89RWBWIU/kGpQQjkCgh5U
44CJiD+FDRkEkDJVo7IkqTxGF39zYR39mQrNFXc6G1H4wRFtqHGP+VOa72a/7arV
FeGtulzeGBK3z1Qi9UyjS2N9mDYSKkfj4f2H+AC1GCRC2mTMCQU=
=KDfF
-----END PGP SIGNATURE-----
Merge tag 'batadv-next-for-davem-20160816' of git://git.open-mesh.org/linux-merge
Simon Wunderlich says:
====================
pull request for net-next: batman-adv 2016-08-16
This feature patchset is all about adding netlink support, which should
supersede our debugfs configuration interface in the long run. It is
especially necessary when batman-adv should be used in different
namespaces, since debugfs can not differentiate between those.
More specifically, the following changes are included:
- Two fixes for namespace handling by Andrew Lunn, checking also the
namespaces for parent interfaces, and supress debugfs entries
for non-default netns
- Implement various netlink commands for the new interface, by
Matthias Schiffer, Andrew Lunn, Sven Eckelmann and Simon Wunderlich
(13 patches):
* routing algorithm list
* hardif list
* translation tables (local and global)
* TTVN for the translation tables
* originator and neighbor tables for B.A.T.M.A.N. IV
and B.A.T.M.A.N. V
* gateway dump functionality for B.A.T.M.A.N. IV
and B.A.T.M.A.N. V
* Bridge Loop Avoidance claims, and corresponding BLA group
* Bridge Loop Avoidance backbone tables
- Finally, mark batman-adv as netns compatible, by Andrew Lunn (1 patch)
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit 64b87639c9 ("netfilter: conntrack: fix race between
nf_conntrack proc read and hash resize") introduce the
nf_conntrack_get_ht, so there's no need to check nf_conntrack_generation
again and again to get the hash table and hash size. And convert
nf_conntrack_get_ht to inline function here.
Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
inet_lookup_listener() and inet6_lookup_listener() no longer
take a reference on the found listener.
This minimal patch adds back the refcounting, but we might do
this differently in net-next later.
Fixes: 3b24d854cb ("tcp/dccp: do not touch listener sk_refcnt under synflood")
Reported-and-tested-by: Denys Fedoryshchenko <nuclearcat@nuclearcat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We should report the over quota message to the right net namespace
instead of the init netns.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Otherwise, if nfnetlink_log.ko is not loaded, we cannot add rules
to log packets to the userspace when we specify it with arp family,
such as:
# nft add rule arp filter input log group 0
<cmdline>:1:1-37: Error: Could not process rule: No such file or
directory
add rule arp filter input log group 0
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We should skip the conntracks that belong to a different namespace,
otherwise other unrelated netns's conntrack entries will be dumped via
/proc/net/nf_conntrack.
Fixes: 56d52d4892 ("netfilter: conntrack: use a single hashtable for all namespaces")
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
- Test fixes.
- A vsock fix.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJXsSOEAAoJECgfDbjSjVRpmCIIAKe6m+gWBiC4GJHJTYP5Q+lR
c6meEwxMBTZ+EVSeqUrAIN7slXu/w4NMVE/7IOo9Y+OUGK9MpQiRDOTzw2m3ps8d
W2gEJ+kvc7wFZZKXPkrgvzSuct0yv2Ho+lhZ9wpENU8KulyjBjAZ4xUDw/4LPM7G
nmE8GwOx625N4KCJh3dw5jZsgdyVWzqPuVYUqFctOWdDEqEs4f/Zb3kHR81DoMai
crri3p0fDOo+9zYPDTteG1ILayY6yIIFiPx8jrHTdL9DS+LcBHYJMXunu4Ensjth
9xdJeqWyb20DjSjzhrjwxS7Li4FJDiG5xYHNfuf2OQb+Or7ZvUGga1PbaNa7m6I=
=nlXU
-----END PGP SIGNATURE-----
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio/vhost fixes from Michael Tsirkin:
- test fixes
- a vsock fix
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
tools/virtio: add dma stubs
vhost/test: fix after swiotlb changes
vhost/vsock: drop space available check for TX vq
ringtest: test build fix
Move exporting of switchdev_port_same_parent_id to be right
below it and not elsewhere.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reported-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ensure that the inner_protocol is set on transmit so that GSO segmentation,
which relies on that field, works correctly.
This is achieved by setting the inner_protocol in gre_build_header rather
than each caller of that function. It ensures that the inner_protocol is
set when gre_fb_xmit() is used to transmit GRE which was not previously the
case.
I have observed this is not the case when OvS transmits GRE using
lwtunnel metadata (which it always does).
Fixes: 3872035241 ("gre: Use inner_proto to obtain inner header protocol")
Cc: Pravin Shelar <pshelar@ovn.org>
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
ping_v6_sendmsg does not set flowi6_oif in response to
sin6_scope_id or sk_bound_dev_if, so it is not possible to use
these APIs to ping an IPv6 address on a different interface.
Instead, it sets flowi6_iif, which is incorrect but harmless.
Stop setting flowi6_iif, and support various ways of setting oif
in the same priority order used by udpv6_sendmsg.
Tested: https://android-review.googlesource.com/#/c/254470/
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If net namespace is attached to a user namespace let's make container's
root owner of sysctls affecting said network namespace instead of global
root.
This also allows us to clean up net_ctl_permissions() because we do not
need to fudge permissions anymore for the container's owner since it now
owns the objects in question.
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When CONFIG_NET_NS is disabled, registering pernet operations causes
init() to be called immediately with init_net as an argument. Unfortunately
this leads to some pernet ops, such as proc_net_ns_init() to be called too
early, when init_net namespace has not been fully initialized. This causes
issues when we want to change pernet ops to use more data from the net
namespace in question, for example reference user namespace that owns our
network namespace.
To fix this we could either play game of musical chairs and rearrange init
order, or we could do the same as when CONFIG_NET_NS is enabled, and
postpone calling pernet ops->init() until namespace is set up properly.
Note that we can not simply undo commit ed160e839d ("[NET]: Cleanup
pernet operation without CONFIG_NET_NS") and use the same implementations
for __register_pernet_operations() and __unregister_pernet_operations(),
because many pernet ops are marked as __net_initdata and will be discarded,
which wreaks havoc on our ops lists. Here we rely on the fact that we only
use lists until init_net is fully initialized, which happens much earlier
than discarding __net_initdata sections.
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove unnecessary use of enable/disable callback notifications
and the incorrect more space available check.
The virtio_transport_tx_work handles when the TX virtqueue
has more buffers available.
Signed-off-by: Gerard Garcia <ggarcia@deic.uab.cat>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
The idea for type_check in dev_get_nest_level() was to count the number
of nested devices of the same type (currently, only macvlan or vlan
devices).
This prevented the false positive lockdep warning on configurations such
as:
eth0 <--- macvlan0 <--- vlan0 <--- macvlan1
However, this doesn't prevent a warning on a configuration such as:
eth0 <--- macvlan0 <--- vlan0
eth1 <--- vlan1 <--- macvlan1
In this case, all the locks end up with a nesting subclass of 1, so
lockdep thinks that there is still a deadlock:
- in the first case we have (macvlan_netdev_addr_lock_key, 1) and then
take (vlan_netdev_xmit_lock_key, 1)
- in the second case, we have (vlan_netdev_xmit_lock_key, 1) and then
take (macvlan_netdev_addr_lock_key, 1)
By removing the linktype check in dev_get_nest_level() and always
incrementing the nesting depth, lockdep considers this configuration
valid.
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
If IPv6 is disabled when the option is set to keep IPv6
addresses on link down, userspace is unaware of this as
there is no such indication via netlink. The solution is to
remove the IPv6 addresses in this case, which results in
netlink messages indicating removal of addresses in the
usual manner. This fix also makes the behavior consistent
with the case of having IPv6 disabled first, which stops
IPv6 addresses from being added.
Fixes: f1705ec197 ("net: ipv6: Make address flushing on ifdown optional")
Signed-off-by: Mike Manning <mmanning@brocade.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* send more nl80211 events for interfaces
* remove useless network/transport offset mangling code
* validate beacon intervals identically for all interface types
* use driver rate estimates for mesh
* fix a compiler type/signedness warning
-----BEGIN PGP SIGNATURE-----
iQIcBAABCgAGBQJXrZSMAAoJEGt7eEactAAd3kAP/1zUHvAZ3qDaz9zz08cIL35R
CXXDtaFip6k4q78ZU1OLKBbR6VXlxZ8mplQ15xtU5Cx7KIzJMG1pWMDpmuvOSttB
4pyg+43CDYSWP9gdxS8jYZVx+KL8+SzrIs4ygvfZv8B5ATSLcTbOGUBlxSylY9FP
CK35I9bL68girl6INb12NtkUwDiA6iC/n9i/62ao/2ywOXEljkwx+JNnk3aiT7Uj
PBRPHY1lK9AynpNyJjpB9Mwip0HNnbS0Ay8WLpPsqQ8NlWH60tEcpEf4L0pBB8cU
l5vs9y6TX/9V9OiVPXHgHjPJ2B3XzCY6DbqE1X5jlzFO1hc9o6y1vh8lqE4PXY/J
brN6nwXObPnx3PDiiqxU33oiXvoQvo8JjBllA3isD2tbseSNZBLewbtdpLOdXtsS
eaY8WyPaLvEfZjI8KJaQn2GvzPzXWLKHdxpYZtxavK0ss7OYioexS13qYxu+i2Tu
8QHOqymI0rfEuyTqOKWxpbUmu4xRyH3Gn6WxOFa/tY4BSKNLfUH7EHvnbh+aX1Hp
IGDTc3erD43agPcn2wYpQtFJ6k9oRIyIakKicP66l4kQrIItfUm6jPIHhYA2D9F4
SrToRZRoVFFOYQEzPoQ0ZEXBYXXa1x8DxoghX50hzX89cqRlFT0jvZHJxlSCLg+h
OzfMU4jtWCwRSK0N2EZ5
=C5wa
-----END PGP SIGNATURE-----
Merge tag 'mac80211-next-for-davem-2016-08-12' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
Johannes Berg says:
====================
Not much for -next so far, but here it goes:
* send more nl80211 events for interfaces
* remove useless network/transport offset mangling code
* validate beacon intervals identically for all interface types
* use driver rate estimates for mesh
* fix a compiler type/signedness warning
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
sctp_transport_seq_start() does not currently clear iter->start_fail on
success, but relies on it being zero when it is allocated (by
seq_open_net()).
This can be a problem in the following sequence:
open() // allocates iter (and implicitly sets iter->start_fail = 0)
read()
- iter->start() // fails and sets iter->start_fail = 1
- iter->stop() // doesn't call sctp_transport_walk_stop() (correct)
read() again
- iter->start() // succeeds, but doesn't change iter->start_fail
- iter->stop() // doesn't call sctp_transport_walk_stop() (wrong)
We should initialize sctp_ht_iter::start_fail to zero if ->start()
succeeds, otherwise it's possible that we leave an old value of 1 there,
which will cause ->stop() to not call sctp_transport_walk_stop(), which
causes all sorts of problems like not calling rcu_read_unlock() (and
preempt_enable()), eventually leading to more warnings like this:
BUG: sleeping function called from invalid context at mm/slab.h:388
in_atomic(): 0, irqs_disabled(): 0, pid: 16551, name: trinity-c2
Preemption disabled at:[<ffffffff819bceb6>] rhashtable_walk_start+0x46/0x150
[<ffffffff81149abb>] preempt_count_add+0x1fb/0x280
[<ffffffff83295892>] _raw_spin_lock+0x12/0x40
[<ffffffff819bceb6>] rhashtable_walk_start+0x46/0x150
[<ffffffff82ec665f>] sctp_transport_walk_start+0x2f/0x60
[<ffffffff82edda1d>] sctp_transport_seq_start+0x4d/0x150
[<ffffffff81439e50>] traverse+0x170/0x850
[<ffffffff8143aeec>] seq_read+0x7cc/0x1180
[<ffffffff814f996c>] proc_reg_read+0xbc/0x180
[<ffffffff813d0384>] do_loop_readv_writev+0x134/0x210
[<ffffffff813d2a95>] do_readv_writev+0x565/0x660
[<ffffffff813d6857>] vfs_readv+0x67/0xa0
[<ffffffff813d6c16>] do_preadv+0x126/0x170
[<ffffffff813d710c>] SyS_preadv+0xc/0x10
[<ffffffff8100334c>] do_syscall_64+0x19c/0x410
[<ffffffff83296225>] return_from_SYSCALL_64+0x0/0x6a
[<ffffffffffffffff>] 0xffffffffffffffff
Notice that this is a subtly different stacktrace from the one in commit
5fc382d875 ("net/sctp: terminate rhashtable walk correctly").
Cc: Xin Long <lucien.xin@gmail.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Acked-By: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If iriap_register_lsap() fails to allocate memory, self->lsap is
set to NULL. However, none of the callers handle the failure and
irlmp_connect_request() will happily dereference it:
iriap_register_lsap: Unable to allocated LSAP!
================================================================================
UBSAN: Undefined behaviour in net/irda/irlmp.c:378:2
member access within null pointer of type 'struct lsap_cb'
CPU: 1 PID: 15403 Comm: trinity-c0 Not tainted 4.8.0-rc1+ #81
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org
04/01/2014
0000000000000000 ffff88010c7e78a8 ffffffff82344f40 0000000041b58ab3
ffffffff84f98000 ffffffff82344e94 ffff88010c7e78d0 ffff88010c7e7880
ffff88010630ad00 ffffffff84a5fae0 ffffffff84d3f5c0 000000000000017a
Call Trace:
[<ffffffff82344f40>] dump_stack+0xac/0xfc
[<ffffffff8242f5a8>] ubsan_epilogue+0xd/0x8a
[<ffffffff824302bf>] __ubsan_handle_type_mismatch+0x157/0x411
[<ffffffff83b7bdbc>] irlmp_connect_request+0x7ac/0x970
[<ffffffff83b77cc0>] iriap_connect_request+0xa0/0x160
[<ffffffff83b77f48>] state_s_disconnect+0x88/0xd0
[<ffffffff83b78904>] iriap_do_client_event+0x94/0x120
[<ffffffff83b77710>] iriap_getvaluebyclass_request+0x3e0/0x6d0
[<ffffffff83ba6ebb>] irda_find_lsap_sel+0x1eb/0x630
[<ffffffff83ba90c8>] irda_connect+0x828/0x12d0
[<ffffffff833c0dfb>] SYSC_connect+0x22b/0x340
[<ffffffff833c7e09>] SyS_connect+0x9/0x10
[<ffffffff81007bd3>] do_syscall_64+0x1b3/0x4b0
[<ffffffff845f946a>] entry_SYSCALL64_slow_path+0x25/0x25
================================================================================
The bug seems to have been around since forever.
There's more problems with missing error checks in iriap_init() (and
indeed all of irda_init()), but that's a bigger problem that needs
very careful review and testing. This patch will fix the most serious
bug (as it's easily reached from unprivileged userspace).
I have tested my patch with a reproducer.
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix the bpf_try_make_writable() helper and all call sites we have in BPF,
it's currently defect with regards to skbs when the write_len spans into
non-linear parts, no matter if cloned or not.
There are multiple issues at once. First, using skb_store_bits() is not
correct since even if we have a cloned skb, page frags can still be shared.
To really make them private, we need to pull them in via __pskb_pull_tail()
first, which also gets us a private head via pskb_expand_head() implicitly.
This is for helpers like bpf_skb_store_bytes(), bpf_l3_csum_replace(),
bpf_l4_csum_replace(). Really, the only thing reasonable and working here
is to call skb_ensure_writable() before any write operation. Meaning, via
pskb_may_pull() it makes sure that parts we want to access are pulled in and
if not does so plus unclones the skb implicitly. If our write_len still fits
the headlen and we're cloned and our header of the clone is not writable,
then we need to make a private copy via pskb_expand_head(). skb_store_bits()
is a bit misleading and only safe to store into non-linear data in different
contexts such as 357b40a18b ("[IPV6]: IPV6_CHECKSUM socket option can
corrupt kernel memory").
For above BPF helper functions, it means after fixed bpf_try_make_writable(),
we've pulled in enough, so that we operate always based on skb->data. Thus,
the call to skb_header_pointer() and skb_store_bits() becomes superfluous.
In bpf_skb_store_bytes(), the len check is unnecessary too since it can
only pass in maximum of BPF stack size, so adding offset is guaranteed to
never overflow. Also bpf_l3/4_csum_replace() helpers must test for proper
offset alignment since they use __sum16 pointer for writing resulting csum.
The remaining helpers that change skb data not discussed here yet are
bpf_skb_vlan_push(), bpf_skb_vlan_pop() and bpf_skb_change_proto(). The
vlan helpers internally call either skb_ensure_writable() (pop case) and
skb_cow_head() (push case, for head expansion), respectively. Similarly,
bpf_skb_proto_xlat() takes care to not mangle page frags.
Fixes: 608cd71a9c ("tc: bpf: generalize pedit action")
Fixes: 91bc4822c3 ("tc: bpf: add checksum helpers")
Fixes: 3697649ff2 ("bpf: try harder on clones when writing into skb")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, if calipso_genopt fails then the error exit path
does not free the ipv6_opt_hdr new causing a memory leak. Fix
this by kfree'ing new on the error exit path.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This backward compatibility has been around for more than ten years,
since Yasuyuki Kozakai introduced IPv6 in conntrack. These days, we have
alternate /proc/net/nf_conntrack* entries, the ctnetlink interface and
the conntrack utility got adopted by many people in the user community
according to what I observed on the netfilter user mailing list.
So let's get rid of this.
Note that nf_conntrack_htable_size and unsigned int nf_conntrack_max do
not need to be exported as symbol anymore.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
While hashing out BPF's current_task_under_cgroup helper bits, it came
to discussion that the skb_in_cgroup helper name was suboptimally chosen.
Tejun says:
So, I think in_cgroup should mean that the object is in that
particular cgroup while under_cgroup in the subhierarchy of that
cgroup. Let's rename the other subhierarchy test to under too. I
think that'd be a lot less confusing going forward.
[...]
It's more intuitive and gives us the room to implement the real
"in" test if ever necessary in the future.
Since this touches uapi bits, we need to change this as long as v4.8
is not yet officially released. Thus, change the helper enum and rename
related bits.
Fixes: 4a482f34af ("cgroup: bpf: Add bpf_skb_in_cgroup_proto")
Reference: http://patchwork.ozlabs.org/patch/658500/
Suggested-by: Sargun Dhillon <sargun@sargun.me>
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Fixes the following sparse warning:
net/ipv6/sit.c:1129:6: warning:
symbol 'ipip6_valid_ip_proto' was not declared. Should it be static?
Signed-off-by: Wei Yongjun <weiyj.lk@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
chronological order):
- bump version strings, by Simon Wunderlich
- kerneldoc clean up, by Sven Eckelmann
- enable RTNL automatic loading and according documentation
changes, by Sven Eckelmann (2 patches)
- fix/improve interface removal and associated locking, by
Sven Eckelmann (3 patches)
- clean up unused variables, by Linus Luessing
- implement Gateway selection code for B.A.T.M.A.N. V by
Antonio Quartulli (4 patches)
- rewrite TQ comparison by Markus Pargmann
- fix Cocinelle warnings on bool vs integers (by Fenguang Wu/Intels
kbuild test robot) and bitwise arithmetic operations (by Linus
Luessing)
- rewrite packet creation for forwarding for readability and to avoid
reference count mistakes, by Linus Luessing
- use kmem_cache for translation table, which results in more efficient
storing of translation table entries, by Sven Eckelmann
- rewrite/clarify reference handling for send_skb_unicast, by Sven
Eckelmann
- fix debug messages when updating routes, by Sven Eckelmann
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCgAGBQJXrY1iAAoJEKEr45hCkp6hqjsP/1GI9Mm9iHGE5s4jY9+JORkn
yR57i0l8IENLtQ2jrxu48VtyBKI5gQoitftRpAMZw5iUjgWXVTzA8/ik1Hy7VHnG
NkDRAwHG/XH0peoubQGGPNbX2pZzBDnjR3wC9/8rOk/q6VqVqcLtgyHKbJFS5hd7
dW3okXqKCZhJcTFnu95i7PZ9zTB7BrHEqcu9aDuA6VHdf4HF9ndCizP9bdnRCOVr
wR1CkCrSjt2pMqRPLDAFcHzq/Lr+4LsNwodO0zqK5yetysJNaFJ7j5nTle2REk4L
V2Wvbmyzxa8MznRphisYM+UJ12BxVjwmuilVoxgeu/FmfCpopA1L7lbWf+xxtAcP
VLegLq2BG3fqkG6Pvk8emabC6oDxZaHsFV6uC4HylzLy09mCKWuap0qtgvNFjloM
ntdYail5BFFsTq9j7KK7k4cfYikeLfmd3/j7F/ok+PJXGpAHKqOsfRABV61rxELH
era8GrQmllh1UA/KX7j6rS4DK+AjaXmh+nk6+KDrd6IKo4+hZ1dg3UHB+ytrnx+6
p0BoLgUnjBvNT44LsjtUZlt+3ILUspJWfb86kBgTFuZm8rJqulrJu6qDbmBZEayb
PPrubxjYSKxR0nMlOVTBsGmNjugQIGn0ku89HKD210YZlpfnYENxwsxtYucWK/Tm
AvwINRUXfumyJIZ385BQ
=6XUi
-----END PGP SIGNATURE-----
Merge tag 'batadv-next-for-davem-20160812' of git://git.open-mesh.org/linux-merge
Simon Wunderlich says:
====================
This feature patchset includes the following changes (mostly
chronological order):
- bump version strings, by Simon Wunderlich
- kerneldoc clean up, by Sven Eckelmann
- enable RTNL automatic loading and according documentation
changes, by Sven Eckelmann (2 patches)
- fix/improve interface removal and associated locking, by
Sven Eckelmann (3 patches)
- clean up unused variables, by Linus Luessing
- implement Gateway selection code for B.A.T.M.A.N. V by
Antonio Quartulli (4 patches)
- rewrite TQ comparison by Markus Pargmann
- fix Cocinelle warnings on bool vs integers (by Fenguang Wu/Intels
kbuild test robot) and bitwise arithmetic operations (by Linus
Luessing)
- rewrite packet creation for forwarding for readability and to avoid
reference count mistakes, by Linus Luessing
- use kmem_cache for translation table, which results in more efficient
storing of translation table entries, by Sven Eckelmann
- rewrite/clarify reference handling for send_skb_unicast, by Sven
Eckelmann
- fix debug messages when updating routes, by Sven Eckelmann
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Highlights include:
- Stable patch from Olga to fix RPCSEC_GSS upcalls when the same user needs
multiple different security services (e.g. krb5i and krb5p).
- Stable patch to fix a regression introduced by the use of SO_REUSEPORT,
and that prevented the use of multiple different NFS versions to the
same server.
- TCP socket reconnection timer fixes.
- Patch from Neil to disable the use of IPv6 temporary addresses.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXrh03AAoJEGcL54qWCgDyp4EQALwZpmYCxWJE5xSHW95Fs124
HYM8g4LznOfs3/ohInb1ja2FaQqUy0XEk3pSjNKfyYgjuwB4qJSOpnAqoIKxJFGB
h4582leYZOZYMMCGslS2I4zcElBYO1WjnKNyb7MpZjCHmN0AdFfIcOXd2K7eL9hM
/poImcs5KfMGIEJqmKqMUxmJ3RjxpK3LySQAes/Y5odOiHC4SGJdGUmSeuPGTbQd
YjFWVHRFU6kVAzPd2Jl46Sgy6SpDaVz82HodXCSY+8lklmIkbIsVqJs0VWo3WkfL
r5WLQ3PzZvloQ7o/E9tZGiB/LEi7roa51hYsG4sleN6Kap5vwyWg0QIKjqyJdFxB
JmFanlCMfae3zNz4cusvgu1okvMnNqO4uRXJIAKfk64k775N9ebY7TXAZUK4/UbY
4nxCHcxygamP/k/8HYFpc4964tMaimIs9JUdojad5a3dzffwXcgEC/0HPUih9R+i
DO/cbVtWeDkmQPLrUqFfOAbmQdyAjELrv48d5BVIst49uuCULU2LlDlVLiAvaZvq
s2YNmr7lkHowvgaH4ShL89wuyyD14Xu5/f49oFBFNKEQay9YthQ8s3XmdZBG7Zl0
oyA1XJjWEq3p8nvPGIqFD26w75ppUbAWLTHsyoU0YfEYrZJrF9jPxowI7WlHgfVo
Io79x1sbgTrckjG+osAf
=UHph
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-4.8-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
"Highlights include:
- Stable patch from Olga to fix RPCSEC_GSS upcalls when the same user
needs multiple different security services (e.g. krb5i and krb5p).
- Stable patch to fix a regression introduced by the use of
SO_REUSEPORT, and that prevented the use of multiple different NFS
versions to the same server.
- TCP socket reconnection timer fixes.
- Patch from Neil to disable the use of IPv6 temporary addresses"
* tag 'nfs-for-4.8-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFSv4: Cap the transport reconnection timer at 1/2 lease period
NFSv4: Cleanup the setting of the nfs4 lease period
SUNRPC: Limit the reconnect backoff timer to the max RPC message timeout
SUNRPC: Fix reconnection timeouts
NFSv4.2: LAYOUTSTATS may return NFS4ERR_ADMIN/DELEG_REVOKED
SUNRPC: disable the use of IPv6 temporary addresses.
SUNRPC: allow for upcalls for same uid but different gss service
SUNRPC: Fix up socket autodisconnect
SUNRPC: Handle EADDRNOTAVAIL on connection failures
This patch adds a new hash expression, this provides jhash support but
this can be extended to support for other hash functions. The modulus
and seed already comes embedded into this new expression.
Use case example:
... meta mark set hash ip saddr mod 10
Signed-off-by: Laura Garcia Liebana <nevola@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
After earlier patches conversions all spots acquire the writer lock and
we can now convert this to a normal spinlock.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
It doesn't seem that important.
We now get inconsistent view of the counters, but those are stale anyway
right after we drop the lock.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Don't acquire the readlock anymore and rely on rcu alone.
In case writer on other CPU changed policy at the wrong moment (after we
obtained sk policy pointer but before we could obtain the reference)
just repeat the lookup.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
If we don't hold the policy lock anymore the refcnt might
already be 0, i.e. policy struct is about to be free'd.
Switch to atomic_inc_not_zero to avoid this.
On removal policies are already unlinked from the tables (lists)
before the last _put occurs so we are not supposed to find the same
'dead' entry on the next loop, so its safe to just repeat the lookup.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Once xfrm_policy_lookup_bytype doesn't grab xfrm_policy_lock anymore its
possible for a hash resize to occur in parallel.
Use sequence counter to block lookup in case a resize is in
progress and to also re-lookup in case hash table was altered
in the mean time (might cause use to not find the best-match).
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Since commit 56f047305d
("xfrm: add rcu grace period in xfrm_policy_destroy()") xfrm policy
objects are already free'd via rcu.
In order to make more places lockless (i.e. use rcu_read_lock instead of
grabbing read-side of policy rwlock) we only need to:
- use rcu_assign_pointer to store address of new hash table backend memory
- add rcu barrier so that freeing of old memory is delayed (expansion
and free happens from system workqueue, so synchronize_rcu is fine)
- use rcu_dereference to fetch current address of the hash table.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
This is required once we allow lockless readers.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Use nft_set_* prefix for backend set implementations, thus we can use
nft_hash for the new hash expression.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Once timer is removed from nf_conn struct we cannot open-code
the removal sequence anymore.
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
... so we don't need to touch all of these places when we get rid of the
timer in nf_conn.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
IP header checksum will be recalculated at ip_local_out, so
there's no need to calculated it here, remove it. Also update
code comments to illustrate it, and delete the misleading
comments about checksum recalculation.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We only need first 4 bytes instead of 8 bytes to get the ports of
tcp/udp/dccp/sctp/udplite in their pkt_to_tuple function.
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
- Misc fixes and cleanups all over the place.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJXq0ruAAoJECgfDbjSjVRp5P8H/2OlDJdSS1l+TwOXbY95ntQ1
vxUX4vGCX5IujC+Rbt7sQV2prE3b6IktFNagpbRoWn21JkpoDMvPtYJrn5BhLtoh
fvDkZE6Wo3QztFSjaUBZWEABBt03KPX0yrAIZplu8ne/Z8KAT3zK57BPnKfmxwv+
dpxt+1wlnqAvYsoUUQZBFT4Gmk2oDiTofiIbQq7W9W/fooznLtLB+ArYtdfNJizC
JnI/vJuWceEXfjT26HexCRhA2OZskrA4ZadDhOjAqkTPN5DHfweLDuHh7IsVfDd1
wXqjc4ks3cYG0CloJ2qY2K7RpDOFIxIizixeDIuAbn9aX4sPOYYfqRm+4iRwmqQ=
=9aUO
-----END PGP SIGNATURE-----
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio/vhost fixes and cleanups from Michael Tsirkin:
"Misc fixes and cleanups all over the place"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
virtio/s390: deprecate old transport
virtio/s390: keep early_put_chars
virtio_blk: Fix a slient kernel panic
virtio-vsock: fix include guard typo
vhost/vsock: fix vhost virtio_vsock_pkt use-after-free
9p/trans_virtio: use kvfree() for iov_iter_get_pages_alloc()
virtio: fix error handling for debug builds
virtio: fix memory leak in virtqueue_add()
Different gcc versions appear to be treating enum with different
signedness, causing warnings with the out parameter one way or
the other.
Just use the correct type to avoid all that.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Depending on which method the driver implements, userspace could
call this (indirectly, by getting station info) before the driver
knows about the station, possibly causing it to misbehave.
Therefore, add a check for sta->uploaded which indicates that the
driver knows about the station.
Signed-off-by: Maxim Altshul <maxim.altshul@ti.com>
[reword commit message]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Beacon interval interface combinations validation was missing
for MESH/IBSS join, add those.
Johannes: also move the beacon interval check disallowing really
tiny and really big intervals into the common function, which
adds it for AP mode.
Signed-off-by: Purushottam Kushwaha <pkushwah@qti.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This change alters the semantics of NL80211_CMD_DEL_INTERFACE events
by always sending this event whenever a net_device object associated
with a wdev is destroyed. Prior to this change, this event was only
emitted as a result of NL80211_CMD_DEL_INTERFACE command sent from
userspace. This allows userspace to reliably detect when wireless
interfaces have been removed, e.g. due to USB removal events, etc.
For wireless device objects without an associated net_device (e.g.
NL80211_IFTYPE_P2P_DEVICE), the NL80211_CMD_DEL_INTERFACE event is
now generated inside cfg80211_unregister_wdev.
Signed-off-by: Denis Kenzior <denkenz@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This change alters the semantics of NL80211_CMD_NEW_INTERFACE events
by always sending this event whenever a new net_device object
associated with a wdev is registered. Prior to this change, this event
was only sent as a result of NL80211_CMD_NEW_INTERFACE command sent
from userspace. This allows userspace to reliably detect new wireless
interfaces (e.g. due to hardware hot-plug events, etc).
For wdevs created without an associated net_device object (e.g.
NL80211_IFTYPE_P2P_DEVICE), the NL80211_CMD_NEW_INTERFACE event is
still generated inside the relevant nl80211 command handler.
Signed-off-by: Denis Kenzior <denkenz@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Since the code only touches the MAC headers, the offsets to the
network/transport headers remain the same throughout this function.
Remove pointless pieces of code that try to 'preserve' them.
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Mesh HWMP module will be able to rely on the HW
RC algorithm if it exists, for path metric calculations.
This allows the metric calculation mechanism to calculate
a correct metric, based on PER and last TX rate both via
HW RC algorithm if it exists or via parameters collected
by the SW.
Signed-off-by: Maxim Altshul <maxim.altshul@ti.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Running LTP 'icmp-uni-basic.sh -6 -p ipcomp -m tunnel' test over
openvswitch + veth can trigger kernel panic:
BUG: unable to handle kernel NULL pointer dereference
at 00000000000000e0 IP: [<ffffffff8169d1d2>] xfrm_input+0x82/0x750
...
[<ffffffff816d472e>] xfrm6_rcv_spi+0x1e/0x20
[<ffffffffa082c3c2>] xfrm6_tunnel_rcv+0x42/0x50 [xfrm6_tunnel]
[<ffffffffa082727e>] tunnel6_rcv+0x3e/0x8c [tunnel6]
[<ffffffff8169f365>] ip6_input_finish+0xd5/0x430
[<ffffffff8169fc53>] ip6_input+0x33/0x90
[<ffffffff8169f1d5>] ip6_rcv_finish+0xa5/0xb0
...
It seems that tunnel.ip6 can have garbage values and also dereferenced
without a proper check, only tunnel.ip4 is being verified. Fix it by
adding one more if block for AF_INET6 and initialize tunnel.ip6 with NULL
inside xfrm6_rcv_spi() (which is similar to xfrm4_rcv_spi()).
Fixes: 049f8e2 ("xfrm: Override skb->mark with tunnel->parm.i_key in xfrm_input")
Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
The creation of a tunnel vport (geneve, gre, vxlan) brings up a
corresponding netdev, a multi-step operation which can fail.
For example, changing a vxlan vport's netdev state to 'up' binds the
vport's socket to a UDP port - if the binding fails (e.g. due to the
port being in use), the error is currently ignored giving the
appearance that the tunnel vport creation completed successfully.
Signed-off-by: Martynas Pumputis <martynas@weave.works>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit cf6f7e1d51 ("tipc: dump monitor attributes"),
I dereferenced a pointer before checking if its valid.
This is reported by static check Smatch as:
net/tipc/monitor.c:733 tipc_nl_add_monitor_peer()
warn: variable dereferenced before check 'mon' (see line 731)
In this commit, we check for a valid monitor before proceeding
with any other operation.
Fixes: cf6f7e1d51 ("tipc: dump monitor attributes")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The PPTP is encapsulated by GRE header with that GRE_VERSION bits
must contain one. But current GRE RPS needs the GRE_VERSION must be
zero. So RPS does not work for PPTP traffic.
In my test environment, there are four MIPS cores, and all traffic
are passed through by PPTP. As a result, only one core is 100% busy
while other three cores are very idle. After this patch, the usage
of four cores are balanced well.
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Reviewed-by: Philip Prindeville <philipp@redfish-solutions.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert the per-device linked list into a hashtable. The primary
motivation for this change is that currently, we're not tracking all the
qdiscs in hierarchy (e.g. excluding default qdiscs), as the lookup
performed over the linked list by qdisc_match_from_root() is rather
expensive.
The ultimate goal is to get rid of hidden qdiscs completely, which will
bring much more determinism in user experience.
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a preparatory patch for converting qdisc linked list into a
hashtable. As we'll need to include hashtable.h in netdevice.h, we first
have to make sure that this will not introduce symbol conflicts for any of
the netdevice.h users.
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for your net tree,
they are:
1) Use mod_timer_pending() to avoid reactivating a dead expectation in
the h323 conntrack helper, from Liping Zhang.
2) Oneliner to fix a type in the register name defined in the nf_tables
header.
3) Don't try to look further when we find an inactive elements with no
descendants in the rbtree set implementation, otherwise we crash.
4) Handle valid zero CSeq in the SIP conntrack helper, from
Christophe Leroy.
5) Don't display a trailing slash in conntrack helper with no classes
via /proc/net/nf_conntrack_expect, from Liping Zhang.
6) Fix an expectation leak during creation from the nfqueue path, again
from Liping Zhang.
7) Validate netlink port ID in verdict message from nfqueue, otherwise
an injection can be possible. Again from Zhang.
8) Reject conntrack tuples with different transport protocol on
original and reply tuples, also from Zhang.
9) Validate offset and length in nft_exthdr, make sure they are under
sizeof(u8), from Laura Garcia Liebana.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
ic_close_devs() calls kfree() for all devices's ic_device. Since commit
2647cffb2b ("net: ipconfig: Support using "delayed" DHCP replies")
the active device's ic_device is still used however to print the
ipconfig summary which results in an oops if the memory is already
changed. So delay freeing until after the autoconfig results are
reported.
Fixes: 2647cffb2b ("net: ipconfig: Support using "delayed" DHCP replies")
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix the direct assignment of offset and length attributes included in
nft_exthdr structure from u32 data to u8.
Signed-off-by: Laura Garcia Liebana <nevola@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
push the lock down, after earlier patches we can rely on rcu to
make sure state struct won't go away.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Before xfrm_state_find() can use rcu_read_lock instead of xfrm_state_lock
we need to switch users of the hash table to assign/obtain the pointers
with the appropriate rcu helpers.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Once xfrm_state_find is lockless we have to cope with a concurrent
resize opertion.
We use a sequence counter to block in case a resize is in progress
and to detect if we might have missed a state that got moved to
a new hash table.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
The hash table backend memory and the state structs are free'd via
kfree/vfree.
Once we only rely on rcu during lookups we have to make sure no other cpu
is currently accessing this before doing the free.
Free operations already happen from worker so we can use synchronize_rcu
to wait until concurrent readers are done.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Once xfrm_state_lookup_byaddr no longer acquires the state lock another
cpu might be freeing the state entry at the same time.
To detect this we use atomic_inc_not_zero, we then signal -EAGAIN to
caller in case our result was stale.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
This is required once we allow lockless access of bydst/bysrc hash tables.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
The xfrm_replay structures are never modified, so declare them as const.
Done with the help of Coccinelle.
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Adding fdb entries pointing to the bridge device uses fdb_insert(),
which lacks various checks and does not respect added_by_user flag.
As a result, some inconsistent behavior can happen:
* Adding temporary entries succeeds but results in permanent entries.
* Same goes for "dynamic" and "use".
* Changing mac address of the bridge device causes deletion of
user-added entries.
* Replacing existing entries looks successful from userspace but actually
not, regardless of NLM_F_EXCL flag.
Use the same logic as other entries and fix them.
Fixes: 3741873b4f ("bridge: allow adding of fdb entries pointing to the bridge device")
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After commit 0ddcf43d5d ("ipv4: FIB Local/MAIN table collapse")
fib_local is set but not used. Remove it.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When executing the script included below, the netns delete operation
hangs with the following message (repeated at 10 second intervals):
kernel:unregister_netdevice: waiting for lo to become free. Usage count = 1
This occurs because a reference to the lo interface in the "secure" netns
is still held by a dst entry in the xfrm bundle cache in the init netns.
Address this problem by garbage collecting the tunnel netns flow cache
when a cross-namespace vti interface receives a NETDEV_DOWN notification.
A more detailed description of the problem scenario (referencing commands
in the script below):
(1) ip link add vti_test type vti local 1.1.1.1 remote 1.1.1.2 key 1
The vti_test interface is created in the init namespace. vti_tunnel_init()
attaches a struct ip_tunnel to the vti interface's netdev_priv(dev),
setting the tunnel net to &init_net.
(2) ip link set vti_test netns secure
The vti_test interface is moved to the "secure" netns. Note that
the associated struct ip_tunnel still has tunnel->net set to &init_net.
(3) ip netns exec secure ping -c 4 -i 0.02 -I 192.168.100.1 192.168.200.1
The first packet sent using the vti device causes xfrm_lookup() to be
called as follows:
dst = xfrm_lookup(tunnel->net, skb_dst(skb), fl, NULL, 0);
Note that tunnel->net is the init namespace, while skb_dst(skb) references
the vti_test interface in the "secure" namespace. The returned dst
references an interface in the init namespace.
Also note that the first parameter to xfrm_lookup() determines which flow
cache is used to store the computed xfrm bundle, so after xfrm_lookup()
returns there will be a cached bundle in the init namespace flow cache
with a dst referencing a device in the "secure" namespace.
(4) ip netns del secure
Kernel begins to delete the "secure" namespace. At some point the
vti_test interface is deleted, at which point dst_ifdown() changes
the dst->dev in the cached xfrm bundle flow from vti_test to lo (still
in the "secure" namespace however).
Since nothing has happened to cause the init namespace's flow cache
to be garbage collected, this dst remains attached to the flow cache,
so the kernel loops waiting for the last reference to lo to go away.
<Begin script>
ip link add br1 type bridge
ip link set dev br1 up
ip addr add dev br1 1.1.1.1/8
ip netns add secure
ip link add vti_test type vti local 1.1.1.1 remote 1.1.1.2 key 1
ip link set vti_test netns secure
ip netns exec secure ip link set vti_test up
ip netns exec secure ip link s lo up
ip netns exec secure ip addr add dev lo 192.168.100.1/24
ip netns exec secure ip route add 192.168.200.0/24 dev vti_test
ip xfrm policy flush
ip xfrm state flush
ip xfrm policy add dir out tmpl src 1.1.1.1 dst 1.1.1.2 \
proto esp mode tunnel mark 1
ip xfrm policy add dir in tmpl src 1.1.1.2 dst 1.1.1.1 \
proto esp mode tunnel mark 1
ip xfrm state add src 1.1.1.1 dst 1.1.1.2 proto esp spi 1 \
mode tunnel enc des3_ede 0x112233445566778811223344556677881122334455667788
ip xfrm state add src 1.1.1.2 dst 1.1.1.1 proto esp spi 1 \
mode tunnel enc des3_ede 0x112233445566778811223344556677881122334455667788
ip netns exec secure ping -c 4 -i 0.02 -I 192.168.100.1 192.168.200.1
ip netns del secure
<End script>
Reported-by: Hangbin Liu <haliu@redhat.com>
Reported-by: Jan Tluka <jtluka@redhat.com>
Signed-off-by: Lance Richardson <lrichard@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Once a packet has been posted to a connection in the data_ready handler, we
mustn't try reposting if we then find that the connection is dying as the
refcount has been given over to the dying connection and the packet might
no longer exist.
Losing the packet isn't a problem as the peer will retransmit.
Signed-off-by: David Howells <dhowells@redhat.com>
The call state machine processor sets up the message parameters for a UDP
message that it might need to transmit in advance on the basis that there's
a very good chance it's going to have to transmit either an ACK or an
ABORT. This requires it to look in the connection struct to retrieve some
of the parameters.
However, if the call is complete, the call connection pointer may be NULL
to dissuade the processor from transmitting a message. However, there are
some situations where the processor is still going to be called - and it's
still going to set up message parameters whether it needs them or not.
This results in a NULL pointer dereference at:
net/rxrpc/call_event.c:837
To fix this, skip the message pre-initialisation if there's no connection
attached.
Signed-off-by: David Howells <dhowells@redhat.com>
If rxrpc_new_client_call() fails to make a connection, the call record that
it allocated needs to be marked as RXRPC_CALL_RELEASED before it is passed
to rxrpc_put_call() to indicate that it no longer has any attachment to the
AF_RXRPC socket.
Without this, an assertion failure may occur at:
net/rxrpc/call_object:635
Signed-off-by: David Howells <dhowells@redhat.com>
A newly added bugfix caused an uninitialized variable to be
used for printing debug output. This is harmless as long
as the debug setting is disabled, but otherwise leads to an
immediate crash.
gcc warns about this when -Wmaybe-uninitialized is enabled:
net/rxrpc/call_object.c: In function 'rxrpc_release_call':
net/rxrpc/call_object.c:496:163: error: 'sp' may be used uninitialized in this function [-Werror=maybe-uninitialized]
The initialization was removed but one of the users remains.
This adds back the initialization.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 372ee16386 ("rxrpc: Fix races between skb free, ACK generation and replying")
Signed-off-by: David Howells <dhowells@redhat.com>
Currently, user can add a conntrack with different l4proto via nfnetlink.
For example, original tuple is TCP while reply tuple is SCTP. This is
invalid combination, we should report EINVAL to userspace.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Like NFQNL_MSG_VERDICT_BATCH do, we should also reject the verdict
request when the portid is not same with the initial portid(maybe
from another process).
Fixes: 97d32cf944 ("netfilter: nfnetlink_queue: batch verdict support")
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
User can use NFQA_EXP to attach expectations to conntracks, but we
forget to put back nf_conntrack_expect when it is inserted successfully,
i.e. in this normal case, expect's use refcnt will be 3. So even we
unlink it and put it back later, the use refcnt is still 1, then the
memory will be leaked forever.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The 'name' filed in struct nf_conntrack_expect_policy{} is not a
pointer, so check it is NULL or not will always return true. Even if the
name is empty, slash will always be displayed like follows:
# cat /proc/net/nf_conntrack_expect
297 l3proto = 2 proto=6 src=1.1.1.1 dst=2.2.2.2 sport=1 dport=1025 ftp/
^
Fixes: 3a8fc53a45 ("netfilter: nf_ct_helper: allocate 16 bytes for the helper and policy names")
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The files provided by batman-adv via debugfs are currently converted to
netlink. Tools which are not yet converted to use the netlink interface may
still rely on the old debugfs files. But systems which already upgraded
their tools can save some space by disabling this feature. The default
configuration of batman-adv on amd64 can reduce the size of the module by
around 11% when this feature is disabled.
$ size net/batman-adv/batman-adv.ko*
text data bss dec hex filename
150507 10395 4160 165062 284c6 net/batman-adv/batman-adv.ko.y
137106 7099 2112 146317 23b8d net/batman-adv/batman-adv.ko.n
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Switch-like virtual interfaces like bridge or openvswitch don't destroy
itself when all their attached netdevices dissappear. Instead they only
remove the link to the unregistered device and keep working until they get
removed manually.
This has the benefit that all configurations for this interfaces are kept
and daemons reacting to rtnl events can just add new slave interfaces
without going through the complete configuration of the switch-like
netdevice.
Handling unregister events of client devices similar in batman-adv allows
users to drop their current workaround of dummy netdevices attached to
batman-adv soft-interfaces.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
It is hard to understand why the refcnt is increased when it isn't done
near the actual place the new reference is used. So using kref_get right
before the place which requires the reference and in the same function
helps to avoid accidental problems caused by incorrect reference counting.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
It is hard to understand why the refcnt is increased when it isn't done
near the actual place the new reference is used. So using kref_get right
before the place which requires the reference and in the same function
helps to avoid accidental problems caused by incorrect reference counting.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
It is hard to understand why the refcnt is increased when it isn't done
near the actual place the new reference is used. So using kref_get right
before the place which requires the reference and in the same function
helps to avoid accidental problems caused by incorrect reference counting.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
It is hard to understand why the refcnt is increased when it isn't done
near the actual place the new reference is used. So using kref_get right
before the place which requires the reference and in the same function
helps to avoid accidental problems caused by incorrect reference counting.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
It is hard to understand why the refcnt is increased when it isn't done
near the actual place the new reference is used. So using kref_get right
before the place which requires the reference and in the same function
helps to avoid accidental problems caused by incorrect reference counting.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
It is hard to understand why the refcnt is increased when it isn't done
near the actual place the new reference is used. So using kref_get right
before the place which requires the reference and in the same function
helps to avoid accidental problems caused by incorrect reference counting.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
It is hard to understand why the refcnt is increased when it isn't done
near the actual place the new reference is used. So using kref_get right
before the place which requires the reference and in the same function
helps to avoid accidental problems caused by incorrect reference counting.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
It is hard to understand why the refcnt is increased when it isn't done
near the actual place the new reference is used. So using kref_get right
before the place which requires the reference and in the same function
helps to avoid accidental problems caused by incorrect reference counting.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
It is hard to understand why the refcnt is increased when it isn't done
near the actual place the new reference is used. So using kref_get right
before the place which requires the reference and in the same function
helps to avoid accidental problems caused by incorrect reference counting.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
It is hard to understand why the refcnt is increased when it isn't done
near the actual place the new reference is used. So using kref_get right
before the place which requires the reference and in the same function
helps to avoid accidental problems caused by incorrect reference counting.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
It is hard to understand why the refcnt is increased when it isn't done
near the actual place the new reference is used. So using kref_get right
before the place which requires the reference and in the same function
helps to avoid accidental problems caused by incorrect reference counting.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
It is hard to understand why the refcnt is increased when it isn't done
near the actual place the new reference is used. So using kref_get right
before the place which requires the reference and in the same function
helps to avoid accidental problems caused by incorrect reference counting.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
It is hard to understand why the refcnt is increased when it isn't done
near the actual place the new reference is used. So using kref_get right
before the place which requires the reference and in the same function
helps to avoid accidental problems caused by incorrect reference counting.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
It is hard to understand why the refcnt is increased when it isn't done
near the actual place the new reference is used. So using kref_get right
before the place which requires the reference and in the same function
helps to avoid accidental problems caused by incorrect reference counting.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
It is hard to understand why the refcnt is increased when it isn't done
near the actual place the new reference is used. So using kref_get right
before the place which requires the reference and in the same function
helps to avoid accidental problems caused by incorrect reference counting.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
It is hard to understand why the refcnt is increased when it isn't done
near the actual place the new reference is used. So using kref_get right
before the place which requires the reference and in the same function
helps to avoid accidental problems caused by incorrect reference counting.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
It is hard to understand why the refcnt is increased when it isn't done
near the actual place the new reference is used. So using kref_get right
before the place which requires the reference and in the same function
helps to avoid accidental problems caused by incorrect reference counting.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
It is hard to understand why the refcnt is increased when it isn't done
near the actual place the new reference is used. So using kref_get right
before the place which requires the reference and in the same function
helps to avoid accidental problems caused by incorrect reference counting.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Set the netnsof flag on the family structure, indicating it can
be used with different network name spaces.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Dump the list of bridge loop avoidance backbones via the netlink socket.
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
The bridge loop avoidange is the main information for the debugging of of
bridge loop detection problems. It is therefore necessary when comparing
the bla claim tables.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Dump the list of bridge loop avoidance claims via the netlink socket.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
[sven.eckelmann@open-mesh.com: add policy for attributes, fix includes, fix
soft_iface reference leak]
Signed-off-by: Sven Eckelmann <sven.eckelmann@open-mesh.com>
[sw@simonwunderlich.de: fix kerneldoc, fix error reporting]
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Dump the list of gateways via the netlink socket.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
[sven.eckelmann@open-mesh.com: integrate in batadv_algo_ops]
Signed-off-by: Sven Eckelmann <sven.eckelmann@open-mesh.com>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Add BATADV_CMD_GET_GATEWAYS commands, using handlers bat_gw_dump in
batadv_algo_ops. Will always return -EOPNOTSUPP for now, as no
implementations exist yet.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Dump the algo V originators and neighbours.
Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
[sven@narfation.org: Fix includes, fix algo_ops integration]
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Add BATADV_CMD_GET_ORIGINATORS and BATADV_CMD_GET_NEIGHBORS commands,
using handlers bat_orig_dump and bat_neigh_dump in batadv_algo_ops. Will
always return -EOPNOTSUPP for now, as no implementations exist yet.
Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
[sven@narfation.org: Rewrite based on new algo_ops structures]
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
The TTVN is the main information for the debugging of translation table
problems. It is therefore necessary when comparing the global translation
tables.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
This adds the commands BATADV_CMD_GET_TRANSTABLE_LOCAL and
BATADV_CMD_GET_TRANSTABLE_GLOBAL, which correspond to the transtable_local
and transtable_global debugfs files.
The batadv_tt_client_flags enum is moved to the UAPI to expose it as part
of the netlink API.
Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
[sven.eckelmann@open-mesh.com: add policy for attributes, fix includes]
Signed-off-by: Sven Eckelmann <sven.eckelmann@open-mesh.com>
[sw@simonwunderlich.de: fix VID attributes content]
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
BATADV_CMD_GET_HARDIFS will return the list of hardifs (including index,
name and MAC address) of all hardifs for a given softif.
Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
[sven.eckelmann@open-mesh.com: Reduce the number of changes to
BATADV_CMD_GET_HARDIFS, add policy for attributes]
Signed-off-by: Sven Eckelmann <sven.eckelmann@open-mesh.com>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
BATADV_CMD_GET_ROUTING_ALGOS is used to get the list of supported routing
algorithms.
Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
[sven.eckelmann@open-mesh.com: Reduce the number of changes to
BATADV_CMD_GET_ROUTING_ALGOS, fix includes]
Signed-off-by: Sven Eckelmann <sven.eckelmann@open-mesh.com>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Debugfs is not netns aware. It thus has problems when the same
interface name exists in multiple network name spaces.
Work around this by not creating entries for interfaces in name spaces
other than the default name space. This means meshes in network
namespaces cannot be managed via debugfs, but there will soon be a
netlink interface which is netns aware.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
batman-adv tries to prevent the user from placing a batX soft
interface into another batman mesh as a hard interface. It does this
by walking up the devices list of parents and ensures they are all
none batX interfaces. iflink can point to an interface in a different
namespace, so also retrieve the parents name space when finding the
parent and use it when doing the comparison.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
[sven@narfation.org: Fix alignments, simplify parent netns retrieval]
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
The debug messages of _batadv_update_route were printed before the actual
route change is done. At this point it is not really known which
curr_router will be replaced. Thus the messages could print the wrong
operation.
Printing the debug messages after the operation was done avoids this
problem.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
This silences the following coccinelle warning:
"WARNING: sum of probable bitmasks, consider |"
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
The function batadv_send_skb_unicast is not acquiring a reference for an
orig_node nor removing it from any datastructure. It still reduces the
reference counter for an object which is still in the hands of the caller.
This is confusing and can lead in the future to problems in the reference
handling of the caller function.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Acked-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
The translation table (global, local) is usually the part of batman-adv
which has the most dynamical allocated objects. Most of them
(tt_local_entry, tt_global_entry, tt_orig_list_entry, tt_change_node,
tt_req_node, tt_roam_node) are equally sized. So it makes sense to have
them allocated from a kmem_cache for each type.
This approach allowed a small wireless router (TP-Link TL-841NDv8; SLUB
allocator) to store 34% more translation table entries compared to the
current implementation.
[1] https://open-mesh.org/projects/batman-adv/wiki/Kmalloc-kmem-cache-tests
Reported-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
This patch abstracts the forward packet creation into the new function
batadv_forw_packet_alloc().
The queue counting and interface reference counters are now handled
internally within batadv_forw_packet_alloc() and its
batadv_forw_packet_free() counterpart. This should reduce the risk of
having reference/queue counting bugs again and should increase
code readibility.
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
net/batman-adv/bridge_loop_avoidance.c:1105:9-10: WARNING: return of 0/1 in function 'batadv_bla_process_claim' with return type bool
Return statements in functions returning bool should use
true/false instead of 1/0.
Generated by: scripts/coccinelle/misc/boolreturn.cocci
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
The difference between tq1 and tq2 are calculated the same way in two
separate functions.
This patch moves the common code to a separate function
'batadv_iv_ogm_neigh_diff' which handles everything necessary. The other
two functions can then handle errors and use the difference directly.
Signed-off-by: Markus Pargmann <mpa@pengutronix.de>
[sven@narfation.org: rebased on current version, initialize return variable
in batadv_iv_ogm_neigh_diff, add kerneldoc, convert to bool return type]
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Now that the GW-mode code is algorithm specific, batman-adv expects the
routing algorithm to implement some APIs to make it work.
However, such APIs are not mandatory, therefore we might have algorithms
not providing them. In this case all the sysfs knobs related to GW-mode
should be deactivated to make sure that settings injected by the user
for this feature are rejected.
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Since the GW selection logic has been made routing protocol specific
it is now possible for B.A.T.M.A.N V to have its own mechanism by
providing the API implementation.
Implement the GW specific API in the B.A.T.M.A.N. V protocol in
order to provide a working GW selection mechanism.
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Each routing protocol may have its own specific logic about
gateway election which is potentially based on the metric being
used.
Create two GW specific API functions and move the current election
logic in the B.A.T.M.A.N. IV specific code.
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
The B.A.T.M.A.N. V algorithm uses a different metric compared to its
predecessor and for this reason the logic used to compute the best
Gateway is also changed. This means that the GW selection class
fed to this logic has a semantics that depends on the algorithm being
used.
Make the parsing and printing routine of the GW selection class
routing algorithm specific. Each algorithm can now parse (and print)
this value independently.
If no API is provided by any algorithm, the default is to use the
current mechanism of considering such value like an integer between
1 and 255.
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
The kobject_put is only removing the sysfs entry and corresponding entries
when its reference counter becomes zero. This tends to lead to collisions
when a device is moved between two different network namespaces because
some of the sysfs files have to be removed first and then added again to
the already moved sysfs entry.
WARNING: CPU: 0 PID: 290 at lib/kobject.c:240 kobject_add_internal+0x5ec/0x8a0
kobject_add_internal failed for batman_adv with -EEXIST, don't try to register things with the same name in the same directory.
But the caller of kobject_put can already remove the sysfs entry before it
does the kobject_put. This removal is done even when the reference counter
is not yet zero and thus avoids the problem.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Postponing the removal of the interface breaks the expected behavior of
NETDEV_UNREGISTER and NETDEV_PRE_TYPE_CHANGE. This is especially
problematic when an interface is removed and added in quick succession.
This reverts commit 5bc44dc845 ("batman-adv: postpone sysfs removal when
unregistering").
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
The legacy sysfs interface to modify interfaces belonging to batman-adv
is run inside a region holding s_lock. And to add a net_device, it has
to also get the rtnl_lock. This is exactly the other way around than in
other virtual net_devices and conflicts with netdevice notifier which
executes inside rtnl_lock.
The inverted lock situation is currently solved by executing the removal
of netdevices via workqueue. The workqueue isn't executed inside
rtnl_lock and thus can independently get the s_lock and the rtnl_lock.
But this workaround fails when the netdevice notifier creates events in
quick succession and the earlier triggered removal of a net_device isn't
processed in the workqueue before the adding of the new netdevice (with
same name) event is issued.
Instead the legacy sysfs interface store events have to be enqueued in
a workqueue to loose the s_lock. The worker is then free to get the
required locks and the deadlock is avoided.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
The batman-adv module can automatically be loaded when operations over the
rtnl link are triggered. This requires only the correct rtnl link name in
the module header.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Some operations in batadv_algo_ops are optional and marked as such in the
kerneldoc. But some of them miss the "(optional)" in their kerneldoc. These
have to also be marked to give an implementor of an algorithm the correct
background information without looking in the code calling these function
pointers.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
This is helpful to detect at compile-time errors related to format
strings.
Signed-off-by: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The code using this variable has been commented out in the past as it
was causing issues in upperlimited link-sharing scenarios.
Signed-off-by: Michal Soltys <soltys@ziu.info>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch simplifies how we update fsc and calculate vt from it - while
keeping the expected functionality identical with how hfsc behaves
curently. It also fixes a certain issue introduced with
a very old patch.
The idea is, that instead of correcting cl_vt before fsc curve update
(rtsc_min) and correcting cl_vt after calculation (rtsc_y2x) to keep
cl_vt local to the current period - we can simply rely on virtual times
and curve values always being in sync - analogously to how rsc and usc
function, except that we use virtual time here.
Why hasn't it been done since the beginning this way ? The likely scenario
(basing on the code trying to correct curves whenever possible) was to
keep the virtual times as small as possible - as they have tendency to
"gallop" forward whenever their siblings and other fair sharing
subtrees are idling. On top of that, current code is subtly bugged, so
cumulative time (without any corrections) is always kept and used in
init_vf() when a new backlog period begins (using cl_cvtoff).
Is cumulative value safe ? Generally yes, though corner cases are easy
to create. For example consider:
1gbit interface
some 100kbit leaf, everything else idle
With current tick (64ns) 1s is 15625000 ticks, but the leaf is alone and
it's virtual time, so in reality it's 10000 times more. ITOW 38 bits are
needed to hold 1 second. 54 - 1 day, 59 - 1 month, 63 - 1 year (all
logarithms rounded up). It's getting somewhat dangerous, but also
requires setup excusing this kind of values not mentioning permanently
backlogged class for a year. In near most extreme case (10gbit, 10kbit
leaf), we have "enough" to hold ~13.6 days in 64 bits.
Well, the issue remains mostly theoretical and cl_cvtoff has been
working fine for all those years. Sensible configuration are de-facto
immune to this issue, and not so sensible can solve it with a cronjob
and its period inversely proportional to the insanity of such setup =)
Now let's explain the subtle bug mentioned earlier.
The issue is related to how offsets are kept and how we calculate
virtual times and update fair service curve(s). The issue itself is
subtle, but easy to observe with long m1 segments. It was introduced in
rather old patch:
Commit 99296150c7: "[NET_SCHED]: O(1) children vtoff adjustment
in HFSC scheduler"
(available in git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git)
Originally when a new backlog period was started, cl_vtoff of each
sibling was updated with cl_cvtmax from past period - naturally moving
all cl_vt to proper starting point. That patch adjusted it so cumulative
offset is kept in the parent, and there is no need for traversing the
list (as any subsequent child activation derives new vt from already
active sibling(s)).
But with this change, cl_vtoff (of each sibling) is no longer persistent
across the inactivity periods, as it's calculated from parent's
cl_cvtoff on a new backlog period, conflicting with the following curve
correction from the previous period:
if (cl->cl_virtual.x == vt) {
cl->cl_virtual.x -= cl->cl_vtoff;
cl->cl_vtoff = 0;
}
This essentially tries to keep curve as if it was local to the period
and resets cl_vtoff (cumulative vt offset of the class) to 0 when
possible (read: when we have an intersection or if a new curve is below
the old one). But then it's recalculated from cl_cvtoff on next active
period. Then rtsc_min() call preceding the above if() doesn't really
do what we expect it to do in such scenario - as it calculates the
minimum of corrected curve (from the previous backlog period) and the
new uncorrected curve (with offset derived from cl_cvtoff).
Example:
tc class add dev $ife parent 1:0 classid 1:1 hfsc ls m2 100mbit ul m2 100mbit
tc class add dev $ife parent 1:1 classid 1:10 hfsc ls m1 80mbit d 10s m2 20mbit
tc class add dev $ife parent 1:1 classid 1:11 hfsc ls m2 20mbit
start B, keep it backlogged, let it run 6s (30s worth of vt as A is idle)
pause B briefly to force cl_cvtoff update in parent (whole 1:1 going idle)
start A, let it run 10s
pause A briefly to force rtsc_min()
At this point we would expect A to continue at 20mbit after a brief
moment of 80mbit. But instead A will use 80mbit for full 10s again. It's
the effect of first correcting A (during 'start A'), and then - after
unpausing - calculating rtsc_min() from old corrected and new uncorrected
curve.
The patch fixes this bug and keepis vt and fsc in sync (virtual times
are cumulative, not local to the backlog period).
Signed-off-by: Michal Soltys <soltys@ziu.info>
Signed-off-by: David S. Miller <davem@davemloft.net>
Based on RFC3376 5.1 and RFC3810 6.1
If the per-interface listening change that triggers the new report is
a filter mode change, then the next [Robustness Variable] State
Change Reports will include a Filter Mode Change Record. This
applies even if any number of source list changes occur in that
period.
Old State New State State Change Record Sent
--------- --------- ------------------------
INCLUDE (A) EXCLUDE (B) TO_EX (B)
EXCLUDE (A) INCLUDE (B) TO_IN (B)
So we should not send source-list change if there is a filter-mode change.
Here are two scenarios:
1. Group deleted and filter mode is EXCLUDE, which means we need send a
TO_IN { }.
2. Not group deleted, but has pcm->crcount, which means we need send a
normal filter-mode-change.
At the same time, if the type is ALLOW or BLOCK, and have psf->sf_crcount,
we stop add records and decrease sf_crcount directly
Reference: https://www.ietf.org/mail-archive/web/magma/current/msg01274.html
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that ipconfig learned to handle "delayed replies" in the previous
commit, there is no reason any more to delay sending a first request per
device.
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
The dhcp code only waits 1s between sending DHCP requests on different
devices and only accepts an answer for the device that sent out the last
request. Only the timeout at the end of a loop is increased iteratively
which favours only the last device. This makes it impossible to work
with a dhcp server that takes little more than 1s connected to a device
that is not the last one.
Instead of also increasing the inter-device timeout, teach the code to
handle delayed replies.
To accomplish that, make *ic_dev track the current ic_device instead of
the current net_device and adapt all users accordingly. The relevant
change then is to reset d to ic_dev on a reply to assert that the
followup request goes through the right device.
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This simplifies understanding what happens when there is more than one
device.
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Admin should be able to set any state. Currently, this fails
when lladdr is not changed and state is changed from
NUD_CONNECTED to NUD_STALE:
ip neigh add 192.168.8.1 lladdr 00:11:22:33:44:55 nud perm dev wlan0
ip neigh show to 192.168.8.1
192.168.8.1 dev wlan0 lladdr 00:11:22:33:44:55 PERMANENT
ip neigh change 192.168.8.1 lladdr 00:11:22:33:44:55 nud stale dev wlan0
ip neigh show to 192.168.8.1
192.168.8.1 dev wlan0 lladdr 00:11:22:33:44:55 PERMANENT
Problem may be from 2.1.X days.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Reviewed-by: Chunhui He <hchunhui@mail.ustc.edu.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 52253db924 ("sctp: also point GSO head_skb to the sk when
it's available") used event->chunk->head_skb to get the head_skb in
sctp_ulpevent_set_owner().
But at that moment, the event->chunk was NULL, as it cloned the skb
in sctp_ulpevent_make_rcvmsg(). Therefore, that patch didn't really
work.
This patch is to move the event->chunk initialization before calling
sctp_ulpevent_receive_data() so that it uses event->chunk when it's
valid.
Fixes: 52253db924 ("sctp: also point GSO head_skb to the sk when it's available")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When having skbs on ingress with CHECKSUM_COMPLETE, tc BPF programs don't
push rcsum of mac header back in and after BPF run back pull out again as
opposed to some other subsystems (ovs, for example).
For cases like q-in-q, meaning when a vlan tag for offloading is already
present and we're about to push another one, then skb_vlan_push() pushes the
inner one into the skb, increasing mac header and skb_postpush_rcsum()'ing
the 4 bytes vlan header diff. Likewise, for the reverse operation in
skb_vlan_pop() for the case where vlan header needs to be pulled out of the
skb, we're decreasing the mac header and skb_postpull_rcsum()'ing the 4 bytes
rcsum of the vlan header that was removed.
However mangling the rcsum here will lead to hw csum failure for BPF case,
since we're pulling or pushing data that was not part of the current rcsum.
Changing tc BPF programs in general to push/pull rcsum around BPF_PROG_RUN()
is also not really an option since current behaviour is ABI by now, but apart
from that would also mean to do quite a bit of useless work in the sense that
usually 12 bytes need to be rcsum pushed/pulled also when we don't need to
touch this vlan related corner case. One way to fix it would be to push the
necessary rcsum fixup down into vlan helpers that are (mostly) slow-path
anyway.
Fixes: 4e10df9a60 ("bpf: introduce bpf_skb_vlan_push/pop() helpers")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
bpf_skb_store_bytes() invocations above L2 header need BPF_F_RECOMPUTE_CSUM
flag for updates, so that CHECKSUM_COMPLETE will be fixed up along the way.
Where we ran into an issue with bpf_skb_store_bytes() is when we did a
single-byte update on the IPv6 hoplimit despite using BPF_F_RECOMPUTE_CSUM
flag; simple ping via ICMPv6 triggered a hw csum failure as a result. The
underlying issue has been tracked down to a buffer alignment issue.
Meaning, that csum_partial() computations via skb_postpull_rcsum() and
skb_postpush_rcsum() pair invoked had a wrong result since they operated on
an odd address for the hoplimit, while other computations were done on an
even address. This mix doesn't work as-is with skb_postpull_rcsum(),
skb_postpush_rcsum() pair as it always expects at least half-word alignment
of input buffers, which is normally the case. Thus, instead of these helpers
using csum_sub() and (implicitly) csum_add(), we need to use csum_block_sub(),
csum_block_add(), respectively. For unaligned offsets, they rotate the sum
to align it to a half-word boundary again, otherwise they work the same as
csum_sub() and csum_add().
Adding __skb_postpull_rcsum(), __skb_postpush_rcsum() variants that take the
offset as an input and adapting bpf_skb_store_bytes() to them fixes the hw
csum failures again. The skb_postpull_rcsum(), skb_postpush_rcsum() helpers
use a 0 constant for offset so that the compiler optimizes the offset & 1
test away and generates the same code as with csum_sub()/_add().
Fixes: 608cd71a9c ("tc: bpf: generalize pedit action")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Follow-up to commit f8ffad69c9 ("bpf: add skb_postpush_rcsum and fix
dev_forward_skb occasions") to fix an issue for dev_queue_xmit() redirect
locations which need CHECKSUM_COMPLETE fixups on ingress.
For the same reasons as described in f8ffad69c9 already, we of course
also need this here, since dev_queue_xmit() on a veth device will let us
end up in the dev_forward_skb() helper again to cross namespaces.
Latter then calls into skb_postpull_rcsum() to pull out L2 header, so
that netif_rx_internal() sees CHECKSUM_COMPLETE as it is expected. That
is, CHECKSUM_COMPLETE on ingress covering L2 _payload_, not L2 headers.
Also here we have to address bpf_redirect() and bpf_clone_redirect().
Fixes: 3896d655f4 ("bpf: introduce bpf_clone_redirect() helper")
Fixes: 27b29f6305 ("bpf: add bpf_redirect() helper")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since 'ss' always adds TCPF_CLOSE to idiag_states flags, sctp_diag can't
rely upon TCPF_LISTEN flag solely being present when listening sockets
are requested.
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>
The asoc's timer value is not kept in asoc->timeouts array but in it's
primary transport instead.
Furthermore, we must export the timer only if it is pending, otherwise
the value will underrun when stored in an unsigned variable and
user space will only see a very large timeout value.
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>
The callback function of call_rcu() just calls a kfree(), so we
can use kfree_rcu() instead of call_rcu() + callback function.
Signed-off-by: Wei Yongjun <weiyj.lk@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Fixes the following sparse warning:
net/ceph/mon_client.c:577:6: warning:
symbol 'cancel_generic_request' was not declared. Should it be static?
Signed-off-by: Wei Yongjun <weiyj.lk@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
In case of error, the function ceph_alloc_page_vector() returns
ERR_PTR() and never returns NULL. The NULL test in the return value
check should be replaced with IS_ERR().
Fixes: 1907920324 ('libceph: support for sending notifies')
Signed-off-by: Wei Yongjun <weiyj.lk@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Passing nsproxy into sysctl_table_root.lookup was a premature
optimization in attempt to avoid depending on current. The
directory /proc/self/sys has not appeared and if and when
it does this code will need to be reviewed closely and reworked
anyway. So remove the premature optimization.
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Do not drop packet when CSeq is 0 as 0 is also a valid value for CSeq.
simple_strtoul() will return 0 either when all digits are 0
or if there are no digits at all. Therefore when simple_strtoul()
returns 0 we check if first character is digit 0 or not.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If we find a matching element that is inactive with no descendants, we
jump to the found label, then crash because of nul-dereference on the
left branch.
Fix this by checking that the element is active and not an interval end
and skipping the logic that only applies to the tree iteration.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Tested-by: Anders K. Pedersen <akp@akp.dk>
Commit 96d1327ac2 ("netfilter: h323: Use mod_timer instead of
set_expect_timeout") just simplify the source codes
if (!del_timer(&exp->timeout))
return 0;
add_timer(&exp->timeout);
to mod_timer(&exp->timeout, jiffies + info->timeout * HZ);
This is not correct, and introduce a race codition:
CPU0 CPU1
- timer expire
process_rcf expectation_timed_out
lock(exp_lock) -
find_exp waiting exp_lock...
re-activate timer!! waiting exp_lock...
unlock(exp_lock) lock(exp_lock)
- unlink expect
- free(expect)
- unlock(exp_lock)
So when the timer expires again, we will access the memory that
was already freed.
Replace mod_timer with mod_timer_pending here to fix this problem.
Fixes: 96d1327ac2 ("netfilter: h323: Use mod_timer instead of set_expect_timeout")
Cc: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This reverts commit 3d5fdff46c.
Ben Hutchings pointed out that the commit isn't safe since it assumes
that the structure used by the driver is iw_point, when in fact there's
no way to know about that.
Fortunately, the only driver in the tree that ever runs this code path
is the wilc1000 staging driver, so it doesn't really matter.
Clearly I should have investigated this better before applying, sorry.
Reported-by: Ben Hutchings <ben@decadent.org.uk>
Cc: stable@vger.kernel.org [though I guess it doesn't matter much]
Fixes: 3d5fdff46c ("wext: Fix 32 bit iwpriv compatibility issue with 64 bit Kernel")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
* fix 80+80 bandwidth warning
* fix powersave with mac80211 TXQ implementation
* use correct way to free SKBs from multicast buffering
* mesh: fix operation ordering to work with all drivers
* mesh: end service period even when peer goes away
* mesh: correct HT opmode validity checks
* pass hw pointer from mac80211 to driver in TPT method,
fixing a bug (in a bit the wrong way, but that's what
we have right now)
-----BEGIN PGP SIGNATURE-----
iQIcBAABCgAGBQJXpIcJAAoJEGt7eEactAAdSFUP/0zeMBnYsxm0UYFPKOYf7+rF
P9s88XRpYNiTQqA5YgkaoiSaORMrdj9AeSTIDJ1MDOHVJSQ3jBbmmWUlM7h+VNQw
P6YQp4xw+yxQeB2Lobb0E/7lxpG5nRKFtbPMkDasSJv+0fzGTqm68Cpjs7IMjfOw
+I7ZjWZzClZdpTS4avyziEbpxAdSvJqf9SczLeDw7BjbufsSWKNT8yBPeTNa0Mfz
IVzKh84eEyHBWQqWhqNclA4QMqQPoTQQ1YYqG1lmc8Jiq7/9y5pImedlNyHkiwgY
t4vh7tFEL1HtWKiq9nbO7fSFkZqJHVyNSpdrQSxsx3FFYkcoOEZu0GbeWQhwXr/s
a1l91GgNoH4Sv9xn3YRVPT+1RygzzGR6MUuNiU9DTSdohg+BBscSSBXm7op39H+Z
z+X7z6a1mQAujfCbW1mNJ2Ajymr2RfEAXHRTUo/8/4Y86+wIbTe1vk0jqgkHOIV2
9Z1Nt/83iP12ON5s1Tnh1H619Pv+UXxujMV3plWPeaPTxG3F34Xnpsnw2AE1cAZ5
Mu0sMMfh9w2rPo5miPyMpU7dJo2mY95qC/+aosZlbeAMyEPqRtSE3sHLzEkUyuJI
5VskVEIBYukIahsRN9Qd9FldQNwcZuqFpo43qkDYkE67Q3/oNokAlMb9SWv/V6D4
FQmZbX1DcL+iYlAx8rN7
=4hWm
-----END PGP SIGNATURE-----
Merge tag 'mac80211-for-davem-2016-08-05' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
First set of fixes for the current cycle:
* fix 80+80 bandwidth warning
* fix powersave with mac80211 TXQ implementation
* use correct way to free SKBs from multicast buffering
* mesh: fix operation ordering to work with all drivers
* mesh: end service period even when peer goes away
* mesh: correct HT opmode validity checks
* pass hw pointer from mac80211 to driver in TPT method,
fixing a bug (in a bit the wrong way, but that's what
we have right now)
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
- New vsock device support in host and guest
- Platform IOMMU support in host and guest,
including compatibility quirks for legacy systems.
- Misc fixes and cleanups.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJXofvbAAoJECgfDbjSjVRpUTIH/iEoK9h636tBayXy0PXkPby0
6fMaRFy6H1HgEttgDhJE8Pqg/ba3qaW9Em0fHyFq7Mp2waFHAZ8hAT8phC6TAK3c
CIBnfzyyuI8u3N9SnNOfelPVcwCBfuALuuTsXB/rwKbYQEVv+U5Rdt3Vyx9+lXkj
P005klz7PfqxFhQrrnj4Eh7VawtHwmMuLH8YoWpCZpM71dHPo6eL+3ftKwhH2boo
qK86uVprwba03Pewpm13vQnotemfVfUUkjXd4EJpG3dx7E0KZosuj0ZG9OV8mPGQ
Cl2gBdUhocdJgeUnAHmf6tumYi9KFlYfy6xLy44YMmN7FL3E9nQjaKZp25UKfiM=
=ztIm
-----END PGP SIGNATURE-----
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio/vhost updates from Michael Tsirkin:
- new vsock device support in host and guest
- platform IOMMU support in host and guest, including compatibility
quirks for legacy systems.
- misc fixes and cleanups.
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
VSOCK: Use kvfree()
vhost: split out vringh Kconfig
vhost: detect 32 bit integer wrap around
vhost: new device IOTLB API
vhost: drop vringh dependency
vhost: convert pre sorted vhost memory array to interval tree
vhost: introduce vhost memory accessors
VSOCK: Add Makefile and Kconfig
VSOCK: Introduce vhost_vsock.ko
VSOCK: Introduce virtio_transport.ko
VSOCK: Introduce virtio_vsock_common.ko
VSOCK: defer sock removal to transports
VSOCK: transport-specific vsock_transport functions
vhost: drop vringh dependency
vop: pull in vhost Kconfig
virtio: new feature to detect IOMMU device quirk
balloon: check the number of available pages in leak balloon
vhost: lockless enqueuing
vhost: simplify work flushing
Inside the kafs filesystem it is possible to occasionally have a call
processed and terminated before we've had a chance to check whether we need
to clean up the rx queue for that call because afs_send_simple_reply() ends
the call when it is done, but this is done in a workqueue item that might
happen to run to completion before afs_deliver_to_call() completes.
Further, it is possible for rxrpc_kernel_send_data() to be called to send a
reply before the last request-phase data skb is released. The rxrpc skb
destructor is where the ACK processing is done and the call state is
advanced upon release of the last skb. ACK generation is also deferred to
a work item because it's possible that the skb destructor is not called in
a context where kernel_sendmsg() can be invoked.
To this end, the following changes are made:
(1) kernel_rxrpc_data_consumed() is added. This should be called whenever
an skb is emptied so as to crank the ACK and call states. This does
not release the skb, however. kernel_rxrpc_free_skb() must now be
called to achieve that. These together replace
rxrpc_kernel_data_delivered().
(2) kernel_rxrpc_data_consumed() is wrapped by afs_data_consumed().
This makes afs_deliver_to_call() easier to work as the skb can simply
be discarded unconditionally here without trying to work out what the
return value of the ->deliver() function means.
The ->deliver() functions can, via afs_data_complete(),
afs_transfer_reply() and afs_extract_data() mark that an skb has been
consumed (thereby cranking the state) without the need to
conditionally free the skb to make sure the state is correct on an
incoming call for when the call processor tries to send the reply.
(3) rxrpc_recvmsg() now has to call kernel_rxrpc_data_consumed() when it
has finished with a packet and MSG_PEEK isn't set.
(4) rxrpc_packet_destructor() no longer calls rxrpc_hard_ACK_data().
Because of this, we no longer need to clear the destructor and put the
call before we free the skb in cases where we don't want the ACK/call
state to be cranked.
(5) The ->deliver() call-type callbacks are made to return -EAGAIN rather
than 0 if they expect more data (afs_extract_data() returns -EAGAIN to
the delivery function already), and the caller is now responsible for
producing an abort if that was the last packet.
(6) There are many bits of unmarshalling code where:
ret = afs_extract_data(call, skb, last, ...);
switch (ret) {
case 0: break;
case -EAGAIN: return 0;
default: return ret;
}
is to be found. As -EAGAIN can now be passed back to the caller, we
now just return if ret < 0:
ret = afs_extract_data(call, skb, last, ...);
if (ret < 0)
return ret;
(7) Checks for trailing data and empty final data packets has been
consolidated as afs_data_complete(). So:
if (skb->len > 0)
return -EBADMSG;
if (!last)
return 0;
becomes:
ret = afs_data_complete(call, skb, last);
if (ret < 0)
return ret;
(8) afs_transfer_reply() now checks the amount of data it has against the
amount of data desired and the amount of data in the skb and returns
an error to induce an abort if we don't get exactly what we want.
Without these changes, the following oops can occasionally be observed,
particularly if some printks are inserted into the delivery path:
general protection fault: 0000 [#1] SMP
Modules linked in: kafs(E) af_rxrpc(E) [last unloaded: af_rxrpc]
CPU: 0 PID: 1305 Comm: kworker/u8:3 Tainted: G E 4.7.0-fsdevel+ #1303
Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014
Workqueue: kafsd afs_async_workfn [kafs]
task: ffff88040be041c0 ti: ffff88040c070000 task.ti: ffff88040c070000
RIP: 0010:[<ffffffff8108fd3c>] [<ffffffff8108fd3c>] __lock_acquire+0xcf/0x15a1
RSP: 0018:ffff88040c073bc0 EFLAGS: 00010002
RAX: 6b6b6b6b6b6b6b6b RBX: 0000000000000000 RCX: ffff88040d29a710
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88040d29a710
RBP: ffff88040c073c70 R08: 0000000000000001 R09: 0000000000000001
R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: ffff88040be041c0 R15: ffffffff814c928f
FS: 0000000000000000(0000) GS:ffff88041fa00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fa4595f4750 CR3: 0000000001c14000 CR4: 00000000001406f0
Stack:
0000000000000006 000000000be04930 0000000000000000 ffff880400000000
ffff880400000000 ffffffff8108f847 ffff88040be041c0 ffffffff81050446
ffff8803fc08a920 ffff8803fc08a958 ffff88040be041c0 ffff88040c073c38
Call Trace:
[<ffffffff8108f847>] ? mark_held_locks+0x5e/0x74
[<ffffffff81050446>] ? __local_bh_enable_ip+0x9b/0xa1
[<ffffffff8108f9ca>] ? trace_hardirqs_on_caller+0x16d/0x189
[<ffffffff810915f4>] lock_acquire+0x122/0x1b6
[<ffffffff810915f4>] ? lock_acquire+0x122/0x1b6
[<ffffffff814c928f>] ? skb_dequeue+0x18/0x61
[<ffffffff81609dbf>] _raw_spin_lock_irqsave+0x35/0x49
[<ffffffff814c928f>] ? skb_dequeue+0x18/0x61
[<ffffffff814c928f>] skb_dequeue+0x18/0x61
[<ffffffffa009aa92>] afs_deliver_to_call+0x344/0x39d [kafs]
[<ffffffffa009ab37>] afs_process_async_call+0x4c/0xd5 [kafs]
[<ffffffffa0099e9c>] afs_async_workfn+0xe/0x10 [kafs]
[<ffffffff81063a3a>] process_one_work+0x29d/0x57c
[<ffffffff81064ac2>] worker_thread+0x24a/0x385
[<ffffffff81064878>] ? rescuer_thread+0x2d0/0x2d0
[<ffffffff810696f5>] kthread+0xf3/0xfb
[<ffffffff8160a6ff>] ret_from_fork+0x1f/0x40
[<ffffffff81069602>] ? kthread_create_on_node+0x1cf/0x1cf
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net_device->ndo_set_rx_headroom (introduced in
871b642ade) says
"Setting a negtaive value reset the rx headroom
to the default value".
It seems that the OVS implementation in
3a927bc7cf overlooked this and sets
dev->needed_headroom unconditionally.
This doesn't have an immediate effect, but can mess up later
LL_RESERVED_SPACE calculations, such as done in
net/ipv6/mcast.c:mld_newpack. For reference, this issue was found
from a skb_panic raised there after the length calculations had given
the wrong result.
Note the other current users of this interface
(drivers/net/tun.c:tun_set_headroom and
drivers/net/veth.c:veth_set_rx_headroom) are both checking this
correctly thus need no modification.
Thanks to Ben for some pointers from the crash dumps!
Cc: Benjamin Poirier <bpoirier@suse.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1361414
Signed-off-by: Ian Wienand <iwienand@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We don't want to miss a lease period renewal due to the TCP connection
failing to reconnect in a timely fashion. To ensure this doesn't happen,
cap the reconnection timer so that we retry the connection attempt
at least every 1/2 lease period.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
When the connect attempt fails and backs off, we should start the clock
at the last connection attempt, not time at which we queue up the
reconnect job.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If the net.ipv6.conf.*.use_temp_addr sysctl is set to '2',
then TCP connections over IPv6 will prefer a 'private' source
address.
These eventually expire and become invalid, typically after a week,
but the time is configurable.
When the local address becomes invalid the client will not be able to
receive replies from the server. Eventually the connection will timeout
or break and a new connection will be established, but this can take
half an hour (typically TCP connection break time).
RFC 4941, which describes private IPv6 addresses, acknowledges that some
applications might not work well with them and that the application may
explicitly a request non-temporary (i.e. "public") address.
I believe this is correct for SUNRPC clients. Without this change, a
client will occasionally experience a long delay if private addresses
have been enabled.
The privacy offered by private addresses is of little value for an NFS
server which requires client authentication.
For NFSv3 this will often not be a problem because idle connections are
closed after 5 minutes. For NFSv4 connections never go idle due to the
period RENEW (or equivalent) request.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
It's possible to have simultaneous upcalls for the same UIDs but
different GSS service. In that case, we need to allow for the
upcall to gssd to proceed so that not the same context is used
by two different GSS services. Some servers lock the use of context
to the GSS service.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Cc: stable@vger.kernel.org # v3.9+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The variable is added to allow the driver an easy access to
it's own hw->priv when the op is invoked.
This fixes a crash in wlcore because it was relying on a
station pointer that wasn't initialized yet. It's the wrong
way to fix the crash, but it solves the problem for now and
it does make sense to have the hw pointer here.
Signed-off-by: Maxim Altshul <maxim.altshul@ti.com>
[rewrite commit message, fix indentation]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Previously, NL80211_MESHCONF_HT_OPMODE validation rejected correct
flag combinations, e.g. IEEE80211_HT_OP_MODE_PROTECTION_NONHT_MIXED |
IEEE80211_HT_OP_MODE_NON_HT_STA_PRSNT.
Doing just a range-check allows setting flags that don't exist (0x8)
and invalid flag combinations.
Implements some checks based on IEEE 802.11 2012 8.4.2.59 "HT
Operation element".
Signed-off-by: Masashi Honma <masashi.honma@gmail.com>
[reword commit message, simplify a bit]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
If QoS frame with EOSP (end of service period) subfield=1 sent by local
peer was not acked by remote peer, local peer did not end the MPSP. This
prevents local peer from going to DOZE state. And if the remote peer
goes away without closing connection, local peer continues AWAKE state
and wastes battery.
Signed-off-by: Masashi Honma <masashi.honma@gmail.com>
Acked-by: Bob Copeland <me@bobcopeland.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The code currently assumes that buffered multicast PS frames don't have
a pending ACK frame for tx status reporting.
However, hostapd sends a broadcast deauth frame on teardown for which tx
status is requested. This can lead to the "Have pending ack frames"
warning on module reload.
Fix this by using ieee80211_free_txskb/ieee80211_purge_tx_queue.
Cc: stable@vger.kernel.org
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Trond made a change to the server's tcp logic that allows a fast
client to better take advantage of high bandwidth networks, but
may increase the risk that a single client could starve other
clients; a new sunrpc.svc_rpc_per_connection_limit parameter
should help mitigate this in the (hopefully unlikely) event this
becomes a problem in practice.
Tom Haynes added a minimal flex-layout pnfs server, which is of
no use in production for now--don't build it unless you're doing
client testing or further server development.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXo7HNAAoJECebzXlCjuG+zqUP/RxO5jZjBhNI8/ayGdDW/Jnq
s0Fu6B+aNRV3GnugmIeI4tWNGnPyERNzFtjLKlnwaasz/oW4qBLqGbNUWC5xKARS
erODs0hM/1aCYWwNBEc5qXP2u23HrWVuQ+B5fg42ACyliKFGq5faDRmf6XGU/1kB
8unXGWPAiLiNZD/bWP91fYhThlLgpfHBFZ7M3G2IqmzWZTSELPzwp1bpRWt7yWQQ
z1oYtXToycbwz3yPVk3cXtaoqpjDUVZf2Guqgqi1BwEyEtYOSaYo1VHNsKDf4OId
QXQh64AqIK4uszpvtNhvsEaAECN7IiB+N4n2laFiQVmAf8Hfl3AnV/gKeD4lKmTj
TY6knnjZO/X88wn80MB7JR1H1WXvvzNIHwNR95qfub/lVKX+C+0AORRtYhi5F9ec
ixNs/z1ImLpYxAjiP/T5anD5xcX2S+LcSv7kRjhEufqNFtRAIqBZO9ZWbCdXAAyE
tcH9Cru4jeIlFO/y6O61EVrn9FFj2+0uu+7urefNRQ2Y9pmKeculJrLF6WO8WHms
4IzXMmjZK+358RVdX2Ji5Hw6rBDvfgP+LjB8Jn8CeIiNRONEjT+2/AYQcfk61aLb
INUbk6G6Vfd8iMO4aaRI9tmW+vKCOZa0IbnrNE1oHKp/AKBDr25i5YPSCsnl3r4Q
iR7rRe9FIkfqBpbfjVFv
=mo54
-----END PGP SIGNATURE-----
Merge tag 'nfsd-4.8' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
"Highlights:
- Trond made a change to the server's tcp logic that allows a fast
client to better take advantage of high bandwidth networks, but may
increase the risk that a single client could starve other clients;
a new sunrpc.svc_rpc_per_connection_limit parameter should help
mitigate this in the (hopefully unlikely) event this becomes a
problem in practice.
- Tom Haynes added a minimal flex-layout pnfs server, which is of no
use in production for now--don't build it unless you're doing
client testing or further server development"
* tag 'nfsd-4.8' of git://linux-nfs.org/~bfields/linux: (32 commits)
nfsd: remove some dead code in nfsd_create_locked()
nfsd: drop unnecessary MAY_EXEC check from create
nfsd: clean up bad-type check in nfsd_create_locked
nfsd: remove unnecessary positive-dentry check
nfsd: reorganize nfsd_create
nfsd: check d_can_lookup in fh_verify of directories
nfsd: remove redundant zero-length check from create
nfsd: Make creates return EEXIST instead of EACCES
SUNRPC: Detect immediate closure of accepted sockets
SUNRPC: accept() may return sockets that are still in SYN_RECV
nfsd: allow nfsd to advertise multiple layout types
nfsd: Close race between nfsd4_release_lockowner and nfsd4_lock
nfsd/blocklayout: Make sure calculate signature/designator length aligned
xfs: abstract block export operations from nfsd layouts
SUNRPC: Remove unused callback xpo_adjust_wspace()
SUNRPC: Change TCP socket space reservation
SUNRPC: Add a server side per-connection limit
SUNRPC: Micro optimisation for svc_data_ready
SUNRPC: Call the default socket callbacks instead of open coding
SUNRPC: lock the socket while detaching it
...
The use of config_enabled() against config options is ambiguous. In
practical terms, config_enabled() is equivalent to IS_BUILTIN(), but the
author might have used it for the meaning of IS_ENABLED(). Using
IS_ENABLED(), IS_BUILTIN(), IS_MODULE() etc. makes the intention
clearer.
This commit replaces config_enabled() with IS_ENABLED() where possible.
This commit is only touching bool config options.
I noticed two cases where config_enabled() is used against a tristate
option:
- config_enabled(CONFIG_HWMON)
[ drivers/net/wireless/ath/ath10k/thermal.c ]
- config_enabled(CONFIG_BACKLIGHT_CLASS_DEVICE)
[ drivers/gpu/drm/gma500/opregion.c ]
I did not touch them because they should be converted to IS_BUILTIN()
in order to keep the logic, but I was not sure it was the authors'
intention.
Link: http://lkml.kernel.org/r/1465215656-20569-1-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Stas Sergeev <stsp@list.ru>
Cc: Matt Redfearn <matt.redfearn@imgtec.com>
Cc: Joshua Kinard <kumba@gentoo.org>
Cc: Jiri Slaby <jslaby@suse.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Markos Chandras <markos.chandras@imgtec.com>
Cc: "Dmitry V. Levin" <ldv@altlinux.org>
Cc: yu-cheng yu <yu-cheng.yu@intel.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Will Drewry <wad@chromium.org>
Cc: Nikolay Martynov <mar.kolya@gmail.com>
Cc: Huacai Chen <chenhc@lemote.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Leonid Yegoshin <Leonid.Yegoshin@imgtec.com>
Cc: Rafal Milecki <zajec5@gmail.com>
Cc: James Cowgill <James.Cowgill@imgtec.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Alex Smith <alex.smith@imgtec.com>
Cc: Adam Buchbinder <adam.buchbinder@gmail.com>
Cc: Qais Yousef <qais.yousef@imgtec.com>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Mikko Rapeli <mikko.rapeli@iki.fi>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Brian Norris <computersforpeace@gmail.com>
Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Cc: "Luis R. Rodriguez" <mcgrof@do-not-panic.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Roland McGrath <roland@hack.frob.com>
Cc: Paul Burton <paul.burton@imgtec.com>
Cc: Kalle Valo <kvalo@qca.qualcomm.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Tony Wu <tung7970@gmail.com>
Cc: Huaitong Han <huaitong.han@intel.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: Jason Cooper <jason@lakedaemon.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrea Gelmini <andrea.gelmini@gelma.net>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Rabin Vincent <rabin@rab.in>
Cc: "Maciej W. Rozycki" <macro@imgtec.com>
Cc: David Daney <david.daney@cavium.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ovs_ct_find_existing() issues a warning if an existing conntrack entry
classified as IP_CT_NEW is found, with the premise that this should
not happen. However, a newly confirmed, non-expected conntrack entry
remains IP_CT_NEW as long as no reply direction traffic is seen. This
has resulted into somewhat confusing kernel log messages. This patch
removes this check and warning.
Fixes: 289f2253 ("openvswitch: Find existing conntrack entry after upcall.")
Suggested-by: Joe Stringer <joe@ovn.org>
Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Acked-by: Joe Stringer <joe@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some drivers (e.g. wl18xx) expect that the last stage in the
de-initialization process will be stopping the beacons, similar to AP flow.
Update ieee80211_stop_mesh() flow accordingly.
As peers can be removed dynamically, this would not impact other drivers.
Tested also on Ralink RT3572 chipset.
Signed-off-by: Maital Hahn <maitalm@ti.com>
Signed-off-by: Yaniv Machani <yanivma@ti.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
* RADOS namespace support in libceph and CephFS (Zheng Yan and myself).
The stopgaps added in 4.5 to deny access to inodes in namespaces are
removed and CEPH_FEATURE_FS_FILE_LAYOUT_V2 feature bit is now fully
supported.
* A large rework of the MDS cap flushing code (Zheng Yan).
* Handle some of ->d_revalidate() in RCU mode (Jeff Layton). We were
overly pessimistic before, bailing at the first sight of LOOKUP_RCU.
On top of that we've got a few CephFS bug fixes, a couple of cleanups
and Arnd's workaround for a weird genksyms issue.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABCAAGBQJXoKLJAAoJEEp/3jgCEfOLDTUIAIcctpKUiNBokc95mQaXYl34
j7lPIaD0/Ur7JPt4nMdtlywYJYSVV2c+SglHztj/+fv0G4bWbLVEFRruh9SwKIci
PzttcmycIAqSn1f5gBZwyQbGuffd/F0EnBj7fFjcukt01i3s1ZQ7t4XtLGtAV0Ts
aIfFtx9SqWig57Z1OZqNgnhnOoh6IqNbic3FL5Hvdl5N5pFbBcQho6Vzoa5O1osH
URG6RmCcO4nykfSoxiivE7UZ+CImsXHkRD7rupBuIjqjZ8wvmZqQF5qxnkb9Dw2F
IkNhrHkTSIiv4EsNPLAETTnFSozrL1nEykKr2FBW+ti8nxNcav+8FgVapqLvFIw=
=gQ0/
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-4.8-rc1' of git://github.com/ceph/ceph-client
Pull Ceph updates from Ilya Dryomov:
"The highlights are:
- RADOS namespace support in libceph and CephFS (Zheng Yan and
myself). The stopgaps added in 4.5 to deny access to inodes in
namespaces are removed and CEPH_FEATURE_FS_FILE_LAYOUT_V2 feature
bit is now fully supported
- A large rework of the MDS cap flushing code (Zheng Yan)
- Handle some of ->d_revalidate() in RCU mode (Jeff Layton). We were
overly pessimistic before, bailing at the first sight of LOOKUP_RCU
On top of that we've got a few CephFS bug fixes, a couple of cleanups
and Arnd's workaround for a weird genksyms issue"
* tag 'ceph-for-4.8-rc1' of git://github.com/ceph/ceph-client: (34 commits)
ceph: fix symbol versioning for ceph_monc_do_statfs
ceph: Correctly return NXIO errors from ceph_llseek
ceph: Mark the file cache as unreclaimable
ceph: optimize cap flush waiting
ceph: cleanup ceph_flush_snaps()
ceph: kick cap flushes before sending other cap message
ceph: introduce an inode flag to indicates if snapflush is needed
ceph: avoid sending duplicated cap flush message
ceph: unify cap flush and snapcap flush
ceph: use list instead of rbtree to track cap flushes
ceph: update types of some local varibles
ceph: include 'follows' of pending snapflush in cap reconnect message
ceph: update cap reconnect message to version 3
ceph: mount non-default filesystem by name
libceph: fsmap.user subscription support
ceph: handle LOOKUP_RCU in ceph_d_revalidate
ceph: allow dentry_lease_is_valid to work under RCU walk
ceph: clear d_fsinfo pointer under d_lock
ceph: remove ceph_mdsc_lease_release
ceph: don't use ->d_time
...
Ensure that we don't forget to set up the disconnection timer for the
case when a connect request is fulfilled after the RPC request that
initiated it has timed out or been interrupted.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The logic was inverted here, set the bit if frames are pending.
Fixes: ba8c3d6f16 ("mac80211: add an intermediate software queue implementation")
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The switch on chandef->width is missing a break on the
NL8211_CHAN_WIDTH_80P80 case; currently we get a WARN_ON when
center_freq2 is non-zero because of the missing break.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Enable virtio-vsock and vhost-vsock.
Signed-off-by: Asias He <asias@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
VM sockets virtio transport implementation. This driver runs in the
guest.
Signed-off-by: Asias He <asias@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
This module contains the common code and header files for the following
virtio_transporto and vhost_vsock kernel modules.
Signed-off-by: Asias He <asias@redhat.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
The virtio transport will implement graceful shutdown and the related
SO_LINGER socket option. This requires orphaning the sock but keeping
it in the table of connections after .release().
This patch adds the vsock_remove_sock() function and leaves it up to the
transport when to remove the sock.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
struct vsock_transport contains function pointers called by AF_VSOCK
core code. The transport may want its own transport-specific function
pointers and they can be added after struct vsock_transport.
Allow the transport to fetch vsock_transport. It can downcast it to
access transport-specific function pointers.
The virtio transport will use this.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
This modification is useful for debugging issues that happen while
the socket is being initialised.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
We're seeing traces of the following form:
[10952.396347] svc: transport ffff88042ba4a 000 dequeued, inuse=2
[10952.396351] svc: tcp_accept ffff88042ba4 a000 sock ffff88042a6e4c80
[10952.396362] nfsd: connect from 10.2.6.1, port=187
[10952.396364] svc: svc_setup_socket ffff8800b99bcf00
[10952.396368] setting up TCP socket for reading
[10952.396370] svc: svc_setup_socket created ffff8803eb10a000 (inet ffff88042b75b800)
[10952.396373] svc: transport ffff8803eb10a000 put into queue
[10952.396375] svc: transport ffff88042ba4a000 put into queue
[10952.396377] svc: server ffff8800bb0ec000 waiting for data (to = 3600000)
[10952.396380] svc: transport ffff8803eb10a000 dequeued, inuse=2
[10952.396381] svc_recv: found XPT_CLOSE
[10952.396397] svc: svc_delete_xprt(ffff8803eb10a000)
[10952.396398] svc: svc_tcp_sock_detach(ffff8803eb10a000)
[10952.396399] svc: svc_sock_detach(ffff8803eb10a000)
[10952.396412] svc: svc_sock_free(ffff8803eb10a000)
i.e. an immediate close of the socket after initialisation.
The culprit appears to be the test at the end of svc_tcp_init, which
checks if the newly created socket is in the TCP_ESTABLISHED state,
and immediately closes it if not. The evidence appears to suggest that
the socket might still be in the SYN_RECV state at this time.
The fix is to check for both states, and then to add a check in
svc_tcp_state_change() to ensure we don't close the socket when
it transitions into TCP_ESTABLISHED.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
If the connect attempt immediately fails with an EADDRNOTAVAIL error, then
that means our choice of source port number was bad.
This error is expected when we set the SO_REUSEPORT socket option and we
have 2 sockets sharing the same source and destination address and port
combinations.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Fixes: 402e23b4ed ("SUNRPC: Fix stupid typo in xs_sock_set_reuseport")
Cc: stable@vger.kernel.org # v4.0+
Commit 141ddefce7 ("sctp: change sk state to CLOSED instead of
CLOSING in sctp_sock_migrate") changed sk state to CLOSED if the
assoc is closed when sctp_accept clones a new sk.
If there is still data in sk receive queue, users will not be able
to read it any more, as sctp_recvmsg returns directly if sk state
is CLOSED.
This patch is to add CLOSED state check in sctp_recvmsg to allow
reading data from TCP-style sk with CLOSED state as what TCP does.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Prior to this patch, once sctp received SHUTDOWN or shutdown with RD,
sk->sk_shutdown would be set with RCV_SHUTDOWN, and all events would
be dropped in sctp_ulpq_tail_event(). It would cause:
1. some notifications couldn't be received by users. like
SCTP_SHUTDOWN_COMP generated by sctp_sf_do_4_C().
2. sctp would also never trigger sk_data_ready when the association
was closed, making it harder to identify the end of the association
by calling recvmsg() and getting an EOF. It was not convenient for
kernel users.
The check here should be stopping delivering DATA chunks after receiving
SHUTDOWN, and stopping delivering ANY chunks after sctp_close().
So this patch is to allow notifications to enqueue into receive queue
even if sk->sk_shutdown is set to RCV_SHUTDOWN in sctp_ulpq_tail_event,
but if sk->sk_shutdown == RCV_SHUTDOWN | SEND_SHUTDOWN, it drops all
events.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sctp needs to queue auth chunk back when we know that we are going
to generate another segment. But commit f1533cce60 ("sctp: fix
panic when sending auth chunks") requeues the last chunk processed
which is probably not the auth chunk.
It causes panic when calculating the MAC in sctp_auth_calculate_hmac(),
as the incorrect offset of the auth chunk in skb->data.
This fix is to requeue it by using packet->auth.
Fixes: f1533cce60 ("sctp: fix panic when sending auth chunks")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_select_initial_window() intends to advertise a window
scaling for the maximum possible window size. To do so,
it considers the maximum of net.ipv4.tcp_rmem[2] and
net.core.rmem_max as the only possible upper-bounds.
However, users with CAP_NET_ADMIN can use SO_RCVBUFFORCE
to set the socket's receive buffer size to values
larger than net.ipv4.tcp_rmem[2] and net.core.rmem_max.
Thus, SO_RCVBUFFORCE is effectively ignored by
tcp_select_initial_window().
To fix this, consider the maximum of net.ipv4.tcp_rmem[2],
net.core.rmem_max and socket's initial buffer space.
Fixes: b0573dea1f ("[NET]: Introduce SO_{SND,RCV}BUFFORCE socket options")
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Suggested-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Using list_move() instead of list_del() + list_add().
Signed-off-by: Wei Yongjun <weiyj.lk@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the error handling case of nla_nest_start() failed read_unlock_bh()
is called to unlock a lock that had not been taken yet. sparse warns
about the context imbalance as the following:
net/tipc/monitor.c:799:23: warning:
context imbalance in '__tipc_nl_add_monitor' - different lock contexts for basic block
Fixes: cf6f7e1d51 ('tipc: dump monitor attributes')
Signed-off-by: Wei Yongjun <weiyj.lk@gmail.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Highlights include:
Stable bugfixes:
- nfs: don't create zero-length requests
- Several LAYOUTGET bugfixes
Features:
- Several performance related features
- More aggressive caching when we can rely on close-to-open cache
consistency
- Remove serialisation of O_DIRECT reads and writes
- Optimise several code paths to not flush to disk unnecessarily. However
allow for the idiosyncracies of pNFS for those layout types that need
to issue a LAYOUTCOMMIT before the metadata can be updated on the server.
- SUNRPC updates to the client data receive path
- pNFS/SCSI support RH/Fedora dm-mpath device nodes
- pNFS files/flexfiles can now use unprivileged ports when the generic NFS
mount options allow it.
Bugfixes:
- Don't use RDMA direct data placement together with data integrity or
privacy security flavours
- Remove the RDMA ALLPHYSICAL memory registration mode as it has potential
security holes.
- Several layout recall fixes to improve NFSv4.1 protocol compliance.
- Fix an Oops in the pNFS files and flexfiles connection setup to the DS
- Allow retry of operations that used a returned delegation stateid
- Don't mark the inode as revalidated if a LAYOUTCOMMIT is outstanding
- Fix writeback races in nfs4_copy_range() and nfs42_proc_deallocate()
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXnSq8AAoJEGcL54qWCgDyn8cP/RCHLekUCq7Klh+NAnEsvuBi
C7w9YpVHaC83/8Q0tR6LyFShSBJBWi/clWwO0IEomkNK/MuO77v4iyPujtEyqowK
0+eWFh/e8CsTf7mNGoi0avrHAZDB3deSuOQeYbwnNWHmd7qKVkB6tHus8LQjk852
eqwYmZ4kVr+eaCO6MttCCxJHf6datPnsbe0stiC9MpxmCzsdpZmFptfauidsFX+p
0U1IHi/ABN6zIFoc4R0iXXbaDb8ErxGw32SWIb8cnnWwdlSD8I0+Jqxs4opp23LY
lAm9E0vtDJ49bJBllYl0dUmizdhJ3+NefK4aqPh5H5h3Csub+MLIsuQv/+r2AOhH
qLBi5kThpspPhGHZ40VDmfV825+csUPTc8WkDaNLvb4f4UGIPakK/KBrBtxiqn+P
0etvYiWBuoBaqRVQpstawnyDdnBK0IMF/3LAULo+ozo7iTkpaZmOALYgPcBUYw2f
d6pxZGeNN0GwWfjDmoUDGC07OpO/CSN5WouArgKsp5+VhjzPxjyaZLCnUhzHzXiM
RV1oBytEs/iw2BLXX809noM9mqHYkdgSVmrZ9OvvDMslcLHaslpq6eaJKZSWqV2J
fAws6rbcZdTFSnbAWr0OSxct6w6BijEjc3/uk+wWRtw9nkOhFqtlxI3y7k4odpW9
wVcEmRNkxfA0LlYNXWuL
=WNyE
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-4.8-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Highlights include:
Stable bugfixes:
- nfs: don't create zero-length requests
- several LAYOUTGET bugfixes
Features:
- several performance related features
- more aggressive caching when we can rely on close-to-open
cache consistency
- remove serialisation of O_DIRECT reads and writes
- optimise several code paths to not flush to disk unnecessarily.
However allow for the idiosyncracies of pNFS for those layout
types that need to issue a LAYOUTCOMMIT before the metadata can
be updated on the server.
- SUNRPC updates to the client data receive path
- pNFS/SCSI support RH/Fedora dm-mpath device nodes
- pNFS files/flexfiles can now use unprivileged ports when
the generic NFS mount options allow it.
Bugfixes:
- Don't use RDMA direct data placement together with data
integrity or privacy security flavours
- Remove the RDMA ALLPHYSICAL memory registration mode as
it has potential security holes.
- Several layout recall fixes to improve NFSv4.1 protocol
compliance.
- Fix an Oops in the pNFS files and flexfiles connection
setup to the DS
- Allow retry of operations that used a returned delegation
stateid
- Don't mark the inode as revalidated if a LAYOUTCOMMIT is
outstanding
- Fix writeback races in nfs4_copy_range() and
nfs42_proc_deallocate()"
* tag 'nfs-for-4.8-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (104 commits)
pNFS: Actively set attributes as invalid if LAYOUTCOMMIT is outstanding
NFSv4: Clean up lookup of SECINFO_NO_NAME
NFSv4.2: Fix warning "variable ‘stateids’ set but not used"
NFSv4: Fix warning "no previous prototype for ‘nfs4_listxattr’"
SUNRPC: Fix a compiler warning in fs/nfs/clnt.c
pNFS: Remove redundant smp_mb() from pnfs_init_lseg()
pNFS: Cleanup - do layout segment initialisation in one place
pNFS: Remove redundant stateid invalidation
pNFS: Remove redundant pnfs_mark_layout_returned_if_empty()
pNFS: Clear the layout metadata if the server changed the layout stateid
pNFS: Cleanup - don't open code pnfs_mark_layout_stateid_invalid()
NFS: pnfs_mark_matching_lsegs_return() should match the layout sequence id
pNFS: Do not set plh_return_seq for non-callback related layoutreturns
pNFS: Ensure layoutreturn acts as a completion for layout callbacks
pNFS: Fix CB_LAYOUTRECALL stateid verification
pNFS: Always update the layout barrier seqid on LAYOUTGET
pNFS: Always update the layout stateid if NFS_LAYOUT_INVALID_STID is set
pNFS: Clear the layout return tracking on layout reinitialisation
pNFS: LAYOUTRETURN should only update the stateid if the layout is valid
nfs: don't create zero-length requests
...
Pull security subsystem updates from James Morris:
"Highlights:
- TPM core and driver updates/fixes
- IPv6 security labeling (CALIPSO)
- Lots of Apparmor fixes
- Seccomp: remove 2-phase API, close hole where ptrace can change
syscall #"
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (156 commits)
apparmor: fix SECURITY_APPARMOR_HASH_DEFAULT parameter handling
tpm: Add TPM 2.0 support to the Nuvoton i2c driver (NPCT6xx family)
tpm: Factor out common startup code
tpm: use devm_add_action_or_reset
tpm2_i2c_nuvoton: add irq validity check
tpm: read burstcount from TPM_STS in one 32-bit transaction
tpm: fix byte-order for the value read by tpm2_get_tpm_pt
tpm_tis_core: convert max timeouts from msec to jiffies
apparmor: fix arg_size computation for when setprocattr is null terminated
apparmor: fix oops, validate buffer size in apparmor_setprocattr()
apparmor: do not expose kernel stack
apparmor: fix module parameters can be changed after policy is locked
apparmor: fix oops in profile_unpack() when policy_db is not present
apparmor: don't check for vmalloc_addr if kvzalloc() failed
apparmor: add missing id bounds check on dfa verification
apparmor: allow SYS_CAP_RESOURCE to be sufficient to prlimit another task
apparmor: use list_next_entry instead of list_entry_next
apparmor: fix refcount race when finding a child profile
apparmor: fix ref count leak when profile sha1 hash is read
apparmor: check that xindex is in trans_table bounds
...
Pull userns vfs updates from Eric Biederman:
"This tree contains some very long awaited work on generalizing the
user namespace support for mounting filesystems to include filesystems
with a backing store. The real world target is fuse but the goal is
to update the vfs to allow any filesystem to be supported. This
patchset is based on a lot of code review and testing to approach that
goal.
While looking at what is needed to support the fuse filesystem it
became clear that there were things like xattrs for security modules
that needed special treatment. That the resolution of those concerns
would not be fuse specific. That sorting out these general issues
made most sense at the generic level, where the right people could be
drawn into the conversation, and the issues could be solved for
everyone.
At a high level what this patchset does a couple of simple things:
- Add a user namespace owner (s_user_ns) to struct super_block.
- Teach the vfs to handle filesystem uids and gids not mapping into
to kuids and kgids and being reported as INVALID_UID and
INVALID_GID in vfs data structures.
By assigning a user namespace owner filesystems that are mounted with
only user namespace privilege can be detected. This allows security
modules and the like to know which mounts may not be trusted. This
also allows the set of uids and gids that are communicated to the
filesystem to be capped at the set of kuids and kgids that are in the
owning user namespace of the filesystem.
One of the crazier corner casees this handles is the case of inodes
whose i_uid or i_gid are not mapped into the vfs. Most of the code
simply doesn't care but it is easy to confuse the inode writeback path
so no operation that could cause an inode write-back is permitted for
such inodes (aka only reads are allowed).
This set of changes starts out by cleaning up the code paths involved
in user namespace permirted mounts. Then when things are clean enough
adds code that cleanly sets s_user_ns. Then additional restrictions
are added that are possible now that the filesystem superblock
contains owner information.
These changes should not affect anyone in practice, but there are some
parts of these restrictions that are changes in behavior.
- Andy's restriction on suid executables that does not honor the
suid bit when the path is from another mount namespace (think
/proc/[pid]/fd/) or when the filesystem was mounted by a less
privileged user.
- The replacement of the user namespace implicit setting of MNT_NODEV
with implicitly setting SB_I_NODEV on the filesystem superblock
instead.
Using SB_I_NODEV is a stronger form that happens to make this state
user invisible. The user visibility can be managed but it caused
problems when it was introduced from applications reasonably
expecting mount flags to be what they were set to.
There is a little bit of work remaining before it is safe to support
mounting filesystems with backing store in user namespaces, beyond
what is in this set of changes.
- Verifying the mounter has permission to read/write the block device
during mount.
- Teaching the integrity modules IMA and EVM to handle filesystems
mounted with only user namespace root and to reduce trust in their
security xattrs accordingly.
- Capturing the mounters credentials and using that for permission
checks in d_automount and the like. (Given that overlayfs already
does this, and we need the work in d_automount it make sense to
generalize this case).
Furthermore there are a few changes that are on the wishlist:
- Get all filesystems supporting posix acls using the generic posix
acls so that posix_acl_fix_xattr_from_user and
posix_acl_fix_xattr_to_user may be removed. [Maintainability]
- Reducing the permission checks in places such as remount to allow
the superblock owner to perform them.
- Allowing the superblock owner to chown files with unmapped uids and
gids to something that is mapped so the files may be treated
normally.
I am not considering even obvious relaxations of permission checks
until it is clear there are no more corner cases that need to be
locked down and handled generically.
Many thanks to Seth Forshee who kept this code alive, and putting up
with me rewriting substantial portions of what he did to handle more
corner cases, and for his diligent testing and reviewing of my
changes"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (30 commits)
fs: Call d_automount with the filesystems creds
fs: Update i_[ug]id_(read|write) to translate relative to s_user_ns
evm: Translate user/group ids relative to s_user_ns when computing HMAC
dquot: For now explicitly don't support filesystems outside of init_user_ns
quota: Handle quota data stored in s_user_ns in quota_setxquota
quota: Ensure qids map to the filesystem
vfs: Don't create inodes with a uid or gid unknown to the vfs
vfs: Don't modify inodes with a uid or gid unknown to the vfs
cred: Reject inodes with invalid ids in set_create_file_as()
fs: Check for invalid i_uid in may_follow_link()
vfs: Verify acls are valid within superblock's s_user_ns.
userns: Handle -1 in k[ug]id_has_mapping when !CONFIG_USER_NS
fs: Refuse uid/gid changes which don't map into s_user_ns
selinux: Add support for unprivileged mounts from user namespaces
Smack: Handle labels consistently in untrusted mounts
Smack: Add support for unprivileged mounts from user namespaces
fs: Treat foreign mounts as nosuid
fs: Limit file caps to the user namespace of the super block
userns: Remove the now unnecessary FS_USERNS_DEV_MOUNT flag
userns: Remove implicit MNT_NODEV fragility.
...
Whenever thresholds are changed the hash tables are rebuilt. This is
done by enumerating all policies and hashing and inserting them into
the right table according to the thresholds and direction.
Because socket policies are also contained in net->xfrm.policy_all but
no hash tables are defined for their direction (dir + XFRM_POLICY_MAX)
this causes a NULL or invalid pointer dereference after returning from
policy_hash_bysel() if the rebuild is done while any socket policies
are installed.
Since the rebuild after changing thresholds is scheduled this crash
could even occur if the userland sets thresholds seemingly before
installing any socket policies.
Fixes: 53c2e285f9 ("xfrm: Do not hash socket policies")
Signed-off-by: Tobias Brunner <tobias@strongswan.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
This changes the vfs dentry hashing to mix in the parent pointer at the
_beginning_ of the hash, rather than at the end.
That actually improves both the hash and the code generation, because we
can move more of the computation to the "static" part of the dcache
setup, and do less at lookup runtime.
It turns out that a lot of other hash users also really wanted to mix in
a base pointer as a 'salt' for the hash, and so the slightly extended
interface ends up working well for other cases too.
Users that want a string hash that is purely about the string pass in a
'salt' pointer of NULL.
* merge branch 'salted-string-hash':
fs/dcache.c: Save one 32-bit multiply in dcache lookup
vfs: make the string hashes salt the hash
Add pool namesapce pointer to struct ceph_file_layout and struct
ceph_object_locator. Pool namespace is used by when mapping object
to PG, it's also used when composing OSD request.
The namespace pointer in struct ceph_file_layout is RCU protected.
So libceph can read namespace without taking lock.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
[idryomov@gmail.com: ceph_oloc_destroy(), misc minor changes]
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The data structure is for storing namesapce string. It allows namespace
string to be shared between cephfs inodes with same layout. This data
structure can also be referenced by OSD request.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Define new ceph_file_layout structure and rename old ceph_file_layout
to ceph_file_layout_legacy. This is preparation for adding namespace
to ceph_file_layout structure.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Pull networking updates from David Miller:
1) Unified UDP encapsulation offload methods for drivers, from
Alexander Duyck.
2) Make DSA binding more sane, from Andrew Lunn.
3) Support QCA9888 chips in ath10k, from Anilkumar Kolli.
4) Several workqueue usage cleanups, from Bhaktipriya Shridhar.
5) Add XDP (eXpress Data Path), essentially running BPF programs on RX
packets as soon as the device sees them, with the option to mirror
the packet on TX via the same interface. From Brenden Blanco and
others.
6) Allow qdisc/class stats dumps to run lockless, from Eric Dumazet.
7) Add VLAN support to b53 and bcm_sf2, from Florian Fainelli.
8) Simplify netlink conntrack entry layout, from Florian Westphal.
9) Add ipv4 forwarding support to mlxsw spectrum driver, from Ido
Schimmel, Yotam Gigi, and Jiri Pirko.
10) Add SKB array infrastructure and convert tun and macvtap over to it.
From Michael S Tsirkin and Jason Wang.
11) Support qdisc packet injection in pktgen, from John Fastabend.
12) Add neighbour monitoring framework to TIPC, from Jon Paul Maloy.
13) Add NV congestion control support to TCP, from Lawrence Brakmo.
14) Add GSO support to SCTP, from Marcelo Ricardo Leitner.
15) Allow GRO and RPS to function on macsec devices, from Paolo Abeni.
16) Support MPLS over IPV4, from Simon Horman.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1622 commits)
xgene: Fix build warning with ACPI disabled.
be2net: perform temperature query in adapter regardless of its interface state
l2tp: Correctly return -EBADF from pppol2tp_getname.
net/mlx5_core/health: Remove deprecated create_singlethread_workqueue
net: ipmr/ip6mr: update lastuse on entry change
macsec: ensure rx_sa is set when validation is disabled
tipc: dump monitor attributes
tipc: add a function to get the bearer name
tipc: get monitor threshold for the cluster
tipc: make cluster size threshold for monitoring configurable
tipc: introduce constants for tipc address validation
net: neigh: disallow transition to NUD_STALE if lladdr is unchanged in neigh_update()
MAINTAINERS: xgene: Add driver and documentation path
Documentation: dtb: xgene: Add MDIO node
dtb: xgene: Add MDIO node
drivers: net: xgene: ethtool: Use phy_ethtool_gset and sset
drivers: net: xgene: Use exported functions
drivers: net: xgene: Enable MDIO driver
drivers: net: xgene: Add backward compatibility
drivers: net: phy: xgene: Add MDIO driver
...
During fuzzing I regularly run into this WARN(). According to Herbert Xu,
this "certainly shouldn't be a WARN, it probably shouldn't print anything
either".
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
AFAICT this message is just printed whenever input validation fails.
This is a normal failure and we shouldn't be dumping the stack over it.
Looks like it was originally a printk that was maybe incorrectly
upgraded to a WARN:
commit 62db5cfd70
Author: stephen hemminger <shemminger@vyatta.com>
Date: Wed May 12 06:37:06 2010 +0000
xfrm: add severity to printk
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Merge updates from Andrew Morton:
- a few misc bits
- ocfs2
- most(?) of MM
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (125 commits)
thp: fix comments of __pmd_trans_huge_lock()
cgroup: remove unnecessary 0 check from css_from_id()
cgroup: fix idr leak for the first cgroup root
mm: memcontrol: fix documentation for compound parameter
mm: memcontrol: remove BUG_ON in uncharge_list
mm: fix build warnings in <linux/compaction.h>
mm, thp: convert from optimistic swapin collapsing to conservative
mm, thp: fix comment inconsistency for swapin readahead functions
thp: update Documentation/{vm/transhuge,filesystems/proc}.txt
shmem: split huge pages beyond i_size under memory pressure
thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE
khugepaged: add support of collapse for tmpfs/shmem pages
shmem: make shmem_inode_info::lock irq-safe
khugepaged: move up_read(mmap_sem) out of khugepaged_alloc_page()
thp: extract khugepaged from mm/huge_memory.c
shmem, thp: respect MADV_{NO,}HUGEPAGE for file mappings
shmem: add huge pages support
shmem: get_unmapped_area align huge page
shmem: prepare huge= mount option and sysfs knob
mm, rmap: account shmem thp pages
...
Unix sockets can consume a significant amount of system memory, hence
they should be accounted to kmemcg.
Since unix socket buffers are always allocated from process context, all
we need to do to charge them to kmemcg is set __GFP_ACCOUNT in
sock->sk_allocation mask.
Eric asked:
> 1) What happens when a buffer, allocated from socket <A> lands in a
> different socket <B>, maybe owned by another user/process.
>
> Who owns it now, in term of kmemcg accounting ?
We never move memcg charges. E.g. if two processes from different
cgroups are sharing a memory region, each page will be charged to the
process which touched it first. Or if two processes are working with
the same directory tree, inodes and dentries will be charged to the
first user. The same is fair for unix socket buffers - they will be
charged to the sender.
> 2) Has performance impact been evaluated ?
I ran netperf STREAM_STREAM with default options in a kmemcg on a 4 core
x2 HT box. The results are below:
# clients bandwidth (10^6bits/sec)
base patched
1 67643 +- 725 64874 +- 353 - 4.0 %
4 193585 +- 2516 186715 +- 1460 - 3.5 %
8 194820 +- 377 187443 +- 1229 - 3.7 %
So the accounting doesn't come for free - it takes ~4% of performance.
I believe we could optimize it by using per cpu batching not only on
charge, but also on uncharge in memcg core, but that's beyond the scope
of this patch set - I'll take a look at this later.
Anyway, if performance impact is found to be unacceptable, it is always
possible to disable kmem accounting at boot time (cgroup.memory=nokmem)
or not use memory cgroups at runtime at all (thanks to jump labels
there'll be no overhead even if they are compiled in).
Link: http://lkml.kernel.org/r/fcfe6cae27a59fbc5e40145664b3cf085a560c68.1464079538.git.vdavydov@virtuozzo.com
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If 'tunnel' is NULL we should return -EBADF but the 'end_put_sess' path
unconditionally sets 'error' back to zero. Rework the error path so it
more closely matches pppol2tp_sendmsg.
Fixes: fd558d186d ("l2tp: Split pppol2tp patch into separate l2tp and ppp parts")
Signed-off-by: Phil Turnbull <phil.turnbull@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently lastuse is updated on entry creation and cache hit, but it should
also be updated on entry change. Since both on add and update the ttl array
is updated we can simply update the lastuse in ipmr_update_thresholds.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
CC: Roopa Prabhu <roopa@cumulusnetworks.com>
CC: Donald Sharp <sharpd@cumulusnetworks.com>
CC: David S. Miller <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
In this commit, we dump the monitor attributes when queried.
The link monitor attributes are separated into two kinds:
1. general attributes per bearer
2. specific attributes per node/peer
This style resembles the socket attributes and the nametable
publications per socket.
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce a new function to get the bearer name from
its id. This is used in subsequent commit.
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In this commit, we add support to fetch the configured
cluster monitoring threshold.
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In this commit, we introduce support to configure the minimum
threshold to activate the new link monitoring algorithm.
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In this commit, we introduce defines for tipc address size,
offset and mask specification for Zone.Cluster.Node.
There is no functional change in this commit.
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
NUD_STALE is used when the caller(e.g. arp_process()) can't guarantee
neighbour reachability. If the entry was NUD_VALID and lladdr is unchanged,
the entry state should not be changed.
Currently the code puts an extra "NUD_CONNECTED" condition. So if old state
was NUD_DELAY or NUD_PROBE (they are NUD_VALID but not NUD_CONNECTED), the
state can be changed to NUD_STALE.
This may cause problem. Because NUD_STALE lladdr doesn't guarantee
reachability, when we send traffic, the state will be changed to
NUD_DELAY. In normal case, if we get no confirmation (by dst_confirm()),
we will change the state to NUD_PROBE and send probe traffic. But now the
state may be reset to NUD_STALE again(e.g. by broadcast ARP packets),
so the probe traffic will not be sent. This situation may happen again and
again, and packets will be sent to an non-reachable lladdr forever.
The fix is to remove the "NUD_CONNECTED" condition. After that the
"NEIGH_UPDATE_F_WEAK_OVERRIDE" condition (used by IPv6) in that branch will
be redundant, so remove it.
This change may increase probe traffic, but it's essential since NUD_STALE
lladdr is unreliable. To ensure correctness, we prefer to resolve lladdr,
when we can't get confirmation, even while remote packets try to set
NUD_STALE state.
Signed-off-by: Chunhui He <hchunhui@mail.ustc.edu.cn>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull s390 updates from Martin Schwidefsky:
"There are a couple of new things for s390 with this merge request:
- a new scheduling domain "drawer" is added to reflect the unusual
topology found on z13 machines. Performance tests showed up to 8
percent gain with the additional domain.
- the new crc-32 checksum crypto module uses the vector-galois-field
multiply and sum SIMD instruction to speed up crc-32 and crc-32c.
- proper __ro_after_init support, this requires RO_AFTER_INIT_DATA in
the generic vmlinux.lds linker script definitions.
- kcov instrumentation support. A prerequisite for that is the
inline assembly basic block cleanup, which is the reason for the
net/iucv/iucv.c change.
- support for 2GB pages is added to the hugetlbfs backend.
Then there are two removals:
- the oprofile hardware sampling support is dead code and is removed.
The oprofile user space uses the perf interface nowadays.
- the ETR clock synchronization is removed, this has been superseeded
be the STP clock synchronization. And it always has been
"interesting" code..
And the usual bug fixes and cleanups"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (82 commits)
s390/pci: Delete an unnecessary check before the function call "pci_dev_put"
s390/smp: clean up a condition
s390/cio/chp : Remove deprecated create_singlethread_workqueue
s390/chsc: improve channel path descriptor determination
s390/chsc: sanitize fmt check for chp_desc determination
s390/cio: make fmt1 channel path descriptor optional
s390/chsc: fix ioctl CHSC_INFO_CU command
s390/cio/device_ops: fix kernel doc
s390/cio: allow to reset channel measurement block
s390/console: Make preferred console handling more consistent
s390/mm: fix gmap tlb flush issues
s390/mm: add support for 2GB hugepages
s390: have unique symbol for __switch_to address
s390/cpuinfo: show maximum thread id
s390/ptrace: clarify bits in the per_struct
s390: stack address vs thread_info
s390: remove pointless load within __switch_to
s390: enable kcov support
s390/cpumf: use basic block for ecctr inline assembly
s390/hypfs: use basic block for diag inline assembly
...
After the previous patch, struct tc_action should be enough
to represent the generic tc action, tcf_common is not necessary
any more. This patch gets rid of it to make tc action code
more readable.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
struct tc_action is confusing, currently we use it for two purposes:
1) Pass in arguments and carry out results from helper functions
2) A generic representation for tc actions
The first one is error-prone, since we need to make sure we don't
miss anything. This patch aims to get rid of this use, by moving
tc_action into tcf_common, so that they are allocated together
in hashtable and can be cast'ed easily.
And together with the following patch, we could really make
tc_action a generic representation for all tc actions and each
type of action can inherit from it.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After a612769774 ("udp: prevent bugcheck if filter truncates packet
too much"), there followed various other fixes for similar cases such
as f4979fcea7 ("rose: limit sk_filter trim to payload").
Latter introduced a new helper sk_filter_trim_cap(), where we can pass
the trim limit directly to the socket filter handling. Make use of it
here as well with sizeof(struct udphdr) as lower cap limit and drop the
extra skb->len test in UDP's input path.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Willem de Bruijn <willemb@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull timer updates from Thomas Gleixner:
"This update provides the following changes:
- The rework of the timer wheel which addresses the shortcomings of
the current wheel (cascading, slow search for next expiring timer,
etc). That's the first major change of the wheel in almost 20
years since Finn implemted it.
- A large overhaul of the clocksource drivers init functions to
consolidate the Device Tree initialization
- Some more Y2038 updates
- A capability fix for timerfd
- Yet another clock chip driver
- The usual pile of updates, comment improvements all over the place"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (130 commits)
tick/nohz: Optimize nohz idle enter
clockevents: Make clockevents_subsys static
clocksource/drivers/time-armada-370-xp: Fix return value check
timers: Implement optimization for same expiry time in mod_timer()
timers: Split out index calculation
timers: Only wake softirq if necessary
timers: Forward the wheel clock whenever possible
timers/nohz: Remove pointless tick_nohz_kick_tick() function
timers: Optimize collect_expired_timers() for NOHZ
timers: Move __run_timers() function
timers: Remove set_timer_slack() leftovers
timers: Switch to a non-cascading wheel
timers: Reduce the CPU index space to 256k
timers: Give a few structs and members proper names
hlist: Add hlist_is_singular_node() helper
signals: Use hrtimer for sigtimedwait()
timers: Remove the deprecated mod_timer_pinned() API
timers, net/ipv4/inet: Initialize connection request timers as pinned
timers, drivers/tty/mips_ejtag: Initialize the poll timer as pinned
timers, drivers/tty/metag_da: Initialize the poll timer as pinned
...
I was seeing a lot of these:
BUG: sleeping function called from invalid context at mm/slab.h:388
in_atomic(): 0, irqs_disabled(): 0, pid: 14971, name: trinity-c2
Preemption disabled at:[<ffffffff819bcd46>] rhashtable_walk_start+0x46/0x150
[<ffffffff81149abb>] preempt_count_add+0x1fb/0x280
[<ffffffff83295722>] _raw_spin_lock+0x12/0x40
[<ffffffff811aac87>] console_unlock+0x2f7/0x930
[<ffffffff811ab5bb>] vprintk_emit+0x2fb/0x520
[<ffffffff811aba6a>] vprintk_default+0x1a/0x20
[<ffffffff812c171a>] printk+0x94/0xb0
[<ffffffff811d6ed0>] print_stack_trace+0xe0/0x170
[<ffffffff8115835e>] ___might_sleep+0x3be/0x460
[<ffffffff81158490>] __might_sleep+0x90/0x1a0
[<ffffffff8139b823>] kmem_cache_alloc+0x153/0x1e0
[<ffffffff819bca1e>] rhashtable_walk_init+0xfe/0x2d0
[<ffffffff82ec64de>] sctp_transport_walk_start+0x1e/0x60
[<ffffffff82edd8ad>] sctp_transport_seq_start+0x4d/0x150
[<ffffffff8143a82b>] seq_read+0x27b/0x1180
[<ffffffff814f97fc>] proc_reg_read+0xbc/0x180
[<ffffffff813d471b>] __vfs_read+0xdb/0x610
[<ffffffff813d4d3a>] vfs_read+0xea/0x2d0
[<ffffffff813d615b>] SyS_pread64+0x11b/0x150
[<ffffffff8100334c>] do_syscall_64+0x19c/0x410
[<ffffffff832960a5>] return_from_SYSCALL_64+0x0/0x6a
[<ffffffffffffffff>] 0xffffffffffffffff
Apparently we always need to call rhashtable_walk_stop(), even when
rhashtable_walk_start() fails:
* rhashtable_walk_start - Start a hash table walk
* @iter: Hash table iterator
*
* Start a hash table walk. Note that we take the RCU lock in all
* cases including when we return an error. So you must always call
* rhashtable_walk_stop to clean up.
otherwise we never call rcu_read_unlock() and we get the splat above.
Fixes: 53fa1036 ("sctp: fix some rhashtable functions using in sctp proc/diag")
See-also: 53fa1036 ("sctp: fix some rhashtable functions using in sctp proc/diag")
See-also: f2dba9c6 ("rhashtable: Introduce rhashtable_walk_*")
Cc: Xin Long <lucien.xin@gmail.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: stable@vger.kernel.org
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull locking updates from Ingo Molnar:
"The locking tree was busier in this cycle than the usual pattern - a
couple of major projects happened to coincide.
The main changes are:
- implement the atomic_fetch_{add,sub,and,or,xor}() API natively
across all SMP architectures (Peter Zijlstra)
- add atomic_fetch_{inc/dec}() as well, using the generic primitives
(Davidlohr Bueso)
- optimize various aspects of rwsems (Jason Low, Davidlohr Bueso,
Waiman Long)
- optimize smp_cond_load_acquire() on arm64 and implement LSE based
atomic{,64}_fetch_{add,sub,and,andnot,or,xor}{,_relaxed,_acquire,_release}()
on arm64 (Will Deacon)
- introduce smp_acquire__after_ctrl_dep() and fix various barrier
mis-uses and bugs (Peter Zijlstra)
- after discovering ancient spin_unlock_wait() barrier bugs in its
implementation and usage, strengthen its semantics and update/fix
usage sites (Peter Zijlstra)
- optimize mutex_trylock() fastpath (Peter Zijlstra)
- ... misc fixes and cleanups"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (67 commits)
locking/atomic: Introduce inc/dec variants for the atomic_fetch_$op() API
locking/barriers, arch/arm64: Implement LDXR+WFE based smp_cond_load_acquire()
locking/static_keys: Fix non static symbol Sparse warning
locking/qspinlock: Use __this_cpu_dec() instead of full-blown this_cpu_dec()
locking/atomic, arch/tile: Fix tilepro build
locking/atomic, arch/m68k: Remove comment
locking/atomic, arch/arc: Fix build
locking/Documentation: Clarify limited control-dependency scope
locking/atomic, arch/rwsem: Employ atomic_long_fetch_add()
locking/atomic, arch/qrwlock: Employ atomic_fetch_add_acquire()
locking/atomic, arch/mips: Convert to _relaxed atomics
locking/atomic, arch/alpha: Convert to _relaxed atomics
locking/atomic: Remove the deprecated atomic_{set,clear}_mask() functions
locking/atomic: Remove linux/atomic.h:atomic_fetch_or()
locking/atomic: Implement atomic{,64,_long}_fetch_{add,sub,and,andnot,or,xor}{,_relaxed,_acquire,_release}()
locking/atomic: Fix atomic64_relaxed() bits
locking/atomic, arch/xtensa: Implement atomic_fetch_{add,sub,and,or,xor}()
locking/atomic, arch/x86: Implement atomic{,64}_fetch_{add,sub,and,or,xor}()
locking/atomic, arch/tile: Implement atomic{,64}_fetch_{add,sub,and,or,xor}()
locking/atomic, arch/sparc: Implement atomic{,64}_fetch_{add,sub,and,or,xor}()
...
The head skb for GSO packets won't travel through the inner depths of
SCTP stack as it doesn't contain any chunks on it. That means skb->sk
doesn't get set and then when sctp_recvmsg() calls
sctp_inet6_skb_msgname() on the head_skb it panics, as this last needs
to check flags at the socket (sp->v4mapped).
The fix is to initialize skb->sk for th head skb once we are able to do
it. That is, when the first chunk is processed.
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The check for a -ve error is redundant, remove it and just
immediately return the return value from the call to
seq_open_net.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Default kernel behavior is to delete IPv6 addresses on link
down, which entails deletion of the multicast and the
subnet-router anycast addresses. These deletions do not
happen with sysctl setting to keep global IPv6 addresses on
link down, so every link down/up causes an increment of the
anycast and multicast refcounts. These bogus refcounts may
stop these addrs from being removed on subsequent calls to
delete them. The solution is to leave the groups for the
multicast and subnet anycast on link down for the callflow
when global IPv6 addresses are kept.
Fixes: f1705ec197 ("net: ipv6: Make address flushing on ifdown optional")
Signed-off-by: Mike Manning <mmanning@brocade.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 486bdee013 ("sctp: add support for RPS and RFS")
saves skb->hash into sk->sk_rxhash so that the inet_* can
record it to flow table.
But sctp uses sock_common_recvmsg as .recvmsg instead
of inet_recvmsg, sock_common_recvmsg doesn't invoke
sock_rps_record_flow to record the flow. It may cause
that the receiver has no chances to record the flow if
it doesn't send msg or poll the socket.
So this patch fixes it by using inet_recvmsg as .recvmsg
in sctp.
Fixes: 486bdee013 ("sctp: add support for RPS and RFS")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 8626c56c82 ("bridge: fix potential use-after-free when hook
returns QUEUE or STOLEN verdict") caused LLDP packets arriving through a
bridge port to be re-injected to the Rx path with skb->dev set to the
bridge device, but this breaks the lldpad daemon.
The lldpad daemon opens a packet socket with protocol set to ETH_P_LLDP
for any valid device on the system, which doesn't not include soft
devices such as bridge and VLAN.
Since packet sockets (ptype_base) are processed in the Rx path after the
Rx handler, LLDP packets with skb->dev set to the bridge device never
reach the lldpad daemon.
Fix this by making the bridge's Rx handler re-inject LLDP packets with
RX_HANDLER_PASS, which effectively restores the behaviour prior to the
mentioned commit.
This means netfilter will never receive LLDP packets coming through a
bridge port, as I don't see a way in which we can have okfn() consume
the packet without breaking existing behaviour. I've already carried out
a similar fix for STP packets in commit 56fae404fb ("bridge: Fix
incorrect re-injection of STP packets").
Fixes: 8626c56c82 ("bridge: fix potential use-after-free when hook returns QUEUE or STOLEN verdict")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Cc: Florian Westphal <fw@strlen.de>
Cc: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch makes sctp support ipv6 nonlocal bind by adding
sp->inet.freebind and net->ipv6.sysctl.ip_nonlocal_bind
check in sctp_v6_available as what sctp did to support
ipv4 nonlocal bind (commit cdac4e0774).
Reported-by: Shijoe George <spanjikk@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes the __output_custom() routine we currently use with
bpf_skb_copy(). I missed that when len is larger than the size of the
current handle, we can issue multiple invocations of copy_func, and
__output_custom() advances destination but also source buffer by the
written amount of bytes. When we have __output_custom(), this is actually
wrong since in that case the source buffer points to a non-linear object,
in our case an skb, which the copy_func helper is supposed to walk.
Therefore, since this is non-linear we thus need to pass the offset into
the helper, so that copy_func can use it for extracting the data from
the source object.
Therefore, adjust the callback signatures properly and pass offset
into the skb_header_pointer() invoked from bpf_skb_copy() callback. The
__DEFINE_OUTPUT_COPY_BODY() is adjusted to accommodate for two things:
i) to pass in whether we should advance source buffer or not; this is
a compile-time constant condition, ii) to pass in the offset for
__output_custom(), which we do with help of __VA_ARGS__, so everything
can stay inlined as is currently. Both changes allow for adapting the
__output_* fast-path helpers w/o extra overhead.
Fixes: 555c8a8623 ("bpf: avoid stack copy and use skb ctx for event output")
Fixes: 7e3f977edd ("perf, events: add non-linear data support for raw records")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
gcc-4.9 and higher warn about the newly added NSCI code:
net/ncsi/ncsi-manage.c: In function 'ncsi_process_next_channel':
net/ncsi/ncsi-manage.c:1003:2: error: 'old_state' may be used uninitialized in this function [-Werror=maybe-uninitialized]
The warning is a false positive and therefore harmless, but it would be good to
avoid it anyway. I have determined that the barrier in the spin_unlock_irqsave()
is what confuses gcc to the point that it cannot track whether the variable
was unused or not.
This rearranges the code in a way that makes it obvious to gcc that old_state
is always initialized at the time of use, functionally this should not
change anything.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change the ageing_time type in br_set_ageing_time() from u32 to what it
is expected to be, i.e. a clock_t.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
br_stp_enable_bridge() does take the br->lock spinlock. Fix its wrongly
pasted comment and use the same as br_stp_disable_bridge().
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Following the work that have been done on offloading classifiers like u32
and flower, now the match-all classifier hw offloading is possible. if
the interface supports tc offloading.
To control the offloading, two tc flags have been introduced: skip_sw and
skip_hw. Typical usage:
tc filter add dev eth25 parent ffff: \
matchall skip_sw \
action mirred egress mirror \
dev eth27
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The matchall classifier matches every packet and allows the user to apply
actions on it. This filter is very useful in usecases where every packet
should be matched, for example, packet mirroring (SPAN) can be setup very
easily using that filter.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
The following patchset contains Netfilter/IPVS updates for net-next,
they are:
1) Count pre-established connections as active in "least connection"
schedulers such that pre-established connections to avoid overloading
backend servers on peak demands, from Michal Kubecek via Simon Horman.
2) Address a race condition when resizing the conntrack table by caching
the bucket size when fulling iterating over the hashtable in these
three possible scenarios: 1) dump via /proc/net/nf_conntrack,
2) unlinking userspace helper and 3) unlinking custom conntrack timeout.
From Liping Zhang.
3) Revisit early_drop() path to perform lockless traversal on conntrack
eviction under stress, use del_timer() as synchronization point to
avoid two CPUs evicting the same entry, from Florian Westphal.
4) Move NAT hlist_head to nf_conn object, this simplifies the existing
NAT extension and it doesn't increase size since recent patches to
align nf_conn, from Florian.
5) Use rhashtable for the by-source NAT hashtable, also from Florian.
6) Don't allow --physdev-is-out from OUTPUT chain, just like
--physdev-out is not either, from Hangbin Liu.
7) Automagically set on nf_conntrack counters if the user tries to
match ct bytes/packets from nftables, from Liping Zhang.
8) Remove possible_net_t fields in nf_tables set objects since we just
simply pass the net pointer to the backend set type implementations.
9) Fix possible off-by-one in h323, from Toby DiPasquale.
10) early_drop() may be called from ctnetlink patch, so we must hold
rcu read size lock from them too, this amends Florian's patch #3
coming in this batch, from Liping Zhang.
11) Use binary search to validate jump offset in x_tables, this
addresses the O(n!) validation that was introduced recently
resolve security issues with unpriviledge namespaces, from Florian.
12) Fix reference leak to connlabel in error path of nft_ct, from Zhang.
13) Three updates for nft_log: Fix log prefix leak in error path. Bail
out on loglevel larger than debug in nft_log and set on the new
NF_LOG_F_COPY_LEN flag when snaplen is specified. Again from Zhang.
14) Allow to filter rule dumps in nf_tables based on table and chain
names.
15) Simplify connlabel to always use 128 bits to store labels and
get rid of unused function in xt_connlabel, from Florian.
16) Replace set_expect_timeout() by mod_timer() from the h323 conntrack
helper, by Gao Feng.
17) Put back x_tables module reference in nft_compat on error, from
Liping Zhang.
18) Add a reference count to the x_tables extensions cache in
nft_compat, so we can remove them when unused and avoid a crash
if the extensions are rmmod, again from Zhang.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Here is the big tty and serial driver update for 4.8-rc1.
Lots of good cleanups from Jiri on a number of vt and other tty related
things, and the normal driver updates. Full details are in the
shortlog.
All of these have been in linux-next for a while with no reported
issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iFYEABECABYFAleVPbQPHGdyZWdAa3JvYWguY29tAAoJEDFH1A3bLfspWXgAn046
QCMeFya4J1zjYjcGXJzNfGMUAKCHxha8Xe65cc0LDz8mNB0MgzjHEg==
=ED8v
-----END PGP SIGNATURE-----
Merge tag 'tty-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty/serial driver updates from Greg KH:
"Here is the big tty and serial driver update for 4.8-rc1.
Lots of good cleanups from Jiri on a number of vt and other tty
related things, and the normal driver updates. Full details are in
the shortlog.
All of these have been in linux-next for a while with no reported
issues"
* tag 'tty-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (90 commits)
tty/serial: atmel: enforce tasklet init and termination sequences
serial: sh-sci: Stop transfers in sci_shutdown()
serial: 8250_ingenic: drop #if conditional surrounding earlycon code
serial: 8250_mtk: drop !defined(MODULE) conditional
serial: 8250_uniphier: drop !defined(MODULE) conditional
earlycon: mark earlycon code as __used iif the caller is built-in
tty/serial/8250: use mctrl_gpio helpers
serial: mctrl_gpio: enable API usage only for initialized mctrl_gpios struct
serial: mctrl_gpio: add modem control read routine
tty/serial/8250: make UART_MCR register access consistent
serial: 8250_mid: Read RX buffer on RX DMA timeout for DNV
serial: 8250_dma: Export serial8250_rx_dma_flush()
dmaengine: hsu: Export hsu_dma_get_status()
tty: serial: 8250: add CON_CONSDEV to flags
tty: serial: samsung: add byte-order aware bit functions
tty: serial: samsung: fixup accessors for endian
serial: sirf: make fifo functions static
serial: mps2-uart: make driver explicitly non-modular
serial: mvebu-uart: free the IRQ in ->shutdown()
serial/bcm63xx_uart: use correct alias naming
...
Fix the report:
net/sunrpc/clnt.c:2580:1: warning: ‘static’ is not at beginning of declaration [-Wold-style-declaration]
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
that caused misdirected requests, tagged for stable.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABCAAGBQJXk4kSAAoJEEp/3jgCEfOLvpsIAKJTs1ELIQ5RmfvwdvqyqI0N
DoSA6rBYIwQvjqBjevJw72w5HKR7hhJoxaEjXEFrw9zbRmXMNnlk5xZpgD8vy5E3
1iCA8LtscFp4ld4ZNWIus45mUpf6/a5ugPd9Mr3V5C4J05LWqZeXufpAHNHyFbII
++hTu6J/RAg8DddEUhBcDl7c65tQpc8ai0h8ll0pLRYNFLPeCoYO3yTitEYax4fR
i6erB3+7pNWnZIsPnUTrXS4B2NG5kPmflVkD7UH9i14PwdQ4QO85LSXD1o8xYrpa
Occ9EvgFuT8zTJHckCEcT2Y0dINz2uHiE05DUea3Udz82keV9zKeZhZUDwJ95RE=
=P1qk
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-4.7-rc8' of git://github.com/ceph/ceph-client
Pull ceph fix from Ilya Dryomov:
"A fix for a long-standing bug in the incremental osdmap handling code
that caused misdirected requests, tagged for stable"
The tag is signed with a brand new key - Sage is on vacation and I
didn't anticipate this"
* tag 'ceph-for-4.7-rc8' of git://github.com/ceph/ceph-client:
libceph: apply new_state before new_up_client on incrementals
We "cache" the loaded match/target modules and reuse them, but when the
modules are removed, we still point to them. Then we may end up with
invalid memory references when using iptables-compat to add rules later.
Input the following commands will reproduce the kernel crash:
# iptables-compat -A INPUT -j LOG
# iptables-compat -D INPUT -j LOG
# rmmod xt_LOG
# iptables-compat -A INPUT -j LOG
BUG: unable to handle kernel paging request at ffffffffa05a9010
IP: [<ffffffff813f783e>] strcmp+0xe/0x30
Call Trace:
[<ffffffffa05acc43>] nft_target_select_ops+0x83/0x1f0 [nft_compat]
[<ffffffffa058a177>] nf_tables_expr_parse+0x147/0x1f0 [nf_tables]
[<ffffffffa058e541>] nf_tables_newrule+0x301/0x810 [nf_tables]
[<ffffffff8141ca00>] ? nla_parse+0x20/0x100
[<ffffffffa057fa8f>] nfnetlink_rcv+0x33f/0x53d [nfnetlink]
[<ffffffffa057f94b>] ? nfnetlink_rcv+0x1fb/0x53d [nfnetlink]
[<ffffffff817116b8>] netlink_unicast+0x178/0x220
[<ffffffff81711a5b>] netlink_sendmsg+0x2fb/0x3a0
[<ffffffff816b7fc8>] sock_sendmsg+0x38/0x50
[<ffffffff816b8a7e>] ___sys_sendmsg+0x28e/0x2a0
[<ffffffff816bcb7e>] ? release_sock+0x1e/0xb0
[<ffffffff81804ac5>] ? _raw_spin_unlock_bh+0x35/0x40
[<ffffffff816bcbe2>] ? release_sock+0x82/0xb0
[<ffffffff816b93d4>] __sys_sendmsg+0x54/0x90
[<ffffffff816b9422>] SyS_sendmsg+0x12/0x20
[<ffffffff81805172>] entry_SYSCALL_64_fastpath+0x1a/0xa9
So when nobody use the related match/target module, there's no need to
"cache" it. And nft_[match|target]_release are useless anymore, remove
them.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If the user specify the invalid NFTA_MATCH_INFO/NFTA_TARGET_INFO attr
or memory alloc fail, we should call module_put to the related match
or target. Otherwise, we cannot remove the module even nobody use it.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Simplify the code without any side effect. The set_expect_timeout is
used to modify the timer expired time. It tries to delete timer, and
add it again. So we could use mod_timer directly.
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The conntrack label extension is currently variable-sized, e.g. if
only 2 labels are used by iptables rules then the labels->bits[] array
will only contain one element.
We track size of each label storage area in the 'words' member.
But in nftables and openvswitch we always have to ask for worst-case
since we don't know what bit will be used at configuration time.
As most arches are 64bit we need to allocate 24 bytes in this case:
struct nf_conn_labels {
u8 words; /* 0 1 */
/* XXX 7 bytes hole, try to pack */
long unsigned bits[2]; /* 8 24 */
Make bits a fixed size and drop the words member, it simplifies
the code and only increases memory requirements on x86 when
less than 64bit labels are required.
We still only allocate the extension if its needed.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Currently, osd_weight and osd_state fields are updated in the encoding
order. This is wrong, because an incremental map may look like e.g.
new_up_client: { osd=6, addr=... } # set osd_state and addr
new_state: { osd=6, xorstate=EXISTS } # clear osd_state
Suppose osd6's current osd_state is EXISTS (i.e. osd6 is down). After
applying new_up_client, osd_state is changed to EXISTS | UP. Carrying
on with the new_state update, we flip EXISTS and leave osd6 in a weird
"!EXISTS but UP" state. A non-existent OSD is considered down by the
mapping code
2087 for (i = 0; i < pg->pg_temp.len; i++) {
2088 if (ceph_osd_is_down(osdmap, pg->pg_temp.osds[i])) {
2089 if (ceph_can_shift_osds(pi))
2090 continue;
2091
2092 temp->osds[temp->size++] = CRUSH_ITEM_NONE;
and so requests get directed to the second OSD in the set instead of
the first, resulting in OSD-side errors like:
[WRN] : client.4239 192.168.122.21:0/2444980242 misdirected client.4239.1:2827 pg 2.5df899f2 to osd.4 not [1,4,6] in e680/680
and hung rbds on the client:
[ 493.566367] rbd: rbd0: write 400000 at 11cc00000 (0)
[ 493.566805] rbd: rbd0: result -6 xferred 400000
[ 493.567011] blk_update_request: I/O error, dev rbd0, sector 9330688
The fix is to decouple application from the decoding and:
- apply new_weight first
- apply new_state before new_up_client
- twiddle osd_state flags if marking in
- clear out some of the state if osd is destroyed
Fixes: http://tracker.ceph.com/issues/14901
Cc: stable@vger.kernel.org # 3.15+: 6dd74e44dc: libceph: set 'exists' flag for newly up osd
Cc: stable@vger.kernel.org # 3.15+
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Josh Durgin <jdurgin@redhat.com>
sock_cmsg_send() can return different error codes and not only
-EINVAL, and we should properly propagate them.
Fixes: c14ac9451c ("sock: enable timestamping using control messages")
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is the first NFC pull request for 4.8. We have:
- A fairly large NFC digital stack patchset:
* RTOX fixes.
* Proper DEP RWT support.
* ACK and NACK PDUs handling fixes, in both initiator
and target modes.
* A few memory leak fixes.
- A conversion of the nfcsim driver to use the digital stack.
The driver supports the DEP protocol in both NFC-A and NFC-F.
- Error injection through debugfs for the nfcsim driver.
- Improvements to the port100 driver for the Sony USB chipset, in
particular to the command abort and cancellation code paths.
- A few minor fixes for the pn533, trf7970a and fdp drivers.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXjp/3AAoJEIqAPN1PVmxKjpEP/2bgptSwf2cCql+Jv+YaLLny
PjKr+qBXxqvogclucaGg5iIFnVjJ/pHd+csCeqHWpT1jAK7DtK1IJZjg6K9nD7vO
OQQ49+oxcFLTTy00rbfCFzxRaDDnkhD/qafXkTomEMiH7QVK0qssaxm2FVFVEblW
1NTmcsEUnbDbccQhXnxh+N7Xt2CAgsMIbbyHM+4yQuqGtSYjFd164c3jTL13W4a5
SQEJZkCtI7DIdFd6SiXkTGNjlN/fqXuUqXsf2EHxdFb7EE0c067uHpudp2hFdAem
WmAYjjmIuTRFwRFKPJMLUakSen3XbBKVUbtDnIMYykNWYnC4CmedrCrX3YRw4GQt
hZgkj6o5IweSSf6foIgihurE6m5jqd2mAcauwYC/K9wW5nHLaKg8fd9gAngoWY7P
MKBOCyjqIPWkNDC5tne6qftpsDhCrBcdrAtbkorx0lHF20OFto7Gjzxx1Ca+fnJg
N9/fMulQJu8rz3FYpvfvogQMJjkjeFUfyZDa3/ft/ySU6qohxDwXOFaZ82lieTAo
PztTq8tY7GDrdJdyvvHx78RpRVCJT8qHzBRIiiZRpt9MM/aPSepLcozwM97WrJDa
sPvz0jol4d12VIy02j2ArPjMon1MrQePed+Y1y2OtBt8rGiSUxC94t5LE3aMqPs/
a9tNLZYL/nixpLeXbeWa
=yrVP
-----END PGP SIGNATURE-----
Merge tag 'nfc-next-4.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/nfc-next
Samuel Ortiz says:
====================
NFC 4.8 pull request
This is the first NFC pull request for 4.8. We have:
- A fairly large NFC digital stack patchset:
* RTOX fixes.
* Proper DEP RWT support.
* ACK and NACK PDUs handling fixes, in both initiator
and target modes.
* A few memory leak fixes.
- A conversion of the nfcsim driver to use the digital stack.
The driver supports the DEP protocol in both NFC-A and NFC-F.
- Error injection through debugfs for the nfcsim driver.
- Improvements to the port100 driver for the Sony USB chipset, in
particular to the command abort and cancellation code paths.
- A few minor fixes for the pn533, trf7970a and fdp drivers.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The IFLA_XDP_ATTACHED nested attribute is meant for read-only, and while
do_setlink properly ignores it, it should be more paranoid and reject
commands that try to set it.
Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
If the table and/or chain attributes are set in a rule dump request,
we filter out the rules based on this selection.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
There's a similar problem in xt_NFLOG, and was fixed by commit 7643507fe8
("netfilter: xt_NFLOG: nflog-range does not truncate packets"). Only set
copy_len here does not work, so we should enable NF_LOG_F_COPY_LEN also.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
User can specify the log level larger than 7(debug level) via
nfnetlink, this is invalid. So in this case, we should report
EINVAL to the userspace.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Suppose that we specify the NFTA_LOG_PREFIX, then NFTA_LOG_LEVEL
and NFTA_LOG_GROUP are specified together or nf_logger_find_get
call returns fail, i.e. expr init fail, memory leak will happen.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add nf_ct_helper_init(), nf_conntrack_helpers_register() and
nf_conntrack_helpers_unregister() functions to avoid repetitive
opencoded initialization in helpers.
This patch keeps an id parameter for nf_ct_helper_init() not to break
helper matching by name that has been inconsistently exposed to
userspace through ports, eg. ftp-2121, and through an incremental id,
eg. tftp-1.
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Johan Hedberg says:
====================
pull request: bluetooth-next 2016-07-19
Here's likely the last bluetooth-next pull request for the 4.8 kernel:
- Fix for L2CAP setsockopt
- Fix for is_suspending flag handling in btmrvl driver
- Addition of Bluetooth HW & FW info fields to debugfs
- Fix to use int instead of char for callback status.
The last one (from Geert Uytterhoeven) is actually not purely a
Bluetooth (or 802.15.4) patch, but it was agreed with other maintainers
that we take it through the bluetooth-next tree.
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Sets the bpf program represented by fd as an early filter in the rx path
of the netdev. The fd must have been created as BPF_PROG_TYPE_XDP.
Providing a negative value as fd clears the program. Getting the fd back
via rtnl is not possible, therefore reading of this value merely
provides a bool whether the program is valid on the link or not.
Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add one new netdev op for drivers implementing the BPF_PROG_TYPE_XDP
filter. The single op is used for both setup/query of the xdp program,
modelled after ndo_setup_tc.
Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a new bpf prog type that is intended to run in early stages of the
packet rx path. Only minimal packet metadata will be available, hence a
new context type, struct xdp_md, is exposed to userspace. So far only
expose the packet start and end pointers, and only in read mode.
An XDP program must return one of the well known enum values, all other
return codes are reserved for future use. Unfortunately, this
restriction is hard to enforce at verification time, so take the
approach of warning at runtime when such programs are encountered. Out
of bounds return codes should alias to XDP_ABORTED.
Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes an issue that a syscall (e.g. sendto syscall) cannot
work correctly. Since the sendto syscall doesn't have msg_control buffer,
the sock_tx_timestamp() in packet_snd() cannot work correctly because
the socks.tsflags is set to 0.
So, this patch sets the socks.tsflags to sk->sk_tsflags as default.
Fixes: c14ac9451c ("sock: enable timestamping using control messages")
Reported-by: Kazuya Mizuguchi <kazuya.mizuguchi.ks@renesas.com>
Reported-by: Keita Kobayashi <keita.kobayashi.ym@renesas.com>
Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This introduces NCSI AEN packet handlers that result in (A) the
currently active channel is reconfigured; (B) Currently active
channel is deconfigured and disabled, another channel is chosen
as active one and configured. Case (B) won't happen if hardware
arbitration has been enabled, the channel that was in active
state is suspended simply.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This manages NCSI packages and channels:
* The available packages and channels are enumerated in the first
time of calling ncsi_start_dev(). The channels' capabilities are
probed in the meanwhile. The NCSI network topology won't change
until the NCSI device is destroyed.
* There in a queue in every NCSI device. The element in the queue,
channel, is waiting for configuration (bringup) or suspending
(teardown). The channel's state (inactive/active) indicates the
futher action (configuration or suspending) will be applied on the
channel. Another channel's state (invisible) means the requested
action is being applied.
* The hardware arbitration will be enabled if all available packages
and channels support it. All available channels try to provide
service when hardware arbitration is enabled. Otherwise, one channel
is selected as the active one at once.
* When channel is in active state, meaning it's providing service, a
timer started to retrieve the channe's link status. If the channel's
link status fails to be updated in the determined period, the channel
is going to be reconfigured. It's the error handling implementation
as defined in NCSI spec.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The NCSI response packets are sent to MC (Management Controller)
from the remote end. They are responses of NCSI command packets
for multiple purposes: completion status of NCSI command packets,
return NCSI channel's capability or configuration etc.
This defines struct to represent NCSI response packets and introduces
function ncsi_rcv_rsp() which will be used to receive NCSI response
packets and parse them.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The NCSI command packets are sent from MC (Management Controller)
to remote end. They are used for multiple purposes: probe existing
NCSI package/channel, retrieve NCSI channel's capability, configure
NCSI channel etc.
This defines struct to represent NCSI command packets and introduces
function ncsi_xmit_cmd(), which will be used to transmit NCSI command
packet according to the request. The request is represented by struct
ncsi_cmd_arg.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
NCSI spec (DSP0222) defines several objects: package, channel, mode,
filter, version and statistics etc. This introduces the data structs
to represent those objects and implement functions to manage them.
Also, this introduces CONFIG_NET_NCSI for the newly implemented NCSI
stack.
* The user (e.g. netdev driver) dereference NCSI device by
"struct ncsi_dev", which is embedded to "struct ncsi_dev_priv".
The later one is used by NCSI stack internally.
* Every NCSI device can have multiple packages simultaneously, up
to 8 packages. It's represented by "struct ncsi_package" and
identified by 3-bits ID.
* Every NCSI package can have multiple channels, up to 32. It's
represented by "struct ncsi_channel" and identified by 5-bits ID.
* Every NCSI channel has version, statistics, various modes and
filters. They are represented by "struct ncsi_channel_version",
"struct ncsi_channel_stats", "struct ncsi_channel_mode" and
"struct ncsi_channel_filter" separately.
* Apart from AEN (Asynchronous Event Notification), the NCSI stack
works in terms of command and response. This introduces "struct
ncsi_req" to represent a complete NCSI transaction made of NCSI
request and response.
link: https://www.dmtf.org/sites/default/files/standards/documents/DSP0222_1.1.0.pdf
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a new function for DSA drivers to handle the switchdev
SWITCHDEV_ATTR_ID_BRIDGE_AGEING_TIME attribute.
The ageing time is passed as milliseconds.
Also because we can have multiple logical bridges on top of a physical
switch and ageing time are switch-wide, call the driver function with
the fastest ageing time in use on the chip instead of the requested one.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Given:
- tap0 and vxlan0 are bridged
- vxlan0 stacked on eth0, eth0 having small mtu (e.g. 1400)
Assume GSO skbs arriving from tap0 having a gso_size as determined by
user-provided virtio_net_hdr (e.g. 1460 corresponding to VM mtu of 1500).
After encapsulation these skbs have skb_gso_network_seglen that exceed
eth0's ip_skb_dst_mtu.
These skbs are accidentally passed to ip_finish_output2 AS IS.
Alas, each final segment (segmented either by validate_xmit_skb or by
hardware UFO) would be larger than eth0 mtu.
As a result, those above-mtu segments get dropped on certain networks.
This behavior is not aligned with the NON-GSO case:
Assume a non-gso 1500-sized IP packet arrives from tap0. After
encapsulation, the vxlan datagram is fragmented normally at the
ip_finish_output-->ip_fragment code path.
The expected behavior for the GSO case would be segmenting the
"gso-oversized" skb first, then fragmenting each segment according to
dst mtu, and finally passing the resulting fragments to ip_finish_output2.
'ip_finish_output_gso' already supports this "Slowpath" behavior,
according to the IPSKB_FRAG_SEGS flag, which is only set during ipv4
forwarding (not set in the bridged case).
In order to support the bridged case, we'll mark skbs arriving from an
ingress interface that get udp-encaspulated as "allowed to be fragmented",
causing their network_seglen to be validated by 'ip_finish_output_gso'
(and fragment if needed).
Note the TUNNEL_DONT_FRAGMENT tun_flag is still honoured (both in the
gso and non-gso cases), which serves users wishing to forbid
fragmentation at the udp tunnel endpoint.
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This flag indicates whether fragmentation of segments is allowed.
Formerly this policy was hardcoded according to IPSKB_FORWARDED (set by
either ip_forward or ipmr_forward).
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current min/max resvport settings are independently limited
by the entire range of allowed ports, so max_resvport can be
set to a port lower than min_resvport.
Prevent inversion of min/max values when set through sysfs and
module parameter by setting the limits dependent on each other.
Signed-off-by: Frank Sorenson <sorenson@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The current min/max resvport settings are independently limited
by the entire range of allowed ports, so max_resvport can be
set to a port lower than min_resvport.
Prevent inversion of min/max values when set through sysctl by
setting the limits dependent on each other.
Signed-off-by: Frank Sorenson <sorenson@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The range calculation for choosing the random reserved port will panic
with divide-by-zero when min_resvport == max_resvport, a range of one
port, not zero.
Fix the reserved port range calculation by adding one to the difference.
Signed-off-by: Frank Sorenson <sorenson@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Author: Frank Sorenson <sorenson@redhat.com>
Date: 2016-06-27 13:55:48 -0500
sunrpc: Fix bit count when setting hashtable size to power-of-two
The hashtable size is incorrectly calculated as the next higher
power-of-two when being set to a power-of-two. fls() returns the
bit number of the most significant set bit, with the least
significant bit being numbered '1'. For a power-of-two, fls()
will return a bit number which is one higher than the number of bits
required, leading to a hashtable which is twice the requested size.
In addition, the value of (1 << nbits) will always be at least num,
so the test will never be true.
Fix the hash table size calculation to correctly set hashtable
size, and eliminate the unnecessary check.
Signed-off-by: Frank Sorenson <sorenson@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
A generic_cred can be used to look up a unx_cred or a gss_cred, so it's
not really safe to use the the generic_cred->acred->ac_flags to store
the NO_CRKEY_TIMEOUT flag. A lookup for a unx_cred triggered while the
KEY_EXPIRE_SOON flag is already set will cause both NO_CRKEY_TIMEOUT and
KEY_EXPIRE_SOON to be set in the ac_flags, leaving the user associated
with the auth_cred to be in a state where they're perpetually doing 4K
NFS_FILE_SYNC writes.
This can be reproduced as follows:
1. Mount two NFS filesystems, one with sec=krb5 and one with sec=sys.
They do not need to be the same export, nor do they even need to be from
the same NFS server. Also, v3 is fine.
$ sudo mount -o v3,sec=krb5 server1:/export /mnt/krb5
$ sudo mount -o v3,sec=sys server2:/export /mnt/sys
2. As the normal user, before accessing the kerberized mount, kinit with
a short lifetime (but not so short that renewing the ticket would leave
you within the 4-minute window again by the time the original ticket
expires), e.g.
$ kinit -l 10m -r 60m
3. Do some I/O to the kerberized mount and verify that the writes are
wsize, UNSTABLE:
$ dd if=/dev/zero of=/mnt/krb5/file bs=1M count=1
4. Wait until you're within 4 minutes of key expiry, then do some more
I/O to the kerberized mount to ensure that RPC_CRED_KEY_EXPIRE_SOON gets
set. Verify that the writes are 4K, FILE_SYNC:
$ dd if=/dev/zero of=/mnt/krb5/file bs=1M count=1
5. Now do some I/O to the sec=sys mount. This will cause
RPC_CRED_NO_CRKEY_TIMEOUT to be set:
$ dd if=/dev/zero of=/mnt/sys/file bs=1M count=1
6. Writes for that user will now be permanently 4K, FILE_SYNC for that
user, regardless of which mount is being written to, until you reboot
the client. Renewing the kerberos ticket (assuming it hasn't already
expired) will have no effect. Grabbing a new kerberos ticket at this
point will have no effect either.
Move the flag to the auth->au_flags field (which is currently unused)
and rename it slightly to reflect that it's no longer associated with
the auth_cred->ac_flags. Add the rpc_auth to the arg list of
rpcauth_cred_key_to_expire and check the au_flags there too. Finally,
add the inode to the arg list of nfs_ctx_key_to_expire so we can
determine the rpc_auth to pass to rpcauth_cred_key_to_expire.
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
We only get nf_connlabels if the user add ct label set expr successfully,
but we will also put nf_connlabels if the user delete ct lable get expr.
This is mismathced, and will cause ct label expr cannot work properly.
Also, if we init something fail, we should put nf_connlabels back.
Otherwise, we may waste to alloc the memory that will never be used.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Do not cache pointers into the skb linear segment across sk_filter.
The function call can trigger pskb_expand_head.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In kernel HTB keeps tokens in signed 64-bit in nanoseconds. In netlink
protocol these values are converted into pshed ticks (64ns for now) and
truncated to 32-bit. In struct tc_htb_xstats fields "tokens" and "ctokens"
are declared as unsigned 32-bit but they could be negative thus tool 'tc'
prints them as signed. Big values loose higher bits and/or become negative.
This patch clamps tokens in xstat into range from INT_MIN to INT_MAX.
In this way it's easier to understand what's going on here.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
The dummy ruleset I used to test the original validation change was broken,
most rules were unreachable and were not tested by mark_source_chains().
In some cases rulesets that used to load in a few seconds now require
several minutes.
sample ruleset that shows the behaviour:
echo "*filter"
for i in $(seq 0 100000);do
printf ":chain_%06x - [0:0]\n" $i
done
for i in $(seq 0 100000);do
printf -- "-A INPUT -j chain_%06x\n" $i
printf -- "-A INPUT -j chain_%06x\n" $i
printf -- "-A INPUT -j chain_%06x\n" $i
done
echo COMMIT
[ pipe result into iptables-restore ]
This ruleset will be about 74mbyte in size, with ~500k searches
though all 500k[1] rule entries. iptables-restore will take forever
(gave up after 10 minutes)
Instead of always searching the entire blob for a match, fill an
array with the start offsets of every single ipt_entry struct,
then do a binary search to check if the jump target is present or not.
After this change ruleset restore times get again close to what one
gets when reverting 3647234101 (~3 seconds on my workstation).
[1] every user-defined rule gets an implicit RETURN, so we get
300k jumps + 100k userchains + 100k returns -> 500k rule entries
Fixes: 3647234101 ("netfilter: x_tables: validate targets of jumps")
Reported-by: Jeff Wu <wujiafu@gmail.com>
Tested-by: Jeff Wu <wujiafu@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If we hit any of the error conditions inside xfrm_dump_sa(), then
xfrm_state_walk_init() never gets called. However, we still call
xfrm_state_walk_done() from xfrm_dump_sa_done(), which will crash
because the state walk was never initialized properly.
We can fix this by setting cb->args[0] only after we've processed the
first element and checking this before calling xfrm_state_walk_done().
Fixes: d3623099d3 ("ipsec: add support of limited SA dump")
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Some Bluetooth controllers allow for reading hardware and firmware
related vendor specific infos. If they are available, then they can be
exposed via debugfs now.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
When we retrieve imtu value from userspace we should use 16 bit pointer
cast instead of 32 as it's defined that way in headers. Fixes setsockopt
calls on big-endian platforms.
Signed-off-by: Amadeusz Sławiński <amadeusz.slawinski@tieto.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Cc: stable@vger.kernel.org
commit 90017accff ("sctp: Add GSO support") didn't register SCTP GSO
offloading for IPv6 and yet didn't put any restrictions on generating
GSO packets while in IPv6, which causes all IPv6 GSO'ed packets to be
silently dropped.
The fix is to properly register the offload this time.
Fixes: 90017accff ("sctp: Add GSO support")
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit d46e416c11 missed to update some other places which checked for
the socket being TCP-style AND Established state, as Closing state has
some overlapping with the previous understanding of Established.
Without this fix, one of the effects is that some already queued rx
messages may not be readable anymore depending on how the association
teared down, and sending may also not be possible if peer initiated the
shutdown.
Also merge two if() blocks into one condition on sctp_sendmsg().
Cc: Xin Long <lucien.xin@gmail.com>
Fixes: d46e416c11 ("sctp: sctp should change socket state when shutdown is received")
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In preparation for hardware offloading of ipmr/ip6mr we need an
interface that allows to check (and later update) the age of entries.
Relying on stats alone can show activity but not actual age of the entry,
furthermore when there're tens of thousands of entries a lot of the
hardware implementations only support "hit" bits which are cleared on
read to denote that the entry was active and shouldn't be aged out,
these can then be naturally translated into age timestamp and will be
compatible with the software forwarding age. Using a lastuse entry doesn't
affect performance because the members in that cache line are written to
along with the age.
Since all new users are encouraged to use ipmr via netlink, this is
exported via the RTA_EXPIRES attribute.
Also do a minor local variable declaration style adjustment - arrange them
longest to shortest.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
CC: Roopa Prabhu <roopa@cumulusnetworks.com>
CC: Shrijeet Mukherjee <shm@cumulusnetworks.com>
CC: Satish Ashok <sashok@cumulusnetworks.com>
CC: Donald Sharp <sharpd@cumulusnetworks.com>
CC: David S. Miller <davem@davemloft.net>
CC: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
CC: James Morris <jmorris@namei.org>
CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
CC: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
macsec can't cope with mtu frames which need vlan tag insertion, and
vlan device set the default mtu equal to the underlying dev's one.
By default vlan over macsec devices use invalid mtu, dropping
all the large packets.
This patch adds a netif helper to check if an upper vlan device
needs mtu reduction. The helper is used during vlan devices
initialization to set a valid default and during mtu updating to
forbid invalid, too bit, mtu values.
The helper currently only check if the lower dev is a macsec device,
if we get more users, we need to update only the helper (possibly
reserving an additional IFF bit).
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before this patch we had two flavors of most forwarding functions -
_forward and _deliver, the difference being that the latter are used
when the packets are locally originated. Instead of all this function
pointer passing and code duplication, we can just pass a boolean noting
that the packet was locally originated and use that to perform the
necessary checks in __br_forward. This gives a minor performance
improvement but more importantly consolidates the forwarding paths.
Also add a kernel doc comment to explain the exported br_forward()'s
arguments.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently if the packet is going to be received locally we set skb0 or
sometimes called skb2 variables to the original skb. This can get
confusing and also we can avoid one conditional on the fast path by
simply using a boolean and passing it around. Thanks to Roopa for the
name suggestion.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes one conditional from the unicast path by using the fact
that skb is NULL only when the packet is multicast or is local.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Trivial style changes in br_handle_frame_finish.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If there were less than 2 entries in the multipath list, then
xprt_iter_next_entry_multiple() would never advance beyond the
first entry, which is correct for round robin behaviour, but not
for the list iteration.
The end result would be infinite looping in rpc_clnt_iterate_for_each_xprt()
as we would never see the xprt == NULL condition fulfilled.
Reported-by: Oleg Drokin <green@linuxhacker.ru>
Fixes: 80b14d5e61 ("SUNRPC: Add a structure to track multiple transports")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
This adds kernel-doc style descriptions for 6 functions and
fixes 1 typo.
Signed-off-by: Richard Sailer <richard@weltraumpflege.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This work addresses a couple of issues bpf_skb_event_output()
helper currently has: i) We need two copies instead of just a
single one for the skb data when it should be part of a sample.
The data can be non-linear and thus needs to be extracted via
bpf_skb_load_bytes() helper first, and then copied once again
into the ring buffer slot. ii) Since bpf_skb_load_bytes()
currently needs to be used first, the helper needs to see a
constant size on the passed stack buffer to make sure BPF
verifier can do sanity checks on it during verification time.
Thus, just passing skb->len (or any other non-constant value)
wouldn't work, but changing bpf_skb_load_bytes() is also not
the proper solution, since the two copies are generally still
needed. iii) bpf_skb_load_bytes() is just for rather small
buffers like headers, since they need to sit on the limited
BPF stack anyway. Instead of working around in bpf_skb_load_bytes(),
this work improves the bpf_skb_event_output() helper to address
all 3 at once.
We can make use of the passed in skb context that we have in
the helper anyway, and use some of the reserved flag bits as
a length argument. The helper will use the new __output_custom()
facility from perf side with bpf_skb_copy() as callback helper
to walk and extract the data. It will pass the data for setup
to bpf_event_output(), which generates and pushes the raw record
with an additional frag part. The linear data used in the first
frag of the record serves as programmatically defined meta data
passed along with the appended sample.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The per-socket rate limit for 'challenge acks' was introduced in the
context of limiting ack loops:
commit f2b2c582e8 ("tcp: mitigate ACK loops for connections as tcp_sock")
And I think it can be extended to rate limit all 'challenge acks' on a
per-socket basis.
Since we have the global tcp_challenge_ack_limit, this patch allows for
tcp_challenge_ack_limit to be set to a large value and effectively rely on
the per-socket limit, or set tcp_challenge_ack_limit to a lower value and
still prevents a single connections from consuming the entire challenge ack
quota.
It further moves in the direction of eliminating the global limit at some
point, as Eric Dumazet has suggested. This a follow-up to:
Subject: tcp: make challenge acks less predictable
Cc: Eric Dumazet <edumazet@google.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Yue Cao <ycao009@ucr.edu>
Signed-off-by: Jason Baron <jbaron@akamai.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The rxrpc_lookup_peer() function returns NULL on error, it never returns
error pointers.
Fixes: 8496af50eb ('rxrpc: Use RCU to access a peer's service connection tree')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use RDS probe-ping to compute how many paths may be used with
the peer, and to synchronously start the multiple paths. If mprds is
supported, hash outgoing traffic to one of multiple paths in rds_sendmsg()
when multipath RDS is supported by the transport.
CC: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some code duplication in rds_tcp_reset_callbacks() can be avoided
by having the function call rds_tcp_restore_callbacks() and
rds_tcp_set_callbacks().
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As the existing comments in rds_tcp_listen_data_ready() indicate,
it is possible under some race-windows to get to this function with the
accept() socket. If that happens, we could run into a sequence whereby
thread 1 thread 2
rds_tcp_accept_one() thread
sets up new_sock via ->accept().
The sk_user_data is now
sock_def_readable
data comes in for new_sock,
->sk_data_ready is called, and
we land in rds_tcp_listen_data_ready
rds_tcp_set_callbacks()
takes the sk_callback_lock and
sets up sk_user_data to be the cp
read_lock sk_callback_lock
ready = cp
unlock sk_callback_lock
page fault on ready
In the above sequence, we end up with a panic on a bad page reference
when trying to execute (*ready)(). Instead we need to call
sock_def_readable() safely, which is what this patch achieves.
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This helper serves to know if two switchdev port netdevices belong to the
same HW ASIC, e.g to figure out if forwarding offload is possible between them.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently only read-only checks are performed up to the point on where
we check if peer is ECN capable, checks which we can avoid otherwise.
The flag ecn_ce_done is only used to perform this check once per
incoming packet, and nothing more.
Thus this patch moves the peer check up.
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We should not clear that flag when switching to a new skb from a GSO skb
because it would cause ECN processing to happen multiple times per GSO
skb, which is not wanted. Instead, let it be processed once per chunk.
That is, in other words, once per IP header available.
Fixes: 90017accff ("sctp: Add GSO support")
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Identifying address family operations during rx path is not something
expensive but it's ugly to the eye to have it done multiple times,
specially when we already validated it during initial rx processing.
This patch takes advantage of the now shared sctp_input_cb and make the
pointer to the operations readily available.
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SCTP will try to access original IP headers on sctp_recvmsg in order to
copy the addresses used. There are also other places that do similar access
to IP or even SCTP headers. But after 90017accff ("sctp: Add GSO
support") they aren't always there because they are only present in the
header skb.
SCTP handles the queueing of incoming data by cloning the incoming skb
and limiting to only the relevant payload. This clone has its cb updated
to something different and it's then queued on socket rx queue. Thus we
need to fix this in two moments.
For rx path, not related to socket queue yet, this patch uses a
partially copied sctp_input_cb to such GSO frags. This restores the
ability to access the headers for this part of the code.
Regarding the socket rx queue, it removes iif member from sctp_event and
also add a chunk pointer on it.
With these changes we're always able to reach the headers again.
The biggest change here is that now the sctp_chunk struct and the
original skb are only freed after the application consumed the buffer.
Note however that the original payload was already like this due to the
skb cloning.
For iif, SCTP's IPv4 code doesn't use it, so no change is necessary.
IPv6 now can fetch it directly from original's IPv6 CB as the original
skb is still accessible.
In the future we probably can simplify sctp_v*_skb_iif() stuff, as
sctp_v4_skb_iif() was called but it's return value not used, and now
it's not even called, but such cleanup is out of scope for this change.
Fixes: 90017accff ("sctp: Add GSO support")
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The next patch needs 8 bytes in there. sctp_ulpevent has a hole due to
bad alignment; msg_flags is using 4 bytes while it actually uses only 2, so
we shrink it, and iif member (4 bytes) which can be easily fetched from
another place once the next patch is there, so we remove it and thus
creating space for 8 bytes.
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We process input path in other files too and having access to it is
nice, so move it to a header where it's shared.
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johan Hedberg says:
====================
pull request: bluetooth-next 2016-07-13
Here's our main bluetooth-next pull request for the 4.8 kernel:
- Fixes and cleanups in 802.15.4 and 6LoWPAN code
- Fix out of bounds issue in btmrvl driver
- Fixes to Bluetooth socket recvmsg return values
- Use crypto_cipher_encrypt_one() instead of crypto_skcipher
- Cleanup of Bluetooth connection sysfs interface
- New Authentication failure reson code for Disconnected mgmt event
- New USB IDs for Atheros, Qualcomm and Intel Bluetooth controllers
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The current server rpc tcp code attempts to predict how much writeable
socket space will be available to a given RPC call before accepting it
for processing. On a 40GigE network, we've found this throttles
individual clients long before the network or disk is saturated. The
server may handle more clients easily, but the bandwidth of individual
clients is still artificially limited.
Instead of trying (and failing) to predict how much writeable socket space
will be available to the RPC call, just fall back to the simple model of
deferring processing until the socket is uncongested.
This may increase the risk of fast clients starving slower clients; in
such cases, the previous patch allows setting a hard per-connection
limit.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Allow the user to limit the number of requests serviced through a single
connection, to help prevent faster clients from starving slower clients.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Don't call svc_xprt_enqueue() if the XPT_DATA flag is already set.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Rather than code up our own versions of the socket callbacks, just
call the defaults.
This also allows us to merge svc_udp_data_ready() and svc_tcp_data_ready().
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Prevent callbacks from triggering while we're detaching the socket.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Dropping and/or deferring requests has an impact on performance. Let's
make sure we can trace those events.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Add a tracepoint to track when the processing of incoming RPC data gets
deferred due to out-of-space issues on the outgoing transport.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
GSS-Proxy doesn't produce very much debug logging at all. Printing out
the gss minor status will aid in troubleshooting if the
GSS_Accept_sec_context upcall fails.
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Dccp verifies packet integrity, including length, at initial rcv in
dccp_invalid_packet, later pulls headers in dccp_enqueue_skb.
A call to sk_filter in-between can cause __skb_pull to wrap skb->len.
skb_copy_datagram_msg interprets this as a negative value, so
(correctly) fails with EFAULT. The negative length is reported in
ioctl SIOCINQ or possibly in a DCCP_WARN in dccp_close.
Introduce an sk_receive_skb variant that caps how small a filter
program can trim packets, and call this in dccp with the header
length. Excessively trimmed packets are now processed normally and
queued for reception as 0B payloads.
Fixes: 7c657876b6 ("[DCCP]: Initial implementation")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sockets can have a filter program attached that drops or trims
incoming packets based on the filter program return value.
Rose requires data packets to have at least ROSE_MIN_LEN bytes. It
verifies this on arrival in rose_route_frame and unconditionally pulls
the bytes in rose_recvmsg. The filter can trim packets to below this
value in-between, causing pull to fail, leaving the partial header at
the time of skb_copy_datagram_msg.
Place a lower bound on the size to which sk_filter may trim packets
by introducing sk_filter_trim_cap and call this for rose packets.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Increment the mgmt revision due to the recently added new
reason code for the Disconnected event.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Define a tracepoint and allow user to trace messages going to and from
hardware associated with devlink instance.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes the following sparse warning:
net/dsa/dsa2.c:680:6: warning:
symbol '_dsa_unregister_switch' was not declared. Should it be static?
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
security initialized after alloc workqueue, so we should exit security
before destroy workqueue in the error handing.
Fixes: 648af7fca1 ("rxrpc: Absorb the rxkad security module")
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS fixes for net
The following patchset contains Netfilter/IPVS fixes for your net tree.
they are:
1) Fix leak in the error path of nft_expr_init(), from Liping Zhang.
2) Tracing from nf_tables cannot be disabled, also from Zhang.
3) Fix an integer overflow on 32bit archs when setting the number of
hashtable buckets, from Florian Westphal.
4) Fix configuration of ipvs sync in backup mode with IPv6 address,
from Quentin Armitage via Simon Horman.
5) Fix incorrect timeout calculation in nft_ct NFT_CT_EXPIRATION,
from Florian Westphal.
6) Skip clash resolution in conntrack insertion races if NAT is in
place.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The clash resolution is not easy to apply if the NAT table is
registered. Even if no NAT rules are installed, the nul-binding ensures
that a unique tuple is used, thus, the packet that loses race gets a
different source port number, as described by:
http://marc.info/?l=netfilter-devel&m=146818011604484&w=2
Clash resolution with NAT is also problematic if addresses/port range
ports are used since the conntrack that wins race may describe a
different mangling that we may have earlier applied to the packet via
nf_nat_setup_info().
Fixes: 71d8c47fc6 ("netfilter: conntrack: introduce clash resolution on insertion race")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Tested-by: Marc Dionne <marc.c.dionne@gmail.com>
User can add ct entry via nfnetlink(IPCTNL_MSG_CT_NEW), and if the total
number reach the nf_conntrack_max, we will try to drop some ct entries.
But in this case(the main function call path is ctnetlink_create_conntrack
-> nf_conntrack_alloc -> early_drop), rcu_read_lock is not held, so race
with hash resize will happen.
Fixes: 242922a027 ("netfilter: conntrack: simplify early_drop")
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The Makefile controlling compilation of this file is obj-y,
meaning that it currently is never being built as a module.
Since MODULE_ALIAS is a no-op for non-modular code, we can simply
remove the MODULE_ALIAS_NETPROTO variant used here.
We replace module.h with kmod.h since the file does make use of
request_module() in order to load other modules from here.
We don't have to worry about init.h coming in via the removed
module.h since the file explicitly includes init.h already.
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In test situations with many nodes and a heavily stressed system we have
observed that the transmission broadcast link may fail due to an
excessive number of retransmissions of the same packet. In such
situations we need to reset all unicast links to all peers, in order to
reset and re-synchronize the broadcast link.
In this commit, we add a new function tipc_bearer_reset_all() to be used
in such situations. The function scans across all bearers and resets all
their pertaining links.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After a new receiver peer has been added to the broadcast transmission
link, we allow immediate transmission of new broadcast packets, trusting
that the new peer will not accept the packets until it has received the
previously sent unicast broadcast initialiation message. In the same
way, the sender must not accept any acknowledges until it has itself
received the broadcast initialization from the peer, as well as
confirmation of the reception of its own initialization message.
Furthermore, when a receiver peer goes down, the sender has to produce
the missing acknowledges from the lost peer locally, in order ensure
correct release of the buffers that were expected to be acknowledged by
the said peer.
In a highly stressed system we have observed that contact with a peer
may come up and be lost before the above mentioned broadcast initial-
ization and confirmation have been received. This leads to the locally
produced acknowledges being rejected, and the non-acknowledged buffers
to linger in the broadcast link transmission queue until it fills up
and the link goes into permanent congestion.
In this commit, we remedy this by temporarily setting the corresponding
broadcast receive link state to ESTABLISHED and the 'bc_peer_is_up'
state to true before we issue the local acknowledges. This ensures that
those acknowledges will always be accepted. The mentioned state values
are restored immediately afterwards when the link is reset.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
At first contact between two nodes, an endpoint might sometimes have
time to send out a LINK_PROTOCOL/STATE packet before it has received
the broadcast initialization packet from the peer, i.e., before it has
received a valid broadcast packet number to add to the 'bc_ack' field
of the protocol message.
This means that the peer endpoint will receive a protocol packet with an
invalid broadcast acknowledge value of 0. Under unlucky circumstances
this may lead to the original, already received acknowledge value being
overwritten, so that the whole broadcast link goes stale after a while.
We fix this by delaying the setting of the link field 'bc_peer_is_up'
until we know that the peer really has received our own broadcast
initialization message. The latter is always sent out as the first
unicast message on a link, and always with seqeunce number 1. Because
of this, we only need to look for a non-zero unicast acknowledge value
in the arriving STATE messages, and once that is confirmed we know we
are safe and can set the mentioned field. Before this moment, we must
ignore all broadcast acknowledges from the peer.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sergei Trofimovich reported that pulse audio sends SCM_CREDENTIALS
as a control message to TCP. Since __sock_cmsg_send does not
support SCM_RIGHTS and SCM_CREDENTIALS, it returns an error and
hence breaks pulse audio over TCP.
SCM_RIGHTS and SCM_CREDENTIALS are sent on the SOL_SOCKET layer
but they semantically belong to SOL_UNIX. Since all
cmsg-processing functions including sock_cmsg_send ignore control
messages of other layers, it is best to ignore SCM_RIGHTS
and SCM_CREDENTIALS for consistency (and also for fixing pulse
audio over TCP).
Fixes: c14ac9451c ("sock: enable timestamping using control messages")
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Reported-by: Sergei Trofimovich <slyfox@gentoo.org>
Tested-by: Sergei Trofimovich <slyfox@gentoo.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vegard Nossum is reporting for a crash in fib_dump_info
when nh_dev = NULL and fib_nhs == 1:
Pid: 50, comm: netlink.exe Not tainted 4.7.0-rc5+
RIP: 0033:[<00000000602b3d18>]
RSP: 0000000062623890 EFLAGS: 00010202
RAX: 0000000000000000 RBX: 000000006261b800 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000024 RDI: 000000006245ba00
RBP: 00000000626238f0 R08: 000000000000029c R09: 0000000000000000
R10: 0000000062468038 R11: 000000006245ba00 R12: 000000006245ba00
R13: 00000000625f96c0 R14: 00000000601e16f0 R15: 0000000000000000
Kernel panic - not syncing: Kernel mode fault at addr 0x2e0, ip 0x602b3d18
CPU: 0 PID: 50 Comm: netlink.exe Not tainted 4.7.0-rc5+ #581
Stack:
626238f0 960226a02 00000400 000000fe
62623910 600afca7 62623970 62623a48
62468038 00000018 00000000 00000000
Call Trace:
[<602b3e93>] rtmsg_fib+0xd3/0x190
[<602b6680>] fib_table_insert+0x260/0x500
[<602b0e5d>] inet_rtm_newroute+0x4d/0x60
[<60250def>] rtnetlink_rcv_msg+0x8f/0x270
[<60267079>] netlink_rcv_skb+0xc9/0xe0
[<60250d4b>] rtnetlink_rcv+0x3b/0x50
[<60265400>] netlink_unicast+0x1a0/0x2c0
[<60265e47>] netlink_sendmsg+0x3f7/0x470
[<6021dc9a>] sock_sendmsg+0x3a/0x90
[<6021e0d0>] ___sys_sendmsg+0x300/0x360
[<6021fa64>] __sys_sendmsg+0x54/0xa0
[<6021fac0>] SyS_sendmsg+0x10/0x20
[<6001ea68>] handle_syscall+0x88/0x90
[<600295fd>] userspace+0x3fd/0x500
[<6001ac55>] fork_handler+0x85/0x90
$ addr2line -e vmlinux -i 0x602b3d18
include/linux/inetdevice.h:222
net/ipv4/fib_semantics.c:1264
Problem happens when RTNH_F_LINKDOWN is provided from user space
when creating routes that do not use the flag, catched with
netlink fuzzer.
Currently, the kernel allows user space to set both flags
to nh_flags and fib_flags but this is not intentional, the
assumption was that they are not set. Fix this by rejecting
both flags with EINVAL.
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Fixes: 0eeb075fad ("net: ipv4 sysctl option to ignore routes when nexthop link is down")
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Cc: Andy Gospodarek <gospo@cumulusnetworks.com>
Cc: Dinesh Dutt <ddutt@cumulusnetworks.com>
Cc: Scott Feldman <sfeldma@gmail.com>
Reviewed-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Yue Cao claims that current host rate limiting of challenge ACKS
(RFC 5961) could leak enough information to allow a patient attacker
to hijack TCP sessions. He will soon provide details in an academic
paper.
This patch increases the default limit from 100 to 1000, and adds
some randomization so that the attacker can no longer hijack
sessions without spending a considerable amount of probes.
Based on initial analysis and patch from Linus.
Note that we also have per socket rate limiting, so it is tempting
to remove the host limit in the future.
v2: randomize the count of challenge acks per second, not the period.
Fixes: 282f23c6ee ("tcp: implement RFC 5961 3.2")
Reported-by: Yue Cao <ycao009@ucr.edu>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Using a combination if #if conditionals and goto labels to unwind
tunnel4_init seems unwieldy. This patch takes a simpler approach of
directly unregistering previously registered protocols when an error
occurs.
This fixes a number of problems with the current implementation
including the potential presence of labels when they are unused
and the potential absence of unregister code when it is needed.
Fixes: 8afe97e5d4 ("tunnels: support MPLS over IPv4 tunnels")
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
prsctp PRIO policy is a policy to abandon lower priority chunks when
asoc doesn't have enough snd buffer, so that the current chunk with
higher priority can be queued successfully.
Similar to TTL/RTX policy, we will set the priority of the chunk to
prsctp_param with sinfo->sinfo_timetolive in sctp_set_prsctp_policy().
So if PRIO policy is enabled, msg->expire_at won't work.
asoc->sent_cnt_removable will record how many chunks can be checked to
remove. If priority policy is enabled, when the chunk is queued into
the out_queue, we will increase sent_cnt_removable. When the chunk is
moved to abandon_queue or dequeue and free, we will decrease
sent_cnt_removable.
In sctp_sendmsg, we will check if there is enough snd buffer for current
msg and if sent_cnt_removable is not 0. Then try to abandon chunks in
sctp_prune_prsctp when sendmsg from the retransmit/transmited queue, and
free chunks from out_queue in right order until the abandon+free size >
msg_len - sctp_wfree. For the abandon size, we have to wait until it
sends FORWARD TSN, receives the sack and the chunks are really freed.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
prsctp RTX policy is a policy to abandon chunks when they are
retransmitted beyond the max count.
This patch uses sent_count to count how many times one chunk has
been sent, and prsctp_param is the max rtx count, which is from
sinfo->sinfo_timetolive in sctp_set_prsctp_policy(). So similar
to TTL policy, if RTX policy is enabled, msg->expire_at won't
work.
Then in sctp_chunk_abandoned, this patch checks if chunk->sent_count
is bigger than chunk->prsctp_param to abandon this chunk.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
prsctp TTL policy is a policy to abandon chunks when they expire
at the specific time in local stack. It's similar with expires_at
in struct sctp_datamsg.
This patch uses sinfo->sinfo_timetolive to set the specific time for
TTL policy. sinfo->sinfo_timetolive is also used for msg->expires_at.
So if prsctp_enable or TTL policy is not enabled, msg->expires_at
still works as before.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds SCTP_PR_ASSOC_STATUS to sctp sockopt, which is used
to dump the prsctp statistics info from the asoc. The prsctp statistics
includes abandoned_sent/unsent from the asoc. abandoned_sent is the
count of the packets we drop packets from retransmit/transmited queue,
and abandoned_unsent is the count of the packets we drop from out_queue
according to the policy.
Note: another option for prsctp statistics dump described in rfc is
SCTP_PR_STREAM_STATUS, which is used to dump the prsctp statistics
info from each stream. But by now, linux doesn't yet have per stream
statistics info, it needs rfc6525 to be implemented. As the prsctp
statistics for each stream has to be based on per stream statistics,
we will delay it until rfc6525 is done in linux.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds SCTP_DEFAULT_PRINFO to sctp sockopt. It is used
to set/get sctp Partially Reliable Policies' default params,
which includes 3 policies (ttl, rtx, prio) and their values.
Still, if we set policy params in sndinfo, we will use the params
of sndinfo against chunks, instead of the default params.
In this patch, we will use 5-8bit of sp/asoc->default_flags
to store prsctp policies, and reuse asoc->default_timetolive
to store their values. It means if we enable and set prsctp
policy, prior ttl timeout in sctp will not work any more.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
According to section 4.5 of rfc7496, prsctp_enable should be per asoc.
We will add prsctp_enable to both asoc and ep, and replace the places
where it used net.sctp->prsctp_enable with asoc->prsctp_enable.
ep->prsctp_enable will be initialized with net.sctp->prsctp_enable, and
asoc->prsctp_enable will be initialized with ep->prsctp_enable. We can
also modify it's value through sockopt SCTP_PR_SUPPORTED.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before commit 778be232a2 ("NFS do not find client in NFSv4
pg_authenticate"), the Linux callback server replied with
RPC_AUTH_ERROR / RPC_AUTH_BADCRED, instead of dropping the CB
request. Let's restore that behavior so the server has a chance to
do something useful about it, and provide a warning that helps
admins correct the problem.
Fixes: 778be232a2 ("NFS do not find client in NFSv4 ...")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
If an RPC program does not set vs_dispatch and pc_func() returns
rpc_drop_reply, the server sends a reply anyway containing a single
word containing the value RPC_DROP_REPLY (in network byte-order, of
course). This is a nonsense RPC message.
Fixes: 9e701c6109 ("svcrpc: simpler request dropping")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Direct data placement is not allowed when using flavors that
guarantee integrity or privacy. When such security flavors are in
effect, don't allow the use of Read and Write chunks for moving
individual data items. All messages larger than the inline threshold
are sent via Long Call or Long Reply.
On my systems (CX-3 Pro on FDR), for small I/O operations, the use
of Long messages adds only around 5 usecs of latency in each
direction.
Note that when integrity or encryption is used, the host CPU touches
every byte in these messages. Even if it could be used, data
movement offload doesn't buy much in this case.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
fixup_copy_count should count only the number of bytes copied to the
page list. The head and tail are now always handled without a data
copy.
And the debugging at the end of rpcrdma_inline_fixup() is also no
longer necessary, since copy_len will be non-zero when there is reply
data in the tail (a normal and valid case).
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Now that rpcrdma_inline_fixup() updates only two fields in
rq_rcv_buf, a full memcpy of that structure to rq_private_buf is
unwarranted. Updating rq_private_buf fields only where needed also
better documents what is going on.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
While trying NFSv4.0/RDMA with sec=krb5p, I noticed small NFS READ
operations failed. After the client unwrapped the NFS READ reply
message, the NFS READ XDR decoder was not able to decode the reply.
The message was "Server cheating in reply", with the reported
number of received payload bytes being zero. Applications reported
a read(2) that returned -1/EIO.
The problem is rpcrdma_inline_fixup() sets the tail.iov_len to zero
when the incoming reply fits entirely in the head iovec. The zero
tail.iov_len confused xdr_buf_trim(), which then mangled the actual
reply data instead of simply removing the trailing GSS checksum.
As near as I can tell, RPC transports are not supposed to update the
head.iov_len, page_len, or tail.iov_len fields in the receive XDR
buffer when handling an incoming RPC reply message. These fields
contain the length of each component of the XDR buffer, and hence
the maximum number of bytes of reply data that can be stored in each
XDR buffer component. I've concluded this because:
- This is how xdr_partial_copy_from_skb() appears to behave
- rpcrdma_inline_fixup() already does not alter page_len
- call_decode() compares rq_private_buf and rq_rcv_buf and WARNs
if they are not exactly the same
Unfortunately, as soon as I tried the simple fix to just remove the
line that sets tail.iov_len to zero, I saw that the logic that
appends the implicit Write chunk pad inline depends on inline_fixup
setting tail.iov_len to zero.
To address this, re-organize the tail iovec handling logic to use
the same approach as with the head iovec: simply point tail.iov_base
to the correct bytes in the receive buffer.
While I remember all this, write down the conclusion in documenting
comments.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
When the remaining length of an incoming reply is longer than the
XDR buf's page_len, switch over to the tail iovec instead of
copying more than page_len bytes into the page list.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Currently, all three chunk list encoders each use a portion of the
one rl_segments array in rpcrdma_req. This is because the MWs for
each chunk list were preserved in rl_segments so that ro_unmap could
find and invalidate them after the RPC was complete.
However, now that MWs are placed on a per-req linked list as they
are registered, there is no longer any information in rpcrdma_mr_seg
that is shared between ro_map and ro_unmap_{sync,safe}, and thus
nothing in rl_segments needs to be preserved after
rpcrdma_marshal_req is complete.
Thus the rl_segments array can be used now just for the needs of
each rpcrdma_convert_iovs call. Once each chunk list is encoded, the
next chunk list encoder is free to re-use all of rl_segments.
This means all three chunk lists in one RPC request can now each
encode a full size data payload with no increase in the size of
rl_segments.
This is a key requirement for Kerberos support, since both the Call
and Reply for a single RPC transaction are conveyed via Long
messages (RDMA Read/Write). Both can be large.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Instead of placing registered MWs sparsely into the rl_segments
array, place these MWs on a per-req list.
ro_unmap_{sync,safe} can then simply pull those MWs off the list
instead of walking through the array.
This change significantly reduces the size of struct rpcrdma_req
by removing nsegs and rl_mw from every array element.
As an additional clean-up, chunk co-ordinates are returned in the
"*mw" output argument so they are no longer needed in every
array element.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Instead of leaving orphaned MRs to be released when the transport
is destroyed, release them immediately. The MR free list can now be
replenished if it becomes exhausted.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Frequent MR list exhaustion can impact I/O throughput, so enough MRs
are always created during transport set-up to prevent running out.
This means more MRs are created than most workloads need.
Commit 94f58c58c0 ("xprtrdma: Allow Read list and Reply chunk
simultaneously") introduced support for sending two chunk lists per
RPC, which consumes more MRs per RPC.
Instead of trying to provision more MRs, introduce a mechanism for
allocating MRs on demand. A few MRs are allocated during transport
set-up to kick things off.
This significantly reduces the average number of MRs per transport
while allowing the MR count to grow for workloads or devices that
need more MRs.
FRWR with mlx4 allocated almost 400 MRs per transport before this
patch. Now it starts with 32.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up, based on code audit: Remove the possibility that the
chunk list XDR encoders can return zero, which would be interpreted
as a NULL.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Commit c93c62231c ("xprtrdma: Disconnect on registration failure")
added a disconnect for some RPC marshaling failures. This is needed
only in a handful of cases, but it was triggering for simple stuff
like temporary resource shortages. Try to straighten this out.
Fix up the lower layers so they don't return -ENOMEM or other error
codes that the RPC client's FSM doesn't explicitly recognize.
Also fix up the places in the send_request path that do want a
disconnect. For example, when ib_post_send or ib_post_recv fail,
this is a sign that there is a send or receive queue resource
miscalculation. That should be rare, and is a sign of a software
bug. But xprtrdma can recover: disconnect to reset the transport and
start over.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Not having an rpcrdma_rep at call_allocate time can be a problem.
It means that send_request can't post a receive buffer to catch
the RPC's reply. Possible consequences are RPC timeouts or even
transport deadlock.
Instead of allowing an RPC to proceed if an rpcrdma_rep is
not available, return NULL to force call_allocate to wait and
try again.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up: ALLPHYSICAL is gone and FMR has been converted to use
scatterlists. There are no more users of these functions.
This patch shrinks the size of struct rpcrdma_req by about 3500
bytes on x86_64. There is one of these structs for each RPC credit
(128 credits per transport connection).
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
No HCA or RNIC in the kernel tree requires the use of ALLPHYSICAL.
ALLPHYSICAL advertises in the clear on the network fabric an R_key
that is good for all of the client's memory. No known exploit
exists, but theoretically any user on the server can use that R_key
on the client's QP to read or update any part of the client's memory.
ALLPHYSICAL exposes the client to server bugs, including:
o base/bounds errors causing data outside the i/o buffer to be
accessed
o RDMA access after reply causing data corruption and/or integrity
fail
ALLPHYSICAL can't protect application memory regions from server
update after a local signal or soft timeout has terminated an RPC.
ALLPHYSICAL chunks are no larger than a page. Special cases to
handle small chunks and long chunk lists have been a source of
implementation complexity and bugs.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Based on code audit.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
I found that commit ead3f26e35 ("xprtrdma: Add ro_unmap_safe
memreg method"), which introduces ro_unmap_safe, never wired up the
FMR recovery worker.
The FMR and FRWR recovery work queues both do the same thing.
Instead of setting up separate individual work queues for this,
schedule a delayed worker to deal with them, since recovering MRs is
not performance-critical.
Fixes: ead3f26e35 ("xprtrdma: Add ro_unmap_safe memreg method")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The use of a scatterlist for handling DMA mapping and unmapping
was recently introduced in frwr_ops.c in commit 4143f34e01
("xprtrdma: Port to new memory registration API"). That commit did
not make a similar update to xprtrdma's FMR support because the
core ib_map_phys_fmr() and ib_unmap_fmr() APIs have not been changed
to take a scatterlist argument.
However, FMR still needs to do DMA mapping and unmapping. It appears
that RDS, for example, uses a scatterlist for this, then builds the
DMA addr array for the ib_map_phys_fmr call separately. I see that
SRP also utilizes a scatterlist for DMA mapping. xprtrdma can do
something similar.
This modernization is used immediately to properly defer DMA
unmapping during fmr_unmap_safe (a FIXME). It separates the DMA
unmapping coordinates from the rl_segments array. This array, being
part of an rpcrdma_req, is always re-used immediately when an RPC
exits. A scatterlist is allocated in memory independent of the
rl_segments array, so it can be preserved indefinitely (ie, until
the MR invalidation and DMA unmapping can actually be done by a
worker thread).
The FRWR and FMR DMA mapping code are slightly different from each
other now, and will diverge further when the "Check for holes" logic
can be removed from FRWR (support for SG_GAP MRs). So I chose not to
create helpers for the common-looking code.
Fixes: ead3f26e35 ("xprtrdma: Add ro_unmap_safe memreg method")
Suggested-by: Sagi Grimberg <sagi@lightbits.io>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up: Use the same naming convention used in other
RPC/RDMA-related data structures.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up: Moving these helpers in a separate patch makes later
patches more readable.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up: FMR is about to replace the rpcrdma_map_one code with
scatterlists. Move the scatterlist fields out of the FRWR-specific
union and into the generic part of rpcrdma_mw.
One minor change: -EIO is now returned if FRWR registration fails.
The RPC is terminated immediately, since the problem is likely due
to a software bug, thus retrying likely won't help.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
ib_unmap_fmr() takes a list of FMRs to unmap. However, it does not
remove the FMRs from this list as it processes them. Other
ib_unmap_fmr() call sites are careful to remove FMRs from the list
after ib_unmap_fmr() returns.
Since commit 7c7a5390dc ("xprtrdma: Add ro_unmap_sync method for FMR")
fmr_op_unmap_sync passes more than one FMR to ib_unmap_fmr(), but
it didn't bother to remove the FMRs from that list once the call was
complete.
I've noticed some instability that could be related to list
tangling by the new fmr_op_unmap_sync() logic. In an abundance
of caution, add some defensive logic to clean up properly after
ib_unmap_fmr().
Fixes: 7c7a5390dc ("xprtrdma: Add ro_unmap_sync method for FMR")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
If socket filter truncates an udp packet below the length of UDP header
in udpv6_queue_rcv_skb() or udp_queue_rcv_skb(), it will trigger a
BUG_ON in skb_pull_rcsum(). This BUG_ON (and therefore a system crash if
kernel is configured that way) can be easily enforced by an unprivileged
user which was reported as CVE-2016-6162. For a reproducer, see
http://seclists.org/oss-sec/2016/q3/8
Fixes: e6afc8ace6 ("udp: remove headers from UDP packets before queueing")
Reported-by: Marco Grassi <marco.gra@gmail.com>
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Fix possible NULL pointer dereference for vlan_insert_tag (two patches)
- Fix reference handling in some features, which may lead to reference
leaks or invalid memory access (four patches)
- Fix speedy join: DHCP packets handled by the gateway feature should
be sent with 4-address unicast instead of 3-address unicast to make
speedy join work. This fixes/speeds up DHCP assignment for clients
which join a mesh for the first time. (one patch)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCgAGBQJXf3SCAAoJEKEr45hCkp6hAaAQAJxKFavGbXHWvj1M1VxqVFkN
AlxP7JZ6OHgnWxBT3drk4ZRaxIA7v/2VkRYrCbxoYjIENiyrmNz+93SAzaBcTKxE
nnUntdDbQWYE3MOGC1lUBIoPgjvs4DQRejyq5dvG9CYEcK9hE4pDKV7FUfeBgmgL
dG5+9ht8JEjMYZq48FQp4SQwkQGpWRiS4fekZEUmcO1pIQpx0uOYTMfMZ/HpqpCN
im1QhUXlAGCBcOIJwztqVb/04LKcuTS8Du+b50BFF5uITmCZdK0NmG5yBH+1Nn8K
uKYanY3dHYUE4eGw3NAqnJ0uSiMQFlhk3gqKgHY8uu/KoMiqZ3tjBkNp+3fF3KqH
0AnXEPPsQPU8RJ5WAHH6TR/UNnoCrfqU6AjbIclHNq7l3WY6u0fD2uKHCGlaV13M
8XolPWECum8iLEptmYDlhYZrh5D9kteGDV7kt3XtQY8Hpv/UE1Jh1/iGrhNjtbdX
7P6NsZdi/cnkGPhIaRnoEQaWHZVmbO4Rl8Q2Yb3Ze2LEUuLdrkmBjTBKqiOFMnMe
7ltA3JL7ip/alRPeNsuiHOY28uNaog3YuEHg8QYiyTs449Os/TjWoh9pzD44dhkB
auIxmiy/IyVdYwlQwfBHDJupVK7WncUq+iF/rv3TfTmY25FO4FC+EV+PsBZdWsc+
co+amJR57ZOAygd0GgU2
=7Z04
-----END PGP SIGNATURE-----
Merge tag 'batadv-net-for-davem-20160708' of git://git.open-mesh.org/linux-merge
Simon Wunderlich says:
====================
Here are a couple batman-adv bugfix patches, all by Sven Eckelmann:
- Fix possible NULL pointer dereference for vlan_insert_tag (two patches)
- Fix reference handling in some features, which may lead to reference
leaks or invalid memory access (four patches)
- Fix speedy join: DHCP packets handled by the gateway feature should
be sent with 4-address unicast instead of 3-address unicast to make
speedy join work. This fixes/speeds up DHCP assignment for clients
which join a mesh for the first time. (one patch)
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch corrects an off-by-one error in the DecodeQ931 function in
the nf_conntrack_h323 module. This error could result in reading off
the end of a Q.931 frame.
Signed-off-by: Toby DiPasquale <toby@cbcg.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Simon Horman says:
====================
IPVS Updates for v4.8
please consider these enhancements to the IPVS. This alters the behaviour
of the "least connection" schedulers such that pre-established connections
are included in the active connection count. This avoids overloading
servers when a large number of new connections arrive in a short space of
time - e.g. when clients reconnect after a node or network failure.
====================
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We can pass the netns pointer as parameter to the functions that need to
gain access to it. From basechains, I didn't find any client for this
field anymore so let's remove this too.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If we want to use ct packets expr, and add a rule like follows:
# nft add rule filter input ct packets gt 1 counter
We will find that no packets will hit it, because
nf_conntrack_acct is disabled by default. So It will
not work until we enable it manually via
"echo 1 > /proc/sys/net/netfilter/nf_conntrack_acct".
This is not friendly, so like xt_connbytes do, if the user
want to use ct byte/packet expr, enable nf_conntrack_acct
automatically.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
physdev_mt() will check skb->nf_bridge first, which was alloced in
br_nf_pre_routing. So if we want to use --physdev-out and physdev-is-out,
we need to match it in FORWARD or POSTROUTING chain. physdev_mt_check()
only checked physdev-out and missed physdev-is-out. Fix it and update the
debug message to make it clearer.
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: Marcelo R Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
It did use a fixed-size bucket list plus single lock to protect add/del.
Unlike the main conntrack table we only need to add and remove keys.
Convert it to rhashtable to get table autosizing and per-bucket locking.
The maximum number of entries is -- as before -- tied to the number of
conntracks so we do not need another upperlimit.
The change does not handle rhashtable_remove_fast error, only possible
"error" is -ENOENT, and that is something that can happen legitimetely,
e.g. because nat module was inserted at a later time and no src manip
took place yet.
Tested with http-client-benchmark + httpterm with DNAT and SNAT rules
in place.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Simon Horman says:
====================
Second Round of IPVS Fixes for v4.7
The fix from Quentin Armitage allows the backup sync daemon to
be bound to a link-local mcast IPv6 address as is already the case
for IPv4.
====================
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The nat extension structure is 32bytes in size on x86_64:
struct nf_conn_nat {
struct hlist_node bysource; /* 0 16 */
struct nf_conn * ct; /* 16 8 */
union nf_conntrack_nat_help help; /* 24 4 */
int masq_index; /* 28 4 */
/* size: 32, cachelines: 1, members: 4 */
/* last cacheline: 32 bytes */
};
The hlist is needed to quickly check for possible tuple collisions
when installing a new nat binding. Storing this in the extension
area has two drawbacks:
1. We need ct backpointer to get the conntrack struct from the extension.
2. When reallocation of extension area occurs we need to fixup the bysource
hash head via hlist_replace_rcu.
We can avoid both by placing the hlist_head in nf_conn and place nf_conn in
the bysource hash rather than the extenstion.
We can also remove the ->move support; no other extension needs it.
Moving the entire nat extension into nf_conn would be possible as well but
then we have to add yet another callback for deletion from the bysource
hash table rather than just using nat extension ->destroy hook for this.
nf_conn size doesn't increase due to aligment, followup patch replaces
hlist_node with single pointer.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We don't need to acquire the bucket lock during early drop, we can
use lockless traveral just like ____nf_conntrack_find.
The timer deletion serves as synchronization point, if another cpu
attempts to evict same entry, only one will succeed with timer deletion.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
From: Liping Zhang <liping.zhang@spreadtrum.com>
Similar to ctnl_untimeout, when hash resize happened, we should try
to do unhelp from the 0# bucket again.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Imagine such situation, nf_conntrack_htable_size now is 4096, we are doing
ctnl_untimeout, and iterate on 3000# bucket.
Meanwhile, another user try to reduce hash size to 2048, then all nf_conn
are removed to the new hashtable. When this hash resize operation finished,
we still try to itreate ct begin from 3000# bucket, find nothing to do and
just return.
We may miss unlinking some timeout objects. And later we will end up with
invalid references to timeout object that are already gone.
So when we find that hash resize happened, try to unlink timeout objects
from the 0# bucket again.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When we do "cat /proc/net/nf_conntrack", and meanwhile resize the conntrack
hash table via /sys/module/nf_conntrack/parameters/hashsize, race will
happen, because reader can observe a newly allocated hash but the old size
(or vice versa). So oops will happen like follows:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000017
IP: [<ffffffffa0418e21>] seq_print_acct+0x11/0x50 [nf_conntrack]
Call Trace:
[<ffffffffa0412f4e>] ? ct_seq_show+0x14e/0x340 [nf_conntrack]
[<ffffffff81261a1c>] seq_read+0x2cc/0x390
[<ffffffff812a8d62>] proc_reg_read+0x42/0x70
[<ffffffff8123bee7>] __vfs_read+0x37/0x130
[<ffffffff81347980>] ? security_file_permission+0xa0/0xc0
[<ffffffff8123cf75>] vfs_read+0x95/0x140
[<ffffffff8123e475>] SyS_read+0x55/0xc0
[<ffffffff817c2572>] entry_SYSCALL_64_fastpath+0x1a/0xa4
It is very easy to reproduce this kernel crash.
1. open one shell and input the following cmds:
while : ; do
echo $RANDOM > /sys/module/nf_conntrack/parameters/hashsize
done
2. open more shells and input the following cmds:
while : ; do
cat /proc/net/nf_conntrack
done
3. just wait a monent, oops will happen soon.
The solution in this patch is based on Florian's Commit 5e3c61f981
("netfilter: conntrack: fix lookup race during hash resize"). And
add a wrapper function nf_conntrack_get_ht to get hash and hsize
suggested by Florian Westphal.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When the target needs more time to process the received PDU, it sends
Response Timeout Extension (RTOX) PDU.
When the initiator receives a RTOX PDU, it must reply with a RTOX PDU
and extends the current rwt value with the formula:
rwt_int = rwt * rtox
This patch takes care of the rtox value passed by the target in the RTOX
PDU and extends the timeout for the next response accordingly.
Signed-off-by: Thierry Escande <thierry.escande@collabora.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
When sending an ATR_REQ, the initiator must wait for the ATR_RES at
least 'RWT(nfcdep,activation) + dRWT(nfcdep)' and no more than
'RWT(nfcdep,activation) + dRWT(nfcdep) + dT(nfcdep,initiator)'. This
gives a timeout value between 1237 ms and 1337 ms. This patch defines
DIGITAL_ATR_RES_RWT to 1337 used for the timeout value of ATR_REQ
command.
For other DEP PDUs, the initiator must wait between 'RWT + dRWT(nfcdep)'
and 'RWT + dRWT(nfcdep) + dT(nfcdep,initiator)' where RWT is given by
the following formula: '(256 * 16 / f(c)) * 2^wt' where wt is the value
of the TO field in the ATR_RES response and is in the range between 0
and 14. This patch declares a mapping table for wt values and gives RWT
max values between 100 ms and 5049 ms.
This patch also defines DIGITAL_ATR_RES_TO_WT, the maximum wt value in
target mode, to 8.
Signed-off-by: Thierry Escande <thierry.escande@collabora.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
This patch frees the RTOX resp sk_buff in initiator mode. It also makes
use of the free_resp exit point for ATN supervisor PDUs in both
initiator and target mode.
Signed-off-by: Thierry Escande <thierry.escande@collabora.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
With this patch, ACK PDU sk_buffs are now freed and code has been
refactored for better errors handling.
Signed-off-by: Thierry Escande <thierry.escande@collabora.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
When the target receives a NACK PDU, it re-sends the last sent PDU.
ACK PDUs are received by the target as a reply from the initiator to
chained I-PDUs. There are 3 cases to handle:
- If the target has previously received 1 or more ATN PDUs and the PNI
in the ACK PDU is equal to the target PNI - 1, then it means that the
initiator did not received the last issued PDU from the target. In
this case it re-sends this PDU.
- If the target has received 1 or more ATN PDUs but the ACK PNI is not
the target PNI - 1, then this means that this ACK is the reply of the
previous chained I-PDU sent by the target. The target did not received
it on the first attempt and it is being re-sent by the initiator. The
process continues as usual.
- No ATN PDU received before this ACK PDU. This is the reply of a
chained I-PDU. The target keeps on processing its chained I-PDU.
The code has been refactored to avoid too many indentation levels.
Also, ACK and NACK PDUs were not freed. This is now fixed.
Signed-off-by: Thierry Escande <thierry.escande@collabora.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
When the initiator sends a DEP_REQ I-PDU, the target device may not
reply in a timely manner. In this case the initiator device must send an
attention PDU (ATN) and if the recipient replies with an ATN PDU in
return, then the last I-PDU must be sent again by the initiator.
This patch fixes how the target handles I-PDU received after an ATN PDU
has been received.
There are 2 possible cases:
- The target has received the initial DEP_REQ and sends back the DEP_RES
but the initiator did not receive it. In this case, after the
initiator has sent an ATN PDU and the target replied it (with an ATN
as well), the initiator sends the saved skb of the initial DEP_REQ
again and the target replies with the saved skb of the initial
DEP_RES.
- Or the target did not even received the initial DEP_REQ. In this case,
after the ATN PDUs exchange, the initiator sends the saved skb and the
target simply passes it up, just as usual.
This behavior is controlled using the atn_count and the PNI field of the
digital device structure.
Signed-off-by: Thierry Escande <thierry.escande@collabora.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
When allocating chained I-PDUs, there is no need to call skb_reserve()
since it's already done by digital_alloc_skb() and contains enough room
for the driver head and tail data.
Signed-off-by: Thierry Escande <thierry.escande@collabora.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
This patch fixes the way an I-PDU is saved in case it needs to be sent
again. It is now copied using pskb_copy() and not simply referenced
using skb_get() since it could be modified by the driver.
digital_in_send_saved_skb() and digital_tg_send_saved_skb() still get a
reference on the saved skb which is re-sent but release it if the send
operation fails. That way the caller doesn't have to take care about skb
ref in case of error.
RTOX supervisor PDU must not be saved as this can override a previously
saved I-PDU that should be re-sent later on.
Signed-off-by: Thierry Escande <thierry.escande@collabora.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
In the prep work I did before enabling BH while handling socket backlog,
I missed two points in DCCP :
1) dccp_v4_ctl_send_reset() uses bh_lock_sock(), assuming BH were
blocked. It is not anymore always true.
2) dccp_v4_route_skb() was using __IP_INC_STATS() instead of
IP_INC_STATS()
A similar fix was done for TCP, in commit 47dcc20a39
("ipv4: tcp: ip_send_unicast_reply() is not BH safe")
Fixes: 7309f8821f ("dccp: do not assume DCCP code is non preemptible")
Fixes: 5413d1babe ("net: do not block BH while processing socket backlog")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
All inet6_netconf_notify_devconf() callers are in process context,
so we can use GFP_KERNEL allocations if we take care of not holding
a rwlock while not needed in ip6mr (we hold RTNL there)
Fixes: d67b8c616b ("netconf: advertise mc_forwarding status")
Fixes: f3a1bfb11c ("rtnl/ipv6: use netconf msg to advertise forwarding status")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
inet_forward_change() runs with RTNL held.
We are allowed to sleep if required.
If we use __in_dev_get_rtnl() instead of __in_dev_get_rcu(),
we no longer have to use GFP_ATOMIC allocations in
inet_netconf_notify_devconf(), meaning we are less likely to miss
notifications under memory pressure, and wont touch precious memory
reserves either and risk dropping incoming packets.
inet_netconf_get_devconf() can also use GFP_KERNEL allocation.
Fixes: edc9e74893 ("rtnl/ipv4: use netconf msg to advertise forwarding status")
Fixes: 9e5511106f ("rtnl/ipv4: add support of RTM_GETNETCONF")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
An important information for the napi_poll tracepoint is knowing
the work done (packets processed) by the napi_poll() call. Add
both the work done and budget, as they are related.
Handle trace_napi_poll() param change in dropwatch/drop_monitor
and in python perf script netdev-times.py in backward compat way,
as python fortunately supports optional parameter handling.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow MPLS routes on IPIP and SIT devices now that they
support forwarding MPLS packets.
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Dinan Gunawardena <dinan.gunawardena@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Extend the IPIP driver to support MPLS over IPv4. The implementation is an
extension of existing support for IPv4 over IPv4 and is based of multiple
inner-protocol support for the SIT driver.
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Dinan Gunawardena <dinan.gunawardena@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Extend the SIT driver to support MPLS over IPv4. This implementation
extends existing support for IPv6 over IPv4 and IPv4 over IPv4.
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Dinan Gunawardena <dinan.gunawardena@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Extend tunnel support to MPLS over IPv4. The implementation extends the
existing differentiation between IPIP and IPv6 over IPv4 to also cover MPLS
over IPv4.
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Dinan Gunawardena <dinan.gunawardena@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As was suggested this patch adds support for the different versions of MLD
and IGMP query types. Since the user visible structure is still in net-next
we can augment it instead of adding netlink attributes.
The distinction between the different IGMP/MLD query types is done as
suggested in Section 7.1, RFC 3376 [1] and Section 8.1, RFC 3810 [2] based
on query payload size and code for IGMP. Since all IGMP packets go through
multicast_rcv() and it uses ip_mc_check_igmp/ipv6_mc_check_mld we can be
sure that at least the ip/ipv6 header can be directly used.
[1] https://tools.ietf.org/html/rfc3376#section-7
[2] https://tools.ietf.org/html/rfc3810#section-8.1
Suggested-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The HCI_BREDR naming is confusing since it actually stands for Primary
Bluetooth Controller. Which is a term that has been used in the latest
standard. However from a legacy point of view there only really have
been Basic Rate (BR) and Enhanced Data Rate (EDR). Recent versions of
Bluetooth introduced Low Energy (LE) and made this terminology a little
bit confused since Dual Mode Controllers include BR/EDR and LE. To
simplify this the name HCI_PRIMARY stands for the Primary Controller
which can be a single mode or dual mode controller.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
The controller device attributes are not used and expose no valuable
information.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
The connection link attributes are not used and expose no valuable
information.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
When we introduced GSO support, if using auth the auth chunk was being
left queued on the packet even after the final segment was generated.
Later on sctp_transmit_packet it calls sctp_packet_reset, which zeroed
the packet len while not accounting for this left-over. This caused more
space to be used the next packet due to the chunk still being queued,
but space which wasn't allocated as its size wasn't accounted.
The fix is to only queue it back when we know that we are going to
generate another segment.
Fixes: 90017accff ("sctp: Add GSO support")
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The routing table of every switch in a tree is currently initialized to
all zeros. This is an issue since 0 is a valid port number.
Add a DSA_RTABLE_NONE=-1 constant to initialize the signed values of the
routing table pointing to other switches.
This fixes the device mapping of the mv88e6xxx driver where the port
pointing to the switch itself and to non-existent switches was wrongly
configured to be 0. It is now set to the expected 0xf value.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIVAwUAV3zeiPSw1s6N8H32AQI9Iw//RxwhAzb+ovNIizPqcFRt4BIS3Or6unov
Bx9jtq+U1POWZWtlIHYZKDH8ndxMnC56NkEHmpIK3uEiqJ6BoQPpDdfN+hhREysV
Hoa+JOkMgufPBanU/JyKY/4vsmlvLuoOdppN1OA/kx1KECux9xJWIrvFsCUQGeat
nDsdWChHkZAm/GDPZiFvxEBVaxDe2dmnDMBFTst1RsrH2uICSqM4k5srmjc3NPAY
bsTqZeQGTIK1V9MggwBHxHFMvvERlGDpcrpoMRjeTzMmCpCg5endJoSu3hdNjHUO
o5Fi50dhLI5jo84DQiXL0wM4SLND0QQygl+QeU3zlJYtsQsF6WxPnIEGqlGr3+WV
I4wjDc5lxECyQIjCsrCo5ZwJ47Kqmm/ZQ4uGd9JooAVhqhP7/2dhFH0zXywJZzKs
zo+dWTF5Xvde+mlknm1RCTgkdx3msPH9EVkEoO4FOPOAg6lhIMQFFXLXZfGr9oX6
V99t+8t8YhDPTbL4AQnzh/aMHtbpM6be4TYiRRjT6iZLuvWOPW/zpp/1hmyeEkbU
KZNDunC2tH030Fx5toGi3b2i8M5SJdyex9Udg/YsNexpWmyHMS49PoGk9ZnRRPA+
xn9+xIVsqTh+xbiyCOPJqlQMMK9ayF7isT2N8T19qoVJxurdE/tMBdtKrJ5uTJFT
W0n8KV46a+4=
=1ZSd
-----END PGP SIGNATURE-----
Merge tag 'rxrpc-rewrite-20160706' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
David Howells says:
====================
rxrpc: Improve conn/call lookup and fix call number generation [ver #3]
I've fixed a couple of patch descriptions and excised the patch that
duplicated the connections list for reconsideration at a later date.
For reference, the excised patch is sitting on the rxrpc-experimental
branch of my git tree, based on top of the rxrpc-rewrite branch. Diffing
it against yesterday's tag shows no differences.
Would you prefer the patch set to be emailed afresh instead of a git-pull
request?
David
---
Here's the next part of the AF_RXRPC rewrite. The two main purposes of
this set are to fix the call number handling and to make use of RCU when
looking up the connection or call to pass a received packet to.
Important changes in this set include:
(1) Avoidance of placing stack data into SG lists in rxkad so that kernel
stacks can become vmalloc'd (Herbert Xu).
(2) Calls cease pinning the connection they used as soon as possible,
which allows the connection to be discarded sooner and allows the call
channel on that connection to be reused earlier.
(3) Make each call channel on a connection have a separate and independent
call number space rather than having a shared number space for the
connection. Call numbers should increment monotonically per channel
on the client, and the server should ignore a call with a lower call
number for that channel than the latest it has seen. The RESPONSE
packet sets the minimum values of each call ID counter on a
connection.
(4) Look up calls by indexing the channel array on a connection rather
than by keeping calls in an rbtree on that connection. Also look up
calls using the channel array rather than using a hashtable.
The call hashtable can then be removed.
(5) Call terminal statuses are cached in the channel array for the last
call. It is assumed that if we the server have seen call N, then the
client no longer cares about call N-1 on the same channel.
This will allow retransmission of the terminal status in future
without the need to keep the rxrpc_call struct around.
(6) Peer lookups are moved out of common connection handling code and into
service connection handling code as client connections (a) must point
to a peer before they can be used and (b) are looked up by a
machine-unique connection ID directly, so we only need to look up the
peer first if we're going to deal with a service call.
(7) The reference count on a connection is held elevated by 1 whilst it is
alive (ie. idle unused connections have a refcount of 1). The reaper
will attempt to change the refcount from 1->0 and skip if this cannot
be done, whilst look ups only increment the refcount if it's non-zero.
This makes the implementation of RCU lookups easier as we don't have
to get a ref on the connection or a lock on the connection list to
prevent a connection being reaped whilst we're contemplating queueing
a packet that initiates a new service call upon it.
If we need to get a connection, but there's a dead connection in the
tree, we use rb_replace_node() to replace the dead one with a new one.
(8) Use a seqlock to validate the walk over the service connection rbtree
attached to a peer when it's being walked in RCU mode.
(9) Make the incoming call/connection packet handling code use RCU mode
and locks and make it only take a reference if the call/connection
gets queued on a workqueue.
The intention is that the next set will introduce the connection lifetime
management and capacity limits to prevent clients from overloading the
server.
There are some fixes too:
(1) Verifying that a packet coming in to a client connection came from the
expected source.
(2) Fix handling of connection failure in client call creation where we
don't reinitialise the list linkage block and a second attempt to
unlink the failed connection oopses and also we don't set the state
correctly, which causes an assertion failure.
(3) New service calls were being added to the socket's accept queue under
the wrong lock.
Changes:
(V2) In rxrpc_find_service_conn_rcu() initialised the sequence number to 0.
Fixed the RCU handling in conn_service.c by introducing and using
rb_replace_node_rcu() as an RCU-safe alternative in
rxrpc_publish_service_conn().
Modified and used rcu_dereference_raw() to avoid RCU sparse warnings
in rxrpc_find_service_conn_rcu().
Added in some missing RCU dereference wrappers. It seems to be
necessary to turn on CONFIG_PROVE_RCU_REPEATEDLY as well as
CONFIG_SPARSE_RCU_POINTER to get the static __rcu annotation checking
to happen.
Fixed some other sparse warnings, including a missing ntohs() in
jumbo packet processing.
(V3) Fixed some commit descriptions.
Excised the patch that duplicated the connection list to separate out
the procfs list for reconsideration at a later date.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
hfsc_sched is huge (size: 920, cachelines: 15), but we can get it to 14
cachelines by placing level after filter_cnt (covering 4 byte hole) and
reducing period/nactive/flags to u32 (period is just a counter,
incremented when class becomes active -- 2**32 is plenty for this
purpose, also, long is only 32bit wide on 32bit platforms anyway).
cl_vtperiod is exported to userspace via tc_hfsc_stats, but its period
member is already u32, so no precision is lost there either.
Cc: Michal Soltys <soltys@ziu.info>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
We need to compute timeout.expires - jiffies, not the other way around.
Add a helper, another patch can then later change more places in
conntrack code where we currently open-code this.
Will allow us to only change one place later when we remove per-ct timer.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When receiving neighbour information with short address option field we
should check the complete range of invalid short addresses and set it to
one invalid address setting which is the unspecified address. This
address is also used when by creating at first a new neighbour entry to
indicate no short address is set.
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
The RIOT-OS stack does send intra-pan frames but don't set the intra pan
flag inside the mac header. It seems this is valid frame addressing but
inefficient. Anyway this patch adds a new function for intra pan
addressing, doesn't matter if intra pan flag or source and destination
are the same. The newly introduction function will be used to check on
intra pan addressing for 6lowpan.
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
If recvmsg is called with a destination buffer that is too small to
receive the contents of skb in its entirety, the return value from
recvmsg was inconsistent with common SOCK_SEQPACKET or SOCK_DGRAM
semantics.
If destination buffer provided by userspace is too small (e.g. len <
copied), then MSG_TRUNC flag is set and copied is returned. Instead, it
should return the length of the message, which is consistent with how
other datagram based sockets act. Quoting 'man recv':
"All three calls return the length of the message on successful comple‐
tion. If a message is too long to fit in the supplied buffer, excess
bytes may be discarded depending on the type of socket the message is
received from."
and
"MSG_TRUNC (since Linux 2.2)
For raw (AF_PACKET), Internet datagram (since Linux
2.4.27/2.6.8), netlink (since Linux 2.6.22), and UNIX datagram
(since Linux 3.4) sockets: return the real length of the packet
or datagram, even when it was longer than the passed buffer."
Signed-off-by: Denis Kenzior <denkenz@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
If recvmsg is called with a destination buffer that is too small to
receive the contents of skb in its entirety, the return value from
recvmsg was inconsistent with common SOCK_SEQPACKET or SOCK_DGRAM
semantics.
If destination buffer provided by userspace is too small (e.g. len <
copied), then MSG_TRUNC flag is set and copied is returned. Instead, it
should return the length of the message, which is consistent with how
other datagram based sockets act. Quoting 'man recv':
"All three calls return the length of the message on successful comple‐
tion. If a message is too long to fit in the supplied buffer, excess
bytes may be discarded depending on the type of socket the message is
received from."
and
"MSG_TRUNC (since Linux 2.2)
For raw (AF_PACKET), Internet datagram (since Linux
2.4.27/2.6.8), netlink (since Linux 2.6.22), and UNIX datagram
(since Linux 3.4) sockets: return the real length of the packet
or datagram, even when it was longer than the passed buffer."
Signed-off-by: Denis Kenzior <denkenz@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch reverts commit f9d1ce8f81 ("ieee802154: fix netns settings").
The lowpan interface need to be created inside the net namespace where
the wpan interface is available. The wpan namespace can be changed only
by nl802154 before. Without this patch it's not possible to create a
lowpan interface for a wpan interface which isn't inside init_net
namespace.
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch adds netns support for 802.15.4 subsystem. Most parts are
copy&pasted from wireless subsystem, it has the identically userspace
API.
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch adds a missing check to handle short address parsing for
802.15.4 6LoWPAN only.
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch removes a double unlock case to accessing neighbour private
data.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
SMP does ECB crypto on stack buffers. This is complicated and
fragile, and it will not work if the stack is virtually allocated.
Switch to the crypto_cipher interface, which is simpler and safer.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Johan Hedberg <johan.hedberg@intel.com>
Tested-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Some users observed that "least connection" distribution algorithm doesn't
handle well bursts of TCP connections from reconnecting clients after
a node or network failure.
This is because the algorithm counts active connection as worth 256
inactive ones where for TCP, "active" only means TCP connections in
ESTABLISHED state. In case of a connection burst, new connections are
handled before previous ones have finished the three way handshaking so
that all are still counted as "inactive", i.e. cheap ones. The become
"active" quickly but at that time, all of them are already assigned to one
real server (or few), resulting in highly unbalanced distribution.
Address this by counting the "pre-established" states as "active".
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
When using HEAD from
https://git.kernel.org/cgit/utils/kernel/ipvsadm/ipvsadm.git/,
the command:
ipvsadm --start-daemon backup --mcast-interface eth0.60 \
--mcast-group ff02::1:81
fails with the error message:
Argument list too long
whereas both:
ipvsadm --start-daemon master --mcast-interface eth0.60 \
--mcast-group ff02::1:81
and:
ipvsadm --start-daemon backup --mcast-interface eth0.60 \
--mcast-group 224.0.0.81
are successful.
The error message "Argument list too long" isn't helpful. The error occurs
because an IPv6 address is given in backup mode.
The error is in make_receive_sock() in net/netfilter/ipvs/ip_vs_sync.c,
since it fails to set the interface on the address or the socket before
calling inet6_bind() (via sock->ops->bind), where the test
'if (!sk->sk_bound_dev_if)' failed.
Setting sock->sk->sk_bound_dev_if on the socket before calling
inet6_bind() resolves the issue.
Fixes: d33288172e ("ipvs: add more mcast parameters for the sync daemon")
Signed-off-by: Quentin Armitage <quentin@armitage.org.uk>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Pinned timers must carry the pinned attribute in the timer structure
itself, so convert the code to the new API.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: George Spelvin <linux@sciencehorizons.net>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160704094341.617891430@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
* beacon report (for radio measurement) support in cfg80211/mac80211
* hwsim: allow wmediumd in namespaces
* mac80211: extend 160MHz workaround to CSA IEs
* mesh: properly encrypt group-addressed privacy action frames
* mesh: allow setting peer AID
* first steps for MU-MIMO monitor mode
* along with various other cleanups and improvements
-----BEGIN PGP SIGNATURE-----
iQIcBAABCgAGBQJXfSCuAAoJEGt7eEactAAdaugP/ilrcELQRsIN5ZXCAKYZuwXV
T01JPwgOaWL9ILu7h1SfG/+j9kzMnyk4WmRpeoj2FGNcyfG2AvULWSLpQJ2abwgQ
8o/emuLinQwRENevaMUTRSOE0HkXoFPCbbq37+a2i6bAv1QSYY3A0xvWpcU5fZ4D
7CKYDYPBAdMXYwEwy1g4nYWfDAYqS4rthr3l3rS1Cy7Q2T1ZlMlD90GjD7oeQAEw
orKulhkkDSzvxfvZCYTzXmUoBQE8sNXGDD+OFsJyowyt+ugM/xan+2tmhCaHSnda
HpdCS2aRj779UBn9cOfELjffTNpS++PM6KFd8ZDaPcJSMginn/BAHTOeNfNUfL0Q
+Enu59I82qMDzbG2z1Qezzjv7OTzyEvyvYzNbLOqljTBSBklDa3rHwhyk+g1oVBH
+4xX1Vk5QBLde+Q0NS0gTkcqOQK8KT5+HEqiUfgLSNDETN0lSGsKbtvMfU/ikE1Y
aLRkTp7nzUd03qjIFLS6RMf7JjucWWzH1ZXTHvbpDFAG7riOhYRD3Sw+0e7madTd
+HXjH9dnOGnGPDL+FyDwtW6iclYwNjcIPQiNdOjwWfMA2Wmr7iq+aFUptRCwQTHB
WtgJ3f8OHax2JXcm4grYfxELZip5vbWJJHUC84Drvmzw3X7FRITf+OEWjdNOzsRD
Fc7w5ceThh9Id1BvcH2+
=R1qP
-----END PGP SIGNATURE-----
Merge tag 'mac80211-next-for-davem-2016-07-06' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
Johannes Berg says:
====================
One more set of new features:
* beacon report (for radio measurement) support in cfg80211/mac80211
* hwsim: allow wmediumd in namespaces
* mac80211: extend 160MHz workaround to CSA IEs
* mesh: properly encrypt group-addressed privacy action frames
* mesh: allow setting peer AID
* first steps for MU-MIMO monitor mode
* along with various other cleanups and improvements
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/mellanox/mlx5/core/en.h
drivers/net/ethernet/mellanox/mlx5/core/en_main.c
drivers/net/usb/r8152.c
All three conflicts were overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next,
they are:
1) Don't use userspace datatypes in bridge netfilter code, from
Tobin Harding.
2) Iterate only once over the expectation table when removing the
helper module, instead of once per-netns, from Florian Westphal.
3) Extra sanitization in xt_hook_ops_alloc() to return error in case
we ever pass zero hooks, xt_hook_ops_alloc():
4) Handle NFPROTO_INET from the logging core infrastructure, from
Liping Zhang.
5) Autoload loggers when TRACE target is used from rules, this doesn't
change the behaviour in case the user already selected nfnetlink_log
as preferred way to print tracing logs, also from Liping Zhang.
6) Conntrack slabs with SLAB_HWCACHE_ALIGN to allow rearranging fields
by cache lines, increases the size of entries in 11% per entry.
From Florian Westphal.
7) Skip zone comparison if CONFIG_NF_CONNTRACK_ZONES=n, from Florian.
8) Remove useless defensive check in nf_logger_find_get() from Shivani
Bhardwaj.
9) Remove zone extension as place it in the conntrack object, this is
always include in the hashing and we expect more intensive use of
zones since containers are in place. Also from Florian Westphal.
10) Owner match now works from any namespace, from Eric Bierdeman.
11) Make sure we only reply with TCP reset to TCP traffic from
nf_reject_ipv4, patch from Liping Zhang.
12) Introduce --nflog-size to indicate amount of network packet bytes
that are copied to userspace via log message, from Vishwanath Pai.
This obsoletes --nflog-range that has never worked, it was designed
to achieve this but it has never worked.
13) Introduce generic macros for nf_tables object generation masks.
14) Use generation mask in table, chain and set objects in nf_tables.
This allows fixes interferences with ongoing preparation phase of
the commit protocol and object listings going on at the same time.
This update is introduced in three patches, one per object.
15) Check if the object is active in the next generation for element
deactivation in the rbtree implementation, given that deactivation
happens from the commit phase path we have to observe the future
status of the object.
16) Support for deletion of just added elements in the hash set type.
17) Allow to resize hashtable from /proc entry, not only from the
obscure /sys entry that maps to the module parameter, from Florian
Westphal.
18) Get rid of NFT_BASECHAIN_DISABLED, this code is not exercised
anymore since we tear down the ruleset whenever the netdevice
goes away.
19) Support for matching inverted set lookups, from Arturo Borrero.
20) Simplify the iptables_mangle_hook() by removing a superfluous
extra branch.
21) Introduce ether_addr_equal_masked() and use it from the netfilter
codebase, from Joe Perches.
22) Remove references to "Use netfilter MARK value as routing key"
from the Netfilter Kconfig description given that this toggle
doesn't exists already for 10 years, from Moritz Sichert.
23) Introduce generic NF_INVF() and use it from the xtables codebase,
from Joe Perches.
24) Setting logger to NONE via /proc was not working unless explicit
nul-termination was included in the string. This fixes seems to
leave the former behaviour there, so we don't break backward.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Speedy join only works when the received packet is either broadcast or an
4addr unicast packet. Thus packets converted from broadcast to unicast via
the gateway handling code have to be converted to 4addr packets to allow
the receiving gateway server to add the sender address as temporary entry
to the translation table.
Not doing it will make the batman-adv gateway server drop the DHCP response
in many situations because it doesn't yet have the TT entry for the
destination of the DHCP response.
Fixes: 371351731e ("batman-adv: change interface_rx to get orig node")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Acked-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Previously, mesh power management functionality works only with kernel
MPM. Because user space MPM did not report mesh peer AID to kernel,
the kernel could not identify the bit in TIM element. So this patch
adds mesh peer AID setting API.
Signed-off-by: Masashi Honma <masashi.honma@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Continuing the workaround implemented in commit 23665aaf91
("mac80211: Interoperability workaround for 80+80 and 160 MHz channels")
use the same code to parse the Wide Bandwidth Channel Switch element
by converting to VHT Operation element since the spec also just refers
to that for parsing semantics, particularly with the workaround.
While at it, remove some dead code - the IEEE80211_STA_DISABLE_40MHZ
flag can never be set at this point since it's checked earlier and the
wide_bw_chansw_ie pointer is set to NULL if it's set.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Add the following to support beacon report radio measurement
with the measurement mode field set to passive or active:
1. Propagate the required scan duration to the device
2. Report the scan start time (in terms of TSF)
3. Report each BSS's detection time (also in terms of TSF)
TSF times refer to the BSS that the interface that requested the
scan is connected to.
Signed-off-by: Assaf Krauss <assaf.krauss@intel.com>
Signed-off-by: Avraham Stern <avraham.stern@intel.com>
[changed ath9k/10k, at76c59x-usb, iwlegacy, wl1251 and wlcore to match
the new API]
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Beacon report radio measurement requires reporting observed BSSs
on the channels specified in the beacon request. If the measurement
mode is set to passive or active, it requires actually performing a
scan (passive or active, accordingly), and reporting the time that
the scan was started and the time each beacon/probe was received
(both in terms of TSF of the BSS of the requesting AP). If the
request mode is table, this information is optional.
In addition, the radio measurement request specifies the channel
dwell time for the measurement.
In order to use scan for beacon report when the mode is active or
passive, add a parameter to scan request that specifies the
channel dwell time, and add scan start time and beacon received time
to scan results information.
Supporting beacon report is required for Multi Band Operation (MBO).
Signed-off-by: Assaf Krauss <assaf.krauss@intel.com>
Signed-off-by: David Spinadel <david.spinadel@intel.com>
Signed-off-by: Avraham Stern <avraham.stern@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
add API to support VHT MU-MIMO air sniffer.
in MU-MIMO there are parallel frames on the air while the HW
has only one RX.
add the capability to sniff one of the MU-MIMO parallel frames by
giving the sniffer additional information so it'll know which
of the parallel frames it shall follow.
Add attribute - NL80211_ATTR_MU_MIMO_GROUP_DATA - for getting
a MU-MIMO groupID in order to monitor packets from that group
using VHT MU-MIMO.
And add attribute -NL80211_ATTR_MU_MIMO_FOLLOW_ADDR - for passing
MAC address to monitor mode.
that option will be used by VHT MU-MIMO air sniffer to follow a
station according to it's MAC address using VHT MU-MIMO.
Signed-off-by: Aviya Erenfeld <aviya.erenfeld@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The current implementation of handling ADDBA Request while a session
is already active with the peer is wrong - in case the peer is using
the existing session's dialog token this should be treated as update
to the session, which can update the timeout value.
We don't really have a good way of supporting that, so reject, but
implement the required behaviour in the spec of "Even if the updated
ADDBA Request frame is not accepted, the original Block ACK setup
remains active." (802.11-2012 10.5.4)
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
No support for pbss results in a memory leak for the acl_data
(if parse_acl_data succeeds). Fix this by moving the ACL parsing later.
Cc: stable@vger.kernel.org
Fixes: 34d505193b ("cfg80211: basic support for PBSS network type")
Signed-off-by: Purushottam Kushwaha <pkushwah@qti.qualcomm.com>
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
The call hash table is now no longer used as calls are looked up directly
by channel slot on the connection, so kill it off.
Signed-off-by: David Howells <dhowells@redhat.com>
Move to using RCU access to a peer's service connection tree when routing
an incoming packet. This is done using a seqlock to trigger retrying of
the tree walk if a change happened.
Further, we no longer get a ref on the connection looked up in the
data_ready handler unless we queue the connection's work item - and then
only if the refcount > 0.
Note that I'm avoiding the use of a hash table for service connections
because each service connection is addressed by a 62-bit number
(constructed from epoch and connection ID >> 2) that would allow the client
to engage in bucket stuffing, given knowledge of the hash algorithm.
Peers, however, are hashed as the network address is less controllable by
the client. The total number of peers will also be limited in a future
commit.
Signed-off-by: David Howells <dhowells@redhat.com>
Overhaul the usage count accounting for the rxrpc_connection struct to make
it easier to implement RCU access from the data_ready handler.
The problem is that currently we're using a lock to prevent the garbage
collector from trying to clean up a connection that we're contemplating
unidling. We could just stick incoming packets on the connection we find,
but we've then got a problem that we may race when dispatching a work item
to process it as we need to give that a ref to prevent the rxrpc_connection
struct from disappearing in the meantime.
Further, incoming packets may get discarded if attached to an
rxrpc_connection struct that is going away. Whilst this is not a total
disaster - the client will presumably resend - it would delay processing of
the call. This would affect the AFS client filesystem's service manager
operation.
To this end:
(1) We now maintain an extra count on the connection usage count whilst it
is on the connection list. This mean it is not in use when its
refcount is 1.
(2) When trying to reuse an old connection, we only increment the refcount
if it is greater than 0. If it is 0, we replace it in the tree with a
new candidate connection.
(3) Two connection flags are added to indicate whether or not a connection
is in the local's client connection tree (used by sendmsg) or the
peer's service connection tree (used by data_ready). This makes sure
that we don't try and remove a connection if it got replaced.
The flags are tested under lock with the removal operation to prevent
the reaper from killing the rxrpc_connection struct whilst someone
else is trying to effect a replacement.
This could probably be alleviated by using memory barriers between the
flag set/test and the rb_tree ops. The rb_tree op would still need to
be under the lock, however.
(4) When trying to reap an old connection, we try to flip the usage count
from 1 to 0. If it's not 1 at that point, then it must've come back
to life temporarily and we ignore it.
Signed-off-by: David Howells <dhowells@redhat.com>
Move the lookup of a peer from a call that's being accepted into the
function that creates a new incoming connection. This will allow us to
avoid incrementing the peer's usage count in some cases in future.
Note that I haven't bother to integrate rxrpc_get_addr_from_skb() with
rxrpc_extract_addr_from_skb() as I'm going to delete the former in the very
near future.
Signed-off-by: David Howells <dhowells@redhat.com>
Split the service-specific connection code out into into its own file. The
client-specific code has already been split out. This will leave just the
common code in the original file.
Signed-off-by: David Howells <dhowells@redhat.com>
Split the client-specific connection code out into its own file. It will
behave somewhat differently from the service-specific connection code, so
it makes sense to separate them.
Signed-off-by: David Howells <dhowells@redhat.com>
Each channel on a connection has a separate, independent number space from
which to allocate callNumber values. It is entirely possible, for example,
to have a connection with four active calls, each with call number 1.
Note that the callNumber values for any particular channel don't have to
start at 1, but they are supposed to increment monotonically for that
channel from a client's perspective and may not be reused once the call
number is transmitted (until the epoch cycles all the way back round).
Currently, however, call numbers are allocated on a per-connection basis
and, further, are held in an rb-tree. The rb-tree is redundant as the four
channel pointers in the rxrpc_connection struct are entirely capable of
pointing to all the calls currently in progress on a connection.
To this end, make the following changes:
(1) Handle call number allocation independently per channel.
(2) Get rid of the conn->calls rb-tree. This is overkill as a connection
may have a maximum of four calls in progress at any one time. Use the
pointers in the channels[] array instead, indexed by the channel
number from the packet.
(3) For each channel, save the result of the last call that was in
progress on that channel in conn->channels[] so that the final ACK or
ABORT packet can be replayed if necessary. Any call earlier than that
is just ignored. If we've seen the next call number in a packet, the
last one is most definitely defunct.
(4) When generating a RESPONSE packet for a connection, the call number
counter for each channel must be included in it.
(5) When parsing a RESPONSE packet for a connection, the call number
counters contained therein should be used to set the minimum expected
call numbers on each channel.
To do in future commits:
(1) Replay terminal packets based on the last call stored in
conn->channels[].
(2) Connections should be retired before the callNumber space on any
channel runs out.
(3) A server is expected to disregard or reject any new incoming call that
has a call number less than the current call number counter. The call
number counter for that channel must be advanced to the new call
number.
Note that the server cannot just require that the next call that it
sees on a channel be exactly the call number counter + 1 because then
there's a scenario that could cause a problem: The client transmits a
packet to initiate a connection, the network goes out, the server
sends an ACK (which gets lost), the client sends an ABORT (which also
gets lost); the network then reconnects, the client then reuses the
call number for the next call (it doesn't know the server already saw
the call number), but the server thinks it already has the first
packet of this call (it doesn't know that the client doesn't know that
it saw the call number the first time).
Signed-off-by: David Howells <dhowells@redhat.com>
The socket's accept queue (socket->acceptq) should be accessed under
socket->call_lock, not under the connection lock.
Signed-off-by: David Howells <dhowells@redhat.com>
Add RCU destruction for connections and calls as the RCU lookup from the
transport socket data_ready handler is going to come along shortly.
Whilst we're at it, move the cleanup workqueue flushing and RCU barrierage
into the destruction code for the objects that need it (locals and
connections) and add the extra RCU barrier required for connection cleanup.
Signed-off-by: David Howells <dhowells@redhat.com>
When a call is disconnected, clear the call's pointer to the connection and
release the associated ref on that connection. This means that the call no
longer pins the connection and the connection can be discarded even before
the call is.
As the code currently stands, the call struct is effectively pinned by
userspace until userspace has enacted a recvmsg() to retrieve the final
call state as sk_buffs on the receive queue pin the call to which they're
related because:
(1) The rxrpc_call struct contains the userspace ID that recvmsg() has to
include in the control message buffer to indicate which call is being
referred to. This ID must remain valid until the terminal packet is
completely read and must be invalidated immediately at that point as
userspace is entitled to immediately reuse it.
(2) The final ACK to the reply to a client call isn't sent until the last
data packet is entirely read (it's probably worth altering this in
future to be send the ACK as soon as all the data has been received).
This change requires a bit of rearrangement to make sure that the call
isn't going to try and access the connection again after protocol
completion:
(1) Delete the error link earlier when we're releasing the call. Possibly
network errors should be distributed via connections at the cost of
adding in an access to the rxrpc_connection struct.
(2) Remove the call from the connection's call tree before disconnecting
the call. The call tree needs to be removed anyway and incoming
packets delivered by channel pointer instead.
(3) The release call event should be considered last after all other
events have been processed so that we don't need access to the
connection again.
(4) Move the channel_lock taking from rxrpc_release_call() to
rxrpc_disconnect_call() where it will be required in future.
Signed-off-by: David Howells <dhowells@redhat.com>
If rxrpc_connect_call() fails during the creation of a client connection,
there are two bugs that we can hit that need fixing:
(1) The call state should be moved to RXRPC_CALL_DEAD before the call
cleanup phase is invoked. If not, this can cause an assertion failure
later.
(2) call->link should be reinitialised after being deleted in
rxrpc_new_client_call() - which otherwise leads to a failure later
when the call cleanup attempts to delete the link again.
Signed-off-by: David Howells <dhowells@redhat.com>
Rather than calling rxrpc_get_connection() manually before calling
rxrpc_queue_conn(), do it inside the queue wrapper.
This allows us to do some important fixes:
(1) If the usage count is 0, do nothing. This prevents connections from
being reanimated once they're dead.
(2) If rxrpc_queue_work() fails because the work item is already queued,
retract the usage count increment which would otherwise be lost.
(3) Don't take a ref on the connection in the work function. By passing
the ref through the work item, this is unnecessary. Doing it in the
work function is too late anyway. Previously, connection-directed
packets held a ref on the connection, but that's not really the best
idea.
And another useful changes:
(*) Don't need to take a refcount on the connection in the data_ready
handler unless we invoke the connection's work item. We're using RCU
there so that's otherwise redundant.
Signed-off-by: David Howells <dhowells@redhat.com>
Check that the client conns cache is empty before module removal and bug if
not, listing any offending connections that are still present. Unfortunately,
if there are connections still around, then the transport socket is still
unexpectedly open and active, so we can't just unallocate the connections.
Signed-off-by: David Howells <dhowells@redhat.com>
Turn the connection event and state #define lists into enums and move
outside of the struct definition.
Whilst we're at it, change _SERVER to _SERVICE in those identifiers and add
EV_ into the event name to distinguish them from flags and states.
Also add a symbol indicating the number of states and use that in the state
text array.
Signed-off-by: David Howells <dhowells@redhat.com>
Provide queueing helper functions so that the queueing of local and
connection objects can be fixed later.
The issue is that a ref on the object needs to be passed to the work queue,
but the act of queueing the object may fail because the object is already
queued. Testing the queuedness of an object before hand doesn't work
because there can be a race with someone else trying to queue it. What
will have to be done is to adjust the refcount depending on the result of
the queue operation.
Signed-off-by: David Howells <dhowells@redhat.com>
rxkad uses stack memory in SG lists which would not work if stacks were
allocated from vmalloc memory. In fact, in most cases this isn't even
necessary as the stack memory ends up getting copied over to kmalloc
memory.
This patch eliminates all the unnecessary stack memory uses by supplying
the final destination directly to the crypto API. In two instances where a
temporary buffer is actually needed we also switch use a scratch area in
the rxrpc_call struct (only one DATA packet will be being secured or
verified at a time).
Finally there is no need to split a split-page buffer into two SG entries
so code dealing with that has been removed.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: David Howells <dhowells@redhat.com>
When looking up a client connection to which to route a packet, we need to
check that the packet came from the correct source so that a peer can't try
to muck around with another peer's connection.
Signed-off-by: David Howells <dhowells@redhat.com>
With this patch, the Digital Protocol layer abort the last issued
command when the dep link goes down. That way it does not have to wait
for the driver to reply with a timeout error before sending a new
command (i.e. a start poll command if constant polling is on).
Signed-off-by: Thierry Escande <thierry.escande@collabora.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
There is a flag in the command structure indicating that this command is
pending. It was checked before sending the command to not send the same
command twice but it was actually never set. This is now fixed.
Signed-off-by: Thierry Escande <thierry.escande@collabora.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
With this patch, when freeing the command queue in the module unregister
function, the callbacks of the commands still queued are called with a
ENODEV error. This gives a chance to the command issuer to free any
memory it could have allocate.
Signed-off-by: Thierry Escande <thierry.escande@collabora.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
The Digital Protocol stack used to send a NACK frame whatever the error
type it receives in digital_in_recv_dep_res(). It actually should only
send a NACK frame on CRC or parity check errors or on any transmission
error if a NACK frame was previously sent. Existing drivers used to send
EIO error for this kind of issues so this patch limits sending of NACK
frames on EIO errors. All other errors will be reported to the upper
layers.
Signed-off-by: Thierry Escande <thierry.escande@collabora.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
When configured as a target listening for a SENSF_REQ poll command, a
nfcid2 array was allocated for no reason leading to a memory leak. The
nfcid2 is sent by the target in the SENSF_RES reply.
Signed-off-by: Thierry Escande <thierry.escande@collabora.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
Once copied into the sk_buff data area using llcp_add_tlv(), the
allocated TLVs must be freed.
With this patch nfc_llcp_send_connect() and nfc_llcp_send_cc() don't
return immediately on success and now free the allocated TLVs.
Signed-off-by: Thierry Escande <thierry.escande@collabora.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
In functions using llcp_add_tlv(), a skb pointer could be set to NULL
and then reuse afterward.
With this patch, the skb pointer returned by llcp_add_tlv() is ignored
since it can only be the passed skb pointer or NULL when the passed TLV
is NULL. There is also no need to check for the TLV pointer as this is
done by llcp_add_tlv().
Signed-off-by: Thierry Escande <thierry.escande@collabora.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
It was first reported and reproduced by Petr (thanks!) in
https://bugzilla.kernel.org/show_bug.cgi?id=119581
free_percpu(rt->rt6i_pcpu) used to always happen in ip6_dst_destroy().
However, after fixing a deadlock bug in
commit 9c7370a166 ("ipv6: Fix a potential deadlock when creating pcpu rt"),
free_percpu() is not called before setting non_pcpu_rt->rt6i_pcpu to NULL.
It is worth to note that rt6i_pcpu is protected by table->tb6_lock.
kmemleak somehow did not report it. We nailed it down by
observing the pcpu entries in /proc/vmallocinfo (first suggested
by Hannes, thanks!).
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Fixes: 9c7370a166 ("ipv6: Fix a potential deadlock when creating pcpu rt")
Reported-by: Petr Novopashenniy <pety@rusnet.ru>
Tested-by: Petr Novopashenniy <pety@rusnet.ru>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Petr Novopashenniy <pety@rusnet.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
dn_fib_count_nhs() could enter an infinite loop if nhp->rtnh_len == 0
(i.e. if userspace passes a malformed netlink message).
Let's use the helpers from net/nexthop.h which take care of all this
stuff. We can do exactly the same as e.g. fib_count_nexthops() and
fib_get_nhs() from net/ipv4/fib_semantics.c.
This fixes the softlockup for me.
Cc: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the data plane is offloaded the traffic doesn't go through the
networking stack. Therefore, after first resolving a neighbour the NUD
state machine will transition it from REACHABLE to STALE until it's
finally deleted by the garbage collector.
To prevent such situations the offloading driver should notify the NUD
state machine on any neighbours that were recently used. The driver's
polling interval should be set so that the NUD state machine can
function as if the traffic wasn't offloaded.
Currently, there are no in-tree drivers that can report confirmation for
a neighbour, but only 'used' indication. Therefore, the polling interval
should be set according to DELAY_FIRST_PROBE_TIME, as a neighbour will
transition from REACHABLE state to DELAY (instead of STALE) if "a packet
was sent within the last DELAY_FIRST_PROBE_TIME seconds" (RFC 4861).
Send a netevent whenever the DELAY_FIRST_PROBE_TIME changes - either via
netlink or sysctl - so that offloading drivers can correctly set their
polling interval.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
L2 upper device needs to propagate neigh_construct/destroy calls down to
lower devices. Do this by defining default ndo functions and use them in
team, bond, bridge and vlan.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As the following patch will allow upper devices to follow the call down
lower devices, we need to add dev here and not rely on n->dev.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is hard to unbind nf-logger:
echo NONE > /proc/sys/net/netfilter/nf_log/0
bash: echo: write error: No such file or directory
sysctl -w net.netfilter.nf_log.0=NONE
sysctl: setting key "net.netfilter.nf_log.0": No such file or directory
net.netfilter.nf_log.0 = NONE
You need explicitly send '\0', for instance like:
echo -e "NONE\0" > /proc/sys/net/netfilter/nf_log/0
That seem to be strange, so fix it using proc_dostring.
Now it works fine:
modprobe nfnetlink_log
echo nfnetlink_log > /proc/sys/net/netfilter/nf_log/0
cat /proc/sys/net/netfilter/nf_log/0
nfnetlink_log
echo NONE > /proc/sys/net/netfilter/nf_log/0
cat /proc/sys/net/netfilter/nf_log/0
NONE
v2: add missed error check for proc_dostring
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The orig_ifinfo reference counter for last_bonding_candidate in
batadv_orig_node has to be reduced when an originator node is released.
Otherwise the orig_ifinfo is leaked and the reference counter the netdevice
is not reduced correctly.
Fixes: f3b3d90189 ("batman-adv: add bonding again")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
The replacement of last_bonding_candidate in batadv_orig_node has to be an
atomic operation. Otherwise it is possible that the reference counter of a
batadv_orig_ifinfo is reduced which was no longer the
last_bonding_candidate when the new candidate is added. This can either
lead to an invalid memory access or to reference leaks which make it
impossible to an interface which was added to batman-adv.
Fixes: f3b3d90189 ("batman-adv: add bonding again")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
The pointer batadv_bla_claim::backbone_gw can be changed at any time.
Therefore, access to it must be protected to ensure that two function
accessing the same backbone_gw are actually accessing the same. This is
especially important when the crc_lock is used or when the backbone_gw of a
claim is exchanged.
Not doing so leads to invalid memory access and/or reference leaks.
Fixes: 23721387c4 ("batman-adv: add basic bridge loop avoidance code")
Fixes: 5a1dd8a477 ("batman-adv: lock crc access in bridge loop avoidance")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
batadv_orig_node_new uses batadv_orig_node_vlan_new to allocate a new
batadv_orig_node_vlan and add it to batadv_orig_node::vlan_list. References
to this list have also to be cleaned when the batadv_orig_node is removed.
Fixes: 7ea7b4a142 ("batman-adv: make the TT CRC logic VLAN specific")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
vlan_insert_tag can return NULL on errors. The distributed arp table code
therefore has to check the return value of vlan_insert_tag for NULL before
it can safely operate on this pointer.
Fixes: be1db4f661 ("batman-adv: make the Distributed ARP Table vlan aware")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
vlan_insert_tag can return NULL on errors. The bridge loop avoidance code
therefore has to check the return value of vlan_insert_tag for NULL before
it can safely operate on this pointer.
Fixes: 23721387c4 ("batman-adv: add basic bridge loop avoidance code")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
- Cleanup work by Markus Pargmann and Sven Eckelmann (six patches)
- Initial Netlink support by Matthias Schiffer (two patches)
- Throughput Meter implementation by Antonio Quartulli, a kernel-space
traffic generator to estimate link speeds. This feature is useful on
low-end WiFi APs where running iperf or netperf from userspace
gives wrong results due to heavy userspace/kernelspace overhead.
(two patches)
- API clean-up work by Antonio Quartulli (one patch)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCgAGBQJXel3cAAoJEKEr45hCkp6hxs0P/jQZBJ37Bd4EHRGdhvCJWwsO
j79zr7mIECub8a6PMkO1GI87ksJNtRdRw7XAIbLKTwsKEsUE0Gpv/MLLKgv/nD7X
zatcoI4DujkgSojZIcOV/061+M9FAnCtAYv13jIS8nbXdqfGPxPfLua6Zbvx1GS2
z/Rqg/WbB2qDtDlUrp0W/8oXQ+k6062G7GigroPLmjdWd5lF0H6ly4loWsxFyr0U
GVl44HM4nOj7DwkVlrGoOXnAbjpz9TNC/aA5TIS/tLFZkm5dvJjjKLDbxo5NM9aq
hRhFy8Gbe0TmOxV3mKZUT1oHuaHgFDY2tADLiLF2g/ijgaTetXCBJ6DXQ7BkiZnh
+t1fnutOB1D05+cZGDmlfb2bFXO6vdDwNzKYuPdeW0tUOVwzNIaMK+US1HffUD3F
ciK/cALsLbfJ3QkUHJclE57baMuB2c7YWJUxGdp2r4lKHak6tc8+BsornI6lB6qY
kcxip6EEaT7edjT66Qjq8GtFK7WIri5nHI9n5Js+Wwl1QENvkLmZRQ6qZexwSplS
RTZmmO+i+Y4rGDa3xoVSlC+CEUO7D4VwhET2Jir7KJrVS+pFNRAmfpUNWxeauAls
D1xWgBrWjjOYu5i3LjwC6cHl4eTWxBwWmBUaxLUUeyoR22ulIs6bXCQMWOLMbupd
q8k2B5BW9waTAgb4Tam9
=PFHu
-----END PGP SIGNATURE-----
Merge tag 'batadv-next-for-davem-20160704' of git://git.open-mesh.org/linux-merge
Simon Wunderlich says:
====================
This feature patchset includes the following changes:
- Cleanup work by Markus Pargmann and Sven Eckelmann (six patches)
- Initial Netlink support by Matthias Schiffer (two patches)
- Throughput Meter implementation by Antonio Quartulli, a kernel-space
traffic generator to estimate link speeds. This feature is useful on
low-end WiFi APs where running iperf or netperf from userspace
gives wrong results due to heavy userspace/kernelspace overhead.
(two patches)
- API clean-up work by Antonio Quartulli (one patch)
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Add functions that iterate over lower devices and find port device.
As a dependency add netdev_for_each_all_lower_dev and
netdev_for_each_all_lower_dev_rcu macro with
netdev_all_lower_get_next and netdev_all_lower_get_next_rcu shelpers.
Also, add functions to return mlxsw struct according to lower device
found and mlxsw_port struct with a reference to lower device.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If register_pernet_subsys() fails, we shouldn't try to call
unregister_pernet_subsys().
Fixes: 467fa15356 ("RDS-TCP: Support multiple RDS-TCP listen endpoints, one per netns.")
Cc: stable@vger.kernel.org
Cc: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If skb_clear_hash() was invoked due to mangling of relevant headers and
BPF program needs skb->hash later on, we can add a helper to trigger hash
recalculation via bpf_get_hash_recalc().
The helper will return the newly retrieved hash directly, but later access
can also be done via skb context again through skb->hash directly (inline)
without needing to call the helper once more.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add another xmit_mode to pktgen to allow testing xmit functionality
of qdiscs. The new mode "queue_xmit" injects packets at
__dev_queue_xmit() so that qdisc is called.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Extremely useful for setting packet type to host so i dont
have to modify the dst mac address using pedit (which requires
that i know the mac address)
Example usage:
tc filter add dev eth0 parent ffff: protocol ip pref 9 u32 \
match ip src 5.5.5.5/32 \
flowid 1:5 action skbedit ptype host
This will tag all packets incoming from 5.5.5.5 with type
PACKET_HOST
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The routing API data structure contains several function
pointers that can easily be grouped together based on the
component they work with.
Split the API in subobjects in order to improve definition readability.
At the same time, remove the "bat_" prefix from the API object and
its fields names. These are batman-adv private structs and there is no
need to always prepend such prefix, which only makes function invocations
much much longer.
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Reviewed-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
The throughput meter module is a simple, kernel-space replacement for
throughtput measurements tool like iperf and netperf. It is intended to
approximate TCP behaviour.
It is invoked through batctl: the protocol is connection oriented, with
cumulative acknowledgment and a dynamic-size sliding window.
The test *can* be interrupted by batctl. A receiver side timeout avoids
unlimited waitings for sender packets: after one second of inactivity, the
receiver abort the ongoing test.
Based on a prototype from Edo Monticelli <montik@autistici.org>
Signed-off-by: Antonio Quartulli <antonio.quartulli@open-mesh.com>
Signed-off-by: Sven Eckelmann <sven.eckelmann@open-mesh.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Return the proper netdev TX status along the TX path so that the tp_meter
can understand when the queue is full and should stop sending packets.
Signed-off-by: Antonio Quartulli <antonio.quartulli@open-mesh.com>
Signed-off-by: Sven Eckelmann <sven.eckelmann@open-mesh.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
BATADV_CMD_GET_MESH_INFO is used to query basic information about a
batman-adv softif (name, index and MAC address for both the softif and
the primary hardif; routing algorithm; batman-adv version).
Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
[sven.eckelmann@open-mesh.com: Reduce the number of changes to
BATADV_CMD_GET_MESH_INFO, add missing kerneldoc, add policy for attributes]
Signed-off-by: Sven Eckelmann <sven.eckelmann@open-mesh.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
debugfs is currently severely broken virtually everywhere in the kernel
where files are dynamically added and removed (see
http://lkml.iu.edu/hypermail/linux/kernel/1506.1/02196.html for some
details). In addition to that, debugfs is not namespace-aware.
Instead of adding new debugfs entries, the whole infrastructure should be
moved to netlink. This will fix the long standing problem of large buffers
for debug tables and hard to parse text files.
Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
[sven.eckelmann@open-mesh.com: Strip down patch to only add genl family,
add missing kerneldoc]
Signed-off-by: Sven Eckelmann <sven.eckelmann@open-mesh.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
LLCP skb tx and rx functions now use print_hex_dump_debug() making
these verbose traces controllable using dynamic debug.
Signed-off-by: Thierry Escande <thierry.escande@collabora.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
This replaces the polling work struct with a delayed work struct and add
a 10 ms delay between 2 poll cycles. This avoids to flood the device
with 'switch off'/'switch on' commands.
Signed-off-by: Thierry Escande <thierry.escande@collabora.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
It used to be EXPORTed, but then EXPORT usage was cleaned up
(in 2012), without noticing that the function has no users at all
(and curiously, never had any users).
Delete it.
While at it, remove non-static "inline" hints on nearby functions:
these hints don't work across compilation units anyway,
and these functions are not used in their .c file, thus they are
never inlined. IOW: "inline" here does not help in any way.
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: Samuel Ortiz <sameo@linux.intel.com>
CC: Christophe Ricard <christophe.ricard@gmail.com>
CC: linux-wireless@vger.kernel.org
CC: linux-kernel@vger.kernel.org
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
netfilter uses multiple FWINV #defines with identical form that hide a
specific structure variable and dereference it with a invflags member.
$ git grep "#define FWINV"
include/linux/netfilter_bridge/ebtables.h:#define FWINV(bool,invflg) ((bool) ^ !!(info->invflags & invflg))
net/bridge/netfilter/ebtables.c:#define FWINV2(bool, invflg) ((bool) ^ !!(e->invflags & invflg))
net/ipv4/netfilter/arp_tables.c:#define FWINV(bool, invflg) ((bool) ^ !!(arpinfo->invflags & (invflg)))
net/ipv4/netfilter/ip_tables.c:#define FWINV(bool, invflg) ((bool) ^ !!(ipinfo->invflags & (invflg)))
net/ipv6/netfilter/ip6_tables.c:#define FWINV(bool, invflg) ((bool) ^ !!(ip6info->invflags & (invflg)))
net/netfilter/xt_tcpudp.c:#define FWINVTCP(bool, invflg) ((bool) ^ !!(tcpinfo->invflags & (invflg)))
Consolidate these macros into a single NF_INVF macro.
Miscellanea:
o Neaten the alignment around these uses
o A few lines are > 80 columns for intelligibility
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add the commands to set and show the mode of SRIOV E-Switch, two modes
are supported:
* legacy: operating in the "old" L2 based mode (DMAC --> VF vport)
* switchdev: the E-Switch is referred to as whitebox switch configured
using standard tools such as tc, bridge, openvswitch etc. To allow
working with the tools, for each VF, a VF representor netdevice is
created by the E-Switch manager vendor device driver instance (e.g PF).
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- two patches with minimal clean up work by Antonio Quartulli and
Simon Wunderlich
- eight patches of B.A.T.M.A.N. V, API and documentation clean
up work, by Antonio Quartulli and Marek Lindner
- Andrew Lunn fixed the skb priority adoption when forwarding
fragmented packets (two patches)
- Multicast optimization support is now enabled for bridges which
comes with some protocol updates, by Linus Luessing
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCgAGBQJXdmhpAAoJEKEr45hCkp6h/R0P/1K0mZjs1lk15j1oc0EeE/lJ
a47nwLQiAj9O790SUhQuonUYtbw5jhxZq5P1zYvg44ngRdUhsH9yiwr+Yado40CW
5ek+EdfGfkwNThGG++knVrbhLPrGxSC9Q2qJCApergt4wViWvvovSJOZsKVcanei
Qv9uGn6TIhZW3FFhvYk6/xgseZhjRISgxPkO1N20tMcC3f0w4YgM5QcCPGG2KB3N
CYq1qryyl4Jf6NeNap/lXiTo6JA5qOHYe68ziotJTtlsrsYQ3WitJvuKO+bWuQIv
zOU/jQ7qUwuabLT5TnzZKxQJvhrqfA5V20OM3yD4lnhdgvqVsHgHoIRy6RpN4U8M
rFlgROZvm+k0ATnL8AcUtIY7EA/EA0ifHN2fdTAfQ0XNc0VxTXEWB4qTTBJu3+se
N0+QyIjpXzgHqKxjIpr8Sm3tBO/ANCui/gWl5SToGXis3xDsRivvTMWNSNYjgDcP
WdyLtx9h7RLNOh64Idwsq4yDHt+/P86z7xJQdlkmrshkjqL/HNgS93U5CeAC3mN0
S6N5PgZG07EYnGxzxDid+6x1UP1VA7dyqHJpxpYY7qbw+/aDVlq5XH/vI/9Lbq5i
vu/54L8cVG5nBe54SZBeUib5W7Wkgf3POWzt+rrRwbHY+gAE1zUPQNNzgDtLHH0N
K/XJwdcoGQzA5LEynGE7
=js/J
-----END PGP SIGNATURE-----
Merge tag 'batadv-next-for-davem-20160701' of git://git.open-mesh.org/linux-merge
Simon Wunderlich says:
====================
This feature patchset includes the following changes:
- two patches with minimal clean up work by Antonio Quartulli and
Simon Wunderlich
- eight patches of B.A.T.M.A.N. V, API and documentation clean
up work, by Antonio Quartulli and Marek Lindner
- Andrew Lunn fixed the skb priority adoption when forwarding
fragmented packets (two patches)
- Multicast optimization support is now enabled for bridges which
comes with some protocol updates, by Linus Luessing
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix incorrect use of nla_strlcpy() where the first NLA_HDRLEN bytes
of the link name where left out.
Making the output of tipc-config -ls look something like:
Link statistics:
dcast-link
1:data0-1.1.2:data0
1:data0-1.1.3:data0
Also, for the record, the patch that introduce this regression
claims "Sending the whole object out can cause a leak". Which isn't
very likely as this is a compat layer, where the data we are parsing
is generated by us and we know the string to be NULL terminated. But
you can of course never be to secure.
Fixes: 5d2be1422e (tipc: fix an infoleak in tipc_nl_compat_link_dump)
Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RDS ping messages are sent with a non-zero src port to a zero
dst port, so that the rds pong messages can be sent back to the
originators src port. However if a confused/malicious sender
sends a ping with a 0 src port, we'd have an infinite ping-pong
loop. To avoid this, the receiver should ignore ping messages
with a 0 src port.
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When reconnecting, the peer with the smaller IP address will initiate
the reconnect, to avoid needless duelling SYN issues.
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds ->conn_path_connect callbacks in the rds_transport
that are used to set up a single connection path.
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ->sk_user_data contains a pointer to the rds_conn_path
for the socket. Use this consistently in the rds_tcp_data_ready
callbacks to get the rds_conn_path for rds_recv_incoming.
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The socket callbacks should all operate on a struct rds_conn_path,
in preparation for a MP capable RDS-TCP.
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A single rds_connection may have multiple rds_conn_paths that have
to be carefully and correctly destroyed, for both rmmod and
netns-delete cases.
For both cases, we extract a single rds_tcp_connection for
each conn into a temporary list, and then invoke rds_conn_destroy()
which iteratively dismantles every path in the rds_connection.
For the netns deletion case, we additionally have to make sure
that we do not leave a socket in TIME_WAIT state, as this will
hold up the netns deletion. Thus we call rds_tcp_conn_paths_destroy()
to reset state quickly.
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The struct rds_tcp_connection is the transport-specific private
data structure that tracks TCP information per rds_conn_path.
Modify this structure to have a back-pointer to the rds_conn_path
for which it is the ->cp_transport_data.
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The c_passive bit is only intended for the IB transport and will
never be encountered in rds-tcp, so remove the dead logic that
predicates on this bit.
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Refactor code to avoid separate indirections for single-path
and multipath transports. All transports (both single and mp-capable)
will get a pointer to the rds_conn_path, and can trivially derive
the rds_connection from the ->cp_conn.
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adds a bpf helper, bpf_skb_in_cgroup, to decide if a skb->sk
belongs to a descendant of a cgroup2. It is similar to the
feature added in netfilter:
commit c38c4597e4 ("netfilter: implement xt_cgroup cgroup2 path match")
The user is expected to populate a BPF_MAP_TYPE_CGROUP_ARRAY
which will be used by the bpf_skb_in_cgroup.
Modifications to the bpf verifier is to ensure BPF_MAP_TYPE_CGROUP_ARRAY
and bpf_skb_in_cgroup() are always used together.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Alexei Starovoitov <ast@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to commit 9b368814b3 ("net: fix bridge multicast packet checksum validation")
we need to fixup the checksum for CHECKSUM_COMPLETE when
pushing skb on RX path. Otherwise we get similar splats.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Tom Herbert <tom@herbertland.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
People who use PACKET_FANOUT_HASH want a symmetric hash, meaning that
they want packets going in both directions on a flow to hash to the
same bucket.
The core kernel SKB hash became non-symmetric when the ipv6 flow label
and other entities were incorporated into the standard flow hash order
to increase entropy.
But there are no users of PACKET_FANOUT_HASH who want an assymetric
hash, they all want a symmetric one.
Therefore, use the flow dissector to compute a flat symmetric hash
over only the protocol, addresses and ports. This hash does not get
installed into and override the normal skb hash, so this change has
no effect whatsoever on the rest of the stack.
Reported-by: Eric Leblond <eric@regit.org>
Tested-by: Eric Leblond <eric@regit.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since bpf_prog_get() and program type check is used in a couple of places,
refactor this into a small helper function that we can make use of. Since
the non RO prog->aux part is not used in performance critical paths and a
program destruction via RCU is rather very unlikley when doing the put, we
shouldn't have an issue just doing the bpf_prog_get() + prog->type != type
check, but actually not taking the ref at all (due to being in fdget() /
fdput() section of the bpf fd) is even cleaner and makes the diff smaller
as well, so just go for that. Callsites are changed to make use of the new
helper where possible.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This option was removed in commit 47dcf0cb10 ("[NET]: Rethink mark field
in struct flowi").
Signed-off-by: Moritz Sichert <moritz+linux@sichert.me>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
There are code duplications of a masked ethernet address comparison here
so make it a separate function instead.
Miscellanea:
o Neaten alignment of FWINV macro uses to make it clearer for the reader
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
No need for a special case to handle NF_INET_POST_ROUTING, this is
basically the same handling as for prerouting, input, forward.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Can overflow so we might allocate very small table when bucket count is
high on a 32bit platform.
Note: resize is only possible from init_netns.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch introduces a new event - NETDEV_CHANGE_TX_QUEUE_LEN, this
will be triggered when tx_queue_len. It could be used by net device
who want to do some processing at that time. An example is tun who may
want to resize tx array when tx_queue_len is changed.
Cc: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
cl->cl_vt alone is relative only to the current backlog period, while
the curve operates on cumulative virtual time. This patch adds missing
cl->cl_vtoff.
Signed-off-by: Michal Soltys <soltys@ziu.info>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a class is going passive, it should update its cl_vt first
to be consistent with the last dequeue operation.
Otherwise its cl_vt will be one packet behind and parent's cvtmax might
not be updated as well.
One possible side effect is if some class goes passive and subsequently
goes active /without/ its parent going passive - with cl_vt lagging one
packet behind - comparison made in init_vf() will be affected (same
period).
Signed-off-by: Michal Soltys <soltys@ziu.info>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is update to:
commit a09ceb0e08 ("sched: remove qdisc->drop")
That commit removed qdisc->drop, but left alone dlist and droplist
that no longer serve any meaningful purpose.
Signed-off-by: Michal Soltys <soltys@ziu.info>
Signed-off-by: David S. Miller <davem@davemloft.net>
The condition can only succeed on wrong configurations.
Signed-off-by: Michal Soltys <soltys@ziu.info>
Signed-off-by: David S. Miller <davem@davemloft.net>
Realtime scheduling implemented in HFSC uses head of the queue to make
the decision about which packet to schedule next. But in case of any
head drop, the deadline calculated for the previous head is not
necessarily correct for the next head (unless both packets have the same
length).
Thanks to peek() function used during dequeue - which internally is a
dequeue operation - hfsc is almost safe from this issue, as peek()
dequeues and isolates the head storing it temporarily until the real
dequeue happens.
But there is one exception: if after the class activation a drop happens
before the first dequeue operation, there's never a chance to do the
peek().
Adding peek() call in enqueue - if this is the first packet in a new
backlog period AND the scheduler has realtime curve defined - fixes that
one corner case. The 1st hfsc_dequeue() will use that peeked packet,
similarly as every subsequent hfsc_dequeue() call uses packet peeked by
the previous call.
Signed-off-by: Michal Soltys <soltys@ziu.info>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some arches have virtually mapped kernel stacks, or will soon have.
tcp_md5_hash_header() uses an automatic variable to copy tcp header
before mangling th->check and calling crypto function, which might
be problematic on such arches.
David says that using percpu storage is also problematic on non SMP
builds.
Just use kmalloc() to allocate scratch areas.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a jumbo packet is being split up and processed, the crypto checksum
for each split-out packet is in the jumbo header and needs placing in the
reconstructed packet header.
When the code was changed to keep the stored copy of the packet header in
host byte order, this reconstruction was missed.
Found with sparse with CF=-D__CHECK_ENDIAN__:
../net/rxrpc/input.c:479:33: warning: incorrect type in assignment (different base types)
../net/rxrpc/input.c:479:33: expected unsigned short [unsigned] [usertype] _rsvd
../net/rxrpc/input.c:479:33: got restricted __be16 [addressable] [usertype] _rsvd
Fixes: 0d12f8a402 ("rxrpc: Keep the skb private record of the Rx header in host byte order")
Signed-off-by: David Howells <dhowells@redhat.com>
ip_skb_dst_mtu uses skb->sk, assuming it is an AF_INET socket (e.g. it
calls ip_sk_use_pmtu which casts sk as an inet_sk).
However, in the case of UDP tunneling, the skb->sk is not necessarily an
inet socket (could be AF_PACKET socket, or AF_UNSPEC if arriving from
tun/tap).
OTOH, the sk passed as an argument throughout IP stack's output path is
the one which is of PMTU interest:
- In case of local sockets, sk is same as skb->sk;
- In case of a udp tunnel, sk is the tunneling socket.
Fix, by passing ip_finish_output's sk to ip_skb_dst_mtu.
This augments 7026b1ddb6 'netfilter: Pass socket pointer down through okfn().'
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When adding rule with NLM_F_EXCL flag then check if the same rule exist.
If yes then exit with -EEXIST.
This is already implemented in iproute2:
if (cmd == RTM_NEWRULE) {
req.n.nlmsg_flags |= NLM_F_CREATE|NLM_F_EXCL;
req.r.rtm_type = RTN_UNICAST;
}
Tested ipv4 and ipv6 with net-next linux on qemu x86
expected behavior after patch:
localhost ~ # ip rule
0: from all lookup local
32766: from all lookup main
32767: from all lookup default
localhost ~ # ip rule add from 10.46.177.97 lookup 104 pref 1005
localhost ~ # ip rule add from 10.46.177.97 lookup 104 pref 1005
RTNETLINK answers: File exists
localhost ~ # ip rule
0: from all lookup local
1005: from 10.46.177.97 lookup 104
32766: from all lookup main
32767: from all lookup default
There was already topic regarding this but I don't see any changes
merged and problem still occurs.
https://lkml.kernel.org/r/1135778809.5944.7.camel+%28%29+localhost+%21+localdomain
Signed-off-by: Mateusz Bajorski <mateusz.bajorski@nokia.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We found that sometimes a restored tcp socket doesn't work.
A reason of this bug is incorrect window parameters and in this case
tcp_acceptable_seq() returns tcp_wnd_end(tp) instead of tp->snd_nxt. The
other side drops packets with this seq, because seq is less than
tp->rcv_nxt ( tcp_sequence() ).
Data from a send queue is sent only if there is enough space in a
window, so when we restore unacked data, we need to expand a window to
fit this data.
This was in a first version of this patch:
"tcp: extend window to fit all restored unacked data in a send queue"
Then Alexey recommended me to restore window parameters instead of
adjusted them according with data in a sent queue. This sounds resonable.
rcv_wnd has to be restored, because it was reported to another side
and the offered window is never shrunk.
One of reasons why we need to restore snd_wnd was described above.
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds stats support for the currently used IGMP/MLD types by the
bridge. The stats are per-port (plus one stat per-bridge) and per-direction
(RX/TX). The stats are exported via netlink via the new linkxstats API
(RTM_GETSTATS). In order to minimize the performance impact, a new option
is used to enable/disable the stats - multicast_stats_enabled, similar to
the recent vlan stats. Also in order to avoid multiple IGMP/MLD type
lookups and checks, we make use of the current "igmp" member of the bridge
private skb->cb region to record the type on Rx (both host-generated and
external packets pass by multicast_rcv()). We can do that since the igmp
member was used as a boolean and all the valid IGMP/MLD types are positive
values. The normal bridge fast-path is not affected at all, the only
affected paths are the flooding ones and since we make use of the IGMP/MLD
type, we can quickly determine if the packet should be counted using
cache-hot data (cb's igmp member). We add counters for:
* IGMP Queries
* IGMP Leaves
* IGMP v1/v2/v3 reports
* MLD Queries
* MLD Leaves
* MLD v1/v2 reports
These are invaluable when monitoring or debugging complex multicast setups
with bridges.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for the IFLA_STATS_LINK_XSTATS_SLAVE attribute
which allows to export per-slave statistics if the master device supports
the linkxstats callback. The attribute is passed down to the linkxstats
callback and it is up to the callback user to use it (an example has been
added to the only current user - the bridge). This allows us to query only
specific slaves of master devices like bridge ports and export only what
we're interested in instead of having to dump all ports and searching only
for a single one. This will be used to export per-port IGMP/MLD stats and
also per-port vlan stats in the future, possibly other statistics as well.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We've accumulated a couple of different fixes now to mesh_sta_cleanup()
due to the different paths that user_mpm and !user_mpm cases take -- one
fix to flush nexthop paths and one to fix the counting.
The only caller of mesh_plink_deactivate() is mesh_sta_cleanup(), so we
can push the user_mpm checks down into there in order to share more
code.
In doing so, we can remove an extra call to mesh_path_flush_by_nexthop()
and the (unnecessary) call to mesh_accept_plinks_update(). This will
also ensure the powersaving state code gets called in the user_mpm case.
The only cleanup tasks we need to avoid when MPM is in user-space
are sending the peering frames and stopping the plink timer, so wrap
those in the appropriate check.
Signed-off-by: Bob Copeland <me@bobcopeland.com>
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Previously, the action frames to group address was not encrypted. But
[1] "Table 8-38 Category values" indicates "Mesh" and "Multihop" category
action frames should be encrypted (Group addressed privacy == yes). And the
encyption key should be MGTK ([1] 10.13 Group addressed robust management frame
procedures). So this patch modifies the code to make it suitable for spec.
[1] IEEE Std 802.11-2012
Signed-off-by: Masashi Honma <masashi.honma@gmail.com>
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
We normally return an uninitialized value, but no one checks it so it
doesn't matter. Anyway, let's silence the static checker warning.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
When building a kernel with W=1, the nl80211.c file causes a number of
warnings, all about the same problem:
net/wireless/nl80211.c: In function 'nl80211_parse_mesh_config':
net/wireless/nl80211.c:5287:103: error: comparison is always false due to limited range of data type [-Werror=type-limits]
net/wireless/nl80211.c:5290:96: error: comparison is always false due to limited range of data type [-Werror=type-limits]
net/wireless/nl80211.c:5293:124: error: comparison is always false due to limited range of data type [-Werror=type-limits]
net/wireless/nl80211.c:5295:148: error: comparison is always false due to limited range of data type [-Werror=type-limits]
net/wireless/nl80211.c:5298:106: error: comparison is always false due to limited range of data type [-Werror=type-limits]
net/wireless/nl80211.c:5305:116: error: comparison is always false due to limited range of data type [-Werror=type-limits]
The problem is that gcc does not notice that the check is generate
by a macro, so it complains about comparing an unsigned type against 0.
I've tried to come up with a way to rephrase that code in a way that
avoids the warnings and otherwise improves the code as well.
This uses a set of new helper functions that perform the range checking,
and should provide slightly better type safety than the older patch,
at the expense of adding 44 lines to the code. Binary code size is
basically unchanged though (20 bytes added to 126561 bytes .text).
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
This work adds a helper for changing skb->pkt_type in a controlled way.
We only allow a subset of possible values and can extend that in future
should other use cases come up. Doing this as a helper has the advantage
that errors can be handeled gracefully and thus helper kept extensible.
It's a write counterpart to pkt_type member we can already read from
struct __sk_buff context. Major use case is to change incoming skbs to
PACKET_HOST in a programmatic way instead of having to recirculate via
redirect(..., BPF_F_INGRESS), for example.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a minimal helper for doing the groundwork of changing
the skb->protocol in a controlled way. Currently supported is v4 to
v6 and vice versa transitions, which allows f.e. for a minimal, static
nat64 implementation where applications in containers that still
require IPv4 can be transparently operated in an IPv6-only environment.
For example, host facing veth of the container can transparently do
the transitions in a programmatic way with the help of clsact qdisc
and cls_bpf.
Idea is to separate concerns for keeping complexity of the helper
lower, which means that the programs utilize bpf_skb_change_proto(),
bpf_skb_store_bytes() and bpf_lX_csum_replace() to get the job done,
instead of doing everything in a single helper (and thus partially
duplicating helper functionality). Also, bpf_skb_change_proto()
shouldn't need to deal with raw packet data as this is done by other
helpers.
bpf_skb_proto_6_to_4() and bpf_skb_proto_4_to_6() unclone the skb to
operate on a private one, push or pop additionally required header
space and migrate the gso/gro meta data from the shared info. We do
mark the gso type as dodgy so that headers are checked and segs
recalculated by the gso/gro engine. The gso_size target is adapted
as well. The flags argument added is currently reserved and can be
used for future extensions.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use smp_processor_id() for the generic helper bpf_get_smp_processor_id()
instead of the raw variant. This allows for preemption checks when we
have DEBUG_PREEMPT, and otherwise uses the raw variant anyway. We only
need to keep the raw variant for socket filters, but we can reuse the
helper that is already there from cBPF side.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Several cases of overlapping changes, except the packet scheduler
conflicts which deal with the addition of the free list parameter
to qdisc_enqueue().
Signed-off-by: David S. Miller <davem@davemloft.net>
The bat_algo.h had some functions declared which were not part of the
bat_algo.c file. These are instead stored in bat_v.c and bat_iv_ogm.c. The
declaration should therefore be also in bat_v.h and bat_iv_ogm,h to make
them easier to find.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
This patch adds a debugfs table with originators and their according
multicast flags to help users figure out why multicast optimizations
might be enabled or disabled for them.
Tested-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
There are several places in batman-adv which provide logging related
functions. These should be grouped together in the log.* files to make them
easier to find.
Reported-by: Markus Pargmann <mpa@pengutronix.de>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
With this patch changes relevant to a node's own multicast flags are
printed to the 'mcast' log level.
Tested-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
The bat_algo functionality in main.c is mostly unrelated to the rest of the
content. It still takes up a large portion of this source file (~15%, 103
lines). Moving it to a separate file makes it better visible as a main
component of the batman-adv implementation and hides it less in the other
helper functions in main.c.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
With this patch we are finally able to support multicast optimizations
in bridged setups, too. So far, if a bridge was added on top of a
soft-interface (e.g. bat0) the batman-adv multicast optimizations
needed to be disabled to avoid packetloss.
Current Linux bridge implementations and API can now provide us
with the so far missing information about interested but "remote"
multicast receivers behind bridge ports.
The Linux bridge performs the detection of remote participants
interested in multicast packets with its own and mature so
called IGMP and MLD snooping code and stores that in its
database. With the new API provided by the bridge batman-adv can
now simply hook into this database.
We then reliably announce the gathered multicast listeners to
other nodes through the batman-adv translation table.
Additionally, the Linux bridge provides us with the information about
whether an IGMP/MLD querier exists. If there is none then we need to
disable multicast optimizations as we cannot learn about multicast
listeners on external, bridged-in host then.
Tested-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
The tvlv functionality in main.c is mostly unrelated to the rest of the
content. It still takes up a large portion of this source file (~45%, 588
lines). Moving it to a separate file makes it better visible as a main
component of the batman-adv implementation and hides it less in the other
helper functions in main.c
Signed-off-by: Markus Pargmann <mpa@pengutronix.de>
[sven@narfation.org: fix conflicts with current version, fix includes,
rewrote commit message]
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
With this patch IGMP or MLD reports are always flooded. This is
necessary for the upcoming bridge integration to function without
multicast packet loss.
With the report handling so far bridges might miss interested multicast
listeners, leading to wrongly excluding ports from multicast packet
forwarding.
Currently we are treating IGMP/MLD reports, the messages bridges use to
learn about interested multicast listeners, just as any other multicast
packet: We try to send them to nodes matching its multicast destination.
Unfortunately, the destination address of reports of the older
IGMPv2/MLDv1 protocol families do not strictly adhere to their own
protocol: More precisely, the interested receiver, an IGMPv2 or MLDv1
querier, itself usually does not listen to the multicast destination
address of any reports.
Therefore with this patch we are simply excluding IGMP/MLD reports from
the multicast forwarding code path and keep flooding them. By that
any bridge receives them and can properly learn about listeners.
To avoid compatibility issues with older nodes not yet implementing this
report handling, we need to force them to flood reports: We do this by
bumping the multicast TVLV version to 2, effectively disabling their
multicast optimization.
Tested-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
It is easier to detect if a include is already there for a used
functionality when the includes are ordered. Using an alphabetic order
together with the grouping in commit 1e2c2a4fe4 ("batman-adv: Add
required includes to all files") makes includes better manageable.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Unfragmented frames which traverse a node have their skb->priority set
by looking at the IP ToS byte, or the 802.1p header. However for
fragments this is not possible, only one of the fragments will contain
the headers. Instead, place the priority into the fragment header and
on receiving a fragment, use this information to set the skb->priority
for when the fragment is forwarded.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
main.h includes statements which (re)define preprocessor variables which
influence the compiled code. This makes it necessary to include it in all
files. For example, it redefines pr_fmt used to the module as prefix for
each pr_* message.
Reported-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
BATMAN will set the skb->priority based on the IP precedence or 802.1q
tag. However, if it needs to fragment the frame, it currently leaves
the fragment skb with the default priority and actually overwrites the
priority in the unfragmented frame. Fix this.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
The ELP interval and throughput override interface settings are initialized
with default settings on every time an interface is added to a mesh.
This patch prevents this behavior by moving the configuration init to the
interface detection routine which runs only once per interface.
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
[a@unstable.cc: move initialization to batadv_v_hardif_init]
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
To reduce the field pollution in our main batadv_priv data structure
we've already created some substructures so that we could group fields
in a convenient manner.
However gw_mode and gw_sel_class are still part of the main object.
More both fields to the GW private substructure.
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
the compiler can optimize functions within the same C file and therefore
there is no need to make it explicit.
Remove the useless inline attribute for __batadv_store_uint_attr()
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
The ogm_emit and ogm_schedule API calls were rather tight to the
B.A.T.M.A.N. IV logic and therefore rather difficult to use
with other algorithm implementations.
Remove such calls and move the surrounding logic into the
B.A.T.M.A.N. IV specific code.
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
To make it easier to search through the code it is better to print static
strings directly instead of using format strings printing constants.
This was addressed in a previous patch, but the Gateway table header
was not updated accordingly.
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Reviewed-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Pull networking fixes from David Miller:
"I've been traveling so this accumulates more than week or so of bug
fixing. It perhaps looks a little worse than it really is.
1) Fix deadlock in ath10k driver, from Ben Greear.
2) Increase scan timeout in iwlwifi, from Luca Coelho.
3) Unbreak STP by properly reinjecting STP packets back into the
stack. Regression fix from Ido Schimmel.
4) Mediatek driver fixes (missing malloc failure checks, leaking of
scratch memory, wrong indexing when mapping TX buffers, etc.) from
John Crispin.
5) Fix endianness bug in icmpv6_err() handler, from Hannes Frederic
Sowa.
6) Fix hashing of flows in UDP in the ruseport case, from Xuemin Su.
7) Fix netlink notifications in ovs for tunnels, delete link messages
are never emitted because of how the device registry state is
handled. From Nicolas Dichtel.
8) Conntrack module leaks kmemcache on unload, from Florian Westphal.
9) Prevent endless jump loops in nft rules, from Liping Zhang and
Pablo Neira Ayuso.
10) Not early enough spinlock initialization in mlx4, from Eric
Dumazet.
11) Bind refcount leak in act_ipt, from Cong WANG.
12) Missing RCU locking in HTB scheduler, from Florian Westphal.
13) Several small MACSEC bug fixes from Sabrina Dubroca (missing RCU
barrier, using heap for SG and IV, and erroneous use of async flag
when allocating AEAD conext.)
14) RCU handling fix in TIPC, from Ying Xue.
15) Pass correct protocol down into ipv4_{update_pmtu,redirect}() in
SIT driver, from Simon Horman.
16) Socket timer deadlock fix in TIPC from Jon Paul Maloy.
17) Fix potential deadlock in team enslave, from Ido Schimmel.
18) Memory leak in KCM procfs handling, from Jiri Slaby.
19) ESN generation fix in ipv4 ESP, from Herbert Xu.
20) Fix GFP_KERNEL allocations with locks held in act_ife, from Cong
WANG.
21) Use after free in netem, from Eric Dumazet.
22) Uninitialized last assert time in multicast router code, from Tom
Goff.
23) Skip raw sockets in sock_diag destruction broadcast, from Willem
de Bruijn.
24) Fix link status reporting in thunderx, from Sunil Goutham.
25) Limit resegmentation of retransmit queue so that we do not
retransmit too large GSO frames. From Eric Dumazet.
26) Delay bpf program release after grace period, from Daniel
Borkmann"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (141 commits)
openvswitch: fix conntrack netlink event delivery
qed: Protect the doorbell BAR with the write barriers.
neigh: Explicitly declare RCU-bh read side critical section in neigh_xmit()
e1000e: keep VLAN interfaces functional after rxvlan off
cfg80211: fix proto in ieee80211_data_to_8023 for frames without LLC header
qlcnic: use the correct ring in qlcnic_83xx_process_rcv_ring_diag()
bpf, perf: delay release of BPF prog after grace period
net: bridge: fix vlan stats continue counter
tcp: do not send too big packets at retransmit time
ibmvnic: fix to use list_for_each_safe() when delete items
net: thunderx: Fix TL4 configuration for secondary Qsets
net: thunderx: Fix link status reporting
net/mlx5e: Reorganize ethtool statistics
net/mlx5e: Fix number of PFC counters reported to ethtool
net/mlx5e: Prevent adding the same vxlan port
net/mlx5e: Check for BlueFlame capability before allocating SQ uar
net/mlx5e: Change enum to better reflect usage
net/mlx5: Add ConnectX-5 PCIe 4.0 to list of supported devices
net/mlx5: Update command strings
net: marvell: Add separate config ANEG function for Marvell 88E1111
...
* fix mesh peer link counter, decrement wasn't always done at all
* fix ethertype (length) for packets without RFC 1042 or bridge
tunnel header
-----BEGIN PGP SIGNATURE-----
iQIcBAABCgAGBQJXc5wQAAoJEGt7eEactAAd2HQP/RS3lZaNn97spZp9Vb9Swqd6
UjhT4vHL2wHPMhPals2M2ztRhL8df/55tOhSuCwUki0gTsAfP4+Jz1PZ4okFUSBn
6EHvRtAgSAzjBElZVEsNO2sM/qm0UOnlqm4XpOXcKzpnRzDSlIa6vpnctp2MJMBI
PB46x8BPQoi192aTzleMqgBwjrMX1ohsM6rgOIEf7rvLhftx6LRRm7/uKaiwenVT
ZxkP6jIOs5ofkUHLHS4OPyId4yReO8oMMbRRA4fW1z2T8hME9PH3jTZL3Znaj7VF
s2LLSqwUlv06k0WFu3SzJh750jCJ6cnJiIqggDlPChrb0/DPicqgWR7/tJAvNVfj
WJqugeNtEVI+4ElGNxFzFAvjAKqO8fHY5U8Ko4LtZuwhjpo5GDvA+FErFESRXvCG
c647cDMhn+6OXwpUwsggQK4c+AS4QX/PzHjmhW5tKORPXlYVXAAz+JDXn195WnzV
w4UAO4ZWai9XY6oSc63uudaHMt8Xkmq8PRQsH1hHG20CAJ+7bGugaWVTci5fzeNh
JC4X+Xh6t+5qq1jEWgr+KguVKXtP3pVetqGy9kYQrJov0fTe2Z6deEeY08sH1Ffz
xdnQUsZphvqSkosTw76JtNPXz2ajxZCD3KNWjcpkL/NNs9lnIZPZc/mnzKP7ltro
ZEYw0HbcXVt/uFhHlwdG
=dAbs
-----END PGP SIGNATURE-----
Merge tag 'mac80211-for-davem-2016-06-29-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
Just two small fixes
* fix mesh peer link counter, decrement wasn't always done at all
* fix ethertype (length) for packets without RFC 1042 or bridge
tunnel header
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Only the first and last netlink message for a particular conntrack are
actually sent. The first message is sent through nf_conntrack_confirm when
the conntrack is committed. The last one is sent when the conntrack is
destroyed on timeout. The other conntrack state change messages are not
advertised.
When the conntrack subsystem is used from netfilter, nf_conntrack_confirm
is called for each packet, from the postrouting hook, which in turn calls
nf_ct_deliver_cached_events to send the state change netlink messages.
This commit fixes the problem by calling nf_ct_deliver_cached_events in the
non-commit case as well.
Fixes: 7f8a436eaa ("openvswitch: Add conntrack action")
CC: Joe Stringer <joestringer@nicira.com>
CC: Justin Pettit <jpettit@nicira.com>
CC: Andy Zhou <azhou@nicira.com>
CC: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Samuel Gauthier <samuel.gauthier@6wind.com>
Acked-by: Joe Stringer <joe@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
If skb_unshare() fails, we call qdisc_drop() with a NULL skb, which
is no longer supported.
Fixes: 520ac30f45 ("net_sched: drop packets after root qdisc lock is released")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
neigh_xmit() expects to be called inside an RCU-bh read side critical
section, and while one of its two current callers gets this right, the
other one doesn't.
More specifically, neigh_xmit() has two callers, mpls_forward() and
mpls_output(), and while both callers call neigh_xmit() under
rcu_read_lock(), this provides sufficient protection for neigh_xmit()
only in the case of mpls_forward(), as that is always called from
softirq context and therefore doesn't need explicit BH protection,
while mpls_output() can be called from process context with softirqs
enabled.
When mpls_output() is called from process context, with softirqs
enabled, we can be preempted by a softirq at any time, and RCU-bh
considers the completion of a softirq as signaling the end of any
pending read-side critical sections, so if we do get a softirq
while we are in the part of neigh_xmit() that expects to be run inside
an RCU-bh read side critical section, we can end up with an unexpected
RCU grace period running right in the middle of that critical section,
making things go boom.
This patch fixes this impedance mismatch in the callee, by making
neigh_xmit() always take rcu_read_{,un}lock_bh() around the code that
expects to be treated as an RCU-bh read side critical section, as this
seems a safer option than fixing it in the callers.
Fixes: 4fd3d7d9e8 ("neigh: Add helper function neigh_xmit")
Signed-off-by: David Barroso <dbarroso@fastly.com>
Signed-off-by: Lennert Buytenhek <lbuytenhek@fastly.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The PDU length of incoming LLC frames is set to the total skb payload size
in __ieee80211_data_to_8023() of net/wireless/util.c which incorrectly
includes the length of the IEEE 802.11 header.
The resulting LLC frame header has a too large PDU length, causing the
llc_fixup_skb() function of net/llc/llc_input.c to reject the incoming
skb, effectively breaking STP.
Solve the problem by properly substracting the IEEE 802.11 frame header size
from the PDU length, allowing the LLC processor to pick up the incoming
control messages.
Special thanks to Gerry Rozema for tracking down the regression and proposing
a suitable patch.
Fixes: 2d1c304cb2 ("cfg80211: add function for 802.3 conversion with separate output buffer")
Cc: stable@vger.kernel.org
Reported-by: Gerry Rozema <gerryr@rozeware.com>
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
I made a dumb off-by-one mistake when I added the vlan stats counter
dumping code. The increment should happen before the check, not after
otherwise we miss one entry when we continue dumping.
Fixes: a60c090361 ("bridge: netlink: export per-vlan stats")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Arjun reported a bug in TCP stack and bisected it to a recent commit.
In case where we process SACK, we can coalesce multiple skbs
into fat ones (tcp_shift_skb_data()), to lower write queue
overhead, because we do not expect to retransmit these packets.
However, SACK reneging can happen, forcing the sender to retransmit
all these packets. If skb->len is above 64KB, we then send buggy
IP packets that could hang TSO engine on cxgb4.
Neal suggested to use tcp_tso_autosize() instead of tp->gso_segs
so that we cook packets of optimal size vs TCP/pacing.
Thanks to Arjun for reporting the bug and running the tests !
Fixes: 10d3be5692 ("tcp-tso: do not split TSO packets at retransmit time")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Arjun V <arjun@chelsio.com>
Tested-by: Arjun V <arjun@chelsio.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Context implies that port in struct "udp_media_addr" is referring
to a UDP port.
Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The UDP msg2addr function tipc_udp_msg2addr() can return -EINVAL which
prior to this patch was unhanded in the caller.
Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The space is missing before the open parenthesis '(', and this
will introduce much more noise when checking patch around.
Signed-off-by: Wei Tang <tangwei@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The untagged vlan object is only destroyed when the interface is removed
via the legacy sysfs interface. But it also has to be destroyed when the
standard rtnl-link interface is used.
Fixes: 5d2c05b213 ("batman-adv: add per VLAN interface attribute framework")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Acked-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
The skb_linearize may reallocate the skb. This makes the calculated pointer
for ethhdr invalid. But it the pointer is used later to fill in the RR
field of the batadv_icmp_packet_rr packet.
Instead re-evaluate eth_hdr after the skb_linearize+skb_cow to fix the
pointer and avoid the invalid read.
Fixes: da6b8c20a5 ("batman-adv: generalize batman-adv icmp packet handling")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Each batadv_tt_local_entry hold a single reference to a
batadv_softif_vlan. In case a new entry cannot be added to the hash
table, the error path puts the reference, but the reference will also
now be dropped by batadv_tt_local_entry_release().
Fixes: a33d970d0b ("batman-adv: Fix reference counting of vlan object for tt_local_entry")
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The tt_req_node is added and removed from a list inside a spinlock. But the
locking is sometimes removed even when the object is still referenced and
will be used later via this reference. For example batadv_send_tt_request
can create a new tt_req_node (including add to a list) and later
re-acquires the lock to remove it from the list and to free it. But at this
time another context could have already removed this tt_req_node from the
list and freed it.
CPU#0
batadv_batman_skb_recv from net_device 0
-> batadv_iv_ogm_receive
-> batadv_iv_ogm_process
-> batadv_iv_ogm_process_per_outif
-> batadv_tvlv_ogm_receive
-> batadv_tvlv_ogm_receive
-> batadv_tvlv_containers_process
-> batadv_tvlv_call_handler
-> batadv_tt_tvlv_ogm_handler_v1
-> batadv_tt_update_orig
-> batadv_send_tt_request
-> batadv_tt_req_node_new
spin_lock(...)
allocates new tt_req_node and adds it to list
spin_unlock(...)
return tt_req_node
CPU#1
batadv_batman_skb_recv from net_device 1
-> batadv_recv_unicast_tvlv
-> batadv_tvlv_containers_process
-> batadv_tvlv_call_handler
-> batadv_tt_tvlv_unicast_handler_v1
-> batadv_handle_tt_response
spin_lock(...)
tt_req_node gets removed from list and is freed
spin_unlock(...)
CPU#0
<- returned to batadv_send_tt_request
spin_lock(...)
tt_req_node gets removed from list and is freed
MEMORY CORRUPTION/SEGFAULT/...
spin_unlock(...)
This can only be solved via reference counting to allow multiple contexts
to handle the list manipulation while making sure that only the last
context holding a reference will free the object.
Fixes: a73105b8d4 ("batman-adv: improved client announcement mechanism")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Tested-by: Martin Weinelt <martin@darmstadt.freifunk.net>
Tested-by: Amadeus Alfa <amadeus@chemnitz.freifunk.net>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
If a VLAN tagged frame is received and the corresponding VLAN is not
configured on the soft interface, it will splat a WARN on every packet
received. This is a quite annoying behaviour for some scenarios, e.g. if
bat0 is bridged with eth0, and there are arbitrary VLAN tagged frames
from Ethernet coming in without having any VLAN configuration on bat0.
The code should probably create vlan objects on the fly and
transparently transport these VLAN-tagged Ethernet frames, but until
this is done, at least the WARN splat should be replaced by a rate
limited output.
Fixes: 354136bcc3 ("batman-adv: fix kernel crash due to missing NULL checks")
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bridge is falsly dropping ipv6 mulitcast packets if there is:
1. No ipv6 address assigned on the brigde.
2. No external mld querier present.
3. The internal querier enabled.
When the bridge fails to build mld queries, because it has no
ipv6 address, it slilently returns, but keeps the local querier enabled.
This specific case causes confusing packet loss.
Ipv6 multicast snooping can only work if:
a) An external querier is present
OR
b) The bridge has an ipv6 address an is capable of sending own queries
Otherwise it has to forward/flood the ipv6 multicast traffic,
because snooping cannot work.
This patch fixes the issue by adding a flag to the bridge struct that
indicates that there is currently no ipv6 address assinged to the bridge
and returns a false state for the local querier in
__br_multicast_querier_exists().
Special thanks to Linus Lüssing.
Fixes: d1d81d4c3d ("bridge: check return value of ipv6_dev_get_saddr()")
Signed-off-by: Daniel Danzberger <daniel@dd-wrt.com>
Acked-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: David S. Miller <davem@davemloft.net>
If a user space program (e.g., wpa_supplicant) deletes a STA entry that
is currently in NL80211_PLINK_ESTAB state, the number of established
plinks counter was not decremented and this could result in rejecting
new plink establishment before really hitting the real maximum plink
limit. For !user_mpm case, this decrementation is handled by
mesh_plink_deactive().
Fix this by decrementing estab_plinks on STA deletion
(mesh_sta_cleanup() gets called from there) so that the counter has a
correct value and the Beacon frame advertisement in Mesh Configuration
element shows the proper value for capability to accept additional
peers.
Cc: stable@vger.kernel.org
Signed-off-by: Jouni Malinen <j@w1.fi>
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Drop redundant include of moduleparam.h
The Coccinelle semantic patch used to make this change is as follows:
@ includesmodule @
@@
#include <linux/module.h>
@ depends on includesmodule @
@@
- #include <linux/moduleparam.h>
Signed-off-by: Amitoj Kaur Chawla <amitoj1606@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support to inet_diag facility to filter sockets based on device
index. If an interface index is in the filter only sockets bound
to that index (sk_bound_dev_if) are returned.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This fixes wrong-interface signaling on 32-bit platforms for entries
created when jiffies > 2^31 + MFC_ASSERT_THRESH.
Signed-off-by: Tom Goff <thomas.goff@ll.mit.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use only simple inline assemblies which consist of a single basic
block if the register asm construct is being used.
Otherwise gcc would generate broken code if the compiler option
--sanitize-coverage=trace-pc would be used.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
SMACK uses similar functions to control CIPSO, these are
the equivalent functions for CALIPSO and follow exactly
the same semantics.
int netlbl_cfg_calipso_add(struct calipso_doi *doi_def,
struct netlbl_audit *audit_info)
Adds a CALIPSO doi.
void netlbl_cfg_calipso_del(u32 doi, struct netlbl_audit *audit_info)
Removes a CALIPSO doi.
int netlbl_cfg_calipso_map_add(u32 doi, const char *domain,
const struct in6_addr *addr,
const struct in6_addr *mask,
struct netlbl_audit *audit_info)
Creates a mapping between a domain and a CALIPSO doi. If
addr and mask are non-NULL this creates an address-selector
type mapping.
This also extends netlbl_cfg_map_del() to remove IPv6 address-selector
mappings.
Signed-off-by: Huw Davies <huw@codeweavers.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
This works in exactly the same way as the CIPSO label cache.
The idea is to allow the lsm to cache the result of a secattr
lookup so that it doesn't need to perform the lookup for
every skbuff.
It introduces two sysctl controls:
calipso_cache_enable - enables/disables the cache.
calipso_cache_bucket_size - sets the size of a cache bucket.
Signed-off-by: Huw Davies <huw@codeweavers.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Lengths, checksum and the DOI are checked. Checking of the
level and categories are left for the socket layer.
CRC validation is performed in the calipso module to avoid
unconditionally linking crc_ccitt() into ipv6.
Signed-off-by: Huw Davies <huw@codeweavers.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
This makes it possible to route the error to the appropriate
labelling engine. CALIPSO is far less verbose than CIPSO
when encountering a bogus packet, so there is no need for a
CALIPSO error handler.
Signed-off-by: Huw Davies <huw@codeweavers.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
In some cases, the lsm needs to add the label to the skbuff directly.
A NF_INET_LOCAL_OUT IPv6 hook is added to selinux to match the IPv4
behaviour. This allows selinux to label the skbuffs that it requires.
Signed-off-by: Huw Davies <huw@codeweavers.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Request sockets need to have a label that takes into account the
incoming connection as well as their parent's label. This is used
for the outgoing SYN-ACK and for their child full-socket.
Signed-off-by: Huw Davies <huw@codeweavers.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
If set, these will take precedence over the parent's options during
both sending and child creation. If they're not set, the parent's
options (if any) will be used.
This is to allow the security_inet_conn_request() hook to modify the
IPv6 options in just the same way that it already may do for IPv4.
Signed-off-by: Huw Davies <huw@codeweavers.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
CALIPSO is a hop-by-hop IPv6 option. A lot of this patch is based on
the equivalent CISPO code. The main difference is due to manipulating
the options in the hop-by-hop header.
Signed-off-by: Huw Davies <huw@codeweavers.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
This is to allow the CALIPSO labelling engine to use these.
Signed-off-by: Huw Davies <huw@codeweavers.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
The functionality is equivalent to ipv6_renew_options() except
that the newopt pointer is in kernel, not user, memory
The kernel memory implementation will be used by the CALIPSO network
labelling engine, which needs to be able to set IPv6 hop-by-hop
options.
Signed-off-by: Huw Davies <huw@codeweavers.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Remove a specified DOI through the NLBL_CALIPSO_C_REMOVE command.
It requires the attribute:
NLBL_CALIPSO_A_DOI.
Signed-off-by: Huw Davies <huw@codeweavers.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
This extends the NLBL_MGMT_C_ADD and NLBL_MGMT_C_ADDDEF commands
to accept CALIPSO protocol DOIs.
Signed-off-by: Huw Davies <huw@codeweavers.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Enumerate the DOI list through the NLBL_CALIPSO_C_LISTALL command.
It takes no attributes.
Signed-off-by: Huw Davies <huw@codeweavers.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Query a specified DOI through the NLBL_CALIPSO_C_LIST command.
It requires the attribute:
NLBL_CALIPSO_A_DOI.
The reply will contain:
NLBL_CALIPSO_A_MTYPE
Signed-off-by: Huw Davies <huw@codeweavers.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
CALIPSO is a packet labelling protocol for IPv6 which is very similar
to CIPSO. It is specified in RFC 5570. Much of the code is based on
the current CIPSO code.
This adds support for adding passthrough-type CALIPSO DOIs through the
NLBL_CALIPSO_C_ADD command. It requires attributes:
NLBL_CALIPSO_A_TYPE which must be CALIPSO_MAP_PASS.
NLBL_CALIPSO_A_DOI.
In passthrough mode the CALIPSO engine will map MLS secattr levels
and categories directly to the packet label.
At this stage, the major difference between this and the CIPSO
code is that IPv6 may be compiled as a module. To allow for
this the CALIPSO functions are registered at module init time.
Signed-off-by: Huw Davies <huw@codeweavers.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
The reason is to allow different labelling protocols for
different address families with the same domain.
This requires the addition of an address family attribute
in the netlink communication protocol. It is used in several
messages:
NLBL_MGMT_C_ADD and NLBL_MGMT_C_ADDDEF take it as an optional
attribute for the unlabelled protocol. It may be one of AF_INET,
AF_INET6 or AF_UNSPEC (to specify both address families). If it
is missing, it defaults to AF_UNSPEC.
NLBL_MGMT_C_LISTALL and NLBL_MGMT_C_LISTDEF return it as part of
the enumeration of each item. Addtionally, it may be sent to
LISTDEF to specify which address family to return.
Signed-off-by: Huw Davies <huw@codeweavers.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
This fixes sparse errors of the form:
incompatible types in comparison expression (different address spaces)
Signed-off-by: Huw Davies <huw@codeweavers.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
There are several places where the listener and pending or accept queue
child sockets are accessed at the same time. Lockdep is unhappy that
two locks from the same class are held.
Tell lockdep that it is safe and document the lock ordering.
Originally Claudio Imbrenda <imbrenda@linux.vnet.ibm.com> sent a similar
patch asking whether this is safe. I have audited the code and also
covered the vsock_pending_work() function.
Suggested-by: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
with the commit 8c14586fc3 ("net: ipv6: Use passed in table for
nexthop lookups"), net hop lookup is first performed on route creation
in the passed-in table.
However device match is not enforced in table lookup, so the found
route can be later discarded due to egress device mismatch and no
global lookup will be performed.
This cause the following to fail:
ip link add dummy1 type dummy
ip link add dummy2 type dummy
ip link set dummy1 up
ip link set dummy2 up
ip route add 2001:db8:8086::/48 dev dummy1 metric 20
ip route add 2001:db8:d34d::/64 via 2001:db8:8086::2 dev dummy1 metric 20
ip route add 2001:db8:8086::/48 dev dummy2 metric 21
ip route add 2001:db8:d34d::/64 via 2001:db8:8086::2 dev dummy2 metric 21
RTNETLINK answers: No route to host
This change fixes the issue enforcing device lookup in
ip6_nh_lookup_table()
v1->v2: updated commit message title
Fixes: 8c14586fc3 ("net: ipv6: Use passed in table for nexthop lookups")
Reported-and-tested-by: Beniamino Galvani <bgalvani@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQEcBAABCgAGBQJXa6rGAAoJED07qiWsqSVqzwsH/0g/blikQuNNAB7LzXGTEMn8
ipWCODaYxgWE/V/1A0JjsKpxB1ZL3HE0/xg/OmZ10SRNlz5LJTLkoYZa7xYmyxtT
a89mT9vdnWxwfwbeIhSSjG9W9g44vqrhMIPxffIu5yqSqdIJak0HIiuQizJ0/xjx
xQVx4AykWKQ3u0/Tiyz3ez5yhyMMNEmHxKyDWpgrR6+zXlofwP/Em3NFPwk9gh32
ECAagOUOEvdjXeRs1Yn/CV0FC7Wgs4Hzr048JJ5wOteawBLr+sPOnuWZtDhNrwDK
Eceh7aEeAiFfIW+jCMEe5qrpFHi8hqyqBo4L9TDYsE7T/Z6BPyh+5OUh8+5dtaw=
=TZew
-----END PGP SIGNATURE-----
Merge tag 'linux-can-next-for-4.8-20160623' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next
Marc Kleine-Budde says:
====================
pull-request: can-next 2016-06-17
this is a pull request of 4 patches for net-next/master.
Arnd Bergmann's patch fixes a regresseion in af_can introduced in
linux-can-next-for-4.8-20160617. There are two patches by Ramesh
Shanmugasundaram, which add CAN-2.0 support to the rcar_canfd driver.
And a patch by Ed Spiridonov that adds better error diagnoses messages
to the Ed Spiridonov driver.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace calls to kmalloc followed by a memcpy with a direct call to
kmemdup.
The Coccinelle semantic patch used to make this change is as follows:
@@
expression from,to,size,flag;
statement S;
@@
- to = \(kmalloc\|kzalloc\)(size,flag);
+ to = kmemdup(from,size,flag);
if (to==NULL || ...) S
- memcpy(to, from, size);
Signed-off-by: Amitoj Kaur Chawla <amitoj1606@gmail.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIVAwUAV2q17PSw1s6N8H32AQKOixAAh+Fw6H7j/biCA73Fi0CtVJpqxvDEoo0d
jGBNtIueiyTnuGpV8yqT23xrgcuaQlLoEPhwQTRPFy/jI2qmah69kuIZzz6ZvNxA
sSsbc4M7PerMwmX/gPwUtvWflal1ECmpc6f+5y3pZBhqowMwm9HwxR0489FCurba
4k1w/OxDQIIH88RIsNcYX129xTIvekzB8bjhkIzfQM3WLfelhYyPr8haTt+CrCuF
gdLB7O+AoCI7rxXuS+7blZq4+AryzNAWjpJQdQXlClF2UJhBDNO6CptmsmL5kpuN
0a6ijumB4Onak6zZhQo5PmvX2UbQbh6QEuGm1ZsdyZoTOFPgynRv0ZVgF4JmBl0t
vhkKrbtcLkSYUHnFUoCBDuJnI9exugUH5a0BjdVPs/J6Zha0eS1pF0IOZEOQzC8j
C4U+dgjKP4OArRLS6rRR5oS99zFijcTR+fNp+0rORwqgiDJhBgkbIO9y2hyDrjr9
OW+Hnbm2EPNF86kLG+zi/OSZ2Af/fX5gtkHvfLETE1rxzZi6G1leBiEOezCVJa1P
/jtr8RR4hDOND8XZ4qvUC0yRBo2ykyW0OjtqZ2No7PuL1z/N6gvaV0PEn62yrP/g
XCcmVxXquTMrQlqu66cQGJ+ZdD6MYa44oAOHGRpX55GCvZyjpMFrLPJ8B9XbXoh3
YXIDu5U8kQs=
=HxJc
-----END PGP SIGNATURE-----
Merge tag 'rxrpc-rewrite-20160622-2' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
David Howells says:
====================
rxrpc: Get rid of conn bundle and transport structs
Here's the next part of the AF_RXRPC rewrite. The primary purpose of this
set is to get rid of the rxrpc_conn_bundle and rxrpc_transport structs.
This simplifies things for future development of the connection handling.
To this end, the following significant changes are made:
(1) The rxrpc_connection struct is given pointers to the local and peer
endpoints, inside the rxrpc_conn_parameters struct. Pointers to the
transport's copy of these pointers are then redirected to the
connection struct.
(2) Exclusive connection handling is fixed. Exclusive connections should
do just one call and then be retired. They are used in security
negotiations and, I believe, the idea is to avoid reuse of negotiated
security contexts.
The current code is doing a single connection per socket and doing all
the calls over that. With this change it gets a new connection for
each call made.
(3) A new sendmsg() control message marker is added to make individual
calls operate over exclusive connections. This should be used in
future in preference to the sockopt that marks a socket as "exclusive
connection".
(4) IDs for client connections initiated by a machine are now allocated
from a global pool using the IDR facility and are unique across all
client connections, no matter their destination. The IDR facility is
then used to look up a connection on the connection ID alone. Other
parameters are then verified afterwards.
Note that the IDR facility may use a lot of memory if the IDs it holds
are widely scattered. Given this, in a future commit, client
connections will be retired if they are more than a certain distance
from the last ID allocated.
The client epoch is advanced by 1 each time the client ID counter
wraps. Connections outside the current epoch will also be retired in
a future commit.
(5) The connection bundle concept is removed and the client connection
tree is moved into the local endpoint. The queue for waiting for a
call channel is moved to the rxrpc_connection struct as there can only
be one connection for any particular key going to any particular peer
now.
(6) The rxrpc_transport struct is removed and the service connection tree
is moved into the peer struct.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
When qdisc bulk dequeue was added in linux-3.18 (commit
5772e9a346 "qdisc: bulk dequeue support for qdiscs
with TCQ_F_ONETXQUEUE"), it was constrained to some
specific qdiscs.
With some extra care, we can extend this to all qdiscs,
so that typical traffic shaping solutions can benefit from
small batches (8 packets in this patch).
For example, HTB is often used on some multi queue device.
And bonding/team are multi queue devices...
Idea is to bulk-dequeue packets mapping to the same transmit queue.
This brings between 35 and 80 % performance increase in HTB setup
under pressure on a bonding setup :
1) NUMA node contention : 610,000 pps -> 1,110,000 pps
2) No node contention : 1,380,000 pps -> 1,930,000 pps
Now we should work to add batches on the enqueue() side ;)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: John Fastabend <john.r.fastabend@intel.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Florian Westphal <fw@strlen.de>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We already get child qdisc qlen, we also can get its backlog
so that class dumps can report it.
Also replace qstats by a single drop counter, but move it in
a separate cache line so that drops do not dirty useful cache lines.
Tested:
$ tc -s cl sh dev eth0
class htb 1:1 root leaf 3: prio 0 rate 1Gbit ceil 1Gbit burst 500000b cburst 500000b
Sent 2183346912 bytes 9021815 pkt (dropped 2340774, overlimits 0 requeues 0)
rate 1001Mbit 517543pps backlog 120758b 499p requeues 0
lended: 9021770 borrowed: 0 giants: 0
tokens: 9 ctokens: 9
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now we defer skb drops, it makes sense to keep a copy
of skb->truesize in struct codel_skb_cb to avoid one
cache line miss per dropped skb in fq_codel_drop(),
to reduce latencies a bit further.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Qdisc performance suffers when packets are dropped at enqueue()
time because drops (kfree_skb()) are done while qdisc lock is held,
delaying a dequeue() draining the queue.
Nominal throughput can be reduced by 50 % when this happens,
at a time we would like the dequeue() to proceed as fast as possible.
Even FQ is vulnerable to this problem, while one of FQ goals was
to provide some flow isolation.
This patch adds a 'struct sk_buff **to_free' parameter to all
qdisc->enqueue(), and in qdisc_drop() helper.
I measured a performance increase of up to 12 %, but this patch
is a prereq so that future batches in enqueue() can fly.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In ircomm_tty_get_serial_info, struct serial_struct is memset to 0 and
then some members set to 0 explicitly.
Remove the latter as it is obviously superfluous.
And remove the retinfo check against NULL. copy_to_user will take care
of that.
Part of hub6 cleanup series.
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Samuel Ortiz <samuel@sortiz.org>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Only set conntrack mark or labels when the commit flag is specified.
This makes sure we can not set them before the connection has been
persisted, as in that case the mark and labels would be lost in an
event of an userspace upcall.
OVS userspace already requires the commit flag to accept setting
ct_mark and/or ct_labels. Validate for this in the kernel API.
Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Set conntrack mark and labels right before committing so that
the initial conntrack NEW event has the mark and labels.
Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Acked-by: Joe Stringer <joe@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce a new configuration option for this expression, which allows users
to invert the logic of set lookups.
In _init() we will now return EINVAL if NFT_LOOKUP_F_INV is in anyway
related to a map lookup.
The code in the _eval() function has been untangled and updated to sopport the
XOR of options, as we should consider 4 cases:
* lookup false, invert false -> NFT_BREAK
* lookup false, invert true -> return w/o NFT_BREAK
* lookup true, invert false -> return w/o NFT_BREAK
* lookup true, invert true -> NFT_BREAK
Signed-off-by: Arturo Borrero Gonzalez <arturo.borrero.glez@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This flag was introduced to restore rulesets from the new netdev
family, but since 5ebe0b0eec ("netfilter: nf_tables: destroy
basechain and rules on netdevice removal") the ruleset is released
once the netdev is gone.
This also removes nft_register_basechain() and
nft_unregister_basechain() since they have no clients anymore after
this rework.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
No need to restrict this to module parameter.
We export a copy of the real hash size -- when user alters the value we
allocate the new table, copy entries etc before we update the real size
to the requested one.
This is also needed because the real size is used by concurrent readers
and cannot be changed without synchronizing the conntrack generation
seqcnt.
We only allow changing this value from the initial net namespace.
Tested using http-client-benchmark vs. httpterm with concurrent
while true;do
echo $RANDOM > /proc/sys/net/netfilter/nf_conntrack_buckets
done
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
New elements are inactive in the preparation phase, and its
NFT_SET_ELEM_BUSY_MASK flag is set on.
This busy flag doesn't allow us to delete it from the same transaction,
following a sequence like:
begin transaction
add element X
delete element X
end transaction
This sequence is valid and may be triggered by robots. To resolve this
problem, allow deactivating elements that are active in the current
generation (ie. those that has been just added in this batch).
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
set->ops->deactivate() is invoked from nft_del_setelem() that happens
from the transaction path, so we have to check if the object is active
in the next generation, not the current.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch addresses two problems:
1) The netlink dump is inconsistent when interfering with an ongoing
transaction update for several reasons:
1.a) We don't honor the internal NFT_TABLE_INACTIVE flag, and we should
be skipping these inactive objects in the dump.
1.b) We perform speculative deletion during the preparation phase, that
may result in skipping active objects.
1.c) The listing order changes, which generates noise when tracking
incremental ruleset update via tools like git or our own
testsuite.
2) We don't allow to add and to update the object in the same batch,
eg. add table x; add table x { flags dormant\; }.
In order to resolve these problems:
1) If the user requests a deletion, the object becomes inactive in the
next generation. Then, ignore objects that scheduled to be deleted
from the lookup path, as they will be effectively removed in the
next generation.
2) From the get/dump path, if the object is not currently active, we
skip it.
3) Support 'add X -> update X' sequence from a transaction.
After this update, we obtain a consistent list as long as we stay
in the same generation. The userspace side can detect interferences
through the generation counter so it can restart the dumping.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Thus, we can reuse these to check the genmask of any object type, not
only rules. This is required now that tables, chain and sets will get a
generation mask field too in follow up patches.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
li->u.ulog.copy_len is currently ignored by the kernel, we should truncate
the packet to either li->u.ulog.copy_len (if set) or copy_range before
sending it to userspace. 0 is a valid input for copy_len, so add a new
flag to indicate whether this was option was specified by the user or not.
Add two flags to indicate whether nflog-size/copy_len was set or not.
XT_NFLOG_F_COPY_LEN is for XT_NFLOG and NFLOG_F_COPY_LEN for nfnetlink_log
On the userspace side, this was initially represented by the option
nflog-range, this will be replaced by --nflog-size now. --nflog-range would
still exist but does not do anything.
Reported-by: Joe Dollard <jdollard@akamai.com>
Reviewed-by: Josh Hunt <johunt@akamai.com>
Signed-off-by: Vishwanath Pai <vpai@akamai.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
In iptables, if the user add a rule to send tcp RST and specify the
non-TCP protocol, such as UDP, kernel will reject this request. But
in nftables, this validity check only occurs in nft tool, i.e. only
in userspace.
This means that user can add such a rule like follows via nfnetlink:
"nft add rule filter forward ip protocol udp reject with tcp reset"
This will generate some confusing tcp RST packets. So we should send
tcp RST only when it is TCP packet.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Today what is normally called data (the mount options) is not passed
to fill_super through mount_ns.
Pass the mount options and the namespace separately to mount_ns so
that filesystems such as proc that have mount options, can use
mount_ns.
Pass the user namespace to mount_ns so that the standard permission
check that verifies the mounter has permissions over the namespace can
be performed in mount_ns instead of in each filesystems .mount method.
Thus removing the duplication between mqueuefs and proc in terms of
permission checks. The extra permission check does not currently
affect the rpc_pipefs filesystem and the nfsd filesystem as those
filesystems do not currently allow unprivileged mounts. Without
unpvileged mounts it is guaranteed that the caller has already passed
capable(CAP_SYS_ADMIN) which guarantees extra permission check will
pass.
Update rpc_pipefs and the nfsd filesystem to ensure that the network
namespace reference is always taken in fill_super and always put in kill_sb
so that the logic is simpler and so that errors originating inside of
fill_super do not cause a network namespace leak.
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
If the packet was dropped by lower qdisc, then we must not
access it later.
Save qdisc_pkt_len(skb) in a temp variable.
Fixes: 2ccccf5fb4 ("net_sched: update hierarchical backlog too")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: WANG Cong <xiyou.wangcong@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexey reported that we have GFP_KERNEL allocation when
holding the spinlock tcf_lock. Actually we don't have
to take that spinlock for all the cases, especially
for the new one we just create. To modify the existing
actions, we still need this spinlock to make sure
the whole update is atomic.
For net-next, we can get rid of this spinlock because
we already hold the RTNL lock on slow path, and on fast
path we can use RCU to protect the metalist.
Joint work with Jamal.
Reported-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Blair Steven noticed that ESN in conjunction with UDP encapsulation
is broken because we set the temporary ESP header to the wrong spot.
This patch fixes this by first of all using the right spot, i.e.,
4 bytes off the real ESP header, and then saving this information
so that after encryption we can restore it properly.
Fixes: 7021b2e1cd ("esp4: Switch to new AEAD interface")
Reported-by: Blair Steven <Blair.Steven@alliedtelesis.co.nz>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When user add a nft rule to set nftrace to zero, for example:
# nft add rule ip filter input nftrace set 0
We should set nf_trace to zero also.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If expr init fails then we need to free it.
So when the user add a nft rule as follows:
# nft add rule filter input tcp dport 22 flow table ssh \
{ ip saddr limit rate 0/second }
memory leak will happen.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Making this work is a little tricky as it really isn't kosher to
change the xt_owner_match_info in a check function.
Without changing xt_owner_match_info we need to know the user
namespace the uids and gids are specified in. In the common case
net->user_ns == current_user_ns(). Verify net->user_ns ==
current_user_ns() in owner_check so we can later assume it in
owner_mt.
In owner_check also verify that all of the uids and gids specified are
in net->user_ns and that the expected min/max relationship exists
between the uids and gids in xt_owner_match_info.
In owner_mt get the network namespace from the outgoing socket, as this
must be the same network namespace as the netfilter rules, and use that
network namespace to find the user namespace the uids and gids in
xt_match_owner_info are encoded in. Then convert from their encoded
from into the kernel internal format for uids and gids and perform the
owner match.
Similar to ping_group_range, this code does not try to detect
noncontiguous UID/GID ranges.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Kevin Cernekee <cernekee@chromium.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Curently we store zone information as a conntrack extension.
This has one drawback: for every lookup we need to fetch the zone data
from the extension area.
This change place the zone data directly into the main conntrack object
structure and then removes the zone conntrack extension.
The zone data is just 4 bytes, it fits into a padding hole before
the tuplehash info, so we do not even increase the nf_conn structure size.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If 'logger' was NULL, there would be a direct jump to the label 'out',
since it has already been checked for NULL, remove this unnecessary
check.
Signed-off-by: Shivani Bhardwaj <shivanib134@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
increases struct size by 32 bytes (288 -> 320), but it is the right thing,
else any attempt to (re-)arrange nf_conn members by cacheline won't work.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Consider such situation, if nf_log_ipv4 kernel module is not installed,
and the user add a following iptables rule:
# iptables -t raw -I PREROUTING -j TRACE
There will be no trace log generated until the user install nf_log_ipv4
module manully. So we should add request related nf_log module
appropriately here.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When we request NFPROTO_INET, it means both NFPROTO_IPV4 and NFPROTO_IPV6.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Since we cannot make sure that the 'hook_mask' will always be none
zero here. If it equals to zero, the num_hooks will be zero too,
and then kmalloc() will return ZERO_SIZE_PTR, which is (void *)16.
Then the following error check will fails:
ops = kmalloc(sizeof(*ops) * num_hooks, GFP_KERNEL);
if (ops == NULL)
return ERR_PTR(-ENOMEM);
So this patch will fix this with just doing the zero check before
kmalloc() is called.
Maybe the case above will never happen here, but in theory.
Signed-off-by: Xiubo Li <lixiubo@cmss.chinamobile.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The change to leave out procfs support in CAN when CONFIG_PROC_FS
is not set was incomplete and leads to a build error:
net/built-in.o: In function `can_init':
:(.init.text+0x9858): undefined reference to `can_stat_update'
ERROR: "can_stat_update" [net/can/can.ko] undefined!
This tries a better approach, encapsulating all of the calls
within IS_ENABLED(), so we also leave out the timer function
from the object file.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: a20fadf853 ("can: build proc support only if CONFIG_PROC_FS is activated")
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
The commit f2a4d086ed ("openvswitch: Add packet truncation support.")
introduces packet truncation before sending to userspace upcall receiver.
This patch passes up the skb->len before truncation so that the upcall
receiver knows the original packet size. Potentially this will be used
by sFlow, where OVS translates sFlow config header=N to a sample action,
truncating packet to N byte in kernel datapath. Thus, only N bytes instead
of full-packet size is copied from kernel to userspace, saving the
kernel-to-userspace bandwidth.
Signed-off-by: William Tu <u9012063@gmail.com>
Cc: Pravin Shelar <pshelar@nicira.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When extracting an individual message from a received "bundle" buffer,
we just create a clone of the base buffer, and adjust it to point into
the right position of the linearized data area of the latter. This works
well for regular message reception, but during periods of extremely high
load it may happen that an extracted buffer, e.g, a connection probe, is
reversed and forwarded through an external interface while the preceding
extracted message is still unhandled. When this happens, the header or
data area of the preceding message will be partially overwritten by a
MAC header, leading to unpredicatable consequences, such as a link
reset.
We now fix this by ensuring that the msg_reverse() function never
returns a cloned buffer, and that the returned buffer always contains
sufficient valid head and tail room to be forwarded.
Reported-by: Erik Hugne <erik.hugne@gmail.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Every open of /proc/net/kcm leaks 16 bytes of memory as is reported by
kmemleak:
unreferenced object 0xffff88059c0e3458 (size 192):
comm "cat", pid 1401, jiffies 4294935742 (age 310.720s)
hex dump (first 32 bytes):
28 45 71 96 05 88 ff ff 00 10 00 00 00 00 00 00 (Eq.............
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<ffffffff8156a2de>] kmem_cache_alloc_trace+0x16e/0x230
[<ffffffff8162a479>] seq_open+0x79/0x1d0
[<ffffffffa0578510>] kcm_seq_open+0x0/0x30 [kcm]
[<ffffffff8162a479>] seq_open+0x79/0x1d0
[<ffffffff8162a8cf>] __seq_open_private+0x2f/0xa0
[<ffffffff81712548>] seq_open_net+0x38/0xa0
...
It is caused by a missing free in the ->release path. So fix it by
providing seq_release_net as the ->release method.
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Fixes: cd6e111bf5 (kcm: Add statistics and proc interfaces)
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Tom Herbert <tom@herbertland.com>
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
The rxrpc_transport struct is now redundant, given that the rxrpc_peer
struct is now per peer port rather than per peer host, so get rid of it.
Service connection lists are transferred to the rxrpc_peer struct, as is
the conn_lock. Previous patches moved the client connection handling out
of the rxrpc_transport struct and discarded the connection bundling code.
Signed-off-by: David Howells <dhowells@redhat.com>
Kill off the concept of maintaining a bundle of connections to a particular
target service to increase the number of call slots available for any
beyond four for that service (there are four call slots per connection).
This will make cleaning up the connection handling code easier and
facilitate removal of the rxrpc_transport struct. Bundling can be
reintroduced later if necessary.
Signed-off-by: David Howells <dhowells@redhat.com>
Provide refcount helper functions for connections so that the code doesn't
touch local or connection usage counts directly.
Also make it such that local and peer put functions can take a NULL
pointer.
Signed-off-by: David Howells <dhowells@redhat.com>
Make rxrpc_send_packet() take a connection not a transport as part of the
phasing out of the rxrpc_transport struct.
Whilst we're at it, rename the function to rxrpc_send_data_packet() to
differentiate it from the other packet sending functions.
Signed-off-by: David Howells <dhowells@redhat.com>
Allocated rxrpc calls displayed in /proc/net/rxrpc_calls may in future be
on the proc list before they're connected or after they've been
disconnected - in which case they may not have a pointer to a connection
struct that can be used to get data from there.
Deal with this by using stuff from the call struct in preference where
possible and printing "no_connection" rather than a peer address if no
connection is assigned.
This change also has the added bonus that the service ID is now taken from
the call rather the connection which will allow per-call service upgrades
to be shown - something required for AuriStor server compatibility.
Signed-off-by: David Howells <dhowells@redhat.com>
Validate the net address given to rxrpc_kernel_begin_call() before using
it.
Whilst this should be mostly unnecessary for in-kernel users, it does clear
the tail of the address struct in case we want to hash or compare the whole
thing.
Signed-off-by: David Howells <dhowells@redhat.com>
Use the IDR facility to allocate client connection IDs on a machine-wide
basis so that each client connection has a unique identifier. When the
connection ID space wraps, we advance the epoch by 1, thereby effectively
having a 62-bit ID space. The IDR facility is then used to look up client
connections during incoming packet routing instead of using an rbtree
rooted on the transport.
This change allows for the removal of the transport in the future and also
means that client connections can be looked up directly in the data-ready
handler by connection ID.
The ID management code is placed in a new file, conn-client.c, to which all
the client connection-specific code will eventually move.
Note that the IDR tree gets very expensive on memory if the connection IDs
are widely scattered throughout the number space, so we shall need to
retire connections that have, say, an ID more than four times the maximum
number of client conns away from the current allocation point to try and
keep the IDs concentrated. We will also need to retire connections from an
old epoch.
Also note that, for the moment, a pointer to the transport has to be passed
through into the ID allocation function so that we can take a BH lock to
prevent a locking issue against in-BH lookup of client connections. This
will go away later when RCU is used for server connections also.
Signed-off-by: David Howells <dhowells@redhat.com>
rxrpc_connection_lock shouldn't be accessed as a BH-excluding lock. It's
only accessed in a few places and none of those are in BH-context.
rxrpc_transport::conn_lock, however, *is* a BH-excluding lock and should be
accessed so consistently.
Signed-off-by: David Howells <dhowells@redhat.com>
Pass a pointer to struct sk_buff rather than struct rxrpc_host_header to
functions so that they can in the future get at transport protocol parameters
rather than just RxRPC parameters.
Signed-off-by: David Howells <dhowells@redhat.com>
"Exclusive connections" are meant to be used for a single client call and
then scrapped. The idea is to limit the use of the negotiated security
context. The current code, however, isn't doing this: it is instead
restricting the socket to a single virtual connection and doing all the
calls over that.
This is changed such that the socket no longer maintains a special virtual
connection over which it will do all the calls, but rather gets a new one
each time a new exclusive call is made.
Further, using a socket option for this is a poor choice. It should be
done on sendmsg with a control message marker instead so that calls can be
marked exclusive individually. To that end, add RXRPC_EXCLUSIVE_CALL
which, if passed to sendmsg() as a control message element, will cause the
call to be done on an single-use connection.
The socket option (RXRPC_EXCLUSIVE_CONNECTION) still exists and, if set,
will override any lack of RXRPC_EXCLUSIVE_CALL being specified so that
programs using the setsockopt() will appear to work the same.
Signed-off-by: David Howells <dhowells@redhat.com>
Replace accesses of conn->trans->{local,peer} with
conn->params.{local,peer} thus making it easier for a future commit to
remove the rxrpc_transport struct.
This also reduces the number of memory accesses involved.
Signed-off-by: David Howells <dhowells@redhat.com>
Define and use a structure to hold connection parameters. This makes it
easier to pass multiple connection parameters around.
Define and use a structure to hold protocol information used to hash a
connection for lookup on incoming packet. Most of these fields will be
disposed of eventually, including the duplicate local pointer.
Whilst we're at it rename "proto" to "family" when referring to a protocol
family.
Signed-off-by: David Howells <dhowells@redhat.com>
Hashing the peer key was introduced for AF_INET, but gcc
warns about the rxrpc_peer_hash_key function returning uninitialized
data for any other value of srx->transport.family:
net/rxrpc/peer_object.c: In function 'rxrpc_peer_hash_key':
net/rxrpc/peer_object.c:57:15: error: 'p' may be used uninitialized in this function [-Werror=maybe-uninitialized]
Assuming that nothing else can be set here, this changes the
function to just return zero in case of an unknown address
family.
Fixes: be6e6707f6 ("rxrpc: Rework peer object handling to use hash table and RCU")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David Howells <dhowells@redhat.com>
rxrpc_lookup_peer_rcu() and rxrpc_lookup_peer() return NULL on error, never
error pointers, so IS_ERR() can't be used.
Fix three callers of those functions.
Fixes: be6e6707f6 ('rxrpc: Rework peer object handling to use hash table and RCU')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David Howells <dhowells@redhat.com>
When receiving an ICMPv4 message containing extensions as
defined in RFC 4884, and translating it to ICMPv6 at SIT
or GRE tunnel, we need some extra manipulation in order
to properly forward the extensions.
This patch only takes care of Time Exceeded messages as they
are the ones that typically carry information from various
routers in a fabric during a traceroute session.
It also avoids complex skb logic if the data_len is not
a multiple of 8.
RFC states :
The "original datagram" field MUST contain at least 128 octets.
If the original datagram did not contain 128 octets, the
"original datagram" field MUST be zero padded to 128 octets.
In practice routers use 128 bytes of original datagram, not more.
Initial translation was added in commit ca15a078bd
("sit: generate icmpv6 error when receiving icmpv4 error")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Oussama Ghorbel <ghorbel@pivasoftware.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ipgre_err() can call ip6_err_gen_icmpv6_unreach() for proper
support of ipv4+gre+icmp+ipv6+... frames, used for example
by traceroute/mtr.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For better traceroute/mtr support for SIT and GRE tunnels,
we translate IPV4 ICMP ICMP_TIME_EXCEEDED to ICMPV6_TIME_EXCEED
We also have to translate the IPv4 source IP address of ICMP
message to IPv6 v4mapped.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We want to use this helper from GRE as well, so this is
the time to move it in net/ipv6/icmp.c
Also add a @nhs parameter, since SIT and GRE have different
values for the header(s) to skip.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SIT or GRE tunnels might want to translate an IPV4 address
into a v4mapped one when translating ICMP to ICMPv6.
This patch adds the parameter to icmp6_send() but
does not change icmpv6_send() signature.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix coding style issues in the following files:
ib_cm.c: add space
loop.c: convert spaces to tabs
sysctl.c: add space
tcp.h: convert spaces to tabs
tcp_connect.c:remove extra indentation in switch statement
tcp_recv.c: convert spaces to tabs
tcp_send.c: convert spaces to tabs
transport.c: move brace up one line on for statement
Signed-off-by: Joshua Houghton <josh@awful.name>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A socket connection made in ax.25 is not closed when session is
completed. The heartbeat timer is stopped prematurely and this is
where the socket gets closed. Allow heatbeat timer to run to close
socket. Symptom occurs in kernels >= 4.2.0
Originally sent 6/15/2016. Resend with distribution list matching
scripts/maintainer.pl output.
Signed-off-by: Basil Gunn <basil@pacabunga.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The state of the rds_connection after rds_tcp_reset_callbacks() would
be RDS_CONN_RESETTING and this is the value that should be passed
by rds_tcp_accept_one() to rds_connect_path_complete() to transition
the socket to RDS_CONN_UP.
Fixes: b5c21c0947c1 ("RDS: TCP: fix race windows in send-path quiescence
by rds_tcp_accept_one()")
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes the following sparse warnings:
net/rds/tcp.c:59:5: warning:
symbol 'rds_tcp_min_sndbuf' was not declared. Should it be static?
net/rds/tcp.c:60:5: warning:
symbol 'rds_tcp_min_rcvbuf' was not declared. Should it be static?
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQEcBAABCgAGBQJXZAIqAAoJED07qiWsqSVqW0YH/RMgC6CDlJUtHr7+B8YLpi0e
BZOzAHH7mQdP+Z2kIXZp8Dnziq5G8heWNjRwTMPLHvOel+oms+0ZK6VY/kJYArdb
ViGGgi/gQ334JGqJYi2utkyIIIRH7ZxwcblF1aaaFVfFy7tZMVuppIWVzR/V0Gje
/5FftT1f04/6iumEq4es+Jb0OC9azoebSs1DUZTIvYOz3XrnCbB1FdmDN+a3xZKC
Qyav6QVnp/m2InzGSN+Kd/W++EP6YckdBp/++2hizsOvSIOfe8GRqkc0r7fLbfZ8
rucDJNi+GrLat4wNza4t3FKf3rBenBFzii14OEUTE0JgpY90fGUF3n/i0duaB/k=
=90Ns
-----END PGP SIGNATURE-----
Merge tag 'linux-can-next-for-4.8-20160617' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next
Marc Kleine-Budde says:
====================
pull-request: can-next 2016-06-17
this is a pull request of 14 patches for net-next/master.
Geert Uytterhoeven contributes a patch that adds a file patterns for
CAN device tree bindings to MAINTAINERS. A patch by Alexander Aring
fixes warnings when building without proc support. A patch by me
improves the sample point calculation. Marek Vasut's patch converts
the slcan driver to use CAN_MTU. A patch by William Breathitt Gray
converts the tscan1 driver to use module_isa_driver.
Two patches by Maximilian Schneider for the gs_usb driver fix coding
style and add support for set_phys_id callback. 5 patches by Oliver
Hartkopp add support for CANFD to the bcm. And finally two patches
by Ramesh Shanmugasundaram, which add support for the rcar_canfd
driver.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
We sometimes observe a 'deadly embrace' type deadlock occurring
between mutually connected sockets on the same node. This happens
when the one-hour peer supervision timers happen to expire
simultaneously in both sockets.
The scenario is as follows:
CPU 1: CPU 2:
-------- --------
tipc_sk_timeout(sk1) tipc_sk_timeout(sk2)
lock(sk1.slock) lock(sk2.slock)
msg_create(probe) msg_create(probe)
unlock(sk1.slock) unlock(sk2.slock)
tipc_node_xmit_skb() tipc_node_xmit_skb()
tipc_node_xmit() tipc_node_xmit()
tipc_sk_rcv(sk2) tipc_sk_rcv(sk1)
lock(sk2.slock) lock((sk1.slock)
filter_rcv() filter_rcv()
tipc_sk_proto_rcv() tipc_sk_proto_rcv()
msg_create(probe_rsp) msg_create(probe_rsp)
tipc_sk_respond() tipc_sk_respond()
tipc_node_xmit_skb() tipc_node_xmit_skb()
tipc_node_xmit() tipc_node_xmit()
tipc_sk_rcv(sk1) tipc_sk_rcv(sk2)
lock((sk1.slock) lock((sk2.slock)
===> DEADLOCK ===> DEADLOCK
Further analysis reveals that there are three different locations in the
socket code where tipc_sk_respond() is called within the context of the
socket lock, with ensuing risk of similar deadlocks.
We now solve this by passing a buffer queue along with all upcalls where
sk_lock.slock may potentially be held. Response or rejected message
buffers are accumulated into this queue instead of being sent out
directly, and only sent once we know we are safely outside the slock
context.
Reported-by: GUNA <gbalasun@gmail.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
"up_map" is a u64 type but we're not using the high 32 bits.
Fixes: 35c55c9877 ('tipc: add neighbor monitoring framework')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPv6 version of 3f2fb9a834 ("net: l3mdev: address selection should only
consider devices in L3 domain") and the follow up commit, a17b693cdd876
("net: l3mdev: prefer VRF master for source address selection").
That is, if outbound device is given then the address preference order
is an address from that device, an address from the master device if it
is enslaved, and then an address from a device in the same L3 domain.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPv6 source address selection needs to consider the real egress route.
Similar to IPv4 implement a get_saddr6 method which is called if
source address has not been set. The get_saddr6 method does a full
lookup which means pulling a route from the VRF FIB table and properly
considering linklocal/multicast destination addresses. Lookup failures
(eg., unreachable) then cause the source address selection to fail
which gets propagated back to the caller.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
VRF driver needs access to ip6_route_get_saddr code. Since it does
little beyond ipv6_dev_get_saddr and ipv6_dev_get_saddr is already
exported for modules move ip6_route_get_saddr to the header as an
inline.
Code move only; no functional change.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that we have all the drivers using udp_tunnel_get_rx_ports,
ndo_add_udp_enc_rx_port, and ndo_del_udp_enc_rx_port we can drop the
function calls that were specific to VXLAN and GENEVE.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch merges the notifiers for VXLAN and GENEVE into a single UDP
tunnel notifier. The idea is that we will want to only have to make one
notifier call to receive the list of ports for VXLAN and GENEVE tunnels
that need to be offloaded.
In addition we add a new set of ndo functions named ndo_udp_tunnel_add and
ndo_udp_tunnel_del that are meant to allow us to track the tunnel meta-data
such as port and address family as tunnels are added and removed. The
tunnel meta-data is now transported in a structure named udp_tunnel_info
which for now carries the type, address family, and port number. In the
future this could be updated so that we can include a tuple of values
including things such as the destination IP address and other fields.
I also ended up going with a naming scheme that consisted of using the
prefix udp_tunnel on function names. I applied this to the notifier and
ndo ops as well so that it hopefully points to the fact that these are
primarily used in the udp_tunnel functions.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch merges the GENEVE and VXLAN code so that both functions pass
through a shared code path. This way we can start the effort of using a
single function on the network device drivers to handle both of these
tunnel types.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for your net tree,
they are rather small patches but fixing several outstanding bugs in
nf_conntrack and nf_tables, as well as minor problems with missing
SYNPROXY header uapi installation:
1) Oneliner not to leak conntrack kmemcache on module removal, this
problem was introduced in the previous merge window, patch from
Florian Westphal.
2) Two fixes for insufficient ruleset loop validation, one due to
incorrect flag check in nf_tables_bind_set() and another related to
silly wrong generation mask logic from the walk path, from Liping
Zhang.
3) Fix double-free of anonymous sets on error, this fix simplifies the
code to let the abort path take care of releasing the set object,
also from Liping Zhang.
4) The introduction of helper function for transactions broke the skip
inactive rules logic from the nft_do_chain(), again from Liping
Zhang.
5) Two patches to install uapi xt_SYNPROXY.h header and calm down
kbuild robot due to missing #include <linux/types.h>.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The programming API of the CAN_BCM depends on struct can_frame which is
given as array directly behind the bcm_msg_head structure. To follow this
schema for the CAN FD frames a new flag 'CAN_FD_FRAME' in the bcm_msg_head
flags indicates that the concatenated CAN frame structures behind the
bcm_msg_head are defined as struct canfd_frame.
This patch adds the support to handle CAN and CAN FD frames on a per BCM-op
base. Main changes:
- generally use struct canfd_frames instead if struct can_frames
- use canfd_frame.flags instead of can_frame.can_dlc for private BCM flags
- make all CAN frame sizes depending on the new CAN_FD_FRAME flags
- separate between CAN and CAN FD when sending/receiving frames
Due to the dependence of the CAN_FD_FRAME flag the former binary interface
for classic CAN frames remains stable.
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
can_frame is the name of the struct can_frame which is not meant in
the corrected comments.
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
When building can subsystem with CONFIG_PROC_FS=n I detected some unused
variables warning by using proc functions. In CAN the proc handling is
nicely placed in one object file. This patch adds simple add a
dependency on CONFIG_PROC_FS for CAN's proc.o file and corresponding
static inline no-op functions.
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
[mkl: provide static inline noops instead of using #ifdefs]
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
When we check for RTM_DELTFILTER, we should also reject the request
for deleting all filters under a given parent when TCA_KIND attribute
is present. If present, it's currently just ignored but there's also
no point to let it pass in the first place either since this doesn't
have any meaning with wild-card removal.
Fixes: ea7f8277f9 ("net, cls: allow for deleting all filters for given parent")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Modern C standards expect the '__inline__' keyword to come before the return
type in a declaration, and we get a couple of warnings for this with "make W=1"
in the xfrm{4,6}_policy.c files:
net/ipv6/xfrm6_policy.c:369:1: error: 'inline' is not at beginning of declaration [-Werror=old-style-declaration]
static int inline xfrm6_net_sysctl_init(struct net *net)
net/ipv6/xfrm6_policy.c:374:1: error: 'inline' is not at beginning of declaration [-Werror=old-style-declaration]
static void inline xfrm6_net_sysctl_exit(struct net *net)
net/ipv4/xfrm4_policy.c:339:1: error: 'inline' is not at beginning of declaration [-Werror=old-style-declaration]
static int inline xfrm4_net_sysctl_init(struct net *net)
net/ipv4/xfrm4_policy.c:344:1: error: 'inline' is not at beginning of declaration [-Werror=old-style-declaration]
static void inline xfrm4_net_sysctl_exit(struct net *net)
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
to the big nfs4_lock_state removal around 3.17 (but that were also
probably hard to reproduce before client changes in 3.20 allowed the
client to perform parallel opens).
Also fix a 4.1 backchannel crash due to rpc multipath changes in 4.6.
Trond acked the client-side rpc fixes going through my tree.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXYsiKAAoJECebzXlCjuG+PK8P/jGBS+h7Zf4GWOOsWN5cbEs/
8VTn83KXFp/feFhGikUIcAZQJRynDK+tD9Vh2FznC2zKDTLFPS0mAvL3tZyQhUO2
nEWaCUOFR+sB3aTPlMGwxbGc7NHNQg1hKqKgqcLEEqtozxhFQye3WW0MZNfFCiUZ
qpq2tK1OGGhJVIp7wWSa8+B2nGFMuasPaGM2OVJrebip49yTG/tT3rwKxKMoB8kS
i8BwNejoP1KRD6LqvpgdV1ESzkdyokDxKXCrdY/j2lMdp2YRe+cWmX239ojjvm8G
n9Ow8DYCefuiKiF6iCLZfxpX8dcmVJvT6g+k+9V63A4YCyuGhy/CneA3MO4QyLhq
yfe2zviJ2kZVz+1Ih3v9kD7ZkyK1hjrxXx/VPrI5CBIXE5eVXin2ZDvTCSoV491h
g1zscPc9Thgk6gKXsvkaVOXxLHBoUzXeSRbNqVXXZfjl+s4TXLNJ0lcaBYkzh74/
SypiFeNHjsjNpJYz5GptlbMUpaEoeyH0Y+OiH8d5Jf8hCcQ+CLjKgKSuCH5zrypt
Lx3U5QWHTT3IXH4QS/njcTSfSDu7BUip4RTLzw6C/ZJf7hd6SS4Xv72J6ZmeDSmg
146MpAYty8HB04KQWpYx0DGI7UEPlubfRHSF9XzsSitbRtNGr6xvIug8fkKBlXDB
aHtr+/gI7UvrmnXnlGdD
=aNkl
-----END PGP SIGNATURE-----
Merge tag 'nfsd-4.7-1' of git://linux-nfs.org/~bfields/linux
Pull nfsd bugfixes from Bruce Fields:
"Oleg Drokin found and fixed races in the nfsd4 state code that go back
to the big nfs4_lock_state removal around 3.17 (but that were also
probably hard to reproduce before client changes in 3.20 allowed the
client to perform parallel opens).
Also fix a 4.1 backchannel crash due to rpc multipath changes in 4.6.
Trond acked the client-side rpc fixes going through my tree"
* tag 'nfsd-4.7-1' of git://linux-nfs.org/~bfields/linux:
nfsd: Make init_open_stateid() a bit more whole
nfsd: Extend the mutex holding region around in nfsd4_process_open2()
nfsd: Always lock state exclusively.
rpc: share one xps between all backchannels
nfsd4/rpc: move backchannel create logic into rpc code
SUNRPC: fix xprt leak on xps allocation failure
nfsd: Fix NFSD_MDS_PR_KEY on 32-bit by adding ULL postfix
Pull overlayfs fixes from Miklos Szeredi:
"This contains two regression fixes: one for the xattr API update and
one for using the mounter's creds in file creation in overlayfs.
There's also a fix for a bug in handling hard linked AF_UNIX sockets
that's been there from day one. This fix is overlayfs only despite
the fact that it touches code outside the overlay filesystem: d_real()
is an identity function for all except overlay dentries"
* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: fix uid/gid when creating over whiteout
ovl: xattr filter fix
af_unix: fix hard linked sockets on overlay
vfs: add d_real_inode() helper
The space is missing after ',', and this will introduce much more
noise when checking patch around.
Signed-off-by: Wei Tang <tangwei@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes the checkpatch.pl error to dev.c:
ERROR: do not initialise statics to 0
Signed-off-by: Wei Tang <tangwei@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This appears to be necessary and sufficient to provide
MPLS in GRE (RFC4023) support.
This can be used by establishing an ipgre tunnel device
and then routing MPLS over it.
The following example will forward MPLS frames received with an outermost
MPLS label 100 over tun1, a GRE tunnel. The forwarded packet will have the
outermost MPLS LSE removed and two new LSEs added with labels 200
(outermost) and 300 (next).
ip link add name tun1 type gre remote 10.0.99.193 local 10.0.99.192 ttl 225
ip link set up dev tun1
ip addr add 10.0.98.192/24 dev tun1
ip route sh
echo 1 > /proc/sys/net/mpls/conf/eth0/input
echo 101 > /proc/sys/net/mpls/platform_labels
ip -f mpls route add 100 as 200/300 via inet 10.0.98.193
ip -f mpls route sh
Also remove unnecessary braces.
Reviewed-by: Dinan Gunawardena <dinan.gunawardena@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since 32b8a8e59c ("sit: add IPv4 over IPv4 support")
ipip6_err() may be called for packets whose IP protocol is
IPPROTO_IPIP as well as those whose IP protocol is IPPROTO_IPV6.
In the case of IPPROTO_IPIP packets the correct protocol value is not
passed to ipv4_update_pmtu() or ipv4_redirect().
This patch resolves this problem by using the IP protocol of the packet
rather than a hard-coded value. This appears to be consistent
with the usage of the protocol of a packet by icmp_socket_deliver()
the caller of ipip6_err().
I was able to exercise the redirect case by using a setup where an ICMP
redirect was received for the destination of the encapsulated packet.
However, it appears that although incorrect the protocol field is not used
in this case and thus no problem manifests. On inspection it does not
appear that a problem will manifest in the fragmentation needed/update pmtu
case either.
In short I believe this is a cosmetic fix. None the less, the use of
IPPROTO_IPV6 seems wrong and confusing.
Reviewed-by: Dinan Gunawardena <dinan.gunawardena@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit d46e416c11 ("sctp: sctp should change socket state when
shutdown is received") may set sk_state CLOSING in sctp_sock_migrate,
but inet_accept doesn't allow the sk_state other than ESTABLISHED/
CLOSED for sctp. So we will change sk_state to CLOSED, instead of
CLOSING, as actually sk is closed already there.
Fixes: d46e416c11 ("sctp: sctp should change socket state when shutdown is received")
Reported-by: Ye Xiaolong <xiaolong.ye@intel.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ctx structure passed into bpf programs is different depending on bpf
program type. The verifier incorrectly marked ctx->data and ctx->data_end
access based on ctx offset only. That caused loads in tracing programs
int bpf_prog(struct pt_regs *ctx) { .. ctx->ax .. }
to be incorrectly marked as PTR_TO_PACKET which later caused verifier
to reject the program that was actually valid in tracing context.
Fix this by doing program type specific matching of ctx offsets.
Fixes: 969bf05eb3 ("bpf: direct packet access")
Reported-by: Sasha Goldshtein <goldshtn@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIVAwUAV2F33fSw1s6N8H32AQLahQ//XgRCyVXPhaOhkiJoHW/xOlWwSDKJg5y2
c8HEThJFO0rajne83b1mX9ijryVvGw7R/2nVExTrOTvDwf/nZm/v5LBz85a/EYPE
r47yTc8OienUFJBhAFzm346mkpACHUFJ+ihX5tdUpi5SqZ4IEMplfi4NezuapBFl
sLl0aucMQKwGb5Q/nWBu+ku6rg5qPBUkhgQ33gAlcqHvcj8oRc232Wl0LGIp6oIc
1ylUj/hHY64YseFHJE4CFHojYxyg7Gyp/8YNcffIdbmzu1B6L6myll38FlMmkPsc
uAgS5MzWMRJ0qUeYLJJuTH0lRZxPV6KWaBPCHL9RwTOmbYT7fgBPy41l4kORBI84
zRBU86oQLV5oNHTubb/CnpY2xH/pCc73m2d2c6HDsDHB80UL4GQhOpVKc9f/eQbA
/QCcl98wSoyQ8No7J9Px+WH7BKijytadj7M2RILQfjSuEKWmlAah3zs/398KdZBj
tMqq5prEevluIq/DINnNaRKa+726v/VQmhuBazIs9FYhcYaAG6oZzqbZh4M8NzAa
/vlXbkSkPDgo/PF7TGIRTYsNftMxNN56tG5Lv8ti0cU7gbPPn3IC61meKrWE3BbR
S1a2rLigS3+iULYDsKDbSg8R2MlBxQrMkzxDXF7mGSAu3qPeArFoAD7ySxVcjScX
lmDzMCaDdnk=
=HOTh
-----END PGP SIGNATURE-----
Merge tag 'rxrpc-rewrite-20160615' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
David Howells says:
====================
rxrpc: Rework endpoint record handling
Here's the next part of the AF_RXRPC rewrite. In this set I rework
endpoint record handling. There are two types of endpoint record, local
and peer. The local endpoint record is used as an anchor for the transport
socket that AF_RXRPC uses (at the moment a UDP socket). Local endpoints
can be shared between AF_RXRPC sockets under certain restricted
circumstances.
The peer endpoint is a record of the remote end. It is (or will be) used
to keep track MTU and RTT values and, with these changes, is used to find
the call(s) to abort when a network error occurs.
The following significant changes are made:
(1) The local endpoint event handling code is split out into its own file.
(2) The local endpoint list bottom half-excluding spinlock is removed as
things are arranged such that sk_user_data will not change whilst the
transport socket callbacks are in progress.
(3) Local endpoints can now only be shared if they have the same transport
address (as before) and have a local service ID of 0 (ie. they're not
listening for incoming calls). This prevents callbacks from a server
to one process being picked up by another process.
(4) Local endpoint destruction is now accomplished by the same work item
as processes events, meaning that the destructor doesn't need to wait
for the event processor.
(5) Peer endpoints are now held in a hash table rather than a flat list.
(6) Peer endpoints are now destroyed by RCU rather than by work item.
(7) Peer endpoints are now differentiated by local endpoint and remote
transport port in addition to remote transport address and transport
type and family.
This means that a firewall that excludes access between a particular
local port and remote port won't cause calls to be aborted that use a
different port pair.
(8) Error report handling now no longer assumes that the source is always
an IPv4 ICMP message from a UDP port and has assumptions that an ICMP
message comes from an IPv4 socket removed. At some point IPv6 support
will be added.
(9) Peer endpoints rather than local endpoints are now the anchor point
for distributing network error reports.
(10) Both types of endpoint records are now disposed of as soon as all
references to them are gone. There is less hanging around and once
their usage counts hit zero, records can no longer be resurrected.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
1) gre_parse_header() can be called from gre_err()
At this point transport header points to ICMP header, not the inner
header.
2) We can not really change transport header as ipgre_err() will later
assume transport header still points to ICMP header (using icmp_hdr())
3) pskb_may_pull() logic in gre_parse_header() really works
if we are interested at zone pointed by skb->data
4) As Jiri explained in commit b7f8fe251e ("gre: do not pull header in
ICMP error processing") we should not pull headers in error handler.
So this fix :
A) changes gre_parse_header() to use skb->data instead of
skb_transport_header()
B) Adds a nhs parameter to gre_parse_header() so that we can skip the
not pulled IP header from error path.
This offset is 0 for normal receive path.
C) remove obsolete IPV6 includes
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <tom@herbertland.com>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net/tipc/link.c: In function ‘tipc_link_timeout’:
net/tipc/link.c:744:28: warning: ‘mtyp’ may be used uninitialized in this function [-Wuninitialized]
Fixes: 42b18f605f ("tipc: refactor function tipc_link_timeout()")
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The algorithm for checksum neutral mapping is incorrect. This problem
was being hidden since we were previously always performing checksum
offload on the translated addresses and only with IPv6 HW csum.
Enabling an ILA router shows the issue.
Corrected algorithm:
old_loc is the original locator in the packet, new_loc is the value
to overwrite with and is found in the lookup table. old_flag is
the old flag value (zero of CSUM_NEUTRAL_FLAG) and new_flag is
then (old_flag ^ CSUM_NEUTRAL_FLAG) & CSUM_NEUTRAL_FLAG.
Need SUM(new_id + new_flag + diff) == SUM(old_id + old_flag) for
checksum neutral translation.
Solving for diff gives:
diff = (old_id - new_id) + (old_flag - new_flag)
compute_csum_diff8(new_id, old_id) gives old_id - new_id
If old_flag is set
old_flag - new_flag = old_flag = CSUM_NEUTRAL_FLAG
Else
old_flag - new_flag = -new_flag = ~CSUM_NEUTRAL_FLAG
Tested:
- Implemented a user space program that creates random addresses
and random locators to overwrite. Compares the checksum over
the address before and after translation (must always be equal)
- Enabled ILA router and showed proper operation.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the presence of firewalls which improperly block ICMP Unreachable
(including Fragmentation Required) messages, Path MTU Discovery is
prevented from working.
A workaround is to handle IPv4 payloads opaquely, ignoring the DF bit--as
is done for other payloads like AppleTalk--and doing transparent
fragmentation and reassembly.
Redux includes the enforcement of mutual exclusion between this feature
and Path MTU Discovery as suggested by Alexander Duyck.
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Philip Prindeville <philipp@redfish-solutions.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds necessary handling for use the short address for
802.15.4 6lowpan. It contains support for IPHC address compression
and new matching algorithmn to decide which link layer address will be
used for 802.15.4 frame.
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
In case of sending RA messages we need some way to get the short address
from an 802.15.4 6LoWPAN interface. This patch will add a temporary
debugfs entry for experimental userspace api.
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch introduce different 6lowpan handling for receive and transmit
NS/NA messages for the ipv6 neighbour discovery. The first use-case is
for supporting 802.15.4 short addresses inside the option fields and
handling for RFC6775 6CO option field as userspace option.
Cc: David S. Miller <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch exports some neighbour discovery functions which can be used
by 6lowpan neighbour discovery ops functionality then.
Cc: David S. Miller <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch introduces neighbour discovery ops callback structure. The
idea is to separate the handling for 6LoWPAN into the 6lowpan module.
These callback offers 6lowpan different handling, such as 802.15.4 short
address handling or RFC6775 (Neighbor Discovery Optimization for IPv6
over 6LoWPANs).
Cc: David S. Miller <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch moves the functionality to add a RA PIO prefix generated
address in an own function. This move prepares to add a hook for
adding a second address for a second link-layer address. E.g. short
address for 802.15.4 6LoWPAN.
Cc: David S. Miller <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds __ndisc_fill_addr_option as low-level function for
ndisc_fill_addr_option which doesn't depend on net_device parameter.
Cc: David S. Miller <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since we use exported function from ipv6 kernel module we don't need to
request the module anymore to have ipv6 functionality.
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the autoconfiguration if a valid 802.15.4 short address
is available for 802.15.4 6LoWPAN interfaces.
Cc: David S. Miller <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch will introduce a 6lowpan neighbour private data. Like the
interface private data we handle private data for generic 6lowpan and
for link-layer specific 6lowpan.
The current first use case if to save the short address for a 802.15.4
6lowpan neighbour.
Cc: David S. Miller <davem@davemloft.net>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
saw a debug splat:
net/include/net/sch_generic.h:287 suspicious rcu_dereference_check() usage!
other info that might help us debug this:
rcu_scheduler_active = 1, debug_locks = 0
2 locks held by kworker/2:1/710:
#0: ("events"){.+.+.+}, at: [<ffffffff8106ca1d>]
#1: ((&q->work)){+.+...}, at: [<ffffffff8106ca1d>] process_one_work+0x14d/0x690
Workqueue: events htb_work_func
Call Trace:
[<ffffffff812dc763>] dump_stack+0x85/0xc2
[<ffffffff8109fee7>] lockdep_rcu_suspicious+0xe7/0x120
[<ffffffff814ced47>] htb_work_func+0x67/0x70
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sfq_reset() can use rtnl_kfree_skbs() instead of kfree_skb()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
pie_change() can use rtnl_qdisc_drop() to benefit from
deferred freeing.
pie_reset() is already using qdisc_reset_queue()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
rtnl_kfree_skbs() can be used in tfifo_reset()
It would be nice if we could iterate through rb tree instead
of removing one skb at a time, and build a single skb chain.
But this is left for a future patch.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Both htb_reset() and htb_destroy() can use __qdisc_reset_queue()
instead of __skb_queue_purge() to defer skb freeing of internal
queues.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Both hhf_reset() and hhf_change() can use rtnl_kfree_skbs()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Both fq_codel_change() and fq_codel_reset() can use rtnl_kfree_skbs()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Both fq_change() and fq_reset() can use rtnl_kfree_skbs()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
codel_change() can use rtnl_qdisc_drop()
to defer expensive skb freeing after locks are released.
codel_reset() already has support for deferred skb freeing
because it uses qdisc_reset_queue()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
choke_reset() and choke_change() can use rtnl_qdisc_drop()
to defer expensive skb freeing after locks are released.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
qdisc are changed under RTNL protection and often
while blocking BH and root qdisc spinlock.
When lots of skbs need to be dropped, we free
them under these locks causing TX/RX freezes,
and more generally latency spikes.
This commit adds rtnl_kfree_skbs(), used to queue
skbs for deferred freeing.
Actual freeing happens right after RTNL is released,
with appropriate scheduling points.
rtnl_qdisc_drop() can also be used in place
of disc_drop() when RTNL is held.
qdisc_reset_queue() and __qdisc_reset_queue() get
the new behavior, so standard qdiscs like pfifo, pfifo_fast...
have their ->reset() method automatically handled.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TIPC based clusters are by default set up with full-mesh link
connectivity between all nodes. Those links are expected to provide
a short failure detection time, by default set to 1500 ms. Because
of this, the background load for neighbor monitoring in an N-node
cluster increases with a factor N on each node, while the overall
monitoring traffic through the network infrastructure increases at
a ~(N * (N - 1)) rate. Experience has shown that such clusters don't
scale well beyond ~100 nodes unless we significantly increase failure
discovery tolerance.
This commit introduces a framework and an algorithm that drastically
reduces this background load, while basically maintaining the original
failure detection times across the whole cluster. Using this algorithm,
background load will now grow at a rate of ~(2 * sqrt(N)) per node, and
at ~(2 * N * sqrt(N)) in traffic overhead. As an example, each node will
now have to actively monitor 38 neighbors in a 400-node cluster, instead
of as before 399.
This "Overlapping Ring Supervision Algorithm" is completely distributed
and employs no centralized or coordinated state. It goes as follows:
- Each node makes up a linearly ascending, circular list of all its N
known neighbors, based on their TIPC node identity. This algorithm
must be the same on all nodes.
- The node then selects the next M = sqrt(N) - 1 nodes downstream from
itself in the list, and chooses to actively monitor those. This is
called its "local monitoring domain".
- It creates a domain record describing the monitoring domain, and
piggy-backs this in the data area of all neighbor monitoring messages
(LINK_PROTOCOL/STATE) leaving that node. This means that all nodes in
the cluster eventually (default within 400 ms) will learn about
its monitoring domain.
- Whenever a node discovers a change in its local domain, e.g., a node
has been added or has gone down, it creates and sends out a new
version of its node record to inform all neighbors about the change.
- A node receiving a domain record from anybody outside its local domain
matches this against its own list (which may not look the same), and
chooses to not actively monitor those members of the received domain
record that are also present in its own list. Instead, it relies on
indications from the direct monitoring nodes if an indirectly
monitored node has gone up or down. If a node is indicated lost, the
receiving node temporarily activates its own direct monitoring towards
that node in order to confirm, or not, that it is actually gone.
- Since each node is actively monitoring sqrt(N) downstream neighbors,
each node is also actively monitored by the same number of upstream
neighbors. This means that all non-direct monitoring nodes normally
will receive sqrt(N) indications that a node is gone.
- A major drawback with ring monitoring is how it handles failures that
cause massive network partitionings. If both a lost node and all its
direct monitoring neighbors are inside the lost partition, the nodes in
the remaining partition will never receive indications about the loss.
To overcome this, each node also chooses to actively monitor some
nodes outside its local domain. Those nodes are called remote domain
"heads", and are selected in such a way that no node in the cluster
will be more than two direct monitoring hops away. Because of this,
each node, apart from monitoring the member of its local domain, will
also typically monitor sqrt(N) remote head nodes.
- As an optimization, local list status, domain status and domain
records are marked with a generation number. This saves senders from
unnecessarily conveying unaltered domain records, and receivers from
performing unneeded re-adaptations of their node monitoring list, such
as re-assigning domain heads.
- As a measure of caution we have added the possibility to disable the
new algorithm through configuration. We do this by keeping a threshold
value for the cluster size; a cluster that grows beyond this value
will switch from full-mesh to ring monitoring, and vice versa when
it shrinks below the value. This means that if the threshold is set to
a value larger than any anticipated cluster size (default size is 32)
the new algorithm is effectively disabled. A patch set for altering the
threshold value and for listing the table contents will follow shortly.
- This change is fully backwards compatible.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This refers to commands to direct action access as follows:
sudo tc actions add action drop index 12
sudo tc actions add action pipe index 10
And then dumping them like so:
sudo tc actions ls action gact
iproute2 worked because it depended on absence of TCA_ACT_TAB TLV
as end of message.
This fix has been tested with iproute2 and is backward compatible.
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPv6 multicast and link-local addresses require special handling by the
VRF driver:
1. Rather than using the VRF device index and full FIB lookups,
packets to/from these addresses should use direct FIB lookups based on
the VRF device table.
2. fail sends/receives on a VRF device to/from a multicast address
(e.g, make ping6 ff02::1%<vrf> fail)
3. move the setting of the flow oif to the first dst lookup and revert
the change in icmpv6_echo_reply made in ca254490c8 ("net: Add VRF
support to IPv6 stack"). Linklocal/mcast addresses require use of the
skb->dev.
With this change connections into and out of a VRF enslaved device work
for multicast and link-local addresses work (icmp, tcp, and udp)
e.g.,
1. packets into VM with VRF config:
ping6 -c3 fe80::e0:f9ff:fe1c:b974%br1
ping6 -c3 ff02::1%br1
ssh -6 fe80::e0:f9ff:fe1c:b974%br1
2. packets going out a VRF enslaved device:
ping6 -c3 fe80::18f8:83ff:fe4b:7a2e%eth1
ping6 -c3 ff02::1%eth1
ssh -6 root@fe80::18f8:83ff:fe4b:7a2e%eth1
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
L3 master devices are virtual devices similar to the loopback
device. Link local and multicast routes for these devices do
not make sense. The ipv6 addrconf code already skips adding a
linklocal address; do the same for the mcast route.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow drivers to pass flow arg to functions where the arg is not const
and allow the driver to make updates as needed (eg., setting oif).
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
And avoid calling tcf_hash_check() twice.
Fixes: a57f19d30b ("net sched: ipt action fix late binding")
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now prio_init() can return -ENOMEM, it also has to make sure
any allocated qdiscs are freed, since the caller (qdisc_create()) wont
call ->destroy() handler for us.
More generally, we want a transactional behavior for "tc qdisc
change ...", so prio_tune() should not make modifications if
any error is returned.
It means that we must validate parameters and allocate missing qdisc(s)
before taking root qdisc lock exactly once, to not leave the prio qdisc
in an intermediate state.
Fixes: cbdf451164 ("net_sched: prio: properly report out of memory errors")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When an inbound message is bigger than a page, allocate a paged SKB,
and subsequently use IUCV receive primitive with IPBUFLST flag.
This relaxes the pressure to allocate big contiguous kernel buffers.
Signed-off-by: Eugene Crosser <Eugene.Crosser@ru.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before introducing paged skbs in the receive path, get rid of the
function `iucv_fragment_skb()` that replaces one large linear skb
with several smaller linear skbs.
Signed-off-by: Eugene Crosser <Eugene.Crosser@ru.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When an outbound message is bigger than a page, allocate and fill
a paged SKB, and subsequently use IUCV send primitive with IPBUFLST
flag. This relaxes the pressure to allocate big contiguous kernel
buffers.
Signed-off-by: Eugene Crosser <Eugene.Crosser@ru.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rework the local RxRPC endpoint management.
Local endpoint objects are maintained in a flat list as before. This
should be okay as there shouldn't be more than one per open AF_RXRPC socket
(there can be fewer as local endpoints can be shared if their local service
ID is 0 and they share the same local transport parameters).
Changes:
(1) Local endpoints may now only be shared if they have local service ID 0
(ie. they're not being used for listening).
This prevents a scenario where process A is listening of the Cache
Manager port and process B contacts a fileserver - which may then
attempt to send CM requests back to B. But if A and B are sharing a
local endpoint, A will get the CM requests meant for B.
(2) We use a mutex to handle lookups and don't provide RCU-only lookups
since we only expect to access the list when opening a socket or
destroying an endpoint.
The local endpoint object is pointed to by the transport socket's
sk_user_data for the life of the transport socket - allowing us to
refer to it directly from the sk_data_ready and sk_error_report
callbacks.
(3) atomic_inc_not_zero() now exists and can be used to only share a local
endpoint if the last reference hasn't yet gone.
(4) We can remove rxrpc_local_lock - a spinlock that had to be taken with
BH processing disabled given that we assume sk_user_data won't change
under us.
(5) The transport socket is shut down before we clear the sk_user_data
pointer so that we can be sure that the transport socket's callbacks
won't be invoked once the RCU destruction is scheduled.
(6) Local endpoints have a work item that handles both destruction and
event processing. The means that destruction doesn't then need to
wait for event processing. The event queues can then be cleared after
the transport socket is shut down.
(7) Local endpoints are no longer available for resurrection beyond the
life of the sockets that had them open. As soon as their last ref
goes, they are scheduled for destruction and may not have their usage
count moved from 0.
Signed-off-by: David Howells <dhowells@redhat.com>
Separate local endpoint event handling out into its own file preparatory to
overhauling the object management aspect (which remains in the original
file).
Signed-off-by: David Howells <dhowells@redhat.com>
The spec allows backchannels for multiple clients to share the same tcp
connection. When that happens, we need to use the same xprt for all of
them. Similarly, we need the same xps.
This fixes list corruption introduced by the multipath code.
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Acked-by: Trond Myklebust <trondmy@primarydata.com>
Also simplify the logic a bit.
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Acked-by: Trond Myklebust <trondmy@primarydata.com>
Callers of rpc_create_xprt expect it to put the xprt on success and
failure.
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Acked-by: Trond Myklebust <trondmy@primarydata.com>
nft_genmask_cur has already done left-shift operator on the gencursor,
so there's no need to do left-shift operator on it again.
Fixes: ea4bd995b0 ("netfilter: nf_tables: add transaction helper functions")
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When we add a nft rule like follows:
# nft add rule filter test tcp dport vmap {1: jump test}
-ELOOP error will be returned, and the anonymous set will be
destroyed.
But after that, nf_tables_abort will also try to remove the
element and destroy the set, which was already destroyed and
freed.
If we add a nft wrong rule, nft_tables_abort will do the cleanup
work rightly, so nf_tables_set_destroy call here is redundant and
wrong, remove it.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Liping Zhang says:
"Users may add such a wrong nft rules successfully, which will cause an
endless jump loop:
# nft add rule filter test tcp dport vmap {1: jump test}
This is because before we commit, the element in the current anonymous
set is inactive, so osp->walk will skip this element and miss the
validate check."
To resolve this problem, this patch passes the generation mask to the
walk function through the iter container structure depending on the code
path:
1) If we're dumping the elements, then we have to check if the element
is active in the current generation. Thus, we check for the current
bit in the genmask.
2) If we're checking for loops, then we have to check if the element is
active in the next generation, as we're in the middle of a
transaction. Thus, we check for the next bit in the genmask.
Based on original patch from Liping Zhang.
Reported-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Tested-by: Liping Zhang <liping.zhang@spreadtrum.com>
We should check "i" is used as a dictionary or not, "binding" is already
checked before.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
I forgot to move the kmem_cache_destroy into the exit path.
Fixes: 0c5366b3a8 ("netfilter: conntrack: use single slab cache)
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Use the peer record to distribute network errors rather than the transport
object (which I want to get rid of). An error from a particular peer
terminates all calls on that peer.
For future consideration:
(1) For ICMP-induced errors it might be worth trying to extract the RxRPC
header from the offending packet, if one is returned attached to the
ICMP packet, to better direct the error.
This may be overkill, though, since an ICMP packet would be expected
to be relating to the destination port, machine or network. RxRPC
ABORT and BUSY packets give notice at RxRPC level.
(2) To also abort connection-level communications (such as CHALLENGE
packets) where indicted by an error - but that requires some revamping
of the connection event handling first.
Signed-off-by: David Howells <dhowells@redhat.com>
Don't assume anything about the address in an ICMP packet in
rxrpc_error_report() as the address may not be IPv4 in future, especially
since we're just printing these details.
Signed-off-by: David Howells <dhowells@redhat.com>
Break MTU determination from ICMP out into its own function to reduce the
complexity of the error report handler.
Signed-off-by: David Howells <dhowells@redhat.com>
Rename rxrpc_UDP_error_report() to rxrpc_error_report() as it might get
called for something other than UDP.
Signed-off-by: David Howells <dhowells@redhat.com>
Rework peer object handling to use a hash table instead of a flat list and
to use RCU. Peer objects are no longer destroyed by passing them to a
workqueue to process, but rather are just passed to the RCU garbage
collector as kfree'able objects.
The hash function uses the local endpoint plus all the components of the
remote address, except for the RxRPC service ID. Peers thus represent a
UDP port on the remote machine as contacted by a UDP port on this machine.
The RCU read lock is used to handle non-creating lookups so that they can
be called from bottom half context in the sk_error_report handler without
having to lock the hash table against modification.
rxrpc_lookup_peer_rcu() *does* take a reference on the peer object as in
the future, this will be passed to a work item for error distribution in
the error_report path and this function will cease being used in the
data_ready path.
Creating lookups are done under spinlock rather than mutex as they might be
set up due to an external stimulus if the local endpoint is a server.
Captured network error messages (ICMP) are handled with respect to this
struct and MTU size and RTT are cached here.
Signed-off-by: David Howells <dhowells@redhat.com>
This function is just ->init(), rename it to make it obvious.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These should be gone when we removed CONFIG_NET_CLS_POLICE.
We can not totally remove them since they are exposed
to userspace.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Refactor rds_conn_destroy() so that the per-path dismantling
is done in rds_conn_path_destroy, and then iterate as needed
over rds_conn_path_destroy().
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This commit changes rds_conn_shutdown to take a rds_conn_path *
argument, allowing it to shutdown paths other than c_path[0] for
MP-capable transports.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a for() loop in __rds_conn_create to initialize all the
conn_paths, in preparate for MP capable transports.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
rds_conn_path_error() is the MP-aware analog of rds_conn_error,
to be used by multipath-capable callers.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This commit updates the callbacks related to the rds-info command
so that they walk through all the rds_conn_path structures and
report the requested info.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
rds_conn_path_connect_if_down() works on the rds_conn_path
that it is passed. Callers who are not t_m_capable may continue
calling rds_conn_connect_if_down, which will invoke
rds_conn_path_connect_if_down() with the default c_path[0].
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This commit allows rds_send_pong() callers to send back
the rds pong message on some path other than c_path[0] by
passing in a struct rds_conn_path * argument. It also
removes the last dependency on the #defines in rds_single.h
from send.c
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Explicitly set up rds_conn_path, either from i_conn_path (for
MP capable transpots) or as c_path[0], and use this in
rds_send_drop_to()
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pass a struct rds_conn_path to rds_send_xmit so that MP capable
transports can transmit packets on something other than c_path[0].
The eventual goal for MP capable transports is to hash the rds
socket to a path based on the bound local address/port, and use
this path as the argument to rds_send_xmit()
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pass the rds_conn_path to rds_send_queue_rm, and use it to initialize
the i_conn_path field in struct rds_incoming. This commit also makes
rds_send_queue_rm() MP capable, because it now takes locks
specific to the rds_conn_path passed in, instead of defaulting to
the c_path[0] based defines from rds_single_path.h
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The only caller of rds_send_get_message() was
rds_iw_send_cq_comp_handler() which was removed as part of
commit dcdede0406 ("RDS: Drop stale iWARP RDMA transport"),
so remove rds_send_get_message() for the same reason.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
rds_send_path_drop_acked() is the path-specific version of
rds_send_drop_acked() to be invoked by MP capable callers.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
rds_send_path_reset() is the path specific version of rds_send_reset()
intended for MP capable callers.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
t_mp_capable transports can use rds_inc_path_init to initialize
all fields in struct rds_incoming, including the i_conn_path.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Transports that are t_mp_capable should set the rds_conn_path
on which the datagram was recived in the ->i_conn_path field
of struct rds_incoming.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The t_mp_capable bit will be used in the core rds module
to support multipathing logic when the transport supports it.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In preparation for multipath RDS, split the rds_connection
structure into a base structure, and a per-path struct rds_conn_path.
The base structure tracks information and locks common to all
paths. The workqs for send/recv/shutdown etc are tracked per
rds_conn_path. Thus the workq callbacks now work with rds_conn_path.
This commit allows for one rds_conn_path per rds_connection, and will
be extended into multiple conn_paths in subsequent commits.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make sure that dctcp_get_info() returns only the size of the
info->dctcp struct that it zeroes out and fills in. Previously it had
been returning the size of the enclosing tcp_cc_info union,
sizeof(*info). There is no problem yet, but that union that may one
day be larger than struct tcp_dctcp_info, in which case the
TCP_CC_INFO code might accidentally copy uninitialized bytes from the
stack.
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix to return a negative error code from the error handling
case instead of 0, as done elsewhere in this function.
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Acked-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIVAwUAV16wsPSw1s6N8H32AQLGqA/+MC6/oufeV5WF/csMucx6I5wauPDQND0b
E1bOgzceGckwY3NUh14sYo8o/JVxzkKCWrWb0bNBSdqUup6vTMxCBgkSUxvaCbN4
RIcqu8LDB3vCrzMobMWijZDGqCpF8ZprUcvfCVj1tdrnEbK3njoYJjJ9UZN/lVPw
Tw+ojKpAty7r/JtZUkWEBhlZZtG26Ko7t+mDbgISOGulPNnw0HSsodMCDb1iSMRS
MZESOqQ5Kb8vhNOQ0zn+fNWB5d94yLZIxmt/siDXuetopHWuD1u8imqz+lhQ1BTU
7tOO1PKGxnsalw7FlgNL1cNmOJi/kEKISURG82K1u3pTk2Bf9vtiEPpLnC6p0Qy2
WwIQVJWewSCYsLiujqhXwmTGHC1m53979VtETHZoKhj/Jm8t3S38XN92Kt/P9y9g
z1r3EBs6+WzTt4zaw5rjVBY33reeDgEo2RYusLxMOgvPsmX5KMNvpov9R04+JthC
Qp1zWsE2FerxRJ+CNFBNl0ei4NVMsds3OHuEfgAiEedRowjGokDRQd6stVEev05S
S2XggUvt011Fvhud0UJWH9koeVcHA85KNCK8V1aj9xH3zsQDizMYhZMhaL34wnu+
z9XSv6DjZzJVWxC4gyQ/bsvdg34Uerv6g2BwRgqDCfh90Sa5U2n1CL8b3LwTKFU3
3R3s0l3R5gg=
=FXID
-----END PGP SIGNATURE-----
Merge tag 'rxrpc-rewrite-20160613' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
David Howells says:
====================
rxrpc: Rename rxrpc source files
Here's the next part of the AF_RXRPC rewrite. In this set I rename some of
the files in the net/rxrpc/ directory and adjust the Makefile and
ar-internal.h to reflect the changes.
The aim is twofold:
(1) Remove the "ar-" prefix on those files that have it as it's not really
useful, especially now that I'm building rxkad in.
(2) To aid splitting the local, peer, connection and call handling code
into separate files for object and event handling in future patches by
making it easier to come up with new filenames.
There are two commits:
(1) The first commit does a bunch of renames of .c files and alters the
Makefile. ar-internal.h isn't renamed at this time to avoid having to
change the contents of the files being renamed.
(2) The second commit changes the section label comments in ar-internal.h
to reflect the changed filenames and reorders the file so that the
sections are back in filename order.
The patches can be found here also:
http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=rxrpc-rewrite
Tagged thusly:
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git
rxrpc-rewrite-20160613
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
When skip_sw is set and hardware fails to apply filter, return error to
user. This will make error propagation logic similar to the one
currently used in u32 classifier.
Also, changed code to use tc_skip_sw() utility function.
Signed-off-by: Amir Vadai <amirva@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function gretap_fb_dev_create() (only used by ovs) never calls
rtnl_configure_link(). The consequence is that dev->rtnl_link_state is
never set to RTNL_LINK_INITIALIZED.
During the deletion phase, the function rollback_registered_many() sends
a RTM_DELLINK only if dev->rtnl_link_state is set to RTNL_LINK_INITIALIZED.
Fixes: b2acd1dc39 ("openvswitch: Use regular GRE net_device instead of vport")
CC: Thomas Graf <tgraf@suug.ch>
CC: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After ipgre_newlink()/geneve_configure() call, the netdev is registered.
Fixes: 7e059158d5 ("vxlan, gre, geneve: Set a large MTU on ovs-created tunnel devices")
CC: David Wragg <david@weave.works>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a corner case in which udp packets belonging to a same
flow are hashed to different socket when hslot->count changes from 10
to 11:
1) When hslot->count <= 10, __udp_lib_lookup() searches udp_table->hash,
and always passes 'daddr' to udp_ehashfn().
2) When hslot->count > 10, __udp_lib_lookup() searches udp_table->hash2,
but may pass 'INADDR_ANY' to udp_ehashfn() if the sockets are bound to
INADDR_ANY instead of some specific addr.
That means when hslot->count changes from 10 to 11, the hash calculated by
udp_ehashfn() is also changed, and the udp packets belonging to a same
flow will be hashed to different socket.
This is easily reproduced:
1) Create 10 udp sockets and bind all of them to 0.0.0.0:40000.
2) From the same host send udp packets to 127.0.0.1:40000, record the
socket index which receives the packets.
3) Create 1 more udp socket and bind it to 0.0.0.0:44096. The number 44096
is 40000 + UDP_HASH_SIZE(4096), this makes the new socket put into the
same hslot as the aformentioned 10 sockets, and makes the hslot->count
change from 10 to 11.
4) From the same host send udp packets to 127.0.0.1:40000, and the socket
index which receives the packets will be different from the one received
in step 2.
This should not happen as the socket bound to 0.0.0.0:44096 should not
change the behavior of the sockets bound to 0.0.0.0:40000.
It's the same case for IPv6, and this patch also fixes that.
Signed-off-by: Su, Xuemin <suxm@chinanetcenter.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the qdisc is full, we drop a packet at the head of the queue,
queue the current skb and return NET_XMIT_CN
Now we track backlog on upper qdiscs, we need to call
qdisc_tree_reduce_backlog(), even if the qlen did not change.
Fixes: 2ccccf5fb4 ("net_sched: update hierarchical backlog too")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: WANG Cong <xiyou.wangcong@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reported-by: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Tom Herbert <tom@herbertland.com>
Fixes: 4068579e1e ("net: Implmement RFC 6936 (zero RX csums for UDP/IPv6")
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Tom Herbert <tom@herbertland.com>
Fixes: 4068579e1e ("net: Implmement RFC 6936 (zero RX csums for UDP/IPv6")
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Florent Fourcot <florent.fourcot@enst-bretagne.fr>
Fixes: 1d13a96c74 ("ipv6: tcp: fix flowlabel value in ACK messages send from TIME_WAIT")
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPv6 ping socket error handler doesn't correctly convert the new 32 bit
mtu to host endianness before using.
Cc: Lorenzo Colitti <lorenzo@google.com>
Fixes: 6d0bfe2261 ("net: ipv6: Add IPv6 support to the ping socket.")
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Even with spin_unlock_wait() fixed, nf_conntrack_lock{,_all}() is
borken as it misses a bunch of memory barriers to order the whole
global vs local locks scheme.
Even x86 (and other TSO archs) are affected.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[ Updated the comments. ]
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Use the low latency transport workqueue to process the task that is
next in line on the xprt->sending queue.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
rpciod can easily get congested due to the long list of queued rpc_tasks.
Having the receive queue wait in turn for those tasks to complete can
therefore be a bottleneck.
Address the problem by separating the workqueues into:
- rpciod: manages rpc_tasks
- xprtiod: manages transport related work.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The only difference between the two at this point is the reset of
the connection timeout, and since everyone expect tcp ignore that value,
we can just throw it into the generic function.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Update the section comments in ar-internal.h that indicate the locations of
the referenced items to reflect the renames done to the .c files in
net/rxrpc/.
This also involves some rearrangement to reflect keep the sections in order
of filename.
Signed-off-by: David Howells <dhowells@redhat.com>
Rename files matching net/rxrpc/ar-*.c to get rid of the "ar-" prefix.
This will aid splitting those files by making easier to come up with new
names.
Note that the not all files are simply renamed from ar-X.c to X.c. The
following exceptions are made:
(*) ar-call.c -> call_object.c
ar-ack.c -> call_event.c
call_object.c is going to contain the core of the call object
handling. Call event handling is all going to be in call_event.c.
(*) ar-accept.c -> call_accept.c
Incoming call handling is going to be here.
(*) ar-connection.c -> conn_object.c
ar-connevent.c -> conn_event.c
The former file is going to have the basic connection object handling,
but there will likely be some differentiation between client
connections and service connections in additional files later. The
latter file will have all the connection-level event handling.
(*) ar-local.c -> local_object.c
This will have the local endpoint object handling code. The local
endpoint event handling code will later be split out into
local_event.c.
(*) ar-peer.c -> peer_object.c
This will have the peer endpoint object handling code. Peer event
handling code will be placed in peer_event.c (for the moment, there is
none).
(*) ar-error.c -> peer_event.c
This will become the peer event handling code, though for the moment
it's actually driven from the local endpoint's perspective.
Note that I haven't renamed ar-transport.c to transport_object.c as the
intention is to delete it when the rxrpc_transport struct is excised.
The only file that actually has its contents changed is net/rxrpc/Makefile.
net/rxrpc/ar-internal.h will need its section marker comments updating, but
I'll do that in a separate patch to make it easier for git to follow the
history across the rename. I may also want to rename ar-internal.h at some
point - but that would mean updating all the #includes and I'd rather do
that in a separate step.
Signed-off-by: David Howells <dhowells@redhat.com.
sch_atm returns this when TC_ACT_SHOT classification occurs.
But all other schedulers that use tc_classify
(htb, hfsc, drr, fq_codel ...) return NET_XMIT_SUCCESS | __BYPASS
in this case so just do that in atm.
BATMAN uses it as an intermediate return value to signal
forwarding vs. buffering, but it did not return POLICED to
callers outside of BATMAN.
Reviewed-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
At Qdisc creation or change time, prio_tune() creates missing
pfifo qdiscs but does not return an error code if one
qdisc could not be allocated.
Leaving a qdisc in non operational state without telling user
anything about this problem is not good.
Also, testing if we replace something different than noop_qdisc
a second time makes no sense so I removed useless code.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
>> net/ipv4/ipconfig.c:130:15: warning: 'ic_addrservaddr' defined but not used [-Wunused-variable]
static __be32 ic_addrservaddr = NONE; /* IP Address of the IP addresses'server */
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In IPv6 the ToS values are part of the flowlabel in flowi6 and get
extracted during fib rule lookup, but we forgot to correctly initialize
the flowlabel before the routing lookup.
Reported-by: <liam.mcbirnie@boeing.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
__QDISC_STATE_THROTTLED bit manipulation is rather expensive
for HTB and few others.
I already removed it for sch_fq in commit f2600cf02b
("net: sched: avoid costly atomic operation in fq_dequeue()")
and so far nobody complained.
When one ore more packets are stuck in one or more throttled
HTB class, a htb dequeue() performs two atomic operations
to clear/set __QDISC_STATE_THROTTLED bit, while root qdisc
lock is held.
Removing this pair of atomic operations bring me a 8 % performance
increase on 200 TCP_RR tests, in presence of throttled classes.
This patch has no side effect, since nothing actually uses
disc_is_throttled() anymore.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Looks like it is only there as some optimization attempt.
Since __QDISC_STATE_THROTTLED set/unset is way too expensive,
and netem is the last user, just remove this check.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
So far no qdisc ever unset the throttled bit at enqueue() time,
so CBQ usage of qdisc_is_throttled() was flaky.
Since __QDISC_STATE_THROTTLED set/unset is way too expensive
considering that only CBQ was eventually caring for this status,
it would make sense to implement a Qdisc ops ->is_throttled()
if we find that this is needed.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We want to get rid of generic qdisc throttled management,
so this qdisc has to use a private flag.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The symbol ic_addrservaddr is not static, but has no declaration
to match so make it static to fix the following warning:
net/ipv4/ipconfig.c:130:8: warning: symbol 'ic_addrservaddr' was not declared. Should it be static?
Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
The functions inet_diag_msg_common_fill and inet_diag_msg_attrs_fill
seem to have been missed from the include/linux/inet_diag.h header
file. Add them to fix the following warnings:
net/ipv4/inet_diag.c:69:6: warning: symbol 'inet_diag_msg_common_fill' was not declared. Should it be static?
net/ipv4/inet_diag.c:108:5: warning: symbol 'inet_diag_msg_attrs_fill' was not declared. Should it be static?
Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now sctp doesn't change socket state upon shutdown reception. It changes
just the assoc state, even though it's a TCP-style socket.
For some cases, if we really need to check sk->sk_state, it's necessary to
fix this issue, at least when we use ss or netstat to dump, we can get a
more exact information.
As an improvement, we will change sk->sk_state when we change asoc->state
to SHUTDOWN_RECEIVED, and also do it in sctp_shutdown to keep consistent
with sctp_close.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo R. Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* the biggest change is Michał's work on integrating FQ/codel
with the mac80211 internal software queues
* cfg80211 connect result gets clarified for the
"no connection at all" case
* advertisement of per-interface type capabilities, in case
they differ (which makes a lot of sense for some capabilities)
* most of the nl80211 & hwsim unprivileged namespace operation
changes
* human-readable VHT capabilities in debugfs
* some other cleanups, like spelling
-----BEGIN PGP SIGNATURE-----
iQIcBAABCgAGBQJXWWe/AAoJEGt7eEactAAdXVsQAKGkdXGmUU14tfiRCnZryMEH
GyiVDHQivfcadicbK9599LXadvYc6ATZHfYoSnROwB82NfhKB71N8dUJ4qePxLNa
lDe5uuuAgxtHG63hK1R02flPorkBAEcUcdHsFSuOw7JWag4/49sCqWH8X5K8E9aT
vrhaPeppJPydwnmNzNOBvMsLEJFJdZWaEZupZQ0kiJ/jB/howFdzF75GxGQ2jh+F
dPk22/PtV2igCtntqaty7h057AdZ+znQuiUdVB7eYIOle7veeGyMzFFf0xQ99LAZ
+xcA7GA74u6m8O6SUVw+6nhrUJ5XTsKGUtmKCTVOcUGa5z7XDD7NIxk2SgloCJkC
qPqVU8wyBCxKc+6JsyiVSLcB5MWvWxifvo0OyLsbCqhN50bTtJT96ymKOAx4NeSG
s+TYlV2HgVRNN6PzGvl/ZUo8Mm1UFGWlCBPcyMy9Fwc0jxdhnOAZzOtqV1yVwlCz
M43RKwxBX6MHAtlfwy6g3M5ievAviwY3Kt+yGeLRIVsJJfOUdQxYAa+m7GTeohD/
Tns4IHbtWJiLTKuaTvrs4ec3ycXrv7iibMINcarvkTgbbF9Qvarf/e2RoWO/freY
OUixDdG7HmsY0XIUhSepeMJpn5xIhQ7dkmckhzDEgZqBM6uPJEMB2+O/CGNDsxEp
VjfnwXJ+9PQrjheBvnD2
=FvhQ
-----END PGP SIGNATURE-----
Merge tag 'mac80211-next-for-davem-2016-06-09' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
Johannes Berg says:
====================
For the next cycle, we have the following:
* the biggest change is Michał's work on integrating FQ/codel
with the mac80211 internal software queues
* cfg80211 connect result gets clarified for the
"no connection at all" case
* advertisement of per-interface type capabilities, in case
they differ (which makes a lot of sense for some capabilities)
* most of the nl80211 & hwsim unprivileged namespace operation
changes
* human-readable VHT capabilities in debugfs
* some other cleanups, like spelling
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP-NV (New Vegas) is a major update to TCP-Vegas.
An earlier version of NV was presented at 2010's LPC.
It is a delayed based congestion avoidance for the
data center. This version has been tested within a
10G rack where the HW RTTs are 20-50us and with
1 to 400 flows.
A description of TCP-NV, including implementation
details as well as experimental results, can be found at:
http://www.brakmo.org/networking/tcp-nv/TCPNV.html
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add in_flight (bytes in flight when packet was sent) field
to tx component of tcp_skb_cb and make it available to
congestion modules' pkts_acked() function through the
ack_sample function argument.
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace open coded conversion between virtio_net_hdr to skb GSO info with
virtio_net_hdr_from_skb
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
alloc_workqueue replaces deprecated create_workqueue().
Since the driver is infiniband which can be used as block device and the
workqueue seems involved in regular operation of the device, so a
dedicated workqueue has been used with WQ_MEM_RECLAIM set to guarantee
forward progress under memory pressure.
Since there are only a fixed number of work items, explicit concurrency
limit is unnecessary here.
Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 8626c56c82 ("bridge: fix potential use-after-free when hook
returns QUEUE or STOLEN verdict") fixed incorrect usage of NF_HOOK's
return value by consuming packets in okfn via br_pass_frame_up().
However, this function re-injects packets to the Rx path with skb->dev
set to the bridge device, which breaks kernel's STP, as all STP packets
appear to originate from the bridge device itself.
Instead, if STP is enabled and bridge isn't a 802.1ad bridge, then learn
packet's SMAC and inject it back to the Rx path for further processing
by the packet handlers.
The patch also makes netfilter's behavior consistent with regards to
packets destined to the Bridge Group Address, as no hook registered at
LOCAL_IN will ever be called, regardless if STP is enabled or not.
Cc: Florian Westphal <fw@strlen.de>
Cc: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Cc: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Fixes: 8626c56c82 ("bridge: fix potential use-after-free when hook returns QUEUE or STOLEN verdict")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We always mixed in the parent pointer into the dentry name hash, but we
did it late at lookup time. It turns out that we can simplify that
lookup-time action by salting the hash with the parent pointer early
instead of late.
A few other users of our string hashes also wanted to mix in their own
pointers into the hash, and those are updated to use the same mechanism.
Hash users that don't have any particular initial salt can just use the
NULL pointer as a no-salt.
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: George Spelvin <linux@sciencehorizons.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Limit the socket incoming call backlog queue size so that a remote client
can't pump in sufficient new calls that the server runs out of memory. Note
that this is partially theoretical at the moment since whilst the number of
calls is limited, the number of packets trying to set up new calls is not.
This will be addressed in a later patch.
If the caller of listen() specifies a backlog INT_MAX, then they get the
current maximum; anything else greater than max_backlog or anything
negative incurs EINVAL.
The limit on the maximum queue size can be set by:
echo N >/proc/sys/net/rxrpc/max_backlog
where 4<=N<=32.
Further, set the default backlog to 0, requiring listen() to be called
before we start actually queueing new calls. Whilst this kind of is a
change in the UAPI, the caller can't actually *accept* new calls anyway
unless they've first called listen() to put the socket into the LISTENING
state - thus the aforementioned new calls would otherwise just sit there,
eating up kernel memory. (Note that sockets that don't have a non-zero
service ID bound don't get incoming calls anyway.)
Given that the default backlog is now 0, make the AFS filesystem call
kernel_listen() to set the maximum backlog for itself.
Possible improvements include:
(1) Trimming a too-large backlog to max_backlog when listen is called.
(2) Trimming the backlog value whenever the value is used so that changes
to max_backlog are applied to an open socket automatically. Note that
the AFS filesystem opens one socket and keeps it open for extended
periods, so would miss out on changes to max_backlog.
(3) Having a separate setting for the AFS filesystem.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a possibility where the user can just specify the parent and
all filters under that parent are then being purged. Currently,
for example for scripting, one needs to specify pref/prio to have
a well-defined number for 'tc filter del' command for addressing
the previously created instance or additionally filter handle in
case of priorities being the same. Improve usage by allowing the
option for tc to specify the parent and removing the whole chain
for that given parent.
Example usage after patch, no tc changes required:
# tc qdisc replace dev foo clsact
# tc filter add dev foo egress bpf da obj ./bpf.o
# tc filter add dev foo egress bpf da obj ./bpf.o
# tc filter show dev foo egress
filter protocol all pref 49151 bpf
filter protocol all pref 49151 bpf handle 0x1 bpf.o:[classifier] direct-action
filter protocol all pref 49152 bpf
filter protocol all pref 49152 bpf handle 0x1 bpf.o:[classifier] direct-action
# tc filter del dev foo egress
# tc filter show dev foo egress
#
Previously, RTM_DELTFILTER requests with invalid prio of 0 were
rejected, so only netlink requests with RTM_NEWTFILTER and NLM_F_CREATE
flag were allowed where the kernel would auto-generate a pref/prio.
We can piggyback on that and use prio of 0 as a wildcard for
requests of RTM_DELTFILTER.
For notifying tc netlink monitoring users (e.g. libnl uses this
for caching), there are two options, that is, sending individual
tfilter_notify() notifications for each tcf_proto, or sending a
single one indicating wildcard removal. I tried both and there
are pros and cons for each, eventually I decided for sending
individual tfilter_notify(), so that user space can support this
seamlessly and there won't be a mess of changing each and every
application to make sure expectations from the kernel won't break
when they don't understand single notification. Since linear chains
don't really scale, I expect only a handful of classifiers to be
attached at max for a given parent anyway.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a bpf_check_basics_ok() and reject filters that are of invalid
size much earlier, so we don't do any useless work such as invoking
bpf_prog_alloc(). Currently, rejection happens in bpf_check_classic()
only, but it's really unnecessarily late and they should be rejected
at earliest point. While at it, also clean up one bpf_prog_size() to
make it consistent with the remaining invocations.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Respect the stack's xmit_recursion limit for calls into dev_queue_xmit().
Currently, they are not handeled by the limiter when attached to clsact's
egress parent, for example, and a buggy program redirecting it to the
same device again could run into stack overflow eventually. It would be
good if we could notify an admin to give him a chance to react. We reuse
xmit_recursion instead of having one private to eBPF, so that the stack's
current recursion depth will be taken into account as well. Follow-up to
commit 3896d655f4 ("bpf: introduce bpf_clone_redirect() helper") and
27b29f6305 ("bpf: add bpf_redirect() helper").
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The patch adds a new OVS action, OVS_ACTION_ATTR_TRUNC, in order to
truncate packets. A 'max_len' is added for setting up the maximum
packet size, and a 'cutlen' field is to record the number of bytes
to trim the packet when the packet is outputting to a port, or when
the packet is sent to userspace.
Signed-off-by: William Tu <u9012063@gmail.com>
Cc: Pravin Shelar <pshelar@nicira.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
net/sched/act_police.c
net/sched/sch_drr.c
net/sched/sch_hfsc.c
net/sched/sch_prio.c
net/sched/sch_red.c
net/sched/sch_tbf.c
In net-next the drop methods of the packet schedulers got removed, so
the bug fixes to them in 'net' are irrelevant.
A packet action unload crash fix conflicts with the addition of the
new firstuse timestamp.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) nfnetlink timestamp taken from wrong skb, fix from Florian Westphal.
2) Revert some msleep conversions in rtlwifi as these spots are in
atomic context, from Larry Finger.
3) Validate that NFTA_SET_TABLE attribute is actually specified when we
call nf_tables_getset(). From Phil Turnbull.
4) Don't do mdio_reset in stmmac driver with spinlock held as that can
sleep, from Vincent Palatin.
5) sk_filter() does things other than run a BPF filter, so we should
not elide it's call just because sk->sk_filter is NULL. Fix from
Eric Dumazet.
6) Fix missing backlog updates in several packet schedulers, from Cong
Wang.
7) bnx2x driver should allow VLAN add/remove while the interface is
down, from Michal Schmidt.
8) Several RDS/TCP race fixes from Sowmini Varadhan.
9) fq_codel scheduler doesn't return correct queue length in dumps,
from Eric Dumazet.
10) Fix TCP stats for tail loss probe and early retransmit in ipv6, from
Yuchung Cheng.
11) Properly initialize udp_tunnel_socket_cfg in l2tp_tunnel_create(),
from Guillaume Nault.
12) qfq scheduler leaks SKBs if a kzalloc fails, fix from Florian
Westphal.
13) sock_fprog passed into PACKET_FANOUT_DATA needs compat handling,
from Willem de Bruijn.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (85 commits)
vmxnet3: segCnt can be 1 for LRO packets
packet: compat support for sock_fprog
stmmac: fix parameter to dwmac4_set_umac_addr()
net/mlx5e: Fix blue flame quota logic
net/mlx5e: Use ndo_stop explicitly at shutdown flow
net/mlx5: E-Switch, always set mc_promisc for allmulti vports
net/mlx5: E-Switch, Modify node guid on vf set MAC
net/mlx5: E-Switch, Fix vport enable flow
net/mlx5: E-Switch, Use the correct error check on returned pointers
net/mlx5: E-Switch, Use the correct free() function
net/mlx5: Fix E-Switch flow steering capabilities check
net/mlx5: Fix flow steering NIC capabilities check
net/mlx5: Fix root flow table update
net/mlx5: Fix MLX5_CMD_OP_MAX to be defined correctly
net/mlx5: Fix masking of reserved bits in XRCD number
net/mlx5: Fix the size of modify QP mailbox
mlxsw: spectrum: Don't sleep during ndo_get_phys_port_name()
mlxsw: spectrum: Make split flow match firmware requirements
wext: Fix 32 bit iwpriv compatibility issue with 64 bit Kernel
cfg80211: remove get/set antenna and tx power warnings
...
Socket option PACKET_FANOUT_DATA takes a struct sock_fprog as argument
if PACKET_FANOUT has mode PACKET_FANOUT_CBPF. This structure contains
a pointer into user memory. If userland is 32-bit and kernel is 64-bit
the two disagree about the layout of struct sock_fprog.
Add compat setsockopt support to convert a 32-bit compat_sock_fprog to
a 64-bit sock_fprog. This is analogous to compat_sock_fprog support for
SO_REUSEPORT added in commit 1957598840 ("soreuseport: add compat
case for setsockopt SO_ATTACH_REUSEPORT_CBPF").
Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Frank Kellermann reported a kernel crash with 4.5.0 when IPv6 is
disabled at boot using the kernel option ipv6.disable=1. Using
current net-next with the boot option:
$ ip link add red type vrf table 1001
Generates:
[12210.919584] BUG: unable to handle kernel NULL pointer dereference at 0000000000000748
[12210.921341] IP: [<ffffffff814b30e3>] fib6_get_table+0x2c/0x5a
[12210.922537] PGD b79e3067 PUD bb32b067 PMD 0
[12210.923479] Oops: 0000 [#1] SMP
[12210.924001] Modules linked in: ipvlan 8021q garp mrp stp llc
[12210.925130] CPU: 3 PID: 1177 Comm: ip Not tainted 4.7.0-rc1+ #235
[12210.926168] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
[12210.928065] task: ffff8800b9ac4640 ti: ffff8800bacac000 task.ti: ffff8800bacac000
[12210.929328] RIP: 0010:[<ffffffff814b30e3>] [<ffffffff814b30e3>] fib6_get_table+0x2c/0x5a
[12210.930697] RSP: 0018:ffff8800bacaf888 EFLAGS: 00010202
[12210.931563] RAX: 0000000000000748 RBX: ffffffff81a9e280 RCX: ffff8800b9ac4e28
[12210.932688] RDX: 00000000000000e9 RSI: 0000000000000002 RDI: 0000000000000286
[12210.933820] RBP: ffff8800bacaf898 R08: ffff8800b9ac4df0 R09: 000000000052001b
[12210.934941] R10: 00000000657c0000 R11: 000000000000c649 R12: 00000000000003e9
[12210.936032] R13: 00000000000003e9 R14: ffff8800bace7800 R15: ffff8800bb3ec000
[12210.937103] FS: 00007faa1766c700(0000) GS:ffff88013ac00000(0000) knlGS:0000000000000000
[12210.938321] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[12210.939166] CR2: 0000000000000748 CR3: 00000000b79d6000 CR4: 00000000000406e0
[12210.940278] Stack:
[12210.940603] ffff8800bb3ec000 ffffffff81a9e280 ffff8800bacaf8c8 ffffffff814b3135
[12210.941818] ffff8800bb3ec000 ffffffff81a9e280 ffffffff81a9e280 ffff8800bace7800
[12210.943040] ffff8800bacaf8f0 ffffffff81397c88 ffff8800bb3ec000 ffffffff81a9e280
[12210.944288] Call Trace:
[12210.944688] [<ffffffff814b3135>] fib6_new_table+0x24/0x8a
[12210.945516] [<ffffffff81397c88>] vrf_dev_init+0xd4/0x162
[12210.946328] [<ffffffff814091e1>] register_netdevice+0x100/0x396
[12210.947209] [<ffffffff8139823d>] vrf_newlink+0x40/0xb3
[12210.948001] [<ffffffff814187f0>] rtnl_newlink+0x5d3/0x6d5
...
The problem above is due to the fact that the fib hash table is not
allocated when IPv6 is disabled at boot.
As for the VRF driver it should not do any IPv6 initializations if IPv6
is disabled, so it needs to know if IPv6 is disabled at boot. The disable
parameter is private to the IPv6 module, so provide an accessor for
modules to determine if IPv6 was disabled at boot time.
Fixes: 35402e3136 ("net: Add IPv6 support to VRF device")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Simplify the RxRPC connect() implementation. It will just note the
destination address it is given, and if a sendmsg() comes along with no
address, this will be assigned as the address. No transport struct will be
held internally, which will allow us to remove this later.
Simplify sendmsg() also. Whilst a call is active, userspace refers to it
by a private unique user ID specified in a control message. When sendmsg()
sees a user ID that doesn't map to an extant call, it creates a new call
for that user ID and attempts to add it. If, when we try to add it, the
user ID is now registered, we now reject the message with -EEXIST. We
should never see this situation unless two threads are racing, trying to
create a call with the same ID - which would be an error.
It also isn't required to provide sendmsg() with an address - provided the
control message data holds a user ID that maps to a currently active call.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is time to add netdev_lockdep_set_classes() helper
so that lockdep annotations per device type are easier to manage.
This removes a lot of copies and missing annotations.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
1) qdisc_run_begin() is really using the equivalent of a trylock.
Instead of using write_seqcount_begin(), use a combination of
raw_write_seqcount_begin() and correct lockdep annotation.
2) sch_direct_xmit() should use regular spin_lock(root_lock)
Fixes: f9eb8aea2a ("net_sched: transform qdisc running bit into a seqcount")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* a fix for a long-standing iwpriv 32/64 compat issue
* two fairly recently introduced (4.6) warning asking for
symmetric operations are erroneous and I remove them
-----BEGIN PGP SIGNATURE-----
iQIcBAABCgAGBQJXWWVNAAoJEGt7eEactAAd2HMP/RGyAWN/uXJVNySsqvBSqiew
t5ZTopRfL2YEcO1EUF749bncfJo6P9KUtuM72gYYRYART5viucdYFk53Ky8EtKh8
QQ7WTGyiuQP7XvderfXmrjV/J35KFiPTnf83KUfOnBTVOCtw6doIBIMutUmR7FPV
cyH4e56j9agzbBJ7mjeLYGBMmyhq2uZ85vXqk/mnpn5fH9+lGojcoIVZfhXWmMoI
d5O/giARzjDM7LTMIwgepY5w/j8RX1rMJCXfxiZsGMuiSTYH8FETbe3Go9CL/qHS
8cZCCQRMCpjMK+SOmNr7xYx70YjIJQabwo9luYmcdCkhw6VHRxHl47lk/2XLp4F9
G+ReA1A7izV2XSyJIWTl+Or1qzgzsf6SfZHKszSdY47NXzK5aCprRfUvXyxLv6GZ
IGcHzNWOFF/1fKhh3w2NylhTd/omkvtZzt5Xuom94OPXdyxzy/PmViZ+xuhutrGt
dYxhmzPZcw+/OeTGgbt2YfYFUdMZqbHkaHPE0wia03r1yXzQD/x05/QET2vwIA0L
C+ATyaykSCCOm/ovwm5rada151Iai1engCWIVw/pZ3HUj6C8xqwyDS3+xsTRmKzI
oBp1O3K3yR0c0T9KJNmORnnrYQAbSBIQNhdioXVCBpByTOjYfATKVQtaSIV/uu4/
2r1KQmrT4z5OrL6lm2wv
=0I3T
-----END PGP SIGNATURE-----
Merge tag 'mac80211-for-davem-2016-06-09' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
Two more fixes for now:
* a fix for a long-standing iwpriv 32/64 compat issue
* two fairly recently introduced (4.6) warning asking for
symmetric operations are erroneous and I remove them
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
ipip6_tunnel_xmit() is called immediately after checking that
skb->protocol is htons(ETH_P_IPV6) so there is no need
to check it a second time.
Found by inspection.
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Dinan Gunawardena <dinan.gunawardena@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In cases where the category bitmap is sparse enough that gaps exist
between netlbl_lsm_catmap structs, callers to netlbl_catmap_getlong()
could find themselves prematurely ending their search through the
category bitmap. Further, the methods used to calculate the 'idx'
and 'off' values were incorrect for bitmaps this large. This patch
changes the netlbl_catmap_getlong() behavior so that it always skips
over gaps and calculates the index and offset values correctly.
Signed-off-by: Paul Moore <paul@paul-moore.com>
There is no other limit other than a global
packet count limit when using software queuing.
This means a single flow queue can grow insanely
long. This is particularly bad for TCP congestion
algorithms which requires a little more
sophisticated frame dropping scheme than a mere
headdrop on limit overflow.
Hence apply (a slighly modified, to fit the knobs)
CoDel5 on flow queues. This improves TCP
convergence and stability when combined with
wireless driver which keeps its own tx queue/fifo
at a minimum fill level for given link conditions.
Signed-off-by: Michal Kazior <michal.kazior@tieto.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This adds a debugfs entry to read and modify some fq parameters.
This makes it easy to debug, test and experiment.
Signed-off-by: Michal Kazior <michal.kazior@tieto.com>
[remove module parameter for now]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
mac80211's software queues were designed to work
very closely with device tx queues. They are
required to make use of 802.11 packet aggregation
easily and efficiently.
Due to the way 802.11 aggregation is designed it
only makes sense to keep fair queuing as close to
hardware as possible to reduce induced latency and
inertia and provide the best flow responsiveness.
This change doesn't translate directly to
immediate and significant gains. End result
depends on driver's induced latency. Best results
can be achieved if driver keeps its own tx
queue/fifo fill level to a minimum.
Signed-off-by: Michal Kazior <michal.kazior@tieto.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Qdiscs are designed with no regard to 802.11
aggregation requirements and hand out
packet-by-packet with no guarantee they are
destined to the same tid. This does more bad than
good no matter how fairly a given qdisc may behave
on an ethernet interface.
Software queuing used per-AC netdev subqueue
congestion control whenever a global AC limit was
hit. This meant in practice a single station or
tid queue could starve others rather easily. This
could resonate with qdiscs in a bad way or could
just end up with poor aggregation performance.
Increasing the AC limit would increase induced
latency which is also bad.
Disabling qdiscs by default and performing
taildrop instead of netdev subqueue congestion
control on the other hand makes it possible for
tid queues to fill up "in the meantime" while
preventing stations starving each other.
This increases aggregation opportunities and
should allow software queuing based drivers
achieve better performance by utilizing airtime
more efficiently with big aggregates.
Signed-off-by: Michal Kazior <michal.kazior@tieto.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Setting rd to NULL to avoid freeing it, just to be able to return
from the function in a single place, doesn't make much sense.
Return the set_regdom() return value directly.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This patch fixes the following checkpatch.pl issues -
- Please don't use multiple blank lines
- Blank lines aren't necessary before a close brace
- Missing a blank line after declarations
Reviewed-by: Julian Calaby <julian.calaby@gmail.com>
Signed-off-by: Kirtika Ruchandani <kirtika.ruchandani@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
iwpriv app uses iw_point structure to send data to Kernel. The iw_point
structure holds a pointer. For compatibility Kernel converts the pointer
as required for WEXT IOCTLs (SIOCIWFIRST to SIOCIWLAST). Some drivers
may use iw_handler_def.private_args to populate iwpriv commands instead
of iw_handler_def.private. For those case, the IOCTLs from
SIOCIWFIRSTPRIV to SIOCIWLASTPRIV will follow the path ndo_do_ioctl().
Accordingly when the filled up iw_point structure comes from 32 bit
iwpriv to 64 bit Kernel, Kernel will not convert the pointer and sends
it to driver. So, the driver may get the invalid data.
The pointer conversion for the IOCTLs (SIOCIWFIRSTPRIV to
SIOCIWLASTPRIV), which follow the path ndo_do_ioctl(), is mandatory.
This patch adds pointer conversion from 32 bit to 64 bit and vice versa,
if the ioctl comes from 32 bit iwpriv to 64 bit Kernel.
Cc: stable@vger.kernel.org
Signed-off-by: Prasun Maiti <prasunmaiti87@gmail.com>
Signed-off-by: Ujjal Roy <royujjal@gmail.com>
Tested-by: Dibyajyoti Ghosh <dibyajyotig@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Since set_tx_power and set_antenna are frequently implemented
without the matching get_tx_power/get_antenna, we shouldn't
have added warnings for those. Remove them.
The remaining ones are correct and need to be implemented
symmetrically for correct operation.
Cc: stable@vger.kernel.org
Fixes: de3bb771f4 ("cfg80211: add more warnings for inconsistent ops")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
after removal of TCA_CBQ_OVL_STRATEGY from cbq scheduler, there are no
more callers of ->drop() outside of other ->drop functions, i.e.
nothing calls them.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
After the removal of TCA_CBQ_POLICE in cbq scheduler qdisc->reshape_fail
is always NULL, i.e. qdisc_rehape_fail is now the same as qdisc_drop.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
iproute2 doesn't implement any cbq option that results in this attribute
being sent to kernel.
To make use of it, user would have to
- patch iproute2
- add a class
- attach a qdisc to the class (default pfifo doesn't work as
q->handle is 0 and cbq_set_police() is a no-op in this case)
- re-'add' the same class (tc class change ...) again
- user must also specifiy a defmap (e.g. 'split 1:0 defmap 3f'), since
this 'police' feature relies on its presence
- the added qdisc must be one of bfifo, pfifo or netem
If all of these conditions are met and _some_ leaf qdiscs, namely
p/bfifo, netem, plug or tbf would drop a packet, kernel calls back into
cbq, which will attempt to re-queue the skb into a different class
as indicated by the parents' defmap entry for TC_PRIO_BESTEFFORT.
[ i.e. we behave as if tc_classify returned TC_ACT_RECLASSIFY ].
This feature, which isn't documented or implemented in iproute2,
and isn't implemented consistently (most qdiscs like sfq, codel, etc
drop right away instead of attempting this reclassification) is the
sole reason for the reshape_fail and __parent member in Qdisc struct.
So remove TCA_CBQ_POLICE support from the kernel, reject it via EOPNOTSUPP
so userspace knows we don't support it, and then remove no-longer needed
infrastructure in followup commit.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
since initial revision of cbq in 2004 iproute 2 has never implemented
support for TCA_CBQ_OVL_STRATEGY, which is what needs to be set to
activate the class->drop() call (TC_CBQ_OVL_DROP strategy must be
set by userspace value must be set by userspace).
David Miller says:
It seems really safe to kill this thing off, flag an error if someone
tries to set the attribute, and therefore kill off all of the
non-default cbq_ovl_*() functions.
A followup commit can then remove all .drop qdisc methods since this
removed the only caller.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
When we need to create a new aggregate to enqueue the skb we call kzalloc.
If that fails we returned ENOBUFS without freeing the skb.
Spotted during code review.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ip6 GRE tap device should not be forced to down state to change
the mac address and should allow live address change for tap device
similar to ipv4 gre.
Signed-off-by: Shweta Choudaha <schoudah@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ip6 GRE tap device should not be forced to down state to change
the mac address and should allow live address change for tap device
similar to ipv4 gre.
Signed-off-by: Shweta Choudaha <schoudah@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Return an error if user requested skip-sw and the underlaying
hardware cannot handle tc offloads (or offloads are disabled).
This patch fixes the knode handling.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Errors reported by u32_replace_hw_hnode() were not propagated.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, VRFs require 1 oif and 1 iif rule per address family per
VRF. As the number of VRF devices increases it brings scalability
issues with the increasing rule list. All of the VRF rules have the
same format with the exception of the specific table id to direct the
lookup. Since the table id is available from the oif or iif in the
loopup, the VRF rules can be consolidated to a single rule that pulls
the table from the VRF device.
This patch introduces a new rule attribute l3mdev. The l3mdev rule
means the table id used for the lookup is pulled from the L3 master
device (e.g., VRF) rather than being statically defined. With the
l3mdev rule all of the basic VRF FIB rules are reduced to 1 l3mdev
rule per address family (IPv4 and IPv6).
If an admin wishes to insert higher priority rules for specific VRFs
those rules will co-exist with the l3mdev rule. This capability means
current VRF scripts will co-exist with this new simpler implementation.
Currently, the rules list for both ipv4 and ipv6 look like this:
$ ip ru ls
1000: from all oif vrf1 lookup 1001
1000: from all iif vrf1 lookup 1001
1000: from all oif vrf2 lookup 1002
1000: from all iif vrf2 lookup 1002
1000: from all oif vrf3 lookup 1003
1000: from all iif vrf3 lookup 1003
1000: from all oif vrf4 lookup 1004
1000: from all iif vrf4 lookup 1004
1000: from all oif vrf5 lookup 1005
1000: from all iif vrf5 lookup 1005
1000: from all oif vrf6 lookup 1006
1000: from all iif vrf6 lookup 1006
1000: from all oif vrf7 lookup 1007
1000: from all iif vrf7 lookup 1007
1000: from all oif vrf8 lookup 1008
1000: from all iif vrf8 lookup 1008
...
32765: from all lookup local
32766: from all lookup main
32767: from all lookup default
With the l3mdev rule the list is just the following regardless of the
number of VRFs:
$ ip ru ls
1000: from all lookup [l3mdev table]
32765: from all lookup local
32766: from all lookup main
32767: from all lookup default
(Note: the above pretty print of the rule is based on an iproute2
prototype. Actual verbage may change)
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The node keepalive interval is recalculated at each timer expiration
to catch any changes in the link tolerance, and stored in a field in
struct tipc_node. We use jiffies as unit for the stored value.
This is suboptimal, because it makes the calculation unnecessary
complex, including two unit conversions. The conversions also lead to
a rounding error that causes the link "abort limit" to be 3 in the
normal case, instead of 4, as intended. This again leads to unnecessary
link resets when the network is pushed close to its limit, e.g., in an
environment with hundreds of nodes or namesapces.
In this commit, we do instead let the keepalive value be calculated and
stored in milliseconds, so that there is only one conversion and the
rounding error is eliminated.
We also remove a redundant "keepalive" field in struct tipc_link. This
is remnant from the previous implementation.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 88e8ac7000 ("tipc: reduce transmission rate of reset messages
when link is down") revealed a flaw in the node FSM, as defined in
the log of commit 66996b6c47 ("tipc: extend node FSM").
We see the following scenario:
1: Node B receives a RESET message from node A before its link endpoint
is fully up, i.e., the node FSM is in state SELF_UP_PEER_COMING. This
event will not change the node FSM state, but the (distinct) link FSM
will move to state RESETTING.
2: As an effect of the previous event, the local endpoint on B will
declare node A lost, and post the event SELF_DOWN to the its node
FSM. This moves the FSM state to SELF_DOWN_PEER_LEAVING, meaning
that no messages will be accepted from A until it receives another
RESET message that confirms that A's endpoint has been reset. This
is wasteful, since we know this as a fact already from the first
received RESET, but worse is that the link instance's FSM has not
wasted this information, but instead moved on to state ESTABLISHING,
meaning that it repeatedly sends out ACTIVATE messages to the reset
peer A.
3: Node A will receive one of the ACTIVATE messages, move its link FSM
to state ESTABLISHED, and start repeatedly sending out STATE messages
to node B.
4: Node B will consistently drop these messages, since it can only accept
accept a RESET according to its node FSM.
5: After four lost STATE messages node A will reset its link and start
repeatedly sending out RESET messages to B.
6: Because of the reduced send rate for RESET messages, it is very
likely that A will receive an ACTIVATE (which is sent out at a much
higher frequency) before it gets the chance to send a RESET, and A
may hence quickly move back to state ESTABLISHED and continue sending
out STATE messages, which will again be dropped by B.
7: GOTO 5.
8: After having repeated the cycle 5-7 a number of times, node A will
by chance get in between with sending a RESET, and the situation is
resolved.
Unfortunately, we have seen that it may take a substantial amount of
time before this vicious loop is broken, sometimes in the order of
minutes.
We correct this by making a small correction to the node FSM: When a
node in state SELF_UP_PEER_COMING receives a SELF_DOWN event, it now
moves directly back to state SELF_DOWN_PEER_DOWN, instead of as now
SELF_DOWN_PEER_LEAVING. This is logically consistent, since we don't
need to wait for RESET confirmation from of an endpoint that we alread
know has been reset. It also means that node B in the scenario above
will not be dropping incoming STATE messages, and the link can come up
immediately.
Finally, a symmetry comparison reveals that the FSM has a similar
error when receiving the event PEER_DOWN in state PEER_UP_SELF_COMING.
Instead of moving to PERR_DOWN_SELF_LEAVING, it should move directly
to SELF_DOWN_PEER_DOWN. Although we have never seen any negative effect
of this logical error, we choose fix this one, too.
The node FSM looks as follows after those changes:
+----------------------------------------+
| PEER_DOWN_EVT|
| |
+------------------------+----------------+ |
|SELF_DOWN_EVT | | |
| | | |
| +-----------+ +-----------+ |
| |NODE_ | |NODE_ | |
| +----------|FAILINGOVER|<---------|SYNCHING |-----------+ |
| |SELF_ +-----------+ FAILOVER_+-----------+ PEER_ | |
| |DOWN_EVT | A BEGIN_EVT A | DOWN_EVT| |
| | | | | | | |
| | | | | | | |
| | |FAILOVER_ |FAILOVER_ |SYNCH_ |SYNCH_ | |
| | |END_EVT |BEGIN_EVT |BEGIN_EVT|END_EVT | |
| | | | | | | |
| | | | | | | |
| | | +--------------+ | | |
| | +-------->| SELF_UP_ |<-------+ | |
| | +-----------------| PEER_UP |----------------+ | |
| | |SELF_DOWN_EVT +--------------+ PEER_DOWN_EVT| | |
| | | A A | | |
| | | | | | | |
| | | PEER_UP_EVT| |SELF_UP_EVT | | |
| | | | | | | |
V V V | | V V V
+------------+ +-----------+ +-----------+ +------------+
|SELF_DOWN_ | |SELF_UP_ | |PEER_UP_ | |PEER_DOWN |
|PEER_LEAVING| |PEER_COMING| |SELF_COMING| |SELF_LEAVING|
+------------+ +-----------+ +-----------+ +------------+
| | A A | |
| | | | | |
| SELF_ | |SELF_ |PEER_ |PEER_ |
| DOWN_EVT| |UP_EVT |UP_EVT |DOWN_EVT |
| | | | | |
| | | | | |
| | +--------------+ | |
|PEER_DOWN_EVT +--->| SELF_DOWN_ |<---+ SELF_DOWN_EVT|
+------------------->| PEER_DOWN |<--------------------+
+--------------+
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that we can properly support multiple distinct trees in the system,
using a global variable: dsa_cpu_port_ethtool_ops is getting clobbered
as soon as the second switch tree gets probed, and we don't want that.
We need to move this to be dynamically allocated, and since we can't
really be comparing addresses anymore to determine first time
initialization versus any other times, just move this to dsa.c and
dsa2.c where the remainder of the dst/ds initialization happens.
The operations teardown restores the master netdev's ethtool_ops to its
original ethtool_ops pointer (typically within the Ethernet driver)
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a helper function: dsa_cpu_port_ethtool_init() which initializes a
custom ethtool_ops structure with custom DSA ethtool operations for CPU
ports. This is a preliminary change to move the initialization outside
of net/dsa/slave.c.
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Mimic what net/dsa/dsa.c does and provide a slave MII bus by default
which will be created if the driver implements a phy_read method.
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some drivers rely on these two bitmasks to contain the correct values
for them to successfully probe and initialize at drv->setup() time,
calculate correct values to put in both masks as early as possible in
dsa_get_ports_dn().
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In case we have multiples trees and switches with the same index, we
need to add another discriminating id: the switch tree.
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
"make htmldocs" complains otherwise:
.//net/core/gen_stats.c:168: warning: No description found for parameter 'running'
.//include/linux/netdevice.h:1867: warning: No description found for parameter 'qdisc_running_key'
Fixes: f9eb8aea2a ("net_sched: transform qdisc running bit into a seqcount")
Fixes: edb09eb17e ("net: sched: do not acquire qdisc spinlock in qdisc/class stats dump")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
"make htmldocs" complains otherwise:
.//net/core/gen_stats.c:65: warning: No description found for parameter 'padattr'
.//net/core/gen_stats.c:101: warning: No description found for parameter 'padattr'
Fixes: 9854518ea0 ("sched: align nlattr properly when needed")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
At present we perform an xfrm_lookup() for each UDPv6 message we
send. The lookup involves querying the flow cache (flow_cache_lookup)
and, in case of a cache miss, creating an XFRM bundle.
If we miss the flow cache, we can end up creating a new bundle and
deriving the path MTU (xfrm_init_pmtu) from on an already transformed
dst_entry, which we pass from the socket cache (sk->sk_dst_cache) down
to xfrm_lookup(). This can happen only if we're caching the dst_entry
in the socket, that is when we're using a connected UDP socket.
To put it another way, the path MTU shrinks each time we miss the flow
cache, which later on leads to incorrectly fragmented payload. It can
be observed with ESPv6 in transport mode:
1) Set up a transformation and lower the MTU to trigger fragmentation
# ip xfrm policy add dir out src ::1 dst ::1 \
tmpl src ::1 dst ::1 proto esp spi 1
# ip xfrm state add src ::1 dst ::1 \
proto esp spi 1 enc 'aes' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b
# ip link set dev lo mtu 1500
2) Monitor the packet flow and set up an UDP sink
# tcpdump -ni lo -ttt &
# socat udp6-listen:12345,fork /dev/null &
3) Send a datagram that needs fragmentation with a connected socket
# perl -e 'print "@" x 1470 | socat - udp6:[::1]:12345
2016/06/07 18:52:52 socat[724] E read(3, 0x555bb3d5ba00, 8192): Protocol error
00:00:00.000000 IP6 ::1 > ::1: frag (0|1448) ESP(spi=0x00000001,seq=0x2), length 1448
00:00:00.000014 IP6 ::1 > ::1: frag (1448|32)
00:00:00.000050 IP6 ::1 > ::1: ESP(spi=0x00000001,seq=0x3), length 1272
(^ ICMPv6 Parameter Problem)
00:00:00.000022 IP6 ::1 > ::1: ESP(spi=0x00000001,seq=0x5), length 136
4) Compare it to a non-connected socket
# perl -e 'print "@" x 1500' | socat - udp6-sendto:[::1]:12345
00:00:40.535488 IP6 ::1 > ::1: frag (0|1448) ESP(spi=0x00000001,seq=0x6), length 1448
00:00:00.000010 IP6 ::1 > ::1: frag (1448|64)
What happens in step (3) is:
1) when connecting the socket in __ip6_datagram_connect(), we
perform an XFRM lookup, miss the flow cache, create an XFRM
bundle, and cache the destination,
2) afterwards, when sending the datagram, we perform an XFRM lookup,
again, miss the flow cache (due to mismatch of flowi6_iif and
flowi6_oif, which is an issue of its own), and recreate an XFRM
bundle based on the cached (and already transformed) destination.
To prevent the recreation of an XFRM bundle, avoid an XFRM lookup
altogether whenever we already have a destination entry cached in the
socket. This prevents the path MTU shrinkage and brings us on par with
UDPv4.
The fix also benefits connected PINGv6 sockets, another user of
ip6_sk_dst_lookup_flow(), who also suffer messages being transformed
twice.
Joint work with Hannes Frederic Sowa.
Reported-by: Jan Tluka <jtluka@redhat.com>
Signed-off-by: Jakub Sitnicki <jkbs@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When in kdump kernel, reduce memory usage by only using a single Queue
Set for multiqueue devices. So make netif_get_num_default_rss_queues()
return one, when in kdump kernel.
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When setting up ILA in a router we noticed that the the encapsulation
is invoked twice: once in the route input path and again upon route
output. To resolve this we add a flag set_csum_neutral for the
ila_update_ipv6_locator. If this flag is set and the checksum
neutral bit is also set we assume that checksum-neutral translation
has already been performed and take no further action. The
flag is set only in ila_output path. The flag is not set for ila_input and
ila_xlat.
Tested:
Used 3 netns to set to emulate a router and two hosts. The router
translates SIR addresses between the two destinations in other two netns.
Verified ping and netperf are functional.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The of_find_net_device_by_node() function is defined in
<linux/of_net.h> but not included in the .c file that
implements it. Fix the following warning by including the
header:
net/core/net-sysfs.c:1494:19: warning: symbol 'of_find_net_device_by_node' was not declared. Should it be static?
Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
RFC 5961 advises to only accept RST packets containing a seq number
matching the next expected seq number instead of the whole receive
window in order to avoid spoofing attacks.
However, this situation is not optimal in the case SACK is in use at the
time the RST is sent. I recently run into a scenario in which packet
losses were high while uploading data to a server, and userspace was
willing to frequently terminate connections by sending a RST. In
this case, the ACK sent on the receiver side (rcv_nxt) is frozen waiting
for a lost packet retransmission and SACK blocks are used to let the
client continue uploading data. At some point later on, the client sends
the RST (snd_nxt), which matches the next expected seq number of the
right-most SACK block on the receiver side which is going forward
receiving data.
In this scenario, as RFC 5961 defines, the RST SEQ doesn't match the
frozen main ACK at receiver side and thus gets dropped and a challenge
ACK is sent, which gets usually lost due to network conditions. The main
consequence is that the connection stays alive for a while even if it
made sense to accept the RST. This can get really bad if lots of
connections like this one are created in few seconds, allocating all the
resources of the server easily.
For security reasons, not all SACK blocks are checked (there could be a
big amount of SACK blocks => acceptable SEQ numbers). Furthermore, it
wouldn't make sense to check for RST in blocks other than the right-most
received one because the sender is not expected to be sending new data
after the RST. For simplicity, only up to the 4 most recently updated
SACK blocks (selective_acks[4] field) are compared to find the
right-most block, as usually those are the ones with bigger probability
to contain it.
This patch was tested in a 3.18 kernel and probed to improve the
situation in the scenario described above.
Signed-off-by: Pau Espin Pedrol <pau.espin@tessares.net>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Tested-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The missing br_vlan_should_use() test caused creation of an unneeded
local fdb entry on changing mac address of a bridge device when there is
a vlan which is configured on a bridge port but not on the bridge
device.
Fixes: 2594e9064a ("bridge: vlan: add per-vlan struct and move to rhashtables")
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for locally originated traffic to VRF-local IPv6 addresses.
Similar to IPv4 a local dst is set on the skb and the packet is
reinserted with a call to netif_rx. With this patch, ping, tcp and udp
packets to a local IPv6 address are successfully routed:
$ ip addr show dev eth1
4: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master red state UP group default qlen 1000
link/ether 02:e0:f9:1c:b9:74 brd ff:ff:ff:ff:ff:ff
inet 10.100.1.1/24 brd 10.100.1.255 scope global eth1
valid_lft forever preferred_lft forever
inet6 2100:1::1/120 scope global
valid_lft forever preferred_lft forever
inet6 fe80::e0:f9ff:fe1c:b974/64 scope link
valid_lft forever preferred_lft forever
$ ping6 -c1 -I red 2100:1::1
ping6: Warning: source address might be selected on device other than red.
PING 2100:1::1(2100:1::1) from 2100:1::1 red: 56 data bytes
64 bytes from 2100:1::1: icmp_seq=1 ttl=64 time=0.098 ms
ip6_input is exported so the VRF driver can use it for the dst input
function. The dst_alloc function for IPv4 defaults to setting the input and
output functions; IPv6's does not. VRF does not need to duplicate the Rx path
so just export the ipv6 input function.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch implements direct encapsulation of IPv4 and IPv6 packets
in UDP. This is done a version "1" of GUE and as explained in I-D
draft-ietf-nvo3-gue-03.
Changes here are only in the receive path, fou with IPxIPx already
supports the transmit side. Both the normal receive path and
GRO path are modified to check for GUE version and check for
IP version in the case that GUE version is "1".
Tested:
IPIP with direct GUE encap
1 TCP_STREAM
4530 Mbps
200 TCP_RR
1297625 tps
135/232/444 90/95/99% latencies
IP4IP6 with direct GUE encap
1 TCP_STREAM
4903 Mbps
200 TCP_RR
1184481 tps
149/253/473 90/95/99% latencies
IP6IP6 direct GUE encap
1 TCP_STREAM
5146 Mbps
200 TCP_RR
1202879 tps
146/251/472 90/95/99% latencies
SIT with direct GUE encap
1 TCP_STREAM
6111 Mbps
200 TCP_RR
1250337 tps
139/241/467 90/95/99% latencies
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS fixes for net
The following patchset contains two Netfilter/IPVS fixes for your net
tree, they are:
1) Fix missing alignment in next offset calculation for standard
targets, introduced in the previous merge window, patch from
Florian Westphal.
2) Fix to correct the handling of outgoing connections which use the
SIP-pe such that the binding of a real-server is updated when needed.
This was an omission from changes introduced by Marco Angaroni in
the previous merge window too, to allow handling of outgoing
connections by the SIP-pe. Patch and report came via Simon Horman.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The v6 tcp stats scan do not provide TLP and ER timer information
correctly like the v4 version . This patch fixes that.
Fixes: 6ba8a3b19e ("tcp: Tail loss probe (TLP)")
Fixes: eed530b6c6 ("tcp: early retransmit")
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When offloading classifiers such as u32 or flower to hardware, and the
qdisc is clsact (TC_H_CLSACT), then we need to differentiate its classes,
since not all of them handle ingress, therefore we must leave those in
software path. Add a .tcf_cl_offload() callback, so we can generically
handle them, tested on ixgbe.
Fixes: 10cbc68434 ("net/sched: cls_flower: Hardware offloaded filters statistics support")
Fixes: 5b33f48842 ("net/flower: Introduce hardware offload support")
Fixes: a1b7c5fd7f ("net: sched: add cls_u32 offload hooks for netdevs")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Large tc dumps (tc -s {qdisc|class} sh dev ethX) done by Google BwE host
agent [1] are problematic at scale :
For each qdisc/class found in the dump, we currently lock the root qdisc
spinlock in order to get stats. Sampling stats every 5 seconds from
thousands of HTB classes is a challenge when the root qdisc spinlock is
under high pressure. Not only the dumps take time, they also slow
down the fast path (queue/dequeue packets) by 10 % to 20 % in some cases.
An audit of existing qdiscs showed that sch_fq_codel is the only qdisc
that might need the qdisc lock in fq_codel_dump_stats() and
fq_codel_dump_class_stats()
In v2 of this patch, I now use the Qdisc running seqcount to provide
consistent reads of packets/bytes counters, regardless of 32/64 bit arches.
I also changed rate estimators to use the same infrastructure
so that they no longer need to lock root qdisc lock.
[1]
http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43838.pdf
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Kevin Athey <kda@google.com>
Cc: Xiaotian Pei <xiaotian@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of using a single bit (__QDISC___STATE_RUNNING)
in sch->__state, use a seqcount.
This adds lockdep support, but more importantly it will allow us
to sample qdisc/class statistics without having to grab qdisc root lock.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We properly scan the flow list to count number of packets,
but John passed 0 to gnet_stats_copy_queue() so we report
a zero value to user space instead of the result.
Fixes: 6401585366 ("net: sched: restrict use of qstats qlen")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: John Fastabend <john.r.fastabend@intel.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Return an error if user requested skip-sw and the underlaying
hardware cannot handle tc offloads (or offloads are disabled).
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
'err' variable is not set in this test, we would return whatever
previous test set 'err' to.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Useful to know when the action was first used for accounting
(and debugging)
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to make a filter processed only by hardware, skip_sw flag
should be supplied. This is an addition to the already existing skip_hw
flag (filter will be processed by software only). If no flag is
specified, filter will be processed by both software and hardware.
If only hardware offloaded filters exist, fl_classify() will return
without doing anything.
A following userspace patch will be sent once kernel patch is accepted.
Example:
tc filter add dev enp0s9 protocol ip prio 20 parent ffff: \
flower \
ip_proto 6 \
indev enp0s9 \
skip_sw \
action skbedit mark 0x1234
Signed-off-by: Amir Vadai <amirva@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Note: Tom Herbert posted almost same patch 3 months back, but for
different reasons.
The reasons we want to get rid of this spin_trylock() are :
1) Under high qdisc pressure, the spin_trylock() has almost no
chance to succeed.
2) We loop multiple times in softirq handler, eventually reaching
the max retry count (10), and we schedule ksoftirqd.
Since we want to adhere more strictly to ksoftirqd being waked up in
the future (https://lwn.net/Articles/687617/), better avoid spurious
wakeups.
3) calls to __netif_reschedule() dirty the cache line containing
q->next_sched, slowing down the owner of qdisc.
4) RT kernels can not use the spin_trylock() here.
With help of busylock, we get the qdisc spinlock fast enough, and
the trylock trick brings only performance penalty.
Depending on qdisc setup, I observed a gain of up to 19 % in qdisc
performance (1016600 pps instead of 853400 pps, using prio+tbf+fq_codel)
("mpstat -I SCPU 1" is much happier now)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <tom@herbertland.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net/rxrpc/rxkad.c:1165:1-3: WARNING: PTR_ERR_OR_ZERO can be used
Use PTR_ERR_OR_ZERO rather than if(IS_ERR(...)) + PTR_ERR
Generated by: scripts/coccinelle/api/ptr_ret.cocci
CC: David Howells <dhowells@redhat.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The send path needs to be quiesced before resetting callbacks from
rds_tcp_accept_one(), and commit eb19284026 ("RDS:TCP: Synchronize
rds_tcp_accept_one with rds_send_xmit when resetting t_sock") achieves
this using the c_state and RDS_IN_XMIT bit following the pattern
used by rds_conn_shutdown(). However this leaves the possibility
of a race window as shown in the sequence below
take t_conn_lock in rds_tcp_conn_connect
send outgoing syn to peer
drop t_conn_lock in rds_tcp_conn_connect
incoming from peer triggers rds_tcp_accept_one, conn is
marked CONNECTING
wait for RDS_IN_XMIT to quiesce any rds_send_xmit threads
call rds_tcp_reset_callbacks
[.. race-window where incoming syn-ack can cause the conn
to be marked UP from rds_tcp_state_change ..]
lock_sock called from rds_tcp_reset_callbacks, and we set
t_sock to null
As soon as the conn is marked UP in the race-window above, rds_send_xmit()
threads will proceed to rds_tcp_xmit and may encounter a null-pointer
deref on the t_sock.
Given that rds_tcp_state_change() is invoked in softirq context, whereas
rds_tcp_reset_callbacks() is in workq context, and testing for RDS_IN_XMIT
after lock_sock could result in a deadlock with tcp_sendmsg, this
commit fixes the race by using a new c_state, RDS_TCP_RESETTING, which
will prevent a transition to RDS_CONN_UP from rds_tcp_state_change().
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When we switch a connection's sockets in rds_tcp_rest_callbacks,
any partially sent datagram must be retransmitted on the new
socket so that the receiver can correctly reassmble the RDS
datagram. Use rds_send_reset() which is designed for this purpose.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When rds_tcp_accept_one() has to replace the existing tcp socket
with a newer tcp socket (duelling-syn resolution), it must lock_sock()
to suppress the rds_tcp_data_recv() path while callbacks are being
changed. Also, existing RDS datagram reassembly state must be reset,
so that the next datagram on the new socket does not have corrupted
state. Similarly when resetting the newly accepted socket, appropriate
locks and synchronization is needed.
This commit ensures correct synchronization by invoking
kernel_sock_shutdown to reset a newly accepted sock, and by taking
appropriate lock_sock()s (for old and new sockets) when resetting
existing callbacks.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
My prior attempt to fix the backlogs of parents failed.
If we return NET_XMIT_CN, our parents wont increase their backlog,
so our qdisc_tree_reduce_backlog() should take this into account.
v2: Florian Westphal pointed out that we could drop the packet,
so we need to save qdisc_pkt_len(skb) in a temp variable before
calling fq_codel_drop()
Fixes: 9d18562a22 ("fq_codel: add batch ability to fq_codel_drop()")
Fixes: 2ccccf5fb4 ("net_sched: update hierarchical backlog too")
Reported-by: Stas Nichiporovich <stasn77@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: WANG Cong <xiyou.wangcong@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The expectation table is not duplicated per net namespace anymore, so we can move
the expectation table and conntrack table iteration out of the per-net loop.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
checkpatch produces data type 'checks'.
This patch amends them by changing, for example:
uint8_t -> u8
Signed-off-by: Tobin C Harding <me@tobin.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Much like we had to do for AF_BLUETOOTH and AF_ALG, make sure we
properly clone the parent socket's LSM attributes to newly created
child sockets.
Signed-off-by: Paul Moore <paul@paul-moore.com>
For gso_skb we only update qlen, backlog should be updated too.
Note, it is correct to just update these stats at one layer,
because the gso_skb is cached there.
Reported-by: Stas Nichiporovich <stasn77@gmail.com>
Fixes: 2ccccf5fb4 ("net_sched: update hierarchical backlog too")
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 538950a1b7 ("soreuseport: setsockopt SO_ATTACH_REUSEPORT_[CE]BPF")
missed to add the compat case for the SO_ATTACH_REUSEPORT_CBPF option.
Signed-off-by: Helge Deller <deller@gmx.de>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
It seems risky to always rely on the caller to ensure the socket's
address family is correct before passing it to the NetLabel kAPI,
especially since we see at least one LSM which didn't. Add address
family checks to the *_delattr() functions to help prevent future
problems.
Cc: <stable@vger.kernel.org>
Reported-by: Maninder Singh <maninder1.s@samsung.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
The capability check should not be audited since it is only being used
to determine the inode permissions. A failed check does not indicate a
violation of security policy but, when an LSM is enabled, a denial audit
message was being generated.
The denial audit message caused confusion for some application authors
because root-running Go applications always triggered the denial. To
prevent this confusion, the capability check in net_ctl_permissions() is
switched to the noaudit variant.
BugLink: https://launchpad.net/bugs/1465724
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: James Morris <james.l.morris@oracle.com>
Before commit 6d7b857d54 ("net: use lib/percpu_counter API for
fragmentation mem accounting"), setting the reassembly high threshold
to 0 prevented fragment reassembly as first fragment would be always
evicted before second could be added to the queue. While inefficient,
some users apparently relied on this method.
Since the commit mentioned above, a percpu counter is used for
reassembly memory accounting and high batch size avoids taking slow path
in most common scenarios. As a result, a whole full sized packet can be
reassembled without the percpu counter's main counter changing its value
so that even with high_thresh set to 0, fragmented packets can be still
reassembled and processed.
Add explicit check preventing reassembly if high threshold is zero.
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Previous patch that introduced handling of outgoing packets in SIP
persistent-engine did not call ip_vs_check_template() in case packet was
matching a connection template. Assumption was that real-server was
healthy, since it was sending a packet just in that moment.
There are however real-server fault conditions requiring that association
between call-id and real-server (represented by connection template)
gets updated. Here is an example of the sequence of events:
1) RS1 is a back2back user agent that handled call-id1 and call-id2
2) RS1 is down and was marked as unavailable
3) new message from outside comes to IPVS with call-id1
4) IPVS reschedules the message to RS2, which becomes new call handler
5) RS2 forwards the message outside, translating call-id1 to call-id2
6) inside pe->conn_out() IPVS matches call-id2 with existing template
7) IPVS does not change association call-id2 <-> RS1
8) new message comes from client with call-id2
9) IPVS reschedules the message to a real-server potentially different
from RS2, which is now the correct destination
This patch introduces ip_vs_check_template() call in the handling of
outgoing packets for SIP-pe. And also introduces a second optional
argument for ip_vs_check_template() that allows to check if dest
associated to a connection template is the same dest that was identified
as the source of the packet. This is to change the real-server bound to a
particular call-id independently from its availability status: the idea
is that it's more reliable, for in->out direction (where internal
network can be considered trusted), to always associate a call-id with
the last real-server that used it in one of its messages. Think about
above sequence of events where, just after step 5, RS1 returns instead
to be available.
Comparison of dests is done by simply comparing pointers to struct
ip_vs_dest; there should be no cases where struct ip_vs_dest keeps its
memory address, but represent a different real-server in terms of
ip-address / port.
Fixes: 39b9722315 ("ipvs: handle connections started by real-servers")
Signed-off-by: Marco Angaroni <marcoangaroni@gmail.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
The existing DSA binding has a number of limitations and problems. The
main problem is that it cannot represent a switch as a linux device,
hanging off some bus. It is limited to one CPU port. The DSA platform
device is artificial, and does not really represent hardware.
Implement a new binding which can be embedded into any type of node on
a bus to represent one switch device, and its links to other switches.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The switch may want to instantiate its own MDIO bus. Only do it
centrally if the switch has not already created one, and the read op
is implemented.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace the two switch statements with an array lookup, and store the
result in the dsa tree structure. The drivers no longer need to know
the selected tag protocol, so remove it from the dsa switch structure.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Refactor the code to setup a single DSA/CPU port into a function of
its own, and export it, so it can be used by the new binding.
Similarly, refactor the destroy code into a function. When destroying
the ports, don't put the of node. They should be released at the end
along with the normal ports.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The new binding will not have a chip data structure, it will place the
routing directly into the switch structure. To enable backwards
compatibility, copy the routing from the chip data into the switch
structure.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With a maximum of four switches, the size of the routing table is the
same as the pointer to it. Removing it makes the code simpler.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the port device node structure into the port structure, from the
chip data. This information is needed in the next step of implementing
the new binding.
The chip data structure is used while parsing the whole old binding,
before the individual switch structures exist. With the new bindings,
this is reversed, the switches exist first, and the interconnections
between the switches is derived from the individual switch
bindings. Thus this chip data structure becomes unneeded.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
eviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are going to be more per-port members added to the switch
structure. So add a port structure and move the netdev into it.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The platform data nr_chips is used when validating a received packet,
to ensure it comes from a know switch chip. The number of possible
switches is limited to DSA_MAX_SWITCHES, so use this as the first
validation step. The new binding allows holes in the dst->ds[] array,
so also ensure ensure there is a valid dsa_switch for this packet.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The DSA layer should no longer assume the switch is connected to an
MDIO bus. As a result, we cannot use the address on the MDIO bus when
forming the name of the switches internal MDIO bus for its builtin and
possibly external PHYs. The switch index is sufficient to make the
name unique, so drop the MDIO address.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The new binding does not make use of dsa_chip_data, a.k.a cd. When
retrieving the size of the EEPROM attached to a switch, don't assume
there is a cd attached to the switch structure.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull Ceph fixes from Sage Weil:
"We have a few follow-up fixes for the libceph refactor from Ilya, and
then some cephfs + fscache fixes from Zheng.
The first two FS-Cache patches are acked by David Howells and deemed
trivial enough to go through our tree. The rest fix some issues with
the ceph fscache handling (disable cache for inodes opened for write,
and simplify the revalidation logic accordingly, dropping the
now-unnecessary work queue)"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
ceph: use i_version to check validity of fscache
ceph: improve fscache revalidation
ceph: disable fscache when inode is opened for write
ceph: avoid unnecessary fscache invalidation/revlidation
ceph: call __fscache_uncache_page() if readpages fails
FS-Cache: make check_consistency callback return int
FS-Cache: wake write waiter after invalidating writes
libceph: use %s instead of %pE in dout()s
libceph: put request only if it's done in handle_reply()
libceph: change ceph_osdmap_flag() to take osdc
size_t objects should be printed with %Z printf format.
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use the more common kernel logging style and reduce object size.
The logging message prefix changes from a mixture of
"RxRPC:" and "RXRPC:" to "af_rxrpc: ".
$ size net/rxrpc/built-in.o*
text data bss dec hex filename
64172 1972 8304 74448 122d0 net/rxrpc/built-in.o.new
67512 1972 8304 77788 12fdc net/rxrpc/built-in.o.old
Miscellanea:
o Consolidate the ASSERT macros to use a single pr_err call with
decimal and hexadecimal output and a stringified #OP argument
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is useful for debugging packet sizes.
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Tested-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SCTP has this pecualiarity that its packets cannot be just segmented to
(P)MTU. Its chunks must be contained in IP segments, padding respected.
So we can't just generate a big skb, set gso_size to the fragmentation
point and deliver it to IP layer.
This patch takes a different approach. SCTP will now build a skb as it
would be if it was received using GRO. That is, there will be a cover
skb with protocol headers and children ones containing the actual
segments, already segmented to a way that respects SCTP RFCs.
With that, we can tell skb_segment() to just split based on frag_list,
trusting its sizes are already in accordance.
This way SCTP can benefit from GSO and instead of passing several
packets through the stack, it can pass a single large packet.
v2:
- Added support for receiving GSO frames, as requested by Dave Miller.
- Clear skb->cb if packet is GSO (otherwise it's not used by SCTP)
- Added heuristics similar to what we have in TCP for not generating
single GSO packets that fills cwnd.
v3:
- consider sctphdr size in skb_gso_transport_seglen()
- rebased due to 5c7cdf339a ("gso: Remove arbitrary checks for
unsupported GSO")
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Tested-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is a preparation for the GSO one. In order to successfully
handle GSO packets on rx path we must not call skb_linearize, otherwise
it defeats any gain GSO may have had.
This patch thus delays as much as possible the call to skb_linearize,
leaving it to sctp_inq_pop() moment. For that the sanity checks
performed now know how to deal with fragments.
One positive side-effect of this is that if the socket is backlogged it
will have the chance of doing it on backlog processing instead of
during softirq.
With this move, it's evident that a check for non-linearity in
sctp_inq_pop was ineffective and is now removed. Note that a similar
check is performed a bit below this one.
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Tested-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb_gso_network_seglen is not enough for checking fragment sizes if
skb is using GSO_BY_FRAGS as we have to check frag per frag.
This patch introduces skb_gso_validate_mtu, based on the former, which
will wrap the use case inside it as all calls to skb_gso_network_seglen
were to validate if it fits on a given TMU, and improve the check.
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Tested-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch allows segmenting a skb based on its frags sizes instead of
based on a fixed value.
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Tested-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sctp GSO requires it and sctp can be compiled as a module, so we need to
export this function.
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Tested-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes: 2ccccf5fb4 ("net_sched: update hierarchical backlog too")
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes: 2ccccf5fb4 ("net_sched: update hierarchical backlog too")
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes: 2ccccf5fb4 ("net_sched: update hierarchical backlog too")
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We need to update backlog too when we update qlen.
Joint work with Stas.
Reported-by: Stas Nichiporovich <stasn77@gmail.com>
Tested-by: Stas Nichiporovich <stasn77@gmail.com>
Fixes: 2ccccf5fb4 ("net_sched: update hierarchical backlog too")
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
hfsc updates backlog lazily, that is only when we
dump the stats. This is problematic after we begin to
update backlog in qdisc_tree_reduce_backlog().
Reported-by: Stas Nichiporovich <stasn77@gmail.com>
Tested-by: Stas Nichiporovich <stasn77@gmail.com>
Fixes: 2ccccf5fb4 ("net_sched: update hierarchical backlog too")
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The last field "flags" of object "minfo" is not initialized.
Copying this object out may leak kernel stack data.
Assign 0 to it to avoid leak.
Signed-off-by: Kangjie Lu <kjlu@gatech.edu>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
link_info.str is a char array of size 60. Memory after the NULL
byte is not initialized. Sending the whole object out can cause
a leak.
Signed-off-by: Kangjie Lu <kjlu@gatech.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
Paul Moore tracked a regression caused by a recent commit, which
mistakenly assumed that sk_filter() could be avoided if socket
had no current BPF filter.
The intent was to avoid udp_lib_checksum_complete() overhead.
But sk_filter() also checks skb_pfmemalloc() and
security_sock_rcv_skb(), so better call it.
Fixes: e6afc8ace6 ("udp: remove headers from UDP packets before queueing")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Paul Moore <paul@paul-moore.com>
Tested-by: Paul Moore <paul@paul-moore.com>
Tested-by: Stephen Smalley <sds@tycho.nsa.gov>
Cc: samanthakumar <samanthakumar@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Set name_assign_type of internal port to NET_NAME_USER.
Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Quoting John Stultz:
In updating a 32bit arm device from 4.6 to Linus' current HEAD, I
noticed I was having some trouble with networking, and realized that
/proc/net/ip_tables_names was suddenly empty.
Digging through the registration process, it seems we're catching on the:
if (strcmp(t->u.user.name, XT_STANDARD_TARGET) == 0 &&
target_offset + sizeof(struct xt_standard_target) != next_offset)
return -EINVAL;
Where next_offset seems to be 4 bytes larger then the
offset + standard_target struct size.
next_offset needs to be aligned via XT_ALIGN (so we can access all members
of ip(6)t_entry struct).
This problem didn't show up on i686 as it only needs 4-byte alignment for
u64, but iptables userspace on other 32bit arches does insert extra padding.
Reported-by: John Stultz <john.stultz@linaro.org>
Tested-by: John Stultz <john.stultz@linaro.org>
Fixes: 7ed2abddd2 ("netfilter: x_tables: check standard target size too")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for your net tree,
they are:
1) Fix incorrect timestamp in nfnetlink_queue introduced when addressing
y2038 safe timestamp, from Florian Westphal.
2) Get rid of leftover conntrack definition from the previous merge
window, oneliner from Florian.
3) Make nf_queue handler pernet to resolve race on dereferencing the
hook state structure with netns removal, from Eric Biederman.
4) Ensure clean exit on unregistered helper ports, from Taehee Yoo.
5) Restore FLOWI_FLAG_KNOWN_NH in nf_dup_ipv6. This got lost while
generalizing xt_TEE to add packet duplication support in nf_tables,
from Paolo Abeni.
6) Insufficient netlink NFTA_SET_TABLE attribute check in
nf_tables_getset(), from Phil Turnbull.
7) Reject helper registration on duplicated ports via modparams.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Fix negative error code usage in ATM layer, from Stefan Hajnoczi.
2) If CONFIG_SYSCTL is disabled, the default TTL is not initialized
properly. From Ezequiel Garcia.
3) Missing spinlock init in mvneta driver, from Gregory CLEMENT.
4) Missing unlocks in hwmb error paths, also from Gregory CLEMENT.
5) Fix deadlock on team->lock when propagating features, from Ivan
Vecera.
6) Work around buffer offset hw bug in alx chips, from Feng Tang.
7) Fix double listing of SCTP entries in sctp_diag dumps, from Xin
Long.
8) Various statistics bug fixes in mlx4 from Eric Dumazet.
9) Fix some randconfig build errors wrt fou ipv6 from Arnd Bergmann.
10) All of l2tp was namespace aware, but the ipv6 support code was not
doing so. From Shmulik Ladkani.
11) Handle on-stack hrtimers properly in pktgen, from Guenter Roeck.
12) Propagate MAC changes properly through VLAN devices, from Mike
Manning.
13) Fix memory leak in bnx2x_init_one(), from Vitaly Kuznetsov.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (62 commits)
sfc: Track RPS flow IDs per channel instead of per function
usbnet: smsc95xx: fix link detection for disabled autonegotiation
virtio_net: fix virtnet_open and virtnet_probe competing for try_fill_recv
bnx2x: avoid leaking memory on bnx2x_init_one() failures
fou: fix IPv6 Kconfig options
openvswitch: update checksum in {push,pop}_mpls
sctp: sctp_diag should dump sctp socket type
net: fec: update dirty_tx even if no skb
vlan: Propagate MAC address to VLANs
atm: iphase: off by one in rx_pkt()
atm: firestream: add more reserved strings
vxlan: Accept user specified MTU value when create new vxlan link
net: pktgen: Call destroy_hrtimer_on_stack()
timer: Export destroy_hrtimer_on_stack()
net: l2tp: Make l2tp_ip6 namespace aware
Documentation: ip-sysctl.txt: clarify secure_redirects
sfc: use flow dissector helpers for aRFS
ieee802154: fix logic error in ieee802154_llsec_parse_dev_addr
net: nps_enet: Disable interrupts before napi reschedule
net/lapb: tuse %*ph to dump buffers
...
The Kconfig options I added to work around broken compilation ended
up screwing up things more, as I used the wrong symbol to control
compilation of the file, resulting in IPv6 fou support to never be built
into the kernel.
Changing CONFIG_NET_FOU_IPV6_TUNNELS to CONFIG_IPV6_FOU fixes that
problem, I had renamed the symbol in one location but not the other,
and as the file is never being used by other kernel code, this did not
lead to a build failure that I would have caught.
After that fix, another issue with the same patch becomes obvious, as we
'select INET6_TUNNEL', which is related to IPV6_TUNNEL, but not the same,
and this can still cause the original build failure when IPV6_TUNNEL is
not built-in but IPV6_FOU is. The fix is equally trivial, we just need
to select the right symbol.
I have successfully build 350 randconfig kernels with this patch
and verified that the driver is now being built.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reported-by: Valentin Rothberg <valentinrothberg@gmail.com>
Fixes: fabb13db44 ("fou: add Kconfig options for IPv6 support")
Signed-off-by: David S. Miller <davem@davemloft.net>
In the case of CHECKSUM_COMPLETE the skb checksum should be updated in
{push,pop}_mpls() as they the type in the ethernet header.
As suggested by Pravin Shelar.
Cc: Pravin Shelar <pshelar@nicira.com>
Fixes: 25cd9ba0ab ("openvswitch: Add basic MPLS support to kernel")
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now we cannot distinguish that one sk is a udp or sctp style when
we use ss to dump sctp_info. it's necessary to dump it as well.
For sctp_diag, ss support is not officially available, thus there
are no official users of this yet, so we can add this field in the
middle of sctp_info without breaking user API.
v1->v2:
- move 'sctpi_s_type' field to the end of struct sctp_info, so
that it won't cause incompatibility with applications already
built.
- add __reserved3 in sctp_info to make sure sctp_info is 8-byte
alignment.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The MAC address of the physical interface is only copied to the VLAN
when it is first created, resulting in an inconsistency after MAC
address changes of only newly created VLANs having an up-to-date MAC.
The VLANs should continue inheriting the MAC address of the physical
interface until the VLAN MAC address is explicitly set to any value.
This allows IPv6 EUI64 addresses for the VLAN to reflect any changes
to the MAC of the physical interface and thus for DAD to behave as
expected.
Signed-off-by: Mike Manning <mmanning@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If CONFIG_DEBUG_OBJECTS_TIMERS=y, hrtimer_init_on_stack() requires
a matching call to destroy_hrtimer_on_stack() to clean up timer
debug objects.
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The driver extended capabilities may differ for different
interface types which the userspace needs to know (for
example the fine timing measurement initiator and responder
bits might differ for a station and AP). Add a new nl80211
attribute to provide extended capabilities per interface type
to userspace.
Signed-off-by: Vidyullatha Kanchanapally <vkanchan@qti.qualcomm.com>
Reviewed-by: Jouni Malinen <jouni@qca.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Previously, the status parameter to cfg80211_connect_result() was
documented as using WLAN_STATUS_UNSPECIFIED_FAILURE (1) when the real
status code for the failure is not known. This value can be used by an
AP (and often is) and as such, user space cannot distinguish between
explicitly rejected authentication/association and not being able to
even try to associate or not receiving a response from the AP.
Add a new inline function, cfg80211_connect_timeout(), to be used when
the driver knows that the connection attempt failed due to a reason
where connection could not be attempt or no response was received from
the AP. The internal functions now allow a negative status value (-1) to
be used as an indication of this special case. This results in the
NL80211_ATTR_TIMED_OUT to be added to the NL80211_CMD_CONNECT event to
allow user space to determine this case was hit. For backwards
compatibility, NL80211_STATUS_CODE with the value
WLAN_STATUS_UNSPECIFIED_FAILURE is still indicated in the event in such
a case.
Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com>
[johannes: fix cfg80211_connect_bss() prototype to use int for status,
add cfg80211_connect_timeout() to docbook, fix docbook]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This makes it a lot easier to understand the capabilities used
by the station:
VHT supported
cap: 0x300819b2
MAX-MPDU-11454
80Mhz
RXLDPC
SHORT-GI-80
TXSTBC
RXSTBC_1
SU-BEAMFORMER-CAPABLE
SU-BEAMFORMEE-CAPABLE
BEAMFORMEE-STS: 0x0
SOUNDING-DIMENSIONS: 0x0
MU-BEAMFORMER-CAPABLE
MPDU-LENGTH-EXPONENT: 0x0
LINK-ADAPTATION-VHT-MRQ-MFB: 0x0
RX-ANTENNA-PATTERN
TX-ANTENNA-PATTERN
RX MCS: fffe
TX MCS: fffe
Signed-off-by: Ben Greear <greearb@candelatech.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The header field is defined as u8[] but also accessed as struct
ieee80211_hdr. Enforce an alignment of 2 to prevent unnecessary
unaligned accesses, which can be very harmful for performance on many
platforms.
Fixes: e495c24731 ("mac80211: extend fast-xmit for more ciphers")
Cc: stable@vger.kernel.org
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Don't allow registration of helpers using the same tuple:
{ l3proto, l4proto, src-port }
We lookup for the helper from the packet path using this tuple through
__nf_ct_helper_find(). Therefore, we have to avoid having two helpers
with the same tuple to ensure predictible behaviour.
Don't compare the helper string names anymore since it is valid to
register two helpers with the same name, but using different tuples.
This is also implicitly fixing up duplicated helper registration via
ports= modparam since the name comparison was defeating the tuple
duplication validation.
Reported-by: Feng Gao <gfree.wind@gmail.com>
Reported-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
While a wiphy can be transferred to network namespaces, a process having
CAP_NET_ADMIN in a non-initial user namespace can not administrate such
devices due to the genetlink GENL_ADMIN_PERM restrictions.
For openvswitch having the same issue, a new GENL_UNS_ADMIN_PERM flag has
been introduced, commit 4a92602aa1 ("openvswitch: allow management from
inside user namespaces"). This patch changes all privileged operations
operating on a wiphy, dev or wdev to allow their administration using the
same mechanism. All operations use either NEED_WIPHY, NEED_WDEV or
NEED_NETDEV, which implies a namespace aware lookup of the device. The only
exception is NL80211_CMD_SET_WIPHY, which explicitly uses a namespace aware
phy lookup.
Signed-off-by: Martin Willi <martin@strongswan.org>
[also allow cancel scan, for completeness]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Commit d30291b985 ("libceph: variable-sized ceph_object_id") changed
dout()s in what is now encode_request() and ceph_object_locator_to_pg()
to use %pE, mostly to document that, although all rbd and cephfs object
names are NULL-terminated strings, ceph_object_id will handle any RADOS
object name, including the one containing NULs, just fine.
However, it turns out that vbin_printf() can't handle anything but ints
and %s - all %p suffixes are ignored. The buffer %p** points to isn't
recorded, resulting in trash in the messages if the buffer had been
reused by the time bstr_printf() got to it.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
handle_reply() may be called twice on the same request: on ack and then
on commit. This occurs on btrfs-formatted OSDs or if cephfs sync write
path is triggered - CEPH_OSD_FLAG_ACK | CEPH_OSD_FLAG_ONDISK.
handle_reply() handles this with the help of done_request().
Fixes: 5aea3dcd50 ("libceph: a major OSD client update")
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
For the benefit of every single caller, take osdc instead of map.
Also, now that osdc->osdmap can't ever be NULL, drop the check.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
If the NFTA_SET_TABLE parameter is missing and the NLM_F_DUMP flag is
not set, then a NULL pointer dereference is triggered in
nf_tables_set_lookup because ctx.table is NULL.
Signed-off-by: Phil Turnbull <phil.turnbull@oracle.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
With the commit 48e8aa6e31 ("ipv6: Set FLOWI_FLAG_KNOWN_NH at
flowi6_flags") ip6_pol_route() callers were asked to to set the
FLOWI_FLAG_KNOWN_NH properly and xt_TEE was updated accordingly,
but with the later refactor in commit bbde9fc182 ("netfilter:
factor out packet duplication for IPv4/IPv6") the flowi6_flags
update was lost.
This commit re-add it just before the routing decision.
Fixes: bbde9fc182 ("netfilter: factor out packet duplication for IPv4/IPv6")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
helpers should unregister the only registered ports.
but, helper cannot have correct registered ports value when
failed to register.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
l2tp_ip6 tunnel and session lookups were still using init_net, although
the l2tp core infrastructure already supports lookups keyed by 'net'.
As a result, l2tp_ip6_recv discarded packets for tunnels/sessions
created in namespaces other than the init_net.
Fix, by using dev_net(skb->dev) or sock_net(sk) where appropriate.
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix a logic error to avoid potential null pointer dereference.
Signed-off-by: Baozeng Ding <sploving1@gmail.com>
Reviewed-by: Stefan Schmidt<stefan@osg.samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use %*ph specifier to dump small buffers in hex format instead doing this
byte-by-byte.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A previous patch added the fou6.ko module, but that failed to link
in a couple of configurations:
net/built-in.o: In function `ip6_tnl_encap_add_fou_ops':
net/ipv6/fou6.c:88: undefined reference to `ip6_tnl_encap_add_ops'
net/ipv6/fou6.c:94: undefined reference to `ip6_tnl_encap_add_ops'
net/ipv6/fou6.c:97: undefined reference to `ip6_tnl_encap_del_ops'
net/built-in.o: In function `ip6_tnl_encap_del_fou_ops':
net/ipv6/fou6.c:106: undefined reference to `ip6_tnl_encap_del_ops'
net/ipv6/fou6.c:107: undefined reference to `ip6_tnl_encap_del_ops'
If CONFIG_IPV6=m, ip6_tnl_encap_add_ops/ip6_tnl_encap_del_ops
are in a module, but fou6.c can still be built-in, and that
obviously fails to link.
Also, if CONFIG_IPV6=y, but CONFIG_IPV6_TUNNEL=m or
CONFIG_IPV6_TUNNEL=n, the same problem happens for a different
reason.
This adds two new silent Kconfig symbols to work around both
problems:
- CONFIG_IPV6_FOU is now always set to 'm' if either CONFIG_NET_FOU=m
or CONFIG_IPV6=m
- CONFIG_IPV6_FOU_TUNNEL is set implicitly when IPV6_FOU is enabled
and NET_FOU_IP_TUNNELS is also turned out, and it will ensure
that CONFIG_IPV6_TUNNEL is also available.
The options could be made user-visible as well, to give additional
room for configuration, but it seems easier not to bother users
with more choice here.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: aa3463d65e ("fou: Add encap ops for IPv6 tunnels")
Signed-off-by: David S. Miller <davem@davemloft.net>
Most users of IS_ERR_VALUE() in the kernel are wrong, as they
pass an 'int' into a function that takes an 'unsigned long'
argument. This happens to work because the type is sign-extended
on 64-bit architectures before it gets converted into an
unsigned type.
However, anything that passes an 'unsigned short' or 'unsigned int'
argument into IS_ERR_VALUE() is guaranteed to be broken, as are
8-bit integers and types that are wider than 'unsigned long'.
Andrzej Hajda has already fixed a lot of the worst abusers that
were causing actual bugs, but it would be nice to prevent any
users that are not passing 'unsigned long' arguments.
This patch changes all users of IS_ERR_VALUE() that I could find
on 32-bit ARM randconfig builds and x86 allmodconfig. For the
moment, this doesn't change the definition of IS_ERR_VALUE()
because there are probably still architecture specific users
elsewhere.
Almost all the warnings I got are for files that are better off
using 'if (err)' or 'if (err < 0)'.
The only legitimate user I could find that we get a warning for
is the (32-bit only) freescale fman driver, so I did not remove
the IS_ERR_VALUE() there but changed the type to 'unsigned long'.
For 9pfs, I just worked around one user whose calling conventions
are so obscure that I did not dare change the behavior.
I was using this definition for testing:
#define IS_ERR_VALUE(x) ((unsigned long*)NULL == (typeof (x)*)NULL && \
unlikely((unsigned long long)(x) >= (unsigned long long)(typeof(x))-MAX_ERRNO))
which ends up making all 16-bit or wider types work correctly with
the most plausible interpretation of what IS_ERR_VALUE() was supposed
to return according to its users, but also causes a compile-time
warning for any users that do not pass an 'unsigned long' argument.
I suggested this approach earlier this year, but back then we ended
up deciding to just fix the users that are obviously broken. After
the initial warning that caused me to get involved in the discussion
(fs/gfs2/dir.c) showed up again in the mainline kernel, Linus
asked me to send the whole thing again.
[ Updated the 9p parts as per Al Viro - Linus ]
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Andrzej Hajda <a.hajda@samsung.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lkml.org/lkml/2016/1/7/363
Link: https://lkml.org/lkml/2016/5/27/486
Acked-by: Srinivas Kandagatla <srinivas.kandagatla@linaro.org> # For nvmem part
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull Ceph updates from Sage Weil:
"This changeset has a few main parts:
- Ilya has finished a huge refactoring effort to sync up the
client-side logic in libceph with the user-space client code, which
has evolved significantly over the last couple years, with lots of
additional behaviors (e.g., how requests are handled when cluster
is full and transitions from full to non-full).
This structure of the code is more closely aligned with userspace
now such that it will be much easier to maintain going forward when
behavior changes take place. There are some locking improvements
bundled in as well.
- Zheng adds multi-filesystem support (multiple namespaces within the
same Ceph cluster)
- Zheng has changed the readdir offsets and directory enumeration so
that dentry offsets are hash-based and therefore stable across
directory fragmentation events on the MDS.
- Zheng has a smorgasbord of bug fixes across fs/ceph"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (71 commits)
ceph: fix wake_up_session_cb()
ceph: don't use truncate_pagecache() to invalidate read cache
ceph: SetPageError() for writeback pages if writepages fails
ceph: handle interrupted ceph_writepage()
ceph: make ceph_update_writeable_page() uninterruptible
libceph: make ceph_osdc_wait_request() uninterruptible
ceph: handle -EAGAIN returned by ceph_update_writeable_page()
ceph: make fault/page_mkwrite return VM_FAULT_OOM for -ENOMEM
ceph: block non-fatal signals for fault/page_mkwrite
ceph: make logical calculation functions return bool
ceph: tolerate bad i_size for symlink inode
ceph: improve fragtree change detection
ceph: keep leaf frag when updating fragtree
ceph: fix dir_auth check in ceph_fill_dirfrag()
ceph: don't assume frag tree splits in mds reply are sorted
ceph: fix inode reference leak
ceph: using hash value to compose dentry offset
ceph: don't forbid marking directory complete after forward seek
ceph: record 'offset' for each entry of readdir result
ceph: define 'end/complete' in readdir reply as bit flags
...
Highlights include:
Features:
- Add support for the NFS v4.2 COPY operation
- Add support for NFS/RDMA over IPv6
Bugfixes and cleanups:
- Avoid race that crashes nfs_init_commit()
- Fix oops in callback path
- Fix LOCK/OPEN race when unlinking an open file
- Choose correct stateids when using delegations in setattr, read and write
- Don't send empty SETATTR after OPEN_CREATE
- xprtrdma: Prevent server from writing a reply into memory client has released
- xprtrdma: Support using Read list and Reply chunk in one RPC call
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJXRu76AAoJENfLVL+wpUDrDVoQAKPKv1tEVJMRUQA3UVoKoixd
KjmmZMjl6GfpISwTZl+a8W549jyGuYH7Gl8vSbMaE9/FI+kJW6XZQniTYfFqY8/a
LbMSdNx1+yURisbkyO0vPqqwKw9r6UmsfGeUT8SpS3ff61yp4Oj436ra2qcPJsZ3
cWl/lHItzX7oKFAWmr0Nmq2X8ac/8+NFyK29+V/QGfwtp3qAPbpA8XM5HrHw3rA2
uk5uNSr3hwqz7P3+Hi7ZoO2m4nQTAbQnEunfYpxlOwz4IaM7qcGnntT6Jhwq1pGE
/1YasG7bHeiWjhynmZZ4CWuMkogau2UJ/G68Cz7ehLhPNr8rH/ZFCJZ+XX0e0CgI
1d+AwxZvgszIQVBY3S7sg8ezVSCPBXRFJ8rtzggGscqC53aP7L+rLfUFH+OKrhMg
6n7RQiq4EmGDJGviB/R2HixI9CpdOf2puNhDKSJmPOqiSS7UuHMw8QCq++vdru+1
GLGunGyO7D70yTV92KtsdzJlFlnfa/g+FIJrmaMpL3HH1h0stTctWX5xlTYmqEL3
z3aUuT8RySk2t1FTabSj6KRWqE/krK5BMZbX91kpF27WL4c/olXFaZPqBDsj0q4u
2rm1fIrc8RxLXctJan9ro092s/e9dup/1JxV5XWMq/EGS1ezvf+0XkCOtURaAWp3
2aPHlx7M8iuq2SouL6f7
=QMmY
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-4.7-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client updates from Anna Schumaker:
"Highlights include:
Features:
- Add support for the NFS v4.2 COPY operation
- Add support for NFS/RDMA over IPv6
Bugfixes and cleanups:
- Avoid race that crashes nfs_init_commit()
- Fix oops in callback path
- Fix LOCK/OPEN race when unlinking an open file
- Choose correct stateids when using delegations in setattr, read and
write
- Don't send empty SETATTR after OPEN_CREATE
- xprtrdma: Prevent server from writing a reply into memory client
has released
- xprtrdma: Support using Read list and Reply chunk in one RPC call"
* tag 'nfs-for-4.7-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (61 commits)
pnfs: pnfs_update_layout needs to consider if strict iomode checking is on
nfs/flexfiles: Use the layout segment for reading unless it a IOMODE_RW and reading is disabled
nfs/flexfiles: Helper function to detect FF_FLAGS_NO_READ_IO
nfs: avoid race that crashes nfs_init_commit
NFS: checking for NULL instead of IS_ERR() in nfs_commit_file()
pnfs: make pnfs_layout_process more robust
pnfs: rework LAYOUTGET retry handling
pnfs: lift retry logic from send_layoutget to pnfs_update_layout
pnfs: fix bad error handling in send_layoutget
flexfiles: add kerneldoc header to nfs4_ff_layout_prepare_ds
flexfiles: remove pointless setting of NFS_LAYOUT_RETURN_REQUESTED
pnfs: only tear down lsegs that precede seqid in LAYOUTRETURN args
pnfs: keep track of the return sequence number in pnfs_layout_hdr
pnfs: record sequence in pnfs_layout_segment when it's created
pnfs: don't merge new ff lsegs with ones that have LAYOUTRETURN bit set
pNFS/flexfiles: When initing reads or writes, we might have to retry connecting to DSes
pNFS/flexfiles: When checking for available DSes, conditionally check for MDS io
pNFS/flexfile: Fix erroneous fall back to read/write through the MDS
NFS: Reclaim writes via writepage are opportunistic
NFSv4: Use the right stateid for delegations in setattr, read and write
...
We have this situation: that EP hash table, contains only the EPs
that are listening, while the transports one, has the opposite.
We have to traverse both to dump all.
But when we traverse the transports one we will also get EPs that are
in the EP hash if they are listening. In this case, the EP is dumped
twice.
We will fix it by checking if the endpoint that is in the endpoint
hash table contains any ep->asoc in there, as it means we will also
find it via transport hash, and thus we can/should skip it, depending
on the filters used, like 'ss -l'.
Still, we should NOT skip it if the user is listing only listening
endpoints, because then we are not traversing the transport hash.
so we have to check idiag_states there also.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ceph_osdc_wait_request() is used when cephfs issues sync IO. In most
cases, the sync IO should be uninterruptible. The fix is use killale
wait function in ceph_osdc_wait_request().
Signed-off-by: Yan, Zheng <zyan@redhat.com>
This patch makes serverl logical caculation functions return bool to
improve readability due to these particular functions only using 0/1
as their return value.
No functional change.
Signed-off-by: Zhang Zhuoyu <zhangzhuoyu@cmss.chinamobile.com>
There is now about a dozen CEPH_OSDMAP_* flags. This is a debugging
interface, so just dump in hex instead of spelling each flag out.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This adds the "map check" infrastructure for sending osdmap version
checks on CALC_TARGET_POOL_DNE and completing in-flight requests with
-ENOENT if the target pool doesn't exist or has just been deleted.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
For map check, we are going to need to send CEPH_MSG_MON_GET_VERSION
messages asynchronously and get a callback on completion. Refactor MON
client to allow firing off generic requests asynchronously and add an
async variant of ceph_monc_get_version(). ceph_monc_do_statfs() is
switched over and remains sync.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Implement ceph_osdc_watch_check() to be able to check on status of
watch. Note that the time it takes for a watch/notify event to get
delivered through the notify_wq is taken into account.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Implement ceph_osdc_notify() for sending notifies.
Due to the fact that the current messenger can't do read-in into
pagelists (it can only do write-out from them), I had to go with a page
vector for a NOTIFY_COMPLETE payload, for now.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This adds support and switches rbd to a new, more reliable version of
watch/notify protocol. As with the OSD client update, this is mostly
about getting the right structures linked into the right places so that
reconnects are properly sent when needed. watch/notify v2 also
requires sending regular pings to the OSDs - send_linger_ping().
A major change from the old watch/notify implementation is the
introduction of ceph_osd_linger_request - linger requests no longer
piggy back on ceph_osd_request. ceph_osd_event has been merged into
ceph_osd_linger_request.
All the details are now hidden within libceph, the interface consists
of a simple pair of watch/unwatch functions and ceph_osdc_notify_ack().
ceph_osdc_watch() does return ceph_osd_linger_request, but only to keep
the lifetime management simple.
ceph_osdc_notify_ack() accepts an optional data payload, which is
relayed back to the notifier.
Portions of this patch are loosely based on work by Douglas Fuller
<dfuller@redhat.com> and Mike Christie <michaelc@cs.wisc.edu>.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The unwatch timeout is currently implemented in rbd. With
watch/unwatch code moving into libceph, we are going to need
a ceph_osdc_wait_request() variant with a timeout.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This is a major sync up, up to ~Jewel. The highlights are:
- per-session request trees (vs a global per-client tree)
- per-session locking (vs a global per-client rwlock)
- homeless OSD session
- no ad-hoc global per-client lists
- support for pool quotas
- foundation for watch/notify v2 support
- foundation for map check (pool deletion detection) support
The switchover is incomplete: lingering requests can be setup and
teared down but aren't ever reestablished. This functionality is
restored with the introduction of the new lingering infrastructure
(ceph_osd_linger_request, linger_work, etc) in a later commit.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
OSD client is getting moved from the big per-client lock to a set of
per-session locks. The big rwlock would only be held for read most of
the time, so a global osdc->osd_lru needs additional protection.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
create_osd() is called way too deep in the stack to be able to error
out in a sane way; a failing create_osd() just messes everything up.
The current req_notarget list solution is broken - the list is never
traversed as it's not entirely clear when to do it, I guess.
If we were to start traversing it at regular intervals and retrying
each request, we wouldn't be far off from what __GFP_NOFAIL is doing,
so allocate OSD sessions with __GFP_NOFAIL, at least until we come up
with a better fix.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Separate osdmap handling from decoding and iterating over a bag of maps
in a fresh MOSDMap message. This sets up the scene for the updated OSD
client.
Of particular importance here is the addition of pi->was_full, which
can be used to answer "did this pool go full -> not-full in this map?".
This is the key bit for supporting pool quotas.
We won't be able to downgrade map_sem for much longer, so drop
downgrade_write().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This leads to a simpler osdmap handling code, particularly when dealing
with pi->was_full, which is introduced in a later commit.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Both homeless OSD sessions and watch/notify v2, introduced in later
commits, require periodic ticks which don't depend on ->num_requests.
Schedule the initial tick from ceph_osdc_init() and reschedule from
handle_timeout() unconditionally.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
ceph_osdc_stop() isn't called if ceph_osdc_init() fails, so we end up
with handle_osds_timeout() running on invalid memory if any one of the
allocations fails. Call schedule_delayed_work() after everything is
setup, just before returning.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
If you specify ACK | ONDISK and set ->r_unsafe_callback, both
->r_callback and ->r_unsafe_callback(true) are called on ack. This is
very confusing. Redo this so that only one of them is called:
->r_unsafe_callback(true), on ack
->r_unsafe_callback(false), on commit
or
->r_callback, on ack|commit
Decode everything in decode_MOSDOpReply() to reduce clutter.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
finish_read(), its only user, uses it to get to hdr.data_len, which is
what ->r_result is set to on success. This gains us the ability to
safely call callbacks from contexts other than reply, e.g. map check.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The crux of this is getting rid of ceph_osdc_build_request(), so that
MOSDOp can be encoded not before but after calc_target() calculates the
actual target. Encoding now happens within ceph_osdc_start_request().
Also nuked is the accompanying bunch of pointers into the encoded
buffer that was used to update fields on each send - instead, the
entire front is re-encoded. If we want to support target->name_len !=
base->name_len in the future, there is no other way, because oid is
surrounded by other fields in the encoded buffer.
Encoding OSD ops and adding data items to the request message were
mixed together in osd_req_encode_op(). While we want to re-encode OSD
ops, we don't want to add duplicate data items to the message when
resending, so all call to ceph_osdc_msg_data_add() are factored out
into a new setup_request_data().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Replace __calc_request_pg() and most of __map_request() with
calc_target() and start using req->r_t.
ceph_osdc_build_request() however still encodes base_oid, because it's
called before calc_target() is and target_oid is empty at that point in
time; a printf in osdc_show() also shows base_oid. This is fixed in
"libceph: switch to calc_target(), part 2".
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Introduce ceph_osd_request_target, containing all mapping-related
fields of ceph_osd_request and calc_target() for calculating mappings
and populating it.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Add and decode pi->min_size and pi->last_force_request_resend. These
are going to be used by calc_target().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
calc_target() code is going to need to know how to compare PGs. Take
lhs and rhs pgid by const * while at it.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Rename ceph_calc_pg_primary() to ceph_pg_to_acting_primary() to
emphasise that it returns acting primary.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Knowning just acting set isn't enough, we need to be able to record up
set as well to detect interval changes. This means returning (up[],
up_len, up_primary, acting[], acting_len, acting_primary) and passing
it around. Introduce and switch to ceph_osds to help with that.
Rename ceph_calc_pg_acting() to ceph_pg_to_up_acting_osds() and return
both up and acting sets from it.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Rename ceph_oloc_oid_to_pg() to ceph_object_locator_to_pg(). Emphasise
that returned is raw PG and return -ENOENT instead of -EIO if the pool
doesn't exist.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Given
struct foo {
u64 id;
struct rb_node bar_node;
};
generate insert_bar(), erase_bar() and lookup_bar() functions with
DEFINE_RB_FUNCS(bar, struct foo, id, bar_node)
The key is assumed to be an integer (u64, int, etc), compared with
< and >. nodefld has to be initialized with RB_CLEAR_NODE().
Start using it for MDS, MON and OSD requests and OSD sessions.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Currently ceph_object_id can hold object names of up to 100
(CEPH_MAX_OID_NAME_LEN) characters. This is enough for all use cases,
expect one - long rbd image names:
- a format 1 header is named "<imgname>.rbd"
- an object that points to a format 2 header is named "rbd_id.<imgname>"
We operate on these potentially long-named objects during rbd map, and,
for format 1 images, during header refresh. (A format 2 header name is
a small system-generated string.)
Lift this 100 character limit by making ceph_object_id be able to point
to an externally-allocated string. Apart from being able to work with
almost arbitrarily-long named objects, this allows us to reduce the
size of ceph_object_id from >100 bytes to 64 bytes.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
For a message pool message, preallocate a page, just like we do for
osd_op. For a normal message, take ceph_object_id into account and
don't bother subtracting CEPH_OSD_SLAB_OPS ceph_osd_ops.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The size of ->r_request and ->r_reply messages depends on the size of
the object name (ceph_object_id), while the size of ceph_osd_request is
fixed. Move message allocation into a separate function that would
have to be called after ceph_object_id and ceph_object_locator (which
is also going to become variable in size with RADOS namespaces) have
been filled in:
req = ceph_osdc_alloc_request(...);
<fill in req->r_base_oid>
<fill in req->r_base_oloc>
ceph_osdc_alloc_messages(req);
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
ceph_osdc_build_request() is going away. Grab snapc and initialize
->r_snapid in ceph_osdc_alloc_request().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
When hwbm_pool_add exited in error the spinlock was not released. This
patch fixes this issue.
Fixes: 8cb2d8bf57 ("net: add a hardware buffer management helper API")
Reported-by: Jean-Jacques Hiblot <jjhiblot@traphandler.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Gregory CLEMENT <gregory.clement@free-electrons.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before calling the nla_parse_nested function, make sure the pointer to the
attribute is not null. This patch fixes several potential null pointer
dereference vulnerabilities in the tipc netlink functions.
Signed-off-by: Baozeng Ding <sploving1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Weber reported:
> Under full load (unshare() in loop -> OOM conditions) we can
> get kernel panic:
>
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
> IP: [<ffffffff81476c85>] nfqnl_nf_hook_drop+0x35/0x70
> [..]
> task: ffff88012dfa3840 ti: ffff88012dffc000 task.ti: ffff88012dffc000
> RIP: 0010:[<ffffffff81476c85>] [<ffffffff81476c85>] nfqnl_nf_hook_drop+0x35/0x70
> RSP: 0000:ffff88012dfffd80 EFLAGS: 00010206
> RAX: 0000000000000008 RBX: ffffffff81add0c0 RCX: ffff88013fd80000
> [..]
> Call Trace:
> [<ffffffff81474d98>] nf_queue_nf_hook_drop+0x18/0x20
> [<ffffffff814738eb>] nf_unregister_net_hook+0xdb/0x150
> [<ffffffff8147398f>] netfilter_net_exit+0x2f/0x60
> [<ffffffff8141b088>] ops_exit_list.isra.4+0x38/0x60
> [<ffffffff8141b652>] setup_net+0xc2/0x120
> [<ffffffff8141bd09>] copy_net_ns+0x79/0x120
> [<ffffffff8106965b>] create_new_namespaces+0x11b/0x1e0
> [<ffffffff810698a7>] unshare_nsproxy_namespaces+0x57/0xa0
> [<ffffffff8104baa2>] SyS_unshare+0x1b2/0x340
> [<ffffffff81608276>] entry_SYSCALL_64_fastpath+0x1e/0xa8
> Code: 65 00 48 89 e5 41 56 41 55 41 54 53 83 e8 01 48 8b 97 70 12 00 00 48 98 49 89 f4 4c 8b 74 c2 18 4d 8d 6e 08 49 81 c6 88 00 00 00 <49> 8b 5d 00 48 85 db 74 1a 48 89 df 4c 89 e2 48 c7 c6 90 68 47
>
The simple fix for this requires a new pernet variable for struct
nf_queue that indicates when it is safe to use the dynamically
allocated nf_queue state.
As we need a variable anyway make nf_register_queue_handler and
nf_unregister_queue_handler pernet. This allows the existing logic of
when it is safe to use the state from the nfnetlink_queue module to be
reused with no changes except for making it per net.
The syncrhonize_rcu from nf_unregister_queue_handler is moved to a new
function nfnl_queue_net_exit_batch so that the worst case of having a
syncrhonize_rcu in the pernet exit path is not experienced in batch
mode.
Reported-by: Florian Westphal <fw@strlen.de>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Policer was not dumping or updating timestamps
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I found a serious performance bug in packet schedulers using hrtimers.
sch_htb and sch_fq are definitely impacted by this problem.
We constantly rearm high resolution timers if some packets are throttled
in one (or more) class, and other packets are flying through qdisc on
another (non throttled) class.
hrtimer_start() does not have the mod_timer() trick of doing nothing if
expires value does not change :
if (timer_pending(timer) &&
timer->expires == expires)
return 1;
This issue is particularly visible when multiple cpus can queue/dequeue
packets on the same qdisc, as hrtimer code has to lock a remote base.
I used following fix :
1) Change htb to use qdisc_watchdog_schedule_ns() instead of open-coding
it.
2) Cache watchdog prior expiration. hrtimer might provide this, but I
prefer to not rely on some hrtimer internal.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXRL2PAAoJECebzXlCjuG+c34P/1wnkehVxDozBJp7UEzhrsE/
U1dpwfykzVEIMh68TldBvyrt2Lb4ThLPZ7V2dVwNqA831S/VM6fWJyw8WerSgGgU
SUGOzdF04rNfy41lXQNpDiiC417Fbp4Js4O+Q5kd+8kqQbXYqCwz0ce3DVbAT571
JmJgBI8gZLhicyNRDOt0y6C+/3P+0bbXYvS8wkzY+CwbNczHJOCLhwViKzWTptm9
LCSgDGm68ckpR7mZkWfEF3WdiZ9+SxeI+pT9dcomzxNfbv8NluDplYmdLbepA2J8
uWHGprVe9WJMDnw4hJhrI2b3/rHIntpxuZYktmnb/z/ezBTyi3FXYWgAEdE1by+Y
Gf7OewKOp8XcQ/iHRZ8vwXNrheHAr9++SB49mGBZJ3qj6bO+FrISQKX9FRxo6PrJ
SDRgYjt5yUG2oD1AAs1NzuBPqZzR40mA6Yk4zuNAcxzK/S7DdRF/9Kjyk86TVv08
3E3O5i1RyVcU/A7JdnbiyeDFMQoRshdnN0HShIZcSfcfW+qFKghNlO9bFfSl904F
jlG6moNB5OBiV8FNOelY+HGAYoUdw120QxqQMv47oZGKCjv+rfK38aB4GBJ4iEuo
TrGqNmrMrs/AKdL3Sd+8LuJqSfXggrwUDc/KS6CFz/U0eBbp6k0kcd7FEyG/J8kW
JxQ0URgyJ+DHfc60E8LN
=k6RP
-----END PGP SIGNATURE-----
Merge tag 'nfsd-4.7' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
"A very quiet cycle for nfsd, mainly just an RDMA update from Chuck
Lever"
* tag 'nfsd-4.7' of git://linux-nfs.org/~bfields/linux:
sunrpc: fix stripping of padded MIC tokens
svcrpc: autoload rdma module
svcrdma: Generalize svc_rdma_xdr_decode_req()
svcrdma: Eliminate code duplication in svc_rdma_recvfrom()
svcrdma: Drain QP before freeing svcrdma_xprt
svcrdma: Post Receives only for forward channel requests
svcrdma: Remove superfluous line from rdma_read_chunks()
svcrdma: svc_rdma_put_context() is invoked twice in Send error path
svcrdma: Do not add XDR padding to xdr_buf page vector
svcrdma: Support IPv6 with NFS/RDMA
nfsd: handle seqid wraparound in nfsd4_preprocess_layout_stateid
Remove unnecessary allocation
In gre6 xmit path, we are sending a GRE packet, so set fl6 proto
to IPPROTO_GRE properly.
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When creat an ip6gretap interface with an unreachable route,
the MTU is about 14 bytes larger than what was needed.
If the remote address is reachable:
ping6 2001:0:130::1 -c 2
PING 2001:0:130::1(2001:0:130::1) 56 data bytes
64 bytes from 2001:0:130::1: icmp_seq=1 ttl=64 time=1.46 ms
64 bytes from 2001:0:130::1: icmp_seq=2 ttl=64 time=81.1 ms
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit fa50d974d1 ("ipv4: Namespaceify ip_default_ttl sysctl knob")
moves the default TTL assignment, and as side-effect IPv4 TTL now
has a default value only if sysctl support is enabled (CONFIG_SYSCTL=y).
The sysctl_ip_default_ttl is fundamental for IP to work properly,
as it provides the TTL to be used as default. The defautl TTL may be
used in ip_selected_ttl, through the following flow:
ip_select_ttl
ip4_dst_hoplimit
net->ipv4.sysctl_ip_default_ttl
This commit fixes the issue by assigning net->ipv4.sysctl_ip_default_ttl
in net_init_net, called during ipv4's initialization.
Without this commit, a kernel built without sysctl support will send
all IP packets with zero TTL (unless a TTL is explicitly set, e.g.
with setsockopt).
Given a similar issue might appear on the other knobs that were
namespaceify, this commit also moves them.
Fixes: fa50d974d1 ("ipv4: Namespaceify ip_default_ttl sysctl knob")
Signed-off-by: Ezequiel Garcia <ezequiel@vanguardiasur.com.ar>
Signed-off-by: David S. Miller <davem@davemloft.net>
The sk_err and sk_err_soft fields are positive errno values and
userspace applications rely on this when using getsockopt(SO_ERROR).
ATM code places an -errno into sk_err_soft in sigd_send() and returns it
from svc_addparty()/svc_dropparty().
Although I am not familiar with ATM code I came to this conclusion
because:
1. sigd_send() msg->type cases as_okay and as_error both have:
sk->sk_err = -msg->reply;
while the as_addparty and as_dropparty cases have:
sk->sk_err_soft = msg->reply;
This is the source of the inconsistency.
2. svc_addparty() returns an -errno and assumes sk_err_soft is also an
-errno:
if (flags & O_NONBLOCK) {
error = -EINPROGRESS;
goto out;
}
...
error = xchg(&sk->sk_err_soft, 0);
out:
release_sock(sk);
return error;
This shows that sk_err_soft is indeed being treated as an -errno.
This patch ensures that sk_err_soft is always a positive errno.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The length of the GSS MIC token need not be a multiple of four bytes.
It is then padded by XDR to a multiple of 4 B, but unwrap_integ_data()
would previously only trim mic.len + 4 B. The remaining up to three
bytes would then trigger a check in nfs4svc_decode_compoundargs(),
leading to a "garbage args" error and mount failure:
nfs4svc_decode_compoundargs: compound not properly padded!
nfsd: failed to decode arguments!
This would prevent older clients using the pre-RFC 4121 MIC format
(37-byte MIC including a 9-byte OID) from mounting exports from v3.9+
servers using krb5i.
The trimming was introduced by commit 4c190e2f91 ("sunrpc: trim off
trailing checksum before returning decrypted or integrity authenticated
buffer").
Fixes: 4c190e2f91 "unrpc: trim off trailing checksum..."
Signed-off-by: Tomáš Trnka <ttrnka@mail.muni.cz>
Cc: stable@vger.kernel.org
Acked-by: Jeff Layton <jlayton@poochiereds.net>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
This should fix failures like:
# rpc.nfsd --rdma
rpc.nfsd: Unable to request RDMA services: Protocol not supported
Reported-by: Steve Dickson <steved@redhat.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Here's the large TTY and Serial driver update for 4.7-rc1.
A few new serial drivers are added here, and Peter has fixed a bunch of
long-standing bugs in the tty layer and serial drivers as normal. Full
details in the shortlog.
All of these have been in linux-next for a while with no reported issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iEYEABECAAYFAlc/0/oACgkQMUfUDdst+ynzyQCgsa54VNijdAzU6AA5HEfqmf2M
cGMAn1boH7hUWlAbJmzzihx4JASoGjYW
=V5VH
-----END PGP SIGNATURE-----
Merge tag 'tty-4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty and serial driver updates from Greg KH:
"Here's the large TTY and Serial driver update for 4.7-rc1.
A few new serial drivers are added here, and Peter has fixed a bunch
of long-standing bugs in the tty layer and serial drivers as normal.
Full details in the shortlog.
All of these have been in linux-next for a while with no reported
issues"
* tag 'tty-4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (88 commits)
MAINTAINERS: 8250: remove website reference
serial: core: Fix port mutex assert if lockdep disabled
serial: 8250_dw: fix wrong logic in dw8250_check_lcr()
tty: vt, finish looping on duplicate
tty: vt, return error when con_startup fails
QE-UART: add "fsl,t1040-ucc-uart" to of_device_id
serial: mctrl_gpio: Drop support for out1-gpios and out2-gpios
serial: 8250dw: Add device HID for future AMD UART controller
Fix OpenSSH pty regression on close
serial: mctrl_gpio: add IRQ locking
serial: 8250: Integrate Fintek into 8250_base
serial: mps2-uart: add support for early console
serial: mps2-uart: add MPS2 UART driver
dt-bindings: document the MPS2 UART bindings
serial: sirf: Use generic uart-has-rtscts DT property
serial: sirf: Introduce helper variable struct device_node *np
serial: mxs-auart: Use generic uart-has-rtscts DT property
serial: imx: Use generic uart-has-rtscts DT property
doc: DT: Add Generic Serial Device Tree Bindings
serial: 8250: of: Make tegra_serial_handle_break() static
...
Pull networking fixes and more updates from David Miller:
1) Tunneling fixes from Tom Herbert and Alexander Duyck.
2) AF_UNIX updates some struct sock bit fields with the socket lock,
whereas setsockopt() sets overlapping ones with locking. Seperate
out the synchronized vs. the AF_UNIX unsynchronized ones to avoid
corruption. From Andrey Ryabinin.
3) Mount BPF filesystem with mount_nodev rather than mount_ns, from
Eric Biederman.
4) A couple kmemdup conversions, from Muhammad Falak R Wani.
5) BPF verifier fixes from Alexei Starovoitov.
6) Don't let tunneled UDP packets get stuck in socket queues, if
something goes wrong during the encapsulation just drop the packet
rather than signalling an error up the call stack. From Hannes
Frederic Sowa.
7) SKB ref after free in batman-adv, from Florian Westphal.
8) TCP iSCSI, ocfs2, rds, and tipc have to disable BH in it's TCP
callbacks since the TCP stack runs pre-emptibly now. From Eric
Dumazet.
9) Fix crash in fixed_phy_add, from Rabin Vincent.
10) Fix length checks in xen-netback, from Paul Durrant.
11) Fix mixup in KEY vs KEYID macsec attributes, from Sabrina Dubroca.
12) RDS connection spamming bug fixes from Sowmini Varadhan
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (152 commits)
net: suppress warnings on dev_alloc_skb
uapi glibc compat: fix compilation when !__USE_MISC in glibc
udp: prevent skbs lingering in tunnel socket queues
bpf: teach verifier to recognize imm += ptr pattern
bpf: support decreasing order in direct packet access
net: usb: ch9200: use kmemdup
ps3_gelic: use kmemdup
net:liquidio: use kmemdup
bpf: Use mount_nodev not mount_ns to mount the bpf filesystem
net: cdc_ncm: update datagram size after changing mtu
tuntap: correctly wake up process during uninit
intel: Add support for IPv6 IP-in-IP offload
ip6_gre: Do not allow segmentation offloads GRE_CSUM is enabled with FOU/GUE
RDS: TCP: Avoid rds connection churn from rogue SYNs
RDS: TCP: rds_tcp_accept_worker() must exit gracefully when terminating rds-tcp
net: sock: move ->sk_shutdown out of bitfields.
ipv6: Don't reset inner headers in ip6_tnl_xmit
ip4ip6: Support for GSO/GRO
ip6ip6: Support for GSO/GRO
ipv6: Set features for IPv6 tunnels
...
In case we find a socket with encapsulation enabled we should call
the encap_recv function even if just a udp header without payload is
available. The callbacks are responsible for correctly verifying and
dropping the packets.
Also, in case the header validation fails for geneve and vxlan we
shouldn't put the skb back into the socket queue, no one will pick
them up there. Instead we can simply discard them in the respective
encap_recv functions.
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch addresses the same issue we had for IPv4 where enabling GRE with
an inner checksum cannot be supported with FOU/GUE due to the fact that
they will jump past the GRE header at it is treated like a tunnel header.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a rogue SYN is received after the connection arbitration
algorithm has converged, the incoming SYN should not needlessly
quiesce the transmit path, and it should not result in needless
TCP connection resets due to re-execution of the connection
arbitration logic.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are two instances where we want to terminate RDS-TCP: when
exiting the netns or during module unload. In either case, the
termination sequence is to stop the listen socket, mark the
rtn->rds_tcp_listen_sock as null, and flush any accept workqs.
Thus any workqs that get flushed at this point will encounter a
null rds_tcp_listen_sock, and must exit gracefully to allow
the RDS-TCP termination to complete successfully.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since iptunnel_handle_offloads() is called in all paths we can
probably drop the block in ip6_tnl_xmit that was checking for
skb->encapsulation and resetting the inner headers.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Need to set dev features, use same values that are used in GREv6.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch add a new fou6 module that provides encapsulation
operations for IPv6.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add encap_hlen and ip_tunnel_encap structure to ip6_tnl. Add functions
for getting encap hlen, setting up encap on a tunnel, performing
encapsulation operation.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds receive path support for IPv6 with fou.
- Add address family to fou structure for open sockets. This supports
AF_INET and AF_INET6. Lookups for fou ports are performed on both the
port number and family.
- In fou and gue receive adjust tot_len in IPv4 header or payload_len
based on address family.
- Allow AF_INET6 in FOU_ATTR_AF netlink attribute.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Create __fou_build_header and __gue_build_header. These implement the
protocol generic parts of building the fou and gue header.
fou_build_header and gue_build_header implement the IPv4 specific
functions and call the __*_build_header functions.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use helper function to set up UDP tunnel related information for a fou
socket.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Consolidate all the ip_tunnel_encap definitions in one spot in the
header file. Also, move ip_encap_hlen and ip_tunnel_encap from
ip_tunnel.c to ip_tunnels.h so they call be called without a dependency
on ip_tunnel module. Similarly, move iptun_encaps to ip_tunnel_core.c.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When performing foo-over-UDP, UDP packets are processed by the
encapsulation handler which returns another protocol to process.
This may result in processing two (or more) protocols in the
loop that are marked as INET6_PROTO_FINAL. The actions taken
for hitting a final protocol, in particular the skb_postpull_rcsum
can only be performed once.
This patch set adds a check of a final protocol has been seen. The
rules are:
- If the final protocol has not been seen any protocol is processed
(final and non-final). In the case of a final protocol, the final
actions are taken (like the skb_postpull_rcsum)
- If a final protocol has been seen (e.g. an encapsulating UDP
header) then no further non-final protocols are allowed
(e.g. extension headers). For more final protocols the
final actions are not taken (e.g. skb_postpull_rcsum).
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In ip6_input_finish the nexthdr protocol is retrieved from the
next header offset that is returned in the cb of the skb.
This method does not work for UDP encapsulation that may not
even have a concept of a nexthdr field (e.g. FOU).
This patch checks for a final protocol (INET6_PROTO_FINAL) when a
protocol handler returns > 0. If the protocol is not final then
resubmission is performed on nhoff value. If the protocol is final
then the nexthdr is taken to be the return value.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch defines two new GSO definitions SKB_GSO_IPXIP4 and
SKB_GSO_IPXIP6 along with corresponding NETIF_F_GSO_IPXIP4 and
NETIF_F_GSO_IPXIP6. These are used to described IP in IP
tunnel and what the outer protocol is. The inner protocol
can be deduced from other GSO types (e.g. SKB_GSO_TCPV4 and
SKB_GSO_TCPV6). The GSO types of SKB_GSO_IPIP and SKB_GSO_SIT
are removed (these are both instances of SKB_GSO_IPXIP4).
SKB_GSO_IPXIP6 will be used when support for GSO with IP
encapsulation over IPv6 is added.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In several gso_segment functions there are checks of gso_type against
a seemingly arbitrary list of SKB_GSO_* flags. This seems like an
attempt to identify unsupported GSO types, but since the stack is
the one that set these GSO types in the first place this seems
unnecessary to do. If a combination isn't valid in the first
place that stack should not allow setting it.
This is a code simplication especially for add new GSO types.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Updates to the new Intel X722 iWARP driver
- Updates to the hfi1 driver
- Fixes for the iw_cxgb4 driver
- Misc core fixes
- Generic RDMA READ/WRITE API addition
- SRP updates
- Misc ipoib updates
- Minor mlx5 updates
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXPIAnAAoJELgmozMOVy/dEYUP/0A6NH5ptzUrwPuLrWoz8h+e
KBfJE7H09mfKBx0Rq8YmnU+pz4lk8vrMLLaqpbGN57mwO0a1lK9bgc3E6KUhQPhc
dpGEX/NG1+aILomD7M4l1yAKkG17kxFLD75cLCeaxhO76jBRWsunukqk5mT/u0EG
fUYZs1fRb9t2LDtTNhPfSFR1+dgP5S17xLhpl9ttn87hTmIuiGWR6ig2nTC7azZD
G0d7RVjohfY2sDD28YgQiUEJ+q+1ymp3XTaCZhPCVl9VCRPweEdtLKcbNWZIvClx
ewuTCgADXg8tAL/6zu6bEKqZlC17UrmVJee3csKQLT09PJkSEICeFR/ld6ONUKAF
nDhi3ySa75Xxg9VDcCnuRkXKK+/zi7oDelZuh9mvMG0JJqPK9rTZDD29j2kBf7C0
bdx4R5cI4KJWQ/GlCyi/nLiuYkmAiCugzcGnRho4ub+EJ0yX1w6n8KVYr37kFsFu
q6MCnEfArEgDpbq1wo0+9MWtqBYrnOI/XtG81Zd+6X2MW975qU85wUdUSjg6OOb1
v1osyAmFDy9A0Y80yY+l1HHrSVIvI0IAWZDfxsbCLQY8O03ZNcvxE2RsrzWd5CKL
iZsX24tjV0WR9+lORHLfAKB3DL9CcfHv/tHo7q+5iAHmIuWZGrEN22ELkwS/4X7x
d/V0XDjzs6lgQeTJ7R4B
=e+Zv
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma
Pull rdma updates from Doug Ledford:
"Primary 4.7 merge window changes
- Updates to the new Intel X722 iWARP driver
- Updates to the hfi1 driver
- Fixes for the iw_cxgb4 driver
- Misc core fixes
- Generic RDMA READ/WRITE API addition
- SRP updates
- Misc ipoib updates
- Minor mlx5 updates"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (148 commits)
IB/mlx5: Fire the CQ completion handler from tasklet
net/mlx5_core: Use tasklet for user-space CQ completion events
IB/core: Do not require CAP_NET_ADMIN for packet sniffing
IB/mlx4: Fix unaligned access in send_reply_to_slave
IB/mlx5: Report Scatter FCS device capability when supported
IB/mlx5: Add Scatter FCS support for Raw Packet QP
IB/core: Add Scatter FCS create flag
IB/core: Add Raw Scatter FCS device capability
IB/core: Add extended device capability flags
i40iw: pass hw_stats by reference rather than by value
i40iw: Remove unnecessary synchronize_irq() before free_irq()
i40iw: constify i40iw_vf_cqp_ops structure
IB/mlx5: Add UARs write-combining and non-cached mapping
IB/mlx5: Allow mapping the free running counter on PROT_EXEC
IB/mlx4: Use list_for_each_entry_safe
IB/SA: Use correct free function
IB/core: Fix a potential array overrun in CMA and SA agent
IB/core: Remove unnecessary check in ibnl_rcv_msg
IB/IWPM: Fix a potential skb leak
RDMA/nes: replace custom print_hex_dump()
...
Overlayfs uses separate inodes even in the case of hard links on the
underlying filesystems. This is a problem for AF_UNIX socket
implementation which indexes sockets based on the inode. This resulted in
hard linked sockets not working.
The fix is to use the real, underlying inode.
Test case follows:
-- ovl-sock-test.c --
#include <unistd.h>
#include <err.h>
#include <sys/socket.h>
#include <sys/un.h>
#define SOCK "test-sock"
#define SOCK2 "test-sock2"
int main(void)
{
int fd, fd2;
struct sockaddr_un addr = {
.sun_family = AF_UNIX,
.sun_path = SOCK,
};
struct sockaddr_un addr2 = {
.sun_family = AF_UNIX,
.sun_path = SOCK2,
};
unlink(SOCK);
unlink(SOCK2);
if ((fd = socket(AF_UNIX, SOCK_STREAM, 0)) == -1)
err(1, "socket");
if (bind(fd, (struct sockaddr *) &addr, sizeof(addr)) == -1)
err(1, "bind");
if (listen(fd, 0) == -1)
err(1, "listen");
if (link(SOCK, SOCK2) == -1)
err(1, "link");
if ((fd2 = socket(AF_UNIX, SOCK_STREAM, 0)) == -1)
err(1, "socket");
if (connect(fd2, (struct sockaddr *) &addr2, sizeof(addr2)) == -1)
err (1, "connect");
return 0;
}
----
Reported-by: Alexander Morozov <alexandr.morozov@docker.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org>
Merge updates from Andrew Morton:
- fsnotify fix
- poll() timeout fix
- a few scripts/ tweaks
- debugobjects updates
- the (small) ocfs2 queue
- Minor fixes to kernel/padata.c
- Maybe half of the MM queue
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (117 commits)
mm, page_alloc: restore the original nodemask if the fast path allocation failed
mm, page_alloc: uninline the bad page part of check_new_page()
mm, page_alloc: don't duplicate code in free_pcp_prepare
mm, page_alloc: defer debugging checks of pages allocated from the PCP
mm, page_alloc: defer debugging checks of freed pages until a PCP drain
cpuset: use static key better and convert to new API
mm, page_alloc: inline pageblock lookup in page free fast paths
mm, page_alloc: remove unnecessary variable from free_pcppages_bulk
mm, page_alloc: pull out side effects from free_pages_check
mm, page_alloc: un-inline the bad part of free_pages_check
mm, page_alloc: check multiple page fields with a single branch
mm, page_alloc: remove field from alloc_context
mm, page_alloc: avoid looking up the first zone in a zonelist twice
mm, page_alloc: shortcut watermark checks for order-0 pages
mm, page_alloc: reduce cost of fair zone allocation policy retry
mm, page_alloc: shorten the page allocator fast path
mm, page_alloc: check once if a zone has isolated pageblocks
mm, page_alloc: move __GFP_HARDWALL modifications out of the fastpath
mm, page_alloc: simplify last cpupid reset
mm, page_alloc: remove unnecessary initialisation from __alloc_pages_nodemask()
...
page_reference manipulation functions are introduced to track down
reference count change of the page. Use it instead of direct
modification of _count.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Sunil Goutham <sgoutham@cavium.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
struct timespec is not y2038 safe. Even though timespec might be
sufficient to represent timeouts, use struct timespec64 here as the plan
is to get rid of all timespec reference in the kernel.
The patch transitions the common functions: poll_select_set_timeout()
and select_estimate_accuracy() to use timespec64. And, all the syscalls
that use these functions are transitioned in the same patch.
The restart block parameters for poll uses monotonic time. Use
timespec64 here as well to assign timeout value. This parameter in the
restart block need not change because this only holds the monotonic
timestamp at which timeout should occur. And, unsigned long data type
should be big enough for this timestamp.
The system call interfaces will be handled in a separate series.
Compat interfaces need not change as timespec64 is an alias to struct
timespec on a 64 bit system.
Link: http://lkml.kernel.org/r/1461947989-21926-3-git-send-email-deepa.kernel@gmail.com
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Acked-by: John Stultz <john.stultz@linaro.org>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
May, we managed to uncover and fix some important bugs in our
new B.A.T.M.A.N. V algorithm. These are the fixes we came up with
together with others that I collected in the past weeks:
- avoid potential crash due to NULL pointer dereference in
B.A.T.M.A.N. V routine when a neigh_ifinfo object is not found, by
Sven Eckelmann
- avoid use-after-free of skb when counting outgoing bytes, by Florian
Westphal
- fix neigh_ifinfo object reference counting imbalance when using
B.A.T.M.A.N. V, by Sven Eckelmann. Such imbalance may lead to the
impossibility of releasing the related netdev object on shutdown
- avoid invalid memory access in case of error while allocating
bcast_own_sum when a new hard-interface is added, by Sven Eckelmann
- ensure originator address is updated in OMG/ELP packet content upon
primary interface address change, by Antonio Quartulli
- fix integer overflow when computing TQ metric (B.A.T.M.A.N. IV), by
Sven Eckelmann
- avoid race condition while adding new neigh_node which would result
in having two objects mapping to the same physical neighbour, by
Linus Lüssing
- ensure originator address is initialized in ELP packet content on
secondary interfaces, by Marek Lindner
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJXO+ZsAAoJEJ4aZjxxc6bK7UEQALxcPqv2YtQhQCexHAPlmTxg
73uXuXpo0eWPxJh2FElcsHTHSt/l0/m/G8MEWCOfqI8MTeg+UTY5lXCY4iThtiY6
n08nl/CLkPNLuDGzAd7CpCM2jJ1iAzgJEVpUGQ6qoO5ekFvO09U/J/o6E8gDTX4I
yc1kR3hcJeLbF6EUXEfTX/2THdFhkYSgHgZoGyM3eu43Zhv/6v454iEpqNxA/Ubr
UngJTGSfqf20FidUt+xjbfptltTzj25Cp+VQ2DQ/LOuit0ChECjXwMfk0UkDHlKX
bYPm1tcUJ0dzbR7cILMlzzzCSzH00leUk5GYYo59rAxBQiJ2j+92c2K0dcnv8cJJ
q+GsZapyQFBIgPWS/TLtG1II3u8YWYvl+nEQXBDtPeRAjGoO4VwSbuatsf+enM8K
i0c622iEc0b3cVe5H8reUh7SZMfMT/cQkEgpX5/psf8FWvEzeDv4HI3aUfVuSpck
mEZAy59HeLWrF+KTScougv14vyoMmKfUcr5E1Oe4X1Ke9dpVvHGiI9wlqb/J0lfl
rIc/5WI3ae/QAXkcPybanPEO8O0pBMP+Y4oCYpWmDu1qYfwZzDaSMqqEALQVVW2I
8Ta1gbzamgZDslf7YDSWOR0iyFAFcnMhT0qU6gv2+RGbcfwD+AGRLu5ufY/th0N3
L6AuFhvT/bxW6jN+iuAS
=IACg
-----END PGP SIGNATURE-----
Merge tag 'batman-adv-fix-for-davem' of git://git.open-mesh.org/linux-merge
Antonio Quartulli says:
====================
During the Wireless Battle Mesh v9 in Porto (PT) at the beginning of
May, we managed to uncover and fix some important bugs in our
new B.A.T.M.A.N. V algorithm. These are the fixes we came up with
together with others that I collected in the past weeks:
- avoid potential crash due to NULL pointer dereference in
B.A.T.M.A.N. V routine when a neigh_ifinfo object is not found, by
Sven Eckelmann
- avoid use-after-free of skb when counting outgoing bytes, by Florian
Westphal
- fix neigh_ifinfo object reference counting imbalance when using
B.A.T.M.A.N. V, by Sven Eckelmann. Such imbalance may lead to the
impossibility of releasing the related netdev object on shutdown
- avoid invalid memory access in case of error while allocating
bcast_own_sum when a new hard-interface is added, by Sven Eckelmann
- ensure originator address is updated in OMG/ELP packet content upon
primary interface address change, by Antonio Quartulli
- fix integer overflow when computing TQ metric (B.A.T.M.A.N. IV), by
Sven Eckelmann
- avoid race condition while adding new neigh_node which would result
in having two objects mapping to the same physical neighbour, by
Linus Lüssing
- ensure originator address is initialized in ELP packet content on
secondary interfaces, by Marek Lindner
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP stack can now run from process context.
Use read_lock_bh(&sk->sk_callback_lock) variant to restore previous
assumption.
Fixes: 5413d1babe ("net: do not block BH while processing socket backlog")
Fixes: d41a69f1d3 ("tcp: make tcp_sendmsg() aware of socket backlog")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jon Maloy <jon.maloy@ericsson.com>
Cc: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP stack can now run from process context.
Use read_lock_bh(&sk->sk_callback_lock) variant to restore previous
assumption.
Fixes: 5413d1babe ("net: do not block BH while processing socket backlog")
Fixes: d41a69f1d3 ("tcp: make tcp_sendmsg() aware of socket backlog")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb_splice_bits() returns int, kcm_splice_read() returns ssize_t,
both are signed.
We may need another patch to make them all ssize_t, but that
deserves a separated patch.
Fixes: 91687355b9 ("kcm: Splice support")
Reported-by: David Binderman <linuxdev.baldrick@gmail.com>
Cc: Tom Herbert <tom@herbertland.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull security subsystem updates from James Morris:
"Highlights:
- A new LSM, "LoadPin", from Kees Cook is added, which allows forcing
of modules and firmware to be loaded from a specific device (this
is from ChromeOS, where the device as a whole is verified
cryptographically via dm-verity).
This is disabled by default but can be configured to be enabled by
default (don't do this if you don't know what you're doing).
- Keys: allow authentication data to be stored in an asymmetric key.
Lots of general fixes and updates.
- SELinux: add restrictions for loading of kernel modules via
finit_module(). Distinguish non-init user namespace capability
checks. Apply execstack check on thread stacks"
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (48 commits)
LSM: LoadPin: provide enablement CONFIG
Yama: use atomic allocations when reporting
seccomp: Fix comment typo
ima: add support for creating files using the mknodat syscall
ima: fix ima_inode_post_setattr
vfs: forbid write access when reading a file into memory
fs: fix over-zealous use of "const"
selinux: apply execstack check on thread stacks
selinux: distinguish non-init user namespace capability checks
LSM: LoadPin for kernel file loading restrictions
fs: define a string representation of the kernel_read_file_id enumeration
Yama: consolidate error reporting
string_helpers: add kstrdup_quotable_file
string_helpers: add kstrdup_quotable_cmdline
string_helpers: add kstrdup_quotable
selinux: check ss_initialized before revalidating an inode label
selinux: delay inode label lookup as long as possible
selinux: don't revalidate an inode's label when explicitly setting it
selinux: Change bool variable name to index.
KEYS: Add KEYCTL_DH_COMPUTE command
...
This fix prevents nodes to wrongly create a 00:00:00:00:00:00 originator
which can potentially interfere with the rest of the neighbor statistics.
Fixes: d6f94d91f7 ("batman-adv: ELP - adding basic infrastructure")
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
The undefined behavior sanatizer detected an signed integer overflow in a
setup with near perfect link quality
UBSAN: Undefined behaviour in net/batman-adv/bat_iv_ogm.c:1246:25
signed integer overflow:
8713350 * 255 cannot be represented in type 'int'
The problems happens because the calculation of mixed unsigned and signed
integers resulted in an integer multiplication.
batadv_ogm_packet::tq (u8 255)
* tq_own (u8 255)
* tq_asym_penalty (int 134; max 255)
* tq_iface_penalty (int 255; max 255)
The tq_iface_penalty, tq_asym_penalty and inv_asym_penalty can just be
changed to unsigned int because they are not expected to become negative.
Fixes: c039876892 ("batman-adv: add WiFi penalty")
Signed-off-by: Sven Eckelmann <sven.eckelmann@open-mesh.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
When the MAC address of the primary interface is changed,
update the originator address in the ELP and OGM skb buffers as
well in order to reflect the change.
Fixes: d6f94d91f7 ("batman-adv: ELP - adding basic infrastructure")
Reported-by: Marek Lindner <marek@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
The function batadv_iv_ogm_orig_add_if allocates new buffers for bcast_own
and bcast_own_sum. It is expected that these buffers are unchanged in case
either bcast_own or bcast_own_sum couldn't be resized.
But the error handling of this function frees the already resized buffer
for bcast_own when the allocation of the new bcast_own_sum buffer failed.
This will lead to an invalid memory access when some code will try to
access bcast_own.
Instead the resized new bcast_own buffer has to be kept. This will not lead
to problems because the size of the buffer was only increased and therefore
no user of the buffer will try to access bytes outside of the new buffer.
Fixes: d0015fdd3d ("batman-adv: provide orig_node routing API")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
The functions batadv_neigh_ifinfo_get increase the reference counter of the
batadv_neigh_ifinfo. These have to be reduced again when the reference is
not used anymore to correctly free the objects.
Fixes: 9786906022 ("batman-adv: B.A.T.M.A.N. V - implement neighbor comparison API calls")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
batadv_neigh_ifinfo_get can return NULL when it cannot find (even when only
temporarily) anymore the neigh_ifinfo in the list neigh->ifinfo_list. This
has to be checked to avoid kernel Oopses when the ifinfo is dereferenced.
This a situation which isn't expected but is already handled by functions
like batadv_v_neigh_cmp. The same kind of warning is therefore used before
the function returns without dereferencing the pointers.
Fixes: 9786906022 ("batman-adv: B.A.T.M.A.N. V - implement neighbor comparison API calls")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
batadv_send_skb_to_orig() calls dev_queue_xmit() so we can't use skb->len.
Fixes: 953324776d ("batman-adv: network coding - buffer unicast packets before forward")
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Pull networking updates from David Miller:
"Highlights:
1) Support SPI based w5100 devices, from Akinobu Mita.
2) Partial Segmentation Offload, from Alexander Duyck.
3) Add GMAC4 support to stmmac driver, from Alexandre TORGUE.
4) Allow cls_flower stats offload, from Amir Vadai.
5) Implement bpf blinding, from Daniel Borkmann.
6) Optimize _ASYNC_ bit twiddling on sockets, unless the socket is
actually using FASYNC these atomics are superfluous. From Eric
Dumazet.
7) Run TCP more preemptibly, also from Eric Dumazet.
8) Support LED blinking, EEPROM dumps, and rxvlan offloading in mlx5e
driver, from Gal Pressman.
9) Allow creating ppp devices via rtnetlink, from Guillaume Nault.
10) Improve BPF usage documentation, from Jesper Dangaard Brouer.
11) Support tunneling offloads in qed, from Manish Chopra.
12) aRFS offloading in mlx5e, from Maor Gottlieb.
13) Add RFS and RPS support to SCTP protocol, from Marcelo Ricardo
Leitner.
14) Add MSG_EOR support to TCP, this allows controlling packet
coalescing on application record boundaries for more accurate
socket timestamp sampling. From Martin KaFai Lau.
15) Fix alignment of 64-bit netlink attributes across the board, from
Nicolas Dichtel.
16) Per-vlan stats in bridging, from Nikolay Aleksandrov.
17) Several conversions of drivers to ethtool ksettings, from Philippe
Reynes.
18) Checksum neutral ILA in ipv6, from Tom Herbert.
19) Factorize all of the various marvell dsa drivers into one, from
Vivien Didelot
20) Add VF support to qed driver, from Yuval Mintz"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1649 commits)
Revert "phy dp83867: Fix compilation with CONFIG_OF_MDIO=m"
Revert "phy dp83867: Make rgmii parameters optional"
r8169: default to 64-bit DMA on recent PCIe chips
phy dp83867: Make rgmii parameters optional
phy dp83867: Fix compilation with CONFIG_OF_MDIO=m
bpf: arm64: remove callee-save registers use for tmp registers
asix: Fix offset calculation in asix_rx_fixup() causing slow transmissions
switchdev: pass pointer to fib_info instead of copy
net_sched: close another race condition in tcf_mirred_release()
tipc: fix nametable publication field in nl compat
drivers: net: Don't print unpopulated net_device name
qed: add support for dcbx.
ravb: Add missing free_irq() calls to ravb_close()
qed: Remove a stray tab
net: ethernet: fec-mpc52xx: use phy_ethtool_{get|set}_link_ksettings
net: ethernet: fec-mpc52xx: use phydev from struct net_device
bpf, doc: fix typo on bpf_asm descriptions
stmmac: hardware TX COE doesn't work when force_thresh_dma_mode is set
net: ethernet: fs-enet: use phy_ethtool_{get|set}_link_ksettings
net: ethernet: fs-enet: use phydev from struct net_device
...
Pull 'struct path' constification update from Al Viro:
"'struct path' is passed by reference to a bunch of Linux security
methods; in theory, there's nothing to stop them from modifying the
damn thing and LSM community being what it is, sooner or later some
enterprising soul is going to decide that it's a good idea.
Let's remove the temptation and constify all of those..."
* 'work.const-path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
constify ima_d_path()
constify security_sb_pivotroot()
constify security_path_chroot()
constify security_path_{link,rename}
apparmor: remove useless checks for NULL ->mnt
constify security_path_{mkdir,mknod,symlink}
constify security_path_{unlink,rmdir}
apparmor: constify common_perm_...()
apparmor: constify aa_path_link()
apparmor: new helper - common_path_perm()
constify chmod_common/security_path_chmod
constify security_sb_mount()
constify chown_common/security_path_chown
tomoyo: constify assorted struct path *
apparmor_path_truncate(): path->mnt is never NULL
constify vfs_truncate()
constify security_path_truncate()
[apparmor] constify struct path * in a bunch of helpers
Clean up.
After "xprtrdma: Remove ro_unmap() from all registration modes",
there are no longer any sites that take rpcrdma_ia::qplock for read.
The one site that takes it for write is always single-threaded. It
is safe to remove it.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
In a cluster failover scenario, it is desirable for the client to
attempt to reconnect quickly, as an alternate NFS server is already
waiting to take over for the down server. The client can't see that
a server IP address has moved to a new server until the existing
connection is gone.
For fabrics and devices where it is meaningful, set a definite upper
bound on the amount of time before it is determined that a
connection is no longer valid. This allows the RPC client to detect
connection loss in a timely matter, then perform a fresh resolution
of the server GUID in case it has changed (cluster failover).
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up: The ro_unmap method is no longer used.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
There needs to be a safe method of releasing registered memory
resources when an RPC terminates. Safe can mean a number of things:
+ Doesn't have to sleep
+ Doesn't rely on having a QP in RTS
ro_unmap_safe will be that safe method. It can be used in cases
where synchronous memory invalidation can deadlock, or needs to have
an active QP.
The important case is fencing an RPC's memory regions after it is
signaled (^C) and before it exits. If this is not done, there is a
window where the server can write an RPC reply into memory that the
client has released and re-used for some other purpose.
Note that this is a full solution for FRWR, but FMR and physical
still have some gaps where a particularly bad server can wreak
some havoc on the client. These gaps are not made worse by this
patch and are expected to be exceptionally rare and timing-based.
They are noted in documenting comments.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Separate the DMA unmap operation from freeing the MW. In a
subsequent patch they will not always be done at the same time,
and they are not related operations (except by order; freeing
the MW must be the last step during invalidation).
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
In a subsequent patch, the fr_xprt and fr_worker fields will be
needed by another memory registration mode. Move them into the
generic rpcrdma_mw structure that wraps struct rpcrdma_frmr.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Maintain the order of invalidation and DMA unmapping when doing
a background MR reset.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
frwr_op_unmap_sync() is now invoked in a workqueue context, the same
as __frwr_queue_recovery(). There's no need to defer MR reset if
posting LOCAL_INV MRs fails.
This means that even when ib_post_send() fails (which should occur
very rarely) the invalidation and DMA unmapping steps are still done
in the correct order.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Move the the I/O direction field from rpcrdma_mr_seg into the
rpcrdma_frmr.
This makes it possible to DMA-unmap the frwr long after an RPC has
exited and its rpcrdma_mr_seg array has been released and re-used.
This might occur if an RPC times out while waiting for a new
connection to be established.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up: Follow same naming convention as other fields in struct
rpcrdma_frwr.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up: Replace rpcrdma_flush_cqs() and rpcrdma_clean_cqs() with
the new ib_drain_qp() API.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-By: Leon Romanovsky <leonro@mellanox.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
rpcrdma_create_chunks() has been replaced, and can be removed.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Update documenting comments to reflect code changes over the past
year.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Avoid the latency and interrupt overhead of registering a Write
chunk when handling NFS READ requests of a few hundred bytes or
less.
This change does not interoperate with Linux NFS/RDMA servers
that do not have commit 9d11b51ce7 ('svcrdma: Fix send_reply()
scatter/gather set-up'). Commit 9d11b51ce7 was introduced in v4.3,
and is included in 4.2.y, 4.1.y, and 3.18.y.
Oracle bug 22925946 has been filed to request that the above fix
be included in the Oracle Linux UEK4 NFS/RDMA server.
Red Hat bugzillas 1327280 and 1327554 have been filed to request
that RHEL NFS/RDMA server backports include the above fix.
Workaround: Replace the "proto=rdma,port=20049" mount options
with "proto=tcp" until commit 9d11b51ce7 is applied to your
NFS server.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
When deciding whether to send a Call inline, rpcrdma_marshal_req
doesn't take into account header bytes consumed by chunk lists.
This results in Call messages on the wire that are sometimes larger
than the inline threshold.
Likewise, when a Write list or Reply chunk is in play, the server's
reply has to emit an RDMA Send that includes a larger-than-minimal
RPC-over-RDMA header.
The actual size of a Call message cannot be estimated until after
the chunk lists have been registered. Thus the size of each
RPC-over-RDMA header can be estimated only after chunks are
registered; but the decision to register chunks is based on the size
of that header. Chicken, meet egg.
The best a client can do is estimate header size based on the
largest header that might occur, and then ensure that inline content
is always smaller than that.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Send buffer space is shared between the RPC-over-RDMA header and
an RPC message. A large RPC-over-RDMA header means less space is
available for the associated RPC message, which then has to be
moved via an RDMA Read or Write.
As more segments are added to the chunk lists, the header increases
in size. Typical modern hardware needs only a few segments to
convey the maximum payload size, but some devices and registration
modes may need a lot of segments to convey data payload. Sometimes
so many are needed that the remaining space in the Send buffer is
not enough for the RPC message. Sending such a message usually
fails.
To ensure a transport can always make forward progress, cap the
number of RDMA segments that are allowed in chunk lists. This
prevents less-capable devices and memory registrations from
consuming a large portion of the Send buffer by reducing the
maximum data payload that can be conveyed with such devices.
For now I choose an arbitrary maximum of 8 RDMA segments. This
allows a maximum size RPC-over-RDMA header to fit nicely in the
current 1024 byte inline threshold with over 700 bytes remaining
for an inline RPC message.
The current maximum data payload of NFS READ or WRITE requests is
one megabyte. To convey that payload on a client with 4KB pages,
each chunk segment would need to handle 32 or more data pages. This
is well within the capabilities of FMR. For physical registration,
the maximum payload size on platforms with 4KB pages is reduced to
32KB.
For FRWR, a device's maximum page list depth would need to be at
least 34 to support the maximum 1MB payload. A device with a smaller
maximum page list depth means the maximum data payload is reduced
when using that device.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Currently the sysctls that allow setting the inline threshold allow
any value to be set.
Small values only make the transport run slower. The default 1KB
setting is as low as is reasonable. And the logic that decides how
to divide a Send buffer between RPC-over-RDMA header and RPC message
assumes (but does not check) that the lower bound is not crazy (say,
57 bytes).
Send and receive buffers share a page with some control information.
Values larger than about 3KB can't be supported, currently.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
RPC-over-RDMA transports have a limit on how large a backward
direction (backchannel) RPC message can be. Ensure that the NFSv4.x
CREATE_SESSION operation advertises this limit to servers.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
This merges the Qualcomm SOC tree with the net-next, solving the
merge conflict in the SMD API between the two.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXNk3mAAoJEAsfOT8Nma3FJkEQAI/0hXy5cmACx37crT6Zishn
+IjfnVFY0w+VWX6DdOVvAiNufMEBnvbo6jAGFxYALvZTRyt2kBxa4SgbQfgeDGp2
UIXU5vYjYamggBALcDc9b6xBxxYEdIjGpqueIrGPyLzuAHxC74Ss4a1UYE8kC8R4
lznB+BCuyv4WQL5GgWK2vTcFH6BD9Yw6FARLI2YJaR1oRrvl+3WCnPSKZUYPi31R
vxvGYV5yiZ5YyOyZX4VHpn2vk1+dD7mhzXS/X1cGdV3Ysaqdw3bXqM0C31HW29F5
qxnVNtbnoNMIpQQaFRMT00x7yx+12gCXXgb7Q1CDzvXFQXkGvQRlcHagmvi7dqI9
YrttwbmS9LgezzEuaUTNfOJpMS9x7ghSayzHEmVLn9/7IMxdmqUiu5f9uKMjQHsW
D3hWCTahVXf9Y+UWKOEtvnazyEcdQIuuBEzZQw05cl8OntGk1eWtHLikZVyW+zWO
lgWX2n84Bc3BPz6o0Nu+tW4WbTlo5irnA8YLeBRJNUFgwlwlhm0Io2Q7aVhW27jZ
rDPfil1KduOGgf0T2EMD9XdQf0tXMaJ7y7vzuGtaWy83kYaEDzYOmnLK39LQYlxm
yYoU5Yo9fsoFXTvLyD/3FBD8Wqblv5C8ObIB4Lc3uhmrLBYA2mjFAmsxqDV2Giuz
So/uZlqnJauYFnMBAHUu
=UpGH
-----END PGP SIGNATURE-----
Merge tag 'net-next-qcom-soc-4.7-2-merge' of git://github.com/andersson/kernel
Merge tag 'qcom-soc-for-4.7-2' into net-next
This merges the Qualcomm SOC tree with the net-next, solving the
merge conflict in the SMD API between the two.
Pull parallel filesystem directory handling update from Al Viro.
This is the main parallel directory work by Al that makes the vfs layer
able to do lookup and readdir in parallel within a single directory.
That's a big change, since this used to be all protected by the
directory inode mutex.
The inode mutex is replaced by an rwsem, and serialization of lookups of
a single name is done by a "in-progress" dentry marker.
The series begins with xattr cleanups, and then ends with switching
filesystems over to actually doing the readdir in parallel (switching to
the "iterate_shared()" that only takes the read lock).
A more detailed explanation of the process from Al Viro:
"The xattr work starts with some acl fixes, then switches ->getxattr to
passing inode and dentry separately. This is the point where the
things start to get tricky - that got merged into the very beginning
of the -rc3-based #work.lookups, to allow untangling the
security_d_instantiate() mess. The xattr work itself proceeds to
switch a lot of filesystems to generic_...xattr(); no complications
there.
After that initial xattr work, the series then does the following:
- untangle security_d_instantiate()
- convert a bunch of open-coded lookup_one_len_unlocked() to calls of
that thing; one such place (in overlayfs) actually yields a trivial
conflict with overlayfs fixes later in the cycle - overlayfs ended
up switching to a variant of lookup_one_len_unlocked() sans the
permission checks. I would've dropped that commit (it gets
overridden on merge from #ovl-fixes in #for-next; proper resolution
is to use the variant in mainline fs/overlayfs/super.c), but I
didn't want to rebase the damn thing - it was fairly late in the
cycle...
- some filesystems had managed to depend on lookup/lookup exclusion
for *fs-internal* data structures in a way that would break if we
relaxed the VFS exclusion. Fixing hadn't been hard, fortunately.
- core of that series - parallel lookup machinery, replacing
->i_mutex with rwsem, making lookup_slow() take it only shared. At
that point lookups happen in parallel; lookups on the same name
wait for the in-progress one to be done with that dentry.
Surprisingly little code, at that - almost all of it is in
fs/dcache.c, with fs/namei.c changes limited to lookup_slow() -
making it use the new primitive and actually switching to locking
shared.
- parallel readdir stuff - first of all, we provide the exclusion on
per-struct file basis, same as we do for read() vs lseek() for
regular files. That takes care of most of the needed exclusion in
readdir/readdir; however, these guys are trickier than lookups, so
I went for switching them one-by-one. To do that, a new method
'->iterate_shared()' is added and filesystems are switched to it
as they are either confirmed to be OK with shared lock on directory
or fixed to be OK with that. I hope to kill the original method
come next cycle (almost all in-tree filesystems are switched
already), but it's still not quite finished.
- several filesystems get switched to parallel readdir. The
interesting part here is dealing with dcache preseeding by readdir;
that needs minor adjustment to be safe with directory locked only
shared.
Most of the filesystems doing that got switched to in those
commits. Important exception: NFS. Turns out that NFS folks, with
their, er, insistence on VFS getting the fuck out of the way of the
Smart Filesystem Code That Knows How And What To Lock(tm) have
grown the locking of their own. They had their own homegrown
rwsem, with lookup/readdir/atomic_open being *writers* (sillyunlink
is the reader there). Of course, with VFS getting the fuck out of
the way, as requested, the actual smarts of the smart filesystem
code etc. had become exposed...
- do_last/lookup_open/atomic_open cleanups. As the result, open()
without O_CREAT locks the directory only shared. Including the
->atomic_open() case. Backmerge from #for-linus in the middle of
that - atomic_open() fix got brought in.
- then comes NFS switch to saner (VFS-based ;-) locking, killing the
homegrown "lookup and readdir are writers" kinda-sorta rwsem. All
exclusion for sillyunlink/lookup is done by the parallel lookups
mechanism. Exclusion between sillyunlink and rmdir is a real rwsem
now - rmdir being the writer.
Result: NFS lookups/readdirs/O_CREAT-less opens happen in parallel
now.
- the rest of the series consists of switching a lot of filesystems
to parallel readdir; in a lot of cases ->llseek() gets simplified
as well. One backmerge in there (again, #for-linus - rockridge
fix)"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (74 commits)
ext4: switch to ->iterate_shared()
hfs: switch to ->iterate_shared()
hfsplus: switch to ->iterate_shared()
hostfs: switch to ->iterate_shared()
hpfs: switch to ->iterate_shared()
hpfs: handle allocation failures in hpfs_add_pos()
gfs2: switch to ->iterate_shared()
f2fs: switch to ->iterate_shared()
afs: switch to ->iterate_shared()
befs: switch to ->iterate_shared()
befs: constify stuff a bit
isofs: switch to ->iterate_shared()
get_acorn_filename(): deobfuscate a bit
btrfs: switch to ->iterate_shared()
logfs: no need to lock directory in lseek
switch ecryptfs to ->iterate_shared
9p: switch to ->iterate_shared()
fat: switch to ->iterate_shared()
romfs, squashfs: switch to ->iterate_shared()
more trivial ->iterate_shared conversions
...
The problem is that fib_info->nh is [0] so the struct fib_info
allocation size depends on number of nexthops. If we just copy fib_info,
we do not copy the nexthops info and driver accesses memory which is not
ours.
Given the fact that fib4 does not defer operations and therefore it does
not need copy, just pass the pointer down to drivers as it was done
before.
Fixes: 850d0cbc91 ("switchdev: remove pointers from switchdev objects")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We saw the following extra refcount release on veth device:
kernel: [7957821.463992] unregister_netdevice: waiting for mesos50284 to become free. Usage count = -1
Since we heavily use mirred action to redirect packets to veth, I think
this is caused by the following race condition:
CPU0:
tcf_mirred_release(): (in RCU callback)
struct net_device *dev = rcu_dereference_protected(m->tcfm_dev, 1);
CPU1:
mirred_device_event():
spin_lock_bh(&mirred_list_lock);
list_for_each_entry(m, &mirred_list, tcfm_list) {
if (rcu_access_pointer(m->tcfm_dev) == dev) {
dev_put(dev);
/* Note : no rcu grace period necessary, as
* net_device are already rcu protected.
*/
RCU_INIT_POINTER(m->tcfm_dev, NULL);
}
}
spin_unlock_bh(&mirred_list_lock);
CPU0:
tcf_mirred_release():
spin_lock_bh(&mirred_list_lock);
list_del(&m->tcfm_list);
spin_unlock_bh(&mirred_list_lock);
if (dev) // <======== Stil refers to the old m->tcfm_dev
dev_put(dev); // <======== dev_put() is called on it again
The action init code path is good because it is impossible to modify
an action that is being removed.
So, fix this by moving everything under the spinlock.
Fixes: 2ee22a90c7 ("net_sched: act_mirred: remove spinlock in fast path")
Fixes: 6bd00b8506 ("act_mirred: fix a race condition on mirred_list")
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The publication field of the old netlink API should contain the
publication key and not the publication reference.
Fixes: 44a8ae94fd (tipc: convert legacy nl name table dump to nl compat)
Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Backmerge to resolve a conflict in ovl_lookup_real();
"ovl_lookup_real(): use lookup_one_len_unlocked()" instead,
but it was too late in the cycle to rebase.
When we free cb->skb after a dump, we do it after releasing the
lock. This means that a new dump could have started in the time
being and we'll end up freeing their skb instead of ours.
This patch saves the skb and module before we unlock so we free
the right memory.
Fixes: 16b304f340 ("netlink: Eliminate kmalloc in netlink dump operation.")
Reported-by: Baozeng Ding <sploving1@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make sure the socket for which the user is listing publication exists
before parsing the socket netlink attributes.
Prior to this patch a call without any socket caused a NULL pointer
dereference in tipc_nl_publ_dump().
Tested-and-reported-by: Baozeng Ding <sploving1@gmail.com>
Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.cm>
Signed-off-by: David S. Miller <davem@davemloft.net>
memory_usage must be decreased in dequeue_func(), not in
fq_codel_dequeue(), otherwise packets dropped by Codel algo
are missing this decrease.
Also we need to clear memory_usage in fq_codel_reset()
Fixes: 95b58430ab ("fq_codel: add memory limitation per queue")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Follow-up for 8a3a4c6e7b ("net: make sch_handle_ingress() drop
monitor ready") to also make the egress side drop monitor ready.
Also here only TC_ACT_SHOT is a clear indication that something
went wrong. Hence don't provide false positives to drop monitors
such as 'perf record -e skb:kfree_skb ...'.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function setup_timer combines the initialization of a timer with the
initialization of the timer's function and data fields. The mulitiline
code for timer initialization is now replaced with function setup_timer.
Also, quoting the mod_timer() function comment:
-> mod_timer() is a more efficient way to update the expire field of an
active timer (if the timer is inactive it will be activated).
Use setup_timer() and mod_timer() to setup and arm a timer, making the
code compact and aid readablity.
Signed-off-by: Muhammad Falak R Wani <falakreyaz@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johan Hedberg says:
====================
pull request: bluetooth-next 2016-05-14
Here are two more Bluetooth patches for the 4.7 kernel which we wanted
to get into net-next before the merge window opens. Please let me know
if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This work adds a generic facility for use from eBPF JIT compilers
that allows for further hardening of JIT generated images through
blinding constants. In response to the original work on BPF JIT
spraying published by Keegan McAllister [1], most BPF JITs were
changed to make images read-only and start at a randomized offset
in the page, where the rest was filled with trap instructions. We
have this nowadays in x86, arm, arm64 and s390 JIT compilers.
Additionally, later work also made eBPF interpreter images read
only for kernels supporting DEBUG_SET_MODULE_RONX, that is, x86,
arm, arm64 and s390 archs as well currently. This is done by
default for mentioned JITs when JITing is enabled. Furthermore,
we had a generic and configurable constant blinding facility on our
todo for quite some time now to further make spraying harder, and
first implementation since around netconf 2016.
We found that for systems where untrusted users can load cBPF/eBPF
code where JIT is enabled, start offset randomization helps a bit
to make jumps into crafted payload harder, but in case where larger
programs that cross page boundary are injected, we again have some
part of the program opcodes at a page start offset. With improved
guessing and more reliable payload injection, chances can increase
to jump into such payload. Elena Reshetova recently wrote a test
case for it [2, 3]. Moreover, eBPF comes with 64 bit constants, which
can leave some more room for payloads. Note that for all this,
additional bugs in the kernel are still required to make the jump
(and of course to guess right, to not jump into a trap) and naturally
the JIT must be enabled, which is disabled by default.
For helping mitigation, the general idea is to provide an option
bpf_jit_harden that admins can tweak along with bpf_jit_enable, so
that for cases where JIT should be enabled for performance reasons,
the generated image can be further hardened with blinding constants
for unpriviledged users (bpf_jit_harden == 1), with trading off
performance for these, but not for privileged ones. We also added
the option of blinding for all users (bpf_jit_harden == 2), which
is quite helpful for testing f.e. with test_bpf.ko. There are no
further e.g. hardening levels of bpf_jit_harden switch intended,
rationale is to have it dead simple to use as on/off. Since this
functionality would need to be duplicated over and over for JIT
compilers to use, which are already complex enough, we provide a
generic eBPF byte-code level based blinding implementation, which is
then just transparently JITed. JIT compilers need to make only a few
changes to integrate this facility and can be migrated one by one.
This option is for eBPF JITs and will be used in x86, arm64, s390
without too much effort, and soon ppc64 JITs, thus that native eBPF
can be blinded as well as cBPF to eBPF migrations, so that both can
be covered with a single implementation. The rule for JITs is that
bpf_jit_blind_constants() must be called from bpf_int_jit_compile(),
and in case blinding is disabled, we follow normally with JITing the
passed program. In case blinding is enabled and we fail during the
process of blinding itself, we must return with the interpreter.
Similarly, in case the JITing process after the blinding failed, we
return normally to the interpreter with the non-blinded code. Meaning,
interpreter doesn't change in any way and operates on eBPF code as
usual. For doing this pre-JIT blinding step, we need to make use of
a helper/auxiliary register, here BPF_REG_AX. This is strictly internal
to the JIT and not in any way part of the eBPF architecture. Just like
in the same way as JITs internally make use of some helper registers
when emitting code, only that here the helper register is one
abstraction level higher in eBPF bytecode, but nevertheless in JIT
phase. That helper register is needed since f.e. manually written
program can issue loads to all registers of eBPF architecture.
The core concept with the additional register is: blind out all 32
and 64 bit constants by converting BPF_K based instructions into a
small sequence from K_VAL into ((RND ^ K_VAL) ^ RND). Therefore, this
is transformed into: BPF_REG_AX := (RND ^ K_VAL), BPF_REG_AX ^= RND,
and REG <OP> BPF_REG_AX, so actual operation on the target register
is translated from BPF_K into BPF_X one that is operating on
BPF_REG_AX's content. During rewriting phase when blinding, RND is
newly generated via prandom_u32() for each processed instruction.
64 bit loads are split into two 32 bit loads to make translation and
patching not too complex. Only basic thing required by JITs is to
call the helper bpf_jit_blind_constants()/bpf_jit_prog_release_other()
pair, and to map BPF_REG_AX into an unused register.
Small bpf_jit_disasm extract from [2] when applied to x86 JIT:
echo 0 > /proc/sys/net/core/bpf_jit_harden
ffffffffa034f5e9 + <x>:
[...]
39: mov $0xa8909090,%eax
3e: mov $0xa8909090,%eax
43: mov $0xa8ff3148,%eax
48: mov $0xa89081b4,%eax
4d: mov $0xa8900bb0,%eax
52: mov $0xa810e0c1,%eax
57: mov $0xa8908eb4,%eax
5c: mov $0xa89020b0,%eax
[...]
echo 1 > /proc/sys/net/core/bpf_jit_harden
ffffffffa034f1e5 + <x>:
[...]
39: mov $0xe1192563,%r10d
3f: xor $0x4989b5f3,%r10d
46: mov %r10d,%eax
49: mov $0xb8296d93,%r10d
4f: xor $0x10b9fd03,%r10d
56: mov %r10d,%eax
59: mov $0x8c381146,%r10d
5f: xor $0x24c7200e,%r10d
66: mov %r10d,%eax
69: mov $0xeb2a830e,%r10d
6f: xor $0x43ba02ba,%r10d
76: mov %r10d,%eax
79: mov $0xd9730af,%r10d
7f: xor $0xa5073b1f,%r10d
86: mov %r10d,%eax
89: mov $0x9a45662b,%r10d
8f: xor $0x325586ea,%r10d
96: mov %r10d,%eax
[...]
As can be seen, original constants that carry payload are hidden
when enabled, actual operations are transformed from constant-based
to register-based ones, making jumps into constants ineffective.
Above extract/example uses single BPF load instruction over and
over, but of course all instructions with constants are blinded.
Performance wise, JIT with blinding performs a bit slower than just
JIT and faster than interpreter case. This is expected, since we
still get all the performance benefits from JITing and in normal
use-cases not every single instruction needs to be blinded. Summing
up all 296 test cases averaged over multiple runs from test_bpf.ko
suite, interpreter was 55% slower than JIT only and JIT with blinding
was 8% slower than JIT only. Since there are also some extremes in
the test suite, I expect for ordinary workloads that the performance
for the JIT with blinding case is even closer to JIT only case,
f.e. nmap test case from suite has averaged timings in ns 29 (JIT),
35 (+ blinding), and 151 (interpreter).
BPF test suite, seccomp test suite, eBPF sample code and various
bigger networking eBPF programs have been tested with this and were
running fine. For testing purposes, I also adapted interpreter and
redirected blinded eBPF image to interpreter and also here all tests
pass.
[1] http://mainisusuallyafunction.blogspot.com/2012/11/attacking-hardened-linux-systems-with.html
[2] https://github.com/01org/jit-spray-poc-for-ksp/
[3] http://www.openwall.com/lists/kernel-hardening/2016/05/03/5
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Elena Reshetova <elena.reshetova@intel.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since the blinding is strictly only called from inside eBPF JITs,
we need to change signatures for bpf_int_jit_compile() and
bpf_prog_select_runtime() first in order to prepare that the
eBPF program we're dealing with can change underneath. Hence,
for call sites, we need to return the latest prog. No functional
change in this patch.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Split the HAVE_BPF_JIT into two for distinguishing cBPF and eBPF JITs.
Current cBPF ones:
# git grep -n HAVE_CBPF_JIT arch/
arch/arm/Kconfig:44: select HAVE_CBPF_JIT
arch/mips/Kconfig:18: select HAVE_CBPF_JIT if !CPU_MICROMIPS
arch/powerpc/Kconfig:129: select HAVE_CBPF_JIT
arch/sparc/Kconfig:35: select HAVE_CBPF_JIT
Current eBPF ones:
# git grep -n HAVE_EBPF_JIT arch/
arch/arm64/Kconfig:61: select HAVE_EBPF_JIT
arch/s390/Kconfig:126: select HAVE_EBPF_JIT if PACK_STACK && HAVE_MARCH_Z196_FEATURES
arch/x86/Kconfig:94: select HAVE_EBPF_JIT if X86_64
Later code also needs this facility to check for eBPF JITs.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Besides others, remove redundant comments where the code is self
documenting enough, and properly indent various bpf_verifier_ops
and bpf_prog_type_list declarations. Moreover, remove two exports
that actually have no module user.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_hdr() is slightly more expensive than using skb->data in contexts
where we know they point to the same byte.
In receive path, tcp_v4_rcv() and tcp_v6_rcv() are in this situation,
as tcp header has not been pulled yet.
In output path, the same can be said when we just pushed the tcp header
in the skb, in tcp_transmit_skb() and tcp_make_synack()
Also factorize the two checks for tcb->tcp_flags & TCPHDR_SYN in
tcp_transmit_skb() and pass tcp header pointer to tcp_ecn_send(),
so that compiler can further optimize and avoid a reload.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
__sock_cmsg_send() might return different error codes, not only -EINVAL.
Fixes: 24025c465f ("ipv4: process socket-level control messages in IPv4")
Fixes: ad1e46a837 ("ipv6: process socket-level control messages in IPv6")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Having multiple loadable modules with the same name cannot work
with modprobe, and having both net/qrtr/smd.ko and drivers/soc/qcom/smd.ko
results in a (somewhat cryptic) build error:
ERROR: "qcom_smd_driver_unregister" [net/qrtr/smd.ko] undefined!
ERROR: "qcom_smd_driver_register" [net/qrtr/smd.ko] undefined!
ERROR: "qcom_smd_set_drvdata" [net/qrtr/smd.ko] undefined!
ERROR: "qcom_smd_send" [net/qrtr/smd.ko] undefined!
ERROR: "qcom_smd_get_drvdata" [net/qrtr/smd.ko] undefined!
ERROR: "qcom_smd_driver_unregister" [drivers/soc/qcom/wcnss_ctrl.ko] undefined!
ERROR: "qcom_smd_driver_register" [drivers/soc/qcom/wcnss_ctrl.ko] undefined!
ERROR: "qcom_smd_set_drvdata" [drivers/soc/qcom/wcnss_ctrl.ko] undefined!
ERROR: "qcom_smd_send" [drivers/soc/qcom/wcnss_ctrl.ko] undefined!
ERROR: "qcom_smd_get_drvdata" [drivers/soc/qcom/wcnss_ctrl.ko] undefined!
Also, the qrtr driver uses the SMD interface and has a Kconfig dependency,
but also allows for compile-testing when SMD is disabled. However, if
with QCOM_SMD=m and COMPILE_TEST=y we can end up with QRTR_SMD=y and
that fails with a related link error.
The changes the dependency so we can still compile-test the driver but
not have it built-in if SMD is a module, to avoid running in the broken
configuration, and changes the Makefile to provide the driver under
a different module name.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: bdabad3e36 ("net: Add Qualcomm IPC router")
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce a new command in ndo_setup_tc() for hardware offloaded
filters, to call the NIC driver, and make it update the statistics.
This will be done before dumping the filter and its statistics.
Signed-off-by: Amir Vadai <amirva@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement the stats_update callback that will be called by NIC drivers
for hardware offloaded filters.
Signed-off-by: Amir Vadai <amirva@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On devices that support TC U32 offloads, this flag enables a filter to be
added only to HW. skip-sw and skip-hw are mutually exclusive flags. By
default without any flags, the filter is added to both HW and SW, but no
error checks are done in case of failure to add to HW. With skip-sw,
failure to add to HW is treated as an error.
Here is a sample script that adds 2 filters, one with skip-sw and the other
with skip-hw flag.
# add ingress qdisc
tc qdisc add dev p4p1 ingress
# enable hw tc offload.
ethtool -K p4p1 hw-tc-offload on
# add u32 filter with skip-sw flag.
tc filter add dev p4p1 parent ffff: protocol ip prio 99 \
handle 800:0:1 u32 ht 800: flowid 800:1 \
skip-sw \
match ip src 192.168.1.0/24 \
action drop
# add u32 filter with skip-hw flag.
tc filter add dev p4p1 parent ffff: protocol ip prio 99 \
handle 800:0:2 u32 ht 800: flowid 800:2 \
skip-hw \
match ip src 192.168.2.0/24 \
action drop
Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The nf_conntrack_core.c fix in 'net' is not relevant in 'net-next'
because we no longer have a per-netns conntrack hash.
The ip_gre.c conflict as well as the iwlwifi ones were cases of
overlapping changes.
Conflicts:
drivers/net/wireless/intel/iwlwifi/mvm/tx.c
net/ipv4/ip_gre.c
net/netfilter/nf_conntrack_core.c
Signed-off-by: David S. Miller <davem@davemloft.net>
Switchdev has been around for quite a while now, putting "EXPERIMENTAL"
in the description is no longer accurate, drop it.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, when creating or updating a route, no check is performed
in both ipv4 and ipv6 code to the hoplimit value.
The caller can i.e. set hoplimit to 256, and when such route will
be used, packets will be sent with hoplimit/ttl equal to 0.
This commit adds checks for the RTAX_HOPLIMIT value, in both ipv4
ipv6 route code, substituting any value greater than 255 with 255.
This is consistent with what is currently done for ADVMSS and MTU
in the ipv4 code.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The slab name ends up being visible in the directory structure under
/sys, and even if you don't have access rights to the file you can see
the filenames.
Just use a 64-bit counter instead of the pointer to the 'net' structure
to generate a unique name.
This code will go away in 4.7 when the conntrack code moves to a single
kmemcache, but this is the backportable simple solution to avoiding
leaking kernel pointers to user space.
Fixes: 5b3501faa8 ("netfilter: nf_conntrack: per netns nf_conntrack_cachep")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
* Change SMD callback parameters
* Use writecombine mapping for SMEM
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXFvdZAAoJEFKiBbHx2RXVxPQQAJnEp1+5CDN7ziQ/zJ3juSxb
b7rOr5Z0qMDhh0H9BFdmhGzFctbY0AOqerfh138pXa6KN7maXT/enUE4vvtWUp1B
HCr9xt13waoa8e6umj/D1qU5mMLUX+JraVkIo6Dyd1g6pVmdrbwzP+Z0tW0x7CvW
n5r9eDYts66vGbQ6xzK40rTi22sEN32AY4mezKj40L8o0NvfZCCwb3dfw2RQzLHK
0KnSQ0PKqU1ZWrDc2YuvRsyh5KGv+t8vvnpMVVUi4TRx3qzoGpHpH3fE3evvMGKY
ya6nmFNFPZ+ZlzsGDx10RqLu58NmjS8LXY6O+yCGy9NPuonKpwHgYD2wxut2NR/W
TTtMhvJWKNIwhXR8ZRDaVBWQJBwoLvpNQT1zOIUuUKf+SZ2DbfgVpte7NZvrgEka
aY4nnZMbI0MvpaAHUtllWbC4xqMQJXY+4wrahKQgKF/l5H/0XQ99fHHz86DS9LCO
y+dfmU98dv/1X8qmxNfsULtJwbxi2Xwzt3MMMZb5v138STyMWeqWOLm+KGzgs1hu
Pc/RLQALdJpQEcg74s5xt2yxSqRuzDig7FtRT1a2SX33vi9YohQEgXCDe+IOJC5l
sm1Ppi3oO/yUcocUDFTBz/g6zD4SWywu2gmBcZIh/JVy8Sd053WwlaKs1Nld6ta7
QPx6rga7lgrYW409RS2A
=Jzqj
-----END PGP SIGNATURE-----
Merge tag 'qcom-soc-for-4.7-2' into net-next
This merges the Qualcomm SOC tree with the net-next, solving the
merge conflict in the SMD API between the two.
Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Clean up: Pass in just the piece of the svc_rqst that is needed
here.
While we're in the area, add an informative documenting comment.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
If the server has forced a disconnect, the associated QP has not
been moved to the Error state, and thus Receives are still posted.
Ensure Receives (and any other outstanding WRs) are drained to
release resources that can be freed during teardown of the
svcrdma_xprt.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Since backward direction support was added, the rq_depth was
increased to accommodate both forward and backward Receives.
But only forward Receives need to be posted after a connection
has been accepted. Receives for backward replies are posted as
needed by svc_rdma_bc_sendto().
This doesn't break anything, but it means some resources are
wasted.
Fixes: 03fe993153 ('svcrdma: Define maximum number of ...')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Clean up: svc_rdma_get_read_chunk() already returns a pointer
to the Read list. No need to set "ch" again to the value it
already contains.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Get a fresh op_ctxt in send_reply() instead of in svc_rdma_sendto().
This ensures that svc_rdma_put_context() is invoked only once if
send_reply() fails.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
An xdr_buf has a head, a vector of pages, and a tail. Each
RPC request is presented to the NFS server contained in an
xdr_buf.
The RDMA transport would like to supply the NFS server with only
the NFS WRITE payload bytes in the page vector. In some common
cases, that would allow the NFS server to swap those pages right
into the target file's page cache.
Have the transport's RDMA Read logic put XDR pad bytes in the tail
iovec, and not in the pages that hold the data payload.
The NFSv3 WRITE XDR decoder is finicky about the lengths involved,
so make sure it is looking in the correct places when computing
the total length of the incoming NFS WRITE request.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Allow both IPv4 and IPv6 to bind same port at the same time,
restricts use of the IPv6 socket to IPv6 communication.
Changes from v1:
- Check rdma_set_afonly return value (suggested by Leon Romanovsky)
Changes from v2:
- Acked-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Shirley Ma <shirley.ma@oracle.com>
Acked-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The SRP initiator allows to set max_sectors to a value that exceeds
the largest amount of data that can be mapped at once with an mlx4
HCA using fast registration and a page size of 4 KB. Hence modify
ib_map_mr_sg() such that it can map partial sg-elements. If an
sg-element has been mapped partially, let the caller know
which fraction has been mapped by adjusting *sg_offset.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Tested-by: Laurence Oberman <loberman@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Doug Ledford <dledford@redhat.com>
With all the latest fixes applied, I am still able to reproduce this
(and other) warning(s):
WARNING: CPU: 1 PID: 19684 at ../kernel/workqueue.c:4092 destroy_workqueue+0x70a/0x770()
...
Call Trace:
[<ffffffff819fee81>] ? dump_stack+0xb3/0x112
[<ffffffff8117377e>] ? warn_slowpath_common+0xde/0x140
[<ffffffff811ce68a>] ? destroy_workqueue+0x70a/0x770
[<ffffffff811739ae>] ? warn_slowpath_null+0x2e/0x40
[<ffffffff811ce68a>] ? destroy_workqueue+0x70a/0x770
[<ffffffffa0c944c9>] ? hci_unregister_dev+0x2a9/0x720 [bluetooth]
[<ffffffffa0b301db>] ? vhci_release+0x7b/0xf0 [hci_vhci]
[<ffffffffa0b30160>] ? vhci_flush+0x50/0x50 [hci_vhci]
[<ffffffff8117cd73>] ? do_exit+0x863/0x2b90
This is due to race present in the hci_unregister_dev path.
hdev->power_on work races with hci_dev_do_close. One tries to open,
the other tries to close, leading to warning like the above. (Another
example is a warning in kobject_get or kobject_put depending on who
wins the race.)
Fix this by switching those two racers to ensure hdev->power_on never
triggers while hci_dev_do_close is in progress.
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
While testing an OpenStack configuration using VXLANs I saw the following
call trace:
RIP: 0010:[<ffffffff815fad49>] udp4_lib_lookup_skb+0x49/0x80
RSP: 0018:ffff88103867bc50 EFLAGS: 00010286
RAX: ffff88103269bf00 RBX: ffff88103269bf00 RCX: 00000000ffffffff
RDX: 0000000000004300 RSI: 0000000000000000 RDI: ffff880f2932e780
RBP: ffff88103867bc60 R08: 0000000000000000 R09: 000000009001a8c0
R10: 0000000000004400 R11: ffffffff81333a58 R12: ffff880f2932e794
R13: 0000000000000014 R14: 0000000000000014 R15: ffffe8efbfd89ca0
FS: 0000000000000000(0000) GS:ffff88103fd80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000488 CR3: 0000000001c06000 CR4: 00000000001426e0
Stack:
ffffffff81576515 ffffffff815733c0 ffff88103867bc98 ffffffff815fcc17
ffff88103269bf00 ffffe8efbfd89ca0 0000000000000014 0000000000000080
ffffe8efbfd89ca0 ffff88103867bcc8 ffffffff815fcf8b ffff880f2932e794
Call Trace:
[<ffffffff81576515>] ? skb_checksum+0x35/0x50
[<ffffffff815733c0>] ? skb_push+0x40/0x40
[<ffffffff815fcc17>] udp_gro_receive+0x57/0x130
[<ffffffff815fcf8b>] udp4_gro_receive+0x10b/0x2c0
[<ffffffff81605863>] inet_gro_receive+0x1d3/0x270
[<ffffffff81589e59>] dev_gro_receive+0x269/0x3b0
[<ffffffff8158a1b8>] napi_gro_receive+0x38/0x120
[<ffffffffa0871297>] gro_cell_poll+0x57/0x80 [vxlan]
[<ffffffff815899d0>] net_rx_action+0x160/0x380
[<ffffffff816965c7>] __do_softirq+0xd7/0x2c5
[<ffffffff8107d969>] run_ksoftirqd+0x29/0x50
[<ffffffff8109a50f>] smpboot_thread_fn+0x10f/0x160
[<ffffffff8109a400>] ? sort_range+0x30/0x30
[<ffffffff81096da8>] kthread+0xd8/0xf0
[<ffffffff81693c82>] ret_from_fork+0x22/0x40
[<ffffffff81096cd0>] ? kthread_park+0x60/0x60
The following trace is seen when receiving a DHCP request over a flow-based
VXLAN tunnel. I believe this is caused by the metadata dst having a NULL
dev value and as a result dev_net(dev) is causing a NULL pointer dereference.
To resolve this I am replacing the check for skb_dst(skb)->dev with just
skb->dev. This makes sense as the callers of this function are usually in
the receive path and as such skb->dev should always be populated. In
addition other functions in the area where these are called are already
using dev_net(skb->dev) to determine the namespace the UDP packet belongs
in.
Fixes: 63058308cd ("udp: Add udp6_lib_lookup_skb and udp4_lib_lookup_skb")
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sunrpc is using SOCKWQ_ASYNC_NOSPACE without setting SOCK_FASYNC,
so the recent optimizations done in sk_set_bit() and sk_clear_bit()
broke it.
There is still the risk that a subsequent sock_fasync() call
would clear SOCK_FASYNC, but sunrpc does not use this yet.
Fixes: 9317bb6982 ("net: SOCKWQ_ASYNC_NOSPACE optimizations")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Jiri Pirko <jiri@resnulli.us>
Reported-by: Huang, Ying <ying.huang@intel.com>
Tested-by: Jiri Pirko <jiri@resnulli.us>
Tested-by: Huang, Ying <ying.huang@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When an ACTIVATE or data packet is received in a link in state
ESTABLISHING, the link does not immediately change state to
ESTABLISHED, but does instead return a LINK_UP event to the caller,
which will execute the state change in a different lock context.
This non-atomic approach incurs a low risk that we may have two
LINK_UP events pending simultaneously for the same link, resulting
in the final part of the setup procedure being executed twice. The
only potential harm caused by this it that we may see two LINK_UP
events issued to subsribers of the topology server, something that
may cause confusion.
This commit eliminates this risk by checking if the link is already
up before proceeding with the second half of the setup.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When dealing with WCCP in gre6 tunnel, it sets the wrong tpi->protocol,
that is, ETH_P_IP instead of ETH_P_IPV6 for the encapuslated traffic.
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Do not include attribute IFLA_GRE_TOS.
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* completion and fixups of nla_put_64_64bit() work
* remove a/b/g/n from wext nickname to avoid confusion
with 11ac (which wouldn't even fit fully there due to
string length restrictions)
along with some other minor changes/cleanups.
-----BEGIN PGP SIGNATURE-----
iQIcBAABCgAGBQJXNGMwAAoJEGt7eEactAAdrRMP/RGmFd1I8Z7gIsKc7pdknf20
4xLA3v8cqD7c9OtqlfVSfIyIosNl+pMatXK4eSaYzJICmdV9lU8VqgW6eH9lmDuM
q8Eis2/i7ZozzYhANVFQxMRvmSED8MeRm49LOzRofuPg4FHzyTYd8+TRfQ+QGRm9
xlrKTJ4BwH5lZgmH20/cphbffo5j8IyjzjKw0MC/tjikIePww6Am/r1jpCnd0Mwz
rWwRa4pQ9JBV7UzzzdcMpdn3KJkteFq/gJPCpZVDo3Zf+L21UavxP99NoUWiUvUU
bEQ9FzcuB7Zyt/lCAyu3ECBGqLvskqseFg3zmDnUoL7GmCZ1PZlfsnfOhQP0HgpI
t/dyw+TYIu6d4ZHbqGM6q6usjIJLltaxATwvREq3VdWFn5fYFu0Em3CyQ0on1XGn
4anNdkilGxaVEHCFBvfLvMTEfsTj7eycsNsDrZEGYV3NwX/wfz+6thqqu8gXFUsC
4MB1fS+YPg7u72ne+oJSPyJyFofYXDSjdNqnZ6HSQEZHinE+FnDCywZlJesxzDsH
q1ABkKLnmMNIwHxbfRBE0eO8sKXo6kUTJeFmWHtxB8/LcsLsV0/8juUy9H5PPZPU
7TJhn4SZrlJb92xq1DQ0t1xsRxTLonbRuu9PtVVZMUhZP69PpzK9toC6eCo9bovn
XFdSQ2c2xRwRAiSjUpQP
=QNxF
-----END PGP SIGNATURE-----
Merge tag 'mac80211-next-for-davem-2016-05-12' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
Johannes Berg says:
====================
Some more work for 4.7, notably:
* completion and fixups of nla_put_64_64bit() work
* remove a/b/g/n from wext nickname to avoid confusion
with 11ac (which wouldn't even fit fully there due to
string length restrictions)
along with some other minor changes/cleanups.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Since 4.4 we erronously use timestamp of the netlink skb (which is zero).
Bugzilla: https://bugzilla.netfilter.org/show_bug.cgi?id=1066
Fixes: b28b1e826f ("netfilter: nfnetlink_queue: use y2038 safe timestamp")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When using RSS, frames might not be processed in the correct order,
and thus AP_LINK_PS must be used; most likely with firmware keeping
track of the powersave state, this is the case in iwlwifi now.
In this case, the driver can use ieee80211_sta_ps_transition() to
still have mac80211 manage powersave buffering. However, for U-APSD
and PS-Poll this isn't sufficient. If the device can't manage that
entirely on its own, mac80211's code should be used.
To allow this, export two functions: ieee80211_sta_uapsd_trigger()
and ieee80211_sta_pspoll().
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
There's no harm in having drivers read the list, since they can
use RCU protection or RTNL locking; allow this to not require
each and every driver to also implement its own bookkeeping.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The devlist_mtx mutex was removed about two years ago, in favour of just
using RTNL/RCU protection. Remove the comment still referencing it.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This allows finding vendor IE from a specific vendor.
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Some hardware (iwlwifi an example) de-aggregate AMSDUs and copy the IV
as is to the generated MPDUs, so the same PN appears in multiple
packets without being a replay attack. Allow driver to explicitly
indicate that a frame is allowed to have the same PN as the previous
frame.
Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In some cases, after a sudden AP disappearing and reconnection to
another AP in the same ESS, user space gets the old AP in scan
results (cached). User space may decide to roam to that old AP
which will cause a disconnection and longer recovery.
Remove APs that are probably out of range from BSS table.
Signed-off-by: David Spinadel <david.spinadel@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This is the first NFC pull request for 4.7. With this one we
mainly have:
- Support for NXP's pn532 NFC chipset. The pn532 is based on the same
microcontroller as the pn533, but it talks to the host through i2c
instead of USB. By separating the pn533 driver into core and PHY
parts, we can not add the i2c layer and support the pn532 chipset.
- Support for NCI's loopback mode. This is a testing mode where each
packet received by the NFCC is sent back to the DH, allowing the
host to test that the controller can receive and send data.
- A few ACPI related fixes for the STMicro drivers, in order to match
the device tree naming scheme.
- A bunch of cleanups for the st-nci and the st21nfca STMicro drivers.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXMRA8AAoJEIqAPN1PVmxKknQP/2WyaVYD+jeRkM4pymRQ/6s6
V2+be68JlatvC53D6fbzo9FF8/QO7w++vTF+KI5nV6g/RvkPvKnxg2DC39fob0DA
6x7KaRuUrTwxBx4tfsvtNfUL4idRi3i6/hrYRJyzaNMHKAukDbjJQiCchveUhZeB
NlfYSEcieNnRJkGpdVGny4tea0fFOY6v+Uz74h2HboKbSD3SGNK29v47AqxH/ssg
Y9XycyYWWF1w2uOUfFzJFsvBAYcPlBDF4WqQ8SkamxhaPAB047Ysfp3y9vWGc8By
pdXLNAALzlCIt2JeGsrXhLG+Zf3y3w3G7HlcVbkh5BcB6fN2FYpPZurY8Km8Z6SM
ufgP8MwGOvOJa9NKapSm3FF6Tl1B3gCK3+SWT5N/lSNEFnaoZ6xMNGqXe1RhTght
+oqfuHLNO0YulQG0/ixJJl08p4eb6hhUMO3LKan3AYub9kgk63I43XeKXhIlqmLJ
tEh3xAPzbPyu+nwkrJF0B2tMeUdrCKZaucru/upYcZi7LDbQOvMjHgTS8YmX/aBk
gCOjINs+U1Q9jN2B7fMJe1uP6DJOTVW+xqiaLBLwHawRBhQQPbg+pLK/sHgFWcaa
bdUWASp7n5Bu5pY/rPHL+xvA8O5z5ngJ7ZPjqB2CvHr9XyQDna3Fo9jA/w3hgr/q
DgvCHukjvkDlUochojQZ
=zM22
-----END PGP SIGNATURE-----
Merge tag 'nfc-next-4.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/nfc-next
Samuel Ortiz says:
====================
NFC 4.7 pull request
This is the first NFC pull request for 4.7. With this one we
mainly have:
- Support for NXP's pn532 NFC chipset. The pn532 is based on the same
microcontroller as the pn533, but it talks to the host through i2c
instead of USB. By separating the pn533 driver into core and PHY
parts, we can not add the i2c layer and support the pn532 chipset.
- Support for NCI's loopback mode. This is a testing mode where each
packet received by the NFCC is sent back to the DH, allowing the
host to test that the controller can receive and send data.
- A few ACPI related fixes for the STMicro drivers, in order to match
the device tree naming scheme.
- A bunch of cleanups for the st-nci and the st21nfca STMicro drivers.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The dsa_switch structure contains a dsa_chip_data member called pd.
However in the rest of the code, pd is used for dsa_platform_data.
This is confusing. Rename it cd, which is already often used in dsa.c
and slave.c for this data type.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
The switch drivers only use the master_dev member for dev_info()
messages. Now that the device is passed to the old style probe, and
new style drivers are probed as true linux drivers, this is no longer
needed.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Resetting the switch is something the driver does, not the framework.
So move the parsing of this property into the driver.
There are no in kernel users of this property, so moving it does not
break anything. There is however a board which will make use of this
property making its way into the kernel.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Applications such as OSPF and BFD need the original ingress device not
the VRF device; the latter can be derived from the former. To that end
add the skb_iif to inet_skb_parm and set it in ipv4 code after clearing
the skb control buffer similar to IPv6. From there the pktinfo can just
pull it from cb with the PKTINFO_SKB_CB cast.
The previous patch moving the skb->dev change to L3 means nothing else
is needed for IPv6; it just works.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently the VRF driver uses the rx_handler to switch the skb device
to the VRF device. Switching the dev prior to the ip / ipv6 layer
means the VRF driver has to duplicate IP/IPv6 processing which adds
overhead and makes features such as retaining the ingress device index
more complicated than necessary.
This patch moves the hook to the L3 layer just after the first NF_HOOK
for PRE_ROUTING. This location makes exposing the original ingress device
trivial (next patch) and allows adding other NF_HOOKs to the VRF driver
in the future.
dev_queue_xmit_nit is exported so that the VRF driver can cycle the skb
with the switched device through the packet taps to maintain current
behavior (tcpdump can be used on either the vrf device or the enslaved
devices).
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Protocol for 4in6 tunnel is IPPROTO_IPIP. This was wrongly changed by
the last cleanup.
CC: Tom Herbert <tom@herbertland.com>
Fixes: 0d3c703a9d ("ipv6: Cleanup IPv6 tunnel receive path")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For ipgre interface in collect metadata mode, it doesn't make sense for the
interface to be of ARPHRD_IPGRE type. The outer header of received packets
is not needed, as all the information from it is present in metadata_dst. We
already don't set ipgre_header_ops for collect metadata interfaces, which is
the only consumer of mac_header pointing to the outer IP header.
Just set the interface type to ARPHRD_NONE in collect metadata mode for
ipgre (not gretap, that still correctly stays ARPHRD_ETHER) and reset
mac_header.
Fixes: a64b04d86d ("gre: do not assign header_ops in collect metadata mode")
Fixes: 2e15ea390e ("ip_gre: Add support to collect tunnel metadata.")
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When using conntrack helpers from OVS, a common configuration is to
perform a lookup without specifying a helper, then go through a
firewalling policy, only to decide to attach a helper afterwards.
In this case, the initial lookup will cause a ct entry to be attached to
the skb, then the later commit with helper should attach the helper and
confirm the connection. However, the helper attachment has been missing.
If the user has enabled automatic helper attachment, then this issue
will be masked as it will be applied in init_conntrack(). It is also
masked if the action is executed from ovs_packet_cmd_execute() as that
will construct a fresh skb.
This patch fixes the issue by making an explicit call to try to assign
the helper if there is a discrepancy between the action's helper and the
current skb->nfct.
Fixes: cae3a26275 ("openvswitch: Allow attaching helpers to ct action")
Signed-off-by: Joe Stringer <joe@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace 2 arguments (cnt and rtt) in the congestion control modules'
pkts_acked() function with a struct. This will allow adding more
information without having to modify existing congestion control
modules (tcp_nv in particular needs bytes in flight when packet
was sent).
As proposed by Neal Cardwell in his comments to the tcp_nv patch.
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The process below was broken and is fixed with this patch.
//add an ife action and give it an instance id of 1
sudo tc actions add action ife encode \
type 0xDEAD allow mark dst 02:15:15:15:15:15 index 1
//create a filter which binds to ife action id 1
sudo tc filter add dev $DEV parent ffff: protocol ip prio 1 u32\
match ip dst 17.0.0.1/32 flowid 1:11 action ife index 1
Message before fix was:
RTNETLINK answers: Invalid argument
We have an error talking to the kernel
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The process below was broken and is fixed with this patch.
//add a skbedit action and give it an instance id of 1
sudo tc actions add action skbedit mark 10 index 1
//create a filter which binds to skbedit action id 1
sudo tc filter add dev $DEV parent ffff: protocol ip prio 1 u32\
match ip dst 17.0.0.1/32 flowid 1:10 action skbedit index 1
Message before fix was:
RTNETLINK answers: Invalid argument
We have an error talking to the kernel
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The process below was broken and is fixed with this patch.
//add a simple action and give it an instance id of 1
sudo tc actions add action simple sdata "foobar" index 1
//create a filter which binds to simple action id 1
sudo tc filter add dev $DEV parent ffff: protocol ip prio 1 u32\
match ip dst 17.0.0.1/32 flowid 1:10 action simple index 1
Message before fix was:
RTNETLINK answers: Invalid argument
We have an error talking to the kernel
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The process below was broken and is fixed with this patch.
//add an mirred action and give it an instance id of 1
sudo tc actions add action mirred egress mirror dev $MDEV index 1
//create a filter which binds to mirred action id 1
sudo tc filter add dev $DEV parent ffff: protocol ip prio 1 u32\
match ip dst 17.0.0.1/32 flowid 1:10 action mirred index 1
Message before bug fix was:
RTNETLINK answers: Invalid argument
We have an error talking to the kernel
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This was broken and is fixed with this patch.
//add an ipt action and give it an instance id of 1
sudo tc actions add action ipt -j mark --set-mark 2 index 1
//create a filter which binds to ipt action id 1
sudo tc filter add dev $DEV parent ffff: protocol ip prio 1 u32\
match ip dst 17.0.0.1/32 flowid 1:10 action ipt index 1
Message before bug fix was:
RTNETLINK answers: Invalid argument
We have an error talking to the kernel
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Late vlan action binding was broken and is fixed with this patch.
//add a vlan action to pop and give it an instance id of 1
sudo tc actions add action vlan pop index 1
//create filter which binds to vlan action id 1
sudo tc filter add dev $DEV parent ffff: protocol ip prio 1 u32 \
match ip dst 17.0.0.1/32 flowid 1:1 action vlan index 1
current message(before bug fix) was:
RTNETLINK answers: Invalid argument
We have an error talking to the kernel
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- remove useless skb size check in batadv_interface_rx
- basic netns support introduced by Andrew Lunn:
- prevent virtual interface from changing netns by setting
NETIF_F_NETNS_LOCAL
- create virtual interface within the netns of the first
hard-interface
- introduce detection of complex bridge loops and report event
to the user (via udev) when the Bridge Loop Avoidance mechanism
can't prevent them
- minor reference counting bugfixes for the hard_iface object that
couldn't make it via the net tree
- use kref_get() instead of kref_get_unless_zero() to make reference
counting bug more visible
- use batadv_compare_eth() all over the code when possible instead of
plain memcmp()
- minor code cleanup and style adjustments
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJXMjUdAAoJEJ4aZjxxc6bKhDsP/i3rRwoppaDK3gxuDkXqrqOj
AYhMJYulcizK2PPCzj24bZSttwcKDNWpjR67rBgXY02NpqSsW+qYFnCm+KokzgC0
3NGV+lC6oCwzowP3FJtW06BCNOybnbR84Fw6JCRL5P4mKoGnhn4jAOIYlYnk4K7z
ByhEallLD84h5oo2Td8i06QzxzaO6lkYaLrAVBQnd+Cs27foxjMOO0gasyCvUkl7
Jsm2n0NSEN6SYv0smLhiPHx9oF66MBAxNRVS6dL/6Ko/vSHNy1AGQgx0SztlHzCQ
K6JCpOZn5CExvKZUMXQXOrSkpdjIxVh+EGnGBAGyNswqBg49Y3K45FqNz4b+R0Pg
ckPzc7UAo8z+ZHzPz4mzx/596sw9xG4jLS6aHkjwet7kLXj/ShP//HqGwfHTeeOY
w4a2ers6ByCq2PwKYz34s/sDvC2+p++ZdyJF+qOjJpNjbH1B+mDId8M1f9H/9rvF
//EZhCcdlpqLIxvQKDMADQRpcwTbAPglsBuHs/1Sk1zjRK1NDRPw+46pUUQOb8ND
ydAs0JVUOgSKHZT+3b5l85poRYpp8gX8nVrvF2DNJBqmBTvSkAbZpNPiSy6fNWQV
he3Q9RndqQq2OPffG1ijte7G1m9J8lCh7Nzl6Ee6vGsw4GzzkKIx9qHHCtXhYpOE
kdg5ttgUK5sfgjsn3ANp
=CB1D
-----END PGP SIGNATURE-----
Merge tag 'batman-adv-for-davem' of git://git.open-mesh.org/linux-merge
Antonio Quartulli says:
====================
Included changes:
- remove useless skb size check in batadv_interface_rx
- basic netns support introduced by Andrew Lunn:
- prevent virtual interface from changing netns by setting
NETIF_F_NETNS_LOCAL
- create virtual interface within the netns of the first
hard-interface
- introduce detection of complex bridge loops and report event
to the user (via udev) when the Bridge Loop Avoidance mechanism
can't prevent them
- minor reference counting bugfixes for the hard_iface object that
couldn't make it via the net tree
- use kref_get() instead of kref_get_unless_zero() to make reference
counting bug more visible
- use batadv_compare_eth() all over the code when possible instead of
plain memcmp()
- minor code cleanup and style adjustments
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
There are two instances of an unused variable, `doff' added by
commit 6fa01ccd88 ("skbuff: Add pskb_extract() helper function")
in pskb_carve_inside_header() and pskb_carve_inside_nonlinear().
Remove these instances, they are not used.
Reported by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The handler 'ila_fill_encap_info' adds two attributes: ILA_ATTR_LOCATOR
and ILA_ATTR_CSUM_MODE.
nla_total_size_64bit() must be use for ILA_ATTR_LOCATOR.
Also, do nla_put_u8 instead of nla_put_u64 for ILA_ATTR_CSUM_MODE.
Fixes: f13a82d87b ("ipv6: use nla_put_u64_64bit()")
Fixes: 90bfe662db ("ila: add checksum neutral ILA translations")
Reported-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the very unlikely case __tcp_retransmit_skb() can not use the cloning
done in tcp_transmit_skb(), we need to refresh skb_mstamp before doing
the copy and transmit, otherwise TCP TS val will be an exact copy of
original transmit.
Fixes: 7faee5c0d5 ("tcp: remove TCP_SKB_CB(skb)->when")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When comparing Ethernet address it is better to use the more
generic batadv_compare_eth. The latter is also optimised for
architectures having a fast unaligned access.
Signed-off-by: Antonio Quartulli <a@unstable.cc>
[sven@narfation.org: fix conflicts with current version]
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Reviewed-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
It is easier to understand that the returned value of a specific function
doesn't have to be 0 when the functions was successful when the actual
return type is bool. This is especially true when all surrounding functions
with return type int use negative values to return the error code.
Reported-by: Nicholas Krause <xerofoify@gmail.com>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
_batadv_update_route requires that the caller already has a valid reference
for neigh_node. It is therefore not possible that it has an reference
counter of 0 and was still given to this function
The kref_get function instead WARNs (with debug information) when the
reference counter would still be 0. This makes a bug in batman-adv better
visible because kref_get_unless_zero would have ignored this problem.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
The callers of the functions using batadv_hard_iface objects already make
sure that they hold a valid reference. The subfunctions don't have
to check whether the reference counter is > 0 because this was checked by
the callers.
The kref_get function instead WARNs (with debug information) when the
reference counter would still be 0. This makes a bug in batman-adv better
visible because kref_get_unless_zero would have ignored this problem.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
batadv_gw_node_add requires that the caller already has a valid reference
for orig_node. It is therefore not possible that it has an reference
counter of 0 and was still given to this function
The kref_get function instead WARNs (with debug information) when the
reference counter would still be 0. This makes a bug in batman-adv better
visible because kref_get_unless_zero would have ignored this problem.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
batadv_gw_select requires that the caller already has a valid reference for
new_gw_node. It is therefore not possible that it has an reference counter
of 0 and was still given to this function
The kref_get function instead WARNs (with debug information) when the
reference counter would still be 0. This makes a bug in batman-adv better
visible because kref_get_unless_zero would have ignored this problem.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
batadv_nc_get_nc_node requires that the caller already has a valid
reference for orig_neigh_node. It is therefore not possible that it has an
reference counter of 0 and was still given to this function
The kref_get function instead WARNs (with debug information) when the
reference counter would still be 0. This makes a bug in batman-adv better
visible because kref_get_unless_zero would have ignored this problem.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
batadv_tvlv_container_get requires that tvlv.container_list_lock is held by
the caller. It is therefore not possible that an item in
tvlv.container_list has an reference counter of 0 and is still in the list
The kref_get function instead WARNs (with debug information) when the
reference counter would still be 0. This makes a bug in batman-adv better
visible because kref_get_unless_zero would have ignored this problem.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
The hard_iface is referenced in the packet_type for batman-adv. Increase
the refcounter of the hard_interface for it to have an explicit reference
for it in case this functionality gets refactorted and the currently
used implicit reference for it will be removed.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
The receive function may start processing an incoming packet while the
hard_iface is shut down in a different context. All called functions called
with the batadv_hard_iface object belonging to the incoming interface would
have to check whether the reference counter is still > 0.
This is rather error-prone because this check can be forgotten easily.
Instead check the reference counter when receiving the object to make sure
that all called functions have a valid reference.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
The batadv_hardif_list list is checked in many situations and the items
in this list are given to specialized functions to modify the routing
behavior. At the moment each of these called functions has to check
itself whether the received batadv_hard_iface has a refcount > 0 before
it can increase the reference counter and use it in other objects.
This can easily lead to problems because it is not easily visible where
all callers of a function got the batadv_hard_iface object from and
whether they already hold a valid reference.
Checking the reference counter directly before calling a subfunction
with a pointer from the batadv_hardif_list avoids this problem.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
There are network setups where the current bridge loop avoidance can't
detect bridge loops. The minimal setup affected would consist of two
LANs and two separate meshes, connected in a ring like that:
A...(mesh1)...B
| |
(LAN1) (LAN2)
| |
C...(mesh2)...D
Since both the meshes and backbones are separate, the bridge loop
avoidance has not enough information to detect and avoid the loop
in this case. Even if these scenarios can't be fixed easily,
these kind of loops can be detected.
This patch implements a periodic check (running every 60 seconds for
now) which sends a broadcast frame with a random MAC address on
each backbone VLAN. If a broadcast frame with the same MAC address
is received shortly after on the mesh, we know that there must be a
loop and report that incident as well as throw an uevent to let others
handle that problem.
Signed-off-by: Simon Wunderlich <simon.wunderlich@open-mesh.com>
[sven@narfation.org: fix conflicts with current version]
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
When creating a soft interface, create it in the same netns as the
hard interface. Replace all references to init_net with the correct
name space for the interface being manipulated.
Suggested-by: Daniel Ehlers <danielehlers@mindeye.net>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
The batX soft interface should not be moved between network name
spaces. This is similar to bridges, bonds, tunnels, which are not
allowed to move between network namespaces.
Suggested-by: Daniel Ehlers <danielehlers@mindeye.net>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Antonio Quartulli <a@unstable.cc>
Reviewed-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
The callers of batadv_interface_rx have to make sure that enough data can
be pulled from the skb when they read the batman-adv header. The only two
functions using it are either calling pskb_may_pull with hdr_size directly
(batadv_recv_bcast_packet) or indirectly via batadv_check_unicast_packet
(batadv_recv_unicast_packet).
Reported-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
A recent commit introduced an unconditional use of an uninitialized
variable, as reported in this gcc warning:
net/netfilter/nf_conntrack_core.c: In function '__nf_conntrack_confirm':
net/netfilter/nf_conntrack_core.c:632:33: error: 'ctinfo' may be used uninitialized in this function [-Werror=maybe-uninitialized]
bytes = atomic64_read(&counter[CTINFO2DIR(ctinfo)].bytes);
^
net/netfilter/nf_conntrack_core.c:628:26: note: 'ctinfo' was declared here
enum ip_conntrack_info ctinfo;
The problem is that a local variable shadows the function parameter.
This removes the local variable, which looks like what Pablo originally
intended.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 71d8c47fc6 ("netfilter: conntrack: introduce clash resolution on insertion race")
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contain Netfilter simple fixes for your net tree,
two one-liner and one two-liner:
1) Oneliner to fix missing spinlock definition that triggers
'BUG: spinlock bad magic on CPU#' when spinlock debugging is enabled,
from Florian Westphal.
2) Fix missing workqueue cancelation on IDLETIMER removal,
from Liping Zhang.
3) Fix insufficient validation of netlink of NFACCT_QUOTA in
nfnetlink_acct, from Phil Turnbull.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix two spots where o_flags in a tunnel are being compared to GRE_SEQ
instead of TUNNEL_SEQ.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Need to use adjusted protocol value for setting inner protocol.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
GRE for IPv6 does not properly translate for GRE flags to tunnel
flags and vice versa. This patch fixes that.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In ip6gre_tnl_link_config set t->tun_len and t->hlen correctly for the
configuration. For hard_header_len and mtu calculation include
IPv6 header and encapsulation overhead.
In ip6gre_tunnel_init_common set t->tun_len and t->hlen correctly for
the configuration. Revert to setting hard_header_len instead of
needed_headroom.
Tested:
./ip link add name tun8 type ip6gretap remote \
2401:db00:20:911a:face:0:27:0 local \
2401:db00:20:911a:face:0:25:0 ttl 225
Gives MTU of 1434. That is equal to 1500 - 40 - 14 - 4 - 8.
./ip link add name tun8 type ip6gretap remote \
2401:db00:20:911a:face:0:27:0 local \
2401:db00:20:911a:face:0:25:0 ttl 225 okey 123
Gives MTU of 1430. That is equal to 1500 - 40 - 14 - 4 - 8 - 4.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stack object "dte_facilities" is allocated in x25_rx_call_request(),
which is supposed to be initialized in x25_negotiate_facilities.
However, 5 fields (8 bytes in total) are not initialized. This
object is then copied to userland via copy_to_user, thus infoleak
occurs.
Signed-off-by: Kangjie Lu <kjlu@gatech.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow udp and raw sockets to send by oif that is an enslaved interface
versus the l3mdev/VRF device. For example, this allows BFD to use ifindex
from IP_PKTINFO on a receive to send a response without the need to
convert to the VRF index. It also allows ping and ping6 to work when
specifying an enslaved interface (e.g., ping -I swp1 <ip>) which is
a natural use case.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move l3mdev_rt6_dst_by_oif and l3mdev_get_saddr to l3mdev.c. Collapse
l3mdev_get_rt6_dst into l3mdev_rt6_dst_by_oif since it is the only
user and keep the l3mdev_get_rt6_dst name for consistency with other
hooks.
A follow-on patch adds more code to these functions making them long
for inlined functions.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In netdevice.h we removed the structure in net-next that is being
changes in 'net'. In macsec.c and rtnetlink.c we have overlaps
between fixes in 'net' and the u64 attribute changes in 'net-next'.
The mlx5 conflicts have to do with vxlan support dependencies.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following large patchset contains Netfilter updates for your
net-next tree. My initial intention was to send you this in two goes but
when I looked back twice I already had this burden on top of me.
Several updates for IPVS from Marco Angaroni:
1) Allow SIP connections originating from real-servers to be load
balanced by the SIP persistence engine as is already implemented
in the other direction.
2) Release connections immediately for One-packet-scheduling (OPS)
in IPVS, instead of making it via timer and rcu callback.
3) Skip deleting conntracks for each one packet in OPS, and don't call
nf_conntrack_alter_reply() since no reply is expected.
4) Enable drop on exhaustion for OPS + SIP persistence.
Miscelaneous conntrack updates from Florian Westphal, including fix for
hash resize:
5) Move conntrack generation counter out of conntrack pernet structure
since this is only used by the init_ns to allow hash resizing.
6) Use get_random_once() from packet path to collect hash random seed
instead of our compound.
7) Don't disable BH from ____nf_conntrack_find() for statistics,
use NF_CT_STAT_INC_ATOMIC() instead.
8) Fix lookup race during conntrack hash resizing.
9) Introduce clash resolution on conntrack insertion for connectionless
protocol.
Then, Florian's netns rework to get rid of per-netns conntrack table,
thus we use one single table for them all. There was consensus on this
change during the NFWS 2015 and, on top of that, it has recently been
pointed as a source of multiple problems from unpriviledged netns:
11) Use a single conntrack hashtable for all namespaces. Include netns
in object comparisons and make it part of the hash calculation.
Adapt early_drop() to consider netns.
12) Use single expectation and NAT hashtable for all namespaces.
13) Use a single slab cache for all namespaces for conntrack objects.
14) Skip full table scanning from nf_ct_iterate_cleanup() if the pernet
conntrack counter tells us the table is empty (ie. equals zero).
Fixes for nf_tables interval set element handling, support to set
conntrack connlabels and allow set names up to 32 bytes.
15) Parse element flags from element deletion path and pass it up to the
backend set implementation.
16) Allow adjacent intervals in the rbtree set type for dynamic interval
updates.
17) Add support to set connlabel from nf_tables, from Florian Westphal.
18) Allow set names up to 32 bytes in nf_tables.
Several x_tables fixes and updates:
19) Fix incorrect use of IS_ERR_VALUE() in x_tables, original patch
from Andrzej Hajda.
And finally, miscelaneous netfilter updates such as:
20) Disable automatic helper assignment by default. Note this proc knob
was introduced by a900689264 ("netfilter: nf_ct_helper: allow to
disable automatic helper assignment") 4 years ago to start moving
towards explicit conntrack helper configuration via iptables CT
target.
21) Get rid of obsolete and inconsistent debugging instrumentation
in x_tables.
22) Remove unnecessary check for null after ip6_route_output().
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
An earlier patch changed lookup side to also net_eq() namespaces after
obtaining a reference on the conntrack, so a single kmemcache can be used.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We already include netns address in the hash, so we only need to use
net_eq in find_appropriate_src and can then put all entries into
same table.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Will be needed soon when we place all in the same hash table.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add function rpc_lookup_generic_cred, which allows lookups of a generic
credential that's not current_cred().
[jlayton: add gfp_t parm]
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
We need to be able to call the generic_cred creator from different
contexts. Add a gfp_t parm to the crcreate operation and to
rpcauth_lookup_credcache. For now, we just push the gfp_t parms up
one level to the *_lookup_cred functions.
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
An xdr_buf with head[0].iov_len = 0 and page_len = 0 will cause
xdr_init_decode() to incorrectly setup the xdr_stream. Specifically,
xdr->end is never initialized.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
TC_ACT_STOLEN is used when ingress traffic is mirred/redirected
to say ifb.
Packet is not dropped, but consumed.
Only TC_ACT_SHOT is a clear indication something went wrong.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On small embedded routers, one wants to control maximal amount of
memory used by fq_codel, instead of controlling number of packets or
bytes, since GRO/TSO make these not practical.
Assuming skb->truesize is accurate, we have to keep track of
skb->truesize sum for skbs in queue.
This patch adds a new TCA_FQ_CODEL_MEMORY_LIMIT attribute.
I chose a default value of 32 MBytes, which looks reasonable even
for heavy duty usages. (Prior fq_codel users should not be hurt
when they upgrade their kernels)
Two fields are added to tc_fq_codel_qd_stats to report :
- Current memory usage
- Number of drops caused by memory limits
# tc qd replace dev eth1 root est 1sec 4sec fq_codel memory_limit 4M
..
# tc -s -d qd sh dev eth1
qdisc fq_codel 8008: root refcnt 257 limit 10240p flows 1024
quantum 1514 target 5.0ms interval 100.0ms memory_limit 4Mb ecn
Sent 2083566791363 bytes 1376214889 pkt (dropped 4994406, overlimits 0
requeues 21705223)
rate 9841Mbit 812549pps backlog 3906120b 376p requeues 21705223
maxpacket 68130 drop_overlimit 4994406 new_flow_count 28855414
ecn_mark 0 memory_used 4190048 drop_overmemory 4994406
new_flows_len 1 old_flows_len 177
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Dave Täht <dave.taht@gmail.com>
Cc: Sebastian Möller <moeller0@gmx.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add an implementation of Qualcomm's IPC router protocol, used to
communicate with service providing remote processors.
Signed-off-by: Courtney Cavin <courtney.cavin@sonymobile.com>
Signed-off-by: Bjorn Andersson <bjorn.andersson@sonymobile.com>
[bjorn: Cope with 0 being a valid node id and implement RTM_NEWADDR]
Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
UDP tunnel segmentation code relies on the inner offsets being set for
an UDP tunnel GSO packet, but the inner *_complete() functions will
set the inner offsets only if 'encapsulation' is set before calling
them. Currently, udp_gro_complete() sets 'encapsulation' only after
the inner *_complete() functions are done. This causes the inner
offsets having invalid values after udp_gro_complete() returns, which
in turn will make it impossible to properly segment the packet in case
it needs to be forwarded, which would be visible to the user either as
invalid packets being sent or as packet loss.
This patch fixes this by setting skb's 'encapsulation' in
udp_gro_complete() before calling into the inner complete functions,
and by making each possible UDP tunnel gro_complete() callback set the
inner_mac_header to the beginning of the tunnel payload.
Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Reviewed-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The setting of the UDP tunnel GSO type is already performed by
udp[46]_gro_complete().
Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
I forgot that ip_send_unicast_reply() is not BH safe (yet).
Disabling preemption before calling it was not a good move.
Fixes: c10d9310ed ("tcp: do not assume TCP code is non preemptible")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andres Lagar-Cavilla <andreslc@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
allow cls_bpf and act_bpf programs access skb->data and skb->data_end pointers.
The bpf helpers that change skb->data need to update data_end pointer as well.
The verifier checks that programs always reload data, data_end pointers
after calls to such bpf helpers.
We cannot add 'data_end' pointer to struct qdisc_skb_cb directly,
since it's embedded as-is by infiniband ipoib, so wrapper struct is needed.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tables have to exist for VRFs to function. Ensure they exist
when VRF device is created.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Responses for packets to unused ports are getting lost with L3 domains.
IPv4 has ip_send_unicast_reply for sending TCP responses which accounts
for L3 domains; update the IPv6 counterpart tcp_v6_send_response.
For icmp the L3 master check needs to be moved up in icmp6_send
to properly respond to UDP packets to a port with no listener.
Fixes: ca254490c8 ("net: Add VRF support to IPv6 stack")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With the newly introduced helper functions the skb pulling is hidden
in the checksumming function - and undone before returning to the
caller.
The IGMP and MLD query parsing functions in the bridge still
assumed that the skb is pointing to the beginning of the IGMP/MLD
message while it is now kept at the beginning of the IPv4/6 header.
If there is a querier somewhere else, then this either causes
the multicast snooping to stay disabled even though it could be
enabled. Or, if we have the querier enabled too, then this can
create unnecessary IGMP / MLD query messages on the link.
Fixing this by taking the offset between IP and IGMP/MLD header into
account, too.
Fixes: 9afd85c9e4 ("net: Export IGMP/MLD message validation code")
Reported-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: David S. Miller <davem@davemloft.net>
Simon Horman says:
====================
Second Round of IPVS Updates for v4.7
please consider these enhancements to the IPVS. They allow its DoS
mitigation strategy effective in conjunction with the SIP persistence
engine.
====================
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We already include netns address in the hash and compare the netns pointers
during lookup, so even if namespaces have overlapping addresses entries
will be spread across the expectation table.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
DoS protection policy that deletes connections to avoid out of memory is
currently not effective for SIP-pe plus OPS-mode for two reasons:
1) connection templates (holding SIP call-id) are always skipped in
ip_vs_random_dropentry()
2) in_pkts counter (used by drop_entry algorithm) is not incremented
for connection templates
This patch addresses such problems with the following changes:
a) connection templates associated (via their dest) to virtual-services
configured in OPS mode are included in ip_vs_random_dropentry()
monitoring. This applies to SIP-pe over UDP (which requires OPS mode),
but is more general principle: when OPS is controlled by templates
memory can be used only by templates themselves, since OPS conns are
deleted after packet is forwarded.
b) OPS connections, if controlled by a template, cause increment of
in_pkts counter of their template. This is already happening but only
in case director is in master-slave mode (see ip_vs_sync_conn()).
Signed-off-by: Marco Angaroni <marcoangaroni@gmail.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
get_bridge_ifindices() is used from the old "deviceless" bridge ioctl
calls which aren't called with rtnl held. The comment above says that it is
called with rtnl but that is not really the case.
Here's a sample output from a test ASSERT_RTNL() which I put in
get_bridge_ifindices and executed "brctl show":
[ 957.422726] RTNL: assertion failed at net/bridge//br_ioctl.c (30)
[ 957.422925] CPU: 0 PID: 1862 Comm: brctl Tainted: G W O
4.6.0-rc4+ #157
[ 957.423009] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.8.1-20150318_183358- 04/01/2014
[ 957.423009] 0000000000000000 ffff880058adfdf0 ffffffff8138dec5
0000000000000400
[ 957.423009] ffffffff81ce8380 ffff880058adfe58 ffffffffa05ead32
0000000000000001
[ 957.423009] 00007ffec1a444b0 0000000000000400 ffff880053c19130
0000000000008940
[ 957.423009] Call Trace:
[ 957.423009] [<ffffffff8138dec5>] dump_stack+0x85/0xc0
[ 957.423009] [<ffffffffa05ead32>]
br_ioctl_deviceless_stub+0x212/0x2e0 [bridge]
[ 957.423009] [<ffffffff81515beb>] sock_ioctl+0x22b/0x290
[ 957.423009] [<ffffffff8126ba75>] do_vfs_ioctl+0x95/0x700
[ 957.423009] [<ffffffff8126c159>] SyS_ioctl+0x79/0x90
[ 957.423009] [<ffffffff8163a4c0>] entry_SYSCALL_64_fastpath+0x23/0xc1
Since it only reads bridge ifindices, we can use rcu to safely walk the net
device list. Also remove the wrong rtnl comment above.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The peer may be expecting a reply having sent a request and then done a
shutdown(SHUT_WR), so tearing down the whole socket at this point seems
wrong and breaks for me with a client which does a SHUT_WR.
Looking at other socket family's stream_recvmsg callbacks doing a shutdown
here does not seem to be the norm and removing it does not seem to have
had any adverse effects that I can see.
I'm using Stefan's RFC virtio transport patches, I'm unsure of the impact
on the vmci transport.
Signed-off-by: Ian Campbell <ian.campbell@docker.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
Cc: Andy King <acking@vmware.com>
Cc: Dmitry Torokhov <dtor@vmware.com>
Cc: Jorgen Hansen <jhansen@vmware.com>
Cc: Adit Ranadive <aditr@vmware.com>
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
If a quota bit is set in NFACCT_FLAGS but the NFACCT_QUOTA parameter is
missing then a NULL pointer dereference is triggered. CAP_NET_ADMIN is
required to trigger the bug.
Signed-off-by: Phil Turnbull <phil.turnbull@oracle.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Currently, we support set names of up to 16 bytes, get this aligned
with the maximum length we can use in ipset to make it easier when
considering migration to nf_tables.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The dprintf() and duprintf() functions are enabled at compile time,
these days we have better runtime debugging through pr_debug() and
static keys.
On top of this, this debugging is so old that I don't expect anyone
using this anymore, so let's get rid of this.
IP_NF_ASSERT() is still left in place, although this needs that
NETFILTER_DEBUG is enabled, I think these assertions provide useful
context information when reading the code.
Note that ARP_NF_ASSERT() has been removed as there is no user of
this.
Kill also DEBUG_ALLOW_ALL and a couple of pr_error() and pr_debug()
spots that are inconsistently placed in the code.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If the protocol is not natively supported, this assigns generic protocol
tracker so we can always assume a valid pointer after these calls.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Jarno Rajahalme <jrajahalme@nicira.com>
Acked-by: Joe Stringer <joe@ovn.org>
This patch introduces nf_ct_resolve_clash() to resolve race condition on
conntrack insertions.
This is particularly a problem for connection-less protocols such as
UDP, with no initial handshake. Two or more packets may race to insert
the entry resulting in packet drops.
Another problematic scenario are packets enqueued to userspace via
NFQUEUE after the raw table, that make it easier to trigger this
race.
To resolve this, the idea is to reset the conntrack entry to the one
that won race. Packet and bytes counters are also merged.
The 'insert_failed' stats still accounts for this situation, after
this patch, the drop counter is bumped whenever we drop packets, so we
can watch for unresolved clashes.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Introduce a helper function to update conntrack counters.
__nf_ct_kill_acct() was unnecessarily subtracting skb_network_offset()
that is expected to be zero from the ipv4/ipv6 hooks.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Remove unnecessary check for non-nul pointer in destroy_conntrack()
given that __nf_ct_l4proto_find() returns the generic protocol tracker
if the protocol is not supported.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When iterating, skip conntrack entries living in a different netns.
We could ignore netns and kill some other non-assured one, but it
has two problems:
- a netns can kill non-assured conntracks in other namespace
- we would start to 'over-subscribe' the affected/overlimit netns.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We already include netns address in the hash and compare the netns pointers
during lookup, so even if namespaces have overlapping addresses entries
will be spread across the table.
Assuming 64k bucket size, this change saves 0.5 mbyte per namespace on a
64bit system.
NAT bysrc and expectation hash is still per namespace, those will
changed too soon.
Future patch will also make conntrack object slab cache global again.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Once we place all conntracks into a global hash table we want them to be
spread across entire hash table, even if namespaces have overlapping ip
addresses.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Once we place all conntracks in the same hash table we must also compare
the netns pointer to skip conntracks that belong to a different namespace.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The iteration process is lockless, so we test if the conntrack object is
eligible for printing (e.g. is AF_INET) after obtaining the reference
count.
Once we put all conntracks into same hash table we might see more
entries that need to be skipped.
So add a helper and first perform the test in a lockless fashion
for fast skip.
Once we obtain the reference count, just repeat the check.
Note that this refactoring also includes a missing check for unconfirmed
conntrack entries due to slab rcu object re-usage, so they need to be
skipped since they are not part of the listing.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This prepares for upcoming change that places all conntracks into a
single, global table. For this to work we will need to also compare
net pointer during lookup. To avoid open-coding such check use the
nf_ct_key_equal helper and then later extend it to also consider net_eq.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Once we place all conntracks into same table iteration becomes more
costly because the table contains conntracks that we are not interested
in (belonging to other netns).
So don't bother scanning if the current namespace has no entries.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When resizing the conntrack hash table at runtime via
echo 42 > /sys/module/nf_conntrack/parameters/hashsize, we are racing with
the conntrack lookup path -- reads can happen in parallel and nothing
prevents readers from observing a the newly allocated hash but the old
size (or vice versa).
So access to hash[bucket] can trigger OOB read access in case the table got
expanded and we saw the new size but the old hash pointer (or it got shrunk
and we got new hash ptr but the size of the old and larger table):
kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] SMP KASAN
CPU: 0 PID: 3 Comm: ksoftirqd/0 Not tainted 4.6.0-rc2+ #107
[..]
Call Trace:
[<ffffffff822c3d6a>] ? nf_conntrack_tuple_taken+0x12a/0xe90
[<ffffffff822c3ac1>] ? nf_ct_invert_tuplepr+0x221/0x3a0
[<ffffffff8230e703>] get_unique_tuple+0xfb3/0x2760
Use generation counter to obtain the address/length of the same table.
Also add a synchronize_net before freeing the old hash.
AFAICS, without it we might access ct_hash[bucket] after ct_hash has been
freed, provided that lockless reader got delayed by another event:
CPU1 CPU2
seq_begin
seq_retry
<delay> resize occurs
free oldhash
for_each(oldhash[size])
Note that resize is only supported in init_netns, it took over 2 minutes
of constant resizing+flooding to produce the warning, so this isn't a
big problem in practice.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
No need to disable BH here anymore:
stats are switched to _ATOMIC variant (== this_cpu_inc()), which
nowadays generates same code as the non _ATOMIC NF_STAT, at least on x86.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Conntrack labels are currently sized depending on the iptables
ruleset, i.e. if we're asked to test or set bits 1, 2, and 65 then we
would allocate enough room to store at least bit 65.
However, with nft, the input is just a register with arbitrary runtime
content.
We therefore ask for the upper ceiling we currently have, which is
enough room to store 128 bits.
Alternatively, we could alter nf_connlabel_replace to increase
net->ct.label_words at run time, but since 128 bits is not that
big we'd only save sizeof(long) so it doesn't seem worth it for now.
This follows a similar approach that xtables 'connlabel'
match uses, so when user inputs
ct label set bar
then we will set the bit used by the 'bar' label and leave the rest alone.
This is done by passing the sreg content to nf_connlabels_replace
as both value and mask argument.
Labels (bits) already set thus cannot be re-set to zero, but
this is not supported by xtables connlabel match either.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
percpu_counter only have protection against preemption.
TCP stack uses them possibly from BH, so we need BH protection
in contexts that could be run in process context
Fixes: c10d9310ed ("tcp: do not assume TCP code is non preemptible")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
__inet_twsk_hashdance() might be called from process context,
better block BH before acquiring bind hash and established locks
Fixes: c10d9310ed ("tcp: do not assume TCP code is non preemptible")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_snd_una_update() and tcp_rcv_nxt_update() call
u64_stats_update_begin() either from process context or BH handler.
This triggers a lockdep splat on 32bit & SMP builds.
We could add u64_stats_update_begin_bh() variant but this would
slow down 32bit builds with useless local_disable_bh() and
local_enable_bh() pairs, since we own the socket lock at this point.
I add sock_owned_by_me() helper to have proper lockdep support
even on 64bit builds, and new u64_stats_update_begin_raw()
and u64_stats_update_end_raw methods.
Fixes: c10d9310ed ("tcp: do not assume TCP code is non preemptible")
Reported-by: Fabio Estevam <festevam@gmail.com>
Diagnosed-by: Francois Romieu <romieu@fr.zoreil.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Fabio Estevam <fabio.estevam@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert says:
====================
pull request (net): ipsec 2016-05-04
1) The flowcache can hit an OOM condition if too
many entries are in the gc_list. Fix this by
counting the entries in the gc_list and refuse
new allocations if the value is too high.
2) The inner headers are invalid after a xfrm transformation,
so reset the skb encapsulation field to ensure nobody tries
access the inner headers. Otherwise tunnel devices stacked
on top of xfrm may build the outer headers based on wrong
informations.
3) Add pmtu handling to vti, we need it to report
pmtu informations for local generated packets.
Please pull or let me know if there are problems.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
- two changes to the MAINTAINERS file where one marks our mailing list
as moderated and the other adds a missing documentation file
- kernel-doc fixes
- code refactoring and various cleanups
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJXKRJdAAoJEJ4aZjxxc6bKSVEP/1Ky6O7+oanpsjjwUiZDMj0W
KPtoPQ/VsJxu51e0OYi78jHtGned7xV+FLyFx1k8BwLPThYtd8ysDVqMqFAmXsRh
JPOT+7Y+lf8/FBUYdKyJcsaoqAeRPXnY+p0vE7woLaxk+GOiWpOIip73nisgu9gy
NxfmgJ77WjEV2v6IiD4djfYmZqOOvCF6IGWkubtc0WZdg5ma/2u7vYEDBy3yjN/b
og/5joT3GZC8K6X8BabxNSLDER+qs489a6rOUGoRK4NCU3LhELAywuAws30nPrB/
vFJ6BvEEzkaGcXJViSFelb9zsi4ngwvY9OPQnFCmOicDzJN7jqdV6yXcnSLurph1
sDR+1+k1f63czCJpG8Uhj+8SaQO7P8T9A5nL1UKwhdCOENCuj8Vtp5y4S2A3bOSe
jEv1dy9FC3yaPvtkyUN+wOuDerPoJr5pFuVRz2RGyeFSMxs+RPBLYf/D0+x1om9A
Vz63ecsygk7S7qGNXHUbQvX5Q5Kv5f4y4XjvmrH3rBq+T/WC6V5vbkTo8L4CapX9
KNffNGl1RWqHz/TVLSrQmlHc9zNM/Rg0am2MIxplGfP0rQSUNob/qjD50KPJSLF/
M8tmOBSCNAxzlfAwcn+VJLq+xt6Mr2mkhwZZPYGQPno8JJqCMq52k4w1AQvbv+eI
sxgFGvTq1WACUDx03vyx
=qoV3
-----END PGP SIGNATURE-----
Merge tag 'batman-adv-for-davem' of git://git.open-mesh.org/linux-merge
Antonio Quartulli says:
====================
pull request: batman-adv 20160504
In this pull request you have:
- two changes to the MAINTAINERS file where one marks our mailing list
as moderated and the other adds a missing documentation file
- kernel-doc fixes
- code refactoring and various cleanups
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The stack object “map” has a total size of 32 bytes. Its last 4
bytes are padding generated by compiler. These padding bytes are
not initialized and sent out via “nla_put”.
Signed-off-by: Kangjie Lu <kjlu@gatech.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
The stack object “info” has a total size of 12 bytes. Its last byte
is padding which is not initialized and leaked via “put_cmsg”.
Signed-off-by: Kangjie Lu <kjlu@gatech.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
previous patches removed all direct accesses to dev->trans_start,
so change the netif_trans_update helper to update trans_start of
netdev queue 0 instead and then remove trans_start from struct net_device.
AFAICS a lot of the netif_trans_update() invocations are now useless
because they occur in ndo_start_xmit and driver doesn't set LLTX
(i.e. stack already took care of the update).
As I can't test any of them it seems better to just leave them alone.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ipv6 gre implementation was cleaned up to share more code
with the ipv4 version, but it can be enabled even when NET_IPGRE_DEMUX
is disabled, resulting in a link error:
net/built-in.o: In function `gre_rcv':
:(.text+0x17f5d0): undefined reference to `gre_parse_header'
ERROR: "gre_parse_header" [net/ipv6/ip6_gre.ko] undefined!
This adds a Kconfig dependency to prevent that now invalid
configuration.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 308edfdf15 ("gre6: Cleanup GREv6 receive path, call common GRE functions")
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For ipgre interfaces in collect metadata mode, receive also traffic with
encapsulated Ethernet headers. The lwtunnel users are supposed to sort this
out correctly. This allows to have mixed Ethernet + L3-only traffic on the
same lwtunnel interface. This is the same way as VXLAN-GPE behaves.
To keep backwards compatibility and prevent any surprises, gretap interfaces
have priority in receiving packets with Ethernet headers.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This will allow to make the pull dependent on the tunnel type.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The call to gre_parse_header is either followed by iptunnel_pull_header, or
in the case of ICMP error path, the actual header is not accessed at all.
In the first case, iptunnel_pull_header will call pskb_may_pull anyway and
it's pointless to do it twice. The only difference is what call will fail
with what error code but the net effect is still the same in all call sites.
In the second case, pskb_may_pull is pointless, as skb->data is at the outer
IP header and not at the GRE header.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change makes it so that we will strip the TSO_MANGLEID bit if TSO is
not present. This way we will also handle ECN correctly of TSO is not
present.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch addresses a possible issue that can occur if we get into any odd
corner cases where we support TSO for a given protocol but not the checksum
or scatter-gather offload. There are few drivers floating around that
setup their tunnels this way and by enforcing the checksum piece we can
avoid mangling any frames.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the event that the number of partial segments is equal to 1 we don't
really need to perform partial segmentation offload. As such we should
skip multiplying the MSS and instead just clear the partial_segs value
since it will not provide any gain to advertise the frame as being GSO when
it is a single frame.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It's easier for gre_parse_header to return the header length instead of
filing it into a parameter. That way, the callers that don't care about the
header length can just check whether the returned value is lower than zero.
In gre_err, the tunnel header must not be pulled. See commit b7f8fe251e
("gre: do not pull header in ICMP error processing") for details.
This patch reduces the conflict between the mentioned commit and commit
95f5c64c3c ("gre: Move utility functions to common headers").
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Under high rx pressure, it is possible tcp_sendmsg() never has a
chance to allocate an skb and loop forever as sk_flush_backlog()
would always return true.
Fix this by calling sk_flush_backlog() only if one skb had been
allocated and filled before last backlog check.
Fixes: d41a69f1d3 ("tcp: make tcp_sendmsg() aware of socket backlog")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
net/ipv4/ip_gre.c
Minor conflicts between tunnel bug fixes in net and
ipv6 tunnel cleanups in net-next.
Signed-off-by: David S. Miller <davem@davemloft.net>
For test purpose, provide the generic nci loopback function.
Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
According to NCI specification, destination type and destination
specific parameters shall uniquely identify a single destination
for the Logical Connection.
Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
nci_core_conn_close was not retrieving a conn_info using the correct
connection id.
Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
NCI_CORE_CONN_CREATE may not have any destination type parameter.
Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
The handler 'ila_fill_encap_info' adds one attribute: ILA_ATTR_LOCATOR.
Fixes: 65d7ab8de5 ("net: Identifier Locator Addressing module")
CC: Tom Herbert <tom@herbertland.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the sendmsg function of UDP, raw, ICMP and l2tp sockets, we use local
variables like hlimits, tclass, opt and dontfrag and pass them to corresponding
functions like ip6_make_skb, ip6_append_data and xxx_push_pending_frames.
This is not a good practice and makes it hard to add new parameters.
This fix introduces a new struct ipcm6_cookie similar to ipcm_cookie in
ipv4 and include the above mentioned variables. And we only pass the
pointer to this structure to corresponding functions. This makes it easier
to add new parameters in the future and makes the function cleaner.
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
An arbitration scheme for duelling SYNs is implemented as part of
commit 241b271952 ("RDS-TCP: Reset tcp callbacks if re-using an
outgoing socket in rds_tcp_accept_one()") which ensures that both nodes
involved will arrive at the same arbitration decision. However, this
needs to be synchronized with an outgoing SYN to be generated by
rds_tcp_conn_connect(). This commit achieves the synchronization
through the t_conn_lock mutex in struct rds_tcp_connection.
The rds_conn_state is checked in rds_tcp_conn_connect() after acquiring
the t_conn_lock mutex. A SYN is sent out only if the RDS connection is
not already UP (an UP would indicate that rds_tcp_accept_one() has
completed 3WH, so no SYN needs to be generated).
Similarly, the rds_conn_state is checked in rds_tcp_accept_one() after
acquiring the t_conn_lock mutex. The only acceptable states (to
allow continuation of the arbitration logic) are UP (i.e., outgoing SYN
was SYN-ACKed by peer after it sent us the SYN) or CONNECTING (we sent
outgoing SYN before we saw incoming SYN).
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a race condition between rds_send_xmit -> rds_tcp_xmit
and the code that deals with resolution of duelling syns added
by commit 241b271952 ("RDS-TCP: Reset tcp callbacks if re-using an
outgoing socket in rds_tcp_accept_one()").
Specifically, we may end up derefencing a null pointer in rds_send_xmit
if we have the interleaving sequence:
rds_tcp_accept_one rds_send_xmit
conn is RDS_CONN_UP, so
invoke rds_tcp_xmit
tc = conn->c_transport_data
rds_tcp_restore_callbacks
/* reset t_sock */
null ptr deref from tc->t_sock
The race condition can be avoided without adding the overhead of
additional locking in the xmit path: have rds_tcp_accept_one wait
for rds_tcp_xmit threads to complete before resetting callbacks.
The synchronization can be done in the same manner as rds_conn_shutdown().
First set the rds_conn_state to something other than RDS_CONN_UP
(so that new threads cannot get into rds_tcp_xmit()), then wait for
RDS_IN_XMIT to be cleared in the conn->c_flags indicating that any
threads in rds_tcp_xmit are done.
Fixes: 241b271952 ("RDS-TCP: Reset tcp callbacks if re-using an
outgoing socket in rds_tcp_accept_one()")
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hosts sending lot of ACK packets exhibit high sock_wfree() cost
because of cache line miss to test SOCK_USE_WRITE_QUEUE
We could move this flag close to sk_wmem_alloc but it is better
to perform the atomic_sub_and_test() on a clean cache line,
as it avoid one extra bus transaction.
skb_orphan_partial() can also have a fast track for packets that either
are TCP acks, or already went through another skb_orphan_partial()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the case of the mlx4 and mlx5 driver they do not support IPv6 checksum
offload for tunnels. With this being the case we should disable GSO in
addition to the checksum offload features when we find that a device cannot
perform a checksum on a given packet type.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are two flow control mechanisms in TIPC; one at link level that
handles network congestion, burst control, and retransmission, and one
at connection level which' only remaining task is to prevent overflow
in the receiving socket buffer. In TIPC, the latter task has to be
solved end-to-end because messages can not be thrown away once they
have been accepted and delivered upwards from the link layer, i.e, we
can never permit the receive buffer to overflow.
Currently, this algorithm is message based. A counter in the receiving
socket keeps track of number of consumed messages, and sends a dedicated
acknowledge message back to the sender for each 256 consumed message.
A counter at the sending end keeps track of the sent, not yet
acknowledged messages, and blocks the sender if this number ever reaches
512 unacknowledged messages. When the missing acknowledge arrives, the
socket is then woken up for renewed transmission. This works well for
keeping the message flow running, as it almost never happens that a
sender socket is blocked this way.
A problem with the current mechanism is that it potentially is very
memory consuming. Since we don't distinguish between small and large
messages, we have to dimension the socket receive buffer according
to a worst-case of both. I.e., the window size must be chosen large
enough to sustain a reasonable throughput even for the smallest
messages, while we must still consider a scenario where all messages
are of maximum size. Hence, the current fix window size of 512 messages
and a maximum message size of 66k results in a receive buffer of 66 MB
when truesize(66k) = 131k is taken into account. It is possible to do
much better.
This commit introduces an algorithm where we instead use 1024-byte
blocks as base unit. This unit, always rounded upwards from the
actual message size, is used when we advertise windows as well as when
we count and acknowledge transmitted data. The advertised window is
based on the configured receive buffer size in such a way that even
the worst-case truesize/msgsize ratio always is covered. Since the
smallest possible message size (from a flow control viewpoint) now is
1024 bytes, we can safely assume this ratio to be less than four, which
is the value we are now using.
This way, we have been able to reduce the default receive buffer size
from 66 MB to 2 MB with maintained performance.
In order to keep this solution backwards compatible, we introduce a
new capability bit in the discovery protocol, and use this throughout
the message sending/reception path to always select the right unit.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
During neighbor discovery, nodes advertise their capabilities as a bit
map in a dedicated 16-bit field in the discovery message header. This
bit map has so far only be stored in the node structure on the peer
nodes, but we now see the need to keep a copy even in the socket
structure.
This commit adds this functionality.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the refactoring commit d570d86497 ("tipc: enqueue arrived buffers
in socket in separate function") we did by accident replace the test
if (sk->sk_backlog.len == 0)
atomic_set(&tsk->dupl_rcvcnt, 0);
with
if (sk->sk_backlog.len)
atomic_set(&tsk->dupl_rcvcnt, 0);
This effectively disables the compensation we have for the double
receive buffer accounting that occurs temporarily when buffers are
moved from the backlog to the socket receive queue. Until now, this
has gone unnoticed because of the large receive buffer limits we are
applying, but becomes indispensable when we reduce this buffer limit
later in this series.
We now fix this by inverting the mentioned condition.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
batadv_iv_ogm_orig_del_if handles two different buffers bcast_own and
bcast_own_sum which should be resized. The error handling two for
allocating these buffers causes the complexity of this function. This can
be avoided completely when the function is split into a main function
handling the locking, freeing and call of the subfunctions.
The subfunction can then independently handle the resize of the buffers.
This also allows to easily reuse the old buffer (which always is larger) in
case a smaller buffer could not be allocated without increasing the code
complexity.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Since batadv_v_ogm_orig_update() was only called from one place and the
calling function became very short, merge these two functions together.
This should also reflect the protocol description of B.A.T.M.A.N. V
better.
Signed-off-by: Simon Wunderlich <simon@open-mesh.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
To match our code better to the protocol description of B.A.T.M.A.N. V,
move batadv_v_ogm_forward() out into batadv_v_ogm_process_per_outif()
and move all checks directly deciding whether the OGM should be
forwarded into batadv_v_ogm_forward().
Signed-off-by: Simon Wunderlich <simon@open-mesh.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Structure initialization within the macros should follow the general
coding style used in the kernel: put the initialization of the first
variable and the closing brace on a separate line.
Reported-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Simon Wunderlich <simon.wunderlich@open-mesh.com>
[sven@narfation.org: fix conflicts with current version]
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Some really long function names in batman-adv require a newline between
return type and the function name. This has lead to some lines starting
with *batadv_...
This * belongs to the return type and thus should be on the same line as
the return type.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
checkpatch.pl warns about the use of 'unsigned' as a short form for
'unsigned int'.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Use to_delayed_work() instead of open-coding it.
Signed-off-by: Geliang Tang <geliangtang@163.com>
Reviewed-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Use list_for_each_entry_safe() instead of list_for_each_safe() to
simplify the code.
Signed-off-by: Geliang Tang <geliangtang@163.com>
Acked-by: Antonio Quartulli <a@unstable.cc>
Reviewed-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Use a static string when showing table headers rather then
a nonsense parametric one with fixed arguments.
It is easier to grep and it does not need to be recomputed
at runtime each time.
Reported-by: Joe Perches <joe@perches.com>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
[sven@narfation.org: fix conflicts with current version]
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
The vsock_transport structure is never modified, so declare it as const.
Done with the help of Coccinelle.
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
In presence of inelastic flows and stress, we can call
fq_codel_drop() for every packet entering fq_codel qdisc.
fq_codel_drop() is quite expensive, as it does a linear scan
of 4 KB of memory to find a fat flow.
Once found, it drops the oldest packet of this flow.
Instead of dropping a single packet, try to drop 50% of the backlog
of this fat flow, with a configurable limit of 64 packets per round.
TCA_FQ_CODEL_DROP_BATCH_SIZE is the new attribute to make this
limit configurable.
With this strategy the 4 KB search is amortized to a single cache line
per drop [1], so fq_codel_drop() no longer appears at the top of kernel
profile in presence of few inelastic flows.
[1] Assuming a 64byte cache line, and 1024 buckets
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dave Taht <dave.taht@gmail.com>
Cc: Jonathan Morton <chromatix99@gmail.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Dave Taht
Signed-off-by: David S. Miller <davem@davemloft.net>
This was recently reported to me, and reproduced on the latest net kernel,
when attempting to run netperf from a host that had a netem qdisc attached
to the egress interface:
[ 788.073771] ---------------------[ cut here ]---------------------------
[ 788.096716] WARNING: at net/core/dev.c:2253 skb_warn_bad_offload+0xcd/0xda()
[ 788.129521] bnx2: caps=(0x00000001801949b3, 0x0000000000000000) len=2962
data_len=0 gso_size=1448 gso_type=1 ip_summed=3
[ 788.182150] Modules linked in: sch_netem kvm_amd kvm crc32_pclmul ipmi_ssif
ghash_clmulni_intel sp5100_tco amd64_edac_mod aesni_intel lrw gf128mul
glue_helper ablk_helper edac_mce_amd cryptd pcspkr sg edac_core hpilo ipmi_si
i2c_piix4 k10temp fam15h_power hpwdt ipmi_msghandler shpchp acpi_power_meter
pcc_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c
sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt
i2c_algo_bit drm_kms_helper ahci ata_generic pata_acpi ttm libahci
crct10dif_pclmul pata_atiixp tg3 libata crct10dif_common drm crc32c_intel ptp
serio_raw bnx2 r8169 hpsa pps_core i2c_core mii dm_mirror dm_region_hash dm_log
dm_mod
[ 788.465294] CPU: 16 PID: 0 Comm: swapper/16 Tainted: G W
------------ 3.10.0-327.el7.x86_64 #1
[ 788.511521] Hardware name: HP ProLiant DL385p Gen8, BIOS A28 12/17/2012
[ 788.542260] ffff880437c036b8 f7afc56532a53db9 ffff880437c03670
ffffffff816351f1
[ 788.576332] ffff880437c036a8 ffffffff8107b200 ffff880633e74200
ffff880231674000
[ 788.611943] 0000000000000001 0000000000000003 0000000000000000
ffff880437c03710
[ 788.647241] Call Trace:
[ 788.658817] <IRQ> [<ffffffff816351f1>] dump_stack+0x19/0x1b
[ 788.686193] [<ffffffff8107b200>] warn_slowpath_common+0x70/0xb0
[ 788.713803] [<ffffffff8107b29c>] warn_slowpath_fmt+0x5c/0x80
[ 788.741314] [<ffffffff812f92f3>] ? ___ratelimit+0x93/0x100
[ 788.767018] [<ffffffff81637f49>] skb_warn_bad_offload+0xcd/0xda
[ 788.796117] [<ffffffff8152950c>] skb_checksum_help+0x17c/0x190
[ 788.823392] [<ffffffffa01463a1>] netem_enqueue+0x741/0x7c0 [sch_netem]
[ 788.854487] [<ffffffff8152cb58>] dev_queue_xmit+0x2a8/0x570
[ 788.880870] [<ffffffff8156ae1d>] ip_finish_output+0x53d/0x7d0
...
The problem occurs because netem is not prepared to handle GSO packets (as it
uses skb_checksum_help in its enqueue path, which cannot manipulate these
frames).
The solution I think is to simply segment the skb in a simmilar fashion to the
way we do in __dev_queue_xmit (via validate_xmit_skb), with some minor changes.
When we decide to corrupt an skb, if the frame is GSO, we segment it, corrupt
the first segment, and enqueue the remaining ones.
tested successfully by myself on the latest net kernel, to which this applies
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: Jamal Hadi Salim <jhs@mojatatu.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: netem@lists.linux-foundation.org
CC: eric.dumazet@gmail.com
CC: stephen@networkplumber.org
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Locally generated TCP GSO packets having to go through a GRE/SIT/IPIP
tunnel have to go through an expensive skb_unclone()
Reallocating skb->head is a lot of work.
Test should really check if a 'real clone' of the packet was done.
TCP does not care if the original gso_type is changed while the packet
travels in the stack.
This adds skb_header_unclone() which is a variant of skb_clone()
using skb_header_cloned() check instead of skb_cloned().
This variant can probably be used from other points.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- a fix for our Distributed ARP Table that makes sure that the input
provided to the hash function during a query is the same as the one
provided during an insert (so to prevent false negatives), by Antonio
Quartulli
- a fix for our new protocol implementation B.A.T.M.A.N. V that ensures
that a hard interface is properly re-activated when it is brought down
and then up again, by Antonio Quartulli
- two fixes respectively to the reference counting of the tt_local_entry
and neigh_node objects, by Sven Eckelmann. Such bug is rather severe
as it would prevent the netdev objects references by batman-adv from
being released after shutdown.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJXJOPeAAoJEJ4aZjxxc6bKFaoP/jsY4MelcsGUGQhjfEfm/gbo
H7Xd5TUydiq9tfbIGwAjbS4Ti+e69ROolyiNQrvxc5PcJFhpQlsSe17+o0NdnCOE
tBLjuCLjpnj+FzghnQYb54Qb2CEllsSLJcOLh+CnbFiHos+pvhxA/NeC6FufyPMF
Zrdsjf/v4rzghWQerToKEcIgCXcRE3Zo2txunUnFXSzQGai4AJnljD1Hk1YcbQdn
O8+6lXYN+j4Swo6yrPB0URzJRIWdjoQ1OfdvggCDTuMW664jyv9gZmsF/fzL2ksj
SGldxkFOX+4x8NenRxs5OFMXHHAJGu8kU8uoXmOCuv6b59F2KWi27rP1MJqxYDcB
pTpq4nAx3IooNSSvpU97SFW3WBQgIsNHMFZwZbGkxqXP1UhPEoUcsuFTPVj/hqDI
h9xBLK/buNbYnMULTW8hMvxOUHqxjPvr37Vbj1uPdbfmwbrvUvwyMSWFn5k/JmAF
CASMwUC4C7IQtEinVYHmT/+QsPGMcmom1WZ1/OlhlxnmOwAcglI/mZnXl0wD7ptg
3KETNlrsNHC6YuOLKIKI08l3Ke2DOZLHdV5PvHcdPgTy7EYbSvZTaDwK422pSiuy
8kcQjN8g6I81drwJqkEiUkJA6kRxkKbXYxosudbRT07IkzUZo7TPAFv7iMNDSHUW
vuJV/rtYAp3bRDyLrxnb
=ksVs
-----END PGP SIGNATURE-----
Merge tag 'batman-adv-fix-for-davem' of git://git.open-mesh.org/linux-merge
Antonio Quartulli says:
====================
In this small batch of patches you have:
- a fix for our Distributed ARP Table that makes sure that the input
provided to the hash function during a query is the same as the one
provided during an insert (so to prevent false negatives), by Antonio
Quartulli
- a fix for our new protocol implementation B.A.T.M.A.N. V that ensures
that a hard interface is properly re-activated when it is brought down
and then up again, by Antonio Quartulli
- two fixes respectively to the reference counting of the tt_local_entry
and neigh_node objects, by Sven Eckelmann. Such bug is rather severe
as it would prevent the netdev objects references by batman-adv from
being released after shutdown.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a new LINK_XSTATS_TYPE_BRIDGE attribute and implement the
RTM_GETSTATS callbacks for IFLA_STATS_LINK_XSTATS (fill_linkxstats and
get_linkxstats_size) in order to export the per-vlan stats.
The paddings were added because soon these fields will be needed for
per-port per-vlan stats (or something else if someone beats me to it) so
avoiding at least a few more netlink attributes.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for per-VLAN Tx/Rx statistics. Every global vlan context gets
allocated a per-cpu stats which is then set in each per-port vlan context
for quick access. The br_allowed_ingress() common function is used to
account for Rx packets and the br_handle_vlan() common function is used
to account for Tx packets. Stats accounting is performed only if the
bridge-wide vlan_stats_enabled option is set either via sysfs or netlink.
A struct hole between vlan_enabled and vlan_proto is used for the new
option so it is in the same cache line. Currently it is binary (on/off)
but it is intentionally restricted to exactly 0 and 1 since other values
will be used in the future for different purposes (e.g. per-port stats).
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add callbacks to calculate the size and fill link extended statistics
which can be split into multiple messages and are dumped via the new
rtnl stats API (RTM_GETSTATS) with the IFLA_STATS_LINK_XSTATS attribute.
Also add that attribute to the idx mask check since it is expected to
be able to save state and resume dumping (e.g. future bridge per-vlan
stats will be dumped via this attribute and callbacks).
Each link type should nest its private attributes under the per-link type
attribute. This allows to have any number of separated private attributes
and to avoid one call to get the dev link type.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The new prividx argument allows the current dumping device to save a
private state counter which would enable it to continue dumping from
where it left off. And the idxattr is used to save the current idx user
so multiple prividx using attributes can be requested at the same time
as suggested by Roopa Prabhu.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Changes in GREv6 transmit path:
- Call gre_checksum, remove gre6_checksum
- Rename ip6gre_xmit2 to __gre6_xmit
- Call gre_build_header utility function
- Call ip6_tnl_xmit common function
- Call ip6_tnl_change_mtu, eliminate ip6gre_tunnel_change_mtu
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A few generic changes to generalize tunnels in IPv6:
- Export ip6_tnl_change_mtu so that it can be called by ip6_gre
- Add tun_hlen to ip6_tnl structure.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Create common functions for both IPv4 and IPv6 GRE in transmit. These
are put into gre.h.
Common functions are for:
- GRE checksum calculation. Move gre_checksum to gre.h.
- Building a GRE header. Move GRE build_header and rename
gre_build_header.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch renames ip6_tnl_xmit2 to ip6_tnl_xmit and exports it. Other
users like GRE will be able to call this. The original ip6_tnl_xmit
function is renamed to ip6_tnl_start_xmit (this is an ndo_start_xmit
function).
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Create gre_rcv function. This calls gre_parse_header and ip6gre_rcv.
- Call ip6_tnl_rcv. Doing this and using gre_parse_header eliminates
most of the code in ip6gre_rcv.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Several of the GRE functions defined in net/ipv4/ip_gre.c are usable
for IPv6 GRE implementation (that is they are protocol agnostic).
These include:
- GRE flag handling functions are move to gre.h
- GRE build_header is moved to gre.h and renamed gre_build_header
- parse_gre_header is moved to gre_demux.c and renamed gre_parse_header
- iptunnel_pull_header is taken out of gre_parse_header. This is now
done by caller. The header length is returned from gre_parse_header
in an int* argument.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some basic changes to make IPv6 tunnel receive path look more like
IPv4 path:
- Make ip6_tnl_rcv non-static so that GREv6 and others can call it
- Make ip6_tnl_rcv look like ip_tunnel_rcv
- Switch to gro_cells_receive
- Make ip6_tnl_rcv non-static and export it
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Large sendmsg()/write() hold socket lock for the duration of the call,
unless sk->sk_sndbuf limit is hit. This is bad because incoming packets
are parked into socket backlog for a long time.
Critical decisions like fast retransmit might be delayed.
Receivers have to maintain a big out of order queue with additional cpu
overhead, and also possible stalls in TX once windows are full.
Bidirectional flows are particularly hurt since the backlog can become
quite big if the copy from user space triggers IO (page faults)
Some applications learnt to use sendmsg() (or sendmmsg()) with small
chunks to avoid this issue.
Kernel should know better, right ?
Add a generic sk_flush_backlog() helper and use it right
before a new skb is allocated. Typically we put 64KB of payload
per skb (unless MSG_EOR is requested) and checking socket backlog
every 64KB gives good results.
As a matter of fact, tests with TSO/GSO disabled give very nice
results, as we manage to keep a small write queue and smaller
perceived rtt.
Note that sk_flush_backlog() maintains socket ownership,
so is not equivalent to a {release_sock(sk); lock_sock(sk);},
to ensure implicit atomicity rules that sendmsg() was
giving to (possibly buggy) applications.
In this simple implementation, I chose to not call tcp_release_cb(),
but we might consider this later.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexei Starovoitov <ast@fb.com>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Socket backlog processing is a major latency source.
With current TCP socket sk_rcvbuf limits, I have sampled __release_sock()
holding cpu for more than 5 ms, and packets being dropped by the NIC
once ring buffer is filled.
All users are now ready to be called from process context,
we can unblock BH and let interrupts be serviced faster.
cond_resched_softirq() could be removed, as it has no more user.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
sctp_inq_push() will soon be called without BH being blocked
when generic socket code flushes the socket backlog.
It is very possible SCTP can be converted to not rely on BH,
but this needs to be done by SCTP experts.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
UDP uses the generic socket backlog code, and this will soon
be changed to not disable BH when protocol is called back.
We need to use appropriate SNMP accessors.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
DCCP uses the generic backlog code, and this will soon
be changed to not disable BH when protocol is called back.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
AFAIK, nothing in current TCP stack absolutely wants BH
being disabled once socket is owned by a thread running in
process context.
As mentioned in my prior patch ("tcp: give prequeue mode some care"),
processing a batch of packets might take time, better not block BH
at all.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
We want to to make TCP stack preemptible, as draining prequeue
and backlog queues can take lot of time.
Many SNMP updates were assuming that BH (and preemption) was disabled.
Need to convert some __NET_INC_STATS() calls to NET_INC_STATS()
and some __TCP_INC_STATS() to TCP_INC_STATS()
Before using this_cpu_ptr(net->ipv4.tcp_sk) in tcp_v4_send_reset()
and tcp_v4_send_ack(), we add an explicit preempt disabled section.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since a/b/g/n no longer exist as spec amendements and VHT (ex 802.11ac)
wasn't handled at all, it's better to just remove the amendment strings
to avoid confusion.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Reviewed-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Pull networking fixes from David Miller:
1) MODULE_FIRMWARE firmware string not correct for iwlwifi 8000 chips,
from Sara Sharon.
2) Fix SKB size checks in batman-adv stack on receive, from Sven
Eckelmann.
3) Leak fix on mac80211 interface add error paths, from Johannes Berg.
4) Cannot invoke napi_disable() with BH disabled in myri10ge driver,
fix from Stanislaw Gruszka.
5) Fix sign extension problem when computing feature masks in
net_gso_ok(), from Marcelo Ricardo Leitner.
6) lan78xx driver doesn't count packets and packet lengths in its
statistics properly, fix from Woojung Huh.
7) Fix the buffer allocation sizes in pegasus USB driver, from Petko
Manolov.
8) Fix refcount overflows in bpf, from Alexei Starovoitov.
9) Unified dst cache handling introduced a preempt warning in
ip_tunnel, fix by resetting rather then setting the cached route.
From Paolo Abeni.
10) Listener hash collision test fix in soreuseport, from Craig Gallak
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (47 commits)
gre: do not pull header in ICMP error processing
net: Implement net_dbg_ratelimited() for CONFIG_DYNAMIC_DEBUG case
tipc: only process unicast on intended node
cxgb3: fix out of bounds read
net/smscx5xx: use the device tree for mac address
soreuseport: Fix TCP listener hash collision
net: l2tp: fix reversed udp6 checksum flags
ip_tunnel: fix preempt warning in ip tunnel creation/updating
samples/bpf: fix trace_output example
bpf: fix check_map_func_compatibility logic
bpf: fix refcnt overflow
drivers: net: cpsw: use of_phy_connect() in fixed-link case
dt: cpsw: phy-handle, phy_id, and fixed-link are mutually exclusive
drivers: net: cpsw: don't ignore phy-mode if phy-handle is used
drivers: net: cpsw: fix segfault in case of bad phy-handle
drivers: net: cpsw: fix parsing of phy-handle DT property in dual_emac config
MAINTAINERS: net: Change maintainer for GRETH 10/100/1G Ethernet MAC device driver
gre: reject GUE and FOU in collect metadata mode
pegasus: fixes reported packet length
pegasus: fixes URB buffer allocation size;
...
iptunnel_pull_header expects that IP header was already pulled; with this
expectation, it pulls the tunnel header. This is not true in gre_err.
Furthermore, ipv4_update_pmtu and ipv4_redirect expect that skb->data points
to the IP header.
We cannot pull the tunnel header in this path. It's just a matter of not
calling iptunnel_pull_header - we don't need any of its effects.
Fixes: bda7bb4634 ("gre: Allow multiple protocol listener for gre protocol.")
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dave Miller pointed out that fb586f2530 ("sctp: delay calls to
sk_data_ready() as much as possible") may insert latency specially if
the receiving application is running on another CPU and that it would be
better if we signalled as early as possible.
This patch thus basically inverts the logic on fb586f2530 and signals
it as early as possible, similar to what we had before.
Fixes: fb586f2530 ("sctp: delay calls to sk_data_ready() as much as possible")
Reported-by: Dave Miller <davem@davemloft.net>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We have observed complete lock up of broadcast-link transmission due to
unacknowledged packets never being removed from the 'transmq' queue. This
is traced to nodes having their ack field set beyond the sequence number
of packets that have actually been transmitted to them.
Consider an example where node 1 has sent 10 packets to node 2 on a
link and node 3 has sent 20 packets to node 2 on another link. We
see examples of an ack from node 2 destined for node 3 being treated as
an ack from node 2 at node 1. This leads to the ack on the node 1 to node
2 link being increased to 20 even though we have only sent 10 packets.
When node 1 does get around to sending further packets, none of the
packets with sequence numbers less than 21 are actually removed from the
transmq.
To resolve this we reinstate some code lost in commit d999297c3d ("tipc:
reduce locking scope during packet reception") which ensures that only
messages destined for the receiving node are processed by that node. This
prevents the sequence numbers from getting out of sync and resolves the
packet leakage, thereby resolving the broadcast-link transmission
lock-ups we observed.
While we are aware that this change only patches over a root problem that
we still haven't identified, this is a sanity test that it is always
legitimate to do. It will remain in the code even after we identify and
fix the real problem.
Reviewed-by: Chris Packham <chris.packham@alliedtelesis.co.nz>
Reviewed-by: John Thompson <john.thompson@alliedtelesis.co.nz>
Signed-off-by: Hamish Martin <hamish.martin@alliedtelesis.co.nz>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When we are displaying statistics for the first link established between
two peers, it will always be presented as STANDBY although it in reality
is ACTIVE.
This happens because we forget to set the 'active' flag in the link
instance at the moment it is established. Although this is a bug, it only
has impact on the presentation view of the link, not on its actual
functionality.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I forgot to include a check for listener port equality when deciding
if two sockets should belong to the same reuseport group. This was
not caught previously because it's only necessary when two listening
sockets for the same user happen to hash to the same listener bucket.
The same error does not exist in the UDP path.
Fixes: c125e80b8868("soreuseport: fast reuseport TCP socket selection")
Signed-off-by: Craig Gallek <kraig@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes a bug which causes the behavior of whether to ignore
udp6 checksum of udp6 encapsulated l2tp tunnel contrary to what
userspace program requests.
When the flag `L2TP_ATTR_UDP_ZERO_CSUM6_RX` is set by userspace, it is
expected that udp6 checksums of received packets of the l2tp tunnel
to create should be ignored. In `l2tp_netlink.c`:
`l2tp_nl_cmd_tunnel_create()`, `cfg.udp6_zero_rx_checksums` is set
according to the flag, and then passed to `l2tp_core.c`:
`l2tp_tunnel_create()` and then `l2tp_tunnel_sock_create()`. In
`l2tp_tunnel_sock_create()`, `udp_conf.use_udp6_rx_checksums` is set
the same to `cfg.udp6_zero_rx_checksums`. However, if we want the
checksum to be ignored, `udp_conf.use_udp6_rx_checksums` should be set
to `false`, i.e. be set to the contrary. Similarly, the same should be
done to `udp_conf.use_udp6_tx_checksums`.
Signed-off-by: Miao Wang <shankerwangmiao@gmail.com>
Acked-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace ASYNC_INITIALIZED bit in the tty_port::flags field with
TTY_PORT_INITIALIZED bit in the tty_port::iflags field. Introduce helpers
tty_port_set_initialized() and tty_port_initialized() to abstract
atomic bit ops.
Note: the transforms for test_and_set_bit() and test_and_clear_bit()
are unnecessary as the state transitions are already mutually exclusive;
the tty lock prevents concurrent open/close/hangup.
Signed-off-by: Peter Hurley <peter@hurleysoftware.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Replace ASYNC_CHECK_CD bit in the tty_port::flags field with
TTY_PORT_CHECK_CD bit in the tty_port::iflags field. Introduce helpers
tty_port_set_check_carrier() and tty_port_check_carrier() to abstract
the atomic bit ops.
Signed-off-by: Peter Hurley <peter@hurleysoftware.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Replace ASYNC_NORMAL_ACTIVE bit in the tty_port::flags field with
TTY_PORT_ACTIVE bit in the tty_port::iflags field. Introduce helpers
tty_port_set_active() and tty_port_active() to abstract atomic bit ops.
Extract state changes from port lock sections, as this usage is
broken and confused; the state transitions are protected by the
tty lock (which mutually excludes parallel open/close/hangup),
and no user tests the active state while holding the port lock.
Signed-off-by: Peter Hurley <peter@hurleysoftware.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Replace ASYNC_CTS_FLOW bit in the tty_port::flags field with
TTY_PORT_CTS_FLOW bit in the tty_port::iflags field. Add
tty_port_set_cts_flow() helper to abstract the atomic bit ops.
Signed-off-by: Peter Hurley <peter@hurleysoftware.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Abstract TTY_THROTTLED bit tests with tty_throttled().
Signed-off-by: Peter Hurley <peter@hurleysoftware.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Abstract TTY_IO_ERROR status test treewide with tty_io_error().
NB: tty->flags uses atomic bit ops; replace non-atomic bit test
with test_bit().
Signed-off-by: Peter Hurley <peter@hurleysoftware.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
is_skb_forwardable is not supposed to change anything so constify its
arguments
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After the commit e09acddf87 ("ip_tunnel: replace dst_cache with generic
implementation"), a preemption debug warning is triggered on ip4
tunnels updating; the dst cache helper needs to be invoked in unpreemptible
context.
We don't need to load the cache on tunnel update, so this commit fixes
the warning replacing the load with a dst cache reset, which is
preempt safe.
Fixes: e09acddf87 ("ip_tunnel: replace dst_cache with generic implementation")
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Workqueue maybe still in running while we destroy the IDLETIMER target,
thus cause a use after free error, add cancel_work_sync() to avoid such
situation.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The batadv_neigh_node was specific to a batadv_hardif_neigh_node and held
an implicit reference to it. But this reference was never stored in form of
a pointer in the batadv_neigh_node itself. Instead
batadv_neigh_node_release depends on a consistent state of
hard_iface->neigh_list and that batadv_hardif_neigh_get always returns the
batadv_hardif_neigh_node object which it has a reference for. But
batadv_hardif_neigh_get cannot guarantee that because it is working only
with rcu_read_lock on this list. It can therefore happen that a neigh_addr
is in this list twice or that batadv_hardif_neigh_get cannot find the
batadv_hardif_neigh_node for an neigh_addr due to some other list
operations taking place at the same time.
Instead add a batadv_hardif_neigh_node pointer directly in
batadv_neigh_node which will be used for the reference counter decremented
on release of batadv_neigh_node.
Fixes: cef63419f7 ("batman-adv: add list of unique single hop neighbors per hard-interface")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
The batadv_tt_local_entry was specific to a batadv_softif_vlan and held an
implicit reference to it. But this reference was never stored in form of a
pointer in the tt_local_entry itself. Instead batadv_tt_local_remove,
batadv_tt_local_table_free and batadv_tt_local_purge_pending_clients depend
on a consistent state of bat_priv->softif_vlan_list and that
batadv_softif_vlan_get always returns the batadv_softif_vlan object which
it has a reference for. But batadv_softif_vlan_get cannot guarantee that
because it is working only with rcu_read_lock on this list. It can
therefore happen that an vid is in this list twice or that
batadv_softif_vlan_get cannot find the batadv_softif_vlan for an vid due to
some other list operations taking place at the same time.
Instead add a batadv_softif_vlan pointer directly in batadv_tt_local_entry
which will be used for the reference counter decremented on release of
batadv_tt_local_entry.
Fixes: 35df3b298f ("batman-adv: fix TT VLAN inconsistency on VLAN re-add")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Acked-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
At the moment there is no explicit reactivation of an hard-interface
upon NETDEV_UP event. In case of B.A.T.M.A.N. IV the interface is
reactivated as soon as the next OGM is scheduled for sending, but this
mechanism does not work with B.A.T.M.A.N. V. The latter does not rely
on the same scheduling mechanism as its predecessor and for this reason
the hard-interface remains deactivated forever after being brought down
once.
This patch fixes the reactivation mechanism by adding a new routing API
which explicitly allows each algorithm to perform any needed operation
upon interface re-activation.
Such API is optional and is implemented by B.A.T.M.A.N. V only and it
just takes care of setting the iface status to ACTIVE
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Now that DAT is VLAN aware, it must use the VID when
computing the DHT address of the candidate nodes where
an entry is going to be stored/retrieved.
Fixes: be1db4f661 ("batman-adv: make the Distributed ARP Table vlan aware")
Signed-off-by: Antonio Quartulli <a@unstable.cc>
[sven@narfation.org: fix conflicts with current version]
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Else we get 'BUG: spinlock bad magic on CPU#' on resize when
spin lock debugging is enabled.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This is a forward-port of the original patch from Andrzej Hajda,
he said:
"IS_ERR_VALUE should be used only with unsigned long type.
Otherwise it can work incorrectly. To achieve this function
xt_percpu_counter_alloc is modified to return unsigned long,
and its result is assigned to temporary variable to perform
error checking, before assigning to .pcnt field.
The patch follows conclusion from discussion on LKML [1][2].
[1]: http://permalink.gmane.org/gmane.linux.kernel/2120927
[2]: http://permalink.gmane.org/gmane.linux.kernel/2150581"
Original patch from Andrzej is here:
http://patchwork.ozlabs.org/patch/582970/
This patch has clashed with input validation fixes for x_tables.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pull Ceph fixes from Sage Weil:
"There is a lifecycle fix in the auth code, a fix for a narrow race
condition on map, and a helpful message in the log when there is a
feature mismatch (which happens frequently now that the default
server-side options have changed)"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
rbd: report unsupported features to syslog
rbd: fix rbd map vs notify races
libceph: make authorizer destruction independent of ceph_auth_client
This patch overloads the DSA master netdev, aka CPU Ethernet MAC to also
include switch-side statistics, which is useful for debugging purposes,
when the switch is not properly connected to the Ethernet MAC (duplex
mismatch, (RG)MII electrical issues etc.).
We accomplish this by retaining the original copy of the master netdev's
ethtool_ops, and just overload the 3 operations we care about:
get_sset_count, get_strings and get_ethtool_stats so as to intercept
these calls and call into the original master_netdev ethtool_ops, plus
our own.
We take this approach as opposed to providing a set of DSA helper
functions that would retrive the CPU port's statistics, because the
entire purpose of DSA is to allow unmodified Ethernet MAC drivers to be
used as CPU conduit interfaces, therefore, statistics overlay in such
drivers would simply not scale.
The new ethtool -S <iface> output would therefore look like this now:
<iface> statistics
p<2 digits cpu port number>_<switch MIB counter names>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP prequeue goal is to defer processing of incoming packets
to user space thread currently blocked in a recvmsg() system call.
Intent is to spend less time processing these packets on behalf
of softirq handler, as softirq handler is unfair to normal process
scheduler decisions, as it might interrupt threads that do not
even use networking.
Current prequeue implementation has following issues :
1) It only checks size of the prequeue against sk_rcvbuf
It was fine 15 years ago when sk_rcvbuf was in the 64KB vicinity.
But we now have ~8MB values to cope with modern networking needs.
We have to add sk_rmem_alloc in the equation, since out of order
packets can definitely use up to sk_rcvbuf memory themselves.
2) Even with a fixed memory truesize check, prequeue can be filled
by thousands of packets. When prequeue needs to be flushed, either
from sofirq context (in tcp_prequeue() or timer code), or process
context (in tcp_prequeue_process()), this adds a latency spike
which is often not desirable.
I added a fixed limit of 32 packets, as this translated to a max
flush time of 60 us on my test hosts.
Also note that all packets in prequeue are not accounted for tcp_mem,
since they are not charged against sk_forward_alloc at this point.
This is probably not a big deal.
Note that this might increase LINUX_MIB_TCPPREQUEUEDROPPED counts,
which is misnamed, as packets are not dropped at all, but rather pushed
to the stack (where they can be either consumed or dropped)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The collect metadata mode does not support GUE nor FOU. This might be
implemented later; until then, we should reject such config.
I think this is okay to be changed. It's unlikely anyone has such
configuration (as it doesn't work anyway) and we may need a way to
distinguish whether it's supported or not by the kernel later.
For backwards compatibility with iproute2, it's not possible to just check
the attribute presence (iproute2 always includes the attribute), the actual
value has to be checked, too.
Fixes: 2e15ea390e ("ip_gre: Add support to collect tunnel metadata.")
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In ipgre (i.e. not gretap) + collect metadata mode, the skb was assumed to
contain Ethernet header and was encapsulated as ETH_P_TEB. This is not the
case, the interface is ARPHRD_IPGRE and the protocol to be used for
encapsulation is skb->protocol.
Fixes: 2e15ea390e ("ip_gre: Add support to collect tunnel metadata.")
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In ipgre mode (i.e. not gretap) with collect metadata flag set, the tunnel
is incorrectly assumed to be mGRE in NBMA mode (see commit 6a5f44d7a0).
This is not the case, we're controlling the encapsulation addresses by
lwtunnel metadata. And anyway, assigning dev->header_ops in collect metadata
mode does not make sense.
Although it would be more user firendly to reject requests that specify
both the collect metadata flag and a remote/local IP address, this would
break current users of gretap or introduce ugly code and differences in
handling ipgre and gretap configuration. Keep the current behavior of
remote/local IP address being ignored in such case.
v3: Back to v1, added explanation paragraph.
v2: Reject configuration specifying both remote/local address and collect
metadata flag.
Fixes: 2e15ea390e ("ip_gre: Add support to collect tunnel metadata.")
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(root user triggerable) error case.
-----BEGIN PGP SIGNATURE-----
iQIcBAABCgAGBQJXIH65AAoJEGt7eEactAAdZmwP/R2UAHltBlYhCEMqcM+8VhPD
VDB3LFTYhOVUtVfFwqAzEoxPDjnGyGgZcjO5RxyCZLokm71KbbHAp3h3GnCVQCHd
dnRej6RD+Kl6n0EoTPCLy7ZAjSjpGBWOTy6MEgrAQnTtL+Q7nUch+z5DXIafTg/w
MOYke/WfD1jHbq2eGHu6HkbY3IUwoSKaEoA8qN20ieJRU7jsaG29RiAvBot2IVTI
g3hTL4FPzwSL5XM0qkoxDLPYA5Mo36Cb5sZ9AjkQCaqP/EemOoFxILGWUyi+17nd
zdF3zZB9lj+CdR+0IbjTjz8b457u1g/JW4dLl+iRqv7clynm3gmz7LivVhBcHogx
usg0hW9tDeZ5wzHj8v+e+C+RqyxtgHxVvYtt8Jh6bTqS8aMO8hor7qPFOcpJPmyz
ZbXThJnsvfaYoWAcvIXUa3Q2kwz2myVLDhlQBgwSi5TzgTDqb2GlYnGhvKOnB5cz
6JL3mZt2vi0Yvb7Lk9YzeEYs5cZq4DFVfx3nHgaVwDZ5GPoUTgIjgSnVLFC/J80a
r09Wmigtjy5qftw6w4pSUcf/Fj4L0BD+GhAZ+Hs5xhjUnlAxTlHxDW0bJ/kMmFzy
9B91YSswBc+3IGdjnsN+bNZ6T0XuvapOLRkC1V8fJDrtyy1Tel/LL9Giy/V13XLr
8vgETcgFJyQ3jnkzyRdr
=EWBn
-----END PGP SIGNATURE-----
Merge tag 'mac80211-for-davem-2016-04-27' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
Just a single fix, for a per-CPU memory leak in a
(root user triggerable) error case.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This is never called with a NULL "buf" and anyway, we dereference 's' on
the lines before so it would Oops before we reach the check.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
1) check skb size to avoid reading beyond its border when delivering
payloads, by Sven Eckelmann
2) initialize last_seen time in neigh_node object to prevent cleanup
routine from accidentally purge it, by Marek Lindner
3) release "recently added" slave interfaces upon virtual/batman
interface shutdown, by Sven Eckelmann
4) properly decrease router object reference counter upon routing table
update, by Sven Eckelmann
5) release queue slots when purging OGM packets of deactivating slave
interface, by Linus Lüssing
Patch 2 and 3 have no "Fixes:" tag because the offending commits date
back to when batman-adv was not yet officially in the net tree.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJXHt+MAAoJEJ4aZjxxc6bKMEgP/1DZWgQpHs5IM8yW7IQx8CQO
iMkpwfnJcRSOnADC/Z2GtIcz1Df2r+NZcqf5xMMF2CL0xlks024qTHoqeV7Poyel
DmzzETbQFWgdFD22RI70h25T4Yb400PP0saL2TbcVec6CiM57YN3cPbhjZvqzN32
bCIa38kwAGXvNqRzcy5WjDF/rllAoJZ0s055z+kY8WuVOmvOEor+FDmWFr0D8ioP
/utVP9ACA3YHZ39DMDFDsyBp6nMZOgHjpJVfmcubFULHmKvYQ0zMpgX19IVoMsJ6
HEtz9fKN4KPgAFbbPcU0GLg4srsNFmEbTB7Bqhqods+ZYN60M4Z0kexqYz1XuItH
atISvCIe14xHdT6gW32N707yK30DxUKIEpEg5wMXhE+1m041NfrfrcvaEXSLco6d
txsQzd1R4T5ry3V1YXv4znSVPmHvd84ykKrklQZgPA09QIPCCDb7Olp8Lj6mMsmc
OuEYLOfAoOD/KZcRUzY6kWpMRfOJLLXUgwcfSEES8MCaaBGD91YZyrSuHvixmo5V
24JTp0D/X/rkkQjI3a2Pf0dhvdGHAk1g6mElddo86a0UpRbm3qshquAPf+U8QcU0
Kt4rpN9dtOA8yTpnvxG2r04T32yzQQQNIqRGEnugokaJUECFF0mugxENTqcw2vux
uNxMhl36A21czA9s/Iu7
=o059
-----END PGP SIGNATURE-----
Merge tag 'batman-adv-fix-for-davem' of git://git.open-mesh.org/linux-merge
Antonio Quartulli says:
====================
In this patchset you can find the following fixes:
1) check skb size to avoid reading beyond its border when delivering
payloads, by Sven Eckelmann
2) initialize last_seen time in neigh_node object to prevent cleanup
routine from accidentally purge it, by Marek Lindner
3) release "recently added" slave interfaces upon virtual/batman
interface shutdown, by Sven Eckelmann
4) properly decrease router object reference counter upon routing table
update, by Sven Eckelmann
5) release queue slots when purging OGM packets of deactivating slave
interface, by Linus Lüssing
Patch 2 and 3 have no "Fixes:" tag because the offending commits date
back to when batman-adv was not yet officially in the net tree.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
There's no need to calculate rps hash if it was not enabled. So this
patch export rps_needed and check it before trying to get rps
hash. Tests (using pktgen to inject packets to guest) shows this can
improve pps about 13% (when rps is disabled).
Before:
~1150000 pps
After:
~1300000 pps
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
----
Changes from V1:
- Fix build when CONFIG_RPS is not set
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds an eor bit to the TCP_SKB_CB. When MSG_EOR
is passed to tcp_sendmsg, the eor bit will be set at the skb
containing the last byte of the userland's msg. The eor bit
will prevent data from appending to that skb in the future.
The change in do_tcp_sendpages is to honor the eor set
during the previous tcp_sendmsg(MSG_EOR) call.
This patch handles the tcp_sendmsg case. The followup patches
will handle other skb coalescing and fragment cases.
One potential use case is to use MSG_EOR with
SOF_TIMESTAMPING_TX_ACK to get a more accurate
TCP ack timestamping on application protocol with
multiple outgoing response messages (e.g. HTTP2).
Packetdrill script for testing:
~~~~~~
+0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
+0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
+0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0
0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
0.200 < . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
0.200 write(4, ..., 14600) = 14600
0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730
0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730
0.200 > . 1:7301(7300) ack 1
0.200 > P. 7301:14601(7300) ack 1
0.300 < . 1:1(0) ack 14601 win 257
0.300 > P. 14601:15331(730) ack 1
0.300 > P. 15331:16061(730) ack 1
0.400 < . 1:1(0) ack 16061 win 257
0.400 close(4) = 0
0.400 > F. 16061:16061(0) ack 1
0.400 < F. 1:1(0) ack 16062 win 257
0.400 > . 16062:16062(0) ack 2
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The SKBTX_ACK_TSTAMP flag is set in skb_shinfo->tx_flags when
the timestamp of the TCP acknowledgement should be reported on
error queue. Since accessing skb_shinfo is likely to incur a
cache-line miss at the time of receiving the ack, the
txstamp_ack bit was added in tcp_skb_cb, which is set iff
the SKBTX_ACK_TSTAMP flag is set for an skb. This makes
SKBTX_ACK_TSTAMP flag redundant.
Remove the SKBTX_ACK_TSTAMP and instead use the txstamp_ack bit
everywhere.
Note that this frees one bit in shinfo->tx_flags.
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Suggested-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove the redundant check for sk->sk_tsflags in tcp_tx_timestamp.
tcp_tx_timestamp() receives the tsflags as a parameter. As a
result the "sk->sk_tsflags || tsflags" is redundant, since
tsflags already includes sk->sk_tsflags plus overrides from
control messages.
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is nothing related to BH in SNMP counters anymore,
since linux-3.0.
Rename helpers to use __ prefix instead of _BH prefix,
for contexts where preemption is disabled.
This more closely matches convention used to update
percpu variables.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename IP6_UPD_PO_STATS_BH() to __IP6_UPD_PO_STATS()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename IP6_INC_STATS_BH() to __IP6_INC_STATS()
and IP6_ADD_STATS_BH() to __IP6_ADD_STATS()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename NET_INC_STATS_BH() to __NET_INC_STATS()
and NET_ADD_STATS_BH() to __NET_ADD_STATS()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename IP_UPD_PO_STATS_BH() to __IP_UPD_PO_STATS()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename ICMP6_INC_STATS_BH() to __ICMP6_INC_STATS()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename IP_INC_STATS_BH() to __IP_INC_STATS(), to
better express this is used in non preemptible context.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename SCTP_INC_STATS_BH() to __SCTP_INC_STATS()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename UDP_INC_STATS_BH() to __UDP_INC_STATS(),
and UDP6_INC_STATS_BH() to __UDP6_INC_STATS()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename ICMP_INC_STATS_BH() to __ICMP_INC_STATS()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename DCCP_INC_STATS_BH() to __DCCP_INC_STATS()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the old days (before linux-3.0), SNMP counters were duplicated,
one for user context, and one for BH context.
After commit 8f0ea0fe3a ("snmp: reduce percpu needs by 50%")
we have a single copy, and what really matters is preemption being
enabled or disabled, since we use this_cpu_inc() or __this_cpu_inc()
respectively.
We therefore kill SNMP_INC_STATS_USER(), SNMP_ADD_STATS_USER(),
NET_INC_STATS_USER(), NET_ADD_STATS_USER(), SCTP_INC_STATS_USER(),
SNMP_INC_STATS64_USER(), SNMP_ADD_STATS64_USER(), TCP_ADD_STATS_USER(),
UDP_INC_STATS_USER(), UDP6_INC_STATS_USER(), and XFRM_INC_STATS_USER()
Following patches will rename __BH helpers to make clear their
usage is not tied to BH being disabled.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Minor overlapping changes in the conflicts.
In the macsec case, the change of the default ID macro
name overlapped with the 64-bit netlink attribute alignment
fixes in net-next.
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to 3bfd847203 ("net: Use passed in table for nexthop lookups")
for IPv4, if the route spec contains a table id use that to lookup the
next hop first and fall back to a full lookup if it fails (per the fix
4c9bcd1179 ("net: Fix nexthop lookups")).
Example:
root@kenny:~# ip -6 ro ls table red
local 2100:1::1 dev lo proto none metric 0 pref medium
2100:1::/120 dev eth1 proto kernel metric 256 pref medium
local 2100:2::1 dev lo proto none metric 0 pref medium
2100:2::/120 dev eth2 proto kernel metric 256 pref medium
local fe80::e0:f9ff:fe09:3cac dev lo proto none metric 0 pref medium
local fe80::e0:f9ff:fe1c:b974 dev lo proto none metric 0 pref medium
fe80::/64 dev eth1 proto kernel metric 256 pref medium
fe80::/64 dev eth2 proto kernel metric 256 pref medium
ff00::/8 dev red metric 256 pref medium
ff00::/8 dev eth1 metric 256 pref medium
ff00::/8 dev eth2 metric 256 pref medium
unreachable default dev lo metric 240 error -113 pref medium
root@kenny:~# ip -6 ro add table red 2100:3::/64 via 2100:1::64
RTNETLINK answers: No route to host
Route add fails even though 2100:1::64 is a reachable next hop:
root@kenny:~# ping6 -I red 2100:1::64
ping6: Warning: source address might be selected on device other than red.
PING 2100:1::64(2100:1::64) from 2100:1::1 red: 56 data bytes
64 bytes from 2100:1::64: icmp_seq=1 ttl=64 time=1.33 ms
With this patch:
root@kenny:~# ip -6 ro add table red 2100:3::/64 via 2100:1::64
root@kenny:~# ip -6 ro ls table red
local 2100:1::1 dev lo proto none metric 0 pref medium
2100:1::/120 dev eth1 proto kernel metric 256 pref medium
local 2100:2::1 dev lo proto none metric 0 pref medium
2100:2::/120 dev eth2 proto kernel metric 256 pref medium
2100:3::/64 via 2100:1::64 dev eth1 metric 1024 pref medium
local fe80::e0:f9ff:fe09:3cac dev lo proto none metric 0 pref medium
local fe80::e0:f9ff:fe1c:b974 dev lo proto none metric 0 pref medium
fe80::/64 dev eth1 proto kernel metric 256 pref medium
fe80::/64 dev eth2 proto kernel metric 256 pref medium
ff00::/8 dev red metric 256 pref medium
ff00::/8 dev eth1 metric 256 pref medium
ff00::/8 dev eth2 metric 256 pref medium
unreachable default dev lo metric 240 error -113 pref medium
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nicolas converted most users, but didn't realize some were generated
by macros. Convert those over as well.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In the case that dev_alloc_name() fails, e.g. because the name was
given by the user and already exists, we need to clean up properly
and free the per-CPU statistics. Fix that.
Cc: stable@vger.kernel.org
Fixes: 5a490510ba ("mac80211: use per-CPU TX/RX statistics")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
No more users in the tree, remove NETDEV_TX_LOCKED support.
Adds another hole in softnet_stats struct, but better than keeping
the unused collision counter around.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
For sctp assoc, when rcvbuf_policy is set, it will has it's own
rmem_alloc, when we dump asoc info in sctp_diag, we should use that
value on RMEM_ALLOC as well, just like WMEM_ALLOC.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johan Hedberg says:
====================
pull request: bluetooth-next 2016-04-26
Here's another set of Bluetooth & 802.15.4 patches for the 4.7 kernel:
- Cleanups & refactoring of ieee802154 & 6lowpan code
- Security related additions to ieee802154 and mrf24j40 driver
- Memory corruption fix to Bluetooth 6lowpan code
- Race condition fix in vhci driver
- Enhancements to the atusb 802.15.4 driver
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
I also fix commit 8b32ab9e6ef1: use nla_total_size_64bit() for
OVS_FLOW_ATTR_USED in ovs_flow_cmd_msg_size().
Fixes: 8b32ab9e6ef1 ("ovs: use nla_put_u64_64bit()")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I also fix the value of INET_DIAG_MAX. It's wrong since commit 8f840e47f1
which is only in net-next right now, thus I didn't make a separate patch.
Fixes: 8f840e47f1 ("sctp: add the sctp_diag.c file")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It was a simple idea -- save IPv6 configured addresses on a link down
so that IPv6 behaves similar to IPv4. As always the devil is in the
details and the IPv6 stack as too many behavioral differences from IPv4
making the simple idea more complicated than it needs to be.
The current implementation for keeping IPv6 addresses can panic or spit
out a warning in one of many paths:
1. IPv6 route gets an IPv4 route as its 'next' which causes a panic in
rt6_fill_node while handling a route dump request.
2. rt->dst.obsolete is set to DST_OBSOLETE_DEAD hitting the WARN_ON in
fib6_del
3. Panic in fib6_purge_rt because rt6i_ref count is not 1.
The root cause of all these is references related to the host route for
an address that is retained.
So, this patch deletes the host route every time the ifdown loop runs.
Since the host route is deleted and will be re-generated an up there is
no longer a need for the l3mdev fix up. On the 'admin up' side move
addrconf_permanent_addr into the NETDEV_UP event handling so that it
runs only once versus on UP and CHANGE events.
All of the current panics and warnings appear to be related to
addresses on the loopback device, but given the catastrophic nature when
a bug is triggered this patch takes the conservative approach and evicts
all host routes rather than trying to determine when it can be re-used
and when it can not. That can be a later optimizaton if desired.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 841645b5f2.
Ok, this puts the feature back. I've decided to apply David A.'s
bug fix and run with that rather than make everyone wait another
whole release for this feature.
Signed-off-by: David S. Miller <davem@davemloft.net>
Since cfg80211 maintains separate BSS table entries for APs if the same
BSSID, SSID pair is seen on multiple channels, it is possible that it
can map the current_bss to a BSS entry on the wrong channel. This
current_bss will not get flushed unless disconnected and cfg80211
reports a wrong channel as the associated channel.
Fix this by introducing a new cfg80211_connect_bss() function which is
similar to cfg80211_connect_result(), but it includes an additional
parameter: the bss the STA is connected to. This allows drivers to
provide the exact bss entry that matches the BSS to which the connection
was completed.
Reviewed-by: Jouni Malinen <jouni@qca.qualcomm.com>
Signed-off-by: Vidyullatha Kanchanapally <vkanchan@qti.qualcomm.com>
Signed-off-by: Sunil Dutt <usdutt@qti.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Add support for the a station statistics netlink attribute:
NL80211_STA_INFO_RX_DURATION.
If present, this attribute contains the aggregate PPDU duration (in
microseconds) for all the frames from the peer. This is useful to
help understand the total time spent transmitting to us by all of
the connected peers.
Signed-off-by: Mohammed Shafi Shajakhan <mohammed@qti.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Support checksum neutral ILA as described in the ILA draft. The low
order 16 bits of the identifier are used to contain the checksum
adjustment value.
The csum-mode parameter is added to described checksum processing. There
are three values:
- adjust transport checksum (previous behavior)
- do checksum neutral mapping
- do nothing
On output the csum-mode in the ila_params is checked and acted on. If
mode is checksum neutral mapping then to mapping and set C-bit.
On input, C-bit is checked. If it is set checksum-netural mapping is
done (regardless of csum-mode in ila params) and C-bit will be cleared.
If it is not set then action in csum-mode is taken.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change model of xlat to be used only for input where lookup is done on
the locator part of an address (comparing to locator_match as key
in rhashtable). This is needed for checksum neutral translation
which obfuscates the low order 16 bits of the identifier. It also
permits hosts to be in muliple ILA domains (each locator can map
to a different SIR address). A check is also added to disallow
translating non-ILA addresses (check of type in identifier).
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add structures for identifiers, locators, and an ila address which
is composed of a locator and identifier and in6_addr can be cast to
it. This includes a three bit type field and enums for the types defined
in ILA I-D.
In ILA lwt don't allow user to set a translation for a non-ILA
address (type of identifier is zero meaning it is an IID). This also
requires that the destination prefix is at least 65 bytes (64
bit locator and first byte of identifier).
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The memcpy of ipv6 header destination address to the skb control block
(sbk->cb) in header_create() results in currupted memory when bt_xmit()
is issued. The skb->cb is "released" in the return of header_create()
making room for lower layer to minipulate the skb->cb.
The value retrieved in bt_xmit is not persistent across header creation
and sending, and the lower layer will overwrite portions of skb->cb,
making the copied destination address wrong.
The memory corruption will lead to non-working multicast as the first 4
bytes of the copied destination address is replaced by a value that
resolves into a non-multicast prefix.
This fix removes the dependency on the skb control block between header
creation and send, by moving the destination address memcpy to the send
function path (setup_create, which is called from bt_xmit).
Signed-off-by: Glenn Ruben Bakke <glenn.ruben.bakke@nordicsemi.no>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Cc: stable@vger.kernel.org # 4.5+
rds-stress experiments with request size 256 bytes, 8K acks,
using 16 threads show a 40% improvment when pskb_extract()
replaces the {skb_clone(..); pskb_pull(..); pskb_trim(..);}
pattern in the Rx path, so we leverage the perf gain with
this commit.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A pattern of skb usage seen in modules such as RDS-TCP is to
extract `to_copy' bytes from the received TCP segment, starting
at some offset `off' into a new skb `clone'. This is done in
the ->data_ready callback, where the clone skb is queued up for rx on
the PF_RDS socket, while the parent TCP segment is returned unchanged
back to the TCP engine.
The existing code uses the sequence
clone = skb_clone(..);
pskb_pull(clone, off, ..);
pskb_trim(clone, to_copy, ..);
with the intention of discarding the first `off' bytes. However,
skb_clone() + pskb_pull() implies pksb_expand_head(), which ends
up doing a redundant memcpy of bytes that will then get discarded
in __pskb_pull_tail().
To avoid this inefficiency, this commit adds pskb_extract() that
creates the clone, and memcpy's only the relevant header/frag/frag_list
to the start of `clone'. pskb_trim() is then invoked to trim clone
down to the requested to_copy bytes.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It was impossible to include codel.h for the
purpose of having access to codel_params or
codel_vars structure definitions and using them
for embedding in other more complex structures.
This splits allows codel.h itself to be treated
like any other header file while codel_qdisc.h and
codel_impl.h contain function definitions with
logic that was previously in codel.h.
This copies over copyrights and doesn't involve
code changes other than adding a few additional
include directives to net/sched/sch*codel.c.
Signed-off-by: Michal Kazior <michal.kazior@tieto.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This strips out qdisc specific bits from the code
and makes it slightly more reusable. Codel will be
used by wireless/mac80211 in the future.
Signed-off-by: Michal Kazior <michal.kazior@tieto.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We should call consume_skb(skb) when skb is properly consumed,
or kfree_skb(skb) when skb must be dropped in error case.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We now have proper per-listener but also per network namespace counters
for SYN packets that might be dropped.
We replace the kfree_skb() by consume_skb() to be drop monitor [1]
friendly, and remove an obsolete comment.
FastOpen SYN packets can carry payload in them just fine.
[1] perf record -a -g -e skb:kfree_skb sleep 1; perf report
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts the following three commits:
70af921db6799977d9aaf1705ec197
The feature was ill conceived, has terrible semantics, and has added
nothing but regressions to the already fragile ipv6 stack.
Fixes: f1705ec197 ("net: ipv6: Make address flushing on ifdown optional")
Signed-off-by: David S. Miller <davem@davemloft.net>
Starting the kernel client with cephx disabled and then enabling cephx
and restarting userspace daemons can result in a crash:
[262671.478162] BUG: unable to handle kernel paging request at ffffebe000000000
[262671.531460] IP: [<ffffffff811cd04a>] kfree+0x5a/0x130
[262671.584334] PGD 0
[262671.635847] Oops: 0000 [#1] SMP
[262672.055841] CPU: 22 PID: 2961272 Comm: kworker/22:2 Not tainted 4.2.0-34-generic #39~14.04.1-Ubuntu
[262672.162338] Hardware name: Dell Inc. PowerEdge R720/068CDY, BIOS 2.4.3 07/09/2014
[262672.268937] Workqueue: ceph-msgr con_work [libceph]
[262672.322290] task: ffff88081c2d0dc0 ti: ffff880149ae8000 task.ti: ffff880149ae8000
[262672.428330] RIP: 0010:[<ffffffff811cd04a>] [<ffffffff811cd04a>] kfree+0x5a/0x130
[262672.535880] RSP: 0018:ffff880149aeba58 EFLAGS: 00010286
[262672.589486] RAX: 000001e000000000 RBX: 0000000000000012 RCX: ffff8807e7461018
[262672.695980] RDX: 000077ff80000000 RSI: ffff88081af2be04 RDI: 0000000000000012
[262672.803668] RBP: ffff880149aeba78 R08: 0000000000000000 R09: 0000000000000000
[262672.912299] R10: ffffebe000000000 R11: ffff880819a60e78 R12: ffff8800aec8df40
[262673.021769] R13: ffffffffc035f70f R14: ffff8807e5b138e0 R15: ffff880da9785840
[262673.131722] FS: 0000000000000000(0000) GS:ffff88081fac0000(0000) knlGS:0000000000000000
[262673.245377] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[262673.303281] CR2: ffffebe000000000 CR3: 0000000001c0d000 CR4: 00000000001406e0
[262673.417556] Stack:
[262673.472943] ffff880149aeba88 ffff88081af2be04 ffff8800aec8df40 ffff88081af2be04
[262673.583767] ffff880149aeba98 ffffffffc035f70f ffff880149aebac8 ffff8800aec8df00
[262673.694546] ffff880149aebac8 ffffffffc035c89e ffff8807e5b138e0 ffff8805b047f800
[262673.805230] Call Trace:
[262673.859116] [<ffffffffc035f70f>] ceph_x_destroy_authorizer+0x1f/0x50 [libceph]
[262673.968705] [<ffffffffc035c89e>] ceph_auth_destroy_authorizer+0x3e/0x60 [libceph]
[262674.078852] [<ffffffffc0352805>] put_osd+0x45/0x80 [libceph]
[262674.134249] [<ffffffffc035290e>] remove_osd+0xae/0x140 [libceph]
[262674.189124] [<ffffffffc0352aa3>] __reset_osd+0x103/0x150 [libceph]
[262674.243749] [<ffffffffc0354703>] kick_requests+0x223/0x460 [libceph]
[262674.297485] [<ffffffffc03559e2>] ceph_osdc_handle_map+0x282/0x5e0 [libceph]
[262674.350813] [<ffffffffc035022e>] dispatch+0x4e/0x720 [libceph]
[262674.403312] [<ffffffffc034bd91>] try_read+0x3d1/0x1090 [libceph]
[262674.454712] [<ffffffff810ab7c2>] ? dequeue_entity+0x152/0x690
[262674.505096] [<ffffffffc034cb1b>] con_work+0xcb/0x1300 [libceph]
[262674.555104] [<ffffffff8108fb3e>] process_one_work+0x14e/0x3d0
[262674.604072] [<ffffffff810901ea>] worker_thread+0x11a/0x470
[262674.652187] [<ffffffff810900d0>] ? rescuer_thread+0x310/0x310
[262674.699022] [<ffffffff810957a2>] kthread+0xd2/0xf0
[262674.744494] [<ffffffff810956d0>] ? kthread_create_on_node+0x1c0/0x1c0
[262674.789543] [<ffffffff817bd81f>] ret_from_fork+0x3f/0x70
[262674.834094] [<ffffffff810956d0>] ? kthread_create_on_node+0x1c0/0x1c0
What happens is the following:
(1) new MON session is established
(2) old "none" ac is destroyed
(3) new "cephx" ac is constructed
...
(4) old OSD session (w/ "none" authorizer) is put
ceph_auth_destroy_authorizer(ac, osd->o_auth.authorizer)
osd->o_auth.authorizer in the "none" case is just a bare pointer into
ac, which contains a single static copy for all services. By the time
we get to (4), "none" ac, freed in (2), is long gone. On top of that,
a new vtable installed in (3) points us at ceph_x_destroy_authorizer(),
so we end up trying to destroy a "none" authorizer with a "cephx"
destructor operating on invalid memory!
To fix this, decouple authorizer destruction from ac and do away with
a single static "none" authorizer by making a copy for each OSD or MDS
session. Authorizers themselves are independent of ac and so there is
no reason for destroy_authorizer() to be an ac op. Make it an op on
the authorizer itself by turning ceph_authorizer into a real struct.
Fixes: http://tracker.ceph.com/issues/15447
Reported-by: Alan Zhang <alan.zhang@linux.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
d894ba18d4 ("soreuseport: fix ordering for mixed v4/v6 sockets")
was merged as a bug fix to the net tree. Two conflicting changes
were committed to net-next before the above fix was merged back to
net-next:
ca065d0cf8 ("udp: no longer use SLAB_DESTROY_BY_RCU")
3b24d854cb ("tcp/dccp: do not touch listener sk_refcnt under synflood")
These changes switched the datastructure used for TCP and UDP sockets
from hlist_nulls to hlist. This patch applies the necessary parts
of the net tree fix to net-next which were not automatic as part of the
merge.
Fixes: 1602f49b58 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
Signed-off-by: Craig Gallek <kraig@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Simon Horman says:
====================
IPVS Updates for v4.7
please consider these enhancements to the IPVS. They allow SIP connections
originating from real-servers to be load balanced by the SIP psersitence
engine as is already implemented in the other direction. And for better one
packet scheduling (OPS) performance.
====================
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
ip6_route_output() will never return a NULL pointer, so there's no need
to check it.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Four years ago we introduced a new sysctl knob to disable automatic
helper assignment in 72110dfaa907 ("netfilter: nf_ct_helper: disable
automatic helper assignment"). This knob kept this behaviour enabled by
default to remain conservative.
This measure was introduced to provide a secure way to configure
iptables and connection tracking helpers through explicit rules.
Give the time we have waited for this, let's turn off this by default
now, worse case users still have a chance to recover the former
behaviour by explicitly enabling this back through sysctl.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch fixes dynamic element updates for adjacent intervals in the
rb-tree representation.
Since elements are sorted in the rb-tree, in case of adjacent nodes with
the same key, the assumption is that an interval end node must be placed
before an interval opening.
In tree lookup operations, the idea is to search for the closer element
that is smaller than the one we're searching for. Given that we'll have
two possible matchings, we have to take the opening interval in case of
adjacent nodes.
Range merges are not trivial with the current representation,
specifically we have to check if node extensions are equal and make sure
we keep the existing internal states around.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Parse flags and pass them to the set via ->deactivate() to check if we
remove the right element from the intervals.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This function parses the set element flags, thus, we can reuse the same
handling when deleting elements.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
As earlier commit removed accessed to the hash from other files we can
also make it static.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Use a private seed and init it using get_random_once.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We only allow rehash in init namespace, so we only use
init_ns.generation. And even if we would allow it, it makes no sense
as the conntrack locks are global; any ongoing rehash prevents insert/
delete.
So make this private to nf_conntrack_core instead.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
After commit fbd40ea018 ("ipv4: Don't do expensive useless work
during inetdev destroy.") when deleting an interface,
fib_del_ifaddr() can be executed without any primary address
present on the dead interface.
The above is safe, but triggers some "bug: prim == NULL" warnings.
This commit avoids warning if the in_dev is dead
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linux TCP stack painfully segments all TSO/GSO packets before retransmits.
This was fine back in the days when TSO/GSO were emerging, with their
bugs, but we believe the dark age is over.
Keeping big packets in write queues, but also in stack traversal
has a lot of benefits.
- Less memory overhead, because write queues have less skbs
- Less cpu overhead at ACK processing.
- Better SACK processing, as lot of studies mentioned how
awful linux was at this ;)
- Less cpu overhead to send the rtx packets
(IP stack traversal, netfilter traversal, drivers...)
- Better latencies in presence of losses.
- Smaller spikes in fq like packet schedulers, as retransmits
are not constrained by TCP Small Queues.
1 % packet losses are common today, and at 100Gbit speeds, this
translates to ~80,000 losses per second.
Losses are often correlated, and we see many retransmit events
leading to 1-MSS train of packets, at the time hosts are already
under stress.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 42b18f605f ("tipc: refactor function tipc_link_timeout()"),
introduced a bug which prevents sending of probe messages during
link synchronization phase. This leads to hanging links, if the
bearer is disabled/enabled after links are up.
In this commit, we send the probe messages correctly.
Fixes: 42b18f605f ("tipc: refactor function tipc_link_timeout()")
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a race-condition when updating the mdb offload flag without using
the mulicast_lock. This reverts commit 9e8430f8d6 ("bridge: mdb:
Passing the port-group pointer to br_mdb module").
This patch marks offloaded MDB entry as "offload" by changing the port-
group flags and marks it as MDB_PG_FLAGS_OFFLOAD.
When switchdev PORT_MDB succeeded and adds a multicast group, a completion
callback is been invoked "br_mdb_complete". The completion function
locks the multicast_lock and finds the right net_bridge_port_group and
marks it as offloaded.
Fixes: 9e8430f8d6 ("bridge: mdb: Passing the port-group pointer to br_mdb module")
Reported-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Elad Raz <eladr@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is duplicate code that translates br_mdb_entry to br_ip let's wrap it
in a common function.
Signed-off-by: Elad Raz <eladr@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When using switchdev deferred operation (SWITCHDEV_F_DEFER), the operation
is executed in different context and the application doesn't have any way
to get the operation real status.
Adding a completion callback fixes that. This patch adds fields to
switchdev_attr and switchdev_obj "complete_priv" field which is used by
the "complete" callback.
Application can set a complete function which will be called once the
operation executed.
Signed-off-by: Elad Raz <eladr@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When removing a single interface while a broadcast or ogm packet is
still pending then we will free the forward packet without releasing the
queue slots again.
This patch is supposed to fix this issue.
Fixes: 6d5808d4ae ("batman-adv: Add missing hardif_free_ref in forw_packet_free")
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
[sven@narfation.org: fix conflicts with current version]
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
_batadv_update_route rcu_derefences orig_ifinfo->router outside of a
spinlock protected region to print some information messages to the debug
log. But this pointer is not checked again when the new pointer is assigned
in the spinlock protected region. Thus is can happen that the value of
orig_ifinfo->router changed in the meantime and thus the reference counter
of the wrong router gets reduced after the spinlock protected region.
Just rcu_dereferencing the value of orig_ifinfo->router inside the spinlock
protected region (which also set the new pointer) is enough to get the
correct old router object.
Fixes: e1a5382f97 ("batman-adv: Make orig_node->router an rcu protected pointer")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
The shutdown of an batman-adv interface can happen with one of its slave
interfaces still being in the BATADV_IF_TO_BE_ACTIVATED state. A possible
reason for it is that the routing algorithm BATMAN_V was selected and
batadv_schedule_bat_ogm was not yet called for this interface. This slave
interface still has to be set to BATADV_IF_INACTIVE or the batman-adv
interface will never reduce its usage counter and thus never gets shutdown.
This problem can be simulated via:
$ modprobe dummy
$ modprobe batman-adv routing_algo=BATMAN_V
$ ip link add bat0 type batadv
$ ip link set dummy0 master bat0
$ ip link set dummy0 up
$ ip link del bat0
unregister_netdevice: waiting for bat0 to become free. Usage count = 3
Reported-by: Matthias Schiffer <mschiffer@universe-factory.net>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
[sven@narfation.org: fix conflicts with current version]
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
The encapsulated ethernet and VLAN header may be outside the received
ethernet frame. Thus the skb buffer size has to be checked before it can be
parsed to find out if it encapsulates another batman-adv packet.
Fixes: 420193573f ("batman-adv: softif bridge loop avoidance")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for your net-next
tree, mostly from Florian Westphal to sort out the lack of sufficient
validation in x_tables and connlabel preparation patches to add
nf_tables support. They are:
1) Ensure we don't go over the ruleset blob boundaries in
mark_source_chains().
2) Validate that target jumps land on an existing xt_entry. This extra
sanitization comes with a performance penalty when loading the ruleset.
3) Introduce xt_check_entry_offsets() and use it from {arp,ip,ip6}tables.
4) Get rid of the smallish check_entry() functions in {arp,ip,ip6}tables.
5) Make sure the minimal possible target size in x_tables.
6) Similar to #3, add xt_compat_check_entry_offsets() for compat code.
7) Check that standard target size is valid.
8) More sanitization to ensure that the target_offset field is correct.
9) Add xt_check_entry_match() to validate that matches are well-formed.
10-12) Three patch to reduce the number of parameters in
translate_compat_table() for {arp,ip,ip6}tables by using a container
structure.
13) No need to return value from xt_compat_match_from_user(), so make
it void.
14) Consolidate translate_table() so it can be used by compat code too.
15) Remove obsolete check for compat code, so we keep consistent with
what was already removed in the native layout code (back in 2007).
16) Get rid of target jump validation from mark_source_chains(),
obsoleted by #2.
17) Introduce xt_copy_counters_from_user() to consolidate counter
copying, and use it from {arp,ip,ip6}tables.
18,22) Get rid of unnecessary explicit inlining in ctnetlink for dump
functions.
19) Move nf_connlabel_match() to xt_connlabel.
20) Skip event notification if connlabel did not change.
21) Update of nf_connlabels_get() to make the upcoming nft connlabel
support easier.
23) Remove spinlock to read protocol state field in conntrack.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
nla_data() is now aligned on a 64-bit area.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
nla_data() is now aligned on a 64-bit area.
A temporary version (nla_put_be64_32bit()) is added for nla_put_net64().
This function is removed in the next patch.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
nla_data() is now aligned on a 64-bit area.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts were two cases of simple overlapping changes,
nothing serious.
In the UDP case, we need to add a hlist_add_tail_rcu()
to linux/rculist.h, because we've moved UDP socket handling
away from using nulls lists.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Fix memory leak in iwlwifi, from Matti Gottlieb.
2) Add missing registration of netfilter arp_tables into initial
namespace, from Florian Westphal.
3) Fix potential NULL deref in DecNET routing code.
4) Restrict NETLINK_URELEASE to truly bound sockets only, from Dmitry
Ivanov.
5) Fix dst ref counting in VRF, from David Ahern.
6) Fix TSO segmenting limits in i40e driver, from Alexander Duyck.
7) Fix heap leak in PACKET_DIAG_MCLIST, from Mathias Krause.
8) Ravalidate IPV6 datagram socket cached routes properly, particularly
with UDP, from Martin KaFai Lau.
9) Fix endian bug in RDS dp_ack_seq handling, from Qing Huang.
10) Fix stats typing in bcmgenet driver, from Eric Dumazet.
11) Openvswitch needs to orphan SKBs before ipv6 fragmentation handing,
from Joe Stringer.
12) SPI device reference leak in spi_ks8895 PHY driver, from Mark Brown.
13) atl2 doesn't actually support scatter-gather, so don't advertise the
feature. From Ben Hucthings.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (72 commits)
openvswitch: use flow protocol when recalculating ipv6 checksums
Driver: Vmxnet3: set CHECKSUM_UNNECESSARY for IPv6 packets
atl2: Disable unimplemented scatter/gather feature
net/mlx4_en: Split SW RX dropped counter per RX ring
net/mlx4_core: Don't allow to VF change global pause settings
net/mlx4_core: Avoid repeated calls to pci enable/disable
net/mlx4_core: Implement pci_resume callback
net: phy: spi_ks8895: Don't leak references to SPI devices
net: ethernet: davinci_emac: Fix platform_data overwrite
net: ethernet: davinci_emac: Fix Unbalanced pm_runtime_enable
qede: Fix single MTU sized packet from firmware GRO flow
qede: Fix setting Skb network header
qede: Fix various memory allocation error flows for fastpath
tcp: Merge tx_flags and tskey in tcp_shifted_skb
tcp: Merge tx_flags and tskey in tcp_collapse_retrans
drivers: net: cpsw: fix wrong regs access in cpsw_ndo_open
tcp: Fix SOF_TIMESTAMPING_TX_ACK when handling dup acks
openvswitch: Orphan skbs before IPv6 defrag
Revert "Prevent NUll pointer dereference with two PHYs on cpsw"
VSOCK: Only check error on skb_recv_datagram when skb is NULL
...
When using masked actions the ipv6_proto field of an action
to set IPv6 fields may be zero rather than the prevailing protocol
which will result in skipping checksum recalculation.
This patch resolves the problem by relying on the protocol
in the flow key rather than that in the set field action.
Fixes: 83d2b9ba1a ("net: openvswitch: Support masked set actions.")
Cc: Jarno Rajahalme <jrajahalme@nicira.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for NETIF_F_TSO_MANGLEID if a given tunnel supports
NETIF_F_TSO. This way if needed a device can then later enable the TSO
with IP ID mangling and the tunnels on top of that device can then also
make use of the IP ID mangling as well.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Removed .type field from NLA to do proper length checking.
Reported by Daniel Borkmann and Julia Lawall.
Signed-off-by: Peter Heise <peter.heise@airbus.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
EXPIRES_IN_MS macro comes from net/ipv4/inet_diag.c and dates
back to before jiffies_to_msecs() has been introduced.
Now we can remove it and use jiffies_to_msecs().
Suggested-by: Jakub Sitnicki <jkbs@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Jakub Sitnicki <jkbs@redhat.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Having the tag protocol in dsa_switch_driver for setup time and in
dsa_switch_tree for runtime is enough. Remove dsa_switch's one.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a new RTM_GETSTATS message to query link stats via netlink
from the kernel. RTM_NEWLINK also dumps stats today, but RTM_NEWLINK
returns a lot more than just stats and is expensive in some cases when
frequent polling for stats from userspace is a common operation.
RTM_GETSTATS is an attempt to provide a light weight netlink message
to explicity query only link stats from the kernel on an interface.
The idea is to also keep it extensible so that new kinds of stats can be
added to it in the future.
This patch adds the following attribute for NETDEV stats:
struct nla_policy ifla_stats_policy[IFLA_STATS_MAX + 1] = {
[IFLA_STATS_LINK_64] = { .len = sizeof(struct rtnl_link_stats64) },
};
Like any other rtnetlink message, RTM_GETSTATS can be used to get stats of
a single interface or all interfaces with NLM_F_DUMP.
Future possible new types of stat attributes:
link af stats:
- IFLA_STATS_LINK_IPV6 (nested. for ipv6 stats)
- IFLA_STATS_LINK_MPLS (nested. for mpls/mdev stats)
extended stats:
- IFLA_STATS_LINK_EXTENDED (nested. extended software netdev stats like bridge,
vlan, vxlan etc)
- IFLA_STATS_LINK_HW_EXTENDED (nested. extended hardware stats which are
available via ethtool today)
This patch also declares a filter mask for all stat attributes.
User has to provide a mask of stats attributes to query. filter mask
can be specified in the new hdr 'struct if_stats_msg' for stats messages.
Other important field in the header is the ifindex.
This api can also include attributes for global stats (eg tcp) in the future.
When global stats are included in a stats msg, the ifindex in the header
must be zero. A single stats message cannot contain both global and
netdev specific stats. To easily distinguish them, netdev specific stat
attributes name are prefixed with IFLA_STATS_LINK_
Without any attributes in the filter_mask, no stats will be returned.
This patch has been tested with mofified iproute2 ifstat.
Suggested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When using OPS mode in conjunction with SIP persistent-engine, packets
originating from the same ip-address/port could be balanced to different
real servers, and (to properly handle SIP responses) OPS connections
are created in the in-out direction too, where ip_vs_update_conntrack()
is called to modify the reply tuple.
As a result, there can be collision of conntrack tuples, causing random
packet drops, as explained below:
conntrack1: orig=CIP->VIP, reply=RIP1->CIP
conntrack2: orig=RIP2->CIP, reply=CIP->VIP
Tuple CIP->VIP is both in orig of conntrack1 and reply of conntrack2.
The collision triggers packet drop inside nf_conntrack processing.
In addition, the current implementation deletes the conntrack object at
every expire of an OPS connection (once every forwarded packet), to have
it recreated from scratch at next packet traversing IPVS.
Since in OPS mode, by definition, we don't expect any associated
response, the choices implemented in this patch are:
a) don't call nf_conntrack_alter_reply() for OPS connections inside
ip_vs_update_conntrack().
b) don't delete the conntrack object at OPS connection expire.
The result is that created conntrack objects for each tuple CIP->VIP,
RIP-N->CIP, etc. are left in UNREPLIED state and not modified by IPVS
OPS connection management. This eliminates packet drops and leaves
a single conntrack object for each tuple packets are sent from.
Signed-off-by: Marco Angaroni <marcoangaroni@gmail.com>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
One-packet-scheduling is the most expensive mode in IPVS from
performance point of view: for each packet to be processed a new
connection data structure is created and, after packet is sent,
deleted by starting a new timer set to expire immediately.
SIP persistent-engine needs OPS mode to have Call-ID based load
balancing, so OPS mode performance has negative impact in SIP
protocol load balancing.
This patch aims to improve performance of OPS mode by means of the
following changes in the release mechanism of OPS connections:
a) call expire callback ip_vs_conn_expire() directly instead of
starting a timer programmed to fire immediately.
b) avoid call_rcu() overhead inside expire callback, since OPS
connection are not inserted in the hash-table and last just the
time to process the packet, hence there is no concurrent access
to such data structures.
Signed-off-by: Marco Angaroni <marcoangaroni@gmail.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
When using LVS-NAT and SIP persistence-egine over UDP, the following
limitations are present with current implementation:
1) To actually have load-balancing based on Call-ID header, you need to
use one-packet-scheduling mode. But with one-packet-scheduling the
connection is deleted just after packet is forwarded, so SIP responses
coming from real-servers do not match any connection and SNAT is
not applied.
2) If you do not use "-o" option, IPVS behaves as normal UDP load
balancer, so different SIP calls (each one identified by a different
Call-ID) coming from the same ip-address/port go to the same
real-server. So basically you don’t have load-balancing based on
Call-ID as intended.
3) Call-ID is not learned when a new SIP call is started by a real-server
(inside-to-outside direction), but only in the outside-to-inside
direction. This would be a general problem for all SIP servers acting
as Back2BackUserAgent.
This patch aims to solve problems 1) and 3) while keeping OPS mode
mandatory for SIP-UDP, so that 2) is not a problem anymore.
The basic mechanism implemented is to make packets, that do not match any
existent connection but come from real-servers, create new connections
instead of let them pass without any effect.
When such packets pass through ip_vs_out(), if their source ip address and
source port match a configured real-server, a new connection is
automatically created in the same way as it would have happened if the
packet had come from outside-to-inside direction. A new connection template
is created too if the virtual-service is persistent and there is no
matching connection template found. The new connection automatically
created, if the service had "-o" option, is an OPS connection that lasts
only the time to forward the packet, just like it happens on the
ingress side.
The main part of this mechanism is implemented inside a persistent-engine
specific callback (at the moment only SIP persistent engine exists) and
is triggered only for UDP packets, since connection oriented protocols, by
using different set of ports (typically ephemeral ports) to open new
outgoing connections, should not need this feature.
The following requisites are needed for automatic connection creation; if
any is missing the packet simply goes the same way as before.
a) virtual-service is not fwmark based (this is because fwmark services
do not store address and port of the virtual-service, required to
build the connection data).
b) virtual-service and real-servers must not have been configured with
omitted port (this is again to have all data to create the connection).
Signed-off-by: Marco Angaroni <marcoangaroni@gmail.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
If skb_recv_datagram returns an skb, we should ignore the err
value returned. Otherwise, datagram receives will return EAGAIN
when they have to wait for a datagram.
Acked-by: Adit Ranadive <aditr@vmware.com>
Signed-off-by: Jorgen Hansen <jhansen@vmware.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The dsa_slave_priv structure does not need a pointer to its net_device.
Kill it.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a new helper for cls/act programs that can push events
to user space applications. For networking, this can be f.e. for sampling,
debugging, logging purposes or pushing of arbitrary wake-up events. The
idea is similar to a43eec3042 ("bpf: introduce bpf_perf_event_output()
helper") and 39111695b1 ("samples: bpf: add bpf_perf_event_output example").
The eBPF program utilizes a perf event array map that user space populates
with fds from perf_event_open(), the eBPF program calls into the helper
f.e. as skb_event_output(skb, &my_map, BPF_F_CURRENT_CPU, raw, sizeof(raw))
so that the raw data is pushed into the fd f.e. at the map index of the
current CPU.
User space can poll/mmap/etc on this and has a data channel for receiving
events that can be post-processed. The nice thing is that since the eBPF
program and user space application making use of it are tightly coupled,
they can define their own arbitrary raw data format and what/when they
want to push.
While f.e. packet headers could be one part of the meta data that is being
pushed, this is not a substitute for things like packet sockets as whole
packet is not being pushed and push is only done in a single direction.
Intention is more of a generically usable, efficient event pipe to applications.
Workflow is that tc can pin the map and applications can attach themselves
e.g. after cls/act setup to one or multiple map slots, demuxing is done by
the eBPF program.
Adding this facility is with minimal effort, it reuses the helper
introduced in a43eec3042 ("bpf: introduce bpf_perf_event_output() helper")
and we get its functionality for free by overloading its BPF_FUNC_ identifier
for cls/act programs, ctx is currently unused, but will be made use of in
future. Example will be added to iproute2's BPF example files.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Separated from previous patch for readability.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
Struct ctl_table_header holds pointer to sysctl table which could be used
for freeing it after unregistration. IPv4 sysctls already use that.
Remove redundant NULL assignment: ndev allocated using kzalloc.
This also saves some bytes: sysctl table could be shorter than
DEVCONF_MAX+1 if some options are disable in config.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Suggested-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since the nlattr header is 4 bytes in size, it can cause the netlink
attribute payload to not be 8-byte aligned.
This is particularly troublesome for IFLA_STATS64 which contains 64-bit
statistic values.
Solve this by creating a dummy IFLA_PAD attribute which has a payload
which is zero bytes in size. When HAVE_EFFICIENT_UNALIGNED_ACCESS is
false, we insert an IFLA_PAD attribute into the netlink response when
necessary such that the IFLA_STATS64 payload will be properly aligned.
With help and suggestions from Eric Dumazet.
Signed-off-by: David S. Miller <davem@davemloft.net>
Calm down gcc warnings:
net/netfilter/nf_conntrack_netlink.c:529:15: warning: 'ctnetlink_proto_size' defined but not used [-Wunused-function]
static size_t ctnetlink_proto_size(const struct nf_conn *ct)
^
net/netfilter/nf_conntrack_netlink.c:546:15: warning: 'ctnetlink_acct_size' defined but not used [-Wunused-function]
static size_t ctnetlink_acct_size(const struct nf_conn *ct)
^
net/netfilter/nf_conntrack_netlink.c:556:12: warning: 'ctnetlink_secctx_size' defined but not used [-Wunused-function]
static int ctnetlink_secctx_size(const struct nf_conn *ct)
^
net/netfilter/nf_conntrack_netlink.c:572:15: warning: 'ctnetlink_timestamp_size' defined but not used [-Wunused-function]
static size_t ctnetlink_timestamp_size(const struct nf_conn *ct)
^
So gcc compiles them out when CONFIG_NF_CONNTRACK_EVENTS and
CONFIG_NETFILTER_NETLINK_GLUE_CT are not set.
Fixes: 4054ff4545 ("netfilter: ctnetlink: remove unnecessary inlining")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
The function convert_legacy_u32_to_link_mode and
convert_link_mode_to_legacy_u32 may be used outside
of ethtool.c. We rename them to ethtool_convert_...
and export them, so we could use them in others
drivers and modules.
Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
nf_connlabel_set() takes the bit number that we would like to set.
nf_connlabels_get() however took the number of bits that we want to
support.
So e.g. nf_connlabels_get(32) support bits 0 to 31, but not 32.
This changes nf_connlabels_get() to take the highest bit that we want
to set.
Callers then don't have to cope with a potential integer wrap
when using nf_connlabels_get(bit + 1) anymore.
Current callers are fine, this change is only to make folloup
nft ct label set support simpler.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
make the replace function only send a ctnetlink event if the contents
of the new set is different.
Otherwise 'ct label set ct label | bar'
will cause netlink event storm since we "replace" labels for each packet.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Currently labels can only be set either by iptables connlabel
match or via ctnetlink.
Before adding nftables set support, clean up the clabel core and move
helpers that nft will not need after all to the xtables module.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We deleted a line of code and accidentally made the "return put_user()"
part of the if statement when it's supposed to be unconditional.
Fixes: 9f9a45beaa ('udp: do not expect udp headers on ioctl SIOCINQ')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch passes netlink attr data ptr directly to dev_get_stats
thus elimiating a stats copy.
Suggested-by: David Miller <davem@davemloft.net>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fix spelling typos found in printk
within various part of the kernel sources.
Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Change the dsa_switch_driver.probe function to return a const char *.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds code borrowed from bits and pieces of other protocols to
the IPv6 GRE path so that we can support GSO over IPv6 based GRE tunnels.
By adding this support we are able to significantly improve the throughput
for GRE tunnels as we are able to make use of GSO.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since GRE doesn't really care about L3 protocol we can support IPv4 and
IPv6 using the same offloads. With that being the case we can add a call
to register the offloads for IPv6 as a part of our GRE offload
initialization.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for the basic offloads we support on most devices.
Specifically with this patch set we can support checksum offload, basic
scatter-gather, and highdma.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When we were creating an ip6gretap interface the MTU was about 6 bytes
short of what was needed. It turns out we were not taking the Ethernet
header into account and as a result we were eating into the 8 bytes
reserved for the encap limit.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch updates the IP tunnel core function iptunnel_handle_offloads so
that we return an int and do not free the skb inside the function. This
actually allows us to clean up several paths in several tunnels so that we
can free the skb at one point in the path without having to have a
secondary path if we are supporting tunnel offloads.
In addition it should resolve some double-free issues I have found in the
tunnels paths as I believe it is possible for us to end up triggering such
an event in the case of fou or gue.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Two different threads with different rds sockets may be in
rds_recv_rcvbuf_delta() via receive path. If their ports
both map to the same word in the congestion map, then
using non-atomic ops to update it could cause the map to
be incorrect. Lets use atomics to avoid such an issue.
Full credit to Wengang <wen.gang.wang@oracle.com> for
finding the issue, analysing it and also pointing out
to offending code with spin lock based fix.
Reviewed-by: Leon Romanovsky <leon@leon.nu>
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
dp->dp_ack_seq is used in big endian format. We need to do the
big endianness conversion when we assign a value in host format
to it.
Signed-off-by: Qing Huang <qing.huang@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When __vlan_insert_tag() fails from skb_vlan_push() path due to the
skb_cow_head(), we need to undo the __skb_push() in the error path
as well that was done earlier to move skb->data pointer to mac header.
Moreover, I noticed that when in the non-error path the __skb_pull()
is done and the original offset to mac header was non-zero, we fixup
from a wrong skb->data offset in the checksum complete processing.
So the skb_postpush_rcsum() really needs to be done before __skb_pull()
where skb->data still points to the mac header start and thus operates
under the same conditions as in __vlan_insert_tag().
Fixes: 93515d53b1 ("net: move vlan pop/push functions into common code")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When rhashtable_walk_init return err, no release function should be
called, and when rhashtable_walk_start return err, we should only invoke
rhashtable_walk_exit to release the source.
But now when sctp_transport_walk_start return err, we just call
rhashtable_walk_stop/exit, and never care about if rhashtable_walk_init
or start return err, which is so bad.
We will fix it by calling rhashtable_walk_exit if rhashtable_walk_start
return err in sctp_transport_walk_start, and if sctp_transport_walk_start
return err, we do not need to call sctp_transport_walk_stop any more.
For sctp proc, we will use 'iter->start_fail' to decide if we will call
rhashtable_walk_stop/exit.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In sctp proc, these three functions in remaddrs and assocs are the
same. we should merge them into one.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This one will implement all the interface of inet_diag, inet_diag_handler.
which includes sctp_diag_dump, sctp_diag_dump_one and sctp_diag_get_info.
It will work as a module, and register inet_diag_handler when loading.
v2->v3:
- fix the mistake in inet_assoc_attr_size().
- change inet_diag_msg_laddrs_fill() name to inet_diag_msg_sctpladdrs_fill.
- change inet_diag_msg_paddrs_fill() name to inet_diag_msg_sctpaddrs_fill.
- add inet_diag_msg_sctpinfo_fill() to make asoc/ep fill code clearer.
- add inet_diag_msg_sctpasoc_fill() to make asoc fill code clearer.
- merge inet_asoc_diag_fill() and inet_ep_diag_fill() to
inet_sctp_diag_fill().
- call sctp_diag_get_info() directly, instead by handler, cause the caller
is in the same file with it.
- call lock_sock in sctp_tsp_dump_one() to make sure we call get sctp info
safely.
- after lock_sock(sk), we should check sk != assoc->base.sk.
- change mem[SK_MEMINFO_WMEM_ALLOC] to asoc->sndbuf_used for asoc dump when
asoc->ep->sndbuf_policy is set. don't use INET_DIAG_MEMINFO attr any more.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
inet_diag_msg_common_fill is used to fill the diag msg common info,
we need to use it in sctp_diag as well, so export it.
inet_diag_msg_attrs_fill is used to fill some common attrs info between
sctp diag and tcp diag.
v2->v3:
- do not need to define and export inet_diag_get_handler any more.
cause all the functions in it are in sctp_diag.ko, we just call
them in sctp_diag.ko.
- add inet_diag_msg_attrs_fill to make codes clear.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For some main variables in sctp.ko, we couldn't export it to other modules,
so we have to define some api to access them.
It will include sctp transport and endpoint's traversal.
There are some transport traversal functions for sctp_diag, we can also
use it for sctp_proc. cause they have the similar situation to traversal
transport.
v2->v3:
- rhashtable_walk_init need the parameter gfp, because of recent upstrem
update
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sctp_diag will dump some important details of sctp's assoc or ep, we use
sctp_info to describe them, sctp_get_sctp_info to get them, and export
it to sctp_diag.ko.
v2->v3:
- we will not use list_for_each_safe in sctp_get_sctp_info, cause
all the callers of it will use lock_sock.
- fix the holes in struct sctp_info with __reserved* field.
because sctp_diag is a new feature, and sctp_info is just for now,
it may be changed in the future.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SCTP already serializes access to rcvbuf through its sock lock:
sctp_recvmsg takes it right in the start and release at the end, while
rx path will also take the lock before doing any socket processing. On
sctp_rcv() it will check if there is an user using the socket and, if
there is, it will queue incoming packets to the backlog. The backlog
processing will do the same. Even timers will do such check and
re-schedule if an user is using the socket.
Simplifying this will allow us to remove sctp_skb_list_tail and get ride
of some expensive lockings. The lists that it is used on are also
mangled with functions like __skb_queue_tail and __skb_unlink in the
same context, like on sctp_ulpq_tail_event() and sctp_clear_pd().
sctp_close() will also purge those while using only the sock lock.
Therefore the lockings performed by sctp_skb_list_tail() are not
necessary. This patch removes this function and replaces its calls with
just skb_queue_splice_tail_init() instead.
The biggest gain is at sctp_ulpq_tail_event(), because the events always
contain a list, even if it's queueing a single skb and this was
triggering expensive calls to spin_lock_irqsave/_irqrestore for every
data chunk received.
As SCTP will deliver each data chunk on a corresponding recvmsg, the
more effective the change will be.
Before this patch, with chunks with 30 bytes:
netperf -t SCTP_STREAM -H 192.168.1.2 -cC -l 60 -- -m 30 -S 400000
400000 -s 400000 400000
on a 10Gbit link with 1500 MTU:
SCTP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.1.1 () port 0 AF_INET
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB
425984 425984 30 60.00 137.45 7.34 7.36 52.504 52.608
With it:
SCTP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.1.1 () port 0 AF_INET
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB
425984 425984 30 60.00 179.10 7.97 6.70 43.740 36.788
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for the newer version 1 of the HSR
networking standard. Version 0 is still default and the new
version has to be selected via iproute2.
Main changes are in the supervision frame handling and its
ethertype field.
Signed-off-by: Peter Heise <peter.heise@airbus.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Last known hot point during SYNFLOOD attack is the clearing
of rx_opt.saw_tstamp in tcp_rcv_state_process()
It is not needed for a listener, so we move it where it matters.
Performance while a SYNFLOOD hits a single listener socket
went from 5 Mpps to 6 Mpps on my test server (24 cores, 8 NIC RX queues)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When removing sk_refcnt manipulation on synflood, I missed that
using skb_set_owner_w() was racy, if sk->sk_wmem_alloc had already
transitioned to 0.
We should hold sk_refcnt instead, but this is a big deal under attack.
(Doing so increase performance from 3.2 Mpps to 3.8 Mpps only)
In this patch, I chose to not attach a socket to syncookies skb.
Performance is now 5 Mpps instead of 3.2 Mpps.
Following patch will remove last known false sharing in
tcp_rcv_state_process()
Fixes: 3b24d854cb ("tcp/dccp: do not touch listener sk_refcnt under synflood")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
According to the link FSM, a received traffic packet can take a link
from state ESTABLISHING to ESTABLISHED, but the link can still not be
fully set up in one atomic operation. This means that even if the the
very first packet on the link is a traffic packet with sequence number
1 (one), it has to be dropped and retransmitted.
This can be avoided if we let the mentioned packet be preceded by a
LINK_PROTOCOL/STATE message, which takes up the endpoint before the
arrival of the traffic.
We add this small feature in this commit.
This is a fully compatible change.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In some link establishment scenarios we see that packet #2 may be sent
out before packet #1, forcing the receiver to demand retransmission of
the missing packet. This is harmless, but may cause confusion among
people tracing the packet flow.
Since this is extremely easy to fix, we do so by adding en extra send
call to the bearer immediately after the link has come up.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function tipc_link_timeout() is unnecessary complex, and can
easily be made more readable.
We do that with this commit. The only functional change is that we
remove a redundant test for whether the broadcast link is up or not.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a link is down, it will continuously try to re-establish contact
with the peer by sending out a RESET or an ACTIVATE message at each
timeout interval. The default value for this interval is currently
375 ms. This is wasteful, and may become a problem in very large
clusters with dozens or hundreds of nodes being down simultaneously.
We now introduce a simple backoff algorithm for these cases. The
first five messages are sent at default rate; thereafter a message
is sent only each 16th timer interval.
This will cover the vast majority of link recycling cases, since the
endpoint starting last will transmit at the higher speed, and the link
should normally be established well be before the rate needs to be
reduced.
The only case where we will see a degradation of link re-establishment
times is when the endpoints remain intact, and a glitch in the
transmission media is causing the link reset. We will then experience
a worst-case re-establishing time of 6 seconds, something we deem
acceptable.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a link endpoint is going down locally, e.g., because its interface
is being stopped, it will spontaneously send out a RESET message to
its peer, informing it about this fact. This saves the peer from
detecting the failure via probing, and hence gives both speedier and
less resource consuming failure detection on the peer side.
According to the link FSM, a receiver of a RESET message, ignoring the
reason for it, must now consider the sender ready to come back up, and
starts periodically sending out ACTIVATE messages to the peer in order
to re-establish the link. Also, according to the FSM, the receiver of
an ACTIVATE message can now go directly to state ESTABLISHED and start
sending regular traffic packets. This is a well-proven and robust FSM.
However, in the case of a reboot, there is a small possibilty that link
endpoint on the rebooted node may have been re-created with a new bearer
identity between the moment it sent its (pre-boot) RESET and the moment
it receives the ACTIVATE from the peer. The new bearer identity cannot
be known by the peer according to this scenario, since traffic headers
don't convey such information. This is a problem, because both endpoints
need to know the correct value of the peer's bearer id at any moment in
time in order to be able to produce correct link events for their users.
The only way to guarantee this is to enforce a full setup message
exchange (RESET + ACTIVATE) even after the reboot, since those messages
carry the bearer idientity in their header.
In this commit we do this by introducing and setting a "stopping" bit in
the header of the spontaneously generated RESET messages, informing the
peer that the sender will not be immediately ready to re-establish the
link. A receiver seeing this bit must act as if this were a locally
detected connectivity failure, and hence has to go through a full two-
way setup message exchange before any link can be re-established.
Although never reported, this problem seems to have always been around.
This protocol addition is fully backwards compatible.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch converts all helpers that can use ARG_PTR_TO_RAW_STACK as argument
type. For tc programs this is bpf_skb_load_bytes(), bpf_skb_get_tunnel_key(),
bpf_skb_get_tunnel_opt(). For tracing, this optimizes bpf_get_current_comm()
and bpf_probe_read(). The check in bpf_skb_load_bytes() for MAX_BPF_STACK can
also be removed since the verifier already makes sure we stay within bounds
on stack buffers.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds what's missing to properly support RPS and RFS on SCTP,
as some of it is already implemented in common calls.
Having support for RPS and RFS allows better scaling specially because
not all NICs support hashing SCTP headers.
Save the hash right when we dequeue a skb from inqueue so we do it only
once per skb instead of per chunk. New sockets will then inherit the
hash through sctp_copy_sock().
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skbs given to validate_xmit_skb() should not have a next
pointer anymore.
Also if a packet is dropped, increment dev->tx_dropped
__dev_queue_xmit() no longer has to change tx_dropped in this case.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull crypto fixes from Herbert Xu:
"This fixes an NFS regression caused by the skcipher/hash conversion in
sunrpc. It also fixes a build problem in certain configurations with
bcm63xx"
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
hwrng: bcm63xx - fix device tree compilation
sunrpc: Fix skcipher/shash conversion
With the SO_REUSEPORT socket option, it is possible to create sockets
in the AF_INET and AF_INET6 domains which are bound to the same IPv4 address.
This is only possible with SO_REUSEPORT and when not using IPV6_V6ONLY on
the AF_INET6 sockets.
Prior to the commits referenced below, an incoming IPv4 packet would
always be routed to a socket of type AF_INET when this mixed-mode was used.
After those changes, the same packet would be routed to the most recently
bound socket (if this happened to be an AF_INET6 socket, it would
have an IPv4 mapped IPv6 address).
The change in behavior occurred because the recent SO_REUSEPORT optimizations
short-circuit the socket scoring logic as soon as they find a match. They
did not take into account the scoring logic that favors AF_INET sockets
over AF_INET6 sockets in the event of a tie.
To fix this problem, this patch changes the insertion order of AF_INET
and AF_INET6 addresses in the TCP and UDP socket lists when the sockets
have SO_REUSEPORT set. AF_INET sockets will be inserted at the head of the
list and AF_INET6 sockets with SO_REUSEPORT set will always be inserted at
the tail of the list. This will force AF_INET sockets to always be
considered first.
Fixes: e32ea7e747 ("soreuseport: fast reuseport UDP socket selection")
Fixes: 125e80b88687 ("soreuseport: fast reuseport TCP socket selection")
Reported-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: Craig Gallek <kraig@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
consume_skb() isn't for error cases that kfree_skb() is more proper
one. At this patch, it fixed tpacket_rcv() and packet_rcv() to be
consistent for error or non-error cases letting perf trace its event
properly.
Signed-off-by: Weongyo Jeong <weongyo.linux@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Until now, the requests sent to topology server are queued
to a workqueue by the generic server framework.
These messages are processed by worker threads and trigger the
registered callbacks.
To reduce latency on uniprocessor systems, explicit rescheduling
is performed using cond_resched() after MAX_RECV_MSG_COUNT(25)
messages.
This implementation on SMP systems leads to an subscriber refcnt
error as described below:
When a worker thread yields by calling cond_resched() in a SMP
system, a new worker is created on another CPU to process the
pending workitem. Sometimes the sleeping thread wakes up before
the new thread finishes execution.
This breaks the assumption on ordering and being single threaded.
The fault is more frequent when MAX_RECV_MSG_COUNT is lowered.
If the first thread was processing subscription create and the
second thread processing close(), the close request will free
the subscriber and the create request oops as follows:
[31.224137] WARNING: CPU: 2 PID: 266 at include/linux/kref.h:46 tipc_subscrb_rcv_cb+0x317/0x380 [tipc]
[31.228143] CPU: 2 PID: 266 Comm: kworker/u8:1 Not tainted 4.5.0+ #97
[31.228377] Workqueue: tipc_rcv tipc_recv_work [tipc]
[...]
[31.228377] Call Trace:
[31.228377] [<ffffffff812fbb6b>] dump_stack+0x4d/0x72
[31.228377] [<ffffffff8105a311>] __warn+0xd1/0xf0
[31.228377] [<ffffffff8105a3fd>] warn_slowpath_null+0x1d/0x20
[31.228377] [<ffffffffa0098067>] tipc_subscrb_rcv_cb+0x317/0x380 [tipc]
[31.228377] [<ffffffffa00a4984>] tipc_receive_from_sock+0xd4/0x130 [tipc]
[31.228377] [<ffffffffa00a439b>] tipc_recv_work+0x2b/0x50 [tipc]
[31.228377] [<ffffffff81071925>] process_one_work+0x145/0x3d0
[31.246554] ---[ end trace c3882c9baa05a4fd ]---
[31.248327] BUG: spinlock bad magic on CPU#2, kworker/u8:1/266
[31.249119] BUG: unable to handle kernel NULL pointer dereference at 0000000000000428
[31.249323] IP: [<ffffffff81099d0c>] spin_dump+0x5c/0xe0
[31.249323] PGD 0
[31.249323] Oops: 0000 [#1] SMP
In this commit, we
- rename tipc_conn_shutdown() to tipc_conn_release().
- move connection release callback execution from tipc_close_conn()
to a new function tipc_sock_release(), which is executed before
we free the connection.
Thus we release the subscriber during connection release procedure
rather than connection shutdown procedure.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a release_cb for UDPv6. It does a route lookup
and updates sk->sk_dst_cache if it is needed. It picks up the
left-over job from ip6_sk_update_pmtu() if the sk was owned
by user during the pmtu update.
It takes a rcu_read_lock to protect the __sk_dst_get() operations
because another thread may do ip6_dst_store() without taking the
sk lock (e.g. sendmsg).
Fixes: 45e4fd2668 ("ipv6: Only create RTF_CACHE routes after encountering pmtu exception")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reported-by: Wei Wang <weiwan@google.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a case in connected UDP socket such that
getsockopt(IPV6_MTU) will return a stale MTU value. The reproducible
sequence could be the following:
1. Create a connected UDP socket
2. Send some datagrams out
3. Receive a ICMPV6_PKT_TOOBIG
4. No new outgoing datagrams to trigger the sk_dst_check()
logic to update the sk->sk_dst_cache.
5. getsockopt(IPV6_MTU) returns the mtu from the invalid
sk->sk_dst_cache instead of the newly created RTF_CACHE clone.
This patch updates the sk->sk_dst_cache for a connected datagram sk
during pmtu-update code path.
Note that the sk->sk_v6_daddr is used to do the route lookup
instead of skb->data (i.e. iph). It is because a UDP socket can become
connected after sending out some datagrams in un-connected state. or
It can be connected multiple times to different destinations. Hence,
iph may not be related to where sk is currently connected to.
It is done under '!sock_owned_by_user(sk)' condition because
the user may make another ip6_datagram_connect() (i.e changing
the sk->sk_v6_daddr) while dst lookup is happening in the pmtu-update
code path.
For the sock_owned_by_user(sk) == true case, the next patch will
introduce a release_cb() which will update the sk->sk_dst_cache.
Test:
Server (Connected UDP Socket):
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Route Details:
[root@arch-fb-vm1 ~]# ip -6 r show | egrep '2fac'
2fac::/64 dev eth0 proto kernel metric 256 pref medium
2fac:face::/64 via 2fac::face dev eth0 metric 1024 pref medium
A simple python code to create a connected UDP socket:
import socket
import errno
HOST = '2fac::1'
PORT = 8080
s = socket.socket(socket.AF_INET6, socket.SOCK_DGRAM)
s.bind((HOST, PORT))
s.connect(('2fac:face::face', 53))
print("connected")
while True:
try:
data = s.recv(1024)
except socket.error as se:
if se.errno == errno.EMSGSIZE:
pmtu = s.getsockopt(41, 24)
print("PMTU:%d" % pmtu)
break
s.close()
Python program output after getting a ICMPV6_PKT_TOOBIG:
[root@arch-fb-vm1 ~]# python2 ~/devshare/kernel/tasks/fib6/udp-connect-53-8080.py
connected
PMTU:1300
Cache routes after recieving TOOBIG:
[root@arch-fb-vm1 ~]# ip -6 r show table cache
2fac:face::face via 2fac::face dev eth0 metric 0
cache expires 463sec mtu 1300 pref medium
Client (Send the ICMPV6_PKT_TOOBIG):
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
scapy is used to generate the TOOBIG message. Here is the scapy script I have
used:
>>> p=Ether(src='da:75:4d:36:ac:32', dst='52:54:00:12:34:66', type=0x86dd)/IPv6(src='2fac::face', dst='2fac::1')/ICMPv6PacketTooBig(mtu=1300)/IPv6(src='2fac::
1',dst='2fac:face::face', nh='UDP')/UDP(sport=8080,dport=53)
>>> sendp(p, iface='qemubr0')
Fixes: 45e4fd2668 ("ipv6: Only create RTF_CACHE routes after encountering pmtu exception")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reported-by: Wei Wang <weiwan@google.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch moves the route lookup and update codes for connected
datagram sk to a newly created function ip6_datagram_dst_update()
It will be reused during the pmtu update in the later patch.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move flowi6 init codes for connected datagram sk to a newly created
function ip6_datagram_flow_key_init().
Notes:
1. fl6_flowlabel is used instead of fl6.flowlabel in __ip6_datagram_connect
2. ipv6_addr_is_multicast(&fl6->daddr) is used instead of
(addr_type & IPV6_ADDR_MULTICAST) in ip6_datagram_flow_key_init()
This new function will be reused during pmtu update in the later patch.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for something I am referring to as GSO partial.
The basic idea is that we can support a broader range of devices for
segmentation if we use fixed outer headers and have the hardware only
really deal with segmenting the inner header. The idea behind the naming
is due to the fact that everything before csum_start will be fixed headers,
and everything after will be the region that is handled by hardware.
With the current implementation it allows us to add support for the
following GSO types with an inner TSO_MANGLEID or TSO6 offload:
NETIF_F_GSO_GRE
NETIF_F_GSO_GRE_CSUM
NETIF_F_GSO_IPIP
NETIF_F_GSO_SIT
NETIF_F_UDP_TUNNEL
NETIF_F_UDP_TUNNEL_CSUM
In the case of hardware that already supports tunneling we may be able to
extend this further to support TSO_TCPV4 without TSO_MANGLEID if the
hardware can support updating inner IPv4 headers.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch does two things.
First it allows TCP to aggregate TCP frames with a fixed IPv4 ID field. As
a result we should now be able to aggregate flows that were converted from
IPv6 to IPv4. In addition this allows us more flexibility for future
implementations of segmentation as we may be able to use a fixed IP ID when
segmenting the flow.
The second thing this does is that it places limitations on the outer IPv4
ID header in the case of tunneled frames. Specifically it forces the IP ID
to be incrementing by 1 unless the DF bit is set in the outer IPv4 header.
This way we can avoid creating overlapping series of IP IDs that could
possibly be fragmented if the frame goes through GRO and is then
resegmented via GSO.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for TSO using IPv4 headers with a fixed IP ID
field. This is meant to allow us to do a lossless GRO in the case of TCP
flows that use a fixed IP ID such as those that convert IPv6 header to IPv4
headers.
In addition I am adding a feature that for now I am referring to TSO with
IP ID mangling. Basically when this flag is enabled the device has the
option to either output the flow with incrementing IP IDs or with a fixed
IP ID regardless of what the original IP ID ordering was. This is useful
in cases where the DF bit is set and we do not care if the original IP ID
value is maintained.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The strings were missing for several of the GSO offloads that are
available. This patch provides the missing strings so that we can toggle
or query any of them via the ethtool command.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
User needs to monitor shared buffer occupancy. For that, he issues a
snapshot command in order to instruct hardware to catch current and
maximal occupancy values, and clear command in order to clear the
historical maximal values.
Also port-pool and tc-pool-bind command response messages are extended to
carry occupancy values.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Define userspace API and drivers API for configuration of shared
buffers. Four basic objects are defined:
shared buffer - attributes are size, number of pools and TCs
pool - chunk of sharedbuffer definition, it has some size and either
static or dynamic threshold
port pool threshold - to set per-port threshold for each pool
port tc threshold bind - to bind port and TC to specified pool
with threshold.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
netlink notifier family check to avoid the socket close DoS problem.
-----BEGIN PGP SIGNATURE-----
iQIbBAABCgAGBQJXD0YaAAoJEGt7eEactAAdJqUP9R0Ljud2jnH2JeNRv1qdEAXw
/PKERFsiZjAnU7kXXdOlm5mqYNJILNGbtPaKF5oi6mswzojwgaIQ+rjsHidX0SVs
1F1+VFxt0//pyfp6udEJ3gyx6wxGlncTdWUtxmzikgmMacn20ejT/wJFzF6y8MHO
YSyI8NGNxdTn4eBPLL1QdipsDFYpkS7Z2P8ZUquJmR+uMccjIJ4m4kB2NM8AGQgf
Z6KVjVP5dOvelWoEjSKaw2MyAAN/jRTxc9PFtEARoKMMtGMHBkn9qzin9wDjLA6k
h3M34Aj5i4R8AIFWJmrCw06rDhPT6j02l4g2IXpmZQqvbP7zewyXyRUkmxMiFvQd
n6zpNnQf3hDSNC8f7sBjNM5syi23j9ineJgtXdYtHqyqO0xY2qAVRYm7cSEXxFVk
WaxrWDO/wtGyKm1lTQYudAAsFARIgomra4QJd5FqFdIwX2nuZekuikOUsytUqnEV
ta+cwLJGLmof5jicrkoYD83974awJHcngLYMYLbkZC0kwWEZEiXX58D3o+FoF9ns
4D+qkLPRDAVIJI/W9j5yuIA6/Aj7Cy9kuxtKf5CoqxL35k47lu69aHEXEceB6fDD
M24G8nwEFcMimEWeYz4AARqaEJ08PWcA3/5dgf+N0XDCvQT95OutsZbqAdR2kqzF
ApCKvGmbx/sNeyZtjbs=
=FvAT
-----END PGP SIGNATURE-----
Merge tag 'mac80211-for-davem-2016-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
This has just the single fix from Dmitry Ivanov, adding the missing
netlink notifier family check to avoid the socket close DoS problem.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
A failure in validate_xmit_skb_list() triggered an unconditional call
to dev_requeue_skb with skb=NULL. This slowly grows the queue
discipline's qlen count until all traffic through the queue stops.
We take the optimistic approach and continue running the queue after a
failure since it is unknown if later packets also will fail in the
validate path.
Fixes: 55a93b3ea7 ("qdisc: validate skb without holding lock")
Signed-off-by: Lars Persson <larper@axis.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Because we miss to wipe the remainder of i->addr[] in packet_mc_add(),
pdiag_put_mclist() leaks uninitialized heap bytes via the
PACKET_DIAG_MCLIST netlink attribute.
Fix this by explicitly memset(0)ing the remaining bytes in i->addr[].
Fixes: eea68e2f1a ("packet: Report socket mclist info via diag module")
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Pavel Emelyanov <xemul@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After introduction of ndo_features_check(), we believe that very
specific checks for rare features should not be done in core
networking stack.
No driver uses gso_min_segs yet, so we revert this feature and save
few instructions per tx packet in fast path.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The meta_type_ops structures are never modified, so declare them as const.
Done with the help of Coccinelle.
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
For local routes that require a particular output interface we do not want
to cache the result. Caching the result causes incorrect behaviour when
there are multiple source addresses on the interface. The end result
being that if the intended recipient is waiting on that interface for the
packet he won't receive it because it will be delivered on the loopback
interface and the IP_PKTINFO ipi_ifindex will be set to the loopback
interface as well.
This can be tested by running a program such as "dhcp_release" which
attempts to inject a packet on a particular interface so that it is
received by another program on the same board. The receiving process
should see an IP_PKTINFO ipi_ifndex value of the source interface
(e.g., eth1) instead of the loopback interface (e.g., lo). The packet
will still appear on the loopback interface in tcpdump but the important
aspect is that the CMSG info is correct.
Sample dhcp_release command line:
dhcp_release eth1 192.168.204.222 02:11:33:22:44:66
Signed-off-by: Allain Legacy <allain.legacy@windriver.com>
Signed off-by: Chris Friesen <chris.friesen@windriver.com>
Reviewed-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently processing of multiple chunks in a single SCTP packet leads to
multiple calls to sk_data_ready, causing multiple wake up signals which
are costy and doesn't make it wake up any faster.
With this patch it will note that the wake up is pending and will do it
before leaving the state machine interpreter, latest place possible to
do it realiably and cleanly.
Note that sk_data_ready events are not dependent on asocs, unlike waking
up writers.
v2: series re-checked
v3: use local vars to cleanup the code, suggested by Jakub Sitnicki
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
f1705ec197 added the option to retain user configured addresses on an
admin down. A comment to one of the later revisions suggested using the
IFA_F_PERMANENT flag rather than adding a user_managed boolean to the
ifaddr struct. A side effect of this change is that link local and
loopback addresses are also retained which is not part of the objective
of f1705ec197. Add check to drop those addresses.
Fixes: f1705ec197 ("net: ipv6: Make address flushing on ifdown optional")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now when we change the attributes of bridge or br_port by netlink,
a relevant netlink notification will be sent, but if we change them
by ioctl or sysfs, no notification will be sent.
We should ensure that whenever those attributes change internally or from
sysfs/ioctl, that a netlink notification is sent out to listeners.
Also, NetworkManager will use this in the future to listen for out-of-band
bridge master attribute updates and incorporate them into the runtime
configuration.
This patch is used for ioctl.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now when we change the attributes of bridge or br_port by netlink,
a relevant netlink notification will be sent, but if we change them
by ioctl or sysfs, no notification will be sent.
We should ensure that whenever those attributes change internally or from
sysfs/ioctl, that a netlink notification is sent out to listeners.
Also, NetworkManager will use this in the future to listen for out-of-band
bridge master attribute updates and incorporate them into the runtime
configuration.
This patch is used for br_sysfs_if, and we also move br_ifinfo_notify out
of store_flag.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now when we change the attributes of bridge or br_port by netlink,
a relevant netlink notification will be sent, but if we change them
by ioctl or sysfs, no notification will be sent.
We should ensure that whenever those attributes change internally or from
sysfs/ioctl, that a netlink notification is sent out to listeners.
Also, NetworkManager will use this in the future to listen for out-of-band
bridge master attribute updates and incorporate them into the runtime
configuration.
This patch is used for br_sysfs_br. and we also need to remove some
rtnl_trylock in old functions so that we can call it in a common one.
For group_addr_store, we cannot make it use store_bridge_parm, because
it's not a string-to-long convert, we will add notification on it
individually.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are some repetitive codes in stp_state_store, we can remove
them by calling store_bridge_parm.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are some repetitive codes in forward_delay_store, we can remove
them by calling store_bridge_parm.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are some repetitive codes in flush_store, we can remove
them by calling store_bridge_parm, also, it would send rtnl notification
after we add it in store_bridge_parm in the following patches.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The original tokenized iid support implemented via f53adae4ea ("net: ipv6:
add tokenized interface identifier support") didn't allow for clearing a
device token as it was intended that this addressing mode was the only one
active for globally scoped IPv6 addresses. Later we relaxed that restriction
via 617fe29d45 ("net: ipv6: only invalidate previously tokenized addresses"),
and we should also allow for clearing tokens as there's no good reason why
it shouldn't be allowed.
Fixes: 617fe29d45 ("net: ipv6: only invalidate previously tokenized addresses")
Reported-by: Robin H. Johnson <robbat2@gentoo.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
sock_owned_by_user should not be used without socket lock held. It seems
to be a common practice to check .owned before lock reclassification, so
provide a little help to abstract this check away.
Cc: linux-cifs@vger.kernel.org
Cc: linux-bluetooth@vger.kernel.org
Cc: linux-nfs@vger.kernel.org
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
On udp sockets, recv cmsg IP_CMSG_CHECKSUM returns a checksum over
the packet payload. Since commit e6afc8ace6 pulled the headers,
taking skb->data as the start of transport header is incorrect. Use
the transport header pointer.
Also, when peeking at an offset from the start of the packet, only
return a checksum from the start of the peeked data. Note that the
cmsg does not subtract a tail checkum when reading truncated data.
Fixes: e6afc8ace6 ("udp: remove headers from UDP packets before queueing")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On udp sockets, ioctl SIOCINQ returns the payload size of the first
packet. Since commit e6afc8ace6 pulled the headers, the result is
incorrect when subtracting header length. Remove that operation.
Fixes: e6afc8ace6 ("udp: remove headers from UDP packets before queueing")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for your net tree. More
specifically, they are:
1) Fix missing filter table per-netns registration in arptables, from
Florian Westphal.
2) Resolve out of bound access when parsing TCP options in
nf_conntrack_tcp, patch from Jozsef Kadlecsik.
3) Prefer NFPROTO_BRIDGE extensions over NFPROTO_UNSPEC in ebtables,
this resolves conflict between xt_limit and ebt_limit, from Phil Sutter.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Many of these functions are called from control plane path. Move
ctnetlink_nlmsg_size() under CONFIG_NF_CONNTRACK_EVENTS to avoid a
compilation warning when CONFIG_NF_CONNTRACK_EVENTS=n.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The three variants use same copy&pasted code, condense this into a
helper and use that.
Make sure info.name is 0-terminated.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Since 'netfilter: x_tables: validate targets of jumps' change we
validate that the target aligns exactly with beginning of a rule,
so offset test is now redundant.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
commit 9e67d5a739
("[NETFILTER]: x_tables: remove obsolete overflow check") left the
compat parts alone, but we can kill it there as well.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This looks like refactoring, but its also a bug fix.
Problem is that the compat path (32bit iptables, 64bit kernel) lacks a few
sanity tests that are done in the normal path.
For example, we do not check for underflows and the base chain policies.
While its possible to also add such checks to the compat path, its more
copy&pastry, for instance we cannot reuse check_underflow() helper as
e->target_offset differs in the compat case.
Other problem is that it makes auditing for validation errors harder; two
places need to be checked and kept in sync.
At a high level 32 bit compat works like this:
1- initial pass over blob:
validate match/entry offsets, bounds checking
lookup all matches and targets
do bookkeeping wrt. size delta of 32/64bit structures
assign match/target.u.kernel pointer (points at kernel
implementation, needed to access ->compatsize etc.)
2- allocate memory according to the total bookkeeping size to
contain the translated ruleset
3- second pass over original blob:
for each entry, copy the 32bit representation to the newly allocated
memory. This also does any special match translations (e.g.
adjust 32bit to 64bit longs, etc).
4- check if ruleset is free of loops (chase all jumps)
5-first pass over translated blob:
call the checkentry function of all matches and targets.
The alternative implemented by this patch is to drop steps 3&4 from the
compat process, the translation is changed into an intermediate step
rather than a full 1:1 translate_table replacement.
In the 2nd pass (step #3), change the 64bit ruleset back to a kernel
representation, i.e. put() the kernel pointer and restore ->u.user.name .
This gets us a 64bit ruleset that is in the format generated by a 64bit
iptables userspace -- we can then use translate_table() to get the
'native' sanity checks.
This has two drawbacks:
1. we re-validate all the match and target entry structure sizes even
though compat translation is supposed to never generate bogus offsets.
2. we put and then re-lookup each match and target.
THe upside is that we get all sanity tests and ruleset validations
provided by the normal path and can remove some duplicated compat code.
iptables-restore time of autogenerated ruleset with 300k chains of form
-A CHAIN0001 -m limit --limit 1/s -j CHAIN0002
-A CHAIN0002 -m limit --limit 1/s -j CHAIN0003
shows no noticeable differences in restore times:
old: 0m30.796s
new: 0m31.521s
64bit: 0m25.674s
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Validate that all matches (if any) add up to the beginning of
the target and that each match covers at least the base structure size.
The compat path should be able to safely re-use the function
as the structures only differ in alignment; added a
BUILD_BUG_ON just in case we have an arch that adds padding as well.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We're currently asserting that targetoff + targetsize <= nextoff.
Extend it to also check that targetoff is >= sizeof(xt_entry).
Since this is generic code, add an argument pointing to the start of the
match/target, we can then derive the base structure size from the delta.
We also need the e->elems pointer in a followup change to validate matches.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We have targets and standard targets -- the latter carries a verdict.
The ip/ip6tables validation functions will access t->verdict for the
standard targets to fetch the jump offset or verdict for chainloop
detection, but this happens before the targets get checked/validated.
Thus we also need to check for verdict presence here, else t->verdict
can point right after a blob.
Spotted with UBSAN while testing malformed blobs.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
32bit rulesets have different layout and alignment requirements, so once
more integrity checks get added to xt_check_entry_offsets it will reject
well-formed 32bit rulesets.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The target size includes the size of the xt_entry_target struct.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Once we add more sanity testing to xt_check_entry_offsets it
becomes relvant if we're expecting a 32bit 'config_compat' blob
or a normal one.
Since we already have a lot of similar-named functions (check_entry,
compat_check_entry, find_and_check_entry, etc.) and the current
incarnation is short just fold its contents into the callers.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Currently arp/ip and ip6tables each implement a short helper to check that
the target offset is large enough to hold one xt_entry_target struct and
that t->u.target_size fits within the current rule.
Unfortunately these checks are not sufficient.
To avoid adding new tests to all of ip/ip6/arptables move the current
checks into a helper, then extend this helper in followup patches.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When we see a jump also check that the offset gets us to beginning of
a rule (an ipt_entry).
The extra overhead is negible, even with absurd cases.
300k custom rules, 300k jumps to 'next' user chain:
[ plus one jump from INPUT to first userchain ]:
Before:
real 0m24.874s
user 0m7.532s
sys 0m16.076s
After:
real 0m27.464s
user 0m7.436s
sys 0m18.840s
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Ben Hawkes says:
In the mark_source_chains function (net/ipv4/netfilter/ip_tables.c) it
is possible for a user-supplied ipt_entry structure to have a large
next_offset field. This field is not bounds checked prior to writing a
counter value at the supplied offset.
Base chains enforce absolute verdict.
User defined chains are supposed to end with an unconditional return,
xtables userspace adds them automatically.
But if such return is missing we will move to non-existent next rule.
Reported-by: Ben Hawkes <hawkes@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The phys in phys_port_mask suggests this mask is about PHYs. In fact,
it means physical ports. Rename to enabled_port_mask, indicating
external enabled ports of the switch, which is hopefully less
confusing.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Tested-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The drivers now allocate their own memory for private usage. Remove
the allocation from the core code.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now the switch devices have a dev pointer, make use of it for allocating
the drivers private data structures using a devm_kzalloc().
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
By passing a device structure to the switch devices, it allows them
to use devm_* methods for resource management.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
all drivers - removing the duplicated enum ieee80211_band and
replacing it by enum nl80211_band. On top of that, just a small
documentation update.
-----BEGIN PGP SIGNATURE-----
iQIcBAABCgAGBQJXDp6hAAoJEGt7eEactAAd/xUQAJtKNwp9CLsx+QFx6lMoXX4x
r0XA8DFgLp1BflS9P05/g1m0NiQxm3YuRtpze/FdPglb6AAVjLcqksf+vTkU+Lng
p7rIkb/fQv5s5aoYPxNrD5zgwALVv9y5fI7rV7scj355iesCC0PmAP34own2Dihi
eBVSammsh5ZNTQKLBk8vXECb0UKWsDBMgp4uQc35Bpw8XSx5Nrtl5JI/hMcckte0
a/FQyQKjmjl3O/nRLn3kzGPv1OnRiJOMb5fMWB+Xm2cLtmKPHIErgVk2l/CMaiYj
sRJR8KaZQpQsyWiQU59UNpywlejy7Z1RsSWmuPhm0xTGzIF1wVIgHJSsRI/gNGD2
8Ey1P+RXkM8NVxrQr/0fis9XWyWfE8ne4tFsPiPOD3VmBiStIB9fAukJHLrvTmKU
JrkXCePUkfNY/PqJqlP/RONBcysI253/snVF49oZ7LMBZiGDPhdRcEEcCaS0tmMM
Qa+a78XvaH5xaKuMIDZ4qMdnMMcdv4g8G1DQeA1mb0EIGL1Gtu9BJsu9q8PqmjQU
1ZAf4MlWJWdYk+CtTNT4slSIQVKAN78s6j1HSB/bNcpWk9y93wBhJW0FdP7FtJ1I
pjJGIVcLU98FKdqi2jqPEezbDXXzOz0gNQDbqfJyM9/R7ijnJcaPllviaWjEg/O7
8jMBOg87Hn7kq7JJGpKA
=2xfe
-----END PGP SIGNATURE-----
Merge tag 'mac80211-next-for-davem-2016-04-13' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
Johannes Berg says:
====================
To synchronize with Kalle, here's just a big change that affects
all drivers - removing the duplicated enum ieee80211_band and
replacing it by enum nl80211_band. On top of that, just a small
documentation update.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
We remove a couple of leftover fields in struct tipc_bearer. Those
were used by the old broadcast implementation, and are not needed
any longer. There is no functional changes in this commit.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In case of link-layer specific handling for 802.15.4 we need to cast to
802.15.4 sepcific structures. Simple add this header when include the
6lowpan header.
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Reviewed-by: Stefan Schmidt<stefan@osg.samsung.com>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch adds the lowpan_is_ll function, which can be used to make a
special 6lowpan linklayer handling for a specific 6lowpan linklayer
type.
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Reviewed-by: Stefan Schmidt<stefan@osg.samsung.com>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This function will be use in later functionality in other branches than
generic 6lowpan, so we move it to the global 6lowpan header.
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Reviewed-by: Stefan Schmidt<stefan@osg.samsung.com>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch removes unnecessary zero data for a stack variable.
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch adds a lowpan prefix to each functions which doesn't have
such prefix currently.
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch moves the 802.15.4 link layer specific structures to generic
6lowpan. This is necessary for special 802.15.4 6lowpan handling in
6lowpan generic layer.
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch changes the naming for interface private data for lowpan
intefaces. The current private data scheme is:
-------------------------------------------------
| 6LoWPAN Generic | LinkLayer 6LoWPAN |
-------------------------------------------------
the current naming schemes are:
- 6LoWPAN Generic:
- lowpan_priv
- LinkLayer 6LoWPAN:
- BTLE
- lowpan_dev
- 802.15.4:
- lowpan_dev_info
the new naming scheme with this patch will be:
- 6LoWPAN Generic:
- lowpan_dev
- LinkLayer 6LoWPAN:
- BTLE
- lowpan_btle_dev
- 802.15.4:
- lowpan_802154_dev
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Reviewed-by: Stefan Schmidt<stefan@osg.samsung.com>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
The short address is unique in combination with the panid. This patch
will add the panid for generating an ieee802154 address hash.
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
The generation of autoconfigured IPv6 link-local addresses starts with a
notification on interface up.
To handle autoconfiguration according to RFC 4944 requires pan id and
short address to generate an autoconfigured link-local address. This
patch will avoid changing of these link-layer address configuration
while lowpan interface is up.
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains the first batch of Netfilter updates for
your net-next tree.
1) Define pr_fmt() in nf_conntrack, from Weongyo Jeong.
2) Define and register netfilter's afinfo for the bridge family,
this comes in preparation for native nfqueue's bridge for nft,
from Stephane Bryant.
3) Add new attributes to store layer 2 and VLAN headers to nfqueue,
also from Stephane Bryant.
4) Parse new NFQA_VLAN and NFQA_L2HDR nfqueue netlink attributes
coming from userspace, from Stephane Bryant.
5) Use net->ipv6.devconf_all->hop_limit instead of hardcoded hop_limit
in IPv6 SYNPROXY, from Liping Zhang.
6) Remove unnecessary check for dst == NULL in nf_reject_ipv6,
from Haishuang Yan.
7) Deinline ctnetlink event report functions, from Florian Westphal.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
If a requested extension exists as module and is not loaded,
ebt_check_match() might accidentally use an NFPROTO_UNSPEC one with same
name and fail.
Reproduced with limit match: Given xt_limit and ebt_limit both built as
module, the following would fail:
modprobe xt_limit
ebtables -I INPUT --limit 1/s -j ACCEPT
The fix is to make ebt_check_match() distrust a found NFPROTO_UNSPEC
extension and retry after requesting an appropriate module.
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: Phil Sutter <phil@nwl.cc>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Not performance critical, it is only invoked when an expectation is
added/destroyed.
While at it, kill unused nf_ct_expect_event() wrapper.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Way too large; move it to nf_conntrack_ecache.c.
Reduces total object size by 1216 byte on my machine.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Johan Hedberg says:
====================
pull request: bluetooth-next 2016-04-12
Here's a set of Bluetooth & 802.15.4 patches intended for the 4.7 kernel:
- Fix for race condition in vhci driver
- Memory leak fix for ieee802154/adf7242 driver
- Improvements to deal with single-mode (LE-only) Bluetooth controllers
- Fix for allowing the BT_SECURITY_FIPS security level
- New BCM2E71 ACPI ID
- NULL pointer dereference fix fox hci_ldisc driver
Let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This enum is already perfectly aliased to enum nl80211_band, and
the only reason for it is that we get IEEE80211_NUM_BANDS out of
it. There's no really good reason to not declare the number of
bands in nl80211 though, so do that and remove the cfg80211 one.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
A non-privileged user can create a netlink socket with the same port_id as
used by an existing open nl80211 netlink socket (e.g. as used by a hostapd
process) with a different protocol number.
Closing this socket will then lead to the notification going to nl80211's
socket release notification handler, and possibly cause an action such as
removing a virtual interface.
Fix this issue by checking that the netlink protocol is NETLINK_GENERIC.
Since generic netlink has no notifier chain of its own, we can't fix the
problem more generically.
Fixes: 026331c4d9 ("cfg80211/mac80211: allow registering for and sending action frames")
Cc: stable@vger.kernel.org
Signed-off-by: Dmitry Ivanov <dima@ubnt.com>
[rewrite commit message]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Add a facility whereby proposed new links to be added to a keyring can be
vetted, permitting them to be rejected if necessary. This can be used to
block public keys from which the signature cannot be verified or for which
the signature verification fails. It could also be used to provide
blacklisting.
This affects operations like add_key(), KEYCTL_LINK and KEYCTL_INSTANTIATE.
To this end:
(1) A function pointer is added to the key struct that, if set, points to
the vetting function. This is called as:
int (*restrict_link)(struct key *keyring,
const struct key_type *key_type,
unsigned long key_flags,
const union key_payload *key_payload),
where 'keyring' will be the keyring being added to, key_type and
key_payload will describe the key being added and key_flags[*] can be
AND'ed with KEY_FLAG_TRUSTED.
[*] This parameter will be removed in a later patch when
KEY_FLAG_TRUSTED is removed.
The function should return 0 to allow the link to take place or an
error (typically -ENOKEY, -ENOPKG or -EKEYREJECTED) to reject the
link.
The pointer should not be set directly, but rather should be set
through keyring_alloc().
Note that if called during add_key(), preparse is called before this
method, but a key isn't actually allocated until after this function
is called.
(2) KEY_ALLOC_BYPASS_RESTRICTION is added. This can be passed to
key_create_or_update() or key_instantiate_and_link() to bypass the
restriction check.
(3) KEY_FLAG_TRUSTED_ONLY is removed. The entire contents of a keyring
with this restriction emplaced can be considered 'trustworthy' by
virtue of being in the keyring when that keyring is consulted.
(4) key_alloc() and keyring_alloc() take an extra argument that will be
used to set restrict_link in the new key. This ensures that the
pointer is set before the key is published, thus preventing a window
of unrestrictedness. Normally this argument will be NULL.
(5) As a temporary affair, keyring_restrict_trusted_only() is added. It
should be passed to keyring_alloc() as the extra argument instead of
setting KEY_FLAG_TRUSTED_ONLY on a keyring. This will be replaced in
a later patch with functions that look in the appropriate places for
authoritative keys.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
ifupdown2 found a kernel bug with IPv6 routes and movement from the main
table to the VRF table. Sequence of events:
Create the interface and add addresses:
ip link add dev eth4.105 link eth4 type vlan id 105
ip addr add dev eth4.105 8.105.105.10/24
ip -6 addr add dev eth4.105 2008:105:105::10/64
At this point IPv6 has inserted a prefix route in the main table even
though the interface is 'down'. From there the VRF device is created:
ip link add dev vrf105 type vrf table 105
ip addr add dev vrf105 9.9.105.10/32
ip -6 addr add dev vrf105 2000:9:105::10/128
ip link set vrf105 up
Then the interface is enslaved, while still in the 'down' state:
ip link set dev eth4.105 master vrf105
Since the device is down the VRF driver cycling the device does not
send the NETDEV_UP and NETDEV_DOWN but rather the NETDEV_CHANGE event
which does not flush the routes inserted prior.
When the link is brought up
ip link set dev eth4.105 up
the prefix route is added in the VRF table, but does not remove
the route from the main table.
Fix by handling the NETDEV_CHANGEUPPER event similar what was implemented
for IPv4 in 7f49e7a38b ("net: Flush local routes when device changes vrf
association")
Fixes: 35402e3136 ("net: Add IPv6 support to VRF device")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vivek reported a kernel exception deleting a VRF with an active
connection through it. The root cause is that the socket has a cached
reference to a dst that is destroyed. Converting the dst_destroy to
dst_release and letting proper reference counting kick in does not
work as the dst has a reference to the device which needs to be released
as well.
I talked to Hannes about this at netdev and he pointed out the ipv4 and
ipv6 dst handling has dst_ifdown for just this scenario. Rather than
continuing with the reinvented dst wheel in VRF just remove it and
leverage the ipv4 and ipv6 versions.
Fixes: 193125dbd8 ("net: Introduce VRF device driver")
Fixes: 35402e3136 ("net: Add IPv6 support to VRF device")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Create a null security type for security index 0 and get rid of all
conditional calls to the security operations. We expect normally to be
using security, so this should be of little negative impact.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Absorb the rxkad security module into the af_rxrpc module so that there's
only one module file. This avoids a circular dependency whereby rxkad pins
af_rxrpc and cached connections pin rxkad but can't be manually evicted
(they will expire eventually and cease pinning).
With this change, af_rxrpc can just be unloaded, despite having cached
connections.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Don't assume transport address family and size when using the peer address
to send a packet. Instead, use the start of the transport address rather
than any particular element of the union and use the transport address
length noted inside the sockaddr_rxrpc struct.
This will be necessary when IPv6 support is introduced.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Don't pass gfp around in incoming call handling functions, but rather hard
code it at the points where we actually need it since the value comes from
within the rxrpc driver and is always the same.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the rxrpc_connection and rxrpc_call structs, there's one field to hold
the abort code, no matter whether that value was generated locally to be
sent or was received from the peer via an abort packet.
Split the abort code fields in two for cleanliness sake and add an error
field to hold the Linux error number to the rxrpc_call struct too
(sometimes this is generated in a context where we can't return it to
userspace directly).
Furthermore, add a skb mark to indicate a packet that caused a local abort
to be generated so that recvmsg() can pick up the correct abort code. A
future addition will need to be to indicate to userspace the difference
between aborts via a control message.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Static arrays of strings should be const char *const[].
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move some miscellaneous bits out into their own file to make it easier to
split the call handling.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Disable a debugging statement that has been left enabled
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit e6afc8ace6 modified the udp receive path by pulling the udp
header before queuing an skbuff onto the receive queue.
Rxrpc also calls skb_recv_datagram to dequeue an skb from a udp
socket. Modify this receive path to also no longer expect udp
headers.
Fixes: e6afc8ace6 ("udp: remove headers from UDP packets before queueing")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Tested-by: Thierry Reding <treding@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit e6afc8ace6 modified the udp receive path by pulling the udp
header before queuing an skbuff onto the receive queue.
Sunrpc also calls skb_recv_datagram to dequeue an skb from a udp
socket. Modify this receive path to also no longer expect udp
headers.
Fixes: e6afc8ace6 ("udp: remove headers from UDP packets before queueing")
Reported-by: Franklin S Cooper Jr. <fcooper@ti.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Tested-by: Thierry Reding <treding@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If a peer node becomes unavailable, in addition to removing the
nametable entries from this node we also need to purge all deferred
updates associated with this node.
Signed-off-by: Erik Hugne <erik.hugne@gmail.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nametable updates received from the network that cannot be applied
immediately are placed on a defer queue. This queue is global to the
TIPC module, which might cause problems when using TIPC in containers.
To prevent nametable updates from escaping into the wrong namespace,
we make the queue pernet instead.
Signed-off-by: Erik Hugne <erik.hugne@gmail.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Multipath route lookups should consider knowledge about next hops and not
select a hop that is known to be failed.
Example:
[h2] [h3] 15.0.0.5
| |
3| 3|
[SP1] [SP2]--+
1 2 1 2
| | /-------------+ |
| \ / |
| X |
| / \ |
| / \---------------\ |
1 2 1 2
12.0.0.2 [TOR1] 3-----------------3 [TOR2] 12.0.0.3
4 4
\ /
\ /
\ /
-------| |-----/
1 2
[TOR3]
3|
|
[h1] 12.0.0.1
host h1 with IP 12.0.0.1 has 2 paths to host h3 at 15.0.0.5:
root@h1:~# ip ro ls
...
12.0.0.0/24 dev swp1 proto kernel scope link src 12.0.0.1
15.0.0.0/16
nexthop via 12.0.0.2 dev swp1 weight 1
nexthop via 12.0.0.3 dev swp1 weight 1
...
If the link between tor3 and tor1 is down and the link between tor1
and tor2 then tor1 is effectively cut-off from h1. Yet the route lookups
in h1 are alternating between the 2 routes: ping 15.0.0.5 gets one and
ssh 15.0.0.5 gets the other. Connections that attempt to use the
12.0.0.2 nexthop fail since that neighbor is not reachable:
root@h1:~# ip neigh show
...
12.0.0.3 dev swp1 lladdr 00:02:00:00:00:1b REACHABLE
12.0.0.2 dev swp1 FAILED
...
The failed path can be avoided by considering known neighbor information
when selecting next hops. If the neighbor lookup fails we have no
knowledge about the nexthop, so give it a shot. If there is an entry
then only select the nexthop if the state is sane. This is similar to
what fib_detect_death does.
To maintain backward compatibility use of the neighbor information is
based on a new sysctl, fib_multipath_use_neigh.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Reviewed-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
All existing users of NETLINK_URELEASE use it to clean up resources that
were previously allocated to a socket via some command. As a result, no
users require getting this notification for unbound sockets.
Sending it for unbound sockets, however, is a problem because any user
(including unprivileged users) can create a socket that uses the same ID
as an existing socket. Binding this new socket will fail, but if the
NETLINK_URELEASE notification is generated for such sockets, the users
thereof will be tricked into thinking the socket that they allocated the
resources for is closed.
In the nl80211 case, this will cause destruction of virtual interfaces
that still belong to an existing hostapd process; this is the case that
Dmitry noticed. In the NFC case, it will cause a poll abort. In the case
of netlink log/queue it will cause them to stop reporting events, as if
NFULNL_CFG_CMD_UNBIND/NFQNL_CFG_CMD_UNBIND had been called.
Fix this problem by checking that the socket is bound before generating
the NETLINK_URELEASE notification.
Cc: stable@vger.kernel.org
Signed-off-by: Dmitry Ivanov <dima@ubnt.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently on high rate SCTP streams the heartbeat timer refresh can
consume quite a lot of resources as timer updates are costly and it
contains a random factor, which a) is also costly and b) invalidates
mod_timer() optimization for not editing a timer to the same value.
It may even cause the timer to be slightly advanced, for no good reason.
As suggested by David Laight this patch now removes this timer update
from hot path by leaving the timer on and re-evaluating upon its
expiration if the heartbeat is still needed or not, similarly to what is
done for TCP. If it's not needed anymore the timer is re-scheduled to
the new timeout, considering the time already elapsed.
For this, we now record the last tx timestamp per transport, updated in
the same spots as hb timer was restarted on tx. Also split up
sctp_transport_reset_timers into sctp_transport_reset_t3_rtx and
sctp_transport_reset_hb_timer, so we can re-arm T3 without re-arming the
heartbeat one.
On loopback with MTU of 65535 and data chunks with 1636, so that we
have a considerable amount of chunks without stressing system calls,
netperf -t SCTP_STREAM -l 30, perf looked like this before:
Samples: 103K of event 'cpu-clock', Event count (approx.): 25833000000
Overhead Command Shared Object Symbol
+ 6,15% netperf [kernel.vmlinux] [k] copy_user_enhanced_fast_string
- 5,43% netperf [kernel.vmlinux] [k] _raw_write_unlock_irqrestore
- _raw_write_unlock_irqrestore
- 96,54% _raw_spin_unlock_irqrestore
- 36,14% mod_timer
+ 97,24% sctp_transport_reset_timers
+ 2,76% sctp_do_sm
+ 33,65% __wake_up_sync_key
+ 28,77% sctp_ulpq_tail_event
+ 1,40% del_timer
- 1,84% mod_timer
+ 99,03% sctp_transport_reset_timers
+ 0,97% sctp_do_sm
+ 1,50% sctp_ulpq_tail_event
And after this patch, now with netperf -l 60:
Samples: 230K of event 'cpu-clock', Event count (approx.): 57707250000
Overhead Command Shared Object Symbol
+ 5,65% netperf [kernel.vmlinux] [k] memcpy_erms
+ 5,59% netperf [kernel.vmlinux] [k] copy_user_enhanced_fast_string
- 5,05% netperf [kernel.vmlinux] [k] _raw_spin_unlock_irqrestore
- _raw_spin_unlock_irqrestore
+ 49,89% __wake_up_sync_key
+ 45,68% sctp_ulpq_tail_event
- 2,85% mod_timer
+ 76,51% sctp_transport_reset_t3_rtx
+ 23,49% sctp_do_sm
+ 1,55% del_timer
+ 2,50% netperf [sctp] [k] sctp_datamsg_from_user
+ 2,26% netperf [sctp] [k] sctp_sendmsg
Throughput-wise, from 6800mbps without the patch to 7050mbps with it,
~3.7%.
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A stupid refactoring bug in inet6_lookup_listener() needs to be fixed
in order to get proper SO_REUSEPORT behavior.
Fixes: 3b24d854cb ("tcp/dccp: do not touch listener sk_refcnt under synflood")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Stale SKB data pointer access across pskb_may_pull() calls in L2TP,
from Haishuang Yan.
2) Fix multicast frame handling in mac80211 AP code, from Felix
Fietkau.
3) mac80211 station hashtable insert errors not handled properly, fix
from Johannes Berg.
4) Fix TX descriptor count limit handling in e1000, from Alexander
Duyck.
5) Revert a buggy netdev refcount fix in netpoll, from Bjorn Helgaas.
6) Must assign rtnl_link_ops of the device before registering it, fix
in ip6_tunnel from Thadeu Lima de Souza Cascardo.
7) Memory leak fix in tc action net exit, from WANG Cong.
8) Add missing AF_KCM entries to name tables, from Dexuan Cui.
9) Fix regression in GRE handling of csums wrt. FOU, from Alexander
Duyck.
10) Fix memory allocation alignment and congestion map corruption in
RDS, from Shamir Rabinovitch.
11) Fix default qdisc regression in tuntap driver, from Jason Wang.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (44 commits)
bridge, netem: mark mailing lists as moderated
tuntap: restore default qdisc
mpls: find_outdev: check for err ptr in addition to NULL check
ipv6: Count in extension headers in skb->network_header
RDS: fix congestion map corruption for PAGE_SIZE > 4k
RDS: memory allocated must be align to 8
GRE: Disable segmentation offloads w/ CSUM and we are encapsulated via FOU
net: add the AF_KCM entries to family name tables
MAINTAINERS: intel-wired-lan list is moderated
lib/test_bpf: Add additional BPF_ADD tests
lib/test_bpf: Add test to check for result of 32-bit add that overflows
lib/test_bpf: Add tests for unsigned BPF_JGT
lib/test_bpf: Fix JMP_JSET tests
VSOCK: Detach QP check should filter out non matching QPs.
stmmac: fix adjust link call in case of a switch is attached
af_packet: tone down the Tx-ring unsupported spew.
net_sched: fix a memory leak in tc action
samples/bpf: Enable powerpc support
samples/bpf: Use llc in PATH, rather than a hardcoded value
samples/bpf: Fix build breakage with map_perf_test_user.c
...
The switchdev design implies that a software error should not happen in
the commit phase since it must have been previously reported in the
prepare phase. If an hardware error occurs during the commit phase,
there is nothing switchdev can do about it.
The DSA layer separates port_vlan_prepare and port_vlan_add for
simplicity and convenience. If an hardware error occurs during the
commit phase, there is no need to report it outside the driver itself.
Make the DSA port_vlan_add routine return void for explicitness.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The switchdev design implies that a software error should not happen in
the commit phase since it must have been previously reported in the
prepare phase. If an hardware error occurs during the commit phase,
there is nothing switchdev can do about it.
The DSA layer separates port_fdb_prepare and port_fdb_add for simplicity
and convenience. If an hardware error occurs during the commit phase,
there is no need to report it outside the DSA driver itself.
Make the DSA port_fdb_add routine return void for explicitness.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The DSA layer doesn't care about the return code of the port_stp_update
routine, so make it void in the layer and the DSA drivers.
Replace the useless dsa_slave_stp_update function with a
dsa_slave_stp_state function used to reply to the switchdev
SWITCHDEV_ATTR_ID_PORT_STP_STATE attribute.
In the meantime, rename port_stp_update to port_stp_state_set to
explicit the state change.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* Bob's mesh mode rhashtable conversion, this includes
the rhashtable API change for allocation flags
* BSSID scan, connect() command reassoc support (Jouni)
* fast (optimised data only) and support for RSS in mac80211 (myself)
* various smaller changes
-----BEGIN PGP SIGNATURE-----
iQIcBAABCgAGBQJXBQ4GAAoJEGt7eEactAAdWiMP/ibaP3I79NDc0s7wCDA+KRkm
hx0Qx4a0wwm7lDFlnGBjY6yKr+XFDliCvdGX7XGpLSsTioNg7eXPpwx5FQoj6RiV
8+5RKE9fTguN9ofUzqAwHd9sVOaxvdlXbKfb/N93Gzjpw/meYk58wXdF7Almkroa
ukgJeMzIlIh+6D96zFEA+Ofzp5chwh+x2Dn0wXutEe9P9fOERA859veAvx65b+Ql
IRGTqyuY5B/wcbkr4o+DWQwgrdt7Vop9nYVPNWtMHm2JTzfuCSaQ2cD9TnVAK/bg
/vtqC46KKNLyBRGexAPqdftY9PWcfipgE+n7k+Et4iGSmNm7Z3dEyewgXmqli7XJ
X8Uiaq+N6Fpe06DVSU7aSRt8NLV64A44jXSfKRI9U2POUqKMn/PMdm8bhPW8qCdM
ra6myWpQGHWK9e0TQQdShq0NQKGxCZAiSRiiIrbbvXl1CwXxkPCG39wAC3Sh1tEN
ou4lGraeywGnTjaq+mwLEtHLoug8Y2x+Fz+Ze4Cu2enXxna9lp4lr+rFlc+2+0Er
o9oPxkTk8krZGIj9M6PNc5W+InMwchaFX3076n67hnFHzFRlOQzkfffbPYlhKJDQ
f8c9JiNZIoX/fD1TAKsrdO1+EKm/xo7w7pLgbMwQal8Jr88SkITDg0i3oXc56vNQ
ZK2gUzwvrD/jh0AUyDfN
=sj7y
-----END PGP SIGNATURE-----
Merge tag 'mac80211-next-for-davem-2016-04-06' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
Johannes Berg says:
====================
For the 4.7 cycle, we have a number of changes:
* Bob's mesh mode rhashtable conversion, this includes
the rhashtable API change for allocation flags
* BSSID scan, connect() command reassoc support (Jouni)
* fast (optimised data only) and support for RSS in mac80211 (myself)
* various smaller changes
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
* TDLS fixes from Arik and Ilan
* rhashtable fixes from Ben and myself
* documentation fixes from Luis
* U-APSD fixes from Emmanuel
* a TXQ fix from Felix
* and a compiler warning suppression from Jeff
-----BEGIN PGP SIGNATURE-----
iQIcBAABCgAGBQJXBQyAAAoJEGt7eEactAAdymsP/i2zU+VQFpkB8RG+nn/AYogY
x2RXgKjCWOs6FwUcu+VHMx0whKuMADjqbMABdlsGGK62Xa5aYYcObvY+CgCUAI+m
unV7kYDIBUHudiTpxXgZYUvylhIvW37VYjc6BDoaq4Jc1rz/L69zSrNHmoNiQv+Y
113T0Ft5EEmEO1LP4s2GLMZTPqwgi2FnaP6UYFdTr2/ZfaRHlj2xRRG62WiT/q1x
DMT2KWHvETCftpK3GwwkMSr0Au8CVV1soiQOoioTPRaevYbBFVi60GVXQeDlvFEV
PqVCOEfsSvsw84phfHrW1bOxBeVsNYHbY/T4eVlC0zssUzz6KNH5jAfpyla1p0lP
WniSqAaWxMcUYWCEBiOLa5LV2XVpXOuTpI84xcgc/BmprgzNyLgDAiCDtehpxALf
Qmhc/rPR5BbLhTNY8z5qANG6mhQCHCo+52ypvBLMhZoajkPjgyBabwoqaRGje2ub
vgzbAfqEguJmCAszw04KZ2UHInBcCDAZ5aOKiinawWvpftkDN0IJmO/HW5vnh4xV
kawpo1eh1JzDcEyYfSjySHHFmR/5qnaDzaPL2cJthcOY/fGywibJHCQmoDPyH5jz
Bkk0F0rMEcdQWs9pJLIMMzkA7BAlxYLYip0J9QImHL77sWK0QwUDCoryrgD6lL1D
v2V31g1TZwPF2Noe9Rk7
=r3Hr
-----END PGP SIGNATURE-----
Merge tag 'mac80211-for-davem-2016-04-06' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
For the current RC series, we have the following fixes:
* TDLS fixes from Arik and Ilan
* rhashtable fixes from Ben and myself
* documentation fixes from Luis
* U-APSD fixes from Emmanuel
* a TXQ fix from Felix
* and a compiler warning suppression from Jeff
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Ptr to devlink structure can be easily obtained from
devlink_port->devlink. So share user_ptr[0] pointer for both and leave
user_ptr[1] free for other users.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As we rely on caller zeroing or correctly set the struct before the call,
this implicit type set is either no-op (DEVLINK_PORT_TYPE_NOTSET is 0)
or it rewrites wanted value. So remove this.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes handling in case of link-local address compression. A
IPv6 link-local address is defined as fe80::/10 prefix which is also
what ipv6_addr_type checks for link-local addresses.
But IPHC compression for link-local addresses are for fe80::/64 types
only. This patch adds additional checks for zero padded bits in case of
link-local address compression to match on a fe80::/64 address only.
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
In case of buggy controllers send advertising packet types that we
don't know of we should simply ignore them instead of trying to react
to them in some (potentially wrong) way.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
If we're dealing with a single-mode controller or BR/EDR is disable
for a dual-mode one, the NO_BREDR flag needs to be unconditionally
present in the advertising data. This patch moves it out from behind
an extra condition to be always set in the create_instance_adv_data()
function if BR/EDR is disabled.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
When sending a UDPv6 message longer than MTU, account for the length
of fragmentable IPv6 extension headers in skb->network_header offset.
Same as we do in alloc_new_skb path in __ip6_append_data().
This ensures that later on __ip6_make_skb() will make space in
headroom for fragmentable extension headers:
/* move skb->data to ip header from ext header */
if (skb->data < skb_network_header(skb))
__skb_pull(skb, skb_network_offset(skb));
Prevents a splat due to skb_under_panic:
skbuff: skb_under_panic: text:ffffffff8143397b len:2126 put:14 \
head:ffff880005bacf50 data:ffff880005bacf4a tail:0x48 end:0xc0 dev:lo
------------[ cut here ]------------
kernel BUG at net/core/skbuff.c:104!
invalid opcode: 0000 [#1] KASAN
CPU: 0 PID: 160 Comm: reproducer Not tainted 4.6.0-rc2 #65
[...]
Call Trace:
[<ffffffff813eb7b9>] skb_push+0x79/0x80
[<ffffffff8143397b>] eth_header+0x2b/0x100
[<ffffffff8141e0d0>] neigh_resolve_output+0x210/0x310
[<ffffffff814eab77>] ip6_finish_output2+0x4a7/0x7c0
[<ffffffff814efe3a>] ip6_output+0x16a/0x280
[<ffffffff815440c1>] ip6_local_out+0xb1/0xf0
[<ffffffff814f1115>] ip6_send_skb+0x45/0xd0
[<ffffffff81518836>] udp_v6_send_skb+0x246/0x5d0
[<ffffffff8151985e>] udpv6_sendmsg+0xa6e/0x1090
[...]
Reported-by: Ji Jianwen <jiji@redhat.com>
Signed-off-by: Jakub Sitnicki <jkbs@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Resetting a bearer/interface, with the consequence of resetting all its
pertaining links, is not an atomic action. This becomes particularly
evident in very large clusters, where a lot of traffic may happen on the
remaining links while we are busy shutting them down. In extreme cases,
we may even see links being re-created and re-established before we are
finished with the job.
To solve this, we now introduce a solution where we temporarily detach
the bearer from the interface when the bearer is reset. This inhibits
all packet reception, while sending still is possible. For the latter,
we use the fact that the device's user pointer now is zero to filter out
which packets can be sent during this situation; i.e., outgoing RESET
messages only. This filtering serves to speed up the neighbors'
detection of the loss event, and saves us from unnecessary probing.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When enabling a bearer we create a 'neigbor discoverer' instance by
calling the function tipc_disc_create() before the bearer is actually
registered in the list of enabled bearers. Because of this, the very
first discovery broadcast message, created by the mentioned function,
is lost, since it cannot find any valid bearer to use. Furthermore,
the used send function, tipc_bearer_xmit_skb() does not free the given
buffer when it cannot find a bearer, resulting in the leak of exactly
one send buffer each time a bearer is enabled.
This commit fixes this problem by introducing two changes:
1) Instead of attemting to send the discovery message directly, we let
tipc_disc_create() return the discovery buffer to the calling
function, tipc_enable_bearer(), so that the latter can send it
when the enabling sequence is finished.
2) In tipc_bearer_xmit_skb(), as well as in the two other transmit
functions at the bearer layer, we now free the indicated buffer or
buffer chain when a valid bearer cannot be found.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When PAGE_SIZE > 4k single page can contain 2 RDS fragments. If
'rds_ib_cong_recv' ignore the RDS fragment offset in to the page it
then read the data fragment as far congestion map update and lead to
corruption of the RDS connection far congestion map.
Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix issue in 'rds_ib_cong_recv' when accessing unaligned memory
allocated by 'rds_page_remainder_alloc' using uint64_t pointer.
Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes an issue I found in which we were dropping frames if we
had enabled checksums on GRE headers that were encapsulated by either FOU
or GUE. Without this patch I was barely able to get 1 Gb/s of throughput.
With this patch applied I am now at least getting around 6 Gb/s.
The issue is due to the fact that with FOU or GUE applied we do not provide
a transport offset pointing to the GRE header, nor do we offload it in
software as the GRE header is completely skipped by GSO and treated like a
VXLAN or GENEVE type header. As such we need to prevent the stack from
generating it and also prevent GRE from generating it via any interface we
create.
Fixes: c3483384ee ("gro: Allow tunnel stacking in the case of FOU/GUE")
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that the UDP encapsulation GRO functions have been moved to the UDP
socket we not longer need the udp_offload insfrastructure so removing it.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adapt gue_gro_receive, gue_gro_complete to take a socket argument.
Don't set udp_offloads any more.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add gro_receive and gro_complete to struct udp_tunnel_sock_cfg.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds GRO functions (gro_receive and gro_complete) to UDP
sockets. udp_gro_receive is changed to perform socket lookup on a
packet. If a socket is found the related GRO functions are called.
This features obsoletes using UDP offload infrastructure for GRO
(udp_offload). This has the advantage of not being limited to provide
offload on a per port basis, GRO is now applied to whatever individual
UDP sockets are bound to. This also allows the possbility of
"application defined GRO"-- that is we can attach something like
a BPF program to a UDP socket to perfrom GRO on an application
layer protocol.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add externally visible functions to lookup a UDP socket by skb. This
will be used for GRO in UDP sockets. These functions also check
if skb->dst is set, and if it is not skb->dev is used to get dev_net.
This allows calling lookup functions before dst has been set on the
skbuff.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 5a5abb1fa3 ("tun, bpf: fix suspicious RCU usage
in tun_{attach, detach}_filter") and replaces it to use lock_sock around
sk_{attach,detach}_filter. The checks inside filter.c are updated with
lockdep_sock_is_held to check for proper socket locks.
It keeps the code cleaner by ensuring that only one lock governs the
socket filter instead of two independent locks.
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The socket is either locked if we hold the slock spin_lock for
lock_sock_fast and unlock_sock_fast or we own the lock (sk_lock.owned
!= 0). Check for this and at the same time improve that the current
thread/cpu is really holding the lock.
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
During release_sock we use callbacks to finish the processing
of outstanding skbs on the socket. We actually are still locked,
sk_locked.owned == 1, but we already told lockdep that the mutex
is released. This could lead to false positives in lockdep for
lockdep_sock_is_held (we don't hold the slock spinlock during processing
the outstanding skbs).
I took over this patch from Eric Dumazet and tested it.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip6_route_output() never returns NULL, so it is not appropriate to
check if the return value is NULL.
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Baozeng Ding reported a KASAN stack out of bounds issue - it uncovered that
the TCP option parsing routines in netfilter TCP connection tracking could
read one byte out of the buffer of the TCP options. Therefore in the patch
we check that the available data length is large enough to parse both TCP
option code and size.
Reported-by: Baozeng Ding <sploving1@gmail.com>
Tested-by: Baozeng Ding <sploving1@gmail.com>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
David Ahern reported panics in __inet_hash() caused by my recent commit.
The reason is inet_reuseport_add_sock() was still using
sk_nulls_for_each_rcu() instead of sk_for_each_rcu().
SO_REUSEPORT enabled listeners were causing an instant crash.
While chasing this bug, I found that I forgot to clear SOCK_RCU_FREE
flag, as it is inherited from the parent at clone time.
Fixes: 3b24d854cb ("tcp/dccp: do not touch listener sk_refcnt under synflood")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: David Ahern <dsa@cumulusnetworks.com>
Tested-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
arptables is broken since we didn't register the table anymore --
even 'arptables -L' fails.
Fixes: b9e69e1273 ("netfilter: xtables: don't hook tables by default")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This is for the recent kcm driver, which introduces AF_KCM(41) in
b7ac4eb(kcm: Kernel Connection Multiplexor module).
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Cc: Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow calling of iptunnel_pull_header without special casing ETH_P_TEB inner
protocol.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The check in vmci_transport_peer_detach_cb should only allow a
detach when the qp handle of the transport matches the one in
the detach message.
Testing: Before this change, a detach from a peer on a different
socket would cause an active stream socket to register a detach.
Reviewed-by: George Zhang <georgezhang@vmware.com>
Signed-off-by: Jorgen Hansen <jhansen@vmware.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Trinity and other fuzzers can hit this WARN on far too easily,
resulting in a tainted kernel that hinders automated fuzzing.
Replace it with a rate-limited printk.
Signed-off-by: Dave Jones <davej@codemonkey.org.uk>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This fixes:
net/mac80211/mesh_hwmp.c:603:26: warning: ‘target_metric’ may be used uninitialized in this function
target_metric is only consumed when reply = true so no bug exists here,
but not all versions of gcc realize it. Initialize to 0 to remove the
warning.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
If the user space issues a NL80211_CMD_CONNECT with
NL80211_ATTR_PREV_BSSID when there is already a connection, allow this
to proceed as a reassociation instead of rejecting the new connect
command with EALREADY.
Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com>
[validate prev_bssid]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This extends NL80211_CMD_CONNECT to allow the NL80211_ATTR_PREV_BSSID
attribute to be used similarly to way this was already allowed with
NL80211_CMD_ASSOCIATE. This allows user space to request reassociation
(instead of association) when already connected to an AP. This provides
an option to reassociate within an ESS without having to disconnect and
associate with the AP.
Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Prevents excessive A-MSDU aggregation at low data rates or bad
conditions.
Signed-off-by: Felix Fietkau <nbd@openwrt.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Requires software tx queueing and fast-xmit support. For good
performance, drivers need frag_list support as well. This avoids the
need for copying data of aggregated frames. Running without it is only
supported for debugging purposes.
To avoid performance and packet size issues, the rate control module or
driver needs to limit the maximum A-MSDU size by setting
max_rc_amsdu_len in struct ieee80211_sta.
Signed-off-by: Felix Fietkau <nbd@openwrt.org>
[fix locking issue]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
If the driver advertises the new HW flag USE_RSS, make the
station statistics on the fast-rx path per-CPU. This will
enable calling the RX in parallel, only hitting locking or
shared cachelines when the fast-RX path isn't available.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The regular RX path has a lot of code, but with a few
assumptions on the hardware it's possible to reduce the
amount of code significantly. Currently the assumptions
on the driver are the following:
* hardware/driver reordering buffer (if supporting aggregation)
* hardware/driver decryption & PN checking (if using encryption)
* hardware/driver did de-duplication
* hardware/driver did A-MSDU deaggregation
* AP_LINK_PS is used (in AP mode)
* no client powersave handling in mac80211 (in client mode)
of which some are actually checked per packet:
* de-duplication
* PN checking
* decryption
and additionally packets must
* not be A-MSDU (have been deaggregated by driver/device)
* be data packets
* not be fragmented
* be unicast
* have RFC 1042 header
Additionally dynamically we assume:
* no encryption or CCMP/GCMP, TKIP/WEP/other not allowed
* station must be authorized
* 4-addr format not enabled
Some data needed for the RX path is cached in a new per-station
"fast_rx" structure, so that we only need to look at this and
the packet, no other memory when processing packets on the fast
RX path.
After doing the above per-packet checks, the data path collapses
down to a pretty simple conversion function taking advantage of
the data cached in the small fast_rx struct.
This should speed up the RX processing, and will make it easier
to reason about parallelizing RX (for which statistics will need
to be per-CPU still.)
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
On 32-bit platforms, the 64-bit counters we keep need to be protected
to be consistently read. Use the u64_stats_sync mechanism to do that.
In order to not end up with overly long lines, refactor the tidstats
assignments a bit.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When storing the last_rate_* values in the RX code, there's nothing
to guarantee consistency, so a concurrent reader could see, e.g.
last_rate_idx on the new value, but last_rate_flag still on the old,
getting completely bogus values in the end.
To fix this, I lifted the sta_stats_encode_rate() function from my
old rate statistics code, which encodes the entire rate data into a
single 16-bit value, avoiding the consistency issue.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Instead of touching the rx_stats.last_rx from the status path, introduce
and use a status_stats.last_ack variable. This will make rx_stats.last_rx
indicate when the last frame was received, making it available for real
"last_rx" and statistics gathering; statistics, when done per-CPU, will
need to figure out which place was updated last for those items where the
"last" value is exposed.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
There's no need to update rx_stats.last_rx after allocating
a station since it's already updated during allocation.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Move the averaged values out of rx_stats and into rx_stats_avg,
to cleanly split them out. The averaged ones cannot be supported
for parallel RX in a per-CPU fashion, while the other values can
be collected per CPU and then combined/selected when needed.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Move the semicolon, people typically assume that and
once line already put a semicolon behind the "call".
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
For the RX MSDU statistics, we need to count the number of
MSDUs created and accepted from an A-MSDU. Right now, all
frames in any A-MSDUs were completely ignored. Fix this by
moving the RX MSDU statistics accounting into the deliver
function.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Sometimes drivers already looked up, or know out-of-band
from their device, which station transmitted a given RX
frame. Allow them to pass the station pointer to mac80211
to save the extra lookup.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When signaling that a GRO frame is ready to be processed, the network stack
correctly checks length and aborts processing when a frame is less than 14
bytes. However, such a condition is really indicative of a broken driver,
and should be loudly signaled, rather than silently dropped as the case is
today.
Convert the condition to use net_warn_ratelimited() to ensure the stack
loudly complains about such broken drivers.
Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: David S. Miller <davem@davemloft.net>