RDS ping messages are sent with a non-zero src port to a zero
dst port, so that the rds pong messages can be sent back to the
originators src port. However if a confused/malicious sender
sends a ping with a 0 src port, we'd have an infinite ping-pong
loop. To avoid this, the receiver should ignore ping messages
with a 0 src port.
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When reconnecting, the peer with the smaller IP address will initiate
the reconnect, to avoid needless duelling SYN issues.
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds ->conn_path_connect callbacks in the rds_transport
that are used to set up a single connection path.
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ->sk_user_data contains a pointer to the rds_conn_path
for the socket. Use this consistently in the rds_tcp_data_ready
callbacks to get the rds_conn_path for rds_recv_incoming.
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The socket callbacks should all operate on a struct rds_conn_path,
in preparation for a MP capable RDS-TCP.
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A single rds_connection may have multiple rds_conn_paths that have
to be carefully and correctly destroyed, for both rmmod and
netns-delete cases.
For both cases, we extract a single rds_tcp_connection for
each conn into a temporary list, and then invoke rds_conn_destroy()
which iteratively dismantles every path in the rds_connection.
For the netns deletion case, we additionally have to make sure
that we do not leave a socket in TIME_WAIT state, as this will
hold up the netns deletion. Thus we call rds_tcp_conn_paths_destroy()
to reset state quickly.
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The struct rds_tcp_connection is the transport-specific private
data structure that tracks TCP information per rds_conn_path.
Modify this structure to have a back-pointer to the rds_conn_path
for which it is the ->cp_transport_data.
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The c_passive bit is only intended for the IB transport and will
never be encountered in rds-tcp, so remove the dead logic that
predicates on this bit.
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Refactor code to avoid separate indirections for single-path
and multipath transports. All transports (both single and mp-capable)
will get a pointer to the rds_conn_path, and can trivially derive
the rds_connection from the ->cp_conn.
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adds a bpf helper, bpf_skb_in_cgroup, to decide if a skb->sk
belongs to a descendant of a cgroup2. It is similar to the
feature added in netfilter:
commit c38c4597e4 ("netfilter: implement xt_cgroup cgroup2 path match")
The user is expected to populate a BPF_MAP_TYPE_CGROUP_ARRAY
which will be used by the bpf_skb_in_cgroup.
Modifications to the bpf verifier is to ensure BPF_MAP_TYPE_CGROUP_ARRAY
and bpf_skb_in_cgroup() are always used together.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Alexei Starovoitov <ast@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to commit 9b368814b3 ("net: fix bridge multicast packet checksum validation")
we need to fixup the checksum for CHECKSUM_COMPLETE when
pushing skb on RX path. Otherwise we get similar splats.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Tom Herbert <tom@herbertland.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
People who use PACKET_FANOUT_HASH want a symmetric hash, meaning that
they want packets going in both directions on a flow to hash to the
same bucket.
The core kernel SKB hash became non-symmetric when the ipv6 flow label
and other entities were incorporated into the standard flow hash order
to increase entropy.
But there are no users of PACKET_FANOUT_HASH who want an assymetric
hash, they all want a symmetric one.
Therefore, use the flow dissector to compute a flat symmetric hash
over only the protocol, addresses and ports. This hash does not get
installed into and override the normal skb hash, so this change has
no effect whatsoever on the rest of the stack.
Reported-by: Eric Leblond <eric@regit.org>
Tested-by: Eric Leblond <eric@regit.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since bpf_prog_get() and program type check is used in a couple of places,
refactor this into a small helper function that we can make use of. Since
the non RO prog->aux part is not used in performance critical paths and a
program destruction via RCU is rather very unlikley when doing the put, we
shouldn't have an issue just doing the bpf_prog_get() + prog->type != type
check, but actually not taking the ref at all (due to being in fdget() /
fdput() section of the bpf fd) is even cleaner and makes the diff smaller
as well, so just go for that. Callsites are changed to make use of the new
helper where possible.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This option was removed in commit 47dcf0cb10 ("[NET]: Rethink mark field
in struct flowi").
Signed-off-by: Moritz Sichert <moritz+linux@sichert.me>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
There are code duplications of a masked ethernet address comparison here
so make it a separate function instead.
Miscellanea:
o Neaten alignment of FWINV macro uses to make it clearer for the reader
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
No need for a special case to handle NF_INET_POST_ROUTING, this is
basically the same handling as for prerouting, input, forward.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Can overflow so we might allocate very small table when bucket count is
high on a 32bit platform.
Note: resize is only possible from init_netns.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch introduces a new event - NETDEV_CHANGE_TX_QUEUE_LEN, this
will be triggered when tx_queue_len. It could be used by net device
who want to do some processing at that time. An example is tun who may
want to resize tx array when tx_queue_len is changed.
Cc: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
cl->cl_vt alone is relative only to the current backlog period, while
the curve operates on cumulative virtual time. This patch adds missing
cl->cl_vtoff.
Signed-off-by: Michal Soltys <soltys@ziu.info>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a class is going passive, it should update its cl_vt first
to be consistent with the last dequeue operation.
Otherwise its cl_vt will be one packet behind and parent's cvtmax might
not be updated as well.
One possible side effect is if some class goes passive and subsequently
goes active /without/ its parent going passive - with cl_vt lagging one
packet behind - comparison made in init_vf() will be affected (same
period).
Signed-off-by: Michal Soltys <soltys@ziu.info>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is update to:
commit a09ceb0e08 ("sched: remove qdisc->drop")
That commit removed qdisc->drop, but left alone dlist and droplist
that no longer serve any meaningful purpose.
Signed-off-by: Michal Soltys <soltys@ziu.info>
Signed-off-by: David S. Miller <davem@davemloft.net>
The condition can only succeed on wrong configurations.
Signed-off-by: Michal Soltys <soltys@ziu.info>
Signed-off-by: David S. Miller <davem@davemloft.net>
Realtime scheduling implemented in HFSC uses head of the queue to make
the decision about which packet to schedule next. But in case of any
head drop, the deadline calculated for the previous head is not
necessarily correct for the next head (unless both packets have the same
length).
Thanks to peek() function used during dequeue - which internally is a
dequeue operation - hfsc is almost safe from this issue, as peek()
dequeues and isolates the head storing it temporarily until the real
dequeue happens.
But there is one exception: if after the class activation a drop happens
before the first dequeue operation, there's never a chance to do the
peek().
Adding peek() call in enqueue - if this is the first packet in a new
backlog period AND the scheduler has realtime curve defined - fixes that
one corner case. The 1st hfsc_dequeue() will use that peeked packet,
similarly as every subsequent hfsc_dequeue() call uses packet peeked by
the previous call.
Signed-off-by: Michal Soltys <soltys@ziu.info>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some arches have virtually mapped kernel stacks, or will soon have.
tcp_md5_hash_header() uses an automatic variable to copy tcp header
before mangling th->check and calling crypto function, which might
be problematic on such arches.
David says that using percpu storage is also problematic on non SMP
builds.
Just use kmalloc() to allocate scratch areas.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a jumbo packet is being split up and processed, the crypto checksum
for each split-out packet is in the jumbo header and needs placing in the
reconstructed packet header.
When the code was changed to keep the stored copy of the packet header in
host byte order, this reconstruction was missed.
Found with sparse with CF=-D__CHECK_ENDIAN__:
../net/rxrpc/input.c:479:33: warning: incorrect type in assignment (different base types)
../net/rxrpc/input.c:479:33: expected unsigned short [unsigned] [usertype] _rsvd
../net/rxrpc/input.c:479:33: got restricted __be16 [addressable] [usertype] _rsvd
Fixes: 0d12f8a402 ("rxrpc: Keep the skb private record of the Rx header in host byte order")
Signed-off-by: David Howells <dhowells@redhat.com>
ip_skb_dst_mtu uses skb->sk, assuming it is an AF_INET socket (e.g. it
calls ip_sk_use_pmtu which casts sk as an inet_sk).
However, in the case of UDP tunneling, the skb->sk is not necessarily an
inet socket (could be AF_PACKET socket, or AF_UNSPEC if arriving from
tun/tap).
OTOH, the sk passed as an argument throughout IP stack's output path is
the one which is of PMTU interest:
- In case of local sockets, sk is same as skb->sk;
- In case of a udp tunnel, sk is the tunneling socket.
Fix, by passing ip_finish_output's sk to ip_skb_dst_mtu.
This augments 7026b1ddb6 'netfilter: Pass socket pointer down through okfn().'
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When adding rule with NLM_F_EXCL flag then check if the same rule exist.
If yes then exit with -EEXIST.
This is already implemented in iproute2:
if (cmd == RTM_NEWRULE) {
req.n.nlmsg_flags |= NLM_F_CREATE|NLM_F_EXCL;
req.r.rtm_type = RTN_UNICAST;
}
Tested ipv4 and ipv6 with net-next linux on qemu x86
expected behavior after patch:
localhost ~ # ip rule
0: from all lookup local
32766: from all lookup main
32767: from all lookup default
localhost ~ # ip rule add from 10.46.177.97 lookup 104 pref 1005
localhost ~ # ip rule add from 10.46.177.97 lookup 104 pref 1005
RTNETLINK answers: File exists
localhost ~ # ip rule
0: from all lookup local
1005: from 10.46.177.97 lookup 104
32766: from all lookup main
32767: from all lookup default
There was already topic regarding this but I don't see any changes
merged and problem still occurs.
https://lkml.kernel.org/r/1135778809.5944.7.camel+%28%29+localhost+%21+localdomain
Signed-off-by: Mateusz Bajorski <mateusz.bajorski@nokia.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We found that sometimes a restored tcp socket doesn't work.
A reason of this bug is incorrect window parameters and in this case
tcp_acceptable_seq() returns tcp_wnd_end(tp) instead of tp->snd_nxt. The
other side drops packets with this seq, because seq is less than
tp->rcv_nxt ( tcp_sequence() ).
Data from a send queue is sent only if there is enough space in a
window, so when we restore unacked data, we need to expand a window to
fit this data.
This was in a first version of this patch:
"tcp: extend window to fit all restored unacked data in a send queue"
Then Alexey recommended me to restore window parameters instead of
adjusted them according with data in a sent queue. This sounds resonable.
rcv_wnd has to be restored, because it was reported to another side
and the offered window is never shrunk.
One of reasons why we need to restore snd_wnd was described above.
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds stats support for the currently used IGMP/MLD types by the
bridge. The stats are per-port (plus one stat per-bridge) and per-direction
(RX/TX). The stats are exported via netlink via the new linkxstats API
(RTM_GETSTATS). In order to minimize the performance impact, a new option
is used to enable/disable the stats - multicast_stats_enabled, similar to
the recent vlan stats. Also in order to avoid multiple IGMP/MLD type
lookups and checks, we make use of the current "igmp" member of the bridge
private skb->cb region to record the type on Rx (both host-generated and
external packets pass by multicast_rcv()). We can do that since the igmp
member was used as a boolean and all the valid IGMP/MLD types are positive
values. The normal bridge fast-path is not affected at all, the only
affected paths are the flooding ones and since we make use of the IGMP/MLD
type, we can quickly determine if the packet should be counted using
cache-hot data (cb's igmp member). We add counters for:
* IGMP Queries
* IGMP Leaves
* IGMP v1/v2/v3 reports
* MLD Queries
* MLD Leaves
* MLD v1/v2 reports
These are invaluable when monitoring or debugging complex multicast setups
with bridges.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for the IFLA_STATS_LINK_XSTATS_SLAVE attribute
which allows to export per-slave statistics if the master device supports
the linkxstats callback. The attribute is passed down to the linkxstats
callback and it is up to the callback user to use it (an example has been
added to the only current user - the bridge). This allows us to query only
specific slaves of master devices like bridge ports and export only what
we're interested in instead of having to dump all ports and searching only
for a single one. This will be used to export per-port IGMP/MLD stats and
also per-port vlan stats in the future, possibly other statistics as well.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We've accumulated a couple of different fixes now to mesh_sta_cleanup()
due to the different paths that user_mpm and !user_mpm cases take -- one
fix to flush nexthop paths and one to fix the counting.
The only caller of mesh_plink_deactivate() is mesh_sta_cleanup(), so we
can push the user_mpm checks down into there in order to share more
code.
In doing so, we can remove an extra call to mesh_path_flush_by_nexthop()
and the (unnecessary) call to mesh_accept_plinks_update(). This will
also ensure the powersaving state code gets called in the user_mpm case.
The only cleanup tasks we need to avoid when MPM is in user-space
are sending the peering frames and stopping the plink timer, so wrap
those in the appropriate check.
Signed-off-by: Bob Copeland <me@bobcopeland.com>
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Previously, the action frames to group address was not encrypted. But
[1] "Table 8-38 Category values" indicates "Mesh" and "Multihop" category
action frames should be encrypted (Group addressed privacy == yes). And the
encyption key should be MGTK ([1] 10.13 Group addressed robust management frame
procedures). So this patch modifies the code to make it suitable for spec.
[1] IEEE Std 802.11-2012
Signed-off-by: Masashi Honma <masashi.honma@gmail.com>
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
We normally return an uninitialized value, but no one checks it so it
doesn't matter. Anyway, let's silence the static checker warning.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
When building a kernel with W=1, the nl80211.c file causes a number of
warnings, all about the same problem:
net/wireless/nl80211.c: In function 'nl80211_parse_mesh_config':
net/wireless/nl80211.c:5287:103: error: comparison is always false due to limited range of data type [-Werror=type-limits]
net/wireless/nl80211.c:5290:96: error: comparison is always false due to limited range of data type [-Werror=type-limits]
net/wireless/nl80211.c:5293:124: error: comparison is always false due to limited range of data type [-Werror=type-limits]
net/wireless/nl80211.c:5295:148: error: comparison is always false due to limited range of data type [-Werror=type-limits]
net/wireless/nl80211.c:5298:106: error: comparison is always false due to limited range of data type [-Werror=type-limits]
net/wireless/nl80211.c:5305:116: error: comparison is always false due to limited range of data type [-Werror=type-limits]
The problem is that gcc does not notice that the check is generate
by a macro, so it complains about comparing an unsigned type against 0.
I've tried to come up with a way to rephrase that code in a way that
avoids the warnings and otherwise improves the code as well.
This uses a set of new helper functions that perform the range checking,
and should provide slightly better type safety than the older patch,
at the expense of adding 44 lines to the code. Binary code size is
basically unchanged though (20 bytes added to 126561 bytes .text).
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
This work adds a helper for changing skb->pkt_type in a controlled way.
We only allow a subset of possible values and can extend that in future
should other use cases come up. Doing this as a helper has the advantage
that errors can be handeled gracefully and thus helper kept extensible.
It's a write counterpart to pkt_type member we can already read from
struct __sk_buff context. Major use case is to change incoming skbs to
PACKET_HOST in a programmatic way instead of having to recirculate via
redirect(..., BPF_F_INGRESS), for example.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a minimal helper for doing the groundwork of changing
the skb->protocol in a controlled way. Currently supported is v4 to
v6 and vice versa transitions, which allows f.e. for a minimal, static
nat64 implementation where applications in containers that still
require IPv4 can be transparently operated in an IPv6-only environment.
For example, host facing veth of the container can transparently do
the transitions in a programmatic way with the help of clsact qdisc
and cls_bpf.
Idea is to separate concerns for keeping complexity of the helper
lower, which means that the programs utilize bpf_skb_change_proto(),
bpf_skb_store_bytes() and bpf_lX_csum_replace() to get the job done,
instead of doing everything in a single helper (and thus partially
duplicating helper functionality). Also, bpf_skb_change_proto()
shouldn't need to deal with raw packet data as this is done by other
helpers.
bpf_skb_proto_6_to_4() and bpf_skb_proto_4_to_6() unclone the skb to
operate on a private one, push or pop additionally required header
space and migrate the gso/gro meta data from the shared info. We do
mark the gso type as dodgy so that headers are checked and segs
recalculated by the gso/gro engine. The gso_size target is adapted
as well. The flags argument added is currently reserved and can be
used for future extensions.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use smp_processor_id() for the generic helper bpf_get_smp_processor_id()
instead of the raw variant. This allows for preemption checks when we
have DEBUG_PREEMPT, and otherwise uses the raw variant anyway. We only
need to keep the raw variant for socket filters, but we can reuse the
helper that is already there from cBPF side.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Several cases of overlapping changes, except the packet scheduler
conflicts which deal with the addition of the free list parameter
to qdisc_enqueue().
Signed-off-by: David S. Miller <davem@davemloft.net>
The bat_algo.h had some functions declared which were not part of the
bat_algo.c file. These are instead stored in bat_v.c and bat_iv_ogm.c. The
declaration should therefore be also in bat_v.h and bat_iv_ogm,h to make
them easier to find.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
This patch adds a debugfs table with originators and their according
multicast flags to help users figure out why multicast optimizations
might be enabled or disabled for them.
Tested-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
There are several places in batman-adv which provide logging related
functions. These should be grouped together in the log.* files to make them
easier to find.
Reported-by: Markus Pargmann <mpa@pengutronix.de>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
With this patch changes relevant to a node's own multicast flags are
printed to the 'mcast' log level.
Tested-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
The bat_algo functionality in main.c is mostly unrelated to the rest of the
content. It still takes up a large portion of this source file (~15%, 103
lines). Moving it to a separate file makes it better visible as a main
component of the batman-adv implementation and hides it less in the other
helper functions in main.c.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
With this patch we are finally able to support multicast optimizations
in bridged setups, too. So far, if a bridge was added on top of a
soft-interface (e.g. bat0) the batman-adv multicast optimizations
needed to be disabled to avoid packetloss.
Current Linux bridge implementations and API can now provide us
with the so far missing information about interested but "remote"
multicast receivers behind bridge ports.
The Linux bridge performs the detection of remote participants
interested in multicast packets with its own and mature so
called IGMP and MLD snooping code and stores that in its
database. With the new API provided by the bridge batman-adv can
now simply hook into this database.
We then reliably announce the gathered multicast listeners to
other nodes through the batman-adv translation table.
Additionally, the Linux bridge provides us with the information about
whether an IGMP/MLD querier exists. If there is none then we need to
disable multicast optimizations as we cannot learn about multicast
listeners on external, bridged-in host then.
Tested-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
The tvlv functionality in main.c is mostly unrelated to the rest of the
content. It still takes up a large portion of this source file (~45%, 588
lines). Moving it to a separate file makes it better visible as a main
component of the batman-adv implementation and hides it less in the other
helper functions in main.c
Signed-off-by: Markus Pargmann <mpa@pengutronix.de>
[sven@narfation.org: fix conflicts with current version, fix includes,
rewrote commit message]
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
With this patch IGMP or MLD reports are always flooded. This is
necessary for the upcoming bridge integration to function without
multicast packet loss.
With the report handling so far bridges might miss interested multicast
listeners, leading to wrongly excluding ports from multicast packet
forwarding.
Currently we are treating IGMP/MLD reports, the messages bridges use to
learn about interested multicast listeners, just as any other multicast
packet: We try to send them to nodes matching its multicast destination.
Unfortunately, the destination address of reports of the older
IGMPv2/MLDv1 protocol families do not strictly adhere to their own
protocol: More precisely, the interested receiver, an IGMPv2 or MLDv1
querier, itself usually does not listen to the multicast destination
address of any reports.
Therefore with this patch we are simply excluding IGMP/MLD reports from
the multicast forwarding code path and keep flooding them. By that
any bridge receives them and can properly learn about listeners.
To avoid compatibility issues with older nodes not yet implementing this
report handling, we need to force them to flood reports: We do this by
bumping the multicast TVLV version to 2, effectively disabling their
multicast optimization.
Tested-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
It is easier to detect if a include is already there for a used
functionality when the includes are ordered. Using an alphabetic order
together with the grouping in commit 1e2c2a4fe4 ("batman-adv: Add
required includes to all files") makes includes better manageable.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Unfragmented frames which traverse a node have their skb->priority set
by looking at the IP ToS byte, or the 802.1p header. However for
fragments this is not possible, only one of the fragments will contain
the headers. Instead, place the priority into the fragment header and
on receiving a fragment, use this information to set the skb->priority
for when the fragment is forwarded.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
main.h includes statements which (re)define preprocessor variables which
influence the compiled code. This makes it necessary to include it in all
files. For example, it redefines pr_fmt used to the module as prefix for
each pr_* message.
Reported-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
BATMAN will set the skb->priority based on the IP precedence or 802.1q
tag. However, if it needs to fragment the frame, it currently leaves
the fragment skb with the default priority and actually overwrites the
priority in the unfragmented frame. Fix this.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
The ELP interval and throughput override interface settings are initialized
with default settings on every time an interface is added to a mesh.
This patch prevents this behavior by moving the configuration init to the
interface detection routine which runs only once per interface.
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
[a@unstable.cc: move initialization to batadv_v_hardif_init]
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
To reduce the field pollution in our main batadv_priv data structure
we've already created some substructures so that we could group fields
in a convenient manner.
However gw_mode and gw_sel_class are still part of the main object.
More both fields to the GW private substructure.
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
the compiler can optimize functions within the same C file and therefore
there is no need to make it explicit.
Remove the useless inline attribute for __batadv_store_uint_attr()
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
The ogm_emit and ogm_schedule API calls were rather tight to the
B.A.T.M.A.N. IV logic and therefore rather difficult to use
with other algorithm implementations.
Remove such calls and move the surrounding logic into the
B.A.T.M.A.N. IV specific code.
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
To make it easier to search through the code it is better to print static
strings directly instead of using format strings printing constants.
This was addressed in a previous patch, but the Gateway table header
was not updated accordingly.
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Reviewed-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Pull networking fixes from David Miller:
"I've been traveling so this accumulates more than week or so of bug
fixing. It perhaps looks a little worse than it really is.
1) Fix deadlock in ath10k driver, from Ben Greear.
2) Increase scan timeout in iwlwifi, from Luca Coelho.
3) Unbreak STP by properly reinjecting STP packets back into the
stack. Regression fix from Ido Schimmel.
4) Mediatek driver fixes (missing malloc failure checks, leaking of
scratch memory, wrong indexing when mapping TX buffers, etc.) from
John Crispin.
5) Fix endianness bug in icmpv6_err() handler, from Hannes Frederic
Sowa.
6) Fix hashing of flows in UDP in the ruseport case, from Xuemin Su.
7) Fix netlink notifications in ovs for tunnels, delete link messages
are never emitted because of how the device registry state is
handled. From Nicolas Dichtel.
8) Conntrack module leaks kmemcache on unload, from Florian Westphal.
9) Prevent endless jump loops in nft rules, from Liping Zhang and
Pablo Neira Ayuso.
10) Not early enough spinlock initialization in mlx4, from Eric
Dumazet.
11) Bind refcount leak in act_ipt, from Cong WANG.
12) Missing RCU locking in HTB scheduler, from Florian Westphal.
13) Several small MACSEC bug fixes from Sabrina Dubroca (missing RCU
barrier, using heap for SG and IV, and erroneous use of async flag
when allocating AEAD conext.)
14) RCU handling fix in TIPC, from Ying Xue.
15) Pass correct protocol down into ipv4_{update_pmtu,redirect}() in
SIT driver, from Simon Horman.
16) Socket timer deadlock fix in TIPC from Jon Paul Maloy.
17) Fix potential deadlock in team enslave, from Ido Schimmel.
18) Memory leak in KCM procfs handling, from Jiri Slaby.
19) ESN generation fix in ipv4 ESP, from Herbert Xu.
20) Fix GFP_KERNEL allocations with locks held in act_ife, from Cong
WANG.
21) Use after free in netem, from Eric Dumazet.
22) Uninitialized last assert time in multicast router code, from Tom
Goff.
23) Skip raw sockets in sock_diag destruction broadcast, from Willem
de Bruijn.
24) Fix link status reporting in thunderx, from Sunil Goutham.
25) Limit resegmentation of retransmit queue so that we do not
retransmit too large GSO frames. From Eric Dumazet.
26) Delay bpf program release after grace period, from Daniel
Borkmann"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (141 commits)
openvswitch: fix conntrack netlink event delivery
qed: Protect the doorbell BAR with the write barriers.
neigh: Explicitly declare RCU-bh read side critical section in neigh_xmit()
e1000e: keep VLAN interfaces functional after rxvlan off
cfg80211: fix proto in ieee80211_data_to_8023 for frames without LLC header
qlcnic: use the correct ring in qlcnic_83xx_process_rcv_ring_diag()
bpf, perf: delay release of BPF prog after grace period
net: bridge: fix vlan stats continue counter
tcp: do not send too big packets at retransmit time
ibmvnic: fix to use list_for_each_safe() when delete items
net: thunderx: Fix TL4 configuration for secondary Qsets
net: thunderx: Fix link status reporting
net/mlx5e: Reorganize ethtool statistics
net/mlx5e: Fix number of PFC counters reported to ethtool
net/mlx5e: Prevent adding the same vxlan port
net/mlx5e: Check for BlueFlame capability before allocating SQ uar
net/mlx5e: Change enum to better reflect usage
net/mlx5: Add ConnectX-5 PCIe 4.0 to list of supported devices
net/mlx5: Update command strings
net: marvell: Add separate config ANEG function for Marvell 88E1111
...
* fix mesh peer link counter, decrement wasn't always done at all
* fix ethertype (length) for packets without RFC 1042 or bridge
tunnel header
-----BEGIN PGP SIGNATURE-----
iQIcBAABCgAGBQJXc5wQAAoJEGt7eEactAAd2HQP/RS3lZaNn97spZp9Vb9Swqd6
UjhT4vHL2wHPMhPals2M2ztRhL8df/55tOhSuCwUki0gTsAfP4+Jz1PZ4okFUSBn
6EHvRtAgSAzjBElZVEsNO2sM/qm0UOnlqm4XpOXcKzpnRzDSlIa6vpnctp2MJMBI
PB46x8BPQoi192aTzleMqgBwjrMX1ohsM6rgOIEf7rvLhftx6LRRm7/uKaiwenVT
ZxkP6jIOs5ofkUHLHS4OPyId4yReO8oMMbRRA4fW1z2T8hME9PH3jTZL3Znaj7VF
s2LLSqwUlv06k0WFu3SzJh750jCJ6cnJiIqggDlPChrb0/DPicqgWR7/tJAvNVfj
WJqugeNtEVI+4ElGNxFzFAvjAKqO8fHY5U8Ko4LtZuwhjpo5GDvA+FErFESRXvCG
c647cDMhn+6OXwpUwsggQK4c+AS4QX/PzHjmhW5tKORPXlYVXAAz+JDXn195WnzV
w4UAO4ZWai9XY6oSc63uudaHMt8Xkmq8PRQsH1hHG20CAJ+7bGugaWVTci5fzeNh
JC4X+Xh6t+5qq1jEWgr+KguVKXtP3pVetqGy9kYQrJov0fTe2Z6deEeY08sH1Ffz
xdnQUsZphvqSkosTw76JtNPXz2ajxZCD3KNWjcpkL/NNs9lnIZPZc/mnzKP7ltro
ZEYw0HbcXVt/uFhHlwdG
=dAbs
-----END PGP SIGNATURE-----
Merge tag 'mac80211-for-davem-2016-06-29-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
Just two small fixes
* fix mesh peer link counter, decrement wasn't always done at all
* fix ethertype (length) for packets without RFC 1042 or bridge
tunnel header
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Only the first and last netlink message for a particular conntrack are
actually sent. The first message is sent through nf_conntrack_confirm when
the conntrack is committed. The last one is sent when the conntrack is
destroyed on timeout. The other conntrack state change messages are not
advertised.
When the conntrack subsystem is used from netfilter, nf_conntrack_confirm
is called for each packet, from the postrouting hook, which in turn calls
nf_ct_deliver_cached_events to send the state change netlink messages.
This commit fixes the problem by calling nf_ct_deliver_cached_events in the
non-commit case as well.
Fixes: 7f8a436eaa ("openvswitch: Add conntrack action")
CC: Joe Stringer <joestringer@nicira.com>
CC: Justin Pettit <jpettit@nicira.com>
CC: Andy Zhou <azhou@nicira.com>
CC: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Samuel Gauthier <samuel.gauthier@6wind.com>
Acked-by: Joe Stringer <joe@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
If skb_unshare() fails, we call qdisc_drop() with a NULL skb, which
is no longer supported.
Fixes: 520ac30f45 ("net_sched: drop packets after root qdisc lock is released")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
neigh_xmit() expects to be called inside an RCU-bh read side critical
section, and while one of its two current callers gets this right, the
other one doesn't.
More specifically, neigh_xmit() has two callers, mpls_forward() and
mpls_output(), and while both callers call neigh_xmit() under
rcu_read_lock(), this provides sufficient protection for neigh_xmit()
only in the case of mpls_forward(), as that is always called from
softirq context and therefore doesn't need explicit BH protection,
while mpls_output() can be called from process context with softirqs
enabled.
When mpls_output() is called from process context, with softirqs
enabled, we can be preempted by a softirq at any time, and RCU-bh
considers the completion of a softirq as signaling the end of any
pending read-side critical sections, so if we do get a softirq
while we are in the part of neigh_xmit() that expects to be run inside
an RCU-bh read side critical section, we can end up with an unexpected
RCU grace period running right in the middle of that critical section,
making things go boom.
This patch fixes this impedance mismatch in the callee, by making
neigh_xmit() always take rcu_read_{,un}lock_bh() around the code that
expects to be treated as an RCU-bh read side critical section, as this
seems a safer option than fixing it in the callers.
Fixes: 4fd3d7d9e8 ("neigh: Add helper function neigh_xmit")
Signed-off-by: David Barroso <dbarroso@fastly.com>
Signed-off-by: Lennert Buytenhek <lbuytenhek@fastly.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The PDU length of incoming LLC frames is set to the total skb payload size
in __ieee80211_data_to_8023() of net/wireless/util.c which incorrectly
includes the length of the IEEE 802.11 header.
The resulting LLC frame header has a too large PDU length, causing the
llc_fixup_skb() function of net/llc/llc_input.c to reject the incoming
skb, effectively breaking STP.
Solve the problem by properly substracting the IEEE 802.11 frame header size
from the PDU length, allowing the LLC processor to pick up the incoming
control messages.
Special thanks to Gerry Rozema for tracking down the regression and proposing
a suitable patch.
Fixes: 2d1c304cb2 ("cfg80211: add function for 802.3 conversion with separate output buffer")
Cc: stable@vger.kernel.org
Reported-by: Gerry Rozema <gerryr@rozeware.com>
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
I made a dumb off-by-one mistake when I added the vlan stats counter
dumping code. The increment should happen before the check, not after
otherwise we miss one entry when we continue dumping.
Fixes: a60c090361 ("bridge: netlink: export per-vlan stats")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Arjun reported a bug in TCP stack and bisected it to a recent commit.
In case where we process SACK, we can coalesce multiple skbs
into fat ones (tcp_shift_skb_data()), to lower write queue
overhead, because we do not expect to retransmit these packets.
However, SACK reneging can happen, forcing the sender to retransmit
all these packets. If skb->len is above 64KB, we then send buggy
IP packets that could hang TSO engine on cxgb4.
Neal suggested to use tcp_tso_autosize() instead of tp->gso_segs
so that we cook packets of optimal size vs TCP/pacing.
Thanks to Arjun for reporting the bug and running the tests !
Fixes: 10d3be5692 ("tcp-tso: do not split TSO packets at retransmit time")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Arjun V <arjun@chelsio.com>
Tested-by: Arjun V <arjun@chelsio.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Context implies that port in struct "udp_media_addr" is referring
to a UDP port.
Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The UDP msg2addr function tipc_udp_msg2addr() can return -EINVAL which
prior to this patch was unhanded in the caller.
Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The space is missing before the open parenthesis '(', and this
will introduce much more noise when checking patch around.
Signed-off-by: Wei Tang <tangwei@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The untagged vlan object is only destroyed when the interface is removed
via the legacy sysfs interface. But it also has to be destroyed when the
standard rtnl-link interface is used.
Fixes: 5d2c05b213 ("batman-adv: add per VLAN interface attribute framework")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Acked-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
The skb_linearize may reallocate the skb. This makes the calculated pointer
for ethhdr invalid. But it the pointer is used later to fill in the RR
field of the batadv_icmp_packet_rr packet.
Instead re-evaluate eth_hdr after the skb_linearize+skb_cow to fix the
pointer and avoid the invalid read.
Fixes: da6b8c20a5 ("batman-adv: generalize batman-adv icmp packet handling")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Each batadv_tt_local_entry hold a single reference to a
batadv_softif_vlan. In case a new entry cannot be added to the hash
table, the error path puts the reference, but the reference will also
now be dropped by batadv_tt_local_entry_release().
Fixes: a33d970d0b ("batman-adv: Fix reference counting of vlan object for tt_local_entry")
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The tt_req_node is added and removed from a list inside a spinlock. But the
locking is sometimes removed even when the object is still referenced and
will be used later via this reference. For example batadv_send_tt_request
can create a new tt_req_node (including add to a list) and later
re-acquires the lock to remove it from the list and to free it. But at this
time another context could have already removed this tt_req_node from the
list and freed it.
CPU#0
batadv_batman_skb_recv from net_device 0
-> batadv_iv_ogm_receive
-> batadv_iv_ogm_process
-> batadv_iv_ogm_process_per_outif
-> batadv_tvlv_ogm_receive
-> batadv_tvlv_ogm_receive
-> batadv_tvlv_containers_process
-> batadv_tvlv_call_handler
-> batadv_tt_tvlv_ogm_handler_v1
-> batadv_tt_update_orig
-> batadv_send_tt_request
-> batadv_tt_req_node_new
spin_lock(...)
allocates new tt_req_node and adds it to list
spin_unlock(...)
return tt_req_node
CPU#1
batadv_batman_skb_recv from net_device 1
-> batadv_recv_unicast_tvlv
-> batadv_tvlv_containers_process
-> batadv_tvlv_call_handler
-> batadv_tt_tvlv_unicast_handler_v1
-> batadv_handle_tt_response
spin_lock(...)
tt_req_node gets removed from list and is freed
spin_unlock(...)
CPU#0
<- returned to batadv_send_tt_request
spin_lock(...)
tt_req_node gets removed from list and is freed
MEMORY CORRUPTION/SEGFAULT/...
spin_unlock(...)
This can only be solved via reference counting to allow multiple contexts
to handle the list manipulation while making sure that only the last
context holding a reference will free the object.
Fixes: a73105b8d4 ("batman-adv: improved client announcement mechanism")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Tested-by: Martin Weinelt <martin@darmstadt.freifunk.net>
Tested-by: Amadeus Alfa <amadeus@chemnitz.freifunk.net>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
If a VLAN tagged frame is received and the corresponding VLAN is not
configured on the soft interface, it will splat a WARN on every packet
received. This is a quite annoying behaviour for some scenarios, e.g. if
bat0 is bridged with eth0, and there are arbitrary VLAN tagged frames
from Ethernet coming in without having any VLAN configuration on bat0.
The code should probably create vlan objects on the fly and
transparently transport these VLAN-tagged Ethernet frames, but until
this is done, at least the WARN splat should be replaced by a rate
limited output.
Fixes: 354136bcc3 ("batman-adv: fix kernel crash due to missing NULL checks")
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bridge is falsly dropping ipv6 mulitcast packets if there is:
1. No ipv6 address assigned on the brigde.
2. No external mld querier present.
3. The internal querier enabled.
When the bridge fails to build mld queries, because it has no
ipv6 address, it slilently returns, but keeps the local querier enabled.
This specific case causes confusing packet loss.
Ipv6 multicast snooping can only work if:
a) An external querier is present
OR
b) The bridge has an ipv6 address an is capable of sending own queries
Otherwise it has to forward/flood the ipv6 multicast traffic,
because snooping cannot work.
This patch fixes the issue by adding a flag to the bridge struct that
indicates that there is currently no ipv6 address assinged to the bridge
and returns a false state for the local querier in
__br_multicast_querier_exists().
Special thanks to Linus Lüssing.
Fixes: d1d81d4c3d ("bridge: check return value of ipv6_dev_get_saddr()")
Signed-off-by: Daniel Danzberger <daniel@dd-wrt.com>
Acked-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: David S. Miller <davem@davemloft.net>
If a user space program (e.g., wpa_supplicant) deletes a STA entry that
is currently in NL80211_PLINK_ESTAB state, the number of established
plinks counter was not decremented and this could result in rejecting
new plink establishment before really hitting the real maximum plink
limit. For !user_mpm case, this decrementation is handled by
mesh_plink_deactive().
Fix this by decrementing estab_plinks on STA deletion
(mesh_sta_cleanup() gets called from there) so that the counter has a
correct value and the Beacon frame advertisement in Mesh Configuration
element shows the proper value for capability to accept additional
peers.
Cc: stable@vger.kernel.org
Signed-off-by: Jouni Malinen <j@w1.fi>
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Drop redundant include of moduleparam.h
The Coccinelle semantic patch used to make this change is as follows:
@ includesmodule @
@@
#include <linux/module.h>
@ depends on includesmodule @
@@
- #include <linux/moduleparam.h>
Signed-off-by: Amitoj Kaur Chawla <amitoj1606@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support to inet_diag facility to filter sockets based on device
index. If an interface index is in the filter only sockets bound
to that index (sk_bound_dev_if) are returned.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This fixes wrong-interface signaling on 32-bit platforms for entries
created when jiffies > 2^31 + MFC_ASSERT_THRESH.
Signed-off-by: Tom Goff <thomas.goff@ll.mit.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use only simple inline assemblies which consist of a single basic
block if the register asm construct is being used.
Otherwise gcc would generate broken code if the compiler option
--sanitize-coverage=trace-pc would be used.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
SMACK uses similar functions to control CIPSO, these are
the equivalent functions for CALIPSO and follow exactly
the same semantics.
int netlbl_cfg_calipso_add(struct calipso_doi *doi_def,
struct netlbl_audit *audit_info)
Adds a CALIPSO doi.
void netlbl_cfg_calipso_del(u32 doi, struct netlbl_audit *audit_info)
Removes a CALIPSO doi.
int netlbl_cfg_calipso_map_add(u32 doi, const char *domain,
const struct in6_addr *addr,
const struct in6_addr *mask,
struct netlbl_audit *audit_info)
Creates a mapping between a domain and a CALIPSO doi. If
addr and mask are non-NULL this creates an address-selector
type mapping.
This also extends netlbl_cfg_map_del() to remove IPv6 address-selector
mappings.
Signed-off-by: Huw Davies <huw@codeweavers.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
This works in exactly the same way as the CIPSO label cache.
The idea is to allow the lsm to cache the result of a secattr
lookup so that it doesn't need to perform the lookup for
every skbuff.
It introduces two sysctl controls:
calipso_cache_enable - enables/disables the cache.
calipso_cache_bucket_size - sets the size of a cache bucket.
Signed-off-by: Huw Davies <huw@codeweavers.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Lengths, checksum and the DOI are checked. Checking of the
level and categories are left for the socket layer.
CRC validation is performed in the calipso module to avoid
unconditionally linking crc_ccitt() into ipv6.
Signed-off-by: Huw Davies <huw@codeweavers.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
This makes it possible to route the error to the appropriate
labelling engine. CALIPSO is far less verbose than CIPSO
when encountering a bogus packet, so there is no need for a
CALIPSO error handler.
Signed-off-by: Huw Davies <huw@codeweavers.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
In some cases, the lsm needs to add the label to the skbuff directly.
A NF_INET_LOCAL_OUT IPv6 hook is added to selinux to match the IPv4
behaviour. This allows selinux to label the skbuffs that it requires.
Signed-off-by: Huw Davies <huw@codeweavers.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Request sockets need to have a label that takes into account the
incoming connection as well as their parent's label. This is used
for the outgoing SYN-ACK and for their child full-socket.
Signed-off-by: Huw Davies <huw@codeweavers.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
If set, these will take precedence over the parent's options during
both sending and child creation. If they're not set, the parent's
options (if any) will be used.
This is to allow the security_inet_conn_request() hook to modify the
IPv6 options in just the same way that it already may do for IPv4.
Signed-off-by: Huw Davies <huw@codeweavers.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
CALIPSO is a hop-by-hop IPv6 option. A lot of this patch is based on
the equivalent CISPO code. The main difference is due to manipulating
the options in the hop-by-hop header.
Signed-off-by: Huw Davies <huw@codeweavers.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
This is to allow the CALIPSO labelling engine to use these.
Signed-off-by: Huw Davies <huw@codeweavers.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
The functionality is equivalent to ipv6_renew_options() except
that the newopt pointer is in kernel, not user, memory
The kernel memory implementation will be used by the CALIPSO network
labelling engine, which needs to be able to set IPv6 hop-by-hop
options.
Signed-off-by: Huw Davies <huw@codeweavers.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Remove a specified DOI through the NLBL_CALIPSO_C_REMOVE command.
It requires the attribute:
NLBL_CALIPSO_A_DOI.
Signed-off-by: Huw Davies <huw@codeweavers.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
This extends the NLBL_MGMT_C_ADD and NLBL_MGMT_C_ADDDEF commands
to accept CALIPSO protocol DOIs.
Signed-off-by: Huw Davies <huw@codeweavers.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Enumerate the DOI list through the NLBL_CALIPSO_C_LISTALL command.
It takes no attributes.
Signed-off-by: Huw Davies <huw@codeweavers.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Query a specified DOI through the NLBL_CALIPSO_C_LIST command.
It requires the attribute:
NLBL_CALIPSO_A_DOI.
The reply will contain:
NLBL_CALIPSO_A_MTYPE
Signed-off-by: Huw Davies <huw@codeweavers.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
CALIPSO is a packet labelling protocol for IPv6 which is very similar
to CIPSO. It is specified in RFC 5570. Much of the code is based on
the current CIPSO code.
This adds support for adding passthrough-type CALIPSO DOIs through the
NLBL_CALIPSO_C_ADD command. It requires attributes:
NLBL_CALIPSO_A_TYPE which must be CALIPSO_MAP_PASS.
NLBL_CALIPSO_A_DOI.
In passthrough mode the CALIPSO engine will map MLS secattr levels
and categories directly to the packet label.
At this stage, the major difference between this and the CIPSO
code is that IPv6 may be compiled as a module. To allow for
this the CALIPSO functions are registered at module init time.
Signed-off-by: Huw Davies <huw@codeweavers.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
The reason is to allow different labelling protocols for
different address families with the same domain.
This requires the addition of an address family attribute
in the netlink communication protocol. It is used in several
messages:
NLBL_MGMT_C_ADD and NLBL_MGMT_C_ADDDEF take it as an optional
attribute for the unlabelled protocol. It may be one of AF_INET,
AF_INET6 or AF_UNSPEC (to specify both address families). If it
is missing, it defaults to AF_UNSPEC.
NLBL_MGMT_C_LISTALL and NLBL_MGMT_C_LISTDEF return it as part of
the enumeration of each item. Addtionally, it may be sent to
LISTDEF to specify which address family to return.
Signed-off-by: Huw Davies <huw@codeweavers.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
This fixes sparse errors of the form:
incompatible types in comparison expression (different address spaces)
Signed-off-by: Huw Davies <huw@codeweavers.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
There are several places where the listener and pending or accept queue
child sockets are accessed at the same time. Lockdep is unhappy that
two locks from the same class are held.
Tell lockdep that it is safe and document the lock ordering.
Originally Claudio Imbrenda <imbrenda@linux.vnet.ibm.com> sent a similar
patch asking whether this is safe. I have audited the code and also
covered the vsock_pending_work() function.
Suggested-by: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
with the commit 8c14586fc3 ("net: ipv6: Use passed in table for
nexthop lookups"), net hop lookup is first performed on route creation
in the passed-in table.
However device match is not enforced in table lookup, so the found
route can be later discarded due to egress device mismatch and no
global lookup will be performed.
This cause the following to fail:
ip link add dummy1 type dummy
ip link add dummy2 type dummy
ip link set dummy1 up
ip link set dummy2 up
ip route add 2001:db8:8086::/48 dev dummy1 metric 20
ip route add 2001:db8:d34d::/64 via 2001:db8:8086::2 dev dummy1 metric 20
ip route add 2001:db8:8086::/48 dev dummy2 metric 21
ip route add 2001:db8:d34d::/64 via 2001:db8:8086::2 dev dummy2 metric 21
RTNETLINK answers: No route to host
This change fixes the issue enforcing device lookup in
ip6_nh_lookup_table()
v1->v2: updated commit message title
Fixes: 8c14586fc3 ("net: ipv6: Use passed in table for nexthop lookups")
Reported-and-tested-by: Beniamino Galvani <bgalvani@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQEcBAABCgAGBQJXa6rGAAoJED07qiWsqSVqzwsH/0g/blikQuNNAB7LzXGTEMn8
ipWCODaYxgWE/V/1A0JjsKpxB1ZL3HE0/xg/OmZ10SRNlz5LJTLkoYZa7xYmyxtT
a89mT9vdnWxwfwbeIhSSjG9W9g44vqrhMIPxffIu5yqSqdIJak0HIiuQizJ0/xjx
xQVx4AykWKQ3u0/Tiyz3ez5yhyMMNEmHxKyDWpgrR6+zXlofwP/Em3NFPwk9gh32
ECAagOUOEvdjXeRs1Yn/CV0FC7Wgs4Hzr048JJ5wOteawBLr+sPOnuWZtDhNrwDK
Eceh7aEeAiFfIW+jCMEe5qrpFHi8hqyqBo4L9TDYsE7T/Z6BPyh+5OUh8+5dtaw=
=TZew
-----END PGP SIGNATURE-----
Merge tag 'linux-can-next-for-4.8-20160623' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next
Marc Kleine-Budde says:
====================
pull-request: can-next 2016-06-17
this is a pull request of 4 patches for net-next/master.
Arnd Bergmann's patch fixes a regresseion in af_can introduced in
linux-can-next-for-4.8-20160617. There are two patches by Ramesh
Shanmugasundaram, which add CAN-2.0 support to the rcar_canfd driver.
And a patch by Ed Spiridonov that adds better error diagnoses messages
to the Ed Spiridonov driver.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace calls to kmalloc followed by a memcpy with a direct call to
kmemdup.
The Coccinelle semantic patch used to make this change is as follows:
@@
expression from,to,size,flag;
statement S;
@@
- to = \(kmalloc\|kzalloc\)(size,flag);
+ to = kmemdup(from,size,flag);
if (to==NULL || ...) S
- memcpy(to, from, size);
Signed-off-by: Amitoj Kaur Chawla <amitoj1606@gmail.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIVAwUAV2q17PSw1s6N8H32AQKOixAAh+Fw6H7j/biCA73Fi0CtVJpqxvDEoo0d
jGBNtIueiyTnuGpV8yqT23xrgcuaQlLoEPhwQTRPFy/jI2qmah69kuIZzz6ZvNxA
sSsbc4M7PerMwmX/gPwUtvWflal1ECmpc6f+5y3pZBhqowMwm9HwxR0489FCurba
4k1w/OxDQIIH88RIsNcYX129xTIvekzB8bjhkIzfQM3WLfelhYyPr8haTt+CrCuF
gdLB7O+AoCI7rxXuS+7blZq4+AryzNAWjpJQdQXlClF2UJhBDNO6CptmsmL5kpuN
0a6ijumB4Onak6zZhQo5PmvX2UbQbh6QEuGm1ZsdyZoTOFPgynRv0ZVgF4JmBl0t
vhkKrbtcLkSYUHnFUoCBDuJnI9exugUH5a0BjdVPs/J6Zha0eS1pF0IOZEOQzC8j
C4U+dgjKP4OArRLS6rRR5oS99zFijcTR+fNp+0rORwqgiDJhBgkbIO9y2hyDrjr9
OW+Hnbm2EPNF86kLG+zi/OSZ2Af/fX5gtkHvfLETE1rxzZi6G1leBiEOezCVJa1P
/jtr8RR4hDOND8XZ4qvUC0yRBo2ykyW0OjtqZ2No7PuL1z/N6gvaV0PEn62yrP/g
XCcmVxXquTMrQlqu66cQGJ+ZdD6MYa44oAOHGRpX55GCvZyjpMFrLPJ8B9XbXoh3
YXIDu5U8kQs=
=HxJc
-----END PGP SIGNATURE-----
Merge tag 'rxrpc-rewrite-20160622-2' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
David Howells says:
====================
rxrpc: Get rid of conn bundle and transport structs
Here's the next part of the AF_RXRPC rewrite. The primary purpose of this
set is to get rid of the rxrpc_conn_bundle and rxrpc_transport structs.
This simplifies things for future development of the connection handling.
To this end, the following significant changes are made:
(1) The rxrpc_connection struct is given pointers to the local and peer
endpoints, inside the rxrpc_conn_parameters struct. Pointers to the
transport's copy of these pointers are then redirected to the
connection struct.
(2) Exclusive connection handling is fixed. Exclusive connections should
do just one call and then be retired. They are used in security
negotiations and, I believe, the idea is to avoid reuse of negotiated
security contexts.
The current code is doing a single connection per socket and doing all
the calls over that. With this change it gets a new connection for
each call made.
(3) A new sendmsg() control message marker is added to make individual
calls operate over exclusive connections. This should be used in
future in preference to the sockopt that marks a socket as "exclusive
connection".
(4) IDs for client connections initiated by a machine are now allocated
from a global pool using the IDR facility and are unique across all
client connections, no matter their destination. The IDR facility is
then used to look up a connection on the connection ID alone. Other
parameters are then verified afterwards.
Note that the IDR facility may use a lot of memory if the IDs it holds
are widely scattered. Given this, in a future commit, client
connections will be retired if they are more than a certain distance
from the last ID allocated.
The client epoch is advanced by 1 each time the client ID counter
wraps. Connections outside the current epoch will also be retired in
a future commit.
(5) The connection bundle concept is removed and the client connection
tree is moved into the local endpoint. The queue for waiting for a
call channel is moved to the rxrpc_connection struct as there can only
be one connection for any particular key going to any particular peer
now.
(6) The rxrpc_transport struct is removed and the service connection tree
is moved into the peer struct.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
When qdisc bulk dequeue was added in linux-3.18 (commit
5772e9a346 "qdisc: bulk dequeue support for qdiscs
with TCQ_F_ONETXQUEUE"), it was constrained to some
specific qdiscs.
With some extra care, we can extend this to all qdiscs,
so that typical traffic shaping solutions can benefit from
small batches (8 packets in this patch).
For example, HTB is often used on some multi queue device.
And bonding/team are multi queue devices...
Idea is to bulk-dequeue packets mapping to the same transmit queue.
This brings between 35 and 80 % performance increase in HTB setup
under pressure on a bonding setup :
1) NUMA node contention : 610,000 pps -> 1,110,000 pps
2) No node contention : 1,380,000 pps -> 1,930,000 pps
Now we should work to add batches on the enqueue() side ;)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: John Fastabend <john.r.fastabend@intel.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Florian Westphal <fw@strlen.de>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We already get child qdisc qlen, we also can get its backlog
so that class dumps can report it.
Also replace qstats by a single drop counter, but move it in
a separate cache line so that drops do not dirty useful cache lines.
Tested:
$ tc -s cl sh dev eth0
class htb 1:1 root leaf 3: prio 0 rate 1Gbit ceil 1Gbit burst 500000b cburst 500000b
Sent 2183346912 bytes 9021815 pkt (dropped 2340774, overlimits 0 requeues 0)
rate 1001Mbit 517543pps backlog 120758b 499p requeues 0
lended: 9021770 borrowed: 0 giants: 0
tokens: 9 ctokens: 9
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now we defer skb drops, it makes sense to keep a copy
of skb->truesize in struct codel_skb_cb to avoid one
cache line miss per dropped skb in fq_codel_drop(),
to reduce latencies a bit further.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Qdisc performance suffers when packets are dropped at enqueue()
time because drops (kfree_skb()) are done while qdisc lock is held,
delaying a dequeue() draining the queue.
Nominal throughput can be reduced by 50 % when this happens,
at a time we would like the dequeue() to proceed as fast as possible.
Even FQ is vulnerable to this problem, while one of FQ goals was
to provide some flow isolation.
This patch adds a 'struct sk_buff **to_free' parameter to all
qdisc->enqueue(), and in qdisc_drop() helper.
I measured a performance increase of up to 12 %, but this patch
is a prereq so that future batches in enqueue() can fly.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In ircomm_tty_get_serial_info, struct serial_struct is memset to 0 and
then some members set to 0 explicitly.
Remove the latter as it is obviously superfluous.
And remove the retinfo check against NULL. copy_to_user will take care
of that.
Part of hub6 cleanup series.
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Samuel Ortiz <samuel@sortiz.org>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Only set conntrack mark or labels when the commit flag is specified.
This makes sure we can not set them before the connection has been
persisted, as in that case the mark and labels would be lost in an
event of an userspace upcall.
OVS userspace already requires the commit flag to accept setting
ct_mark and/or ct_labels. Validate for this in the kernel API.
Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Set conntrack mark and labels right before committing so that
the initial conntrack NEW event has the mark and labels.
Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Acked-by: Joe Stringer <joe@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce a new configuration option for this expression, which allows users
to invert the logic of set lookups.
In _init() we will now return EINVAL if NFT_LOOKUP_F_INV is in anyway
related to a map lookup.
The code in the _eval() function has been untangled and updated to sopport the
XOR of options, as we should consider 4 cases:
* lookup false, invert false -> NFT_BREAK
* lookup false, invert true -> return w/o NFT_BREAK
* lookup true, invert false -> return w/o NFT_BREAK
* lookup true, invert true -> NFT_BREAK
Signed-off-by: Arturo Borrero Gonzalez <arturo.borrero.glez@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This flag was introduced to restore rulesets from the new netdev
family, but since 5ebe0b0eec ("netfilter: nf_tables: destroy
basechain and rules on netdevice removal") the ruleset is released
once the netdev is gone.
This also removes nft_register_basechain() and
nft_unregister_basechain() since they have no clients anymore after
this rework.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
No need to restrict this to module parameter.
We export a copy of the real hash size -- when user alters the value we
allocate the new table, copy entries etc before we update the real size
to the requested one.
This is also needed because the real size is used by concurrent readers
and cannot be changed without synchronizing the conntrack generation
seqcnt.
We only allow changing this value from the initial net namespace.
Tested using http-client-benchmark vs. httpterm with concurrent
while true;do
echo $RANDOM > /proc/sys/net/netfilter/nf_conntrack_buckets
done
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
New elements are inactive in the preparation phase, and its
NFT_SET_ELEM_BUSY_MASK flag is set on.
This busy flag doesn't allow us to delete it from the same transaction,
following a sequence like:
begin transaction
add element X
delete element X
end transaction
This sequence is valid and may be triggered by robots. To resolve this
problem, allow deactivating elements that are active in the current
generation (ie. those that has been just added in this batch).
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
set->ops->deactivate() is invoked from nft_del_setelem() that happens
from the transaction path, so we have to check if the object is active
in the next generation, not the current.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch addresses two problems:
1) The netlink dump is inconsistent when interfering with an ongoing
transaction update for several reasons:
1.a) We don't honor the internal NFT_TABLE_INACTIVE flag, and we should
be skipping these inactive objects in the dump.
1.b) We perform speculative deletion during the preparation phase, that
may result in skipping active objects.
1.c) The listing order changes, which generates noise when tracking
incremental ruleset update via tools like git or our own
testsuite.
2) We don't allow to add and to update the object in the same batch,
eg. add table x; add table x { flags dormant\; }.
In order to resolve these problems:
1) If the user requests a deletion, the object becomes inactive in the
next generation. Then, ignore objects that scheduled to be deleted
from the lookup path, as they will be effectively removed in the
next generation.
2) From the get/dump path, if the object is not currently active, we
skip it.
3) Support 'add X -> update X' sequence from a transaction.
After this update, we obtain a consistent list as long as we stay
in the same generation. The userspace side can detect interferences
through the generation counter so it can restart the dumping.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Thus, we can reuse these to check the genmask of any object type, not
only rules. This is required now that tables, chain and sets will get a
generation mask field too in follow up patches.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
li->u.ulog.copy_len is currently ignored by the kernel, we should truncate
the packet to either li->u.ulog.copy_len (if set) or copy_range before
sending it to userspace. 0 is a valid input for copy_len, so add a new
flag to indicate whether this was option was specified by the user or not.
Add two flags to indicate whether nflog-size/copy_len was set or not.
XT_NFLOG_F_COPY_LEN is for XT_NFLOG and NFLOG_F_COPY_LEN for nfnetlink_log
On the userspace side, this was initially represented by the option
nflog-range, this will be replaced by --nflog-size now. --nflog-range would
still exist but does not do anything.
Reported-by: Joe Dollard <jdollard@akamai.com>
Reviewed-by: Josh Hunt <johunt@akamai.com>
Signed-off-by: Vishwanath Pai <vpai@akamai.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
In iptables, if the user add a rule to send tcp RST and specify the
non-TCP protocol, such as UDP, kernel will reject this request. But
in nftables, this validity check only occurs in nft tool, i.e. only
in userspace.
This means that user can add such a rule like follows via nfnetlink:
"nft add rule filter forward ip protocol udp reject with tcp reset"
This will generate some confusing tcp RST packets. So we should send
tcp RST only when it is TCP packet.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Today what is normally called data (the mount options) is not passed
to fill_super through mount_ns.
Pass the mount options and the namespace separately to mount_ns so
that filesystems such as proc that have mount options, can use
mount_ns.
Pass the user namespace to mount_ns so that the standard permission
check that verifies the mounter has permissions over the namespace can
be performed in mount_ns instead of in each filesystems .mount method.
Thus removing the duplication between mqueuefs and proc in terms of
permission checks. The extra permission check does not currently
affect the rpc_pipefs filesystem and the nfsd filesystem as those
filesystems do not currently allow unprivileged mounts. Without
unpvileged mounts it is guaranteed that the caller has already passed
capable(CAP_SYS_ADMIN) which guarantees extra permission check will
pass.
Update rpc_pipefs and the nfsd filesystem to ensure that the network
namespace reference is always taken in fill_super and always put in kill_sb
so that the logic is simpler and so that errors originating inside of
fill_super do not cause a network namespace leak.
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
If the packet was dropped by lower qdisc, then we must not
access it later.
Save qdisc_pkt_len(skb) in a temp variable.
Fixes: 2ccccf5fb4 ("net_sched: update hierarchical backlog too")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: WANG Cong <xiyou.wangcong@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexey reported that we have GFP_KERNEL allocation when
holding the spinlock tcf_lock. Actually we don't have
to take that spinlock for all the cases, especially
for the new one we just create. To modify the existing
actions, we still need this spinlock to make sure
the whole update is atomic.
For net-next, we can get rid of this spinlock because
we already hold the RTNL lock on slow path, and on fast
path we can use RCU to protect the metalist.
Joint work with Jamal.
Reported-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Blair Steven noticed that ESN in conjunction with UDP encapsulation
is broken because we set the temporary ESP header to the wrong spot.
This patch fixes this by first of all using the right spot, i.e.,
4 bytes off the real ESP header, and then saving this information
so that after encryption we can restore it properly.
Fixes: 7021b2e1cd ("esp4: Switch to new AEAD interface")
Reported-by: Blair Steven <Blair.Steven@alliedtelesis.co.nz>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When user add a nft rule to set nftrace to zero, for example:
# nft add rule ip filter input nftrace set 0
We should set nf_trace to zero also.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If expr init fails then we need to free it.
So when the user add a nft rule as follows:
# nft add rule filter input tcp dport 22 flow table ssh \
{ ip saddr limit rate 0/second }
memory leak will happen.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Making this work is a little tricky as it really isn't kosher to
change the xt_owner_match_info in a check function.
Without changing xt_owner_match_info we need to know the user
namespace the uids and gids are specified in. In the common case
net->user_ns == current_user_ns(). Verify net->user_ns ==
current_user_ns() in owner_check so we can later assume it in
owner_mt.
In owner_check also verify that all of the uids and gids specified are
in net->user_ns and that the expected min/max relationship exists
between the uids and gids in xt_owner_match_info.
In owner_mt get the network namespace from the outgoing socket, as this
must be the same network namespace as the netfilter rules, and use that
network namespace to find the user namespace the uids and gids in
xt_match_owner_info are encoded in. Then convert from their encoded
from into the kernel internal format for uids and gids and perform the
owner match.
Similar to ping_group_range, this code does not try to detect
noncontiguous UID/GID ranges.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Kevin Cernekee <cernekee@chromium.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Curently we store zone information as a conntrack extension.
This has one drawback: for every lookup we need to fetch the zone data
from the extension area.
This change place the zone data directly into the main conntrack object
structure and then removes the zone conntrack extension.
The zone data is just 4 bytes, it fits into a padding hole before
the tuplehash info, so we do not even increase the nf_conn structure size.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If 'logger' was NULL, there would be a direct jump to the label 'out',
since it has already been checked for NULL, remove this unnecessary
check.
Signed-off-by: Shivani Bhardwaj <shivanib134@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
increases struct size by 32 bytes (288 -> 320), but it is the right thing,
else any attempt to (re-)arrange nf_conn members by cacheline won't work.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Consider such situation, if nf_log_ipv4 kernel module is not installed,
and the user add a following iptables rule:
# iptables -t raw -I PREROUTING -j TRACE
There will be no trace log generated until the user install nf_log_ipv4
module manully. So we should add request related nf_log module
appropriately here.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When we request NFPROTO_INET, it means both NFPROTO_IPV4 and NFPROTO_IPV6.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Since we cannot make sure that the 'hook_mask' will always be none
zero here. If it equals to zero, the num_hooks will be zero too,
and then kmalloc() will return ZERO_SIZE_PTR, which is (void *)16.
Then the following error check will fails:
ops = kmalloc(sizeof(*ops) * num_hooks, GFP_KERNEL);
if (ops == NULL)
return ERR_PTR(-ENOMEM);
So this patch will fix this with just doing the zero check before
kmalloc() is called.
Maybe the case above will never happen here, but in theory.
Signed-off-by: Xiubo Li <lixiubo@cmss.chinamobile.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The change to leave out procfs support in CAN when CONFIG_PROC_FS
is not set was incomplete and leads to a build error:
net/built-in.o: In function `can_init':
:(.init.text+0x9858): undefined reference to `can_stat_update'
ERROR: "can_stat_update" [net/can/can.ko] undefined!
This tries a better approach, encapsulating all of the calls
within IS_ENABLED(), so we also leave out the timer function
from the object file.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: a20fadf853 ("can: build proc support only if CONFIG_PROC_FS is activated")
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
The commit f2a4d086ed ("openvswitch: Add packet truncation support.")
introduces packet truncation before sending to userspace upcall receiver.
This patch passes up the skb->len before truncation so that the upcall
receiver knows the original packet size. Potentially this will be used
by sFlow, where OVS translates sFlow config header=N to a sample action,
truncating packet to N byte in kernel datapath. Thus, only N bytes instead
of full-packet size is copied from kernel to userspace, saving the
kernel-to-userspace bandwidth.
Signed-off-by: William Tu <u9012063@gmail.com>
Cc: Pravin Shelar <pshelar@nicira.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When extracting an individual message from a received "bundle" buffer,
we just create a clone of the base buffer, and adjust it to point into
the right position of the linearized data area of the latter. This works
well for regular message reception, but during periods of extremely high
load it may happen that an extracted buffer, e.g, a connection probe, is
reversed and forwarded through an external interface while the preceding
extracted message is still unhandled. When this happens, the header or
data area of the preceding message will be partially overwritten by a
MAC header, leading to unpredicatable consequences, such as a link
reset.
We now fix this by ensuring that the msg_reverse() function never
returns a cloned buffer, and that the returned buffer always contains
sufficient valid head and tail room to be forwarded.
Reported-by: Erik Hugne <erik.hugne@gmail.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Every open of /proc/net/kcm leaks 16 bytes of memory as is reported by
kmemleak:
unreferenced object 0xffff88059c0e3458 (size 192):
comm "cat", pid 1401, jiffies 4294935742 (age 310.720s)
hex dump (first 32 bytes):
28 45 71 96 05 88 ff ff 00 10 00 00 00 00 00 00 (Eq.............
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<ffffffff8156a2de>] kmem_cache_alloc_trace+0x16e/0x230
[<ffffffff8162a479>] seq_open+0x79/0x1d0
[<ffffffffa0578510>] kcm_seq_open+0x0/0x30 [kcm]
[<ffffffff8162a479>] seq_open+0x79/0x1d0
[<ffffffff8162a8cf>] __seq_open_private+0x2f/0xa0
[<ffffffff81712548>] seq_open_net+0x38/0xa0
...
It is caused by a missing free in the ->release path. So fix it by
providing seq_release_net as the ->release method.
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Fixes: cd6e111bf5 (kcm: Add statistics and proc interfaces)
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Tom Herbert <tom@herbertland.com>
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
The rxrpc_transport struct is now redundant, given that the rxrpc_peer
struct is now per peer port rather than per peer host, so get rid of it.
Service connection lists are transferred to the rxrpc_peer struct, as is
the conn_lock. Previous patches moved the client connection handling out
of the rxrpc_transport struct and discarded the connection bundling code.
Signed-off-by: David Howells <dhowells@redhat.com>
Kill off the concept of maintaining a bundle of connections to a particular
target service to increase the number of call slots available for any
beyond four for that service (there are four call slots per connection).
This will make cleaning up the connection handling code easier and
facilitate removal of the rxrpc_transport struct. Bundling can be
reintroduced later if necessary.
Signed-off-by: David Howells <dhowells@redhat.com>
Provide refcount helper functions for connections so that the code doesn't
touch local or connection usage counts directly.
Also make it such that local and peer put functions can take a NULL
pointer.
Signed-off-by: David Howells <dhowells@redhat.com>
Make rxrpc_send_packet() take a connection not a transport as part of the
phasing out of the rxrpc_transport struct.
Whilst we're at it, rename the function to rxrpc_send_data_packet() to
differentiate it from the other packet sending functions.
Signed-off-by: David Howells <dhowells@redhat.com>
Allocated rxrpc calls displayed in /proc/net/rxrpc_calls may in future be
on the proc list before they're connected or after they've been
disconnected - in which case they may not have a pointer to a connection
struct that can be used to get data from there.
Deal with this by using stuff from the call struct in preference where
possible and printing "no_connection" rather than a peer address if no
connection is assigned.
This change also has the added bonus that the service ID is now taken from
the call rather the connection which will allow per-call service upgrades
to be shown - something required for AuriStor server compatibility.
Signed-off-by: David Howells <dhowells@redhat.com>
Validate the net address given to rxrpc_kernel_begin_call() before using
it.
Whilst this should be mostly unnecessary for in-kernel users, it does clear
the tail of the address struct in case we want to hash or compare the whole
thing.
Signed-off-by: David Howells <dhowells@redhat.com>
Use the IDR facility to allocate client connection IDs on a machine-wide
basis so that each client connection has a unique identifier. When the
connection ID space wraps, we advance the epoch by 1, thereby effectively
having a 62-bit ID space. The IDR facility is then used to look up client
connections during incoming packet routing instead of using an rbtree
rooted on the transport.
This change allows for the removal of the transport in the future and also
means that client connections can be looked up directly in the data-ready
handler by connection ID.
The ID management code is placed in a new file, conn-client.c, to which all
the client connection-specific code will eventually move.
Note that the IDR tree gets very expensive on memory if the connection IDs
are widely scattered throughout the number space, so we shall need to
retire connections that have, say, an ID more than four times the maximum
number of client conns away from the current allocation point to try and
keep the IDs concentrated. We will also need to retire connections from an
old epoch.
Also note that, for the moment, a pointer to the transport has to be passed
through into the ID allocation function so that we can take a BH lock to
prevent a locking issue against in-BH lookup of client connections. This
will go away later when RCU is used for server connections also.
Signed-off-by: David Howells <dhowells@redhat.com>
rxrpc_connection_lock shouldn't be accessed as a BH-excluding lock. It's
only accessed in a few places and none of those are in BH-context.
rxrpc_transport::conn_lock, however, *is* a BH-excluding lock and should be
accessed so consistently.
Signed-off-by: David Howells <dhowells@redhat.com>
Pass a pointer to struct sk_buff rather than struct rxrpc_host_header to
functions so that they can in the future get at transport protocol parameters
rather than just RxRPC parameters.
Signed-off-by: David Howells <dhowells@redhat.com>
"Exclusive connections" are meant to be used for a single client call and
then scrapped. The idea is to limit the use of the negotiated security
context. The current code, however, isn't doing this: it is instead
restricting the socket to a single virtual connection and doing all the
calls over that.
This is changed such that the socket no longer maintains a special virtual
connection over which it will do all the calls, but rather gets a new one
each time a new exclusive call is made.
Further, using a socket option for this is a poor choice. It should be
done on sendmsg with a control message marker instead so that calls can be
marked exclusive individually. To that end, add RXRPC_EXCLUSIVE_CALL
which, if passed to sendmsg() as a control message element, will cause the
call to be done on an single-use connection.
The socket option (RXRPC_EXCLUSIVE_CONNECTION) still exists and, if set,
will override any lack of RXRPC_EXCLUSIVE_CALL being specified so that
programs using the setsockopt() will appear to work the same.
Signed-off-by: David Howells <dhowells@redhat.com>
Replace accesses of conn->trans->{local,peer} with
conn->params.{local,peer} thus making it easier for a future commit to
remove the rxrpc_transport struct.
This also reduces the number of memory accesses involved.
Signed-off-by: David Howells <dhowells@redhat.com>
Define and use a structure to hold connection parameters. This makes it
easier to pass multiple connection parameters around.
Define and use a structure to hold protocol information used to hash a
connection for lookup on incoming packet. Most of these fields will be
disposed of eventually, including the duplicate local pointer.
Whilst we're at it rename "proto" to "family" when referring to a protocol
family.
Signed-off-by: David Howells <dhowells@redhat.com>
Hashing the peer key was introduced for AF_INET, but gcc
warns about the rxrpc_peer_hash_key function returning uninitialized
data for any other value of srx->transport.family:
net/rxrpc/peer_object.c: In function 'rxrpc_peer_hash_key':
net/rxrpc/peer_object.c:57:15: error: 'p' may be used uninitialized in this function [-Werror=maybe-uninitialized]
Assuming that nothing else can be set here, this changes the
function to just return zero in case of an unknown address
family.
Fixes: be6e6707f6 ("rxrpc: Rework peer object handling to use hash table and RCU")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David Howells <dhowells@redhat.com>
rxrpc_lookup_peer_rcu() and rxrpc_lookup_peer() return NULL on error, never
error pointers, so IS_ERR() can't be used.
Fix three callers of those functions.
Fixes: be6e6707f6 ('rxrpc: Rework peer object handling to use hash table and RCU')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David Howells <dhowells@redhat.com>
When receiving an ICMPv4 message containing extensions as
defined in RFC 4884, and translating it to ICMPv6 at SIT
or GRE tunnel, we need some extra manipulation in order
to properly forward the extensions.
This patch only takes care of Time Exceeded messages as they
are the ones that typically carry information from various
routers in a fabric during a traceroute session.
It also avoids complex skb logic if the data_len is not
a multiple of 8.
RFC states :
The "original datagram" field MUST contain at least 128 octets.
If the original datagram did not contain 128 octets, the
"original datagram" field MUST be zero padded to 128 octets.
In practice routers use 128 bytes of original datagram, not more.
Initial translation was added in commit ca15a078bd
("sit: generate icmpv6 error when receiving icmpv4 error")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Oussama Ghorbel <ghorbel@pivasoftware.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ipgre_err() can call ip6_err_gen_icmpv6_unreach() for proper
support of ipv4+gre+icmp+ipv6+... frames, used for example
by traceroute/mtr.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For better traceroute/mtr support for SIT and GRE tunnels,
we translate IPV4 ICMP ICMP_TIME_EXCEEDED to ICMPV6_TIME_EXCEED
We also have to translate the IPv4 source IP address of ICMP
message to IPv6 v4mapped.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We want to use this helper from GRE as well, so this is
the time to move it in net/ipv6/icmp.c
Also add a @nhs parameter, since SIT and GRE have different
values for the header(s) to skip.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SIT or GRE tunnels might want to translate an IPV4 address
into a v4mapped one when translating ICMP to ICMPv6.
This patch adds the parameter to icmp6_send() but
does not change icmpv6_send() signature.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix coding style issues in the following files:
ib_cm.c: add space
loop.c: convert spaces to tabs
sysctl.c: add space
tcp.h: convert spaces to tabs
tcp_connect.c:remove extra indentation in switch statement
tcp_recv.c: convert spaces to tabs
tcp_send.c: convert spaces to tabs
transport.c: move brace up one line on for statement
Signed-off-by: Joshua Houghton <josh@awful.name>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A socket connection made in ax.25 is not closed when session is
completed. The heartbeat timer is stopped prematurely and this is
where the socket gets closed. Allow heatbeat timer to run to close
socket. Symptom occurs in kernels >= 4.2.0
Originally sent 6/15/2016. Resend with distribution list matching
scripts/maintainer.pl output.
Signed-off-by: Basil Gunn <basil@pacabunga.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The state of the rds_connection after rds_tcp_reset_callbacks() would
be RDS_CONN_RESETTING and this is the value that should be passed
by rds_tcp_accept_one() to rds_connect_path_complete() to transition
the socket to RDS_CONN_UP.
Fixes: b5c21c0947c1 ("RDS: TCP: fix race windows in send-path quiescence
by rds_tcp_accept_one()")
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes the following sparse warnings:
net/rds/tcp.c:59:5: warning:
symbol 'rds_tcp_min_sndbuf' was not declared. Should it be static?
net/rds/tcp.c:60:5: warning:
symbol 'rds_tcp_min_rcvbuf' was not declared. Should it be static?
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQEcBAABCgAGBQJXZAIqAAoJED07qiWsqSVqW0YH/RMgC6CDlJUtHr7+B8YLpi0e
BZOzAHH7mQdP+Z2kIXZp8Dnziq5G8heWNjRwTMPLHvOel+oms+0ZK6VY/kJYArdb
ViGGgi/gQ334JGqJYi2utkyIIIRH7ZxwcblF1aaaFVfFy7tZMVuppIWVzR/V0Gje
/5FftT1f04/6iumEq4es+Jb0OC9azoebSs1DUZTIvYOz3XrnCbB1FdmDN+a3xZKC
Qyav6QVnp/m2InzGSN+Kd/W++EP6YckdBp/++2hizsOvSIOfe8GRqkc0r7fLbfZ8
rucDJNi+GrLat4wNza4t3FKf3rBenBFzii14OEUTE0JgpY90fGUF3n/i0duaB/k=
=90Ns
-----END PGP SIGNATURE-----
Merge tag 'linux-can-next-for-4.8-20160617' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next
Marc Kleine-Budde says:
====================
pull-request: can-next 2016-06-17
this is a pull request of 14 patches for net-next/master.
Geert Uytterhoeven contributes a patch that adds a file patterns for
CAN device tree bindings to MAINTAINERS. A patch by Alexander Aring
fixes warnings when building without proc support. A patch by me
improves the sample point calculation. Marek Vasut's patch converts
the slcan driver to use CAN_MTU. A patch by William Breathitt Gray
converts the tscan1 driver to use module_isa_driver.
Two patches by Maximilian Schneider for the gs_usb driver fix coding
style and add support for set_phys_id callback. 5 patches by Oliver
Hartkopp add support for CANFD to the bcm. And finally two patches
by Ramesh Shanmugasundaram, which add support for the rcar_canfd
driver.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
We sometimes observe a 'deadly embrace' type deadlock occurring
between mutually connected sockets on the same node. This happens
when the one-hour peer supervision timers happen to expire
simultaneously in both sockets.
The scenario is as follows:
CPU 1: CPU 2:
-------- --------
tipc_sk_timeout(sk1) tipc_sk_timeout(sk2)
lock(sk1.slock) lock(sk2.slock)
msg_create(probe) msg_create(probe)
unlock(sk1.slock) unlock(sk2.slock)
tipc_node_xmit_skb() tipc_node_xmit_skb()
tipc_node_xmit() tipc_node_xmit()
tipc_sk_rcv(sk2) tipc_sk_rcv(sk1)
lock(sk2.slock) lock((sk1.slock)
filter_rcv() filter_rcv()
tipc_sk_proto_rcv() tipc_sk_proto_rcv()
msg_create(probe_rsp) msg_create(probe_rsp)
tipc_sk_respond() tipc_sk_respond()
tipc_node_xmit_skb() tipc_node_xmit_skb()
tipc_node_xmit() tipc_node_xmit()
tipc_sk_rcv(sk1) tipc_sk_rcv(sk2)
lock((sk1.slock) lock((sk2.slock)
===> DEADLOCK ===> DEADLOCK
Further analysis reveals that there are three different locations in the
socket code where tipc_sk_respond() is called within the context of the
socket lock, with ensuing risk of similar deadlocks.
We now solve this by passing a buffer queue along with all upcalls where
sk_lock.slock may potentially be held. Response or rejected message
buffers are accumulated into this queue instead of being sent out
directly, and only sent once we know we are safely outside the slock
context.
Reported-by: GUNA <gbalasun@gmail.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
"up_map" is a u64 type but we're not using the high 32 bits.
Fixes: 35c55c9877 ('tipc: add neighbor monitoring framework')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPv6 version of 3f2fb9a834 ("net: l3mdev: address selection should only
consider devices in L3 domain") and the follow up commit, a17b693cdd876
("net: l3mdev: prefer VRF master for source address selection").
That is, if outbound device is given then the address preference order
is an address from that device, an address from the master device if it
is enslaved, and then an address from a device in the same L3 domain.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPv6 source address selection needs to consider the real egress route.
Similar to IPv4 implement a get_saddr6 method which is called if
source address has not been set. The get_saddr6 method does a full
lookup which means pulling a route from the VRF FIB table and properly
considering linklocal/multicast destination addresses. Lookup failures
(eg., unreachable) then cause the source address selection to fail
which gets propagated back to the caller.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
VRF driver needs access to ip6_route_get_saddr code. Since it does
little beyond ipv6_dev_get_saddr and ipv6_dev_get_saddr is already
exported for modules move ip6_route_get_saddr to the header as an
inline.
Code move only; no functional change.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that we have all the drivers using udp_tunnel_get_rx_ports,
ndo_add_udp_enc_rx_port, and ndo_del_udp_enc_rx_port we can drop the
function calls that were specific to VXLAN and GENEVE.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch merges the notifiers for VXLAN and GENEVE into a single UDP
tunnel notifier. The idea is that we will want to only have to make one
notifier call to receive the list of ports for VXLAN and GENEVE tunnels
that need to be offloaded.
In addition we add a new set of ndo functions named ndo_udp_tunnel_add and
ndo_udp_tunnel_del that are meant to allow us to track the tunnel meta-data
such as port and address family as tunnels are added and removed. The
tunnel meta-data is now transported in a structure named udp_tunnel_info
which for now carries the type, address family, and port number. In the
future this could be updated so that we can include a tuple of values
including things such as the destination IP address and other fields.
I also ended up going with a naming scheme that consisted of using the
prefix udp_tunnel on function names. I applied this to the notifier and
ndo ops as well so that it hopefully points to the fact that these are
primarily used in the udp_tunnel functions.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch merges the GENEVE and VXLAN code so that both functions pass
through a shared code path. This way we can start the effort of using a
single function on the network device drivers to handle both of these
tunnel types.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for your net tree,
they are rather small patches but fixing several outstanding bugs in
nf_conntrack and nf_tables, as well as minor problems with missing
SYNPROXY header uapi installation:
1) Oneliner not to leak conntrack kmemcache on module removal, this
problem was introduced in the previous merge window, patch from
Florian Westphal.
2) Two fixes for insufficient ruleset loop validation, one due to
incorrect flag check in nf_tables_bind_set() and another related to
silly wrong generation mask logic from the walk path, from Liping
Zhang.
3) Fix double-free of anonymous sets on error, this fix simplifies the
code to let the abort path take care of releasing the set object,
also from Liping Zhang.
4) The introduction of helper function for transactions broke the skip
inactive rules logic from the nft_do_chain(), again from Liping
Zhang.
5) Two patches to install uapi xt_SYNPROXY.h header and calm down
kbuild robot due to missing #include <linux/types.h>.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The programming API of the CAN_BCM depends on struct can_frame which is
given as array directly behind the bcm_msg_head structure. To follow this
schema for the CAN FD frames a new flag 'CAN_FD_FRAME' in the bcm_msg_head
flags indicates that the concatenated CAN frame structures behind the
bcm_msg_head are defined as struct canfd_frame.
This patch adds the support to handle CAN and CAN FD frames on a per BCM-op
base. Main changes:
- generally use struct canfd_frames instead if struct can_frames
- use canfd_frame.flags instead of can_frame.can_dlc for private BCM flags
- make all CAN frame sizes depending on the new CAN_FD_FRAME flags
- separate between CAN and CAN FD when sending/receiving frames
Due to the dependence of the CAN_FD_FRAME flag the former binary interface
for classic CAN frames remains stable.
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
can_frame is the name of the struct can_frame which is not meant in
the corrected comments.
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
When building can subsystem with CONFIG_PROC_FS=n I detected some unused
variables warning by using proc functions. In CAN the proc handling is
nicely placed in one object file. This patch adds simple add a
dependency on CONFIG_PROC_FS for CAN's proc.o file and corresponding
static inline no-op functions.
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
[mkl: provide static inline noops instead of using #ifdefs]
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
When we check for RTM_DELTFILTER, we should also reject the request
for deleting all filters under a given parent when TCA_KIND attribute
is present. If present, it's currently just ignored but there's also
no point to let it pass in the first place either since this doesn't
have any meaning with wild-card removal.
Fixes: ea7f8277f9 ("net, cls: allow for deleting all filters for given parent")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Modern C standards expect the '__inline__' keyword to come before the return
type in a declaration, and we get a couple of warnings for this with "make W=1"
in the xfrm{4,6}_policy.c files:
net/ipv6/xfrm6_policy.c:369:1: error: 'inline' is not at beginning of declaration [-Werror=old-style-declaration]
static int inline xfrm6_net_sysctl_init(struct net *net)
net/ipv6/xfrm6_policy.c:374:1: error: 'inline' is not at beginning of declaration [-Werror=old-style-declaration]
static void inline xfrm6_net_sysctl_exit(struct net *net)
net/ipv4/xfrm4_policy.c:339:1: error: 'inline' is not at beginning of declaration [-Werror=old-style-declaration]
static int inline xfrm4_net_sysctl_init(struct net *net)
net/ipv4/xfrm4_policy.c:344:1: error: 'inline' is not at beginning of declaration [-Werror=old-style-declaration]
static void inline xfrm4_net_sysctl_exit(struct net *net)
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
to the big nfs4_lock_state removal around 3.17 (but that were also
probably hard to reproduce before client changes in 3.20 allowed the
client to perform parallel opens).
Also fix a 4.1 backchannel crash due to rpc multipath changes in 4.6.
Trond acked the client-side rpc fixes going through my tree.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXYsiKAAoJECebzXlCjuG+PK8P/jGBS+h7Zf4GWOOsWN5cbEs/
8VTn83KXFp/feFhGikUIcAZQJRynDK+tD9Vh2FznC2zKDTLFPS0mAvL3tZyQhUO2
nEWaCUOFR+sB3aTPlMGwxbGc7NHNQg1hKqKgqcLEEqtozxhFQye3WW0MZNfFCiUZ
qpq2tK1OGGhJVIp7wWSa8+B2nGFMuasPaGM2OVJrebip49yTG/tT3rwKxKMoB8kS
i8BwNejoP1KRD6LqvpgdV1ESzkdyokDxKXCrdY/j2lMdp2YRe+cWmX239ojjvm8G
n9Ow8DYCefuiKiF6iCLZfxpX8dcmVJvT6g+k+9V63A4YCyuGhy/CneA3MO4QyLhq
yfe2zviJ2kZVz+1Ih3v9kD7ZkyK1hjrxXx/VPrI5CBIXE5eVXin2ZDvTCSoV491h
g1zscPc9Thgk6gKXsvkaVOXxLHBoUzXeSRbNqVXXZfjl+s4TXLNJ0lcaBYkzh74/
SypiFeNHjsjNpJYz5GptlbMUpaEoeyH0Y+OiH8d5Jf8hCcQ+CLjKgKSuCH5zrypt
Lx3U5QWHTT3IXH4QS/njcTSfSDu7BUip4RTLzw6C/ZJf7hd6SS4Xv72J6ZmeDSmg
146MpAYty8HB04KQWpYx0DGI7UEPlubfRHSF9XzsSitbRtNGr6xvIug8fkKBlXDB
aHtr+/gI7UvrmnXnlGdD
=aNkl
-----END PGP SIGNATURE-----
Merge tag 'nfsd-4.7-1' of git://linux-nfs.org/~bfields/linux
Pull nfsd bugfixes from Bruce Fields:
"Oleg Drokin found and fixed races in the nfsd4 state code that go back
to the big nfs4_lock_state removal around 3.17 (but that were also
probably hard to reproduce before client changes in 3.20 allowed the
client to perform parallel opens).
Also fix a 4.1 backchannel crash due to rpc multipath changes in 4.6.
Trond acked the client-side rpc fixes going through my tree"
* tag 'nfsd-4.7-1' of git://linux-nfs.org/~bfields/linux:
nfsd: Make init_open_stateid() a bit more whole
nfsd: Extend the mutex holding region around in nfsd4_process_open2()
nfsd: Always lock state exclusively.
rpc: share one xps between all backchannels
nfsd4/rpc: move backchannel create logic into rpc code
SUNRPC: fix xprt leak on xps allocation failure
nfsd: Fix NFSD_MDS_PR_KEY on 32-bit by adding ULL postfix
Pull overlayfs fixes from Miklos Szeredi:
"This contains two regression fixes: one for the xattr API update and
one for using the mounter's creds in file creation in overlayfs.
There's also a fix for a bug in handling hard linked AF_UNIX sockets
that's been there from day one. This fix is overlayfs only despite
the fact that it touches code outside the overlay filesystem: d_real()
is an identity function for all except overlay dentries"
* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: fix uid/gid when creating over whiteout
ovl: xattr filter fix
af_unix: fix hard linked sockets on overlay
vfs: add d_real_inode() helper
The space is missing after ',', and this will introduce much more
noise when checking patch around.
Signed-off-by: Wei Tang <tangwei@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes the checkpatch.pl error to dev.c:
ERROR: do not initialise statics to 0
Signed-off-by: Wei Tang <tangwei@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This appears to be necessary and sufficient to provide
MPLS in GRE (RFC4023) support.
This can be used by establishing an ipgre tunnel device
and then routing MPLS over it.
The following example will forward MPLS frames received with an outermost
MPLS label 100 over tun1, a GRE tunnel. The forwarded packet will have the
outermost MPLS LSE removed and two new LSEs added with labels 200
(outermost) and 300 (next).
ip link add name tun1 type gre remote 10.0.99.193 local 10.0.99.192 ttl 225
ip link set up dev tun1
ip addr add 10.0.98.192/24 dev tun1
ip route sh
echo 1 > /proc/sys/net/mpls/conf/eth0/input
echo 101 > /proc/sys/net/mpls/platform_labels
ip -f mpls route add 100 as 200/300 via inet 10.0.98.193
ip -f mpls route sh
Also remove unnecessary braces.
Reviewed-by: Dinan Gunawardena <dinan.gunawardena@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since 32b8a8e59c ("sit: add IPv4 over IPv4 support")
ipip6_err() may be called for packets whose IP protocol is
IPPROTO_IPIP as well as those whose IP protocol is IPPROTO_IPV6.
In the case of IPPROTO_IPIP packets the correct protocol value is not
passed to ipv4_update_pmtu() or ipv4_redirect().
This patch resolves this problem by using the IP protocol of the packet
rather than a hard-coded value. This appears to be consistent
with the usage of the protocol of a packet by icmp_socket_deliver()
the caller of ipip6_err().
I was able to exercise the redirect case by using a setup where an ICMP
redirect was received for the destination of the encapsulated packet.
However, it appears that although incorrect the protocol field is not used
in this case and thus no problem manifests. On inspection it does not
appear that a problem will manifest in the fragmentation needed/update pmtu
case either.
In short I believe this is a cosmetic fix. None the less, the use of
IPPROTO_IPV6 seems wrong and confusing.
Reviewed-by: Dinan Gunawardena <dinan.gunawardena@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit d46e416c11 ("sctp: sctp should change socket state when
shutdown is received") may set sk_state CLOSING in sctp_sock_migrate,
but inet_accept doesn't allow the sk_state other than ESTABLISHED/
CLOSED for sctp. So we will change sk_state to CLOSED, instead of
CLOSING, as actually sk is closed already there.
Fixes: d46e416c11 ("sctp: sctp should change socket state when shutdown is received")
Reported-by: Ye Xiaolong <xiaolong.ye@intel.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ctx structure passed into bpf programs is different depending on bpf
program type. The verifier incorrectly marked ctx->data and ctx->data_end
access based on ctx offset only. That caused loads in tracing programs
int bpf_prog(struct pt_regs *ctx) { .. ctx->ax .. }
to be incorrectly marked as PTR_TO_PACKET which later caused verifier
to reject the program that was actually valid in tracing context.
Fix this by doing program type specific matching of ctx offsets.
Fixes: 969bf05eb3 ("bpf: direct packet access")
Reported-by: Sasha Goldshtein <goldshtn@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIVAwUAV2F33fSw1s6N8H32AQLahQ//XgRCyVXPhaOhkiJoHW/xOlWwSDKJg5y2
c8HEThJFO0rajne83b1mX9ijryVvGw7R/2nVExTrOTvDwf/nZm/v5LBz85a/EYPE
r47yTc8OienUFJBhAFzm346mkpACHUFJ+ihX5tdUpi5SqZ4IEMplfi4NezuapBFl
sLl0aucMQKwGb5Q/nWBu+ku6rg5qPBUkhgQ33gAlcqHvcj8oRc232Wl0LGIp6oIc
1ylUj/hHY64YseFHJE4CFHojYxyg7Gyp/8YNcffIdbmzu1B6L6myll38FlMmkPsc
uAgS5MzWMRJ0qUeYLJJuTH0lRZxPV6KWaBPCHL9RwTOmbYT7fgBPy41l4kORBI84
zRBU86oQLV5oNHTubb/CnpY2xH/pCc73m2d2c6HDsDHB80UL4GQhOpVKc9f/eQbA
/QCcl98wSoyQ8No7J9Px+WH7BKijytadj7M2RILQfjSuEKWmlAah3zs/398KdZBj
tMqq5prEevluIq/DINnNaRKa+726v/VQmhuBazIs9FYhcYaAG6oZzqbZh4M8NzAa
/vlXbkSkPDgo/PF7TGIRTYsNftMxNN56tG5Lv8ti0cU7gbPPn3IC61meKrWE3BbR
S1a2rLigS3+iULYDsKDbSg8R2MlBxQrMkzxDXF7mGSAu3qPeArFoAD7ySxVcjScX
lmDzMCaDdnk=
=HOTh
-----END PGP SIGNATURE-----
Merge tag 'rxrpc-rewrite-20160615' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
David Howells says:
====================
rxrpc: Rework endpoint record handling
Here's the next part of the AF_RXRPC rewrite. In this set I rework
endpoint record handling. There are two types of endpoint record, local
and peer. The local endpoint record is used as an anchor for the transport
socket that AF_RXRPC uses (at the moment a UDP socket). Local endpoints
can be shared between AF_RXRPC sockets under certain restricted
circumstances.
The peer endpoint is a record of the remote end. It is (or will be) used
to keep track MTU and RTT values and, with these changes, is used to find
the call(s) to abort when a network error occurs.
The following significant changes are made:
(1) The local endpoint event handling code is split out into its own file.
(2) The local endpoint list bottom half-excluding spinlock is removed as
things are arranged such that sk_user_data will not change whilst the
transport socket callbacks are in progress.
(3) Local endpoints can now only be shared if they have the same transport
address (as before) and have a local service ID of 0 (ie. they're not
listening for incoming calls). This prevents callbacks from a server
to one process being picked up by another process.
(4) Local endpoint destruction is now accomplished by the same work item
as processes events, meaning that the destructor doesn't need to wait
for the event processor.
(5) Peer endpoints are now held in a hash table rather than a flat list.
(6) Peer endpoints are now destroyed by RCU rather than by work item.
(7) Peer endpoints are now differentiated by local endpoint and remote
transport port in addition to remote transport address and transport
type and family.
This means that a firewall that excludes access between a particular
local port and remote port won't cause calls to be aborted that use a
different port pair.
(8) Error report handling now no longer assumes that the source is always
an IPv4 ICMP message from a UDP port and has assumptions that an ICMP
message comes from an IPv4 socket removed. At some point IPv6 support
will be added.
(9) Peer endpoints rather than local endpoints are now the anchor point
for distributing network error reports.
(10) Both types of endpoint records are now disposed of as soon as all
references to them are gone. There is less hanging around and once
their usage counts hit zero, records can no longer be resurrected.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
1) gre_parse_header() can be called from gre_err()
At this point transport header points to ICMP header, not the inner
header.
2) We can not really change transport header as ipgre_err() will later
assume transport header still points to ICMP header (using icmp_hdr())
3) pskb_may_pull() logic in gre_parse_header() really works
if we are interested at zone pointed by skb->data
4) As Jiri explained in commit b7f8fe251e ("gre: do not pull header in
ICMP error processing") we should not pull headers in error handler.
So this fix :
A) changes gre_parse_header() to use skb->data instead of
skb_transport_header()
B) Adds a nhs parameter to gre_parse_header() so that we can skip the
not pulled IP header from error path.
This offset is 0 for normal receive path.
C) remove obsolete IPV6 includes
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <tom@herbertland.com>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net/tipc/link.c: In function ‘tipc_link_timeout’:
net/tipc/link.c:744:28: warning: ‘mtyp’ may be used uninitialized in this function [-Wuninitialized]
Fixes: 42b18f605f ("tipc: refactor function tipc_link_timeout()")
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The algorithm for checksum neutral mapping is incorrect. This problem
was being hidden since we were previously always performing checksum
offload on the translated addresses and only with IPv6 HW csum.
Enabling an ILA router shows the issue.
Corrected algorithm:
old_loc is the original locator in the packet, new_loc is the value
to overwrite with and is found in the lookup table. old_flag is
the old flag value (zero of CSUM_NEUTRAL_FLAG) and new_flag is
then (old_flag ^ CSUM_NEUTRAL_FLAG) & CSUM_NEUTRAL_FLAG.
Need SUM(new_id + new_flag + diff) == SUM(old_id + old_flag) for
checksum neutral translation.
Solving for diff gives:
diff = (old_id - new_id) + (old_flag - new_flag)
compute_csum_diff8(new_id, old_id) gives old_id - new_id
If old_flag is set
old_flag - new_flag = old_flag = CSUM_NEUTRAL_FLAG
Else
old_flag - new_flag = -new_flag = ~CSUM_NEUTRAL_FLAG
Tested:
- Implemented a user space program that creates random addresses
and random locators to overwrite. Compares the checksum over
the address before and after translation (must always be equal)
- Enabled ILA router and showed proper operation.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the presence of firewalls which improperly block ICMP Unreachable
(including Fragmentation Required) messages, Path MTU Discovery is
prevented from working.
A workaround is to handle IPv4 payloads opaquely, ignoring the DF bit--as
is done for other payloads like AppleTalk--and doing transparent
fragmentation and reassembly.
Redux includes the enforcement of mutual exclusion between this feature
and Path MTU Discovery as suggested by Alexander Duyck.
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Philip Prindeville <philipp@redfish-solutions.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds necessary handling for use the short address for
802.15.4 6lowpan. It contains support for IPHC address compression
and new matching algorithmn to decide which link layer address will be
used for 802.15.4 frame.
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
In case of sending RA messages we need some way to get the short address
from an 802.15.4 6LoWPAN interface. This patch will add a temporary
debugfs entry for experimental userspace api.
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch introduce different 6lowpan handling for receive and transmit
NS/NA messages for the ipv6 neighbour discovery. The first use-case is
for supporting 802.15.4 short addresses inside the option fields and
handling for RFC6775 6CO option field as userspace option.
Cc: David S. Miller <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch exports some neighbour discovery functions which can be used
by 6lowpan neighbour discovery ops functionality then.
Cc: David S. Miller <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch introduces neighbour discovery ops callback structure. The
idea is to separate the handling for 6LoWPAN into the 6lowpan module.
These callback offers 6lowpan different handling, such as 802.15.4 short
address handling or RFC6775 (Neighbor Discovery Optimization for IPv6
over 6LoWPANs).
Cc: David S. Miller <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch moves the functionality to add a RA PIO prefix generated
address in an own function. This move prepares to add a hook for
adding a second address for a second link-layer address. E.g. short
address for 802.15.4 6LoWPAN.
Cc: David S. Miller <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds __ndisc_fill_addr_option as low-level function for
ndisc_fill_addr_option which doesn't depend on net_device parameter.
Cc: David S. Miller <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since we use exported function from ipv6 kernel module we don't need to
request the module anymore to have ipv6 functionality.
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the autoconfiguration if a valid 802.15.4 short address
is available for 802.15.4 6LoWPAN interfaces.
Cc: David S. Miller <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch will introduce a 6lowpan neighbour private data. Like the
interface private data we handle private data for generic 6lowpan and
for link-layer specific 6lowpan.
The current first use case if to save the short address for a 802.15.4
6lowpan neighbour.
Cc: David S. Miller <davem@davemloft.net>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
saw a debug splat:
net/include/net/sch_generic.h:287 suspicious rcu_dereference_check() usage!
other info that might help us debug this:
rcu_scheduler_active = 1, debug_locks = 0
2 locks held by kworker/2:1/710:
#0: ("events"){.+.+.+}, at: [<ffffffff8106ca1d>]
#1: ((&q->work)){+.+...}, at: [<ffffffff8106ca1d>] process_one_work+0x14d/0x690
Workqueue: events htb_work_func
Call Trace:
[<ffffffff812dc763>] dump_stack+0x85/0xc2
[<ffffffff8109fee7>] lockdep_rcu_suspicious+0xe7/0x120
[<ffffffff814ced47>] htb_work_func+0x67/0x70
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sfq_reset() can use rtnl_kfree_skbs() instead of kfree_skb()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
pie_change() can use rtnl_qdisc_drop() to benefit from
deferred freeing.
pie_reset() is already using qdisc_reset_queue()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
rtnl_kfree_skbs() can be used in tfifo_reset()
It would be nice if we could iterate through rb tree instead
of removing one skb at a time, and build a single skb chain.
But this is left for a future patch.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Both htb_reset() and htb_destroy() can use __qdisc_reset_queue()
instead of __skb_queue_purge() to defer skb freeing of internal
queues.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Both hhf_reset() and hhf_change() can use rtnl_kfree_skbs()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Both fq_codel_change() and fq_codel_reset() can use rtnl_kfree_skbs()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Both fq_change() and fq_reset() can use rtnl_kfree_skbs()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
codel_change() can use rtnl_qdisc_drop()
to defer expensive skb freeing after locks are released.
codel_reset() already has support for deferred skb freeing
because it uses qdisc_reset_queue()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
choke_reset() and choke_change() can use rtnl_qdisc_drop()
to defer expensive skb freeing after locks are released.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
qdisc are changed under RTNL protection and often
while blocking BH and root qdisc spinlock.
When lots of skbs need to be dropped, we free
them under these locks causing TX/RX freezes,
and more generally latency spikes.
This commit adds rtnl_kfree_skbs(), used to queue
skbs for deferred freeing.
Actual freeing happens right after RTNL is released,
with appropriate scheduling points.
rtnl_qdisc_drop() can also be used in place
of disc_drop() when RTNL is held.
qdisc_reset_queue() and __qdisc_reset_queue() get
the new behavior, so standard qdiscs like pfifo, pfifo_fast...
have their ->reset() method automatically handled.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TIPC based clusters are by default set up with full-mesh link
connectivity between all nodes. Those links are expected to provide
a short failure detection time, by default set to 1500 ms. Because
of this, the background load for neighbor monitoring in an N-node
cluster increases with a factor N on each node, while the overall
monitoring traffic through the network infrastructure increases at
a ~(N * (N - 1)) rate. Experience has shown that such clusters don't
scale well beyond ~100 nodes unless we significantly increase failure
discovery tolerance.
This commit introduces a framework and an algorithm that drastically
reduces this background load, while basically maintaining the original
failure detection times across the whole cluster. Using this algorithm,
background load will now grow at a rate of ~(2 * sqrt(N)) per node, and
at ~(2 * N * sqrt(N)) in traffic overhead. As an example, each node will
now have to actively monitor 38 neighbors in a 400-node cluster, instead
of as before 399.
This "Overlapping Ring Supervision Algorithm" is completely distributed
and employs no centralized or coordinated state. It goes as follows:
- Each node makes up a linearly ascending, circular list of all its N
known neighbors, based on their TIPC node identity. This algorithm
must be the same on all nodes.
- The node then selects the next M = sqrt(N) - 1 nodes downstream from
itself in the list, and chooses to actively monitor those. This is
called its "local monitoring domain".
- It creates a domain record describing the monitoring domain, and
piggy-backs this in the data area of all neighbor monitoring messages
(LINK_PROTOCOL/STATE) leaving that node. This means that all nodes in
the cluster eventually (default within 400 ms) will learn about
its monitoring domain.
- Whenever a node discovers a change in its local domain, e.g., a node
has been added or has gone down, it creates and sends out a new
version of its node record to inform all neighbors about the change.
- A node receiving a domain record from anybody outside its local domain
matches this against its own list (which may not look the same), and
chooses to not actively monitor those members of the received domain
record that are also present in its own list. Instead, it relies on
indications from the direct monitoring nodes if an indirectly
monitored node has gone up or down. If a node is indicated lost, the
receiving node temporarily activates its own direct monitoring towards
that node in order to confirm, or not, that it is actually gone.
- Since each node is actively monitoring sqrt(N) downstream neighbors,
each node is also actively monitored by the same number of upstream
neighbors. This means that all non-direct monitoring nodes normally
will receive sqrt(N) indications that a node is gone.
- A major drawback with ring monitoring is how it handles failures that
cause massive network partitionings. If both a lost node and all its
direct monitoring neighbors are inside the lost partition, the nodes in
the remaining partition will never receive indications about the loss.
To overcome this, each node also chooses to actively monitor some
nodes outside its local domain. Those nodes are called remote domain
"heads", and are selected in such a way that no node in the cluster
will be more than two direct monitoring hops away. Because of this,
each node, apart from monitoring the member of its local domain, will
also typically monitor sqrt(N) remote head nodes.
- As an optimization, local list status, domain status and domain
records are marked with a generation number. This saves senders from
unnecessarily conveying unaltered domain records, and receivers from
performing unneeded re-adaptations of their node monitoring list, such
as re-assigning domain heads.
- As a measure of caution we have added the possibility to disable the
new algorithm through configuration. We do this by keeping a threshold
value for the cluster size; a cluster that grows beyond this value
will switch from full-mesh to ring monitoring, and vice versa when
it shrinks below the value. This means that if the threshold is set to
a value larger than any anticipated cluster size (default size is 32)
the new algorithm is effectively disabled. A patch set for altering the
threshold value and for listing the table contents will follow shortly.
- This change is fully backwards compatible.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This refers to commands to direct action access as follows:
sudo tc actions add action drop index 12
sudo tc actions add action pipe index 10
And then dumping them like so:
sudo tc actions ls action gact
iproute2 worked because it depended on absence of TCA_ACT_TAB TLV
as end of message.
This fix has been tested with iproute2 and is backward compatible.
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPv6 multicast and link-local addresses require special handling by the
VRF driver:
1. Rather than using the VRF device index and full FIB lookups,
packets to/from these addresses should use direct FIB lookups based on
the VRF device table.
2. fail sends/receives on a VRF device to/from a multicast address
(e.g, make ping6 ff02::1%<vrf> fail)
3. move the setting of the flow oif to the first dst lookup and revert
the change in icmpv6_echo_reply made in ca254490c8 ("net: Add VRF
support to IPv6 stack"). Linklocal/mcast addresses require use of the
skb->dev.
With this change connections into and out of a VRF enslaved device work
for multicast and link-local addresses work (icmp, tcp, and udp)
e.g.,
1. packets into VM with VRF config:
ping6 -c3 fe80::e0:f9ff:fe1c:b974%br1
ping6 -c3 ff02::1%br1
ssh -6 fe80::e0:f9ff:fe1c:b974%br1
2. packets going out a VRF enslaved device:
ping6 -c3 fe80::18f8:83ff:fe4b:7a2e%eth1
ping6 -c3 ff02::1%eth1
ssh -6 root@fe80::18f8:83ff:fe4b:7a2e%eth1
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
L3 master devices are virtual devices similar to the loopback
device. Link local and multicast routes for these devices do
not make sense. The ipv6 addrconf code already skips adding a
linklocal address; do the same for the mcast route.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow drivers to pass flow arg to functions where the arg is not const
and allow the driver to make updates as needed (eg., setting oif).
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
And avoid calling tcf_hash_check() twice.
Fixes: a57f19d30b ("net sched: ipt action fix late binding")
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now prio_init() can return -ENOMEM, it also has to make sure
any allocated qdiscs are freed, since the caller (qdisc_create()) wont
call ->destroy() handler for us.
More generally, we want a transactional behavior for "tc qdisc
change ...", so prio_tune() should not make modifications if
any error is returned.
It means that we must validate parameters and allocate missing qdisc(s)
before taking root qdisc lock exactly once, to not leave the prio qdisc
in an intermediate state.
Fixes: cbdf451164 ("net_sched: prio: properly report out of memory errors")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When an inbound message is bigger than a page, allocate a paged SKB,
and subsequently use IUCV receive primitive with IPBUFLST flag.
This relaxes the pressure to allocate big contiguous kernel buffers.
Signed-off-by: Eugene Crosser <Eugene.Crosser@ru.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before introducing paged skbs in the receive path, get rid of the
function `iucv_fragment_skb()` that replaces one large linear skb
with several smaller linear skbs.
Signed-off-by: Eugene Crosser <Eugene.Crosser@ru.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When an outbound message is bigger than a page, allocate and fill
a paged SKB, and subsequently use IUCV send primitive with IPBUFLST
flag. This relaxes the pressure to allocate big contiguous kernel
buffers.
Signed-off-by: Eugene Crosser <Eugene.Crosser@ru.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rework the local RxRPC endpoint management.
Local endpoint objects are maintained in a flat list as before. This
should be okay as there shouldn't be more than one per open AF_RXRPC socket
(there can be fewer as local endpoints can be shared if their local service
ID is 0 and they share the same local transport parameters).
Changes:
(1) Local endpoints may now only be shared if they have local service ID 0
(ie. they're not being used for listening).
This prevents a scenario where process A is listening of the Cache
Manager port and process B contacts a fileserver - which may then
attempt to send CM requests back to B. But if A and B are sharing a
local endpoint, A will get the CM requests meant for B.
(2) We use a mutex to handle lookups and don't provide RCU-only lookups
since we only expect to access the list when opening a socket or
destroying an endpoint.
The local endpoint object is pointed to by the transport socket's
sk_user_data for the life of the transport socket - allowing us to
refer to it directly from the sk_data_ready and sk_error_report
callbacks.
(3) atomic_inc_not_zero() now exists and can be used to only share a local
endpoint if the last reference hasn't yet gone.
(4) We can remove rxrpc_local_lock - a spinlock that had to be taken with
BH processing disabled given that we assume sk_user_data won't change
under us.
(5) The transport socket is shut down before we clear the sk_user_data
pointer so that we can be sure that the transport socket's callbacks
won't be invoked once the RCU destruction is scheduled.
(6) Local endpoints have a work item that handles both destruction and
event processing. The means that destruction doesn't then need to
wait for event processing. The event queues can then be cleared after
the transport socket is shut down.
(7) Local endpoints are no longer available for resurrection beyond the
life of the sockets that had them open. As soon as their last ref
goes, they are scheduled for destruction and may not have their usage
count moved from 0.
Signed-off-by: David Howells <dhowells@redhat.com>
Separate local endpoint event handling out into its own file preparatory to
overhauling the object management aspect (which remains in the original
file).
Signed-off-by: David Howells <dhowells@redhat.com>
The spec allows backchannels for multiple clients to share the same tcp
connection. When that happens, we need to use the same xprt for all of
them. Similarly, we need the same xps.
This fixes list corruption introduced by the multipath code.
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Acked-by: Trond Myklebust <trondmy@primarydata.com>
Also simplify the logic a bit.
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Acked-by: Trond Myklebust <trondmy@primarydata.com>
Callers of rpc_create_xprt expect it to put the xprt on success and
failure.
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Acked-by: Trond Myklebust <trondmy@primarydata.com>
nft_genmask_cur has already done left-shift operator on the gencursor,
so there's no need to do left-shift operator on it again.
Fixes: ea4bd995b0 ("netfilter: nf_tables: add transaction helper functions")
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When we add a nft rule like follows:
# nft add rule filter test tcp dport vmap {1: jump test}
-ELOOP error will be returned, and the anonymous set will be
destroyed.
But after that, nf_tables_abort will also try to remove the
element and destroy the set, which was already destroyed and
freed.
If we add a nft wrong rule, nft_tables_abort will do the cleanup
work rightly, so nf_tables_set_destroy call here is redundant and
wrong, remove it.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Liping Zhang says:
"Users may add such a wrong nft rules successfully, which will cause an
endless jump loop:
# nft add rule filter test tcp dport vmap {1: jump test}
This is because before we commit, the element in the current anonymous
set is inactive, so osp->walk will skip this element and miss the
validate check."
To resolve this problem, this patch passes the generation mask to the
walk function through the iter container structure depending on the code
path:
1) If we're dumping the elements, then we have to check if the element
is active in the current generation. Thus, we check for the current
bit in the genmask.
2) If we're checking for loops, then we have to check if the element is
active in the next generation, as we're in the middle of a
transaction. Thus, we check for the next bit in the genmask.
Based on original patch from Liping Zhang.
Reported-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Tested-by: Liping Zhang <liping.zhang@spreadtrum.com>
We should check "i" is used as a dictionary or not, "binding" is already
checked before.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
I forgot to move the kmem_cache_destroy into the exit path.
Fixes: 0c5366b3a8 ("netfilter: conntrack: use single slab cache)
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Use the peer record to distribute network errors rather than the transport
object (which I want to get rid of). An error from a particular peer
terminates all calls on that peer.
For future consideration:
(1) For ICMP-induced errors it might be worth trying to extract the RxRPC
header from the offending packet, if one is returned attached to the
ICMP packet, to better direct the error.
This may be overkill, though, since an ICMP packet would be expected
to be relating to the destination port, machine or network. RxRPC
ABORT and BUSY packets give notice at RxRPC level.
(2) To also abort connection-level communications (such as CHALLENGE
packets) where indicted by an error - but that requires some revamping
of the connection event handling first.
Signed-off-by: David Howells <dhowells@redhat.com>
Don't assume anything about the address in an ICMP packet in
rxrpc_error_report() as the address may not be IPv4 in future, especially
since we're just printing these details.
Signed-off-by: David Howells <dhowells@redhat.com>
Break MTU determination from ICMP out into its own function to reduce the
complexity of the error report handler.
Signed-off-by: David Howells <dhowells@redhat.com>
Rename rxrpc_UDP_error_report() to rxrpc_error_report() as it might get
called for something other than UDP.
Signed-off-by: David Howells <dhowells@redhat.com>
Rework peer object handling to use a hash table instead of a flat list and
to use RCU. Peer objects are no longer destroyed by passing them to a
workqueue to process, but rather are just passed to the RCU garbage
collector as kfree'able objects.
The hash function uses the local endpoint plus all the components of the
remote address, except for the RxRPC service ID. Peers thus represent a
UDP port on the remote machine as contacted by a UDP port on this machine.
The RCU read lock is used to handle non-creating lookups so that they can
be called from bottom half context in the sk_error_report handler without
having to lock the hash table against modification.
rxrpc_lookup_peer_rcu() *does* take a reference on the peer object as in
the future, this will be passed to a work item for error distribution in
the error_report path and this function will cease being used in the
data_ready path.
Creating lookups are done under spinlock rather than mutex as they might be
set up due to an external stimulus if the local endpoint is a server.
Captured network error messages (ICMP) are handled with respect to this
struct and MTU size and RTT are cached here.
Signed-off-by: David Howells <dhowells@redhat.com>
This function is just ->init(), rename it to make it obvious.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These should be gone when we removed CONFIG_NET_CLS_POLICE.
We can not totally remove them since they are exposed
to userspace.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Refactor rds_conn_destroy() so that the per-path dismantling
is done in rds_conn_path_destroy, and then iterate as needed
over rds_conn_path_destroy().
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This commit changes rds_conn_shutdown to take a rds_conn_path *
argument, allowing it to shutdown paths other than c_path[0] for
MP-capable transports.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a for() loop in __rds_conn_create to initialize all the
conn_paths, in preparate for MP capable transports.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
rds_conn_path_error() is the MP-aware analog of rds_conn_error,
to be used by multipath-capable callers.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This commit updates the callbacks related to the rds-info command
so that they walk through all the rds_conn_path structures and
report the requested info.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
rds_conn_path_connect_if_down() works on the rds_conn_path
that it is passed. Callers who are not t_m_capable may continue
calling rds_conn_connect_if_down, which will invoke
rds_conn_path_connect_if_down() with the default c_path[0].
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This commit allows rds_send_pong() callers to send back
the rds pong message on some path other than c_path[0] by
passing in a struct rds_conn_path * argument. It also
removes the last dependency on the #defines in rds_single.h
from send.c
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Explicitly set up rds_conn_path, either from i_conn_path (for
MP capable transpots) or as c_path[0], and use this in
rds_send_drop_to()
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pass a struct rds_conn_path to rds_send_xmit so that MP capable
transports can transmit packets on something other than c_path[0].
The eventual goal for MP capable transports is to hash the rds
socket to a path based on the bound local address/port, and use
this path as the argument to rds_send_xmit()
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pass the rds_conn_path to rds_send_queue_rm, and use it to initialize
the i_conn_path field in struct rds_incoming. This commit also makes
rds_send_queue_rm() MP capable, because it now takes locks
specific to the rds_conn_path passed in, instead of defaulting to
the c_path[0] based defines from rds_single_path.h
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The only caller of rds_send_get_message() was
rds_iw_send_cq_comp_handler() which was removed as part of
commit dcdede0406 ("RDS: Drop stale iWARP RDMA transport"),
so remove rds_send_get_message() for the same reason.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
rds_send_path_drop_acked() is the path-specific version of
rds_send_drop_acked() to be invoked by MP capable callers.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
rds_send_path_reset() is the path specific version of rds_send_reset()
intended for MP capable callers.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
t_mp_capable transports can use rds_inc_path_init to initialize
all fields in struct rds_incoming, including the i_conn_path.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Transports that are t_mp_capable should set the rds_conn_path
on which the datagram was recived in the ->i_conn_path field
of struct rds_incoming.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The t_mp_capable bit will be used in the core rds module
to support multipathing logic when the transport supports it.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In preparation for multipath RDS, split the rds_connection
structure into a base structure, and a per-path struct rds_conn_path.
The base structure tracks information and locks common to all
paths. The workqs for send/recv/shutdown etc are tracked per
rds_conn_path. Thus the workq callbacks now work with rds_conn_path.
This commit allows for one rds_conn_path per rds_connection, and will
be extended into multiple conn_paths in subsequent commits.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make sure that dctcp_get_info() returns only the size of the
info->dctcp struct that it zeroes out and fills in. Previously it had
been returning the size of the enclosing tcp_cc_info union,
sizeof(*info). There is no problem yet, but that union that may one
day be larger than struct tcp_dctcp_info, in which case the
TCP_CC_INFO code might accidentally copy uninitialized bytes from the
stack.
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix to return a negative error code from the error handling
case instead of 0, as done elsewhere in this function.
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Acked-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIVAwUAV16wsPSw1s6N8H32AQLGqA/+MC6/oufeV5WF/csMucx6I5wauPDQND0b
E1bOgzceGckwY3NUh14sYo8o/JVxzkKCWrWb0bNBSdqUup6vTMxCBgkSUxvaCbN4
RIcqu8LDB3vCrzMobMWijZDGqCpF8ZprUcvfCVj1tdrnEbK3njoYJjJ9UZN/lVPw
Tw+ojKpAty7r/JtZUkWEBhlZZtG26Ko7t+mDbgISOGulPNnw0HSsodMCDb1iSMRS
MZESOqQ5Kb8vhNOQ0zn+fNWB5d94yLZIxmt/siDXuetopHWuD1u8imqz+lhQ1BTU
7tOO1PKGxnsalw7FlgNL1cNmOJi/kEKISURG82K1u3pTk2Bf9vtiEPpLnC6p0Qy2
WwIQVJWewSCYsLiujqhXwmTGHC1m53979VtETHZoKhj/Jm8t3S38XN92Kt/P9y9g
z1r3EBs6+WzTt4zaw5rjVBY33reeDgEo2RYusLxMOgvPsmX5KMNvpov9R04+JthC
Qp1zWsE2FerxRJ+CNFBNl0ei4NVMsds3OHuEfgAiEedRowjGokDRQd6stVEev05S
S2XggUvt011Fvhud0UJWH9koeVcHA85KNCK8V1aj9xH3zsQDizMYhZMhaL34wnu+
z9XSv6DjZzJVWxC4gyQ/bsvdg34Uerv6g2BwRgqDCfh90Sa5U2n1CL8b3LwTKFU3
3R3s0l3R5gg=
=FXID
-----END PGP SIGNATURE-----
Merge tag 'rxrpc-rewrite-20160613' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
David Howells says:
====================
rxrpc: Rename rxrpc source files
Here's the next part of the AF_RXRPC rewrite. In this set I rename some of
the files in the net/rxrpc/ directory and adjust the Makefile and
ar-internal.h to reflect the changes.
The aim is twofold:
(1) Remove the "ar-" prefix on those files that have it as it's not really
useful, especially now that I'm building rxkad in.
(2) To aid splitting the local, peer, connection and call handling code
into separate files for object and event handling in future patches by
making it easier to come up with new filenames.
There are two commits:
(1) The first commit does a bunch of renames of .c files and alters the
Makefile. ar-internal.h isn't renamed at this time to avoid having to
change the contents of the files being renamed.
(2) The second commit changes the section label comments in ar-internal.h
to reflect the changed filenames and reorders the file so that the
sections are back in filename order.
The patches can be found here also:
http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=rxrpc-rewrite
Tagged thusly:
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git
rxrpc-rewrite-20160613
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
When skip_sw is set and hardware fails to apply filter, return error to
user. This will make error propagation logic similar to the one
currently used in u32 classifier.
Also, changed code to use tc_skip_sw() utility function.
Signed-off-by: Amir Vadai <amirva@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function gretap_fb_dev_create() (only used by ovs) never calls
rtnl_configure_link(). The consequence is that dev->rtnl_link_state is
never set to RTNL_LINK_INITIALIZED.
During the deletion phase, the function rollback_registered_many() sends
a RTM_DELLINK only if dev->rtnl_link_state is set to RTNL_LINK_INITIALIZED.
Fixes: b2acd1dc39 ("openvswitch: Use regular GRE net_device instead of vport")
CC: Thomas Graf <tgraf@suug.ch>
CC: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After ipgre_newlink()/geneve_configure() call, the netdev is registered.
Fixes: 7e059158d5 ("vxlan, gre, geneve: Set a large MTU on ovs-created tunnel devices")
CC: David Wragg <david@weave.works>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a corner case in which udp packets belonging to a same
flow are hashed to different socket when hslot->count changes from 10
to 11:
1) When hslot->count <= 10, __udp_lib_lookup() searches udp_table->hash,
and always passes 'daddr' to udp_ehashfn().
2) When hslot->count > 10, __udp_lib_lookup() searches udp_table->hash2,
but may pass 'INADDR_ANY' to udp_ehashfn() if the sockets are bound to
INADDR_ANY instead of some specific addr.
That means when hslot->count changes from 10 to 11, the hash calculated by
udp_ehashfn() is also changed, and the udp packets belonging to a same
flow will be hashed to different socket.
This is easily reproduced:
1) Create 10 udp sockets and bind all of them to 0.0.0.0:40000.
2) From the same host send udp packets to 127.0.0.1:40000, record the
socket index which receives the packets.
3) Create 1 more udp socket and bind it to 0.0.0.0:44096. The number 44096
is 40000 + UDP_HASH_SIZE(4096), this makes the new socket put into the
same hslot as the aformentioned 10 sockets, and makes the hslot->count
change from 10 to 11.
4) From the same host send udp packets to 127.0.0.1:40000, and the socket
index which receives the packets will be different from the one received
in step 2.
This should not happen as the socket bound to 0.0.0.0:44096 should not
change the behavior of the sockets bound to 0.0.0.0:40000.
It's the same case for IPv6, and this patch also fixes that.
Signed-off-by: Su, Xuemin <suxm@chinanetcenter.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the qdisc is full, we drop a packet at the head of the queue,
queue the current skb and return NET_XMIT_CN
Now we track backlog on upper qdiscs, we need to call
qdisc_tree_reduce_backlog(), even if the qlen did not change.
Fixes: 2ccccf5fb4 ("net_sched: update hierarchical backlog too")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: WANG Cong <xiyou.wangcong@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reported-by: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Tom Herbert <tom@herbertland.com>
Fixes: 4068579e1e ("net: Implmement RFC 6936 (zero RX csums for UDP/IPv6")
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Tom Herbert <tom@herbertland.com>
Fixes: 4068579e1e ("net: Implmement RFC 6936 (zero RX csums for UDP/IPv6")
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Florent Fourcot <florent.fourcot@enst-bretagne.fr>
Fixes: 1d13a96c74 ("ipv6: tcp: fix flowlabel value in ACK messages send from TIME_WAIT")
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPv6 ping socket error handler doesn't correctly convert the new 32 bit
mtu to host endianness before using.
Cc: Lorenzo Colitti <lorenzo@google.com>
Fixes: 6d0bfe2261 ("net: ipv6: Add IPv6 support to the ping socket.")
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Even with spin_unlock_wait() fixed, nf_conntrack_lock{,_all}() is
borken as it misses a bunch of memory barriers to order the whole
global vs local locks scheme.
Even x86 (and other TSO archs) are affected.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[ Updated the comments. ]
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Use the low latency transport workqueue to process the task that is
next in line on the xprt->sending queue.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
rpciod can easily get congested due to the long list of queued rpc_tasks.
Having the receive queue wait in turn for those tasks to complete can
therefore be a bottleneck.
Address the problem by separating the workqueues into:
- rpciod: manages rpc_tasks
- xprtiod: manages transport related work.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The only difference between the two at this point is the reset of
the connection timeout, and since everyone expect tcp ignore that value,
we can just throw it into the generic function.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Update the section comments in ar-internal.h that indicate the locations of
the referenced items to reflect the renames done to the .c files in
net/rxrpc/.
This also involves some rearrangement to reflect keep the sections in order
of filename.
Signed-off-by: David Howells <dhowells@redhat.com>
Rename files matching net/rxrpc/ar-*.c to get rid of the "ar-" prefix.
This will aid splitting those files by making easier to come up with new
names.
Note that the not all files are simply renamed from ar-X.c to X.c. The
following exceptions are made:
(*) ar-call.c -> call_object.c
ar-ack.c -> call_event.c
call_object.c is going to contain the core of the call object
handling. Call event handling is all going to be in call_event.c.
(*) ar-accept.c -> call_accept.c
Incoming call handling is going to be here.
(*) ar-connection.c -> conn_object.c
ar-connevent.c -> conn_event.c
The former file is going to have the basic connection object handling,
but there will likely be some differentiation between client
connections and service connections in additional files later. The
latter file will have all the connection-level event handling.
(*) ar-local.c -> local_object.c
This will have the local endpoint object handling code. The local
endpoint event handling code will later be split out into
local_event.c.
(*) ar-peer.c -> peer_object.c
This will have the peer endpoint object handling code. Peer event
handling code will be placed in peer_event.c (for the moment, there is
none).
(*) ar-error.c -> peer_event.c
This will become the peer event handling code, though for the moment
it's actually driven from the local endpoint's perspective.
Note that I haven't renamed ar-transport.c to transport_object.c as the
intention is to delete it when the rxrpc_transport struct is excised.
The only file that actually has its contents changed is net/rxrpc/Makefile.
net/rxrpc/ar-internal.h will need its section marker comments updating, but
I'll do that in a separate patch to make it easier for git to follow the
history across the rename. I may also want to rename ar-internal.h at some
point - but that would mean updating all the #includes and I'd rather do
that in a separate step.
Signed-off-by: David Howells <dhowells@redhat.com.
sch_atm returns this when TC_ACT_SHOT classification occurs.
But all other schedulers that use tc_classify
(htb, hfsc, drr, fq_codel ...) return NET_XMIT_SUCCESS | __BYPASS
in this case so just do that in atm.
BATMAN uses it as an intermediate return value to signal
forwarding vs. buffering, but it did not return POLICED to
callers outside of BATMAN.
Reviewed-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
At Qdisc creation or change time, prio_tune() creates missing
pfifo qdiscs but does not return an error code if one
qdisc could not be allocated.
Leaving a qdisc in non operational state without telling user
anything about this problem is not good.
Also, testing if we replace something different than noop_qdisc
a second time makes no sense so I removed useless code.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
>> net/ipv4/ipconfig.c:130:15: warning: 'ic_addrservaddr' defined but not used [-Wunused-variable]
static __be32 ic_addrservaddr = NONE; /* IP Address of the IP addresses'server */
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In IPv6 the ToS values are part of the flowlabel in flowi6 and get
extracted during fib rule lookup, but we forgot to correctly initialize
the flowlabel before the routing lookup.
Reported-by: <liam.mcbirnie@boeing.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
__QDISC_STATE_THROTTLED bit manipulation is rather expensive
for HTB and few others.
I already removed it for sch_fq in commit f2600cf02b
("net: sched: avoid costly atomic operation in fq_dequeue()")
and so far nobody complained.
When one ore more packets are stuck in one or more throttled
HTB class, a htb dequeue() performs two atomic operations
to clear/set __QDISC_STATE_THROTTLED bit, while root qdisc
lock is held.
Removing this pair of atomic operations bring me a 8 % performance
increase on 200 TCP_RR tests, in presence of throttled classes.
This patch has no side effect, since nothing actually uses
disc_is_throttled() anymore.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Looks like it is only there as some optimization attempt.
Since __QDISC_STATE_THROTTLED set/unset is way too expensive,
and netem is the last user, just remove this check.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
So far no qdisc ever unset the throttled bit at enqueue() time,
so CBQ usage of qdisc_is_throttled() was flaky.
Since __QDISC_STATE_THROTTLED set/unset is way too expensive
considering that only CBQ was eventually caring for this status,
it would make sense to implement a Qdisc ops ->is_throttled()
if we find that this is needed.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We want to get rid of generic qdisc throttled management,
so this qdisc has to use a private flag.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The symbol ic_addrservaddr is not static, but has no declaration
to match so make it static to fix the following warning:
net/ipv4/ipconfig.c:130:8: warning: symbol 'ic_addrservaddr' was not declared. Should it be static?
Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
The functions inet_diag_msg_common_fill and inet_diag_msg_attrs_fill
seem to have been missed from the include/linux/inet_diag.h header
file. Add them to fix the following warnings:
net/ipv4/inet_diag.c:69:6: warning: symbol 'inet_diag_msg_common_fill' was not declared. Should it be static?
net/ipv4/inet_diag.c:108:5: warning: symbol 'inet_diag_msg_attrs_fill' was not declared. Should it be static?
Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now sctp doesn't change socket state upon shutdown reception. It changes
just the assoc state, even though it's a TCP-style socket.
For some cases, if we really need to check sk->sk_state, it's necessary to
fix this issue, at least when we use ss or netstat to dump, we can get a
more exact information.
As an improvement, we will change sk->sk_state when we change asoc->state
to SHUTDOWN_RECEIVED, and also do it in sctp_shutdown to keep consistent
with sctp_close.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo R. Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* the biggest change is Michał's work on integrating FQ/codel
with the mac80211 internal software queues
* cfg80211 connect result gets clarified for the
"no connection at all" case
* advertisement of per-interface type capabilities, in case
they differ (which makes a lot of sense for some capabilities)
* most of the nl80211 & hwsim unprivileged namespace operation
changes
* human-readable VHT capabilities in debugfs
* some other cleanups, like spelling
-----BEGIN PGP SIGNATURE-----
iQIcBAABCgAGBQJXWWe/AAoJEGt7eEactAAdXVsQAKGkdXGmUU14tfiRCnZryMEH
GyiVDHQivfcadicbK9599LXadvYc6ATZHfYoSnROwB82NfhKB71N8dUJ4qePxLNa
lDe5uuuAgxtHG63hK1R02flPorkBAEcUcdHsFSuOw7JWag4/49sCqWH8X5K8E9aT
vrhaPeppJPydwnmNzNOBvMsLEJFJdZWaEZupZQ0kiJ/jB/howFdzF75GxGQ2jh+F
dPk22/PtV2igCtntqaty7h057AdZ+znQuiUdVB7eYIOle7veeGyMzFFf0xQ99LAZ
+xcA7GA74u6m8O6SUVw+6nhrUJ5XTsKGUtmKCTVOcUGa5z7XDD7NIxk2SgloCJkC
qPqVU8wyBCxKc+6JsyiVSLcB5MWvWxifvo0OyLsbCqhN50bTtJT96ymKOAx4NeSG
s+TYlV2HgVRNN6PzGvl/ZUo8Mm1UFGWlCBPcyMy9Fwc0jxdhnOAZzOtqV1yVwlCz
M43RKwxBX6MHAtlfwy6g3M5ievAviwY3Kt+yGeLRIVsJJfOUdQxYAa+m7GTeohD/
Tns4IHbtWJiLTKuaTvrs4ec3ycXrv7iibMINcarvkTgbbF9Qvarf/e2RoWO/freY
OUixDdG7HmsY0XIUhSepeMJpn5xIhQ7dkmckhzDEgZqBM6uPJEMB2+O/CGNDsxEp
VjfnwXJ+9PQrjheBvnD2
=FvhQ
-----END PGP SIGNATURE-----
Merge tag 'mac80211-next-for-davem-2016-06-09' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
Johannes Berg says:
====================
For the next cycle, we have the following:
* the biggest change is Michał's work on integrating FQ/codel
with the mac80211 internal software queues
* cfg80211 connect result gets clarified for the
"no connection at all" case
* advertisement of per-interface type capabilities, in case
they differ (which makes a lot of sense for some capabilities)
* most of the nl80211 & hwsim unprivileged namespace operation
changes
* human-readable VHT capabilities in debugfs
* some other cleanups, like spelling
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP-NV (New Vegas) is a major update to TCP-Vegas.
An earlier version of NV was presented at 2010's LPC.
It is a delayed based congestion avoidance for the
data center. This version has been tested within a
10G rack where the HW RTTs are 20-50us and with
1 to 400 flows.
A description of TCP-NV, including implementation
details as well as experimental results, can be found at:
http://www.brakmo.org/networking/tcp-nv/TCPNV.html
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add in_flight (bytes in flight when packet was sent) field
to tx component of tcp_skb_cb and make it available to
congestion modules' pkts_acked() function through the
ack_sample function argument.
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace open coded conversion between virtio_net_hdr to skb GSO info with
virtio_net_hdr_from_skb
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
alloc_workqueue replaces deprecated create_workqueue().
Since the driver is infiniband which can be used as block device and the
workqueue seems involved in regular operation of the device, so a
dedicated workqueue has been used with WQ_MEM_RECLAIM set to guarantee
forward progress under memory pressure.
Since there are only a fixed number of work items, explicit concurrency
limit is unnecessary here.
Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 8626c56c82 ("bridge: fix potential use-after-free when hook
returns QUEUE or STOLEN verdict") fixed incorrect usage of NF_HOOK's
return value by consuming packets in okfn via br_pass_frame_up().
However, this function re-injects packets to the Rx path with skb->dev
set to the bridge device, which breaks kernel's STP, as all STP packets
appear to originate from the bridge device itself.
Instead, if STP is enabled and bridge isn't a 802.1ad bridge, then learn
packet's SMAC and inject it back to the Rx path for further processing
by the packet handlers.
The patch also makes netfilter's behavior consistent with regards to
packets destined to the Bridge Group Address, as no hook registered at
LOCAL_IN will ever be called, regardless if STP is enabled or not.
Cc: Florian Westphal <fw@strlen.de>
Cc: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Cc: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Fixes: 8626c56c82 ("bridge: fix potential use-after-free when hook returns QUEUE or STOLEN verdict")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We always mixed in the parent pointer into the dentry name hash, but we
did it late at lookup time. It turns out that we can simplify that
lookup-time action by salting the hash with the parent pointer early
instead of late.
A few other users of our string hashes also wanted to mix in their own
pointers into the hash, and those are updated to use the same mechanism.
Hash users that don't have any particular initial salt can just use the
NULL pointer as a no-salt.
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: George Spelvin <linux@sciencehorizons.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Limit the socket incoming call backlog queue size so that a remote client
can't pump in sufficient new calls that the server runs out of memory. Note
that this is partially theoretical at the moment since whilst the number of
calls is limited, the number of packets trying to set up new calls is not.
This will be addressed in a later patch.
If the caller of listen() specifies a backlog INT_MAX, then they get the
current maximum; anything else greater than max_backlog or anything
negative incurs EINVAL.
The limit on the maximum queue size can be set by:
echo N >/proc/sys/net/rxrpc/max_backlog
where 4<=N<=32.
Further, set the default backlog to 0, requiring listen() to be called
before we start actually queueing new calls. Whilst this kind of is a
change in the UAPI, the caller can't actually *accept* new calls anyway
unless they've first called listen() to put the socket into the LISTENING
state - thus the aforementioned new calls would otherwise just sit there,
eating up kernel memory. (Note that sockets that don't have a non-zero
service ID bound don't get incoming calls anyway.)
Given that the default backlog is now 0, make the AFS filesystem call
kernel_listen() to set the maximum backlog for itself.
Possible improvements include:
(1) Trimming a too-large backlog to max_backlog when listen is called.
(2) Trimming the backlog value whenever the value is used so that changes
to max_backlog are applied to an open socket automatically. Note that
the AFS filesystem opens one socket and keeps it open for extended
periods, so would miss out on changes to max_backlog.
(3) Having a separate setting for the AFS filesystem.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a possibility where the user can just specify the parent and
all filters under that parent are then being purged. Currently,
for example for scripting, one needs to specify pref/prio to have
a well-defined number for 'tc filter del' command for addressing
the previously created instance or additionally filter handle in
case of priorities being the same. Improve usage by allowing the
option for tc to specify the parent and removing the whole chain
for that given parent.
Example usage after patch, no tc changes required:
# tc qdisc replace dev foo clsact
# tc filter add dev foo egress bpf da obj ./bpf.o
# tc filter add dev foo egress bpf da obj ./bpf.o
# tc filter show dev foo egress
filter protocol all pref 49151 bpf
filter protocol all pref 49151 bpf handle 0x1 bpf.o:[classifier] direct-action
filter protocol all pref 49152 bpf
filter protocol all pref 49152 bpf handle 0x1 bpf.o:[classifier] direct-action
# tc filter del dev foo egress
# tc filter show dev foo egress
#
Previously, RTM_DELTFILTER requests with invalid prio of 0 were
rejected, so only netlink requests with RTM_NEWTFILTER and NLM_F_CREATE
flag were allowed where the kernel would auto-generate a pref/prio.
We can piggyback on that and use prio of 0 as a wildcard for
requests of RTM_DELTFILTER.
For notifying tc netlink monitoring users (e.g. libnl uses this
for caching), there are two options, that is, sending individual
tfilter_notify() notifications for each tcf_proto, or sending a
single one indicating wildcard removal. I tried both and there
are pros and cons for each, eventually I decided for sending
individual tfilter_notify(), so that user space can support this
seamlessly and there won't be a mess of changing each and every
application to make sure expectations from the kernel won't break
when they don't understand single notification. Since linear chains
don't really scale, I expect only a handful of classifiers to be
attached at max for a given parent anyway.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a bpf_check_basics_ok() and reject filters that are of invalid
size much earlier, so we don't do any useless work such as invoking
bpf_prog_alloc(). Currently, rejection happens in bpf_check_classic()
only, but it's really unnecessarily late and they should be rejected
at earliest point. While at it, also clean up one bpf_prog_size() to
make it consistent with the remaining invocations.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Respect the stack's xmit_recursion limit for calls into dev_queue_xmit().
Currently, they are not handeled by the limiter when attached to clsact's
egress parent, for example, and a buggy program redirecting it to the
same device again could run into stack overflow eventually. It would be
good if we could notify an admin to give him a chance to react. We reuse
xmit_recursion instead of having one private to eBPF, so that the stack's
current recursion depth will be taken into account as well. Follow-up to
commit 3896d655f4 ("bpf: introduce bpf_clone_redirect() helper") and
27b29f6305 ("bpf: add bpf_redirect() helper").
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The patch adds a new OVS action, OVS_ACTION_ATTR_TRUNC, in order to
truncate packets. A 'max_len' is added for setting up the maximum
packet size, and a 'cutlen' field is to record the number of bytes
to trim the packet when the packet is outputting to a port, or when
the packet is sent to userspace.
Signed-off-by: William Tu <u9012063@gmail.com>
Cc: Pravin Shelar <pshelar@nicira.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
net/sched/act_police.c
net/sched/sch_drr.c
net/sched/sch_hfsc.c
net/sched/sch_prio.c
net/sched/sch_red.c
net/sched/sch_tbf.c
In net-next the drop methods of the packet schedulers got removed, so
the bug fixes to them in 'net' are irrelevant.
A packet action unload crash fix conflicts with the addition of the
new firstuse timestamp.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) nfnetlink timestamp taken from wrong skb, fix from Florian Westphal.
2) Revert some msleep conversions in rtlwifi as these spots are in
atomic context, from Larry Finger.
3) Validate that NFTA_SET_TABLE attribute is actually specified when we
call nf_tables_getset(). From Phil Turnbull.
4) Don't do mdio_reset in stmmac driver with spinlock held as that can
sleep, from Vincent Palatin.
5) sk_filter() does things other than run a BPF filter, so we should
not elide it's call just because sk->sk_filter is NULL. Fix from
Eric Dumazet.
6) Fix missing backlog updates in several packet schedulers, from Cong
Wang.
7) bnx2x driver should allow VLAN add/remove while the interface is
down, from Michal Schmidt.
8) Several RDS/TCP race fixes from Sowmini Varadhan.
9) fq_codel scheduler doesn't return correct queue length in dumps,
from Eric Dumazet.
10) Fix TCP stats for tail loss probe and early retransmit in ipv6, from
Yuchung Cheng.
11) Properly initialize udp_tunnel_socket_cfg in l2tp_tunnel_create(),
from Guillaume Nault.
12) qfq scheduler leaks SKBs if a kzalloc fails, fix from Florian
Westphal.
13) sock_fprog passed into PACKET_FANOUT_DATA needs compat handling,
from Willem de Bruijn.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (85 commits)
vmxnet3: segCnt can be 1 for LRO packets
packet: compat support for sock_fprog
stmmac: fix parameter to dwmac4_set_umac_addr()
net/mlx5e: Fix blue flame quota logic
net/mlx5e: Use ndo_stop explicitly at shutdown flow
net/mlx5: E-Switch, always set mc_promisc for allmulti vports
net/mlx5: E-Switch, Modify node guid on vf set MAC
net/mlx5: E-Switch, Fix vport enable flow
net/mlx5: E-Switch, Use the correct error check on returned pointers
net/mlx5: E-Switch, Use the correct free() function
net/mlx5: Fix E-Switch flow steering capabilities check
net/mlx5: Fix flow steering NIC capabilities check
net/mlx5: Fix root flow table update
net/mlx5: Fix MLX5_CMD_OP_MAX to be defined correctly
net/mlx5: Fix masking of reserved bits in XRCD number
net/mlx5: Fix the size of modify QP mailbox
mlxsw: spectrum: Don't sleep during ndo_get_phys_port_name()
mlxsw: spectrum: Make split flow match firmware requirements
wext: Fix 32 bit iwpriv compatibility issue with 64 bit Kernel
cfg80211: remove get/set antenna and tx power warnings
...
Socket option PACKET_FANOUT_DATA takes a struct sock_fprog as argument
if PACKET_FANOUT has mode PACKET_FANOUT_CBPF. This structure contains
a pointer into user memory. If userland is 32-bit and kernel is 64-bit
the two disagree about the layout of struct sock_fprog.
Add compat setsockopt support to convert a 32-bit compat_sock_fprog to
a 64-bit sock_fprog. This is analogous to compat_sock_fprog support for
SO_REUSEPORT added in commit 1957598840 ("soreuseport: add compat
case for setsockopt SO_ATTACH_REUSEPORT_CBPF").
Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Frank Kellermann reported a kernel crash with 4.5.0 when IPv6 is
disabled at boot using the kernel option ipv6.disable=1. Using
current net-next with the boot option:
$ ip link add red type vrf table 1001
Generates:
[12210.919584] BUG: unable to handle kernel NULL pointer dereference at 0000000000000748
[12210.921341] IP: [<ffffffff814b30e3>] fib6_get_table+0x2c/0x5a
[12210.922537] PGD b79e3067 PUD bb32b067 PMD 0
[12210.923479] Oops: 0000 [#1] SMP
[12210.924001] Modules linked in: ipvlan 8021q garp mrp stp llc
[12210.925130] CPU: 3 PID: 1177 Comm: ip Not tainted 4.7.0-rc1+ #235
[12210.926168] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
[12210.928065] task: ffff8800b9ac4640 ti: ffff8800bacac000 task.ti: ffff8800bacac000
[12210.929328] RIP: 0010:[<ffffffff814b30e3>] [<ffffffff814b30e3>] fib6_get_table+0x2c/0x5a
[12210.930697] RSP: 0018:ffff8800bacaf888 EFLAGS: 00010202
[12210.931563] RAX: 0000000000000748 RBX: ffffffff81a9e280 RCX: ffff8800b9ac4e28
[12210.932688] RDX: 00000000000000e9 RSI: 0000000000000002 RDI: 0000000000000286
[12210.933820] RBP: ffff8800bacaf898 R08: ffff8800b9ac4df0 R09: 000000000052001b
[12210.934941] R10: 00000000657c0000 R11: 000000000000c649 R12: 00000000000003e9
[12210.936032] R13: 00000000000003e9 R14: ffff8800bace7800 R15: ffff8800bb3ec000
[12210.937103] FS: 00007faa1766c700(0000) GS:ffff88013ac00000(0000) knlGS:0000000000000000
[12210.938321] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[12210.939166] CR2: 0000000000000748 CR3: 00000000b79d6000 CR4: 00000000000406e0
[12210.940278] Stack:
[12210.940603] ffff8800bb3ec000 ffffffff81a9e280 ffff8800bacaf8c8 ffffffff814b3135
[12210.941818] ffff8800bb3ec000 ffffffff81a9e280 ffffffff81a9e280 ffff8800bace7800
[12210.943040] ffff8800bacaf8f0 ffffffff81397c88 ffff8800bb3ec000 ffffffff81a9e280
[12210.944288] Call Trace:
[12210.944688] [<ffffffff814b3135>] fib6_new_table+0x24/0x8a
[12210.945516] [<ffffffff81397c88>] vrf_dev_init+0xd4/0x162
[12210.946328] [<ffffffff814091e1>] register_netdevice+0x100/0x396
[12210.947209] [<ffffffff8139823d>] vrf_newlink+0x40/0xb3
[12210.948001] [<ffffffff814187f0>] rtnl_newlink+0x5d3/0x6d5
...
The problem above is due to the fact that the fib hash table is not
allocated when IPv6 is disabled at boot.
As for the VRF driver it should not do any IPv6 initializations if IPv6
is disabled, so it needs to know if IPv6 is disabled at boot. The disable
parameter is private to the IPv6 module, so provide an accessor for
modules to determine if IPv6 was disabled at boot time.
Fixes: 35402e3136 ("net: Add IPv6 support to VRF device")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Simplify the RxRPC connect() implementation. It will just note the
destination address it is given, and if a sendmsg() comes along with no
address, this will be assigned as the address. No transport struct will be
held internally, which will allow us to remove this later.
Simplify sendmsg() also. Whilst a call is active, userspace refers to it
by a private unique user ID specified in a control message. When sendmsg()
sees a user ID that doesn't map to an extant call, it creates a new call
for that user ID and attempts to add it. If, when we try to add it, the
user ID is now registered, we now reject the message with -EEXIST. We
should never see this situation unless two threads are racing, trying to
create a call with the same ID - which would be an error.
It also isn't required to provide sendmsg() with an address - provided the
control message data holds a user ID that maps to a currently active call.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is time to add netdev_lockdep_set_classes() helper
so that lockdep annotations per device type are easier to manage.
This removes a lot of copies and missing annotations.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
1) qdisc_run_begin() is really using the equivalent of a trylock.
Instead of using write_seqcount_begin(), use a combination of
raw_write_seqcount_begin() and correct lockdep annotation.
2) sch_direct_xmit() should use regular spin_lock(root_lock)
Fixes: f9eb8aea2a ("net_sched: transform qdisc running bit into a seqcount")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* a fix for a long-standing iwpriv 32/64 compat issue
* two fairly recently introduced (4.6) warning asking for
symmetric operations are erroneous and I remove them
-----BEGIN PGP SIGNATURE-----
iQIcBAABCgAGBQJXWWVNAAoJEGt7eEactAAd2HMP/RGyAWN/uXJVNySsqvBSqiew
t5ZTopRfL2YEcO1EUF749bncfJo6P9KUtuM72gYYRYART5viucdYFk53Ky8EtKh8
QQ7WTGyiuQP7XvderfXmrjV/J35KFiPTnf83KUfOnBTVOCtw6doIBIMutUmR7FPV
cyH4e56j9agzbBJ7mjeLYGBMmyhq2uZ85vXqk/mnpn5fH9+lGojcoIVZfhXWmMoI
d5O/giARzjDM7LTMIwgepY5w/j8RX1rMJCXfxiZsGMuiSTYH8FETbe3Go9CL/qHS
8cZCCQRMCpjMK+SOmNr7xYx70YjIJQabwo9luYmcdCkhw6VHRxHl47lk/2XLp4F9
G+ReA1A7izV2XSyJIWTl+Or1qzgzsf6SfZHKszSdY47NXzK5aCprRfUvXyxLv6GZ
IGcHzNWOFF/1fKhh3w2NylhTd/omkvtZzt5Xuom94OPXdyxzy/PmViZ+xuhutrGt
dYxhmzPZcw+/OeTGgbt2YfYFUdMZqbHkaHPE0wia03r1yXzQD/x05/QET2vwIA0L
C+ATyaykSCCOm/ovwm5rada151Iai1engCWIVw/pZ3HUj6C8xqwyDS3+xsTRmKzI
oBp1O3K3yR0c0T9KJNmORnnrYQAbSBIQNhdioXVCBpByTOjYfATKVQtaSIV/uu4/
2r1KQmrT4z5OrL6lm2wv
=0I3T
-----END PGP SIGNATURE-----
Merge tag 'mac80211-for-davem-2016-06-09' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
Two more fixes for now:
* a fix for a long-standing iwpriv 32/64 compat issue
* two fairly recently introduced (4.6) warning asking for
symmetric operations are erroneous and I remove them
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
ipip6_tunnel_xmit() is called immediately after checking that
skb->protocol is htons(ETH_P_IPV6) so there is no need
to check it a second time.
Found by inspection.
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Dinan Gunawardena <dinan.gunawardena@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In cases where the category bitmap is sparse enough that gaps exist
between netlbl_lsm_catmap structs, callers to netlbl_catmap_getlong()
could find themselves prematurely ending their search through the
category bitmap. Further, the methods used to calculate the 'idx'
and 'off' values were incorrect for bitmaps this large. This patch
changes the netlbl_catmap_getlong() behavior so that it always skips
over gaps and calculates the index and offset values correctly.
Signed-off-by: Paul Moore <paul@paul-moore.com>
There is no other limit other than a global
packet count limit when using software queuing.
This means a single flow queue can grow insanely
long. This is particularly bad for TCP congestion
algorithms which requires a little more
sophisticated frame dropping scheme than a mere
headdrop on limit overflow.
Hence apply (a slighly modified, to fit the knobs)
CoDel5 on flow queues. This improves TCP
convergence and stability when combined with
wireless driver which keeps its own tx queue/fifo
at a minimum fill level for given link conditions.
Signed-off-by: Michal Kazior <michal.kazior@tieto.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This adds a debugfs entry to read and modify some fq parameters.
This makes it easy to debug, test and experiment.
Signed-off-by: Michal Kazior <michal.kazior@tieto.com>
[remove module parameter for now]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
mac80211's software queues were designed to work
very closely with device tx queues. They are
required to make use of 802.11 packet aggregation
easily and efficiently.
Due to the way 802.11 aggregation is designed it
only makes sense to keep fair queuing as close to
hardware as possible to reduce induced latency and
inertia and provide the best flow responsiveness.
This change doesn't translate directly to
immediate and significant gains. End result
depends on driver's induced latency. Best results
can be achieved if driver keeps its own tx
queue/fifo fill level to a minimum.
Signed-off-by: Michal Kazior <michal.kazior@tieto.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Qdiscs are designed with no regard to 802.11
aggregation requirements and hand out
packet-by-packet with no guarantee they are
destined to the same tid. This does more bad than
good no matter how fairly a given qdisc may behave
on an ethernet interface.
Software queuing used per-AC netdev subqueue
congestion control whenever a global AC limit was
hit. This meant in practice a single station or
tid queue could starve others rather easily. This
could resonate with qdiscs in a bad way or could
just end up with poor aggregation performance.
Increasing the AC limit would increase induced
latency which is also bad.
Disabling qdiscs by default and performing
taildrop instead of netdev subqueue congestion
control on the other hand makes it possible for
tid queues to fill up "in the meantime" while
preventing stations starving each other.
This increases aggregation opportunities and
should allow software queuing based drivers
achieve better performance by utilizing airtime
more efficiently with big aggregates.
Signed-off-by: Michal Kazior <michal.kazior@tieto.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Setting rd to NULL to avoid freeing it, just to be able to return
from the function in a single place, doesn't make much sense.
Return the set_regdom() return value directly.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This patch fixes the following checkpatch.pl issues -
- Please don't use multiple blank lines
- Blank lines aren't necessary before a close brace
- Missing a blank line after declarations
Reviewed-by: Julian Calaby <julian.calaby@gmail.com>
Signed-off-by: Kirtika Ruchandani <kirtika.ruchandani@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
iwpriv app uses iw_point structure to send data to Kernel. The iw_point
structure holds a pointer. For compatibility Kernel converts the pointer
as required for WEXT IOCTLs (SIOCIWFIRST to SIOCIWLAST). Some drivers
may use iw_handler_def.private_args to populate iwpriv commands instead
of iw_handler_def.private. For those case, the IOCTLs from
SIOCIWFIRSTPRIV to SIOCIWLASTPRIV will follow the path ndo_do_ioctl().
Accordingly when the filled up iw_point structure comes from 32 bit
iwpriv to 64 bit Kernel, Kernel will not convert the pointer and sends
it to driver. So, the driver may get the invalid data.
The pointer conversion for the IOCTLs (SIOCIWFIRSTPRIV to
SIOCIWLASTPRIV), which follow the path ndo_do_ioctl(), is mandatory.
This patch adds pointer conversion from 32 bit to 64 bit and vice versa,
if the ioctl comes from 32 bit iwpriv to 64 bit Kernel.
Cc: stable@vger.kernel.org
Signed-off-by: Prasun Maiti <prasunmaiti87@gmail.com>
Signed-off-by: Ujjal Roy <royujjal@gmail.com>
Tested-by: Dibyajyoti Ghosh <dibyajyotig@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Since set_tx_power and set_antenna are frequently implemented
without the matching get_tx_power/get_antenna, we shouldn't
have added warnings for those. Remove them.
The remaining ones are correct and need to be implemented
symmetrically for correct operation.
Cc: stable@vger.kernel.org
Fixes: de3bb771f4 ("cfg80211: add more warnings for inconsistent ops")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
after removal of TCA_CBQ_OVL_STRATEGY from cbq scheduler, there are no
more callers of ->drop() outside of other ->drop functions, i.e.
nothing calls them.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
After the removal of TCA_CBQ_POLICE in cbq scheduler qdisc->reshape_fail
is always NULL, i.e. qdisc_rehape_fail is now the same as qdisc_drop.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
iproute2 doesn't implement any cbq option that results in this attribute
being sent to kernel.
To make use of it, user would have to
- patch iproute2
- add a class
- attach a qdisc to the class (default pfifo doesn't work as
q->handle is 0 and cbq_set_police() is a no-op in this case)
- re-'add' the same class (tc class change ...) again
- user must also specifiy a defmap (e.g. 'split 1:0 defmap 3f'), since
this 'police' feature relies on its presence
- the added qdisc must be one of bfifo, pfifo or netem
If all of these conditions are met and _some_ leaf qdiscs, namely
p/bfifo, netem, plug or tbf would drop a packet, kernel calls back into
cbq, which will attempt to re-queue the skb into a different class
as indicated by the parents' defmap entry for TC_PRIO_BESTEFFORT.
[ i.e. we behave as if tc_classify returned TC_ACT_RECLASSIFY ].
This feature, which isn't documented or implemented in iproute2,
and isn't implemented consistently (most qdiscs like sfq, codel, etc
drop right away instead of attempting this reclassification) is the
sole reason for the reshape_fail and __parent member in Qdisc struct.
So remove TCA_CBQ_POLICE support from the kernel, reject it via EOPNOTSUPP
so userspace knows we don't support it, and then remove no-longer needed
infrastructure in followup commit.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
since initial revision of cbq in 2004 iproute 2 has never implemented
support for TCA_CBQ_OVL_STRATEGY, which is what needs to be set to
activate the class->drop() call (TC_CBQ_OVL_DROP strategy must be
set by userspace value must be set by userspace).
David Miller says:
It seems really safe to kill this thing off, flag an error if someone
tries to set the attribute, and therefore kill off all of the
non-default cbq_ovl_*() functions.
A followup commit can then remove all .drop qdisc methods since this
removed the only caller.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
When we need to create a new aggregate to enqueue the skb we call kzalloc.
If that fails we returned ENOBUFS without freeing the skb.
Spotted during code review.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ip6 GRE tap device should not be forced to down state to change
the mac address and should allow live address change for tap device
similar to ipv4 gre.
Signed-off-by: Shweta Choudaha <schoudah@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ip6 GRE tap device should not be forced to down state to change
the mac address and should allow live address change for tap device
similar to ipv4 gre.
Signed-off-by: Shweta Choudaha <schoudah@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Return an error if user requested skip-sw and the underlaying
hardware cannot handle tc offloads (or offloads are disabled).
This patch fixes the knode handling.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Errors reported by u32_replace_hw_hnode() were not propagated.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, VRFs require 1 oif and 1 iif rule per address family per
VRF. As the number of VRF devices increases it brings scalability
issues with the increasing rule list. All of the VRF rules have the
same format with the exception of the specific table id to direct the
lookup. Since the table id is available from the oif or iif in the
loopup, the VRF rules can be consolidated to a single rule that pulls
the table from the VRF device.
This patch introduces a new rule attribute l3mdev. The l3mdev rule
means the table id used for the lookup is pulled from the L3 master
device (e.g., VRF) rather than being statically defined. With the
l3mdev rule all of the basic VRF FIB rules are reduced to 1 l3mdev
rule per address family (IPv4 and IPv6).
If an admin wishes to insert higher priority rules for specific VRFs
those rules will co-exist with the l3mdev rule. This capability means
current VRF scripts will co-exist with this new simpler implementation.
Currently, the rules list for both ipv4 and ipv6 look like this:
$ ip ru ls
1000: from all oif vrf1 lookup 1001
1000: from all iif vrf1 lookup 1001
1000: from all oif vrf2 lookup 1002
1000: from all iif vrf2 lookup 1002
1000: from all oif vrf3 lookup 1003
1000: from all iif vrf3 lookup 1003
1000: from all oif vrf4 lookup 1004
1000: from all iif vrf4 lookup 1004
1000: from all oif vrf5 lookup 1005
1000: from all iif vrf5 lookup 1005
1000: from all oif vrf6 lookup 1006
1000: from all iif vrf6 lookup 1006
1000: from all oif vrf7 lookup 1007
1000: from all iif vrf7 lookup 1007
1000: from all oif vrf8 lookup 1008
1000: from all iif vrf8 lookup 1008
...
32765: from all lookup local
32766: from all lookup main
32767: from all lookup default
With the l3mdev rule the list is just the following regardless of the
number of VRFs:
$ ip ru ls
1000: from all lookup [l3mdev table]
32765: from all lookup local
32766: from all lookup main
32767: from all lookup default
(Note: the above pretty print of the rule is based on an iproute2
prototype. Actual verbage may change)
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The node keepalive interval is recalculated at each timer expiration
to catch any changes in the link tolerance, and stored in a field in
struct tipc_node. We use jiffies as unit for the stored value.
This is suboptimal, because it makes the calculation unnecessary
complex, including two unit conversions. The conversions also lead to
a rounding error that causes the link "abort limit" to be 3 in the
normal case, instead of 4, as intended. This again leads to unnecessary
link resets when the network is pushed close to its limit, e.g., in an
environment with hundreds of nodes or namesapces.
In this commit, we do instead let the keepalive value be calculated and
stored in milliseconds, so that there is only one conversion and the
rounding error is eliminated.
We also remove a redundant "keepalive" field in struct tipc_link. This
is remnant from the previous implementation.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 88e8ac7000 ("tipc: reduce transmission rate of reset messages
when link is down") revealed a flaw in the node FSM, as defined in
the log of commit 66996b6c47 ("tipc: extend node FSM").
We see the following scenario:
1: Node B receives a RESET message from node A before its link endpoint
is fully up, i.e., the node FSM is in state SELF_UP_PEER_COMING. This
event will not change the node FSM state, but the (distinct) link FSM
will move to state RESETTING.
2: As an effect of the previous event, the local endpoint on B will
declare node A lost, and post the event SELF_DOWN to the its node
FSM. This moves the FSM state to SELF_DOWN_PEER_LEAVING, meaning
that no messages will be accepted from A until it receives another
RESET message that confirms that A's endpoint has been reset. This
is wasteful, since we know this as a fact already from the first
received RESET, but worse is that the link instance's FSM has not
wasted this information, but instead moved on to state ESTABLISHING,
meaning that it repeatedly sends out ACTIVATE messages to the reset
peer A.
3: Node A will receive one of the ACTIVATE messages, move its link FSM
to state ESTABLISHED, and start repeatedly sending out STATE messages
to node B.
4: Node B will consistently drop these messages, since it can only accept
accept a RESET according to its node FSM.
5: After four lost STATE messages node A will reset its link and start
repeatedly sending out RESET messages to B.
6: Because of the reduced send rate for RESET messages, it is very
likely that A will receive an ACTIVATE (which is sent out at a much
higher frequency) before it gets the chance to send a RESET, and A
may hence quickly move back to state ESTABLISHED and continue sending
out STATE messages, which will again be dropped by B.
7: GOTO 5.
8: After having repeated the cycle 5-7 a number of times, node A will
by chance get in between with sending a RESET, and the situation is
resolved.
Unfortunately, we have seen that it may take a substantial amount of
time before this vicious loop is broken, sometimes in the order of
minutes.
We correct this by making a small correction to the node FSM: When a
node in state SELF_UP_PEER_COMING receives a SELF_DOWN event, it now
moves directly back to state SELF_DOWN_PEER_DOWN, instead of as now
SELF_DOWN_PEER_LEAVING. This is logically consistent, since we don't
need to wait for RESET confirmation from of an endpoint that we alread
know has been reset. It also means that node B in the scenario above
will not be dropping incoming STATE messages, and the link can come up
immediately.
Finally, a symmetry comparison reveals that the FSM has a similar
error when receiving the event PEER_DOWN in state PEER_UP_SELF_COMING.
Instead of moving to PERR_DOWN_SELF_LEAVING, it should move directly
to SELF_DOWN_PEER_DOWN. Although we have never seen any negative effect
of this logical error, we choose fix this one, too.
The node FSM looks as follows after those changes:
+----------------------------------------+
| PEER_DOWN_EVT|
| |
+------------------------+----------------+ |
|SELF_DOWN_EVT | | |
| | | |
| +-----------+ +-----------+ |
| |NODE_ | |NODE_ | |
| +----------|FAILINGOVER|<---------|SYNCHING |-----------+ |
| |SELF_ +-----------+ FAILOVER_+-----------+ PEER_ | |
| |DOWN_EVT | A BEGIN_EVT A | DOWN_EVT| |
| | | | | | | |
| | | | | | | |
| | |FAILOVER_ |FAILOVER_ |SYNCH_ |SYNCH_ | |
| | |END_EVT |BEGIN_EVT |BEGIN_EVT|END_EVT | |
| | | | | | | |
| | | | | | | |
| | | +--------------+ | | |
| | +-------->| SELF_UP_ |<-------+ | |
| | +-----------------| PEER_UP |----------------+ | |
| | |SELF_DOWN_EVT +--------------+ PEER_DOWN_EVT| | |
| | | A A | | |
| | | | | | | |
| | | PEER_UP_EVT| |SELF_UP_EVT | | |
| | | | | | | |
V V V | | V V V
+------------+ +-----------+ +-----------+ +------------+
|SELF_DOWN_ | |SELF_UP_ | |PEER_UP_ | |PEER_DOWN |
|PEER_LEAVING| |PEER_COMING| |SELF_COMING| |SELF_LEAVING|
+------------+ +-----------+ +-----------+ +------------+
| | A A | |
| | | | | |
| SELF_ | |SELF_ |PEER_ |PEER_ |
| DOWN_EVT| |UP_EVT |UP_EVT |DOWN_EVT |
| | | | | |
| | | | | |
| | +--------------+ | |
|PEER_DOWN_EVT +--->| SELF_DOWN_ |<---+ SELF_DOWN_EVT|
+------------------->| PEER_DOWN |<--------------------+
+--------------+
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that we can properly support multiple distinct trees in the system,
using a global variable: dsa_cpu_port_ethtool_ops is getting clobbered
as soon as the second switch tree gets probed, and we don't want that.
We need to move this to be dynamically allocated, and since we can't
really be comparing addresses anymore to determine first time
initialization versus any other times, just move this to dsa.c and
dsa2.c where the remainder of the dst/ds initialization happens.
The operations teardown restores the master netdev's ethtool_ops to its
original ethtool_ops pointer (typically within the Ethernet driver)
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a helper function: dsa_cpu_port_ethtool_init() which initializes a
custom ethtool_ops structure with custom DSA ethtool operations for CPU
ports. This is a preliminary change to move the initialization outside
of net/dsa/slave.c.
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Mimic what net/dsa/dsa.c does and provide a slave MII bus by default
which will be created if the driver implements a phy_read method.
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some drivers rely on these two bitmasks to contain the correct values
for them to successfully probe and initialize at drv->setup() time,
calculate correct values to put in both masks as early as possible in
dsa_get_ports_dn().
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In case we have multiples trees and switches with the same index, we
need to add another discriminating id: the switch tree.
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
"make htmldocs" complains otherwise:
.//net/core/gen_stats.c:168: warning: No description found for parameter 'running'
.//include/linux/netdevice.h:1867: warning: No description found for parameter 'qdisc_running_key'
Fixes: f9eb8aea2a ("net_sched: transform qdisc running bit into a seqcount")
Fixes: edb09eb17e ("net: sched: do not acquire qdisc spinlock in qdisc/class stats dump")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
"make htmldocs" complains otherwise:
.//net/core/gen_stats.c:65: warning: No description found for parameter 'padattr'
.//net/core/gen_stats.c:101: warning: No description found for parameter 'padattr'
Fixes: 9854518ea0 ("sched: align nlattr properly when needed")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
At present we perform an xfrm_lookup() for each UDPv6 message we
send. The lookup involves querying the flow cache (flow_cache_lookup)
and, in case of a cache miss, creating an XFRM bundle.
If we miss the flow cache, we can end up creating a new bundle and
deriving the path MTU (xfrm_init_pmtu) from on an already transformed
dst_entry, which we pass from the socket cache (sk->sk_dst_cache) down
to xfrm_lookup(). This can happen only if we're caching the dst_entry
in the socket, that is when we're using a connected UDP socket.
To put it another way, the path MTU shrinks each time we miss the flow
cache, which later on leads to incorrectly fragmented payload. It can
be observed with ESPv6 in transport mode:
1) Set up a transformation and lower the MTU to trigger fragmentation
# ip xfrm policy add dir out src ::1 dst ::1 \
tmpl src ::1 dst ::1 proto esp spi 1
# ip xfrm state add src ::1 dst ::1 \
proto esp spi 1 enc 'aes' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b
# ip link set dev lo mtu 1500
2) Monitor the packet flow and set up an UDP sink
# tcpdump -ni lo -ttt &
# socat udp6-listen:12345,fork /dev/null &
3) Send a datagram that needs fragmentation with a connected socket
# perl -e 'print "@" x 1470 | socat - udp6:[::1]:12345
2016/06/07 18:52:52 socat[724] E read(3, 0x555bb3d5ba00, 8192): Protocol error
00:00:00.000000 IP6 ::1 > ::1: frag (0|1448) ESP(spi=0x00000001,seq=0x2), length 1448
00:00:00.000014 IP6 ::1 > ::1: frag (1448|32)
00:00:00.000050 IP6 ::1 > ::1: ESP(spi=0x00000001,seq=0x3), length 1272
(^ ICMPv6 Parameter Problem)
00:00:00.000022 IP6 ::1 > ::1: ESP(spi=0x00000001,seq=0x5), length 136
4) Compare it to a non-connected socket
# perl -e 'print "@" x 1500' | socat - udp6-sendto:[::1]:12345
00:00:40.535488 IP6 ::1 > ::1: frag (0|1448) ESP(spi=0x00000001,seq=0x6), length 1448
00:00:00.000010 IP6 ::1 > ::1: frag (1448|64)
What happens in step (3) is:
1) when connecting the socket in __ip6_datagram_connect(), we
perform an XFRM lookup, miss the flow cache, create an XFRM
bundle, and cache the destination,
2) afterwards, when sending the datagram, we perform an XFRM lookup,
again, miss the flow cache (due to mismatch of flowi6_iif and
flowi6_oif, which is an issue of its own), and recreate an XFRM
bundle based on the cached (and already transformed) destination.
To prevent the recreation of an XFRM bundle, avoid an XFRM lookup
altogether whenever we already have a destination entry cached in the
socket. This prevents the path MTU shrinkage and brings us on par with
UDPv4.
The fix also benefits connected PINGv6 sockets, another user of
ip6_sk_dst_lookup_flow(), who also suffer messages being transformed
twice.
Joint work with Hannes Frederic Sowa.
Reported-by: Jan Tluka <jtluka@redhat.com>
Signed-off-by: Jakub Sitnicki <jkbs@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When in kdump kernel, reduce memory usage by only using a single Queue
Set for multiqueue devices. So make netif_get_num_default_rss_queues()
return one, when in kdump kernel.
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When setting up ILA in a router we noticed that the the encapsulation
is invoked twice: once in the route input path and again upon route
output. To resolve this we add a flag set_csum_neutral for the
ila_update_ipv6_locator. If this flag is set and the checksum
neutral bit is also set we assume that checksum-neutral translation
has already been performed and take no further action. The
flag is set only in ila_output path. The flag is not set for ila_input and
ila_xlat.
Tested:
Used 3 netns to set to emulate a router and two hosts. The router
translates SIR addresses between the two destinations in other two netns.
Verified ping and netperf are functional.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The of_find_net_device_by_node() function is defined in
<linux/of_net.h> but not included in the .c file that
implements it. Fix the following warning by including the
header:
net/core/net-sysfs.c:1494:19: warning: symbol 'of_find_net_device_by_node' was not declared. Should it be static?
Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
RFC 5961 advises to only accept RST packets containing a seq number
matching the next expected seq number instead of the whole receive
window in order to avoid spoofing attacks.
However, this situation is not optimal in the case SACK is in use at the
time the RST is sent. I recently run into a scenario in which packet
losses were high while uploading data to a server, and userspace was
willing to frequently terminate connections by sending a RST. In
this case, the ACK sent on the receiver side (rcv_nxt) is frozen waiting
for a lost packet retransmission and SACK blocks are used to let the
client continue uploading data. At some point later on, the client sends
the RST (snd_nxt), which matches the next expected seq number of the
right-most SACK block on the receiver side which is going forward
receiving data.
In this scenario, as RFC 5961 defines, the RST SEQ doesn't match the
frozen main ACK at receiver side and thus gets dropped and a challenge
ACK is sent, which gets usually lost due to network conditions. The main
consequence is that the connection stays alive for a while even if it
made sense to accept the RST. This can get really bad if lots of
connections like this one are created in few seconds, allocating all the
resources of the server easily.
For security reasons, not all SACK blocks are checked (there could be a
big amount of SACK blocks => acceptable SEQ numbers). Furthermore, it
wouldn't make sense to check for RST in blocks other than the right-most
received one because the sender is not expected to be sending new data
after the RST. For simplicity, only up to the 4 most recently updated
SACK blocks (selective_acks[4] field) are compared to find the
right-most block, as usually those are the ones with bigger probability
to contain it.
This patch was tested in a 3.18 kernel and probed to improve the
situation in the scenario described above.
Signed-off-by: Pau Espin Pedrol <pau.espin@tessares.net>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Tested-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The missing br_vlan_should_use() test caused creation of an unneeded
local fdb entry on changing mac address of a bridge device when there is
a vlan which is configured on a bridge port but not on the bridge
device.
Fixes: 2594e9064a ("bridge: vlan: add per-vlan struct and move to rhashtables")
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for locally originated traffic to VRF-local IPv6 addresses.
Similar to IPv4 a local dst is set on the skb and the packet is
reinserted with a call to netif_rx. With this patch, ping, tcp and udp
packets to a local IPv6 address are successfully routed:
$ ip addr show dev eth1
4: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master red state UP group default qlen 1000
link/ether 02:e0:f9:1c:b9:74 brd ff:ff:ff:ff:ff:ff
inet 10.100.1.1/24 brd 10.100.1.255 scope global eth1
valid_lft forever preferred_lft forever
inet6 2100:1::1/120 scope global
valid_lft forever preferred_lft forever
inet6 fe80::e0:f9ff:fe1c:b974/64 scope link
valid_lft forever preferred_lft forever
$ ping6 -c1 -I red 2100:1::1
ping6: Warning: source address might be selected on device other than red.
PING 2100:1::1(2100:1::1) from 2100:1::1 red: 56 data bytes
64 bytes from 2100:1::1: icmp_seq=1 ttl=64 time=0.098 ms
ip6_input is exported so the VRF driver can use it for the dst input
function. The dst_alloc function for IPv4 defaults to setting the input and
output functions; IPv6's does not. VRF does not need to duplicate the Rx path
so just export the ipv6 input function.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch implements direct encapsulation of IPv4 and IPv6 packets
in UDP. This is done a version "1" of GUE and as explained in I-D
draft-ietf-nvo3-gue-03.
Changes here are only in the receive path, fou with IPxIPx already
supports the transmit side. Both the normal receive path and
GRO path are modified to check for GUE version and check for
IP version in the case that GUE version is "1".
Tested:
IPIP with direct GUE encap
1 TCP_STREAM
4530 Mbps
200 TCP_RR
1297625 tps
135/232/444 90/95/99% latencies
IP4IP6 with direct GUE encap
1 TCP_STREAM
4903 Mbps
200 TCP_RR
1184481 tps
149/253/473 90/95/99% latencies
IP6IP6 direct GUE encap
1 TCP_STREAM
5146 Mbps
200 TCP_RR
1202879 tps
146/251/472 90/95/99% latencies
SIT with direct GUE encap
1 TCP_STREAM
6111 Mbps
200 TCP_RR
1250337 tps
139/241/467 90/95/99% latencies
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS fixes for net
The following patchset contains two Netfilter/IPVS fixes for your net
tree, they are:
1) Fix missing alignment in next offset calculation for standard
targets, introduced in the previous merge window, patch from
Florian Westphal.
2) Fix to correct the handling of outgoing connections which use the
SIP-pe such that the binding of a real-server is updated when needed.
This was an omission from changes introduced by Marco Angaroni in
the previous merge window too, to allow handling of outgoing
connections by the SIP-pe. Patch and report came via Simon Horman.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The v6 tcp stats scan do not provide TLP and ER timer information
correctly like the v4 version . This patch fixes that.
Fixes: 6ba8a3b19e ("tcp: Tail loss probe (TLP)")
Fixes: eed530b6c6 ("tcp: early retransmit")
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When offloading classifiers such as u32 or flower to hardware, and the
qdisc is clsact (TC_H_CLSACT), then we need to differentiate its classes,
since not all of them handle ingress, therefore we must leave those in
software path. Add a .tcf_cl_offload() callback, so we can generically
handle them, tested on ixgbe.
Fixes: 10cbc68434 ("net/sched: cls_flower: Hardware offloaded filters statistics support")
Fixes: 5b33f48842 ("net/flower: Introduce hardware offload support")
Fixes: a1b7c5fd7f ("net: sched: add cls_u32 offload hooks for netdevs")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Large tc dumps (tc -s {qdisc|class} sh dev ethX) done by Google BwE host
agent [1] are problematic at scale :
For each qdisc/class found in the dump, we currently lock the root qdisc
spinlock in order to get stats. Sampling stats every 5 seconds from
thousands of HTB classes is a challenge when the root qdisc spinlock is
under high pressure. Not only the dumps take time, they also slow
down the fast path (queue/dequeue packets) by 10 % to 20 % in some cases.
An audit of existing qdiscs showed that sch_fq_codel is the only qdisc
that might need the qdisc lock in fq_codel_dump_stats() and
fq_codel_dump_class_stats()
In v2 of this patch, I now use the Qdisc running seqcount to provide
consistent reads of packets/bytes counters, regardless of 32/64 bit arches.
I also changed rate estimators to use the same infrastructure
so that they no longer need to lock root qdisc lock.
[1]
http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43838.pdf
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Kevin Athey <kda@google.com>
Cc: Xiaotian Pei <xiaotian@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of using a single bit (__QDISC___STATE_RUNNING)
in sch->__state, use a seqcount.
This adds lockdep support, but more importantly it will allow us
to sample qdisc/class statistics without having to grab qdisc root lock.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We properly scan the flow list to count number of packets,
but John passed 0 to gnet_stats_copy_queue() so we report
a zero value to user space instead of the result.
Fixes: 6401585366 ("net: sched: restrict use of qstats qlen")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: John Fastabend <john.r.fastabend@intel.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Return an error if user requested skip-sw and the underlaying
hardware cannot handle tc offloads (or offloads are disabled).
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
'err' variable is not set in this test, we would return whatever
previous test set 'err' to.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Useful to know when the action was first used for accounting
(and debugging)
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to make a filter processed only by hardware, skip_sw flag
should be supplied. This is an addition to the already existing skip_hw
flag (filter will be processed by software only). If no flag is
specified, filter will be processed by both software and hardware.
If only hardware offloaded filters exist, fl_classify() will return
without doing anything.
A following userspace patch will be sent once kernel patch is accepted.
Example:
tc filter add dev enp0s9 protocol ip prio 20 parent ffff: \
flower \
ip_proto 6 \
indev enp0s9 \
skip_sw \
action skbedit mark 0x1234
Signed-off-by: Amir Vadai <amirva@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Note: Tom Herbert posted almost same patch 3 months back, but for
different reasons.
The reasons we want to get rid of this spin_trylock() are :
1) Under high qdisc pressure, the spin_trylock() has almost no
chance to succeed.
2) We loop multiple times in softirq handler, eventually reaching
the max retry count (10), and we schedule ksoftirqd.
Since we want to adhere more strictly to ksoftirqd being waked up in
the future (https://lwn.net/Articles/687617/), better avoid spurious
wakeups.
3) calls to __netif_reschedule() dirty the cache line containing
q->next_sched, slowing down the owner of qdisc.
4) RT kernels can not use the spin_trylock() here.
With help of busylock, we get the qdisc spinlock fast enough, and
the trylock trick brings only performance penalty.
Depending on qdisc setup, I observed a gain of up to 19 % in qdisc
performance (1016600 pps instead of 853400 pps, using prio+tbf+fq_codel)
("mpstat -I SCPU 1" is much happier now)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <tom@herbertland.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net/rxrpc/rxkad.c:1165:1-3: WARNING: PTR_ERR_OR_ZERO can be used
Use PTR_ERR_OR_ZERO rather than if(IS_ERR(...)) + PTR_ERR
Generated by: scripts/coccinelle/api/ptr_ret.cocci
CC: David Howells <dhowells@redhat.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The send path needs to be quiesced before resetting callbacks from
rds_tcp_accept_one(), and commit eb19284026 ("RDS:TCP: Synchronize
rds_tcp_accept_one with rds_send_xmit when resetting t_sock") achieves
this using the c_state and RDS_IN_XMIT bit following the pattern
used by rds_conn_shutdown(). However this leaves the possibility
of a race window as shown in the sequence below
take t_conn_lock in rds_tcp_conn_connect
send outgoing syn to peer
drop t_conn_lock in rds_tcp_conn_connect
incoming from peer triggers rds_tcp_accept_one, conn is
marked CONNECTING
wait for RDS_IN_XMIT to quiesce any rds_send_xmit threads
call rds_tcp_reset_callbacks
[.. race-window where incoming syn-ack can cause the conn
to be marked UP from rds_tcp_state_change ..]
lock_sock called from rds_tcp_reset_callbacks, and we set
t_sock to null
As soon as the conn is marked UP in the race-window above, rds_send_xmit()
threads will proceed to rds_tcp_xmit and may encounter a null-pointer
deref on the t_sock.
Given that rds_tcp_state_change() is invoked in softirq context, whereas
rds_tcp_reset_callbacks() is in workq context, and testing for RDS_IN_XMIT
after lock_sock could result in a deadlock with tcp_sendmsg, this
commit fixes the race by using a new c_state, RDS_TCP_RESETTING, which
will prevent a transition to RDS_CONN_UP from rds_tcp_state_change().
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When we switch a connection's sockets in rds_tcp_rest_callbacks,
any partially sent datagram must be retransmitted on the new
socket so that the receiver can correctly reassmble the RDS
datagram. Use rds_send_reset() which is designed for this purpose.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When rds_tcp_accept_one() has to replace the existing tcp socket
with a newer tcp socket (duelling-syn resolution), it must lock_sock()
to suppress the rds_tcp_data_recv() path while callbacks are being
changed. Also, existing RDS datagram reassembly state must be reset,
so that the next datagram on the new socket does not have corrupted
state. Similarly when resetting the newly accepted socket, appropriate
locks and synchronization is needed.
This commit ensures correct synchronization by invoking
kernel_sock_shutdown to reset a newly accepted sock, and by taking
appropriate lock_sock()s (for old and new sockets) when resetting
existing callbacks.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
My prior attempt to fix the backlogs of parents failed.
If we return NET_XMIT_CN, our parents wont increase their backlog,
so our qdisc_tree_reduce_backlog() should take this into account.
v2: Florian Westphal pointed out that we could drop the packet,
so we need to save qdisc_pkt_len(skb) in a temp variable before
calling fq_codel_drop()
Fixes: 9d18562a22 ("fq_codel: add batch ability to fq_codel_drop()")
Fixes: 2ccccf5fb4 ("net_sched: update hierarchical backlog too")
Reported-by: Stas Nichiporovich <stasn77@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: WANG Cong <xiyou.wangcong@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The expectation table is not duplicated per net namespace anymore, so we can move
the expectation table and conntrack table iteration out of the per-net loop.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
checkpatch produces data type 'checks'.
This patch amends them by changing, for example:
uint8_t -> u8
Signed-off-by: Tobin C Harding <me@tobin.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Much like we had to do for AF_BLUETOOTH and AF_ALG, make sure we
properly clone the parent socket's LSM attributes to newly created
child sockets.
Signed-off-by: Paul Moore <paul@paul-moore.com>
For gso_skb we only update qlen, backlog should be updated too.
Note, it is correct to just update these stats at one layer,
because the gso_skb is cached there.
Reported-by: Stas Nichiporovich <stasn77@gmail.com>
Fixes: 2ccccf5fb4 ("net_sched: update hierarchical backlog too")
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 538950a1b7 ("soreuseport: setsockopt SO_ATTACH_REUSEPORT_[CE]BPF")
missed to add the compat case for the SO_ATTACH_REUSEPORT_CBPF option.
Signed-off-by: Helge Deller <deller@gmx.de>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
It seems risky to always rely on the caller to ensure the socket's
address family is correct before passing it to the NetLabel kAPI,
especially since we see at least one LSM which didn't. Add address
family checks to the *_delattr() functions to help prevent future
problems.
Cc: <stable@vger.kernel.org>
Reported-by: Maninder Singh <maninder1.s@samsung.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
The capability check should not be audited since it is only being used
to determine the inode permissions. A failed check does not indicate a
violation of security policy but, when an LSM is enabled, a denial audit
message was being generated.
The denial audit message caused confusion for some application authors
because root-running Go applications always triggered the denial. To
prevent this confusion, the capability check in net_ctl_permissions() is
switched to the noaudit variant.
BugLink: https://launchpad.net/bugs/1465724
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: James Morris <james.l.morris@oracle.com>
Before commit 6d7b857d54 ("net: use lib/percpu_counter API for
fragmentation mem accounting"), setting the reassembly high threshold
to 0 prevented fragment reassembly as first fragment would be always
evicted before second could be added to the queue. While inefficient,
some users apparently relied on this method.
Since the commit mentioned above, a percpu counter is used for
reassembly memory accounting and high batch size avoids taking slow path
in most common scenarios. As a result, a whole full sized packet can be
reassembled without the percpu counter's main counter changing its value
so that even with high_thresh set to 0, fragmented packets can be still
reassembled and processed.
Add explicit check preventing reassembly if high threshold is zero.
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Previous patch that introduced handling of outgoing packets in SIP
persistent-engine did not call ip_vs_check_template() in case packet was
matching a connection template. Assumption was that real-server was
healthy, since it was sending a packet just in that moment.
There are however real-server fault conditions requiring that association
between call-id and real-server (represented by connection template)
gets updated. Here is an example of the sequence of events:
1) RS1 is a back2back user agent that handled call-id1 and call-id2
2) RS1 is down and was marked as unavailable
3) new message from outside comes to IPVS with call-id1
4) IPVS reschedules the message to RS2, which becomes new call handler
5) RS2 forwards the message outside, translating call-id1 to call-id2
6) inside pe->conn_out() IPVS matches call-id2 with existing template
7) IPVS does not change association call-id2 <-> RS1
8) new message comes from client with call-id2
9) IPVS reschedules the message to a real-server potentially different
from RS2, which is now the correct destination
This patch introduces ip_vs_check_template() call in the handling of
outgoing packets for SIP-pe. And also introduces a second optional
argument for ip_vs_check_template() that allows to check if dest
associated to a connection template is the same dest that was identified
as the source of the packet. This is to change the real-server bound to a
particular call-id independently from its availability status: the idea
is that it's more reliable, for in->out direction (where internal
network can be considered trusted), to always associate a call-id with
the last real-server that used it in one of its messages. Think about
above sequence of events where, just after step 5, RS1 returns instead
to be available.
Comparison of dests is done by simply comparing pointers to struct
ip_vs_dest; there should be no cases where struct ip_vs_dest keeps its
memory address, but represent a different real-server in terms of
ip-address / port.
Fixes: 39b9722315 ("ipvs: handle connections started by real-servers")
Signed-off-by: Marco Angaroni <marcoangaroni@gmail.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
The existing DSA binding has a number of limitations and problems. The
main problem is that it cannot represent a switch as a linux device,
hanging off some bus. It is limited to one CPU port. The DSA platform
device is artificial, and does not really represent hardware.
Implement a new binding which can be embedded into any type of node on
a bus to represent one switch device, and its links to other switches.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The switch may want to instantiate its own MDIO bus. Only do it
centrally if the switch has not already created one, and the read op
is implemented.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace the two switch statements with an array lookup, and store the
result in the dsa tree structure. The drivers no longer need to know
the selected tag protocol, so remove it from the dsa switch structure.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Refactor the code to setup a single DSA/CPU port into a function of
its own, and export it, so it can be used by the new binding.
Similarly, refactor the destroy code into a function. When destroying
the ports, don't put the of node. They should be released at the end
along with the normal ports.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The new binding will not have a chip data structure, it will place the
routing directly into the switch structure. To enable backwards
compatibility, copy the routing from the chip data into the switch
structure.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With a maximum of four switches, the size of the routing table is the
same as the pointer to it. Removing it makes the code simpler.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the port device node structure into the port structure, from the
chip data. This information is needed in the next step of implementing
the new binding.
The chip data structure is used while parsing the whole old binding,
before the individual switch structures exist. With the new bindings,
this is reversed, the switches exist first, and the interconnections
between the switches is derived from the individual switch
bindings. Thus this chip data structure becomes unneeded.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
eviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are going to be more per-port members added to the switch
structure. So add a port structure and move the netdev into it.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The platform data nr_chips is used when validating a received packet,
to ensure it comes from a know switch chip. The number of possible
switches is limited to DSA_MAX_SWITCHES, so use this as the first
validation step. The new binding allows holes in the dst->ds[] array,
so also ensure ensure there is a valid dsa_switch for this packet.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The DSA layer should no longer assume the switch is connected to an
MDIO bus. As a result, we cannot use the address on the MDIO bus when
forming the name of the switches internal MDIO bus for its builtin and
possibly external PHYs. The switch index is sufficient to make the
name unique, so drop the MDIO address.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The new binding does not make use of dsa_chip_data, a.k.a cd. When
retrieving the size of the EEPROM attached to a switch, don't assume
there is a cd attached to the switch structure.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull Ceph fixes from Sage Weil:
"We have a few follow-up fixes for the libceph refactor from Ilya, and
then some cephfs + fscache fixes from Zheng.
The first two FS-Cache patches are acked by David Howells and deemed
trivial enough to go through our tree. The rest fix some issues with
the ceph fscache handling (disable cache for inodes opened for write,
and simplify the revalidation logic accordingly, dropping the
now-unnecessary work queue)"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
ceph: use i_version to check validity of fscache
ceph: improve fscache revalidation
ceph: disable fscache when inode is opened for write
ceph: avoid unnecessary fscache invalidation/revlidation
ceph: call __fscache_uncache_page() if readpages fails
FS-Cache: make check_consistency callback return int
FS-Cache: wake write waiter after invalidating writes
libceph: use %s instead of %pE in dout()s
libceph: put request only if it's done in handle_reply()
libceph: change ceph_osdmap_flag() to take osdc
size_t objects should be printed with %Z printf format.
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use the more common kernel logging style and reduce object size.
The logging message prefix changes from a mixture of
"RxRPC:" and "RXRPC:" to "af_rxrpc: ".
$ size net/rxrpc/built-in.o*
text data bss dec hex filename
64172 1972 8304 74448 122d0 net/rxrpc/built-in.o.new
67512 1972 8304 77788 12fdc net/rxrpc/built-in.o.old
Miscellanea:
o Consolidate the ASSERT macros to use a single pr_err call with
decimal and hexadecimal output and a stringified #OP argument
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is useful for debugging packet sizes.
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Tested-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SCTP has this pecualiarity that its packets cannot be just segmented to
(P)MTU. Its chunks must be contained in IP segments, padding respected.
So we can't just generate a big skb, set gso_size to the fragmentation
point and deliver it to IP layer.
This patch takes a different approach. SCTP will now build a skb as it
would be if it was received using GRO. That is, there will be a cover
skb with protocol headers and children ones containing the actual
segments, already segmented to a way that respects SCTP RFCs.
With that, we can tell skb_segment() to just split based on frag_list,
trusting its sizes are already in accordance.
This way SCTP can benefit from GSO and instead of passing several
packets through the stack, it can pass a single large packet.
v2:
- Added support for receiving GSO frames, as requested by Dave Miller.
- Clear skb->cb if packet is GSO (otherwise it's not used by SCTP)
- Added heuristics similar to what we have in TCP for not generating
single GSO packets that fills cwnd.
v3:
- consider sctphdr size in skb_gso_transport_seglen()
- rebased due to 5c7cdf339a ("gso: Remove arbitrary checks for
unsupported GSO")
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Tested-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is a preparation for the GSO one. In order to successfully
handle GSO packets on rx path we must not call skb_linearize, otherwise
it defeats any gain GSO may have had.
This patch thus delays as much as possible the call to skb_linearize,
leaving it to sctp_inq_pop() moment. For that the sanity checks
performed now know how to deal with fragments.
One positive side-effect of this is that if the socket is backlogged it
will have the chance of doing it on backlog processing instead of
during softirq.
With this move, it's evident that a check for non-linearity in
sctp_inq_pop was ineffective and is now removed. Note that a similar
check is performed a bit below this one.
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Tested-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb_gso_network_seglen is not enough for checking fragment sizes if
skb is using GSO_BY_FRAGS as we have to check frag per frag.
This patch introduces skb_gso_validate_mtu, based on the former, which
will wrap the use case inside it as all calls to skb_gso_network_seglen
were to validate if it fits on a given TMU, and improve the check.
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Tested-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch allows segmenting a skb based on its frags sizes instead of
based on a fixed value.
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Tested-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sctp GSO requires it and sctp can be compiled as a module, so we need to
export this function.
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Tested-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes: 2ccccf5fb4 ("net_sched: update hierarchical backlog too")
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes: 2ccccf5fb4 ("net_sched: update hierarchical backlog too")
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes: 2ccccf5fb4 ("net_sched: update hierarchical backlog too")
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We need to update backlog too when we update qlen.
Joint work with Stas.
Reported-by: Stas Nichiporovich <stasn77@gmail.com>
Tested-by: Stas Nichiporovich <stasn77@gmail.com>
Fixes: 2ccccf5fb4 ("net_sched: update hierarchical backlog too")
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
hfsc updates backlog lazily, that is only when we
dump the stats. This is problematic after we begin to
update backlog in qdisc_tree_reduce_backlog().
Reported-by: Stas Nichiporovich <stasn77@gmail.com>
Tested-by: Stas Nichiporovich <stasn77@gmail.com>
Fixes: 2ccccf5fb4 ("net_sched: update hierarchical backlog too")
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The last field "flags" of object "minfo" is not initialized.
Copying this object out may leak kernel stack data.
Assign 0 to it to avoid leak.
Signed-off-by: Kangjie Lu <kjlu@gatech.edu>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
link_info.str is a char array of size 60. Memory after the NULL
byte is not initialized. Sending the whole object out can cause
a leak.
Signed-off-by: Kangjie Lu <kjlu@gatech.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
Paul Moore tracked a regression caused by a recent commit, which
mistakenly assumed that sk_filter() could be avoided if socket
had no current BPF filter.
The intent was to avoid udp_lib_checksum_complete() overhead.
But sk_filter() also checks skb_pfmemalloc() and
security_sock_rcv_skb(), so better call it.
Fixes: e6afc8ace6 ("udp: remove headers from UDP packets before queueing")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Paul Moore <paul@paul-moore.com>
Tested-by: Paul Moore <paul@paul-moore.com>
Tested-by: Stephen Smalley <sds@tycho.nsa.gov>
Cc: samanthakumar <samanthakumar@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Set name_assign_type of internal port to NET_NAME_USER.
Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Quoting John Stultz:
In updating a 32bit arm device from 4.6 to Linus' current HEAD, I
noticed I was having some trouble with networking, and realized that
/proc/net/ip_tables_names was suddenly empty.
Digging through the registration process, it seems we're catching on the:
if (strcmp(t->u.user.name, XT_STANDARD_TARGET) == 0 &&
target_offset + sizeof(struct xt_standard_target) != next_offset)
return -EINVAL;
Where next_offset seems to be 4 bytes larger then the
offset + standard_target struct size.
next_offset needs to be aligned via XT_ALIGN (so we can access all members
of ip(6)t_entry struct).
This problem didn't show up on i686 as it only needs 4-byte alignment for
u64, but iptables userspace on other 32bit arches does insert extra padding.
Reported-by: John Stultz <john.stultz@linaro.org>
Tested-by: John Stultz <john.stultz@linaro.org>
Fixes: 7ed2abddd2 ("netfilter: x_tables: check standard target size too")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for your net tree,
they are:
1) Fix incorrect timestamp in nfnetlink_queue introduced when addressing
y2038 safe timestamp, from Florian Westphal.
2) Get rid of leftover conntrack definition from the previous merge
window, oneliner from Florian.
3) Make nf_queue handler pernet to resolve race on dereferencing the
hook state structure with netns removal, from Eric Biederman.
4) Ensure clean exit on unregistered helper ports, from Taehee Yoo.
5) Restore FLOWI_FLAG_KNOWN_NH in nf_dup_ipv6. This got lost while
generalizing xt_TEE to add packet duplication support in nf_tables,
from Paolo Abeni.
6) Insufficient netlink NFTA_SET_TABLE attribute check in
nf_tables_getset(), from Phil Turnbull.
7) Reject helper registration on duplicated ports via modparams.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Fix negative error code usage in ATM layer, from Stefan Hajnoczi.
2) If CONFIG_SYSCTL is disabled, the default TTL is not initialized
properly. From Ezequiel Garcia.
3) Missing spinlock init in mvneta driver, from Gregory CLEMENT.
4) Missing unlocks in hwmb error paths, also from Gregory CLEMENT.
5) Fix deadlock on team->lock when propagating features, from Ivan
Vecera.
6) Work around buffer offset hw bug in alx chips, from Feng Tang.
7) Fix double listing of SCTP entries in sctp_diag dumps, from Xin
Long.
8) Various statistics bug fixes in mlx4 from Eric Dumazet.
9) Fix some randconfig build errors wrt fou ipv6 from Arnd Bergmann.
10) All of l2tp was namespace aware, but the ipv6 support code was not
doing so. From Shmulik Ladkani.
11) Handle on-stack hrtimers properly in pktgen, from Guenter Roeck.
12) Propagate MAC changes properly through VLAN devices, from Mike
Manning.
13) Fix memory leak in bnx2x_init_one(), from Vitaly Kuznetsov.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (62 commits)
sfc: Track RPS flow IDs per channel instead of per function
usbnet: smsc95xx: fix link detection for disabled autonegotiation
virtio_net: fix virtnet_open and virtnet_probe competing for try_fill_recv
bnx2x: avoid leaking memory on bnx2x_init_one() failures
fou: fix IPv6 Kconfig options
openvswitch: update checksum in {push,pop}_mpls
sctp: sctp_diag should dump sctp socket type
net: fec: update dirty_tx even if no skb
vlan: Propagate MAC address to VLANs
atm: iphase: off by one in rx_pkt()
atm: firestream: add more reserved strings
vxlan: Accept user specified MTU value when create new vxlan link
net: pktgen: Call destroy_hrtimer_on_stack()
timer: Export destroy_hrtimer_on_stack()
net: l2tp: Make l2tp_ip6 namespace aware
Documentation: ip-sysctl.txt: clarify secure_redirects
sfc: use flow dissector helpers for aRFS
ieee802154: fix logic error in ieee802154_llsec_parse_dev_addr
net: nps_enet: Disable interrupts before napi reschedule
net/lapb: tuse %*ph to dump buffers
...
The Kconfig options I added to work around broken compilation ended
up screwing up things more, as I used the wrong symbol to control
compilation of the file, resulting in IPv6 fou support to never be built
into the kernel.
Changing CONFIG_NET_FOU_IPV6_TUNNELS to CONFIG_IPV6_FOU fixes that
problem, I had renamed the symbol in one location but not the other,
and as the file is never being used by other kernel code, this did not
lead to a build failure that I would have caught.
After that fix, another issue with the same patch becomes obvious, as we
'select INET6_TUNNEL', which is related to IPV6_TUNNEL, but not the same,
and this can still cause the original build failure when IPV6_TUNNEL is
not built-in but IPV6_FOU is. The fix is equally trivial, we just need
to select the right symbol.
I have successfully build 350 randconfig kernels with this patch
and verified that the driver is now being built.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reported-by: Valentin Rothberg <valentinrothberg@gmail.com>
Fixes: fabb13db44 ("fou: add Kconfig options for IPv6 support")
Signed-off-by: David S. Miller <davem@davemloft.net>
In the case of CHECKSUM_COMPLETE the skb checksum should be updated in
{push,pop}_mpls() as they the type in the ethernet header.
As suggested by Pravin Shelar.
Cc: Pravin Shelar <pshelar@nicira.com>
Fixes: 25cd9ba0ab ("openvswitch: Add basic MPLS support to kernel")
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now we cannot distinguish that one sk is a udp or sctp style when
we use ss to dump sctp_info. it's necessary to dump it as well.
For sctp_diag, ss support is not officially available, thus there
are no official users of this yet, so we can add this field in the
middle of sctp_info without breaking user API.
v1->v2:
- move 'sctpi_s_type' field to the end of struct sctp_info, so
that it won't cause incompatibility with applications already
built.
- add __reserved3 in sctp_info to make sure sctp_info is 8-byte
alignment.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The MAC address of the physical interface is only copied to the VLAN
when it is first created, resulting in an inconsistency after MAC
address changes of only newly created VLANs having an up-to-date MAC.
The VLANs should continue inheriting the MAC address of the physical
interface until the VLAN MAC address is explicitly set to any value.
This allows IPv6 EUI64 addresses for the VLAN to reflect any changes
to the MAC of the physical interface and thus for DAD to behave as
expected.
Signed-off-by: Mike Manning <mmanning@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If CONFIG_DEBUG_OBJECTS_TIMERS=y, hrtimer_init_on_stack() requires
a matching call to destroy_hrtimer_on_stack() to clean up timer
debug objects.
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The driver extended capabilities may differ for different
interface types which the userspace needs to know (for
example the fine timing measurement initiator and responder
bits might differ for a station and AP). Add a new nl80211
attribute to provide extended capabilities per interface type
to userspace.
Signed-off-by: Vidyullatha Kanchanapally <vkanchan@qti.qualcomm.com>
Reviewed-by: Jouni Malinen <jouni@qca.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Previously, the status parameter to cfg80211_connect_result() was
documented as using WLAN_STATUS_UNSPECIFIED_FAILURE (1) when the real
status code for the failure is not known. This value can be used by an
AP (and often is) and as such, user space cannot distinguish between
explicitly rejected authentication/association and not being able to
even try to associate or not receiving a response from the AP.
Add a new inline function, cfg80211_connect_timeout(), to be used when
the driver knows that the connection attempt failed due to a reason
where connection could not be attempt or no response was received from
the AP. The internal functions now allow a negative status value (-1) to
be used as an indication of this special case. This results in the
NL80211_ATTR_TIMED_OUT to be added to the NL80211_CMD_CONNECT event to
allow user space to determine this case was hit. For backwards
compatibility, NL80211_STATUS_CODE with the value
WLAN_STATUS_UNSPECIFIED_FAILURE is still indicated in the event in such
a case.
Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com>
[johannes: fix cfg80211_connect_bss() prototype to use int for status,
add cfg80211_connect_timeout() to docbook, fix docbook]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This makes it a lot easier to understand the capabilities used
by the station:
VHT supported
cap: 0x300819b2
MAX-MPDU-11454
80Mhz
RXLDPC
SHORT-GI-80
TXSTBC
RXSTBC_1
SU-BEAMFORMER-CAPABLE
SU-BEAMFORMEE-CAPABLE
BEAMFORMEE-STS: 0x0
SOUNDING-DIMENSIONS: 0x0
MU-BEAMFORMER-CAPABLE
MPDU-LENGTH-EXPONENT: 0x0
LINK-ADAPTATION-VHT-MRQ-MFB: 0x0
RX-ANTENNA-PATTERN
TX-ANTENNA-PATTERN
RX MCS: fffe
TX MCS: fffe
Signed-off-by: Ben Greear <greearb@candelatech.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The header field is defined as u8[] but also accessed as struct
ieee80211_hdr. Enforce an alignment of 2 to prevent unnecessary
unaligned accesses, which can be very harmful for performance on many
platforms.
Fixes: e495c24731 ("mac80211: extend fast-xmit for more ciphers")
Cc: stable@vger.kernel.org
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Don't allow registration of helpers using the same tuple:
{ l3proto, l4proto, src-port }
We lookup for the helper from the packet path using this tuple through
__nf_ct_helper_find(). Therefore, we have to avoid having two helpers
with the same tuple to ensure predictible behaviour.
Don't compare the helper string names anymore since it is valid to
register two helpers with the same name, but using different tuples.
This is also implicitly fixing up duplicated helper registration via
ports= modparam since the name comparison was defeating the tuple
duplication validation.
Reported-by: Feng Gao <gfree.wind@gmail.com>
Reported-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
While a wiphy can be transferred to network namespaces, a process having
CAP_NET_ADMIN in a non-initial user namespace can not administrate such
devices due to the genetlink GENL_ADMIN_PERM restrictions.
For openvswitch having the same issue, a new GENL_UNS_ADMIN_PERM flag has
been introduced, commit 4a92602aa1 ("openvswitch: allow management from
inside user namespaces"). This patch changes all privileged operations
operating on a wiphy, dev or wdev to allow their administration using the
same mechanism. All operations use either NEED_WIPHY, NEED_WDEV or
NEED_NETDEV, which implies a namespace aware lookup of the device. The only
exception is NL80211_CMD_SET_WIPHY, which explicitly uses a namespace aware
phy lookup.
Signed-off-by: Martin Willi <martin@strongswan.org>
[also allow cancel scan, for completeness]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Commit d30291b985 ("libceph: variable-sized ceph_object_id") changed
dout()s in what is now encode_request() and ceph_object_locator_to_pg()
to use %pE, mostly to document that, although all rbd and cephfs object
names are NULL-terminated strings, ceph_object_id will handle any RADOS
object name, including the one containing NULs, just fine.
However, it turns out that vbin_printf() can't handle anything but ints
and %s - all %p suffixes are ignored. The buffer %p** points to isn't
recorded, resulting in trash in the messages if the buffer had been
reused by the time bstr_printf() got to it.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
handle_reply() may be called twice on the same request: on ack and then
on commit. This occurs on btrfs-formatted OSDs or if cephfs sync write
path is triggered - CEPH_OSD_FLAG_ACK | CEPH_OSD_FLAG_ONDISK.
handle_reply() handles this with the help of done_request().
Fixes: 5aea3dcd50 ("libceph: a major OSD client update")
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
For the benefit of every single caller, take osdc instead of map.
Also, now that osdc->osdmap can't ever be NULL, drop the check.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
If the NFTA_SET_TABLE parameter is missing and the NLM_F_DUMP flag is
not set, then a NULL pointer dereference is triggered in
nf_tables_set_lookup because ctx.table is NULL.
Signed-off-by: Phil Turnbull <phil.turnbull@oracle.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
With the commit 48e8aa6e31 ("ipv6: Set FLOWI_FLAG_KNOWN_NH at
flowi6_flags") ip6_pol_route() callers were asked to to set the
FLOWI_FLAG_KNOWN_NH properly and xt_TEE was updated accordingly,
but with the later refactor in commit bbde9fc182 ("netfilter:
factor out packet duplication for IPv4/IPv6") the flowi6_flags
update was lost.
This commit re-add it just before the routing decision.
Fixes: bbde9fc182 ("netfilter: factor out packet duplication for IPv4/IPv6")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
helpers should unregister the only registered ports.
but, helper cannot have correct registered ports value when
failed to register.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
l2tp_ip6 tunnel and session lookups were still using init_net, although
the l2tp core infrastructure already supports lookups keyed by 'net'.
As a result, l2tp_ip6_recv discarded packets for tunnels/sessions
created in namespaces other than the init_net.
Fix, by using dev_net(skb->dev) or sock_net(sk) where appropriate.
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix a logic error to avoid potential null pointer dereference.
Signed-off-by: Baozeng Ding <sploving1@gmail.com>
Reviewed-by: Stefan Schmidt<stefan@osg.samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use %*ph specifier to dump small buffers in hex format instead doing this
byte-by-byte.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A previous patch added the fou6.ko module, but that failed to link
in a couple of configurations:
net/built-in.o: In function `ip6_tnl_encap_add_fou_ops':
net/ipv6/fou6.c:88: undefined reference to `ip6_tnl_encap_add_ops'
net/ipv6/fou6.c:94: undefined reference to `ip6_tnl_encap_add_ops'
net/ipv6/fou6.c:97: undefined reference to `ip6_tnl_encap_del_ops'
net/built-in.o: In function `ip6_tnl_encap_del_fou_ops':
net/ipv6/fou6.c:106: undefined reference to `ip6_tnl_encap_del_ops'
net/ipv6/fou6.c:107: undefined reference to `ip6_tnl_encap_del_ops'
If CONFIG_IPV6=m, ip6_tnl_encap_add_ops/ip6_tnl_encap_del_ops
are in a module, but fou6.c can still be built-in, and that
obviously fails to link.
Also, if CONFIG_IPV6=y, but CONFIG_IPV6_TUNNEL=m or
CONFIG_IPV6_TUNNEL=n, the same problem happens for a different
reason.
This adds two new silent Kconfig symbols to work around both
problems:
- CONFIG_IPV6_FOU is now always set to 'm' if either CONFIG_NET_FOU=m
or CONFIG_IPV6=m
- CONFIG_IPV6_FOU_TUNNEL is set implicitly when IPV6_FOU is enabled
and NET_FOU_IP_TUNNELS is also turned out, and it will ensure
that CONFIG_IPV6_TUNNEL is also available.
The options could be made user-visible as well, to give additional
room for configuration, but it seems easier not to bother users
with more choice here.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: aa3463d65e ("fou: Add encap ops for IPv6 tunnels")
Signed-off-by: David S. Miller <davem@davemloft.net>
Most users of IS_ERR_VALUE() in the kernel are wrong, as they
pass an 'int' into a function that takes an 'unsigned long'
argument. This happens to work because the type is sign-extended
on 64-bit architectures before it gets converted into an
unsigned type.
However, anything that passes an 'unsigned short' or 'unsigned int'
argument into IS_ERR_VALUE() is guaranteed to be broken, as are
8-bit integers and types that are wider than 'unsigned long'.
Andrzej Hajda has already fixed a lot of the worst abusers that
were causing actual bugs, but it would be nice to prevent any
users that are not passing 'unsigned long' arguments.
This patch changes all users of IS_ERR_VALUE() that I could find
on 32-bit ARM randconfig builds and x86 allmodconfig. For the
moment, this doesn't change the definition of IS_ERR_VALUE()
because there are probably still architecture specific users
elsewhere.
Almost all the warnings I got are for files that are better off
using 'if (err)' or 'if (err < 0)'.
The only legitimate user I could find that we get a warning for
is the (32-bit only) freescale fman driver, so I did not remove
the IS_ERR_VALUE() there but changed the type to 'unsigned long'.
For 9pfs, I just worked around one user whose calling conventions
are so obscure that I did not dare change the behavior.
I was using this definition for testing:
#define IS_ERR_VALUE(x) ((unsigned long*)NULL == (typeof (x)*)NULL && \
unlikely((unsigned long long)(x) >= (unsigned long long)(typeof(x))-MAX_ERRNO))
which ends up making all 16-bit or wider types work correctly with
the most plausible interpretation of what IS_ERR_VALUE() was supposed
to return according to its users, but also causes a compile-time
warning for any users that do not pass an 'unsigned long' argument.
I suggested this approach earlier this year, but back then we ended
up deciding to just fix the users that are obviously broken. After
the initial warning that caused me to get involved in the discussion
(fs/gfs2/dir.c) showed up again in the mainline kernel, Linus
asked me to send the whole thing again.
[ Updated the 9p parts as per Al Viro - Linus ]
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Andrzej Hajda <a.hajda@samsung.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lkml.org/lkml/2016/1/7/363
Link: https://lkml.org/lkml/2016/5/27/486
Acked-by: Srinivas Kandagatla <srinivas.kandagatla@linaro.org> # For nvmem part
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull Ceph updates from Sage Weil:
"This changeset has a few main parts:
- Ilya has finished a huge refactoring effort to sync up the
client-side logic in libceph with the user-space client code, which
has evolved significantly over the last couple years, with lots of
additional behaviors (e.g., how requests are handled when cluster
is full and transitions from full to non-full).
This structure of the code is more closely aligned with userspace
now such that it will be much easier to maintain going forward when
behavior changes take place. There are some locking improvements
bundled in as well.
- Zheng adds multi-filesystem support (multiple namespaces within the
same Ceph cluster)
- Zheng has changed the readdir offsets and directory enumeration so
that dentry offsets are hash-based and therefore stable across
directory fragmentation events on the MDS.
- Zheng has a smorgasbord of bug fixes across fs/ceph"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (71 commits)
ceph: fix wake_up_session_cb()
ceph: don't use truncate_pagecache() to invalidate read cache
ceph: SetPageError() for writeback pages if writepages fails
ceph: handle interrupted ceph_writepage()
ceph: make ceph_update_writeable_page() uninterruptible
libceph: make ceph_osdc_wait_request() uninterruptible
ceph: handle -EAGAIN returned by ceph_update_writeable_page()
ceph: make fault/page_mkwrite return VM_FAULT_OOM for -ENOMEM
ceph: block non-fatal signals for fault/page_mkwrite
ceph: make logical calculation functions return bool
ceph: tolerate bad i_size for symlink inode
ceph: improve fragtree change detection
ceph: keep leaf frag when updating fragtree
ceph: fix dir_auth check in ceph_fill_dirfrag()
ceph: don't assume frag tree splits in mds reply are sorted
ceph: fix inode reference leak
ceph: using hash value to compose dentry offset
ceph: don't forbid marking directory complete after forward seek
ceph: record 'offset' for each entry of readdir result
ceph: define 'end/complete' in readdir reply as bit flags
...
Highlights include:
Features:
- Add support for the NFS v4.2 COPY operation
- Add support for NFS/RDMA over IPv6
Bugfixes and cleanups:
- Avoid race that crashes nfs_init_commit()
- Fix oops in callback path
- Fix LOCK/OPEN race when unlinking an open file
- Choose correct stateids when using delegations in setattr, read and write
- Don't send empty SETATTR after OPEN_CREATE
- xprtrdma: Prevent server from writing a reply into memory client has released
- xprtrdma: Support using Read list and Reply chunk in one RPC call
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJXRu76AAoJENfLVL+wpUDrDVoQAKPKv1tEVJMRUQA3UVoKoixd
KjmmZMjl6GfpISwTZl+a8W549jyGuYH7Gl8vSbMaE9/FI+kJW6XZQniTYfFqY8/a
LbMSdNx1+yURisbkyO0vPqqwKw9r6UmsfGeUT8SpS3ff61yp4Oj436ra2qcPJsZ3
cWl/lHItzX7oKFAWmr0Nmq2X8ac/8+NFyK29+V/QGfwtp3qAPbpA8XM5HrHw3rA2
uk5uNSr3hwqz7P3+Hi7ZoO2m4nQTAbQnEunfYpxlOwz4IaM7qcGnntT6Jhwq1pGE
/1YasG7bHeiWjhynmZZ4CWuMkogau2UJ/G68Cz7ehLhPNr8rH/ZFCJZ+XX0e0CgI
1d+AwxZvgszIQVBY3S7sg8ezVSCPBXRFJ8rtzggGscqC53aP7L+rLfUFH+OKrhMg
6n7RQiq4EmGDJGviB/R2HixI9CpdOf2puNhDKSJmPOqiSS7UuHMw8QCq++vdru+1
GLGunGyO7D70yTV92KtsdzJlFlnfa/g+FIJrmaMpL3HH1h0stTctWX5xlTYmqEL3
z3aUuT8RySk2t1FTabSj6KRWqE/krK5BMZbX91kpF27WL4c/olXFaZPqBDsj0q4u
2rm1fIrc8RxLXctJan9ro092s/e9dup/1JxV5XWMq/EGS1ezvf+0XkCOtURaAWp3
2aPHlx7M8iuq2SouL6f7
=QMmY
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-4.7-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client updates from Anna Schumaker:
"Highlights include:
Features:
- Add support for the NFS v4.2 COPY operation
- Add support for NFS/RDMA over IPv6
Bugfixes and cleanups:
- Avoid race that crashes nfs_init_commit()
- Fix oops in callback path
- Fix LOCK/OPEN race when unlinking an open file
- Choose correct stateids when using delegations in setattr, read and
write
- Don't send empty SETATTR after OPEN_CREATE
- xprtrdma: Prevent server from writing a reply into memory client
has released
- xprtrdma: Support using Read list and Reply chunk in one RPC call"
* tag 'nfs-for-4.7-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (61 commits)
pnfs: pnfs_update_layout needs to consider if strict iomode checking is on
nfs/flexfiles: Use the layout segment for reading unless it a IOMODE_RW and reading is disabled
nfs/flexfiles: Helper function to detect FF_FLAGS_NO_READ_IO
nfs: avoid race that crashes nfs_init_commit
NFS: checking for NULL instead of IS_ERR() in nfs_commit_file()
pnfs: make pnfs_layout_process more robust
pnfs: rework LAYOUTGET retry handling
pnfs: lift retry logic from send_layoutget to pnfs_update_layout
pnfs: fix bad error handling in send_layoutget
flexfiles: add kerneldoc header to nfs4_ff_layout_prepare_ds
flexfiles: remove pointless setting of NFS_LAYOUT_RETURN_REQUESTED
pnfs: only tear down lsegs that precede seqid in LAYOUTRETURN args
pnfs: keep track of the return sequence number in pnfs_layout_hdr
pnfs: record sequence in pnfs_layout_segment when it's created
pnfs: don't merge new ff lsegs with ones that have LAYOUTRETURN bit set
pNFS/flexfiles: When initing reads or writes, we might have to retry connecting to DSes
pNFS/flexfiles: When checking for available DSes, conditionally check for MDS io
pNFS/flexfile: Fix erroneous fall back to read/write through the MDS
NFS: Reclaim writes via writepage are opportunistic
NFSv4: Use the right stateid for delegations in setattr, read and write
...
We have this situation: that EP hash table, contains only the EPs
that are listening, while the transports one, has the opposite.
We have to traverse both to dump all.
But when we traverse the transports one we will also get EPs that are
in the EP hash if they are listening. In this case, the EP is dumped
twice.
We will fix it by checking if the endpoint that is in the endpoint
hash table contains any ep->asoc in there, as it means we will also
find it via transport hash, and thus we can/should skip it, depending
on the filters used, like 'ss -l'.
Still, we should NOT skip it if the user is listing only listening
endpoints, because then we are not traversing the transport hash.
so we have to check idiag_states there also.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ceph_osdc_wait_request() is used when cephfs issues sync IO. In most
cases, the sync IO should be uninterruptible. The fix is use killale
wait function in ceph_osdc_wait_request().
Signed-off-by: Yan, Zheng <zyan@redhat.com>
This patch makes serverl logical caculation functions return bool to
improve readability due to these particular functions only using 0/1
as their return value.
No functional change.
Signed-off-by: Zhang Zhuoyu <zhangzhuoyu@cmss.chinamobile.com>
There is now about a dozen CEPH_OSDMAP_* flags. This is a debugging
interface, so just dump in hex instead of spelling each flag out.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This adds the "map check" infrastructure for sending osdmap version
checks on CALC_TARGET_POOL_DNE and completing in-flight requests with
-ENOENT if the target pool doesn't exist or has just been deleted.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
For map check, we are going to need to send CEPH_MSG_MON_GET_VERSION
messages asynchronously and get a callback on completion. Refactor MON
client to allow firing off generic requests asynchronously and add an
async variant of ceph_monc_get_version(). ceph_monc_do_statfs() is
switched over and remains sync.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Implement ceph_osdc_watch_check() to be able to check on status of
watch. Note that the time it takes for a watch/notify event to get
delivered through the notify_wq is taken into account.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Implement ceph_osdc_notify() for sending notifies.
Due to the fact that the current messenger can't do read-in into
pagelists (it can only do write-out from them), I had to go with a page
vector for a NOTIFY_COMPLETE payload, for now.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This adds support and switches rbd to a new, more reliable version of
watch/notify protocol. As with the OSD client update, this is mostly
about getting the right structures linked into the right places so that
reconnects are properly sent when needed. watch/notify v2 also
requires sending regular pings to the OSDs - send_linger_ping().
A major change from the old watch/notify implementation is the
introduction of ceph_osd_linger_request - linger requests no longer
piggy back on ceph_osd_request. ceph_osd_event has been merged into
ceph_osd_linger_request.
All the details are now hidden within libceph, the interface consists
of a simple pair of watch/unwatch functions and ceph_osdc_notify_ack().
ceph_osdc_watch() does return ceph_osd_linger_request, but only to keep
the lifetime management simple.
ceph_osdc_notify_ack() accepts an optional data payload, which is
relayed back to the notifier.
Portions of this patch are loosely based on work by Douglas Fuller
<dfuller@redhat.com> and Mike Christie <michaelc@cs.wisc.edu>.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The unwatch timeout is currently implemented in rbd. With
watch/unwatch code moving into libceph, we are going to need
a ceph_osdc_wait_request() variant with a timeout.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This is a major sync up, up to ~Jewel. The highlights are:
- per-session request trees (vs a global per-client tree)
- per-session locking (vs a global per-client rwlock)
- homeless OSD session
- no ad-hoc global per-client lists
- support for pool quotas
- foundation for watch/notify v2 support
- foundation for map check (pool deletion detection) support
The switchover is incomplete: lingering requests can be setup and
teared down but aren't ever reestablished. This functionality is
restored with the introduction of the new lingering infrastructure
(ceph_osd_linger_request, linger_work, etc) in a later commit.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
OSD client is getting moved from the big per-client lock to a set of
per-session locks. The big rwlock would only be held for read most of
the time, so a global osdc->osd_lru needs additional protection.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
create_osd() is called way too deep in the stack to be able to error
out in a sane way; a failing create_osd() just messes everything up.
The current req_notarget list solution is broken - the list is never
traversed as it's not entirely clear when to do it, I guess.
If we were to start traversing it at regular intervals and retrying
each request, we wouldn't be far off from what __GFP_NOFAIL is doing,
so allocate OSD sessions with __GFP_NOFAIL, at least until we come up
with a better fix.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Separate osdmap handling from decoding and iterating over a bag of maps
in a fresh MOSDMap message. This sets up the scene for the updated OSD
client.
Of particular importance here is the addition of pi->was_full, which
can be used to answer "did this pool go full -> not-full in this map?".
This is the key bit for supporting pool quotas.
We won't be able to downgrade map_sem for much longer, so drop
downgrade_write().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This leads to a simpler osdmap handling code, particularly when dealing
with pi->was_full, which is introduced in a later commit.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Both homeless OSD sessions and watch/notify v2, introduced in later
commits, require periodic ticks which don't depend on ->num_requests.
Schedule the initial tick from ceph_osdc_init() and reschedule from
handle_timeout() unconditionally.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
ceph_osdc_stop() isn't called if ceph_osdc_init() fails, so we end up
with handle_osds_timeout() running on invalid memory if any one of the
allocations fails. Call schedule_delayed_work() after everything is
setup, just before returning.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
If you specify ACK | ONDISK and set ->r_unsafe_callback, both
->r_callback and ->r_unsafe_callback(true) are called on ack. This is
very confusing. Redo this so that only one of them is called:
->r_unsafe_callback(true), on ack
->r_unsafe_callback(false), on commit
or
->r_callback, on ack|commit
Decode everything in decode_MOSDOpReply() to reduce clutter.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
finish_read(), its only user, uses it to get to hdr.data_len, which is
what ->r_result is set to on success. This gains us the ability to
safely call callbacks from contexts other than reply, e.g. map check.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The crux of this is getting rid of ceph_osdc_build_request(), so that
MOSDOp can be encoded not before but after calc_target() calculates the
actual target. Encoding now happens within ceph_osdc_start_request().
Also nuked is the accompanying bunch of pointers into the encoded
buffer that was used to update fields on each send - instead, the
entire front is re-encoded. If we want to support target->name_len !=
base->name_len in the future, there is no other way, because oid is
surrounded by other fields in the encoded buffer.
Encoding OSD ops and adding data items to the request message were
mixed together in osd_req_encode_op(). While we want to re-encode OSD
ops, we don't want to add duplicate data items to the message when
resending, so all call to ceph_osdc_msg_data_add() are factored out
into a new setup_request_data().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Replace __calc_request_pg() and most of __map_request() with
calc_target() and start using req->r_t.
ceph_osdc_build_request() however still encodes base_oid, because it's
called before calc_target() is and target_oid is empty at that point in
time; a printf in osdc_show() also shows base_oid. This is fixed in
"libceph: switch to calc_target(), part 2".
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Introduce ceph_osd_request_target, containing all mapping-related
fields of ceph_osd_request and calc_target() for calculating mappings
and populating it.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Add and decode pi->min_size and pi->last_force_request_resend. These
are going to be used by calc_target().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
calc_target() code is going to need to know how to compare PGs. Take
lhs and rhs pgid by const * while at it.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Rename ceph_calc_pg_primary() to ceph_pg_to_acting_primary() to
emphasise that it returns acting primary.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Knowning just acting set isn't enough, we need to be able to record up
set as well to detect interval changes. This means returning (up[],
up_len, up_primary, acting[], acting_len, acting_primary) and passing
it around. Introduce and switch to ceph_osds to help with that.
Rename ceph_calc_pg_acting() to ceph_pg_to_up_acting_osds() and return
both up and acting sets from it.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Rename ceph_oloc_oid_to_pg() to ceph_object_locator_to_pg(). Emphasise
that returned is raw PG and return -ENOENT instead of -EIO if the pool
doesn't exist.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Given
struct foo {
u64 id;
struct rb_node bar_node;
};
generate insert_bar(), erase_bar() and lookup_bar() functions with
DEFINE_RB_FUNCS(bar, struct foo, id, bar_node)
The key is assumed to be an integer (u64, int, etc), compared with
< and >. nodefld has to be initialized with RB_CLEAR_NODE().
Start using it for MDS, MON and OSD requests and OSD sessions.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Currently ceph_object_id can hold object names of up to 100
(CEPH_MAX_OID_NAME_LEN) characters. This is enough for all use cases,
expect one - long rbd image names:
- a format 1 header is named "<imgname>.rbd"
- an object that points to a format 2 header is named "rbd_id.<imgname>"
We operate on these potentially long-named objects during rbd map, and,
for format 1 images, during header refresh. (A format 2 header name is
a small system-generated string.)
Lift this 100 character limit by making ceph_object_id be able to point
to an externally-allocated string. Apart from being able to work with
almost arbitrarily-long named objects, this allows us to reduce the
size of ceph_object_id from >100 bytes to 64 bytes.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
For a message pool message, preallocate a page, just like we do for
osd_op. For a normal message, take ceph_object_id into account and
don't bother subtracting CEPH_OSD_SLAB_OPS ceph_osd_ops.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The size of ->r_request and ->r_reply messages depends on the size of
the object name (ceph_object_id), while the size of ceph_osd_request is
fixed. Move message allocation into a separate function that would
have to be called after ceph_object_id and ceph_object_locator (which
is also going to become variable in size with RADOS namespaces) have
been filled in:
req = ceph_osdc_alloc_request(...);
<fill in req->r_base_oid>
<fill in req->r_base_oloc>
ceph_osdc_alloc_messages(req);
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
ceph_osdc_build_request() is going away. Grab snapc and initialize
->r_snapid in ceph_osdc_alloc_request().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
When hwbm_pool_add exited in error the spinlock was not released. This
patch fixes this issue.
Fixes: 8cb2d8bf57 ("net: add a hardware buffer management helper API")
Reported-by: Jean-Jacques Hiblot <jjhiblot@traphandler.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Gregory CLEMENT <gregory.clement@free-electrons.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before calling the nla_parse_nested function, make sure the pointer to the
attribute is not null. This patch fixes several potential null pointer
dereference vulnerabilities in the tipc netlink functions.
Signed-off-by: Baozeng Ding <sploving1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Weber reported:
> Under full load (unshare() in loop -> OOM conditions) we can
> get kernel panic:
>
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
> IP: [<ffffffff81476c85>] nfqnl_nf_hook_drop+0x35/0x70
> [..]
> task: ffff88012dfa3840 ti: ffff88012dffc000 task.ti: ffff88012dffc000
> RIP: 0010:[<ffffffff81476c85>] [<ffffffff81476c85>] nfqnl_nf_hook_drop+0x35/0x70
> RSP: 0000:ffff88012dfffd80 EFLAGS: 00010206
> RAX: 0000000000000008 RBX: ffffffff81add0c0 RCX: ffff88013fd80000
> [..]
> Call Trace:
> [<ffffffff81474d98>] nf_queue_nf_hook_drop+0x18/0x20
> [<ffffffff814738eb>] nf_unregister_net_hook+0xdb/0x150
> [<ffffffff8147398f>] netfilter_net_exit+0x2f/0x60
> [<ffffffff8141b088>] ops_exit_list.isra.4+0x38/0x60
> [<ffffffff8141b652>] setup_net+0xc2/0x120
> [<ffffffff8141bd09>] copy_net_ns+0x79/0x120
> [<ffffffff8106965b>] create_new_namespaces+0x11b/0x1e0
> [<ffffffff810698a7>] unshare_nsproxy_namespaces+0x57/0xa0
> [<ffffffff8104baa2>] SyS_unshare+0x1b2/0x340
> [<ffffffff81608276>] entry_SYSCALL_64_fastpath+0x1e/0xa8
> Code: 65 00 48 89 e5 41 56 41 55 41 54 53 83 e8 01 48 8b 97 70 12 00 00 48 98 49 89 f4 4c 8b 74 c2 18 4d 8d 6e 08 49 81 c6 88 00 00 00 <49> 8b 5d 00 48 85 db 74 1a 48 89 df 4c 89 e2 48 c7 c6 90 68 47
>
The simple fix for this requires a new pernet variable for struct
nf_queue that indicates when it is safe to use the dynamically
allocated nf_queue state.
As we need a variable anyway make nf_register_queue_handler and
nf_unregister_queue_handler pernet. This allows the existing logic of
when it is safe to use the state from the nfnetlink_queue module to be
reused with no changes except for making it per net.
The syncrhonize_rcu from nf_unregister_queue_handler is moved to a new
function nfnl_queue_net_exit_batch so that the worst case of having a
syncrhonize_rcu in the pernet exit path is not experienced in batch
mode.
Reported-by: Florian Westphal <fw@strlen.de>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Policer was not dumping or updating timestamps
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I found a serious performance bug in packet schedulers using hrtimers.
sch_htb and sch_fq are definitely impacted by this problem.
We constantly rearm high resolution timers if some packets are throttled
in one (or more) class, and other packets are flying through qdisc on
another (non throttled) class.
hrtimer_start() does not have the mod_timer() trick of doing nothing if
expires value does not change :
if (timer_pending(timer) &&
timer->expires == expires)
return 1;
This issue is particularly visible when multiple cpus can queue/dequeue
packets on the same qdisc, as hrtimer code has to lock a remote base.
I used following fix :
1) Change htb to use qdisc_watchdog_schedule_ns() instead of open-coding
it.
2) Cache watchdog prior expiration. hrtimer might provide this, but I
prefer to not rely on some hrtimer internal.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXRL2PAAoJECebzXlCjuG+c34P/1wnkehVxDozBJp7UEzhrsE/
U1dpwfykzVEIMh68TldBvyrt2Lb4ThLPZ7V2dVwNqA831S/VM6fWJyw8WerSgGgU
SUGOzdF04rNfy41lXQNpDiiC417Fbp4Js4O+Q5kd+8kqQbXYqCwz0ce3DVbAT571
JmJgBI8gZLhicyNRDOt0y6C+/3P+0bbXYvS8wkzY+CwbNczHJOCLhwViKzWTptm9
LCSgDGm68ckpR7mZkWfEF3WdiZ9+SxeI+pT9dcomzxNfbv8NluDplYmdLbepA2J8
uWHGprVe9WJMDnw4hJhrI2b3/rHIntpxuZYktmnb/z/ezBTyi3FXYWgAEdE1by+Y
Gf7OewKOp8XcQ/iHRZ8vwXNrheHAr9++SB49mGBZJ3qj6bO+FrISQKX9FRxo6PrJ
SDRgYjt5yUG2oD1AAs1NzuBPqZzR40mA6Yk4zuNAcxzK/S7DdRF/9Kjyk86TVv08
3E3O5i1RyVcU/A7JdnbiyeDFMQoRshdnN0HShIZcSfcfW+qFKghNlO9bFfSl904F
jlG6moNB5OBiV8FNOelY+HGAYoUdw120QxqQMv47oZGKCjv+rfK38aB4GBJ4iEuo
TrGqNmrMrs/AKdL3Sd+8LuJqSfXggrwUDc/KS6CFz/U0eBbp6k0kcd7FEyG/J8kW
JxQ0URgyJ+DHfc60E8LN
=k6RP
-----END PGP SIGNATURE-----
Merge tag 'nfsd-4.7' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
"A very quiet cycle for nfsd, mainly just an RDMA update from Chuck
Lever"
* tag 'nfsd-4.7' of git://linux-nfs.org/~bfields/linux:
sunrpc: fix stripping of padded MIC tokens
svcrpc: autoload rdma module
svcrdma: Generalize svc_rdma_xdr_decode_req()
svcrdma: Eliminate code duplication in svc_rdma_recvfrom()
svcrdma: Drain QP before freeing svcrdma_xprt
svcrdma: Post Receives only for forward channel requests
svcrdma: Remove superfluous line from rdma_read_chunks()
svcrdma: svc_rdma_put_context() is invoked twice in Send error path
svcrdma: Do not add XDR padding to xdr_buf page vector
svcrdma: Support IPv6 with NFS/RDMA
nfsd: handle seqid wraparound in nfsd4_preprocess_layout_stateid
Remove unnecessary allocation
In gre6 xmit path, we are sending a GRE packet, so set fl6 proto
to IPPROTO_GRE properly.
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When creat an ip6gretap interface with an unreachable route,
the MTU is about 14 bytes larger than what was needed.
If the remote address is reachable:
ping6 2001:0:130::1 -c 2
PING 2001:0:130::1(2001:0:130::1) 56 data bytes
64 bytes from 2001:0:130::1: icmp_seq=1 ttl=64 time=1.46 ms
64 bytes from 2001:0:130::1: icmp_seq=2 ttl=64 time=81.1 ms
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit fa50d974d1 ("ipv4: Namespaceify ip_default_ttl sysctl knob")
moves the default TTL assignment, and as side-effect IPv4 TTL now
has a default value only if sysctl support is enabled (CONFIG_SYSCTL=y).
The sysctl_ip_default_ttl is fundamental for IP to work properly,
as it provides the TTL to be used as default. The defautl TTL may be
used in ip_selected_ttl, through the following flow:
ip_select_ttl
ip4_dst_hoplimit
net->ipv4.sysctl_ip_default_ttl
This commit fixes the issue by assigning net->ipv4.sysctl_ip_default_ttl
in net_init_net, called during ipv4's initialization.
Without this commit, a kernel built without sysctl support will send
all IP packets with zero TTL (unless a TTL is explicitly set, e.g.
with setsockopt).
Given a similar issue might appear on the other knobs that were
namespaceify, this commit also moves them.
Fixes: fa50d974d1 ("ipv4: Namespaceify ip_default_ttl sysctl knob")
Signed-off-by: Ezequiel Garcia <ezequiel@vanguardiasur.com.ar>
Signed-off-by: David S. Miller <davem@davemloft.net>
The sk_err and sk_err_soft fields are positive errno values and
userspace applications rely on this when using getsockopt(SO_ERROR).
ATM code places an -errno into sk_err_soft in sigd_send() and returns it
from svc_addparty()/svc_dropparty().
Although I am not familiar with ATM code I came to this conclusion
because:
1. sigd_send() msg->type cases as_okay and as_error both have:
sk->sk_err = -msg->reply;
while the as_addparty and as_dropparty cases have:
sk->sk_err_soft = msg->reply;
This is the source of the inconsistency.
2. svc_addparty() returns an -errno and assumes sk_err_soft is also an
-errno:
if (flags & O_NONBLOCK) {
error = -EINPROGRESS;
goto out;
}
...
error = xchg(&sk->sk_err_soft, 0);
out:
release_sock(sk);
return error;
This shows that sk_err_soft is indeed being treated as an -errno.
This patch ensures that sk_err_soft is always a positive errno.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The length of the GSS MIC token need not be a multiple of four bytes.
It is then padded by XDR to a multiple of 4 B, but unwrap_integ_data()
would previously only trim mic.len + 4 B. The remaining up to three
bytes would then trigger a check in nfs4svc_decode_compoundargs(),
leading to a "garbage args" error and mount failure:
nfs4svc_decode_compoundargs: compound not properly padded!
nfsd: failed to decode arguments!
This would prevent older clients using the pre-RFC 4121 MIC format
(37-byte MIC including a 9-byte OID) from mounting exports from v3.9+
servers using krb5i.
The trimming was introduced by commit 4c190e2f91 ("sunrpc: trim off
trailing checksum before returning decrypted or integrity authenticated
buffer").
Fixes: 4c190e2f91 "unrpc: trim off trailing checksum..."
Signed-off-by: Tomáš Trnka <ttrnka@mail.muni.cz>
Cc: stable@vger.kernel.org
Acked-by: Jeff Layton <jlayton@poochiereds.net>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
This should fix failures like:
# rpc.nfsd --rdma
rpc.nfsd: Unable to request RDMA services: Protocol not supported
Reported-by: Steve Dickson <steved@redhat.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Here's the large TTY and Serial driver update for 4.7-rc1.
A few new serial drivers are added here, and Peter has fixed a bunch of
long-standing bugs in the tty layer and serial drivers as normal. Full
details in the shortlog.
All of these have been in linux-next for a while with no reported issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iEYEABECAAYFAlc/0/oACgkQMUfUDdst+ynzyQCgsa54VNijdAzU6AA5HEfqmf2M
cGMAn1boH7hUWlAbJmzzihx4JASoGjYW
=V5VH
-----END PGP SIGNATURE-----
Merge tag 'tty-4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty and serial driver updates from Greg KH:
"Here's the large TTY and Serial driver update for 4.7-rc1.
A few new serial drivers are added here, and Peter has fixed a bunch
of long-standing bugs in the tty layer and serial drivers as normal.
Full details in the shortlog.
All of these have been in linux-next for a while with no reported
issues"
* tag 'tty-4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (88 commits)
MAINTAINERS: 8250: remove website reference
serial: core: Fix port mutex assert if lockdep disabled
serial: 8250_dw: fix wrong logic in dw8250_check_lcr()
tty: vt, finish looping on duplicate
tty: vt, return error when con_startup fails
QE-UART: add "fsl,t1040-ucc-uart" to of_device_id
serial: mctrl_gpio: Drop support for out1-gpios and out2-gpios
serial: 8250dw: Add device HID for future AMD UART controller
Fix OpenSSH pty regression on close
serial: mctrl_gpio: add IRQ locking
serial: 8250: Integrate Fintek into 8250_base
serial: mps2-uart: add support for early console
serial: mps2-uart: add MPS2 UART driver
dt-bindings: document the MPS2 UART bindings
serial: sirf: Use generic uart-has-rtscts DT property
serial: sirf: Introduce helper variable struct device_node *np
serial: mxs-auart: Use generic uart-has-rtscts DT property
serial: imx: Use generic uart-has-rtscts DT property
doc: DT: Add Generic Serial Device Tree Bindings
serial: 8250: of: Make tegra_serial_handle_break() static
...
Pull networking fixes and more updates from David Miller:
1) Tunneling fixes from Tom Herbert and Alexander Duyck.
2) AF_UNIX updates some struct sock bit fields with the socket lock,
whereas setsockopt() sets overlapping ones with locking. Seperate
out the synchronized vs. the AF_UNIX unsynchronized ones to avoid
corruption. From Andrey Ryabinin.
3) Mount BPF filesystem with mount_nodev rather than mount_ns, from
Eric Biederman.
4) A couple kmemdup conversions, from Muhammad Falak R Wani.
5) BPF verifier fixes from Alexei Starovoitov.
6) Don't let tunneled UDP packets get stuck in socket queues, if
something goes wrong during the encapsulation just drop the packet
rather than signalling an error up the call stack. From Hannes
Frederic Sowa.
7) SKB ref after free in batman-adv, from Florian Westphal.
8) TCP iSCSI, ocfs2, rds, and tipc have to disable BH in it's TCP
callbacks since the TCP stack runs pre-emptibly now. From Eric
Dumazet.
9) Fix crash in fixed_phy_add, from Rabin Vincent.
10) Fix length checks in xen-netback, from Paul Durrant.
11) Fix mixup in KEY vs KEYID macsec attributes, from Sabrina Dubroca.
12) RDS connection spamming bug fixes from Sowmini Varadhan
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (152 commits)
net: suppress warnings on dev_alloc_skb
uapi glibc compat: fix compilation when !__USE_MISC in glibc
udp: prevent skbs lingering in tunnel socket queues
bpf: teach verifier to recognize imm += ptr pattern
bpf: support decreasing order in direct packet access
net: usb: ch9200: use kmemdup
ps3_gelic: use kmemdup
net:liquidio: use kmemdup
bpf: Use mount_nodev not mount_ns to mount the bpf filesystem
net: cdc_ncm: update datagram size after changing mtu
tuntap: correctly wake up process during uninit
intel: Add support for IPv6 IP-in-IP offload
ip6_gre: Do not allow segmentation offloads GRE_CSUM is enabled with FOU/GUE
RDS: TCP: Avoid rds connection churn from rogue SYNs
RDS: TCP: rds_tcp_accept_worker() must exit gracefully when terminating rds-tcp
net: sock: move ->sk_shutdown out of bitfields.
ipv6: Don't reset inner headers in ip6_tnl_xmit
ip4ip6: Support for GSO/GRO
ip6ip6: Support for GSO/GRO
ipv6: Set features for IPv6 tunnels
...
In case we find a socket with encapsulation enabled we should call
the encap_recv function even if just a udp header without payload is
available. The callbacks are responsible for correctly verifying and
dropping the packets.
Also, in case the header validation fails for geneve and vxlan we
shouldn't put the skb back into the socket queue, no one will pick
them up there. Instead we can simply discard them in the respective
encap_recv functions.
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch addresses the same issue we had for IPv4 where enabling GRE with
an inner checksum cannot be supported with FOU/GUE due to the fact that
they will jump past the GRE header at it is treated like a tunnel header.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a rogue SYN is received after the connection arbitration
algorithm has converged, the incoming SYN should not needlessly
quiesce the transmit path, and it should not result in needless
TCP connection resets due to re-execution of the connection
arbitration logic.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are two instances where we want to terminate RDS-TCP: when
exiting the netns or during module unload. In either case, the
termination sequence is to stop the listen socket, mark the
rtn->rds_tcp_listen_sock as null, and flush any accept workqs.
Thus any workqs that get flushed at this point will encounter a
null rds_tcp_listen_sock, and must exit gracefully to allow
the RDS-TCP termination to complete successfully.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since iptunnel_handle_offloads() is called in all paths we can
probably drop the block in ip6_tnl_xmit that was checking for
skb->encapsulation and resetting the inner headers.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Need to set dev features, use same values that are used in GREv6.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch add a new fou6 module that provides encapsulation
operations for IPv6.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add encap_hlen and ip_tunnel_encap structure to ip6_tnl. Add functions
for getting encap hlen, setting up encap on a tunnel, performing
encapsulation operation.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds receive path support for IPv6 with fou.
- Add address family to fou structure for open sockets. This supports
AF_INET and AF_INET6. Lookups for fou ports are performed on both the
port number and family.
- In fou and gue receive adjust tot_len in IPv4 header or payload_len
based on address family.
- Allow AF_INET6 in FOU_ATTR_AF netlink attribute.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Create __fou_build_header and __gue_build_header. These implement the
protocol generic parts of building the fou and gue header.
fou_build_header and gue_build_header implement the IPv4 specific
functions and call the __*_build_header functions.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use helper function to set up UDP tunnel related information for a fou
socket.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Consolidate all the ip_tunnel_encap definitions in one spot in the
header file. Also, move ip_encap_hlen and ip_tunnel_encap from
ip_tunnel.c to ip_tunnels.h so they call be called without a dependency
on ip_tunnel module. Similarly, move iptun_encaps to ip_tunnel_core.c.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When performing foo-over-UDP, UDP packets are processed by the
encapsulation handler which returns another protocol to process.
This may result in processing two (or more) protocols in the
loop that are marked as INET6_PROTO_FINAL. The actions taken
for hitting a final protocol, in particular the skb_postpull_rcsum
can only be performed once.
This patch set adds a check of a final protocol has been seen. The
rules are:
- If the final protocol has not been seen any protocol is processed
(final and non-final). In the case of a final protocol, the final
actions are taken (like the skb_postpull_rcsum)
- If a final protocol has been seen (e.g. an encapsulating UDP
header) then no further non-final protocols are allowed
(e.g. extension headers). For more final protocols the
final actions are not taken (e.g. skb_postpull_rcsum).
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In ip6_input_finish the nexthdr protocol is retrieved from the
next header offset that is returned in the cb of the skb.
This method does not work for UDP encapsulation that may not
even have a concept of a nexthdr field (e.g. FOU).
This patch checks for a final protocol (INET6_PROTO_FINAL) when a
protocol handler returns > 0. If the protocol is not final then
resubmission is performed on nhoff value. If the protocol is final
then the nexthdr is taken to be the return value.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch defines two new GSO definitions SKB_GSO_IPXIP4 and
SKB_GSO_IPXIP6 along with corresponding NETIF_F_GSO_IPXIP4 and
NETIF_F_GSO_IPXIP6. These are used to described IP in IP
tunnel and what the outer protocol is. The inner protocol
can be deduced from other GSO types (e.g. SKB_GSO_TCPV4 and
SKB_GSO_TCPV6). The GSO types of SKB_GSO_IPIP and SKB_GSO_SIT
are removed (these are both instances of SKB_GSO_IPXIP4).
SKB_GSO_IPXIP6 will be used when support for GSO with IP
encapsulation over IPv6 is added.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In several gso_segment functions there are checks of gso_type against
a seemingly arbitrary list of SKB_GSO_* flags. This seems like an
attempt to identify unsupported GSO types, but since the stack is
the one that set these GSO types in the first place this seems
unnecessary to do. If a combination isn't valid in the first
place that stack should not allow setting it.
This is a code simplication especially for add new GSO types.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Updates to the new Intel X722 iWARP driver
- Updates to the hfi1 driver
- Fixes for the iw_cxgb4 driver
- Misc core fixes
- Generic RDMA READ/WRITE API addition
- SRP updates
- Misc ipoib updates
- Minor mlx5 updates
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXPIAnAAoJELgmozMOVy/dEYUP/0A6NH5ptzUrwPuLrWoz8h+e
KBfJE7H09mfKBx0Rq8YmnU+pz4lk8vrMLLaqpbGN57mwO0a1lK9bgc3E6KUhQPhc
dpGEX/NG1+aILomD7M4l1yAKkG17kxFLD75cLCeaxhO76jBRWsunukqk5mT/u0EG
fUYZs1fRb9t2LDtTNhPfSFR1+dgP5S17xLhpl9ttn87hTmIuiGWR6ig2nTC7azZD
G0d7RVjohfY2sDD28YgQiUEJ+q+1ymp3XTaCZhPCVl9VCRPweEdtLKcbNWZIvClx
ewuTCgADXg8tAL/6zu6bEKqZlC17UrmVJee3csKQLT09PJkSEICeFR/ld6ONUKAF
nDhi3ySa75Xxg9VDcCnuRkXKK+/zi7oDelZuh9mvMG0JJqPK9rTZDD29j2kBf7C0
bdx4R5cI4KJWQ/GlCyi/nLiuYkmAiCugzcGnRho4ub+EJ0yX1w6n8KVYr37kFsFu
q6MCnEfArEgDpbq1wo0+9MWtqBYrnOI/XtG81Zd+6X2MW975qU85wUdUSjg6OOb1
v1osyAmFDy9A0Y80yY+l1HHrSVIvI0IAWZDfxsbCLQY8O03ZNcvxE2RsrzWd5CKL
iZsX24tjV0WR9+lORHLfAKB3DL9CcfHv/tHo7q+5iAHmIuWZGrEN22ELkwS/4X7x
d/V0XDjzs6lgQeTJ7R4B
=e+Zv
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma
Pull rdma updates from Doug Ledford:
"Primary 4.7 merge window changes
- Updates to the new Intel X722 iWARP driver
- Updates to the hfi1 driver
- Fixes for the iw_cxgb4 driver
- Misc core fixes
- Generic RDMA READ/WRITE API addition
- SRP updates
- Misc ipoib updates
- Minor mlx5 updates"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (148 commits)
IB/mlx5: Fire the CQ completion handler from tasklet
net/mlx5_core: Use tasklet for user-space CQ completion events
IB/core: Do not require CAP_NET_ADMIN for packet sniffing
IB/mlx4: Fix unaligned access in send_reply_to_slave
IB/mlx5: Report Scatter FCS device capability when supported
IB/mlx5: Add Scatter FCS support for Raw Packet QP
IB/core: Add Scatter FCS create flag
IB/core: Add Raw Scatter FCS device capability
IB/core: Add extended device capability flags
i40iw: pass hw_stats by reference rather than by value
i40iw: Remove unnecessary synchronize_irq() before free_irq()
i40iw: constify i40iw_vf_cqp_ops structure
IB/mlx5: Add UARs write-combining and non-cached mapping
IB/mlx5: Allow mapping the free running counter on PROT_EXEC
IB/mlx4: Use list_for_each_entry_safe
IB/SA: Use correct free function
IB/core: Fix a potential array overrun in CMA and SA agent
IB/core: Remove unnecessary check in ibnl_rcv_msg
IB/IWPM: Fix a potential skb leak
RDMA/nes: replace custom print_hex_dump()
...
Overlayfs uses separate inodes even in the case of hard links on the
underlying filesystems. This is a problem for AF_UNIX socket
implementation which indexes sockets based on the inode. This resulted in
hard linked sockets not working.
The fix is to use the real, underlying inode.
Test case follows:
-- ovl-sock-test.c --
#include <unistd.h>
#include <err.h>
#include <sys/socket.h>
#include <sys/un.h>
#define SOCK "test-sock"
#define SOCK2 "test-sock2"
int main(void)
{
int fd, fd2;
struct sockaddr_un addr = {
.sun_family = AF_UNIX,
.sun_path = SOCK,
};
struct sockaddr_un addr2 = {
.sun_family = AF_UNIX,
.sun_path = SOCK2,
};
unlink(SOCK);
unlink(SOCK2);
if ((fd = socket(AF_UNIX, SOCK_STREAM, 0)) == -1)
err(1, "socket");
if (bind(fd, (struct sockaddr *) &addr, sizeof(addr)) == -1)
err(1, "bind");
if (listen(fd, 0) == -1)
err(1, "listen");
if (link(SOCK, SOCK2) == -1)
err(1, "link");
if ((fd2 = socket(AF_UNIX, SOCK_STREAM, 0)) == -1)
err(1, "socket");
if (connect(fd2, (struct sockaddr *) &addr2, sizeof(addr2)) == -1)
err (1, "connect");
return 0;
}
----
Reported-by: Alexander Morozov <alexandr.morozov@docker.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org>
Merge updates from Andrew Morton:
- fsnotify fix
- poll() timeout fix
- a few scripts/ tweaks
- debugobjects updates
- the (small) ocfs2 queue
- Minor fixes to kernel/padata.c
- Maybe half of the MM queue
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (117 commits)
mm, page_alloc: restore the original nodemask if the fast path allocation failed
mm, page_alloc: uninline the bad page part of check_new_page()
mm, page_alloc: don't duplicate code in free_pcp_prepare
mm, page_alloc: defer debugging checks of pages allocated from the PCP
mm, page_alloc: defer debugging checks of freed pages until a PCP drain
cpuset: use static key better and convert to new API
mm, page_alloc: inline pageblock lookup in page free fast paths
mm, page_alloc: remove unnecessary variable from free_pcppages_bulk
mm, page_alloc: pull out side effects from free_pages_check
mm, page_alloc: un-inline the bad part of free_pages_check
mm, page_alloc: check multiple page fields with a single branch
mm, page_alloc: remove field from alloc_context
mm, page_alloc: avoid looking up the first zone in a zonelist twice
mm, page_alloc: shortcut watermark checks for order-0 pages
mm, page_alloc: reduce cost of fair zone allocation policy retry
mm, page_alloc: shorten the page allocator fast path
mm, page_alloc: check once if a zone has isolated pageblocks
mm, page_alloc: move __GFP_HARDWALL modifications out of the fastpath
mm, page_alloc: simplify last cpupid reset
mm, page_alloc: remove unnecessary initialisation from __alloc_pages_nodemask()
...
page_reference manipulation functions are introduced to track down
reference count change of the page. Use it instead of direct
modification of _count.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Sunil Goutham <sgoutham@cavium.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
struct timespec is not y2038 safe. Even though timespec might be
sufficient to represent timeouts, use struct timespec64 here as the plan
is to get rid of all timespec reference in the kernel.
The patch transitions the common functions: poll_select_set_timeout()
and select_estimate_accuracy() to use timespec64. And, all the syscalls
that use these functions are transitioned in the same patch.
The restart block parameters for poll uses monotonic time. Use
timespec64 here as well to assign timeout value. This parameter in the
restart block need not change because this only holds the monotonic
timestamp at which timeout should occur. And, unsigned long data type
should be big enough for this timestamp.
The system call interfaces will be handled in a separate series.
Compat interfaces need not change as timespec64 is an alias to struct
timespec on a 64 bit system.
Link: http://lkml.kernel.org/r/1461947989-21926-3-git-send-email-deepa.kernel@gmail.com
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Acked-by: John Stultz <john.stultz@linaro.org>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
May, we managed to uncover and fix some important bugs in our
new B.A.T.M.A.N. V algorithm. These are the fixes we came up with
together with others that I collected in the past weeks:
- avoid potential crash due to NULL pointer dereference in
B.A.T.M.A.N. V routine when a neigh_ifinfo object is not found, by
Sven Eckelmann
- avoid use-after-free of skb when counting outgoing bytes, by Florian
Westphal
- fix neigh_ifinfo object reference counting imbalance when using
B.A.T.M.A.N. V, by Sven Eckelmann. Such imbalance may lead to the
impossibility of releasing the related netdev object on shutdown
- avoid invalid memory access in case of error while allocating
bcast_own_sum when a new hard-interface is added, by Sven Eckelmann
- ensure originator address is updated in OMG/ELP packet content upon
primary interface address change, by Antonio Quartulli
- fix integer overflow when computing TQ metric (B.A.T.M.A.N. IV), by
Sven Eckelmann
- avoid race condition while adding new neigh_node which would result
in having two objects mapping to the same physical neighbour, by
Linus Lüssing
- ensure originator address is initialized in ELP packet content on
secondary interfaces, by Marek Lindner
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJXO+ZsAAoJEJ4aZjxxc6bK7UEQALxcPqv2YtQhQCexHAPlmTxg
73uXuXpo0eWPxJh2FElcsHTHSt/l0/m/G8MEWCOfqI8MTeg+UTY5lXCY4iThtiY6
n08nl/CLkPNLuDGzAd7CpCM2jJ1iAzgJEVpUGQ6qoO5ekFvO09U/J/o6E8gDTX4I
yc1kR3hcJeLbF6EUXEfTX/2THdFhkYSgHgZoGyM3eu43Zhv/6v454iEpqNxA/Ubr
UngJTGSfqf20FidUt+xjbfptltTzj25Cp+VQ2DQ/LOuit0ChECjXwMfk0UkDHlKX
bYPm1tcUJ0dzbR7cILMlzzzCSzH00leUk5GYYo59rAxBQiJ2j+92c2K0dcnv8cJJ
q+GsZapyQFBIgPWS/TLtG1II3u8YWYvl+nEQXBDtPeRAjGoO4VwSbuatsf+enM8K
i0c622iEc0b3cVe5H8reUh7SZMfMT/cQkEgpX5/psf8FWvEzeDv4HI3aUfVuSpck
mEZAy59HeLWrF+KTScougv14vyoMmKfUcr5E1Oe4X1Ke9dpVvHGiI9wlqb/J0lfl
rIc/5WI3ae/QAXkcPybanPEO8O0pBMP+Y4oCYpWmDu1qYfwZzDaSMqqEALQVVW2I
8Ta1gbzamgZDslf7YDSWOR0iyFAFcnMhT0qU6gv2+RGbcfwD+AGRLu5ufY/th0N3
L6AuFhvT/bxW6jN+iuAS
=IACg
-----END PGP SIGNATURE-----
Merge tag 'batman-adv-fix-for-davem' of git://git.open-mesh.org/linux-merge
Antonio Quartulli says:
====================
During the Wireless Battle Mesh v9 in Porto (PT) at the beginning of
May, we managed to uncover and fix some important bugs in our
new B.A.T.M.A.N. V algorithm. These are the fixes we came up with
together with others that I collected in the past weeks:
- avoid potential crash due to NULL pointer dereference in
B.A.T.M.A.N. V routine when a neigh_ifinfo object is not found, by
Sven Eckelmann
- avoid use-after-free of skb when counting outgoing bytes, by Florian
Westphal
- fix neigh_ifinfo object reference counting imbalance when using
B.A.T.M.A.N. V, by Sven Eckelmann. Such imbalance may lead to the
impossibility of releasing the related netdev object on shutdown
- avoid invalid memory access in case of error while allocating
bcast_own_sum when a new hard-interface is added, by Sven Eckelmann
- ensure originator address is updated in OMG/ELP packet content upon
primary interface address change, by Antonio Quartulli
- fix integer overflow when computing TQ metric (B.A.T.M.A.N. IV), by
Sven Eckelmann
- avoid race condition while adding new neigh_node which would result
in having two objects mapping to the same physical neighbour, by
Linus Lüssing
- ensure originator address is initialized in ELP packet content on
secondary interfaces, by Marek Lindner
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP stack can now run from process context.
Use read_lock_bh(&sk->sk_callback_lock) variant to restore previous
assumption.
Fixes: 5413d1babe ("net: do not block BH while processing socket backlog")
Fixes: d41a69f1d3 ("tcp: make tcp_sendmsg() aware of socket backlog")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jon Maloy <jon.maloy@ericsson.com>
Cc: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP stack can now run from process context.
Use read_lock_bh(&sk->sk_callback_lock) variant to restore previous
assumption.
Fixes: 5413d1babe ("net: do not block BH while processing socket backlog")
Fixes: d41a69f1d3 ("tcp: make tcp_sendmsg() aware of socket backlog")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb_splice_bits() returns int, kcm_splice_read() returns ssize_t,
both are signed.
We may need another patch to make them all ssize_t, but that
deserves a separated patch.
Fixes: 91687355b9 ("kcm: Splice support")
Reported-by: David Binderman <linuxdev.baldrick@gmail.com>
Cc: Tom Herbert <tom@herbertland.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull security subsystem updates from James Morris:
"Highlights:
- A new LSM, "LoadPin", from Kees Cook is added, which allows forcing
of modules and firmware to be loaded from a specific device (this
is from ChromeOS, where the device as a whole is verified
cryptographically via dm-verity).
This is disabled by default but can be configured to be enabled by
default (don't do this if you don't know what you're doing).
- Keys: allow authentication data to be stored in an asymmetric key.
Lots of general fixes and updates.
- SELinux: add restrictions for loading of kernel modules via
finit_module(). Distinguish non-init user namespace capability
checks. Apply execstack check on thread stacks"
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (48 commits)
LSM: LoadPin: provide enablement CONFIG
Yama: use atomic allocations when reporting
seccomp: Fix comment typo
ima: add support for creating files using the mknodat syscall
ima: fix ima_inode_post_setattr
vfs: forbid write access when reading a file into memory
fs: fix over-zealous use of "const"
selinux: apply execstack check on thread stacks
selinux: distinguish non-init user namespace capability checks
LSM: LoadPin for kernel file loading restrictions
fs: define a string representation of the kernel_read_file_id enumeration
Yama: consolidate error reporting
string_helpers: add kstrdup_quotable_file
string_helpers: add kstrdup_quotable_cmdline
string_helpers: add kstrdup_quotable
selinux: check ss_initialized before revalidating an inode label
selinux: delay inode label lookup as long as possible
selinux: don't revalidate an inode's label when explicitly setting it
selinux: Change bool variable name to index.
KEYS: Add KEYCTL_DH_COMPUTE command
...
This fix prevents nodes to wrongly create a 00:00:00:00:00:00 originator
which can potentially interfere with the rest of the neighbor statistics.
Fixes: d6f94d91f7 ("batman-adv: ELP - adding basic infrastructure")
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
The undefined behavior sanatizer detected an signed integer overflow in a
setup with near perfect link quality
UBSAN: Undefined behaviour in net/batman-adv/bat_iv_ogm.c:1246:25
signed integer overflow:
8713350 * 255 cannot be represented in type 'int'
The problems happens because the calculation of mixed unsigned and signed
integers resulted in an integer multiplication.
batadv_ogm_packet::tq (u8 255)
* tq_own (u8 255)
* tq_asym_penalty (int 134; max 255)
* tq_iface_penalty (int 255; max 255)
The tq_iface_penalty, tq_asym_penalty and inv_asym_penalty can just be
changed to unsigned int because they are not expected to become negative.
Fixes: c039876892 ("batman-adv: add WiFi penalty")
Signed-off-by: Sven Eckelmann <sven.eckelmann@open-mesh.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
When the MAC address of the primary interface is changed,
update the originator address in the ELP and OGM skb buffers as
well in order to reflect the change.
Fixes: d6f94d91f7 ("batman-adv: ELP - adding basic infrastructure")
Reported-by: Marek Lindner <marek@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
The function batadv_iv_ogm_orig_add_if allocates new buffers for bcast_own
and bcast_own_sum. It is expected that these buffers are unchanged in case
either bcast_own or bcast_own_sum couldn't be resized.
But the error handling of this function frees the already resized buffer
for bcast_own when the allocation of the new bcast_own_sum buffer failed.
This will lead to an invalid memory access when some code will try to
access bcast_own.
Instead the resized new bcast_own buffer has to be kept. This will not lead
to problems because the size of the buffer was only increased and therefore
no user of the buffer will try to access bytes outside of the new buffer.
Fixes: d0015fdd3d ("batman-adv: provide orig_node routing API")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
The functions batadv_neigh_ifinfo_get increase the reference counter of the
batadv_neigh_ifinfo. These have to be reduced again when the reference is
not used anymore to correctly free the objects.
Fixes: 9786906022 ("batman-adv: B.A.T.M.A.N. V - implement neighbor comparison API calls")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
batadv_neigh_ifinfo_get can return NULL when it cannot find (even when only
temporarily) anymore the neigh_ifinfo in the list neigh->ifinfo_list. This
has to be checked to avoid kernel Oopses when the ifinfo is dereferenced.
This a situation which isn't expected but is already handled by functions
like batadv_v_neigh_cmp. The same kind of warning is therefore used before
the function returns without dereferencing the pointers.
Fixes: 9786906022 ("batman-adv: B.A.T.M.A.N. V - implement neighbor comparison API calls")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
batadv_send_skb_to_orig() calls dev_queue_xmit() so we can't use skb->len.
Fixes: 953324776d ("batman-adv: network coding - buffer unicast packets before forward")
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Pull networking updates from David Miller:
"Highlights:
1) Support SPI based w5100 devices, from Akinobu Mita.
2) Partial Segmentation Offload, from Alexander Duyck.
3) Add GMAC4 support to stmmac driver, from Alexandre TORGUE.
4) Allow cls_flower stats offload, from Amir Vadai.
5) Implement bpf blinding, from Daniel Borkmann.
6) Optimize _ASYNC_ bit twiddling on sockets, unless the socket is
actually using FASYNC these atomics are superfluous. From Eric
Dumazet.
7) Run TCP more preemptibly, also from Eric Dumazet.
8) Support LED blinking, EEPROM dumps, and rxvlan offloading in mlx5e
driver, from Gal Pressman.
9) Allow creating ppp devices via rtnetlink, from Guillaume Nault.
10) Improve BPF usage documentation, from Jesper Dangaard Brouer.
11) Support tunneling offloads in qed, from Manish Chopra.
12) aRFS offloading in mlx5e, from Maor Gottlieb.
13) Add RFS and RPS support to SCTP protocol, from Marcelo Ricardo
Leitner.
14) Add MSG_EOR support to TCP, this allows controlling packet
coalescing on application record boundaries for more accurate
socket timestamp sampling. From Martin KaFai Lau.
15) Fix alignment of 64-bit netlink attributes across the board, from
Nicolas Dichtel.
16) Per-vlan stats in bridging, from Nikolay Aleksandrov.
17) Several conversions of drivers to ethtool ksettings, from Philippe
Reynes.
18) Checksum neutral ILA in ipv6, from Tom Herbert.
19) Factorize all of the various marvell dsa drivers into one, from
Vivien Didelot
20) Add VF support to qed driver, from Yuval Mintz"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1649 commits)
Revert "phy dp83867: Fix compilation with CONFIG_OF_MDIO=m"
Revert "phy dp83867: Make rgmii parameters optional"
r8169: default to 64-bit DMA on recent PCIe chips
phy dp83867: Make rgmii parameters optional
phy dp83867: Fix compilation with CONFIG_OF_MDIO=m
bpf: arm64: remove callee-save registers use for tmp registers
asix: Fix offset calculation in asix_rx_fixup() causing slow transmissions
switchdev: pass pointer to fib_info instead of copy
net_sched: close another race condition in tcf_mirred_release()
tipc: fix nametable publication field in nl compat
drivers: net: Don't print unpopulated net_device name
qed: add support for dcbx.
ravb: Add missing free_irq() calls to ravb_close()
qed: Remove a stray tab
net: ethernet: fec-mpc52xx: use phy_ethtool_{get|set}_link_ksettings
net: ethernet: fec-mpc52xx: use phydev from struct net_device
bpf, doc: fix typo on bpf_asm descriptions
stmmac: hardware TX COE doesn't work when force_thresh_dma_mode is set
net: ethernet: fs-enet: use phy_ethtool_{get|set}_link_ksettings
net: ethernet: fs-enet: use phydev from struct net_device
...
Pull 'struct path' constification update from Al Viro:
"'struct path' is passed by reference to a bunch of Linux security
methods; in theory, there's nothing to stop them from modifying the
damn thing and LSM community being what it is, sooner or later some
enterprising soul is going to decide that it's a good idea.
Let's remove the temptation and constify all of those..."
* 'work.const-path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
constify ima_d_path()
constify security_sb_pivotroot()
constify security_path_chroot()
constify security_path_{link,rename}
apparmor: remove useless checks for NULL ->mnt
constify security_path_{mkdir,mknod,symlink}
constify security_path_{unlink,rmdir}
apparmor: constify common_perm_...()
apparmor: constify aa_path_link()
apparmor: new helper - common_path_perm()
constify chmod_common/security_path_chmod
constify security_sb_mount()
constify chown_common/security_path_chown
tomoyo: constify assorted struct path *
apparmor_path_truncate(): path->mnt is never NULL
constify vfs_truncate()
constify security_path_truncate()
[apparmor] constify struct path * in a bunch of helpers
Clean up.
After "xprtrdma: Remove ro_unmap() from all registration modes",
there are no longer any sites that take rpcrdma_ia::qplock for read.
The one site that takes it for write is always single-threaded. It
is safe to remove it.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
In a cluster failover scenario, it is desirable for the client to
attempt to reconnect quickly, as an alternate NFS server is already
waiting to take over for the down server. The client can't see that
a server IP address has moved to a new server until the existing
connection is gone.
For fabrics and devices where it is meaningful, set a definite upper
bound on the amount of time before it is determined that a
connection is no longer valid. This allows the RPC client to detect
connection loss in a timely matter, then perform a fresh resolution
of the server GUID in case it has changed (cluster failover).
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up: The ro_unmap method is no longer used.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
There needs to be a safe method of releasing registered memory
resources when an RPC terminates. Safe can mean a number of things:
+ Doesn't have to sleep
+ Doesn't rely on having a QP in RTS
ro_unmap_safe will be that safe method. It can be used in cases
where synchronous memory invalidation can deadlock, or needs to have
an active QP.
The important case is fencing an RPC's memory regions after it is
signaled (^C) and before it exits. If this is not done, there is a
window where the server can write an RPC reply into memory that the
client has released and re-used for some other purpose.
Note that this is a full solution for FRWR, but FMR and physical
still have some gaps where a particularly bad server can wreak
some havoc on the client. These gaps are not made worse by this
patch and are expected to be exceptionally rare and timing-based.
They are noted in documenting comments.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Separate the DMA unmap operation from freeing the MW. In a
subsequent patch they will not always be done at the same time,
and they are not related operations (except by order; freeing
the MW must be the last step during invalidation).
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
In a subsequent patch, the fr_xprt and fr_worker fields will be
needed by another memory registration mode. Move them into the
generic rpcrdma_mw structure that wraps struct rpcrdma_frmr.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Maintain the order of invalidation and DMA unmapping when doing
a background MR reset.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
frwr_op_unmap_sync() is now invoked in a workqueue context, the same
as __frwr_queue_recovery(). There's no need to defer MR reset if
posting LOCAL_INV MRs fails.
This means that even when ib_post_send() fails (which should occur
very rarely) the invalidation and DMA unmapping steps are still done
in the correct order.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Move the the I/O direction field from rpcrdma_mr_seg into the
rpcrdma_frmr.
This makes it possible to DMA-unmap the frwr long after an RPC has
exited and its rpcrdma_mr_seg array has been released and re-used.
This might occur if an RPC times out while waiting for a new
connection to be established.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up: Follow same naming convention as other fields in struct
rpcrdma_frwr.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up: Replace rpcrdma_flush_cqs() and rpcrdma_clean_cqs() with
the new ib_drain_qp() API.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-By: Leon Romanovsky <leonro@mellanox.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
rpcrdma_create_chunks() has been replaced, and can be removed.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Update documenting comments to reflect code changes over the past
year.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Avoid the latency and interrupt overhead of registering a Write
chunk when handling NFS READ requests of a few hundred bytes or
less.
This change does not interoperate with Linux NFS/RDMA servers
that do not have commit 9d11b51ce7 ('svcrdma: Fix send_reply()
scatter/gather set-up'). Commit 9d11b51ce7 was introduced in v4.3,
and is included in 4.2.y, 4.1.y, and 3.18.y.
Oracle bug 22925946 has been filed to request that the above fix
be included in the Oracle Linux UEK4 NFS/RDMA server.
Red Hat bugzillas 1327280 and 1327554 have been filed to request
that RHEL NFS/RDMA server backports include the above fix.
Workaround: Replace the "proto=rdma,port=20049" mount options
with "proto=tcp" until commit 9d11b51ce7 is applied to your
NFS server.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
When deciding whether to send a Call inline, rpcrdma_marshal_req
doesn't take into account header bytes consumed by chunk lists.
This results in Call messages on the wire that are sometimes larger
than the inline threshold.
Likewise, when a Write list or Reply chunk is in play, the server's
reply has to emit an RDMA Send that includes a larger-than-minimal
RPC-over-RDMA header.
The actual size of a Call message cannot be estimated until after
the chunk lists have been registered. Thus the size of each
RPC-over-RDMA header can be estimated only after chunks are
registered; but the decision to register chunks is based on the size
of that header. Chicken, meet egg.
The best a client can do is estimate header size based on the
largest header that might occur, and then ensure that inline content
is always smaller than that.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Send buffer space is shared between the RPC-over-RDMA header and
an RPC message. A large RPC-over-RDMA header means less space is
available for the associated RPC message, which then has to be
moved via an RDMA Read or Write.
As more segments are added to the chunk lists, the header increases
in size. Typical modern hardware needs only a few segments to
convey the maximum payload size, but some devices and registration
modes may need a lot of segments to convey data payload. Sometimes
so many are needed that the remaining space in the Send buffer is
not enough for the RPC message. Sending such a message usually
fails.
To ensure a transport can always make forward progress, cap the
number of RDMA segments that are allowed in chunk lists. This
prevents less-capable devices and memory registrations from
consuming a large portion of the Send buffer by reducing the
maximum data payload that can be conveyed with such devices.
For now I choose an arbitrary maximum of 8 RDMA segments. This
allows a maximum size RPC-over-RDMA header to fit nicely in the
current 1024 byte inline threshold with over 700 bytes remaining
for an inline RPC message.
The current maximum data payload of NFS READ or WRITE requests is
one megabyte. To convey that payload on a client with 4KB pages,
each chunk segment would need to handle 32 or more data pages. This
is well within the capabilities of FMR. For physical registration,
the maximum payload size on platforms with 4KB pages is reduced to
32KB.
For FRWR, a device's maximum page list depth would need to be at
least 34 to support the maximum 1MB payload. A device with a smaller
maximum page list depth means the maximum data payload is reduced
when using that device.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Currently the sysctls that allow setting the inline threshold allow
any value to be set.
Small values only make the transport run slower. The default 1KB
setting is as low as is reasonable. And the logic that decides how
to divide a Send buffer between RPC-over-RDMA header and RPC message
assumes (but does not check) that the lower bound is not crazy (say,
57 bytes).
Send and receive buffers share a page with some control information.
Values larger than about 3KB can't be supported, currently.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
RPC-over-RDMA transports have a limit on how large a backward
direction (backchannel) RPC message can be. Ensure that the NFSv4.x
CREATE_SESSION operation advertises this limit to servers.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
This merges the Qualcomm SOC tree with the net-next, solving the
merge conflict in the SMD API between the two.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXNk3mAAoJEAsfOT8Nma3FJkEQAI/0hXy5cmACx37crT6Zishn
+IjfnVFY0w+VWX6DdOVvAiNufMEBnvbo6jAGFxYALvZTRyt2kBxa4SgbQfgeDGp2
UIXU5vYjYamggBALcDc9b6xBxxYEdIjGpqueIrGPyLzuAHxC74Ss4a1UYE8kC8R4
lznB+BCuyv4WQL5GgWK2vTcFH6BD9Yw6FARLI2YJaR1oRrvl+3WCnPSKZUYPi31R
vxvGYV5yiZ5YyOyZX4VHpn2vk1+dD7mhzXS/X1cGdV3Ysaqdw3bXqM0C31HW29F5
qxnVNtbnoNMIpQQaFRMT00x7yx+12gCXXgb7Q1CDzvXFQXkGvQRlcHagmvi7dqI9
YrttwbmS9LgezzEuaUTNfOJpMS9x7ghSayzHEmVLn9/7IMxdmqUiu5f9uKMjQHsW
D3hWCTahVXf9Y+UWKOEtvnazyEcdQIuuBEzZQw05cl8OntGk1eWtHLikZVyW+zWO
lgWX2n84Bc3BPz6o0Nu+tW4WbTlo5irnA8YLeBRJNUFgwlwlhm0Io2Q7aVhW27jZ
rDPfil1KduOGgf0T2EMD9XdQf0tXMaJ7y7vzuGtaWy83kYaEDzYOmnLK39LQYlxm
yYoU5Yo9fsoFXTvLyD/3FBD8Wqblv5C8ObIB4Lc3uhmrLBYA2mjFAmsxqDV2Giuz
So/uZlqnJauYFnMBAHUu
=UpGH
-----END PGP SIGNATURE-----
Merge tag 'net-next-qcom-soc-4.7-2-merge' of git://github.com/andersson/kernel
Merge tag 'qcom-soc-for-4.7-2' into net-next
This merges the Qualcomm SOC tree with the net-next, solving the
merge conflict in the SMD API between the two.
Pull parallel filesystem directory handling update from Al Viro.
This is the main parallel directory work by Al that makes the vfs layer
able to do lookup and readdir in parallel within a single directory.
That's a big change, since this used to be all protected by the
directory inode mutex.
The inode mutex is replaced by an rwsem, and serialization of lookups of
a single name is done by a "in-progress" dentry marker.
The series begins with xattr cleanups, and then ends with switching
filesystems over to actually doing the readdir in parallel (switching to
the "iterate_shared()" that only takes the read lock).
A more detailed explanation of the process from Al Viro:
"The xattr work starts with some acl fixes, then switches ->getxattr to
passing inode and dentry separately. This is the point where the
things start to get tricky - that got merged into the very beginning
of the -rc3-based #work.lookups, to allow untangling the
security_d_instantiate() mess. The xattr work itself proceeds to
switch a lot of filesystems to generic_...xattr(); no complications
there.
After that initial xattr work, the series then does the following:
- untangle security_d_instantiate()
- convert a bunch of open-coded lookup_one_len_unlocked() to calls of
that thing; one such place (in overlayfs) actually yields a trivial
conflict with overlayfs fixes later in the cycle - overlayfs ended
up switching to a variant of lookup_one_len_unlocked() sans the
permission checks. I would've dropped that commit (it gets
overridden on merge from #ovl-fixes in #for-next; proper resolution
is to use the variant in mainline fs/overlayfs/super.c), but I
didn't want to rebase the damn thing - it was fairly late in the
cycle...
- some filesystems had managed to depend on lookup/lookup exclusion
for *fs-internal* data structures in a way that would break if we
relaxed the VFS exclusion. Fixing hadn't been hard, fortunately.
- core of that series - parallel lookup machinery, replacing
->i_mutex with rwsem, making lookup_slow() take it only shared. At
that point lookups happen in parallel; lookups on the same name
wait for the in-progress one to be done with that dentry.
Surprisingly little code, at that - almost all of it is in
fs/dcache.c, with fs/namei.c changes limited to lookup_slow() -
making it use the new primitive and actually switching to locking
shared.
- parallel readdir stuff - first of all, we provide the exclusion on
per-struct file basis, same as we do for read() vs lseek() for
regular files. That takes care of most of the needed exclusion in
readdir/readdir; however, these guys are trickier than lookups, so
I went for switching them one-by-one. To do that, a new method
'->iterate_shared()' is added and filesystems are switched to it
as they are either confirmed to be OK with shared lock on directory
or fixed to be OK with that. I hope to kill the original method
come next cycle (almost all in-tree filesystems are switched
already), but it's still not quite finished.
- several filesystems get switched to parallel readdir. The
interesting part here is dealing with dcache preseeding by readdir;
that needs minor adjustment to be safe with directory locked only
shared.
Most of the filesystems doing that got switched to in those
commits. Important exception: NFS. Turns out that NFS folks, with
their, er, insistence on VFS getting the fuck out of the way of the
Smart Filesystem Code That Knows How And What To Lock(tm) have
grown the locking of their own. They had their own homegrown
rwsem, with lookup/readdir/atomic_open being *writers* (sillyunlink
is the reader there). Of course, with VFS getting the fuck out of
the way, as requested, the actual smarts of the smart filesystem
code etc. had become exposed...
- do_last/lookup_open/atomic_open cleanups. As the result, open()
without O_CREAT locks the directory only shared. Including the
->atomic_open() case. Backmerge from #for-linus in the middle of
that - atomic_open() fix got brought in.
- then comes NFS switch to saner (VFS-based ;-) locking, killing the
homegrown "lookup and readdir are writers" kinda-sorta rwsem. All
exclusion for sillyunlink/lookup is done by the parallel lookups
mechanism. Exclusion between sillyunlink and rmdir is a real rwsem
now - rmdir being the writer.
Result: NFS lookups/readdirs/O_CREAT-less opens happen in parallel
now.
- the rest of the series consists of switching a lot of filesystems
to parallel readdir; in a lot of cases ->llseek() gets simplified
as well. One backmerge in there (again, #for-linus - rockridge
fix)"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (74 commits)
ext4: switch to ->iterate_shared()
hfs: switch to ->iterate_shared()
hfsplus: switch to ->iterate_shared()
hostfs: switch to ->iterate_shared()
hpfs: switch to ->iterate_shared()
hpfs: handle allocation failures in hpfs_add_pos()
gfs2: switch to ->iterate_shared()
f2fs: switch to ->iterate_shared()
afs: switch to ->iterate_shared()
befs: switch to ->iterate_shared()
befs: constify stuff a bit
isofs: switch to ->iterate_shared()
get_acorn_filename(): deobfuscate a bit
btrfs: switch to ->iterate_shared()
logfs: no need to lock directory in lseek
switch ecryptfs to ->iterate_shared
9p: switch to ->iterate_shared()
fat: switch to ->iterate_shared()
romfs, squashfs: switch to ->iterate_shared()
more trivial ->iterate_shared conversions
...
The problem is that fib_info->nh is [0] so the struct fib_info
allocation size depends on number of nexthops. If we just copy fib_info,
we do not copy the nexthops info and driver accesses memory which is not
ours.
Given the fact that fib4 does not defer operations and therefore it does
not need copy, just pass the pointer down to drivers as it was done
before.
Fixes: 850d0cbc91 ("switchdev: remove pointers from switchdev objects")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We saw the following extra refcount release on veth device:
kernel: [7957821.463992] unregister_netdevice: waiting for mesos50284 to become free. Usage count = -1
Since we heavily use mirred action to redirect packets to veth, I think
this is caused by the following race condition:
CPU0:
tcf_mirred_release(): (in RCU callback)
struct net_device *dev = rcu_dereference_protected(m->tcfm_dev, 1);
CPU1:
mirred_device_event():
spin_lock_bh(&mirred_list_lock);
list_for_each_entry(m, &mirred_list, tcfm_list) {
if (rcu_access_pointer(m->tcfm_dev) == dev) {
dev_put(dev);
/* Note : no rcu grace period necessary, as
* net_device are already rcu protected.
*/
RCU_INIT_POINTER(m->tcfm_dev, NULL);
}
}
spin_unlock_bh(&mirred_list_lock);
CPU0:
tcf_mirred_release():
spin_lock_bh(&mirred_list_lock);
list_del(&m->tcfm_list);
spin_unlock_bh(&mirred_list_lock);
if (dev) // <======== Stil refers to the old m->tcfm_dev
dev_put(dev); // <======== dev_put() is called on it again
The action init code path is good because it is impossible to modify
an action that is being removed.
So, fix this by moving everything under the spinlock.
Fixes: 2ee22a90c7 ("net_sched: act_mirred: remove spinlock in fast path")
Fixes: 6bd00b8506 ("act_mirred: fix a race condition on mirred_list")
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The publication field of the old netlink API should contain the
publication key and not the publication reference.
Fixes: 44a8ae94fd (tipc: convert legacy nl name table dump to nl compat)
Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Backmerge to resolve a conflict in ovl_lookup_real();
"ovl_lookup_real(): use lookup_one_len_unlocked()" instead,
but it was too late in the cycle to rebase.
When we free cb->skb after a dump, we do it after releasing the
lock. This means that a new dump could have started in the time
being and we'll end up freeing their skb instead of ours.
This patch saves the skb and module before we unlock so we free
the right memory.
Fixes: 16b304f340 ("netlink: Eliminate kmalloc in netlink dump operation.")
Reported-by: Baozeng Ding <sploving1@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make sure the socket for which the user is listing publication exists
before parsing the socket netlink attributes.
Prior to this patch a call without any socket caused a NULL pointer
dereference in tipc_nl_publ_dump().
Tested-and-reported-by: Baozeng Ding <sploving1@gmail.com>
Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.cm>
Signed-off-by: David S. Miller <davem@davemloft.net>
memory_usage must be decreased in dequeue_func(), not in
fq_codel_dequeue(), otherwise packets dropped by Codel algo
are missing this decrease.
Also we need to clear memory_usage in fq_codel_reset()
Fixes: 95b58430ab ("fq_codel: add memory limitation per queue")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Follow-up for 8a3a4c6e7b ("net: make sch_handle_ingress() drop
monitor ready") to also make the egress side drop monitor ready.
Also here only TC_ACT_SHOT is a clear indication that something
went wrong. Hence don't provide false positives to drop monitors
such as 'perf record -e skb:kfree_skb ...'.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function setup_timer combines the initialization of a timer with the
initialization of the timer's function and data fields. The mulitiline
code for timer initialization is now replaced with function setup_timer.
Also, quoting the mod_timer() function comment:
-> mod_timer() is a more efficient way to update the expire field of an
active timer (if the timer is inactive it will be activated).
Use setup_timer() and mod_timer() to setup and arm a timer, making the
code compact and aid readablity.
Signed-off-by: Muhammad Falak R Wani <falakreyaz@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johan Hedberg says:
====================
pull request: bluetooth-next 2016-05-14
Here are two more Bluetooth patches for the 4.7 kernel which we wanted
to get into net-next before the merge window opens. Please let me know
if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This work adds a generic facility for use from eBPF JIT compilers
that allows for further hardening of JIT generated images through
blinding constants. In response to the original work on BPF JIT
spraying published by Keegan McAllister [1], most BPF JITs were
changed to make images read-only and start at a randomized offset
in the page, where the rest was filled with trap instructions. We
have this nowadays in x86, arm, arm64 and s390 JIT compilers.
Additionally, later work also made eBPF interpreter images read
only for kernels supporting DEBUG_SET_MODULE_RONX, that is, x86,
arm, arm64 and s390 archs as well currently. This is done by
default for mentioned JITs when JITing is enabled. Furthermore,
we had a generic and configurable constant blinding facility on our
todo for quite some time now to further make spraying harder, and
first implementation since around netconf 2016.
We found that for systems where untrusted users can load cBPF/eBPF
code where JIT is enabled, start offset randomization helps a bit
to make jumps into crafted payload harder, but in case where larger
programs that cross page boundary are injected, we again have some
part of the program opcodes at a page start offset. With improved
guessing and more reliable payload injection, chances can increase
to jump into such payload. Elena Reshetova recently wrote a test
case for it [2, 3]. Moreover, eBPF comes with 64 bit constants, which
can leave some more room for payloads. Note that for all this,
additional bugs in the kernel are still required to make the jump
(and of course to guess right, to not jump into a trap) and naturally
the JIT must be enabled, which is disabled by default.
For helping mitigation, the general idea is to provide an option
bpf_jit_harden that admins can tweak along with bpf_jit_enable, so
that for cases where JIT should be enabled for performance reasons,
the generated image can be further hardened with blinding constants
for unpriviledged users (bpf_jit_harden == 1), with trading off
performance for these, but not for privileged ones. We also added
the option of blinding for all users (bpf_jit_harden == 2), which
is quite helpful for testing f.e. with test_bpf.ko. There are no
further e.g. hardening levels of bpf_jit_harden switch intended,
rationale is to have it dead simple to use as on/off. Since this
functionality would need to be duplicated over and over for JIT
compilers to use, which are already complex enough, we provide a
generic eBPF byte-code level based blinding implementation, which is
then just transparently JITed. JIT compilers need to make only a few
changes to integrate this facility and can be migrated one by one.
This option is for eBPF JITs and will be used in x86, arm64, s390
without too much effort, and soon ppc64 JITs, thus that native eBPF
can be blinded as well as cBPF to eBPF migrations, so that both can
be covered with a single implementation. The rule for JITs is that
bpf_jit_blind_constants() must be called from bpf_int_jit_compile(),
and in case blinding is disabled, we follow normally with JITing the
passed program. In case blinding is enabled and we fail during the
process of blinding itself, we must return with the interpreter.
Similarly, in case the JITing process after the blinding failed, we
return normally to the interpreter with the non-blinded code. Meaning,
interpreter doesn't change in any way and operates on eBPF code as
usual. For doing this pre-JIT blinding step, we need to make use of
a helper/auxiliary register, here BPF_REG_AX. This is strictly internal
to the JIT and not in any way part of the eBPF architecture. Just like
in the same way as JITs internally make use of some helper registers
when emitting code, only that here the helper register is one
abstraction level higher in eBPF bytecode, but nevertheless in JIT
phase. That helper register is needed since f.e. manually written
program can issue loads to all registers of eBPF architecture.
The core concept with the additional register is: blind out all 32
and 64 bit constants by converting BPF_K based instructions into a
small sequence from K_VAL into ((RND ^ K_VAL) ^ RND). Therefore, this
is transformed into: BPF_REG_AX := (RND ^ K_VAL), BPF_REG_AX ^= RND,
and REG <OP> BPF_REG_AX, so actual operation on the target register
is translated from BPF_K into BPF_X one that is operating on
BPF_REG_AX's content. During rewriting phase when blinding, RND is
newly generated via prandom_u32() for each processed instruction.
64 bit loads are split into two 32 bit loads to make translation and
patching not too complex. Only basic thing required by JITs is to
call the helper bpf_jit_blind_constants()/bpf_jit_prog_release_other()
pair, and to map BPF_REG_AX into an unused register.
Small bpf_jit_disasm extract from [2] when applied to x86 JIT:
echo 0 > /proc/sys/net/core/bpf_jit_harden
ffffffffa034f5e9 + <x>:
[...]
39: mov $0xa8909090,%eax
3e: mov $0xa8909090,%eax
43: mov $0xa8ff3148,%eax
48: mov $0xa89081b4,%eax
4d: mov $0xa8900bb0,%eax
52: mov $0xa810e0c1,%eax
57: mov $0xa8908eb4,%eax
5c: mov $0xa89020b0,%eax
[...]
echo 1 > /proc/sys/net/core/bpf_jit_harden
ffffffffa034f1e5 + <x>:
[...]
39: mov $0xe1192563,%r10d
3f: xor $0x4989b5f3,%r10d
46: mov %r10d,%eax
49: mov $0xb8296d93,%r10d
4f: xor $0x10b9fd03,%r10d
56: mov %r10d,%eax
59: mov $0x8c381146,%r10d
5f: xor $0x24c7200e,%r10d
66: mov %r10d,%eax
69: mov $0xeb2a830e,%r10d
6f: xor $0x43ba02ba,%r10d
76: mov %r10d,%eax
79: mov $0xd9730af,%r10d
7f: xor $0xa5073b1f,%r10d
86: mov %r10d,%eax
89: mov $0x9a45662b,%r10d
8f: xor $0x325586ea,%r10d
96: mov %r10d,%eax
[...]
As can be seen, original constants that carry payload are hidden
when enabled, actual operations are transformed from constant-based
to register-based ones, making jumps into constants ineffective.
Above extract/example uses single BPF load instruction over and
over, but of course all instructions with constants are blinded.
Performance wise, JIT with blinding performs a bit slower than just
JIT and faster than interpreter case. This is expected, since we
still get all the performance benefits from JITing and in normal
use-cases not every single instruction needs to be blinded. Summing
up all 296 test cases averaged over multiple runs from test_bpf.ko
suite, interpreter was 55% slower than JIT only and JIT with blinding
was 8% slower than JIT only. Since there are also some extremes in
the test suite, I expect for ordinary workloads that the performance
for the JIT with blinding case is even closer to JIT only case,
f.e. nmap test case from suite has averaged timings in ns 29 (JIT),
35 (+ blinding), and 151 (interpreter).
BPF test suite, seccomp test suite, eBPF sample code and various
bigger networking eBPF programs have been tested with this and were
running fine. For testing purposes, I also adapted interpreter and
redirected blinded eBPF image to interpreter and also here all tests
pass.
[1] http://mainisusuallyafunction.blogspot.com/2012/11/attacking-hardened-linux-systems-with.html
[2] https://github.com/01org/jit-spray-poc-for-ksp/
[3] http://www.openwall.com/lists/kernel-hardening/2016/05/03/5
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Elena Reshetova <elena.reshetova@intel.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since the blinding is strictly only called from inside eBPF JITs,
we need to change signatures for bpf_int_jit_compile() and
bpf_prog_select_runtime() first in order to prepare that the
eBPF program we're dealing with can change underneath. Hence,
for call sites, we need to return the latest prog. No functional
change in this patch.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Split the HAVE_BPF_JIT into two for distinguishing cBPF and eBPF JITs.
Current cBPF ones:
# git grep -n HAVE_CBPF_JIT arch/
arch/arm/Kconfig:44: select HAVE_CBPF_JIT
arch/mips/Kconfig:18: select HAVE_CBPF_JIT if !CPU_MICROMIPS
arch/powerpc/Kconfig:129: select HAVE_CBPF_JIT
arch/sparc/Kconfig:35: select HAVE_CBPF_JIT
Current eBPF ones:
# git grep -n HAVE_EBPF_JIT arch/
arch/arm64/Kconfig:61: select HAVE_EBPF_JIT
arch/s390/Kconfig:126: select HAVE_EBPF_JIT if PACK_STACK && HAVE_MARCH_Z196_FEATURES
arch/x86/Kconfig:94: select HAVE_EBPF_JIT if X86_64
Later code also needs this facility to check for eBPF JITs.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Besides others, remove redundant comments where the code is self
documenting enough, and properly indent various bpf_verifier_ops
and bpf_prog_type_list declarations. Moreover, remove two exports
that actually have no module user.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_hdr() is slightly more expensive than using skb->data in contexts
where we know they point to the same byte.
In receive path, tcp_v4_rcv() and tcp_v6_rcv() are in this situation,
as tcp header has not been pulled yet.
In output path, the same can be said when we just pushed the tcp header
in the skb, in tcp_transmit_skb() and tcp_make_synack()
Also factorize the two checks for tcb->tcp_flags & TCPHDR_SYN in
tcp_transmit_skb() and pass tcp header pointer to tcp_ecn_send(),
so that compiler can further optimize and avoid a reload.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
__sock_cmsg_send() might return different error codes, not only -EINVAL.
Fixes: 24025c465f ("ipv4: process socket-level control messages in IPv4")
Fixes: ad1e46a837 ("ipv6: process socket-level control messages in IPv6")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Having multiple loadable modules with the same name cannot work
with modprobe, and having both net/qrtr/smd.ko and drivers/soc/qcom/smd.ko
results in a (somewhat cryptic) build error:
ERROR: "qcom_smd_driver_unregister" [net/qrtr/smd.ko] undefined!
ERROR: "qcom_smd_driver_register" [net/qrtr/smd.ko] undefined!
ERROR: "qcom_smd_set_drvdata" [net/qrtr/smd.ko] undefined!
ERROR: "qcom_smd_send" [net/qrtr/smd.ko] undefined!
ERROR: "qcom_smd_get_drvdata" [net/qrtr/smd.ko] undefined!
ERROR: "qcom_smd_driver_unregister" [drivers/soc/qcom/wcnss_ctrl.ko] undefined!
ERROR: "qcom_smd_driver_register" [drivers/soc/qcom/wcnss_ctrl.ko] undefined!
ERROR: "qcom_smd_set_drvdata" [drivers/soc/qcom/wcnss_ctrl.ko] undefined!
ERROR: "qcom_smd_send" [drivers/soc/qcom/wcnss_ctrl.ko] undefined!
ERROR: "qcom_smd_get_drvdata" [drivers/soc/qcom/wcnss_ctrl.ko] undefined!
Also, the qrtr driver uses the SMD interface and has a Kconfig dependency,
but also allows for compile-testing when SMD is disabled. However, if
with QCOM_SMD=m and COMPILE_TEST=y we can end up with QRTR_SMD=y and
that fails with a related link error.
The changes the dependency so we can still compile-test the driver but
not have it built-in if SMD is a module, to avoid running in the broken
configuration, and changes the Makefile to provide the driver under
a different module name.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: bdabad3e36 ("net: Add Qualcomm IPC router")
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce a new command in ndo_setup_tc() for hardware offloaded
filters, to call the NIC driver, and make it update the statistics.
This will be done before dumping the filter and its statistics.
Signed-off-by: Amir Vadai <amirva@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement the stats_update callback that will be called by NIC drivers
for hardware offloaded filters.
Signed-off-by: Amir Vadai <amirva@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On devices that support TC U32 offloads, this flag enables a filter to be
added only to HW. skip-sw and skip-hw are mutually exclusive flags. By
default without any flags, the filter is added to both HW and SW, but no
error checks are done in case of failure to add to HW. With skip-sw,
failure to add to HW is treated as an error.
Here is a sample script that adds 2 filters, one with skip-sw and the other
with skip-hw flag.
# add ingress qdisc
tc qdisc add dev p4p1 ingress
# enable hw tc offload.
ethtool -K p4p1 hw-tc-offload on
# add u32 filter with skip-sw flag.
tc filter add dev p4p1 parent ffff: protocol ip prio 99 \
handle 800:0:1 u32 ht 800: flowid 800:1 \
skip-sw \
match ip src 192.168.1.0/24 \
action drop
# add u32 filter with skip-hw flag.
tc filter add dev p4p1 parent ffff: protocol ip prio 99 \
handle 800:0:2 u32 ht 800: flowid 800:2 \
skip-hw \
match ip src 192.168.2.0/24 \
action drop
Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The nf_conntrack_core.c fix in 'net' is not relevant in 'net-next'
because we no longer have a per-netns conntrack hash.
The ip_gre.c conflict as well as the iwlwifi ones were cases of
overlapping changes.
Conflicts:
drivers/net/wireless/intel/iwlwifi/mvm/tx.c
net/ipv4/ip_gre.c
net/netfilter/nf_conntrack_core.c
Signed-off-by: David S. Miller <davem@davemloft.net>
Switchdev has been around for quite a while now, putting "EXPERIMENTAL"
in the description is no longer accurate, drop it.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, when creating or updating a route, no check is performed
in both ipv4 and ipv6 code to the hoplimit value.
The caller can i.e. set hoplimit to 256, and when such route will
be used, packets will be sent with hoplimit/ttl equal to 0.
This commit adds checks for the RTAX_HOPLIMIT value, in both ipv4
ipv6 route code, substituting any value greater than 255 with 255.
This is consistent with what is currently done for ADVMSS and MTU
in the ipv4 code.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The slab name ends up being visible in the directory structure under
/sys, and even if you don't have access rights to the file you can see
the filenames.
Just use a 64-bit counter instead of the pointer to the 'net' structure
to generate a unique name.
This code will go away in 4.7 when the conntrack code moves to a single
kmemcache, but this is the backportable simple solution to avoiding
leaking kernel pointers to user space.
Fixes: 5b3501faa8 ("netfilter: nf_conntrack: per netns nf_conntrack_cachep")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
* Change SMD callback parameters
* Use writecombine mapping for SMEM
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXFvdZAAoJEFKiBbHx2RXVxPQQAJnEp1+5CDN7ziQ/zJ3juSxb
b7rOr5Z0qMDhh0H9BFdmhGzFctbY0AOqerfh138pXa6KN7maXT/enUE4vvtWUp1B
HCr9xt13waoa8e6umj/D1qU5mMLUX+JraVkIo6Dyd1g6pVmdrbwzP+Z0tW0x7CvW
n5r9eDYts66vGbQ6xzK40rTi22sEN32AY4mezKj40L8o0NvfZCCwb3dfw2RQzLHK
0KnSQ0PKqU1ZWrDc2YuvRsyh5KGv+t8vvnpMVVUi4TRx3qzoGpHpH3fE3evvMGKY
ya6nmFNFPZ+ZlzsGDx10RqLu58NmjS8LXY6O+yCGy9NPuonKpwHgYD2wxut2NR/W
TTtMhvJWKNIwhXR8ZRDaVBWQJBwoLvpNQT1zOIUuUKf+SZ2DbfgVpte7NZvrgEka
aY4nnZMbI0MvpaAHUtllWbC4xqMQJXY+4wrahKQgKF/l5H/0XQ99fHHz86DS9LCO
y+dfmU98dv/1X8qmxNfsULtJwbxi2Xwzt3MMMZb5v138STyMWeqWOLm+KGzgs1hu
Pc/RLQALdJpQEcg74s5xt2yxSqRuzDig7FtRT1a2SX33vi9YohQEgXCDe+IOJC5l
sm1Ppi3oO/yUcocUDFTBz/g6zD4SWywu2gmBcZIh/JVy8Sd053WwlaKs1Nld6ta7
QPx6rga7lgrYW409RS2A
=Jzqj
-----END PGP SIGNATURE-----
Merge tag 'qcom-soc-for-4.7-2' into net-next
This merges the Qualcomm SOC tree with the net-next, solving the
merge conflict in the SMD API between the two.
Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Clean up: Pass in just the piece of the svc_rqst that is needed
here.
While we're in the area, add an informative documenting comment.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
If the server has forced a disconnect, the associated QP has not
been moved to the Error state, and thus Receives are still posted.
Ensure Receives (and any other outstanding WRs) are drained to
release resources that can be freed during teardown of the
svcrdma_xprt.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Since backward direction support was added, the rq_depth was
increased to accommodate both forward and backward Receives.
But only forward Receives need to be posted after a connection
has been accepted. Receives for backward replies are posted as
needed by svc_rdma_bc_sendto().
This doesn't break anything, but it means some resources are
wasted.
Fixes: 03fe993153 ('svcrdma: Define maximum number of ...')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Clean up: svc_rdma_get_read_chunk() already returns a pointer
to the Read list. No need to set "ch" again to the value it
already contains.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Get a fresh op_ctxt in send_reply() instead of in svc_rdma_sendto().
This ensures that svc_rdma_put_context() is invoked only once if
send_reply() fails.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
An xdr_buf has a head, a vector of pages, and a tail. Each
RPC request is presented to the NFS server contained in an
xdr_buf.
The RDMA transport would like to supply the NFS server with only
the NFS WRITE payload bytes in the page vector. In some common
cases, that would allow the NFS server to swap those pages right
into the target file's page cache.
Have the transport's RDMA Read logic put XDR pad bytes in the tail
iovec, and not in the pages that hold the data payload.
The NFSv3 WRITE XDR decoder is finicky about the lengths involved,
so make sure it is looking in the correct places when computing
the total length of the incoming NFS WRITE request.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Allow both IPv4 and IPv6 to bind same port at the same time,
restricts use of the IPv6 socket to IPv6 communication.
Changes from v1:
- Check rdma_set_afonly return value (suggested by Leon Romanovsky)
Changes from v2:
- Acked-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Shirley Ma <shirley.ma@oracle.com>
Acked-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The SRP initiator allows to set max_sectors to a value that exceeds
the largest amount of data that can be mapped at once with an mlx4
HCA using fast registration and a page size of 4 KB. Hence modify
ib_map_mr_sg() such that it can map partial sg-elements. If an
sg-element has been mapped partially, let the caller know
which fraction has been mapped by adjusting *sg_offset.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Tested-by: Laurence Oberman <loberman@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Doug Ledford <dledford@redhat.com>
With all the latest fixes applied, I am still able to reproduce this
(and other) warning(s):
WARNING: CPU: 1 PID: 19684 at ../kernel/workqueue.c:4092 destroy_workqueue+0x70a/0x770()
...
Call Trace:
[<ffffffff819fee81>] ? dump_stack+0xb3/0x112
[<ffffffff8117377e>] ? warn_slowpath_common+0xde/0x140
[<ffffffff811ce68a>] ? destroy_workqueue+0x70a/0x770
[<ffffffff811739ae>] ? warn_slowpath_null+0x2e/0x40
[<ffffffff811ce68a>] ? destroy_workqueue+0x70a/0x770
[<ffffffffa0c944c9>] ? hci_unregister_dev+0x2a9/0x720 [bluetooth]
[<ffffffffa0b301db>] ? vhci_release+0x7b/0xf0 [hci_vhci]
[<ffffffffa0b30160>] ? vhci_flush+0x50/0x50 [hci_vhci]
[<ffffffff8117cd73>] ? do_exit+0x863/0x2b90
This is due to race present in the hci_unregister_dev path.
hdev->power_on work races with hci_dev_do_close. One tries to open,
the other tries to close, leading to warning like the above. (Another
example is a warning in kobject_get or kobject_put depending on who
wins the race.)
Fix this by switching those two racers to ensure hdev->power_on never
triggers while hci_dev_do_close is in progress.
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
While testing an OpenStack configuration using VXLANs I saw the following
call trace:
RIP: 0010:[<ffffffff815fad49>] udp4_lib_lookup_skb+0x49/0x80
RSP: 0018:ffff88103867bc50 EFLAGS: 00010286
RAX: ffff88103269bf00 RBX: ffff88103269bf00 RCX: 00000000ffffffff
RDX: 0000000000004300 RSI: 0000000000000000 RDI: ffff880f2932e780
RBP: ffff88103867bc60 R08: 0000000000000000 R09: 000000009001a8c0
R10: 0000000000004400 R11: ffffffff81333a58 R12: ffff880f2932e794
R13: 0000000000000014 R14: 0000000000000014 R15: ffffe8efbfd89ca0
FS: 0000000000000000(0000) GS:ffff88103fd80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000488 CR3: 0000000001c06000 CR4: 00000000001426e0
Stack:
ffffffff81576515 ffffffff815733c0 ffff88103867bc98 ffffffff815fcc17
ffff88103269bf00 ffffe8efbfd89ca0 0000000000000014 0000000000000080
ffffe8efbfd89ca0 ffff88103867bcc8 ffffffff815fcf8b ffff880f2932e794
Call Trace:
[<ffffffff81576515>] ? skb_checksum+0x35/0x50
[<ffffffff815733c0>] ? skb_push+0x40/0x40
[<ffffffff815fcc17>] udp_gro_receive+0x57/0x130
[<ffffffff815fcf8b>] udp4_gro_receive+0x10b/0x2c0
[<ffffffff81605863>] inet_gro_receive+0x1d3/0x270
[<ffffffff81589e59>] dev_gro_receive+0x269/0x3b0
[<ffffffff8158a1b8>] napi_gro_receive+0x38/0x120
[<ffffffffa0871297>] gro_cell_poll+0x57/0x80 [vxlan]
[<ffffffff815899d0>] net_rx_action+0x160/0x380
[<ffffffff816965c7>] __do_softirq+0xd7/0x2c5
[<ffffffff8107d969>] run_ksoftirqd+0x29/0x50
[<ffffffff8109a50f>] smpboot_thread_fn+0x10f/0x160
[<ffffffff8109a400>] ? sort_range+0x30/0x30
[<ffffffff81096da8>] kthread+0xd8/0xf0
[<ffffffff81693c82>] ret_from_fork+0x22/0x40
[<ffffffff81096cd0>] ? kthread_park+0x60/0x60
The following trace is seen when receiving a DHCP request over a flow-based
VXLAN tunnel. I believe this is caused by the metadata dst having a NULL
dev value and as a result dev_net(dev) is causing a NULL pointer dereference.
To resolve this I am replacing the check for skb_dst(skb)->dev with just
skb->dev. This makes sense as the callers of this function are usually in
the receive path and as such skb->dev should always be populated. In
addition other functions in the area where these are called are already
using dev_net(skb->dev) to determine the namespace the UDP packet belongs
in.
Fixes: 63058308cd ("udp: Add udp6_lib_lookup_skb and udp4_lib_lookup_skb")
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sunrpc is using SOCKWQ_ASYNC_NOSPACE without setting SOCK_FASYNC,
so the recent optimizations done in sk_set_bit() and sk_clear_bit()
broke it.
There is still the risk that a subsequent sock_fasync() call
would clear SOCK_FASYNC, but sunrpc does not use this yet.
Fixes: 9317bb6982 ("net: SOCKWQ_ASYNC_NOSPACE optimizations")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Jiri Pirko <jiri@resnulli.us>
Reported-by: Huang, Ying <ying.huang@intel.com>
Tested-by: Jiri Pirko <jiri@resnulli.us>
Tested-by: Huang, Ying <ying.huang@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When an ACTIVATE or data packet is received in a link in state
ESTABLISHING, the link does not immediately change state to
ESTABLISHED, but does instead return a LINK_UP event to the caller,
which will execute the state change in a different lock context.
This non-atomic approach incurs a low risk that we may have two
LINK_UP events pending simultaneously for the same link, resulting
in the final part of the setup procedure being executed twice. The
only potential harm caused by this it that we may see two LINK_UP
events issued to subsribers of the topology server, something that
may cause confusion.
This commit eliminates this risk by checking if the link is already
up before proceeding with the second half of the setup.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When dealing with WCCP in gre6 tunnel, it sets the wrong tpi->protocol,
that is, ETH_P_IP instead of ETH_P_IPV6 for the encapuslated traffic.
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Do not include attribute IFLA_GRE_TOS.
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* completion and fixups of nla_put_64_64bit() work
* remove a/b/g/n from wext nickname to avoid confusion
with 11ac (which wouldn't even fit fully there due to
string length restrictions)
along with some other minor changes/cleanups.
-----BEGIN PGP SIGNATURE-----
iQIcBAABCgAGBQJXNGMwAAoJEGt7eEactAAdrRMP/RGmFd1I8Z7gIsKc7pdknf20
4xLA3v8cqD7c9OtqlfVSfIyIosNl+pMatXK4eSaYzJICmdV9lU8VqgW6eH9lmDuM
q8Eis2/i7ZozzYhANVFQxMRvmSED8MeRm49LOzRofuPg4FHzyTYd8+TRfQ+QGRm9
xlrKTJ4BwH5lZgmH20/cphbffo5j8IyjzjKw0MC/tjikIePww6Am/r1jpCnd0Mwz
rWwRa4pQ9JBV7UzzzdcMpdn3KJkteFq/gJPCpZVDo3Zf+L21UavxP99NoUWiUvUU
bEQ9FzcuB7Zyt/lCAyu3ECBGqLvskqseFg3zmDnUoL7GmCZ1PZlfsnfOhQP0HgpI
t/dyw+TYIu6d4ZHbqGM6q6usjIJLltaxATwvREq3VdWFn5fYFu0Em3CyQ0on1XGn
4anNdkilGxaVEHCFBvfLvMTEfsTj7eycsNsDrZEGYV3NwX/wfz+6thqqu8gXFUsC
4MB1fS+YPg7u72ne+oJSPyJyFofYXDSjdNqnZ6HSQEZHinE+FnDCywZlJesxzDsH
q1ABkKLnmMNIwHxbfRBE0eO8sKXo6kUTJeFmWHtxB8/LcsLsV0/8juUy9H5PPZPU
7TJhn4SZrlJb92xq1DQ0t1xsRxTLonbRuu9PtVVZMUhZP69PpzK9toC6eCo9bovn
XFdSQ2c2xRwRAiSjUpQP
=QNxF
-----END PGP SIGNATURE-----
Merge tag 'mac80211-next-for-davem-2016-05-12' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
Johannes Berg says:
====================
Some more work for 4.7, notably:
* completion and fixups of nla_put_64_64bit() work
* remove a/b/g/n from wext nickname to avoid confusion
with 11ac (which wouldn't even fit fully there due to
string length restrictions)
along with some other minor changes/cleanups.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Since 4.4 we erronously use timestamp of the netlink skb (which is zero).
Bugzilla: https://bugzilla.netfilter.org/show_bug.cgi?id=1066
Fixes: b28b1e826f ("netfilter: nfnetlink_queue: use y2038 safe timestamp")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When using RSS, frames might not be processed in the correct order,
and thus AP_LINK_PS must be used; most likely with firmware keeping
track of the powersave state, this is the case in iwlwifi now.
In this case, the driver can use ieee80211_sta_ps_transition() to
still have mac80211 manage powersave buffering. However, for U-APSD
and PS-Poll this isn't sufficient. If the device can't manage that
entirely on its own, mac80211's code should be used.
To allow this, export two functions: ieee80211_sta_uapsd_trigger()
and ieee80211_sta_pspoll().
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
There's no harm in having drivers read the list, since they can
use RCU protection or RTNL locking; allow this to not require
each and every driver to also implement its own bookkeeping.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The devlist_mtx mutex was removed about two years ago, in favour of just
using RTNL/RCU protection. Remove the comment still referencing it.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This allows finding vendor IE from a specific vendor.
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Some hardware (iwlwifi an example) de-aggregate AMSDUs and copy the IV
as is to the generated MPDUs, so the same PN appears in multiple
packets without being a replay attack. Allow driver to explicitly
indicate that a frame is allowed to have the same PN as the previous
frame.
Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In some cases, after a sudden AP disappearing and reconnection to
another AP in the same ESS, user space gets the old AP in scan
results (cached). User space may decide to roam to that old AP
which will cause a disconnection and longer recovery.
Remove APs that are probably out of range from BSS table.
Signed-off-by: David Spinadel <david.spinadel@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This is the first NFC pull request for 4.7. With this one we
mainly have:
- Support for NXP's pn532 NFC chipset. The pn532 is based on the same
microcontroller as the pn533, but it talks to the host through i2c
instead of USB. By separating the pn533 driver into core and PHY
parts, we can not add the i2c layer and support the pn532 chipset.
- Support for NCI's loopback mode. This is a testing mode where each
packet received by the NFCC is sent back to the DH, allowing the
host to test that the controller can receive and send data.
- A few ACPI related fixes for the STMicro drivers, in order to match
the device tree naming scheme.
- A bunch of cleanups for the st-nci and the st21nfca STMicro drivers.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXMRA8AAoJEIqAPN1PVmxKknQP/2WyaVYD+jeRkM4pymRQ/6s6
V2+be68JlatvC53D6fbzo9FF8/QO7w++vTF+KI5nV6g/RvkPvKnxg2DC39fob0DA
6x7KaRuUrTwxBx4tfsvtNfUL4idRi3i6/hrYRJyzaNMHKAukDbjJQiCchveUhZeB
NlfYSEcieNnRJkGpdVGny4tea0fFOY6v+Uz74h2HboKbSD3SGNK29v47AqxH/ssg
Y9XycyYWWF1w2uOUfFzJFsvBAYcPlBDF4WqQ8SkamxhaPAB047Ysfp3y9vWGc8By
pdXLNAALzlCIt2JeGsrXhLG+Zf3y3w3G7HlcVbkh5BcB6fN2FYpPZurY8Km8Z6SM
ufgP8MwGOvOJa9NKapSm3FF6Tl1B3gCK3+SWT5N/lSNEFnaoZ6xMNGqXe1RhTght
+oqfuHLNO0YulQG0/ixJJl08p4eb6hhUMO3LKan3AYub9kgk63I43XeKXhIlqmLJ
tEh3xAPzbPyu+nwkrJF0B2tMeUdrCKZaucru/upYcZi7LDbQOvMjHgTS8YmX/aBk
gCOjINs+U1Q9jN2B7fMJe1uP6DJOTVW+xqiaLBLwHawRBhQQPbg+pLK/sHgFWcaa
bdUWASp7n5Bu5pY/rPHL+xvA8O5z5ngJ7ZPjqB2CvHr9XyQDna3Fo9jA/w3hgr/q
DgvCHukjvkDlUochojQZ
=zM22
-----END PGP SIGNATURE-----
Merge tag 'nfc-next-4.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/nfc-next
Samuel Ortiz says:
====================
NFC 4.7 pull request
This is the first NFC pull request for 4.7. With this one we
mainly have:
- Support for NXP's pn532 NFC chipset. The pn532 is based on the same
microcontroller as the pn533, but it talks to the host through i2c
instead of USB. By separating the pn533 driver into core and PHY
parts, we can not add the i2c layer and support the pn532 chipset.
- Support for NCI's loopback mode. This is a testing mode where each
packet received by the NFCC is sent back to the DH, allowing the
host to test that the controller can receive and send data.
- A few ACPI related fixes for the STMicro drivers, in order to match
the device tree naming scheme.
- A bunch of cleanups for the st-nci and the st21nfca STMicro drivers.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The dsa_switch structure contains a dsa_chip_data member called pd.
However in the rest of the code, pd is used for dsa_platform_data.
This is confusing. Rename it cd, which is already often used in dsa.c
and slave.c for this data type.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
The switch drivers only use the master_dev member for dev_info()
messages. Now that the device is passed to the old style probe, and
new style drivers are probed as true linux drivers, this is no longer
needed.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Resetting the switch is something the driver does, not the framework.
So move the parsing of this property into the driver.
There are no in kernel users of this property, so moving it does not
break anything. There is however a board which will make use of this
property making its way into the kernel.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Applications such as OSPF and BFD need the original ingress device not
the VRF device; the latter can be derived from the former. To that end
add the skb_iif to inet_skb_parm and set it in ipv4 code after clearing
the skb control buffer similar to IPv6. From there the pktinfo can just
pull it from cb with the PKTINFO_SKB_CB cast.
The previous patch moving the skb->dev change to L3 means nothing else
is needed for IPv6; it just works.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently the VRF driver uses the rx_handler to switch the skb device
to the VRF device. Switching the dev prior to the ip / ipv6 layer
means the VRF driver has to duplicate IP/IPv6 processing which adds
overhead and makes features such as retaining the ingress device index
more complicated than necessary.
This patch moves the hook to the L3 layer just after the first NF_HOOK
for PRE_ROUTING. This location makes exposing the original ingress device
trivial (next patch) and allows adding other NF_HOOKs to the VRF driver
in the future.
dev_queue_xmit_nit is exported so that the VRF driver can cycle the skb
with the switched device through the packet taps to maintain current
behavior (tcpdump can be used on either the vrf device or the enslaved
devices).
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Protocol for 4in6 tunnel is IPPROTO_IPIP. This was wrongly changed by
the last cleanup.
CC: Tom Herbert <tom@herbertland.com>
Fixes: 0d3c703a9d ("ipv6: Cleanup IPv6 tunnel receive path")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For ipgre interface in collect metadata mode, it doesn't make sense for the
interface to be of ARPHRD_IPGRE type. The outer header of received packets
is not needed, as all the information from it is present in metadata_dst. We
already don't set ipgre_header_ops for collect metadata interfaces, which is
the only consumer of mac_header pointing to the outer IP header.
Just set the interface type to ARPHRD_NONE in collect metadata mode for
ipgre (not gretap, that still correctly stays ARPHRD_ETHER) and reset
mac_header.
Fixes: a64b04d86d ("gre: do not assign header_ops in collect metadata mode")
Fixes: 2e15ea390e ("ip_gre: Add support to collect tunnel metadata.")
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When using conntrack helpers from OVS, a common configuration is to
perform a lookup without specifying a helper, then go through a
firewalling policy, only to decide to attach a helper afterwards.
In this case, the initial lookup will cause a ct entry to be attached to
the skb, then the later commit with helper should attach the helper and
confirm the connection. However, the helper attachment has been missing.
If the user has enabled automatic helper attachment, then this issue
will be masked as it will be applied in init_conntrack(). It is also
masked if the action is executed from ovs_packet_cmd_execute() as that
will construct a fresh skb.
This patch fixes the issue by making an explicit call to try to assign
the helper if there is a discrepancy between the action's helper and the
current skb->nfct.
Fixes: cae3a26275 ("openvswitch: Allow attaching helpers to ct action")
Signed-off-by: Joe Stringer <joe@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace 2 arguments (cnt and rtt) in the congestion control modules'
pkts_acked() function with a struct. This will allow adding more
information without having to modify existing congestion control
modules (tcp_nv in particular needs bytes in flight when packet
was sent).
As proposed by Neal Cardwell in his comments to the tcp_nv patch.
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The process below was broken and is fixed with this patch.
//add an ife action and give it an instance id of 1
sudo tc actions add action ife encode \
type 0xDEAD allow mark dst 02:15:15:15:15:15 index 1
//create a filter which binds to ife action id 1
sudo tc filter add dev $DEV parent ffff: protocol ip prio 1 u32\
match ip dst 17.0.0.1/32 flowid 1:11 action ife index 1
Message before fix was:
RTNETLINK answers: Invalid argument
We have an error talking to the kernel
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The process below was broken and is fixed with this patch.
//add a skbedit action and give it an instance id of 1
sudo tc actions add action skbedit mark 10 index 1
//create a filter which binds to skbedit action id 1
sudo tc filter add dev $DEV parent ffff: protocol ip prio 1 u32\
match ip dst 17.0.0.1/32 flowid 1:10 action skbedit index 1
Message before fix was:
RTNETLINK answers: Invalid argument
We have an error talking to the kernel
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The process below was broken and is fixed with this patch.
//add a simple action and give it an instance id of 1
sudo tc actions add action simple sdata "foobar" index 1
//create a filter which binds to simple action id 1
sudo tc filter add dev $DEV parent ffff: protocol ip prio 1 u32\
match ip dst 17.0.0.1/32 flowid 1:10 action simple index 1
Message before fix was:
RTNETLINK answers: Invalid argument
We have an error talking to the kernel
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The process below was broken and is fixed with this patch.
//add an mirred action and give it an instance id of 1
sudo tc actions add action mirred egress mirror dev $MDEV index 1
//create a filter which binds to mirred action id 1
sudo tc filter add dev $DEV parent ffff: protocol ip prio 1 u32\
match ip dst 17.0.0.1/32 flowid 1:10 action mirred index 1
Message before bug fix was:
RTNETLINK answers: Invalid argument
We have an error talking to the kernel
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This was broken and is fixed with this patch.
//add an ipt action and give it an instance id of 1
sudo tc actions add action ipt -j mark --set-mark 2 index 1
//create a filter which binds to ipt action id 1
sudo tc filter add dev $DEV parent ffff: protocol ip prio 1 u32\
match ip dst 17.0.0.1/32 flowid 1:10 action ipt index 1
Message before bug fix was:
RTNETLINK answers: Invalid argument
We have an error talking to the kernel
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Late vlan action binding was broken and is fixed with this patch.
//add a vlan action to pop and give it an instance id of 1
sudo tc actions add action vlan pop index 1
//create filter which binds to vlan action id 1
sudo tc filter add dev $DEV parent ffff: protocol ip prio 1 u32 \
match ip dst 17.0.0.1/32 flowid 1:1 action vlan index 1
current message(before bug fix) was:
RTNETLINK answers: Invalid argument
We have an error talking to the kernel
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- remove useless skb size check in batadv_interface_rx
- basic netns support introduced by Andrew Lunn:
- prevent virtual interface from changing netns by setting
NETIF_F_NETNS_LOCAL
- create virtual interface within the netns of the first
hard-interface
- introduce detection of complex bridge loops and report event
to the user (via udev) when the Bridge Loop Avoidance mechanism
can't prevent them
- minor reference counting bugfixes for the hard_iface object that
couldn't make it via the net tree
- use kref_get() instead of kref_get_unless_zero() to make reference
counting bug more visible
- use batadv_compare_eth() all over the code when possible instead of
plain memcmp()
- minor code cleanup and style adjustments
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJXMjUdAAoJEJ4aZjxxc6bKhDsP/i3rRwoppaDK3gxuDkXqrqOj
AYhMJYulcizK2PPCzj24bZSttwcKDNWpjR67rBgXY02NpqSsW+qYFnCm+KokzgC0
3NGV+lC6oCwzowP3FJtW06BCNOybnbR84Fw6JCRL5P4mKoGnhn4jAOIYlYnk4K7z
ByhEallLD84h5oo2Td8i06QzxzaO6lkYaLrAVBQnd+Cs27foxjMOO0gasyCvUkl7
Jsm2n0NSEN6SYv0smLhiPHx9oF66MBAxNRVS6dL/6Ko/vSHNy1AGQgx0SztlHzCQ
K6JCpOZn5CExvKZUMXQXOrSkpdjIxVh+EGnGBAGyNswqBg49Y3K45FqNz4b+R0Pg
ckPzc7UAo8z+ZHzPz4mzx/596sw9xG4jLS6aHkjwet7kLXj/ShP//HqGwfHTeeOY
w4a2ers6ByCq2PwKYz34s/sDvC2+p++ZdyJF+qOjJpNjbH1B+mDId8M1f9H/9rvF
//EZhCcdlpqLIxvQKDMADQRpcwTbAPglsBuHs/1Sk1zjRK1NDRPw+46pUUQOb8ND
ydAs0JVUOgSKHZT+3b5l85poRYpp8gX8nVrvF2DNJBqmBTvSkAbZpNPiSy6fNWQV
he3Q9RndqQq2OPffG1ijte7G1m9J8lCh7Nzl6Ee6vGsw4GzzkKIx9qHHCtXhYpOE
kdg5ttgUK5sfgjsn3ANp
=CB1D
-----END PGP SIGNATURE-----
Merge tag 'batman-adv-for-davem' of git://git.open-mesh.org/linux-merge
Antonio Quartulli says:
====================
Included changes:
- remove useless skb size check in batadv_interface_rx
- basic netns support introduced by Andrew Lunn:
- prevent virtual interface from changing netns by setting
NETIF_F_NETNS_LOCAL
- create virtual interface within the netns of the first
hard-interface
- introduce detection of complex bridge loops and report event
to the user (via udev) when the Bridge Loop Avoidance mechanism
can't prevent them
- minor reference counting bugfixes for the hard_iface object that
couldn't make it via the net tree
- use kref_get() instead of kref_get_unless_zero() to make reference
counting bug more visible
- use batadv_compare_eth() all over the code when possible instead of
plain memcmp()
- minor code cleanup and style adjustments
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
There are two instances of an unused variable, `doff' added by
commit 6fa01ccd88 ("skbuff: Add pskb_extract() helper function")
in pskb_carve_inside_header() and pskb_carve_inside_nonlinear().
Remove these instances, they are not used.
Reported by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The handler 'ila_fill_encap_info' adds two attributes: ILA_ATTR_LOCATOR
and ILA_ATTR_CSUM_MODE.
nla_total_size_64bit() must be use for ILA_ATTR_LOCATOR.
Also, do nla_put_u8 instead of nla_put_u64 for ILA_ATTR_CSUM_MODE.
Fixes: f13a82d87b ("ipv6: use nla_put_u64_64bit()")
Fixes: 90bfe662db ("ila: add checksum neutral ILA translations")
Reported-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the very unlikely case __tcp_retransmit_skb() can not use the cloning
done in tcp_transmit_skb(), we need to refresh skb_mstamp before doing
the copy and transmit, otherwise TCP TS val will be an exact copy of
original transmit.
Fixes: 7faee5c0d5 ("tcp: remove TCP_SKB_CB(skb)->when")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When comparing Ethernet address it is better to use the more
generic batadv_compare_eth. The latter is also optimised for
architectures having a fast unaligned access.
Signed-off-by: Antonio Quartulli <a@unstable.cc>
[sven@narfation.org: fix conflicts with current version]
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Reviewed-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
It is easier to understand that the returned value of a specific function
doesn't have to be 0 when the functions was successful when the actual
return type is bool. This is especially true when all surrounding functions
with return type int use negative values to return the error code.
Reported-by: Nicholas Krause <xerofoify@gmail.com>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
_batadv_update_route requires that the caller already has a valid reference
for neigh_node. It is therefore not possible that it has an reference
counter of 0 and was still given to this function
The kref_get function instead WARNs (with debug information) when the
reference counter would still be 0. This makes a bug in batman-adv better
visible because kref_get_unless_zero would have ignored this problem.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
The callers of the functions using batadv_hard_iface objects already make
sure that they hold a valid reference. The subfunctions don't have
to check whether the reference counter is > 0 because this was checked by
the callers.
The kref_get function instead WARNs (with debug information) when the
reference counter would still be 0. This makes a bug in batman-adv better
visible because kref_get_unless_zero would have ignored this problem.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
batadv_gw_node_add requires that the caller already has a valid reference
for orig_node. It is therefore not possible that it has an reference
counter of 0 and was still given to this function
The kref_get function instead WARNs (with debug information) when the
reference counter would still be 0. This makes a bug in batman-adv better
visible because kref_get_unless_zero would have ignored this problem.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
batadv_gw_select requires that the caller already has a valid reference for
new_gw_node. It is therefore not possible that it has an reference counter
of 0 and was still given to this function
The kref_get function instead WARNs (with debug information) when the
reference counter would still be 0. This makes a bug in batman-adv better
visible because kref_get_unless_zero would have ignored this problem.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
batadv_nc_get_nc_node requires that the caller already has a valid
reference for orig_neigh_node. It is therefore not possible that it has an
reference counter of 0 and was still given to this function
The kref_get function instead WARNs (with debug information) when the
reference counter would still be 0. This makes a bug in batman-adv better
visible because kref_get_unless_zero would have ignored this problem.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
batadv_tvlv_container_get requires that tvlv.container_list_lock is held by
the caller. It is therefore not possible that an item in
tvlv.container_list has an reference counter of 0 and is still in the list
The kref_get function instead WARNs (with debug information) when the
reference counter would still be 0. This makes a bug in batman-adv better
visible because kref_get_unless_zero would have ignored this problem.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
The hard_iface is referenced in the packet_type for batman-adv. Increase
the refcounter of the hard_interface for it to have an explicit reference
for it in case this functionality gets refactorted and the currently
used implicit reference for it will be removed.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
The receive function may start processing an incoming packet while the
hard_iface is shut down in a different context. All called functions called
with the batadv_hard_iface object belonging to the incoming interface would
have to check whether the reference counter is still > 0.
This is rather error-prone because this check can be forgotten easily.
Instead check the reference counter when receiving the object to make sure
that all called functions have a valid reference.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
The batadv_hardif_list list is checked in many situations and the items
in this list are given to specialized functions to modify the routing
behavior. At the moment each of these called functions has to check
itself whether the received batadv_hard_iface has a refcount > 0 before
it can increase the reference counter and use it in other objects.
This can easily lead to problems because it is not easily visible where
all callers of a function got the batadv_hard_iface object from and
whether they already hold a valid reference.
Checking the reference counter directly before calling a subfunction
with a pointer from the batadv_hardif_list avoids this problem.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
There are network setups where the current bridge loop avoidance can't
detect bridge loops. The minimal setup affected would consist of two
LANs and two separate meshes, connected in a ring like that:
A...(mesh1)...B
| |
(LAN1) (LAN2)
| |
C...(mesh2)...D
Since both the meshes and backbones are separate, the bridge loop
avoidance has not enough information to detect and avoid the loop
in this case. Even if these scenarios can't be fixed easily,
these kind of loops can be detected.
This patch implements a periodic check (running every 60 seconds for
now) which sends a broadcast frame with a random MAC address on
each backbone VLAN. If a broadcast frame with the same MAC address
is received shortly after on the mesh, we know that there must be a
loop and report that incident as well as throw an uevent to let others
handle that problem.
Signed-off-by: Simon Wunderlich <simon.wunderlich@open-mesh.com>
[sven@narfation.org: fix conflicts with current version]
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
When creating a soft interface, create it in the same netns as the
hard interface. Replace all references to init_net with the correct
name space for the interface being manipulated.
Suggested-by: Daniel Ehlers <danielehlers@mindeye.net>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
The batX soft interface should not be moved between network name
spaces. This is similar to bridges, bonds, tunnels, which are not
allowed to move between network namespaces.
Suggested-by: Daniel Ehlers <danielehlers@mindeye.net>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Antonio Quartulli <a@unstable.cc>
Reviewed-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
The callers of batadv_interface_rx have to make sure that enough data can
be pulled from the skb when they read the batman-adv header. The only two
functions using it are either calling pskb_may_pull with hdr_size directly
(batadv_recv_bcast_packet) or indirectly via batadv_check_unicast_packet
(batadv_recv_unicast_packet).
Reported-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
A recent commit introduced an unconditional use of an uninitialized
variable, as reported in this gcc warning:
net/netfilter/nf_conntrack_core.c: In function '__nf_conntrack_confirm':
net/netfilter/nf_conntrack_core.c:632:33: error: 'ctinfo' may be used uninitialized in this function [-Werror=maybe-uninitialized]
bytes = atomic64_read(&counter[CTINFO2DIR(ctinfo)].bytes);
^
net/netfilter/nf_conntrack_core.c:628:26: note: 'ctinfo' was declared here
enum ip_conntrack_info ctinfo;
The problem is that a local variable shadows the function parameter.
This removes the local variable, which looks like what Pablo originally
intended.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 71d8c47fc6 ("netfilter: conntrack: introduce clash resolution on insertion race")
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contain Netfilter simple fixes for your net tree,
two one-liner and one two-liner:
1) Oneliner to fix missing spinlock definition that triggers
'BUG: spinlock bad magic on CPU#' when spinlock debugging is enabled,
from Florian Westphal.
2) Fix missing workqueue cancelation on IDLETIMER removal,
from Liping Zhang.
3) Fix insufficient validation of netlink of NFACCT_QUOTA in
nfnetlink_acct, from Phil Turnbull.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix two spots where o_flags in a tunnel are being compared to GRE_SEQ
instead of TUNNEL_SEQ.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Need to use adjusted protocol value for setting inner protocol.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
GRE for IPv6 does not properly translate for GRE flags to tunnel
flags and vice versa. This patch fixes that.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In ip6gre_tnl_link_config set t->tun_len and t->hlen correctly for the
configuration. For hard_header_len and mtu calculation include
IPv6 header and encapsulation overhead.
In ip6gre_tunnel_init_common set t->tun_len and t->hlen correctly for
the configuration. Revert to setting hard_header_len instead of
needed_headroom.
Tested:
./ip link add name tun8 type ip6gretap remote \
2401:db00:20:911a:face:0:27:0 local \
2401:db00:20:911a:face:0:25:0 ttl 225
Gives MTU of 1434. That is equal to 1500 - 40 - 14 - 4 - 8.
./ip link add name tun8 type ip6gretap remote \
2401:db00:20:911a:face:0:27:0 local \
2401:db00:20:911a:face:0:25:0 ttl 225 okey 123
Gives MTU of 1430. That is equal to 1500 - 40 - 14 - 4 - 8 - 4.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stack object "dte_facilities" is allocated in x25_rx_call_request(),
which is supposed to be initialized in x25_negotiate_facilities.
However, 5 fields (8 bytes in total) are not initialized. This
object is then copied to userland via copy_to_user, thus infoleak
occurs.
Signed-off-by: Kangjie Lu <kjlu@gatech.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow udp and raw sockets to send by oif that is an enslaved interface
versus the l3mdev/VRF device. For example, this allows BFD to use ifindex
from IP_PKTINFO on a receive to send a response without the need to
convert to the VRF index. It also allows ping and ping6 to work when
specifying an enslaved interface (e.g., ping -I swp1 <ip>) which is
a natural use case.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move l3mdev_rt6_dst_by_oif and l3mdev_get_saddr to l3mdev.c. Collapse
l3mdev_get_rt6_dst into l3mdev_rt6_dst_by_oif since it is the only
user and keep the l3mdev_get_rt6_dst name for consistency with other
hooks.
A follow-on patch adds more code to these functions making them long
for inlined functions.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In netdevice.h we removed the structure in net-next that is being
changes in 'net'. In macsec.c and rtnetlink.c we have overlaps
between fixes in 'net' and the u64 attribute changes in 'net-next'.
The mlx5 conflicts have to do with vxlan support dependencies.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following large patchset contains Netfilter updates for your
net-next tree. My initial intention was to send you this in two goes but
when I looked back twice I already had this burden on top of me.
Several updates for IPVS from Marco Angaroni:
1) Allow SIP connections originating from real-servers to be load
balanced by the SIP persistence engine as is already implemented
in the other direction.
2) Release connections immediately for One-packet-scheduling (OPS)
in IPVS, instead of making it via timer and rcu callback.
3) Skip deleting conntracks for each one packet in OPS, and don't call
nf_conntrack_alter_reply() since no reply is expected.
4) Enable drop on exhaustion for OPS + SIP persistence.
Miscelaneous conntrack updates from Florian Westphal, including fix for
hash resize:
5) Move conntrack generation counter out of conntrack pernet structure
since this is only used by the init_ns to allow hash resizing.
6) Use get_random_once() from packet path to collect hash random seed
instead of our compound.
7) Don't disable BH from ____nf_conntrack_find() for statistics,
use NF_CT_STAT_INC_ATOMIC() instead.
8) Fix lookup race during conntrack hash resizing.
9) Introduce clash resolution on conntrack insertion for connectionless
protocol.
Then, Florian's netns rework to get rid of per-netns conntrack table,
thus we use one single table for them all. There was consensus on this
change during the NFWS 2015 and, on top of that, it has recently been
pointed as a source of multiple problems from unpriviledged netns:
11) Use a single conntrack hashtable for all namespaces. Include netns
in object comparisons and make it part of the hash calculation.
Adapt early_drop() to consider netns.
12) Use single expectation and NAT hashtable for all namespaces.
13) Use a single slab cache for all namespaces for conntrack objects.
14) Skip full table scanning from nf_ct_iterate_cleanup() if the pernet
conntrack counter tells us the table is empty (ie. equals zero).
Fixes for nf_tables interval set element handling, support to set
conntrack connlabels and allow set names up to 32 bytes.
15) Parse element flags from element deletion path and pass it up to the
backend set implementation.
16) Allow adjacent intervals in the rbtree set type for dynamic interval
updates.
17) Add support to set connlabel from nf_tables, from Florian Westphal.
18) Allow set names up to 32 bytes in nf_tables.
Several x_tables fixes and updates:
19) Fix incorrect use of IS_ERR_VALUE() in x_tables, original patch
from Andrzej Hajda.
And finally, miscelaneous netfilter updates such as:
20) Disable automatic helper assignment by default. Note this proc knob
was introduced by a900689264 ("netfilter: nf_ct_helper: allow to
disable automatic helper assignment") 4 years ago to start moving
towards explicit conntrack helper configuration via iptables CT
target.
21) Get rid of obsolete and inconsistent debugging instrumentation
in x_tables.
22) Remove unnecessary check for null after ip6_route_output().
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
An earlier patch changed lookup side to also net_eq() namespaces after
obtaining a reference on the conntrack, so a single kmemcache can be used.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We already include netns address in the hash, so we only need to use
net_eq in find_appropriate_src and can then put all entries into
same table.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Will be needed soon when we place all in the same hash table.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add function rpc_lookup_generic_cred, which allows lookups of a generic
credential that's not current_cred().
[jlayton: add gfp_t parm]
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
We need to be able to call the generic_cred creator from different
contexts. Add a gfp_t parm to the crcreate operation and to
rpcauth_lookup_credcache. For now, we just push the gfp_t parms up
one level to the *_lookup_cred functions.
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
An xdr_buf with head[0].iov_len = 0 and page_len = 0 will cause
xdr_init_decode() to incorrectly setup the xdr_stream. Specifically,
xdr->end is never initialized.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
TC_ACT_STOLEN is used when ingress traffic is mirred/redirected
to say ifb.
Packet is not dropped, but consumed.
Only TC_ACT_SHOT is a clear indication something went wrong.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On small embedded routers, one wants to control maximal amount of
memory used by fq_codel, instead of controlling number of packets or
bytes, since GRO/TSO make these not practical.
Assuming skb->truesize is accurate, we have to keep track of
skb->truesize sum for skbs in queue.
This patch adds a new TCA_FQ_CODEL_MEMORY_LIMIT attribute.
I chose a default value of 32 MBytes, which looks reasonable even
for heavy duty usages. (Prior fq_codel users should not be hurt
when they upgrade their kernels)
Two fields are added to tc_fq_codel_qd_stats to report :
- Current memory usage
- Number of drops caused by memory limits
# tc qd replace dev eth1 root est 1sec 4sec fq_codel memory_limit 4M
..
# tc -s -d qd sh dev eth1
qdisc fq_codel 8008: root refcnt 257 limit 10240p flows 1024
quantum 1514 target 5.0ms interval 100.0ms memory_limit 4Mb ecn
Sent 2083566791363 bytes 1376214889 pkt (dropped 4994406, overlimits 0
requeues 21705223)
rate 9841Mbit 812549pps backlog 3906120b 376p requeues 21705223
maxpacket 68130 drop_overlimit 4994406 new_flow_count 28855414
ecn_mark 0 memory_used 4190048 drop_overmemory 4994406
new_flows_len 1 old_flows_len 177
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Dave Täht <dave.taht@gmail.com>
Cc: Sebastian Möller <moeller0@gmx.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add an implementation of Qualcomm's IPC router protocol, used to
communicate with service providing remote processors.
Signed-off-by: Courtney Cavin <courtney.cavin@sonymobile.com>
Signed-off-by: Bjorn Andersson <bjorn.andersson@sonymobile.com>
[bjorn: Cope with 0 being a valid node id and implement RTM_NEWADDR]
Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
UDP tunnel segmentation code relies on the inner offsets being set for
an UDP tunnel GSO packet, but the inner *_complete() functions will
set the inner offsets only if 'encapsulation' is set before calling
them. Currently, udp_gro_complete() sets 'encapsulation' only after
the inner *_complete() functions are done. This causes the inner
offsets having invalid values after udp_gro_complete() returns, which
in turn will make it impossible to properly segment the packet in case
it needs to be forwarded, which would be visible to the user either as
invalid packets being sent or as packet loss.
This patch fixes this by setting skb's 'encapsulation' in
udp_gro_complete() before calling into the inner complete functions,
and by making each possible UDP tunnel gro_complete() callback set the
inner_mac_header to the beginning of the tunnel payload.
Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Reviewed-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The setting of the UDP tunnel GSO type is already performed by
udp[46]_gro_complete().
Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
I forgot that ip_send_unicast_reply() is not BH safe (yet).
Disabling preemption before calling it was not a good move.
Fixes: c10d9310ed ("tcp: do not assume TCP code is non preemptible")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andres Lagar-Cavilla <andreslc@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
allow cls_bpf and act_bpf programs access skb->data and skb->data_end pointers.
The bpf helpers that change skb->data need to update data_end pointer as well.
The verifier checks that programs always reload data, data_end pointers
after calls to such bpf helpers.
We cannot add 'data_end' pointer to struct qdisc_skb_cb directly,
since it's embedded as-is by infiniband ipoib, so wrapper struct is needed.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tables have to exist for VRFs to function. Ensure they exist
when VRF device is created.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Responses for packets to unused ports are getting lost with L3 domains.
IPv4 has ip_send_unicast_reply for sending TCP responses which accounts
for L3 domains; update the IPv6 counterpart tcp_v6_send_response.
For icmp the L3 master check needs to be moved up in icmp6_send
to properly respond to UDP packets to a port with no listener.
Fixes: ca254490c8 ("net: Add VRF support to IPv6 stack")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With the newly introduced helper functions the skb pulling is hidden
in the checksumming function - and undone before returning to the
caller.
The IGMP and MLD query parsing functions in the bridge still
assumed that the skb is pointing to the beginning of the IGMP/MLD
message while it is now kept at the beginning of the IPv4/6 header.
If there is a querier somewhere else, then this either causes
the multicast snooping to stay disabled even though it could be
enabled. Or, if we have the querier enabled too, then this can
create unnecessary IGMP / MLD query messages on the link.
Fixing this by taking the offset between IP and IGMP/MLD header into
account, too.
Fixes: 9afd85c9e4 ("net: Export IGMP/MLD message validation code")
Reported-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: David S. Miller <davem@davemloft.net>
Simon Horman says:
====================
Second Round of IPVS Updates for v4.7
please consider these enhancements to the IPVS. They allow its DoS
mitigation strategy effective in conjunction with the SIP persistence
engine.
====================
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We already include netns address in the hash and compare the netns pointers
during lookup, so even if namespaces have overlapping addresses entries
will be spread across the expectation table.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
DoS protection policy that deletes connections to avoid out of memory is
currently not effective for SIP-pe plus OPS-mode for two reasons:
1) connection templates (holding SIP call-id) are always skipped in
ip_vs_random_dropentry()
2) in_pkts counter (used by drop_entry algorithm) is not incremented
for connection templates
This patch addresses such problems with the following changes:
a) connection templates associated (via their dest) to virtual-services
configured in OPS mode are included in ip_vs_random_dropentry()
monitoring. This applies to SIP-pe over UDP (which requires OPS mode),
but is more general principle: when OPS is controlled by templates
memory can be used only by templates themselves, since OPS conns are
deleted after packet is forwarded.
b) OPS connections, if controlled by a template, cause increment of
in_pkts counter of their template. This is already happening but only
in case director is in master-slave mode (see ip_vs_sync_conn()).
Signed-off-by: Marco Angaroni <marcoangaroni@gmail.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
get_bridge_ifindices() is used from the old "deviceless" bridge ioctl
calls which aren't called with rtnl held. The comment above says that it is
called with rtnl but that is not really the case.
Here's a sample output from a test ASSERT_RTNL() which I put in
get_bridge_ifindices and executed "brctl show":
[ 957.422726] RTNL: assertion failed at net/bridge//br_ioctl.c (30)
[ 957.422925] CPU: 0 PID: 1862 Comm: brctl Tainted: G W O
4.6.0-rc4+ #157
[ 957.423009] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.8.1-20150318_183358- 04/01/2014
[ 957.423009] 0000000000000000 ffff880058adfdf0 ffffffff8138dec5
0000000000000400
[ 957.423009] ffffffff81ce8380 ffff880058adfe58 ffffffffa05ead32
0000000000000001
[ 957.423009] 00007ffec1a444b0 0000000000000400 ffff880053c19130
0000000000008940
[ 957.423009] Call Trace:
[ 957.423009] [<ffffffff8138dec5>] dump_stack+0x85/0xc0
[ 957.423009] [<ffffffffa05ead32>]
br_ioctl_deviceless_stub+0x212/0x2e0 [bridge]
[ 957.423009] [<ffffffff81515beb>] sock_ioctl+0x22b/0x290
[ 957.423009] [<ffffffff8126ba75>] do_vfs_ioctl+0x95/0x700
[ 957.423009] [<ffffffff8126c159>] SyS_ioctl+0x79/0x90
[ 957.423009] [<ffffffff8163a4c0>] entry_SYSCALL_64_fastpath+0x23/0xc1
Since it only reads bridge ifindices, we can use rcu to safely walk the net
device list. Also remove the wrong rtnl comment above.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The peer may be expecting a reply having sent a request and then done a
shutdown(SHUT_WR), so tearing down the whole socket at this point seems
wrong and breaks for me with a client which does a SHUT_WR.
Looking at other socket family's stream_recvmsg callbacks doing a shutdown
here does not seem to be the norm and removing it does not seem to have
had any adverse effects that I can see.
I'm using Stefan's RFC virtio transport patches, I'm unsure of the impact
on the vmci transport.
Signed-off-by: Ian Campbell <ian.campbell@docker.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
Cc: Andy King <acking@vmware.com>
Cc: Dmitry Torokhov <dtor@vmware.com>
Cc: Jorgen Hansen <jhansen@vmware.com>
Cc: Adit Ranadive <aditr@vmware.com>
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
If a quota bit is set in NFACCT_FLAGS but the NFACCT_QUOTA parameter is
missing then a NULL pointer dereference is triggered. CAP_NET_ADMIN is
required to trigger the bug.
Signed-off-by: Phil Turnbull <phil.turnbull@oracle.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Currently, we support set names of up to 16 bytes, get this aligned
with the maximum length we can use in ipset to make it easier when
considering migration to nf_tables.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The dprintf() and duprintf() functions are enabled at compile time,
these days we have better runtime debugging through pr_debug() and
static keys.
On top of this, this debugging is so old that I don't expect anyone
using this anymore, so let's get rid of this.
IP_NF_ASSERT() is still left in place, although this needs that
NETFILTER_DEBUG is enabled, I think these assertions provide useful
context information when reading the code.
Note that ARP_NF_ASSERT() has been removed as there is no user of
this.
Kill also DEBUG_ALLOW_ALL and a couple of pr_error() and pr_debug()
spots that are inconsistently placed in the code.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If the protocol is not natively supported, this assigns generic protocol
tracker so we can always assume a valid pointer after these calls.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Jarno Rajahalme <jrajahalme@nicira.com>
Acked-by: Joe Stringer <joe@ovn.org>
This patch introduces nf_ct_resolve_clash() to resolve race condition on
conntrack insertions.
This is particularly a problem for connection-less protocols such as
UDP, with no initial handshake. Two or more packets may race to insert
the entry resulting in packet drops.
Another problematic scenario are packets enqueued to userspace via
NFQUEUE after the raw table, that make it easier to trigger this
race.
To resolve this, the idea is to reset the conntrack entry to the one
that won race. Packet and bytes counters are also merged.
The 'insert_failed' stats still accounts for this situation, after
this patch, the drop counter is bumped whenever we drop packets, so we
can watch for unresolved clashes.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Introduce a helper function to update conntrack counters.
__nf_ct_kill_acct() was unnecessarily subtracting skb_network_offset()
that is expected to be zero from the ipv4/ipv6 hooks.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Remove unnecessary check for non-nul pointer in destroy_conntrack()
given that __nf_ct_l4proto_find() returns the generic protocol tracker
if the protocol is not supported.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When iterating, skip conntrack entries living in a different netns.
We could ignore netns and kill some other non-assured one, but it
has two problems:
- a netns can kill non-assured conntracks in other namespace
- we would start to 'over-subscribe' the affected/overlimit netns.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We already include netns address in the hash and compare the netns pointers
during lookup, so even if namespaces have overlapping addresses entries
will be spread across the table.
Assuming 64k bucket size, this change saves 0.5 mbyte per namespace on a
64bit system.
NAT bysrc and expectation hash is still per namespace, those will
changed too soon.
Future patch will also make conntrack object slab cache global again.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Once we place all conntracks into a global hash table we want them to be
spread across entire hash table, even if namespaces have overlapping ip
addresses.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Once we place all conntracks in the same hash table we must also compare
the netns pointer to skip conntracks that belong to a different namespace.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The iteration process is lockless, so we test if the conntrack object is
eligible for printing (e.g. is AF_INET) after obtaining the reference
count.
Once we put all conntracks into same hash table we might see more
entries that need to be skipped.
So add a helper and first perform the test in a lockless fashion
for fast skip.
Once we obtain the reference count, just repeat the check.
Note that this refactoring also includes a missing check for unconfirmed
conntrack entries due to slab rcu object re-usage, so they need to be
skipped since they are not part of the listing.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This prepares for upcoming change that places all conntracks into a
single, global table. For this to work we will need to also compare
net pointer during lookup. To avoid open-coding such check use the
nf_ct_key_equal helper and then later extend it to also consider net_eq.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Once we place all conntracks into same table iteration becomes more
costly because the table contains conntracks that we are not interested
in (belonging to other netns).
So don't bother scanning if the current namespace has no entries.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When resizing the conntrack hash table at runtime via
echo 42 > /sys/module/nf_conntrack/parameters/hashsize, we are racing with
the conntrack lookup path -- reads can happen in parallel and nothing
prevents readers from observing a the newly allocated hash but the old
size (or vice versa).
So access to hash[bucket] can trigger OOB read access in case the table got
expanded and we saw the new size but the old hash pointer (or it got shrunk
and we got new hash ptr but the size of the old and larger table):
kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] SMP KASAN
CPU: 0 PID: 3 Comm: ksoftirqd/0 Not tainted 4.6.0-rc2+ #107
[..]
Call Trace:
[<ffffffff822c3d6a>] ? nf_conntrack_tuple_taken+0x12a/0xe90
[<ffffffff822c3ac1>] ? nf_ct_invert_tuplepr+0x221/0x3a0
[<ffffffff8230e703>] get_unique_tuple+0xfb3/0x2760
Use generation counter to obtain the address/length of the same table.
Also add a synchronize_net before freeing the old hash.
AFAICS, without it we might access ct_hash[bucket] after ct_hash has been
freed, provided that lockless reader got delayed by another event:
CPU1 CPU2
seq_begin
seq_retry
<delay> resize occurs
free oldhash
for_each(oldhash[size])
Note that resize is only supported in init_netns, it took over 2 minutes
of constant resizing+flooding to produce the warning, so this isn't a
big problem in practice.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
No need to disable BH here anymore:
stats are switched to _ATOMIC variant (== this_cpu_inc()), which
nowadays generates same code as the non _ATOMIC NF_STAT, at least on x86.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Conntrack labels are currently sized depending on the iptables
ruleset, i.e. if we're asked to test or set bits 1, 2, and 65 then we
would allocate enough room to store at least bit 65.
However, with nft, the input is just a register with arbitrary runtime
content.
We therefore ask for the upper ceiling we currently have, which is
enough room to store 128 bits.
Alternatively, we could alter nf_connlabel_replace to increase
net->ct.label_words at run time, but since 128 bits is not that
big we'd only save sizeof(long) so it doesn't seem worth it for now.
This follows a similar approach that xtables 'connlabel'
match uses, so when user inputs
ct label set bar
then we will set the bit used by the 'bar' label and leave the rest alone.
This is done by passing the sreg content to nf_connlabels_replace
as both value and mask argument.
Labels (bits) already set thus cannot be re-set to zero, but
this is not supported by xtables connlabel match either.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
percpu_counter only have protection against preemption.
TCP stack uses them possibly from BH, so we need BH protection
in contexts that could be run in process context
Fixes: c10d9310ed ("tcp: do not assume TCP code is non preemptible")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
__inet_twsk_hashdance() might be called from process context,
better block BH before acquiring bind hash and established locks
Fixes: c10d9310ed ("tcp: do not assume TCP code is non preemptible")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_snd_una_update() and tcp_rcv_nxt_update() call
u64_stats_update_begin() either from process context or BH handler.
This triggers a lockdep splat on 32bit & SMP builds.
We could add u64_stats_update_begin_bh() variant but this would
slow down 32bit builds with useless local_disable_bh() and
local_enable_bh() pairs, since we own the socket lock at this point.
I add sock_owned_by_me() helper to have proper lockdep support
even on 64bit builds, and new u64_stats_update_begin_raw()
and u64_stats_update_end_raw methods.
Fixes: c10d9310ed ("tcp: do not assume TCP code is non preemptible")
Reported-by: Fabio Estevam <festevam@gmail.com>
Diagnosed-by: Francois Romieu <romieu@fr.zoreil.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Fabio Estevam <fabio.estevam@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert says:
====================
pull request (net): ipsec 2016-05-04
1) The flowcache can hit an OOM condition if too
many entries are in the gc_list. Fix this by
counting the entries in the gc_list and refuse
new allocations if the value is too high.
2) The inner headers are invalid after a xfrm transformation,
so reset the skb encapsulation field to ensure nobody tries
access the inner headers. Otherwise tunnel devices stacked
on top of xfrm may build the outer headers based on wrong
informations.
3) Add pmtu handling to vti, we need it to report
pmtu informations for local generated packets.
Please pull or let me know if there are problems.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
- two changes to the MAINTAINERS file where one marks our mailing list
as moderated and the other adds a missing documentation file
- kernel-doc fixes
- code refactoring and various cleanups
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJXKRJdAAoJEJ4aZjxxc6bKSVEP/1Ky6O7+oanpsjjwUiZDMj0W
KPtoPQ/VsJxu51e0OYi78jHtGned7xV+FLyFx1k8BwLPThYtd8ysDVqMqFAmXsRh
JPOT+7Y+lf8/FBUYdKyJcsaoqAeRPXnY+p0vE7woLaxk+GOiWpOIip73nisgu9gy
NxfmgJ77WjEV2v6IiD4djfYmZqOOvCF6IGWkubtc0WZdg5ma/2u7vYEDBy3yjN/b
og/5joT3GZC8K6X8BabxNSLDER+qs489a6rOUGoRK4NCU3LhELAywuAws30nPrB/
vFJ6BvEEzkaGcXJViSFelb9zsi4ngwvY9OPQnFCmOicDzJN7jqdV6yXcnSLurph1
sDR+1+k1f63czCJpG8Uhj+8SaQO7P8T9A5nL1UKwhdCOENCuj8Vtp5y4S2A3bOSe
jEv1dy9FC3yaPvtkyUN+wOuDerPoJr5pFuVRz2RGyeFSMxs+RPBLYf/D0+x1om9A
Vz63ecsygk7S7qGNXHUbQvX5Q5Kv5f4y4XjvmrH3rBq+T/WC6V5vbkTo8L4CapX9
KNffNGl1RWqHz/TVLSrQmlHc9zNM/Rg0am2MIxplGfP0rQSUNob/qjD50KPJSLF/
M8tmOBSCNAxzlfAwcn+VJLq+xt6Mr2mkhwZZPYGQPno8JJqCMq52k4w1AQvbv+eI
sxgFGvTq1WACUDx03vyx
=qoV3
-----END PGP SIGNATURE-----
Merge tag 'batman-adv-for-davem' of git://git.open-mesh.org/linux-merge
Antonio Quartulli says:
====================
pull request: batman-adv 20160504
In this pull request you have:
- two changes to the MAINTAINERS file where one marks our mailing list
as moderated and the other adds a missing documentation file
- kernel-doc fixes
- code refactoring and various cleanups
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The stack object “map” has a total size of 32 bytes. Its last 4
bytes are padding generated by compiler. These padding bytes are
not initialized and sent out via “nla_put”.
Signed-off-by: Kangjie Lu <kjlu@gatech.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
The stack object “info” has a total size of 12 bytes. Its last byte
is padding which is not initialized and leaked via “put_cmsg”.
Signed-off-by: Kangjie Lu <kjlu@gatech.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
previous patches removed all direct accesses to dev->trans_start,
so change the netif_trans_update helper to update trans_start of
netdev queue 0 instead and then remove trans_start from struct net_device.
AFAICS a lot of the netif_trans_update() invocations are now useless
because they occur in ndo_start_xmit and driver doesn't set LLTX
(i.e. stack already took care of the update).
As I can't test any of them it seems better to just leave them alone.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ipv6 gre implementation was cleaned up to share more code
with the ipv4 version, but it can be enabled even when NET_IPGRE_DEMUX
is disabled, resulting in a link error:
net/built-in.o: In function `gre_rcv':
:(.text+0x17f5d0): undefined reference to `gre_parse_header'
ERROR: "gre_parse_header" [net/ipv6/ip6_gre.ko] undefined!
This adds a Kconfig dependency to prevent that now invalid
configuration.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 308edfdf15 ("gre6: Cleanup GREv6 receive path, call common GRE functions")
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For ipgre interfaces in collect metadata mode, receive also traffic with
encapsulated Ethernet headers. The lwtunnel users are supposed to sort this
out correctly. This allows to have mixed Ethernet + L3-only traffic on the
same lwtunnel interface. This is the same way as VXLAN-GPE behaves.
To keep backwards compatibility and prevent any surprises, gretap interfaces
have priority in receiving packets with Ethernet headers.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This will allow to make the pull dependent on the tunnel type.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The call to gre_parse_header is either followed by iptunnel_pull_header, or
in the case of ICMP error path, the actual header is not accessed at all.
In the first case, iptunnel_pull_header will call pskb_may_pull anyway and
it's pointless to do it twice. The only difference is what call will fail
with what error code but the net effect is still the same in all call sites.
In the second case, pskb_may_pull is pointless, as skb->data is at the outer
IP header and not at the GRE header.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change makes it so that we will strip the TSO_MANGLEID bit if TSO is
not present. This way we will also handle ECN correctly of TSO is not
present.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch addresses a possible issue that can occur if we get into any odd
corner cases where we support TSO for a given protocol but not the checksum
or scatter-gather offload. There are few drivers floating around that
setup their tunnels this way and by enforcing the checksum piece we can
avoid mangling any frames.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the event that the number of partial segments is equal to 1 we don't
really need to perform partial segmentation offload. As such we should
skip multiplying the MSS and instead just clear the partial_segs value
since it will not provide any gain to advertise the frame as being GSO when
it is a single frame.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It's easier for gre_parse_header to return the header length instead of
filing it into a parameter. That way, the callers that don't care about the
header length can just check whether the returned value is lower than zero.
In gre_err, the tunnel header must not be pulled. See commit b7f8fe251e
("gre: do not pull header in ICMP error processing") for details.
This patch reduces the conflict between the mentioned commit and commit
95f5c64c3c ("gre: Move utility functions to common headers").
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Under high rx pressure, it is possible tcp_sendmsg() never has a
chance to allocate an skb and loop forever as sk_flush_backlog()
would always return true.
Fix this by calling sk_flush_backlog() only if one skb had been
allocated and filled before last backlog check.
Fixes: d41a69f1d3 ("tcp: make tcp_sendmsg() aware of socket backlog")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
net/ipv4/ip_gre.c
Minor conflicts between tunnel bug fixes in net and
ipv6 tunnel cleanups in net-next.
Signed-off-by: David S. Miller <davem@davemloft.net>
For test purpose, provide the generic nci loopback function.
Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
According to NCI specification, destination type and destination
specific parameters shall uniquely identify a single destination
for the Logical Connection.
Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
nci_core_conn_close was not retrieving a conn_info using the correct
connection id.
Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
NCI_CORE_CONN_CREATE may not have any destination type parameter.
Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
The handler 'ila_fill_encap_info' adds one attribute: ILA_ATTR_LOCATOR.
Fixes: 65d7ab8de5 ("net: Identifier Locator Addressing module")
CC: Tom Herbert <tom@herbertland.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the sendmsg function of UDP, raw, ICMP and l2tp sockets, we use local
variables like hlimits, tclass, opt and dontfrag and pass them to corresponding
functions like ip6_make_skb, ip6_append_data and xxx_push_pending_frames.
This is not a good practice and makes it hard to add new parameters.
This fix introduces a new struct ipcm6_cookie similar to ipcm_cookie in
ipv4 and include the above mentioned variables. And we only pass the
pointer to this structure to corresponding functions. This makes it easier
to add new parameters in the future and makes the function cleaner.
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
An arbitration scheme for duelling SYNs is implemented as part of
commit 241b271952 ("RDS-TCP: Reset tcp callbacks if re-using an
outgoing socket in rds_tcp_accept_one()") which ensures that both nodes
involved will arrive at the same arbitration decision. However, this
needs to be synchronized with an outgoing SYN to be generated by
rds_tcp_conn_connect(). This commit achieves the synchronization
through the t_conn_lock mutex in struct rds_tcp_connection.
The rds_conn_state is checked in rds_tcp_conn_connect() after acquiring
the t_conn_lock mutex. A SYN is sent out only if the RDS connection is
not already UP (an UP would indicate that rds_tcp_accept_one() has
completed 3WH, so no SYN needs to be generated).
Similarly, the rds_conn_state is checked in rds_tcp_accept_one() after
acquiring the t_conn_lock mutex. The only acceptable states (to
allow continuation of the arbitration logic) are UP (i.e., outgoing SYN
was SYN-ACKed by peer after it sent us the SYN) or CONNECTING (we sent
outgoing SYN before we saw incoming SYN).
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a race condition between rds_send_xmit -> rds_tcp_xmit
and the code that deals with resolution of duelling syns added
by commit 241b271952 ("RDS-TCP: Reset tcp callbacks if re-using an
outgoing socket in rds_tcp_accept_one()").
Specifically, we may end up derefencing a null pointer in rds_send_xmit
if we have the interleaving sequence:
rds_tcp_accept_one rds_send_xmit
conn is RDS_CONN_UP, so
invoke rds_tcp_xmit
tc = conn->c_transport_data
rds_tcp_restore_callbacks
/* reset t_sock */
null ptr deref from tc->t_sock
The race condition can be avoided without adding the overhead of
additional locking in the xmit path: have rds_tcp_accept_one wait
for rds_tcp_xmit threads to complete before resetting callbacks.
The synchronization can be done in the same manner as rds_conn_shutdown().
First set the rds_conn_state to something other than RDS_CONN_UP
(so that new threads cannot get into rds_tcp_xmit()), then wait for
RDS_IN_XMIT to be cleared in the conn->c_flags indicating that any
threads in rds_tcp_xmit are done.
Fixes: 241b271952 ("RDS-TCP: Reset tcp callbacks if re-using an
outgoing socket in rds_tcp_accept_one()")
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hosts sending lot of ACK packets exhibit high sock_wfree() cost
because of cache line miss to test SOCK_USE_WRITE_QUEUE
We could move this flag close to sk_wmem_alloc but it is better
to perform the atomic_sub_and_test() on a clean cache line,
as it avoid one extra bus transaction.
skb_orphan_partial() can also have a fast track for packets that either
are TCP acks, or already went through another skb_orphan_partial()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the case of the mlx4 and mlx5 driver they do not support IPv6 checksum
offload for tunnels. With this being the case we should disable GSO in
addition to the checksum offload features when we find that a device cannot
perform a checksum on a given packet type.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are two flow control mechanisms in TIPC; one at link level that
handles network congestion, burst control, and retransmission, and one
at connection level which' only remaining task is to prevent overflow
in the receiving socket buffer. In TIPC, the latter task has to be
solved end-to-end because messages can not be thrown away once they
have been accepted and delivered upwards from the link layer, i.e, we
can never permit the receive buffer to overflow.
Currently, this algorithm is message based. A counter in the receiving
socket keeps track of number of consumed messages, and sends a dedicated
acknowledge message back to the sender for each 256 consumed message.
A counter at the sending end keeps track of the sent, not yet
acknowledged messages, and blocks the sender if this number ever reaches
512 unacknowledged messages. When the missing acknowledge arrives, the
socket is then woken up for renewed transmission. This works well for
keeping the message flow running, as it almost never happens that a
sender socket is blocked this way.
A problem with the current mechanism is that it potentially is very
memory consuming. Since we don't distinguish between small and large
messages, we have to dimension the socket receive buffer according
to a worst-case of both. I.e., the window size must be chosen large
enough to sustain a reasonable throughput even for the smallest
messages, while we must still consider a scenario where all messages
are of maximum size. Hence, the current fix window size of 512 messages
and a maximum message size of 66k results in a receive buffer of 66 MB
when truesize(66k) = 131k is taken into account. It is possible to do
much better.
This commit introduces an algorithm where we instead use 1024-byte
blocks as base unit. This unit, always rounded upwards from the
actual message size, is used when we advertise windows as well as when
we count and acknowledge transmitted data. The advertised window is
based on the configured receive buffer size in such a way that even
the worst-case truesize/msgsize ratio always is covered. Since the
smallest possible message size (from a flow control viewpoint) now is
1024 bytes, we can safely assume this ratio to be less than four, which
is the value we are now using.
This way, we have been able to reduce the default receive buffer size
from 66 MB to 2 MB with maintained performance.
In order to keep this solution backwards compatible, we introduce a
new capability bit in the discovery protocol, and use this throughout
the message sending/reception path to always select the right unit.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
During neighbor discovery, nodes advertise their capabilities as a bit
map in a dedicated 16-bit field in the discovery message header. This
bit map has so far only be stored in the node structure on the peer
nodes, but we now see the need to keep a copy even in the socket
structure.
This commit adds this functionality.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the refactoring commit d570d86497 ("tipc: enqueue arrived buffers
in socket in separate function") we did by accident replace the test
if (sk->sk_backlog.len == 0)
atomic_set(&tsk->dupl_rcvcnt, 0);
with
if (sk->sk_backlog.len)
atomic_set(&tsk->dupl_rcvcnt, 0);
This effectively disables the compensation we have for the double
receive buffer accounting that occurs temporarily when buffers are
moved from the backlog to the socket receive queue. Until now, this
has gone unnoticed because of the large receive buffer limits we are
applying, but becomes indispensable when we reduce this buffer limit
later in this series.
We now fix this by inverting the mentioned condition.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
batadv_iv_ogm_orig_del_if handles two different buffers bcast_own and
bcast_own_sum which should be resized. The error handling two for
allocating these buffers causes the complexity of this function. This can
be avoided completely when the function is split into a main function
handling the locking, freeing and call of the subfunctions.
The subfunction can then independently handle the resize of the buffers.
This also allows to easily reuse the old buffer (which always is larger) in
case a smaller buffer could not be allocated without increasing the code
complexity.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Since batadv_v_ogm_orig_update() was only called from one place and the
calling function became very short, merge these two functions together.
This should also reflect the protocol description of B.A.T.M.A.N. V
better.
Signed-off-by: Simon Wunderlich <simon@open-mesh.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
To match our code better to the protocol description of B.A.T.M.A.N. V,
move batadv_v_ogm_forward() out into batadv_v_ogm_process_per_outif()
and move all checks directly deciding whether the OGM should be
forwarded into batadv_v_ogm_forward().
Signed-off-by: Simon Wunderlich <simon@open-mesh.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Structure initialization within the macros should follow the general
coding style used in the kernel: put the initialization of the first
variable and the closing brace on a separate line.
Reported-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Simon Wunderlich <simon.wunderlich@open-mesh.com>
[sven@narfation.org: fix conflicts with current version]
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Some really long function names in batman-adv require a newline between
return type and the function name. This has lead to some lines starting
with *batadv_...
This * belongs to the return type and thus should be on the same line as
the return type.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
checkpatch.pl warns about the use of 'unsigned' as a short form for
'unsigned int'.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Use to_delayed_work() instead of open-coding it.
Signed-off-by: Geliang Tang <geliangtang@163.com>
Reviewed-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Use list_for_each_entry_safe() instead of list_for_each_safe() to
simplify the code.
Signed-off-by: Geliang Tang <geliangtang@163.com>
Acked-by: Antonio Quartulli <a@unstable.cc>
Reviewed-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Use a static string when showing table headers rather then
a nonsense parametric one with fixed arguments.
It is easier to grep and it does not need to be recomputed
at runtime each time.
Reported-by: Joe Perches <joe@perches.com>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
[sven@narfation.org: fix conflicts with current version]
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
The vsock_transport structure is never modified, so declare it as const.
Done with the help of Coccinelle.
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
In presence of inelastic flows and stress, we can call
fq_codel_drop() for every packet entering fq_codel qdisc.
fq_codel_drop() is quite expensive, as it does a linear scan
of 4 KB of memory to find a fat flow.
Once found, it drops the oldest packet of this flow.
Instead of dropping a single packet, try to drop 50% of the backlog
of this fat flow, with a configurable limit of 64 packets per round.
TCA_FQ_CODEL_DROP_BATCH_SIZE is the new attribute to make this
limit configurable.
With this strategy the 4 KB search is amortized to a single cache line
per drop [1], so fq_codel_drop() no longer appears at the top of kernel
profile in presence of few inelastic flows.
[1] Assuming a 64byte cache line, and 1024 buckets
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dave Taht <dave.taht@gmail.com>
Cc: Jonathan Morton <chromatix99@gmail.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Dave Taht
Signed-off-by: David S. Miller <davem@davemloft.net>
This was recently reported to me, and reproduced on the latest net kernel,
when attempting to run netperf from a host that had a netem qdisc attached
to the egress interface:
[ 788.073771] ---------------------[ cut here ]---------------------------
[ 788.096716] WARNING: at net/core/dev.c:2253 skb_warn_bad_offload+0xcd/0xda()
[ 788.129521] bnx2: caps=(0x00000001801949b3, 0x0000000000000000) len=2962
data_len=0 gso_size=1448 gso_type=1 ip_summed=3
[ 788.182150] Modules linked in: sch_netem kvm_amd kvm crc32_pclmul ipmi_ssif
ghash_clmulni_intel sp5100_tco amd64_edac_mod aesni_intel lrw gf128mul
glue_helper ablk_helper edac_mce_amd cryptd pcspkr sg edac_core hpilo ipmi_si
i2c_piix4 k10temp fam15h_power hpwdt ipmi_msghandler shpchp acpi_power_meter
pcc_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c
sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt
i2c_algo_bit drm_kms_helper ahci ata_generic pata_acpi ttm libahci
crct10dif_pclmul pata_atiixp tg3 libata crct10dif_common drm crc32c_intel ptp
serio_raw bnx2 r8169 hpsa pps_core i2c_core mii dm_mirror dm_region_hash dm_log
dm_mod
[ 788.465294] CPU: 16 PID: 0 Comm: swapper/16 Tainted: G W
------------ 3.10.0-327.el7.x86_64 #1
[ 788.511521] Hardware name: HP ProLiant DL385p Gen8, BIOS A28 12/17/2012
[ 788.542260] ffff880437c036b8 f7afc56532a53db9 ffff880437c03670
ffffffff816351f1
[ 788.576332] ffff880437c036a8 ffffffff8107b200 ffff880633e74200
ffff880231674000
[ 788.611943] 0000000000000001 0000000000000003 0000000000000000
ffff880437c03710
[ 788.647241] Call Trace:
[ 788.658817] <IRQ> [<ffffffff816351f1>] dump_stack+0x19/0x1b
[ 788.686193] [<ffffffff8107b200>] warn_slowpath_common+0x70/0xb0
[ 788.713803] [<ffffffff8107b29c>] warn_slowpath_fmt+0x5c/0x80
[ 788.741314] [<ffffffff812f92f3>] ? ___ratelimit+0x93/0x100
[ 788.767018] [<ffffffff81637f49>] skb_warn_bad_offload+0xcd/0xda
[ 788.796117] [<ffffffff8152950c>] skb_checksum_help+0x17c/0x190
[ 788.823392] [<ffffffffa01463a1>] netem_enqueue+0x741/0x7c0 [sch_netem]
[ 788.854487] [<ffffffff8152cb58>] dev_queue_xmit+0x2a8/0x570
[ 788.880870] [<ffffffff8156ae1d>] ip_finish_output+0x53d/0x7d0
...
The problem occurs because netem is not prepared to handle GSO packets (as it
uses skb_checksum_help in its enqueue path, which cannot manipulate these
frames).
The solution I think is to simply segment the skb in a simmilar fashion to the
way we do in __dev_queue_xmit (via validate_xmit_skb), with some minor changes.
When we decide to corrupt an skb, if the frame is GSO, we segment it, corrupt
the first segment, and enqueue the remaining ones.
tested successfully by myself on the latest net kernel, to which this applies
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: Jamal Hadi Salim <jhs@mojatatu.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: netem@lists.linux-foundation.org
CC: eric.dumazet@gmail.com
CC: stephen@networkplumber.org
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Locally generated TCP GSO packets having to go through a GRE/SIT/IPIP
tunnel have to go through an expensive skb_unclone()
Reallocating skb->head is a lot of work.
Test should really check if a 'real clone' of the packet was done.
TCP does not care if the original gso_type is changed while the packet
travels in the stack.
This adds skb_header_unclone() which is a variant of skb_clone()
using skb_header_cloned() check instead of skb_cloned().
This variant can probably be used from other points.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- a fix for our Distributed ARP Table that makes sure that the input
provided to the hash function during a query is the same as the one
provided during an insert (so to prevent false negatives), by Antonio
Quartulli
- a fix for our new protocol implementation B.A.T.M.A.N. V that ensures
that a hard interface is properly re-activated when it is brought down
and then up again, by Antonio Quartulli
- two fixes respectively to the reference counting of the tt_local_entry
and neigh_node objects, by Sven Eckelmann. Such bug is rather severe
as it would prevent the netdev objects references by batman-adv from
being released after shutdown.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJXJOPeAAoJEJ4aZjxxc6bKFaoP/jsY4MelcsGUGQhjfEfm/gbo
H7Xd5TUydiq9tfbIGwAjbS4Ti+e69ROolyiNQrvxc5PcJFhpQlsSe17+o0NdnCOE
tBLjuCLjpnj+FzghnQYb54Qb2CEllsSLJcOLh+CnbFiHos+pvhxA/NeC6FufyPMF
Zrdsjf/v4rzghWQerToKEcIgCXcRE3Zo2txunUnFXSzQGai4AJnljD1Hk1YcbQdn
O8+6lXYN+j4Swo6yrPB0URzJRIWdjoQ1OfdvggCDTuMW664jyv9gZmsF/fzL2ksj
SGldxkFOX+4x8NenRxs5OFMXHHAJGu8kU8uoXmOCuv6b59F2KWi27rP1MJqxYDcB
pTpq4nAx3IooNSSvpU97SFW3WBQgIsNHMFZwZbGkxqXP1UhPEoUcsuFTPVj/hqDI
h9xBLK/buNbYnMULTW8hMvxOUHqxjPvr37Vbj1uPdbfmwbrvUvwyMSWFn5k/JmAF
CASMwUC4C7IQtEinVYHmT/+QsPGMcmom1WZ1/OlhlxnmOwAcglI/mZnXl0wD7ptg
3KETNlrsNHC6YuOLKIKI08l3Ke2DOZLHdV5PvHcdPgTy7EYbSvZTaDwK422pSiuy
8kcQjN8g6I81drwJqkEiUkJA6kRxkKbXYxosudbRT07IkzUZo7TPAFv7iMNDSHUW
vuJV/rtYAp3bRDyLrxnb
=ksVs
-----END PGP SIGNATURE-----
Merge tag 'batman-adv-fix-for-davem' of git://git.open-mesh.org/linux-merge
Antonio Quartulli says:
====================
In this small batch of patches you have:
- a fix for our Distributed ARP Table that makes sure that the input
provided to the hash function during a query is the same as the one
provided during an insert (so to prevent false negatives), by Antonio
Quartulli
- a fix for our new protocol implementation B.A.T.M.A.N. V that ensures
that a hard interface is properly re-activated when it is brought down
and then up again, by Antonio Quartulli
- two fixes respectively to the reference counting of the tt_local_entry
and neigh_node objects, by Sven Eckelmann. Such bug is rather severe
as it would prevent the netdev objects references by batman-adv from
being released after shutdown.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a new LINK_XSTATS_TYPE_BRIDGE attribute and implement the
RTM_GETSTATS callbacks for IFLA_STATS_LINK_XSTATS (fill_linkxstats and
get_linkxstats_size) in order to export the per-vlan stats.
The paddings were added because soon these fields will be needed for
per-port per-vlan stats (or something else if someone beats me to it) so
avoiding at least a few more netlink attributes.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for per-VLAN Tx/Rx statistics. Every global vlan context gets
allocated a per-cpu stats which is then set in each per-port vlan context
for quick access. The br_allowed_ingress() common function is used to
account for Rx packets and the br_handle_vlan() common function is used
to account for Tx packets. Stats accounting is performed only if the
bridge-wide vlan_stats_enabled option is set either via sysfs or netlink.
A struct hole between vlan_enabled and vlan_proto is used for the new
option so it is in the same cache line. Currently it is binary (on/off)
but it is intentionally restricted to exactly 0 and 1 since other values
will be used in the future for different purposes (e.g. per-port stats).
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add callbacks to calculate the size and fill link extended statistics
which can be split into multiple messages and are dumped via the new
rtnl stats API (RTM_GETSTATS) with the IFLA_STATS_LINK_XSTATS attribute.
Also add that attribute to the idx mask check since it is expected to
be able to save state and resume dumping (e.g. future bridge per-vlan
stats will be dumped via this attribute and callbacks).
Each link type should nest its private attributes under the per-link type
attribute. This allows to have any number of separated private attributes
and to avoid one call to get the dev link type.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The new prividx argument allows the current dumping device to save a
private state counter which would enable it to continue dumping from
where it left off. And the idxattr is used to save the current idx user
so multiple prividx using attributes can be requested at the same time
as suggested by Roopa Prabhu.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Changes in GREv6 transmit path:
- Call gre_checksum, remove gre6_checksum
- Rename ip6gre_xmit2 to __gre6_xmit
- Call gre_build_header utility function
- Call ip6_tnl_xmit common function
- Call ip6_tnl_change_mtu, eliminate ip6gre_tunnel_change_mtu
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A few generic changes to generalize tunnels in IPv6:
- Export ip6_tnl_change_mtu so that it can be called by ip6_gre
- Add tun_hlen to ip6_tnl structure.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Create common functions for both IPv4 and IPv6 GRE in transmit. These
are put into gre.h.
Common functions are for:
- GRE checksum calculation. Move gre_checksum to gre.h.
- Building a GRE header. Move GRE build_header and rename
gre_build_header.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch renames ip6_tnl_xmit2 to ip6_tnl_xmit and exports it. Other
users like GRE will be able to call this. The original ip6_tnl_xmit
function is renamed to ip6_tnl_start_xmit (this is an ndo_start_xmit
function).
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Create gre_rcv function. This calls gre_parse_header and ip6gre_rcv.
- Call ip6_tnl_rcv. Doing this and using gre_parse_header eliminates
most of the code in ip6gre_rcv.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Several of the GRE functions defined in net/ipv4/ip_gre.c are usable
for IPv6 GRE implementation (that is they are protocol agnostic).
These include:
- GRE flag handling functions are move to gre.h
- GRE build_header is moved to gre.h and renamed gre_build_header
- parse_gre_header is moved to gre_demux.c and renamed gre_parse_header
- iptunnel_pull_header is taken out of gre_parse_header. This is now
done by caller. The header length is returned from gre_parse_header
in an int* argument.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some basic changes to make IPv6 tunnel receive path look more like
IPv4 path:
- Make ip6_tnl_rcv non-static so that GREv6 and others can call it
- Make ip6_tnl_rcv look like ip_tunnel_rcv
- Switch to gro_cells_receive
- Make ip6_tnl_rcv non-static and export it
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Large sendmsg()/write() hold socket lock for the duration of the call,
unless sk->sk_sndbuf limit is hit. This is bad because incoming packets
are parked into socket backlog for a long time.
Critical decisions like fast retransmit might be delayed.
Receivers have to maintain a big out of order queue with additional cpu
overhead, and also possible stalls in TX once windows are full.
Bidirectional flows are particularly hurt since the backlog can become
quite big if the copy from user space triggers IO (page faults)
Some applications learnt to use sendmsg() (or sendmmsg()) with small
chunks to avoid this issue.
Kernel should know better, right ?
Add a generic sk_flush_backlog() helper and use it right
before a new skb is allocated. Typically we put 64KB of payload
per skb (unless MSG_EOR is requested) and checking socket backlog
every 64KB gives good results.
As a matter of fact, tests with TSO/GSO disabled give very nice
results, as we manage to keep a small write queue and smaller
perceived rtt.
Note that sk_flush_backlog() maintains socket ownership,
so is not equivalent to a {release_sock(sk); lock_sock(sk);},
to ensure implicit atomicity rules that sendmsg() was
giving to (possibly buggy) applications.
In this simple implementation, I chose to not call tcp_release_cb(),
but we might consider this later.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexei Starovoitov <ast@fb.com>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Socket backlog processing is a major latency source.
With current TCP socket sk_rcvbuf limits, I have sampled __release_sock()
holding cpu for more than 5 ms, and packets being dropped by the NIC
once ring buffer is filled.
All users are now ready to be called from process context,
we can unblock BH and let interrupts be serviced faster.
cond_resched_softirq() could be removed, as it has no more user.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
sctp_inq_push() will soon be called without BH being blocked
when generic socket code flushes the socket backlog.
It is very possible SCTP can be converted to not rely on BH,
but this needs to be done by SCTP experts.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
UDP uses the generic socket backlog code, and this will soon
be changed to not disable BH when protocol is called back.
We need to use appropriate SNMP accessors.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
DCCP uses the generic backlog code, and this will soon
be changed to not disable BH when protocol is called back.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
AFAIK, nothing in current TCP stack absolutely wants BH
being disabled once socket is owned by a thread running in
process context.
As mentioned in my prior patch ("tcp: give prequeue mode some care"),
processing a batch of packets might take time, better not block BH
at all.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
We want to to make TCP stack preemptible, as draining prequeue
and backlog queues can take lot of time.
Many SNMP updates were assuming that BH (and preemption) was disabled.
Need to convert some __NET_INC_STATS() calls to NET_INC_STATS()
and some __TCP_INC_STATS() to TCP_INC_STATS()
Before using this_cpu_ptr(net->ipv4.tcp_sk) in tcp_v4_send_reset()
and tcp_v4_send_ack(), we add an explicit preempt disabled section.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since a/b/g/n no longer exist as spec amendements and VHT (ex 802.11ac)
wasn't handled at all, it's better to just remove the amendment strings
to avoid confusion.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Reviewed-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Pull networking fixes from David Miller:
1) MODULE_FIRMWARE firmware string not correct for iwlwifi 8000 chips,
from Sara Sharon.
2) Fix SKB size checks in batman-adv stack on receive, from Sven
Eckelmann.
3) Leak fix on mac80211 interface add error paths, from Johannes Berg.
4) Cannot invoke napi_disable() with BH disabled in myri10ge driver,
fix from Stanislaw Gruszka.
5) Fix sign extension problem when computing feature masks in
net_gso_ok(), from Marcelo Ricardo Leitner.
6) lan78xx driver doesn't count packets and packet lengths in its
statistics properly, fix from Woojung Huh.
7) Fix the buffer allocation sizes in pegasus USB driver, from Petko
Manolov.
8) Fix refcount overflows in bpf, from Alexei Starovoitov.
9) Unified dst cache handling introduced a preempt warning in
ip_tunnel, fix by resetting rather then setting the cached route.
From Paolo Abeni.
10) Listener hash collision test fix in soreuseport, from Craig Gallak
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (47 commits)
gre: do not pull header in ICMP error processing
net: Implement net_dbg_ratelimited() for CONFIG_DYNAMIC_DEBUG case
tipc: only process unicast on intended node
cxgb3: fix out of bounds read
net/smscx5xx: use the device tree for mac address
soreuseport: Fix TCP listener hash collision
net: l2tp: fix reversed udp6 checksum flags
ip_tunnel: fix preempt warning in ip tunnel creation/updating
samples/bpf: fix trace_output example
bpf: fix check_map_func_compatibility logic
bpf: fix refcnt overflow
drivers: net: cpsw: use of_phy_connect() in fixed-link case
dt: cpsw: phy-handle, phy_id, and fixed-link are mutually exclusive
drivers: net: cpsw: don't ignore phy-mode if phy-handle is used
drivers: net: cpsw: fix segfault in case of bad phy-handle
drivers: net: cpsw: fix parsing of phy-handle DT property in dual_emac config
MAINTAINERS: net: Change maintainer for GRETH 10/100/1G Ethernet MAC device driver
gre: reject GUE and FOU in collect metadata mode
pegasus: fixes reported packet length
pegasus: fixes URB buffer allocation size;
...
iptunnel_pull_header expects that IP header was already pulled; with this
expectation, it pulls the tunnel header. This is not true in gre_err.
Furthermore, ipv4_update_pmtu and ipv4_redirect expect that skb->data points
to the IP header.
We cannot pull the tunnel header in this path. It's just a matter of not
calling iptunnel_pull_header - we don't need any of its effects.
Fixes: bda7bb4634 ("gre: Allow multiple protocol listener for gre protocol.")
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dave Miller pointed out that fb586f2530 ("sctp: delay calls to
sk_data_ready() as much as possible") may insert latency specially if
the receiving application is running on another CPU and that it would be
better if we signalled as early as possible.
This patch thus basically inverts the logic on fb586f2530 and signals
it as early as possible, similar to what we had before.
Fixes: fb586f2530 ("sctp: delay calls to sk_data_ready() as much as possible")
Reported-by: Dave Miller <davem@davemloft.net>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We have observed complete lock up of broadcast-link transmission due to
unacknowledged packets never being removed from the 'transmq' queue. This
is traced to nodes having their ack field set beyond the sequence number
of packets that have actually been transmitted to them.
Consider an example where node 1 has sent 10 packets to node 2 on a
link and node 3 has sent 20 packets to node 2 on another link. We
see examples of an ack from node 2 destined for node 3 being treated as
an ack from node 2 at node 1. This leads to the ack on the node 1 to node
2 link being increased to 20 even though we have only sent 10 packets.
When node 1 does get around to sending further packets, none of the
packets with sequence numbers less than 21 are actually removed from the
transmq.
To resolve this we reinstate some code lost in commit d999297c3d ("tipc:
reduce locking scope during packet reception") which ensures that only
messages destined for the receiving node are processed by that node. This
prevents the sequence numbers from getting out of sync and resolves the
packet leakage, thereby resolving the broadcast-link transmission
lock-ups we observed.
While we are aware that this change only patches over a root problem that
we still haven't identified, this is a sanity test that it is always
legitimate to do. It will remain in the code even after we identify and
fix the real problem.
Reviewed-by: Chris Packham <chris.packham@alliedtelesis.co.nz>
Reviewed-by: John Thompson <john.thompson@alliedtelesis.co.nz>
Signed-off-by: Hamish Martin <hamish.martin@alliedtelesis.co.nz>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When we are displaying statistics for the first link established between
two peers, it will always be presented as STANDBY although it in reality
is ACTIVE.
This happens because we forget to set the 'active' flag in the link
instance at the moment it is established. Although this is a bug, it only
has impact on the presentation view of the link, not on its actual
functionality.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I forgot to include a check for listener port equality when deciding
if two sockets should belong to the same reuseport group. This was
not caught previously because it's only necessary when two listening
sockets for the same user happen to hash to the same listener bucket.
The same error does not exist in the UDP path.
Fixes: c125e80b8868("soreuseport: fast reuseport TCP socket selection")
Signed-off-by: Craig Gallek <kraig@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes a bug which causes the behavior of whether to ignore
udp6 checksum of udp6 encapsulated l2tp tunnel contrary to what
userspace program requests.
When the flag `L2TP_ATTR_UDP_ZERO_CSUM6_RX` is set by userspace, it is
expected that udp6 checksums of received packets of the l2tp tunnel
to create should be ignored. In `l2tp_netlink.c`:
`l2tp_nl_cmd_tunnel_create()`, `cfg.udp6_zero_rx_checksums` is set
according to the flag, and then passed to `l2tp_core.c`:
`l2tp_tunnel_create()` and then `l2tp_tunnel_sock_create()`. In
`l2tp_tunnel_sock_create()`, `udp_conf.use_udp6_rx_checksums` is set
the same to `cfg.udp6_zero_rx_checksums`. However, if we want the
checksum to be ignored, `udp_conf.use_udp6_rx_checksums` should be set
to `false`, i.e. be set to the contrary. Similarly, the same should be
done to `udp_conf.use_udp6_tx_checksums`.
Signed-off-by: Miao Wang <shankerwangmiao@gmail.com>
Acked-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace ASYNC_INITIALIZED bit in the tty_port::flags field with
TTY_PORT_INITIALIZED bit in the tty_port::iflags field. Introduce helpers
tty_port_set_initialized() and tty_port_initialized() to abstract
atomic bit ops.
Note: the transforms for test_and_set_bit() and test_and_clear_bit()
are unnecessary as the state transitions are already mutually exclusive;
the tty lock prevents concurrent open/close/hangup.
Signed-off-by: Peter Hurley <peter@hurleysoftware.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Replace ASYNC_CHECK_CD bit in the tty_port::flags field with
TTY_PORT_CHECK_CD bit in the tty_port::iflags field. Introduce helpers
tty_port_set_check_carrier() and tty_port_check_carrier() to abstract
the atomic bit ops.
Signed-off-by: Peter Hurley <peter@hurleysoftware.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Replace ASYNC_NORMAL_ACTIVE bit in the tty_port::flags field with
TTY_PORT_ACTIVE bit in the tty_port::iflags field. Introduce helpers
tty_port_set_active() and tty_port_active() to abstract atomic bit ops.
Extract state changes from port lock sections, as this usage is
broken and confused; the state transitions are protected by the
tty lock (which mutually excludes parallel open/close/hangup),
and no user tests the active state while holding the port lock.
Signed-off-by: Peter Hurley <peter@hurleysoftware.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Replace ASYNC_CTS_FLOW bit in the tty_port::flags field with
TTY_PORT_CTS_FLOW bit in the tty_port::iflags field. Add
tty_port_set_cts_flow() helper to abstract the atomic bit ops.
Signed-off-by: Peter Hurley <peter@hurleysoftware.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Abstract TTY_THROTTLED bit tests with tty_throttled().
Signed-off-by: Peter Hurley <peter@hurleysoftware.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Abstract TTY_IO_ERROR status test treewide with tty_io_error().
NB: tty->flags uses atomic bit ops; replace non-atomic bit test
with test_bit().
Signed-off-by: Peter Hurley <peter@hurleysoftware.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
is_skb_forwardable is not supposed to change anything so constify its
arguments
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After the commit e09acddf87 ("ip_tunnel: replace dst_cache with generic
implementation"), a preemption debug warning is triggered on ip4
tunnels updating; the dst cache helper needs to be invoked in unpreemptible
context.
We don't need to load the cache on tunnel update, so this commit fixes
the warning replacing the load with a dst cache reset, which is
preempt safe.
Fixes: e09acddf87 ("ip_tunnel: replace dst_cache with generic implementation")
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Workqueue maybe still in running while we destroy the IDLETIMER target,
thus cause a use after free error, add cancel_work_sync() to avoid such
situation.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The batadv_neigh_node was specific to a batadv_hardif_neigh_node and held
an implicit reference to it. But this reference was never stored in form of
a pointer in the batadv_neigh_node itself. Instead
batadv_neigh_node_release depends on a consistent state of
hard_iface->neigh_list and that batadv_hardif_neigh_get always returns the
batadv_hardif_neigh_node object which it has a reference for. But
batadv_hardif_neigh_get cannot guarantee that because it is working only
with rcu_read_lock on this list. It can therefore happen that a neigh_addr
is in this list twice or that batadv_hardif_neigh_get cannot find the
batadv_hardif_neigh_node for an neigh_addr due to some other list
operations taking place at the same time.
Instead add a batadv_hardif_neigh_node pointer directly in
batadv_neigh_node which will be used for the reference counter decremented
on release of batadv_neigh_node.
Fixes: cef63419f7 ("batman-adv: add list of unique single hop neighbors per hard-interface")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
The batadv_tt_local_entry was specific to a batadv_softif_vlan and held an
implicit reference to it. But this reference was never stored in form of a
pointer in the tt_local_entry itself. Instead batadv_tt_local_remove,
batadv_tt_local_table_free and batadv_tt_local_purge_pending_clients depend
on a consistent state of bat_priv->softif_vlan_list and that
batadv_softif_vlan_get always returns the batadv_softif_vlan object which
it has a reference for. But batadv_softif_vlan_get cannot guarantee that
because it is working only with rcu_read_lock on this list. It can
therefore happen that an vid is in this list twice or that
batadv_softif_vlan_get cannot find the batadv_softif_vlan for an vid due to
some other list operations taking place at the same time.
Instead add a batadv_softif_vlan pointer directly in batadv_tt_local_entry
which will be used for the reference counter decremented on release of
batadv_tt_local_entry.
Fixes: 35df3b298f ("batman-adv: fix TT VLAN inconsistency on VLAN re-add")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Acked-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
At the moment there is no explicit reactivation of an hard-interface
upon NETDEV_UP event. In case of B.A.T.M.A.N. IV the interface is
reactivated as soon as the next OGM is scheduled for sending, but this
mechanism does not work with B.A.T.M.A.N. V. The latter does not rely
on the same scheduling mechanism as its predecessor and for this reason
the hard-interface remains deactivated forever after being brought down
once.
This patch fixes the reactivation mechanism by adding a new routing API
which explicitly allows each algorithm to perform any needed operation
upon interface re-activation.
Such API is optional and is implemented by B.A.T.M.A.N. V only and it
just takes care of setting the iface status to ACTIVE
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Now that DAT is VLAN aware, it must use the VID when
computing the DHT address of the candidate nodes where
an entry is going to be stored/retrieved.
Fixes: be1db4f661 ("batman-adv: make the Distributed ARP Table vlan aware")
Signed-off-by: Antonio Quartulli <a@unstable.cc>
[sven@narfation.org: fix conflicts with current version]
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Else we get 'BUG: spinlock bad magic on CPU#' on resize when
spin lock debugging is enabled.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This is a forward-port of the original patch from Andrzej Hajda,
he said:
"IS_ERR_VALUE should be used only with unsigned long type.
Otherwise it can work incorrectly. To achieve this function
xt_percpu_counter_alloc is modified to return unsigned long,
and its result is assigned to temporary variable to perform
error checking, before assigning to .pcnt field.
The patch follows conclusion from discussion on LKML [1][2].
[1]: http://permalink.gmane.org/gmane.linux.kernel/2120927
[2]: http://permalink.gmane.org/gmane.linux.kernel/2150581"
Original patch from Andrzej is here:
http://patchwork.ozlabs.org/patch/582970/
This patch has clashed with input validation fixes for x_tables.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pull Ceph fixes from Sage Weil:
"There is a lifecycle fix in the auth code, a fix for a narrow race
condition on map, and a helpful message in the log when there is a
feature mismatch (which happens frequently now that the default
server-side options have changed)"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
rbd: report unsupported features to syslog
rbd: fix rbd map vs notify races
libceph: make authorizer destruction independent of ceph_auth_client
This patch overloads the DSA master netdev, aka CPU Ethernet MAC to also
include switch-side statistics, which is useful for debugging purposes,
when the switch is not properly connected to the Ethernet MAC (duplex
mismatch, (RG)MII electrical issues etc.).
We accomplish this by retaining the original copy of the master netdev's
ethtool_ops, and just overload the 3 operations we care about:
get_sset_count, get_strings and get_ethtool_stats so as to intercept
these calls and call into the original master_netdev ethtool_ops, plus
our own.
We take this approach as opposed to providing a set of DSA helper
functions that would retrive the CPU port's statistics, because the
entire purpose of DSA is to allow unmodified Ethernet MAC drivers to be
used as CPU conduit interfaces, therefore, statistics overlay in such
drivers would simply not scale.
The new ethtool -S <iface> output would therefore look like this now:
<iface> statistics
p<2 digits cpu port number>_<switch MIB counter names>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP prequeue goal is to defer processing of incoming packets
to user space thread currently blocked in a recvmsg() system call.
Intent is to spend less time processing these packets on behalf
of softirq handler, as softirq handler is unfair to normal process
scheduler decisions, as it might interrupt threads that do not
even use networking.
Current prequeue implementation has following issues :
1) It only checks size of the prequeue against sk_rcvbuf
It was fine 15 years ago when sk_rcvbuf was in the 64KB vicinity.
But we now have ~8MB values to cope with modern networking needs.
We have to add sk_rmem_alloc in the equation, since out of order
packets can definitely use up to sk_rcvbuf memory themselves.
2) Even with a fixed memory truesize check, prequeue can be filled
by thousands of packets. When prequeue needs to be flushed, either
from sofirq context (in tcp_prequeue() or timer code), or process
context (in tcp_prequeue_process()), this adds a latency spike
which is often not desirable.
I added a fixed limit of 32 packets, as this translated to a max
flush time of 60 us on my test hosts.
Also note that all packets in prequeue are not accounted for tcp_mem,
since they are not charged against sk_forward_alloc at this point.
This is probably not a big deal.
Note that this might increase LINUX_MIB_TCPPREQUEUEDROPPED counts,
which is misnamed, as packets are not dropped at all, but rather pushed
to the stack (where they can be either consumed or dropped)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The collect metadata mode does not support GUE nor FOU. This might be
implemented later; until then, we should reject such config.
I think this is okay to be changed. It's unlikely anyone has such
configuration (as it doesn't work anyway) and we may need a way to
distinguish whether it's supported or not by the kernel later.
For backwards compatibility with iproute2, it's not possible to just check
the attribute presence (iproute2 always includes the attribute), the actual
value has to be checked, too.
Fixes: 2e15ea390e ("ip_gre: Add support to collect tunnel metadata.")
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In ipgre (i.e. not gretap) + collect metadata mode, the skb was assumed to
contain Ethernet header and was encapsulated as ETH_P_TEB. This is not the
case, the interface is ARPHRD_IPGRE and the protocol to be used for
encapsulation is skb->protocol.
Fixes: 2e15ea390e ("ip_gre: Add support to collect tunnel metadata.")
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In ipgre mode (i.e. not gretap) with collect metadata flag set, the tunnel
is incorrectly assumed to be mGRE in NBMA mode (see commit 6a5f44d7a0).
This is not the case, we're controlling the encapsulation addresses by
lwtunnel metadata. And anyway, assigning dev->header_ops in collect metadata
mode does not make sense.
Although it would be more user firendly to reject requests that specify
both the collect metadata flag and a remote/local IP address, this would
break current users of gretap or introduce ugly code and differences in
handling ipgre and gretap configuration. Keep the current behavior of
remote/local IP address being ignored in such case.
v3: Back to v1, added explanation paragraph.
v2: Reject configuration specifying both remote/local address and collect
metadata flag.
Fixes: 2e15ea390e ("ip_gre: Add support to collect tunnel metadata.")
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(root user triggerable) error case.
-----BEGIN PGP SIGNATURE-----
iQIcBAABCgAGBQJXIH65AAoJEGt7eEactAAdZmwP/R2UAHltBlYhCEMqcM+8VhPD
VDB3LFTYhOVUtVfFwqAzEoxPDjnGyGgZcjO5RxyCZLokm71KbbHAp3h3GnCVQCHd
dnRej6RD+Kl6n0EoTPCLy7ZAjSjpGBWOTy6MEgrAQnTtL+Q7nUch+z5DXIafTg/w
MOYke/WfD1jHbq2eGHu6HkbY3IUwoSKaEoA8qN20ieJRU7jsaG29RiAvBot2IVTI
g3hTL4FPzwSL5XM0qkoxDLPYA5Mo36Cb5sZ9AjkQCaqP/EemOoFxILGWUyi+17nd
zdF3zZB9lj+CdR+0IbjTjz8b457u1g/JW4dLl+iRqv7clynm3gmz7LivVhBcHogx
usg0hW9tDeZ5wzHj8v+e+C+RqyxtgHxVvYtt8Jh6bTqS8aMO8hor7qPFOcpJPmyz
ZbXThJnsvfaYoWAcvIXUa3Q2kwz2myVLDhlQBgwSi5TzgTDqb2GlYnGhvKOnB5cz
6JL3mZt2vi0Yvb7Lk9YzeEYs5cZq4DFVfx3nHgaVwDZ5GPoUTgIjgSnVLFC/J80a
r09Wmigtjy5qftw6w4pSUcf/Fj4L0BD+GhAZ+Hs5xhjUnlAxTlHxDW0bJ/kMmFzy
9B91YSswBc+3IGdjnsN+bNZ6T0XuvapOLRkC1V8fJDrtyy1Tel/LL9Giy/V13XLr
8vgETcgFJyQ3jnkzyRdr
=EWBn
-----END PGP SIGNATURE-----
Merge tag 'mac80211-for-davem-2016-04-27' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
Just a single fix, for a per-CPU memory leak in a
(root user triggerable) error case.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This is never called with a NULL "buf" and anyway, we dereference 's' on
the lines before so it would Oops before we reach the check.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
1) check skb size to avoid reading beyond its border when delivering
payloads, by Sven Eckelmann
2) initialize last_seen time in neigh_node object to prevent cleanup
routine from accidentally purge it, by Marek Lindner
3) release "recently added" slave interfaces upon virtual/batman
interface shutdown, by Sven Eckelmann
4) properly decrease router object reference counter upon routing table
update, by Sven Eckelmann
5) release queue slots when purging OGM packets of deactivating slave
interface, by Linus Lüssing
Patch 2 and 3 have no "Fixes:" tag because the offending commits date
back to when batman-adv was not yet officially in the net tree.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJXHt+MAAoJEJ4aZjxxc6bKMEgP/1DZWgQpHs5IM8yW7IQx8CQO
iMkpwfnJcRSOnADC/Z2GtIcz1Df2r+NZcqf5xMMF2CL0xlks024qTHoqeV7Poyel
DmzzETbQFWgdFD22RI70h25T4Yb400PP0saL2TbcVec6CiM57YN3cPbhjZvqzN32
bCIa38kwAGXvNqRzcy5WjDF/rllAoJZ0s055z+kY8WuVOmvOEor+FDmWFr0D8ioP
/utVP9ACA3YHZ39DMDFDsyBp6nMZOgHjpJVfmcubFULHmKvYQ0zMpgX19IVoMsJ6
HEtz9fKN4KPgAFbbPcU0GLg4srsNFmEbTB7Bqhqods+ZYN60M4Z0kexqYz1XuItH
atISvCIe14xHdT6gW32N707yK30DxUKIEpEg5wMXhE+1m041NfrfrcvaEXSLco6d
txsQzd1R4T5ry3V1YXv4znSVPmHvd84ykKrklQZgPA09QIPCCDb7Olp8Lj6mMsmc
OuEYLOfAoOD/KZcRUzY6kWpMRfOJLLXUgwcfSEES8MCaaBGD91YZyrSuHvixmo5V
24JTp0D/X/rkkQjI3a2Pf0dhvdGHAk1g6mElddo86a0UpRbm3qshquAPf+U8QcU0
Kt4rpN9dtOA8yTpnvxG2r04T32yzQQQNIqRGEnugokaJUECFF0mugxENTqcw2vux
uNxMhl36A21czA9s/Iu7
=o059
-----END PGP SIGNATURE-----
Merge tag 'batman-adv-fix-for-davem' of git://git.open-mesh.org/linux-merge
Antonio Quartulli says:
====================
In this patchset you can find the following fixes:
1) check skb size to avoid reading beyond its border when delivering
payloads, by Sven Eckelmann
2) initialize last_seen time in neigh_node object to prevent cleanup
routine from accidentally purge it, by Marek Lindner
3) release "recently added" slave interfaces upon virtual/batman
interface shutdown, by Sven Eckelmann
4) properly decrease router object reference counter upon routing table
update, by Sven Eckelmann
5) release queue slots when purging OGM packets of deactivating slave
interface, by Linus Lüssing
Patch 2 and 3 have no "Fixes:" tag because the offending commits date
back to when batman-adv was not yet officially in the net tree.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
There's no need to calculate rps hash if it was not enabled. So this
patch export rps_needed and check it before trying to get rps
hash. Tests (using pktgen to inject packets to guest) shows this can
improve pps about 13% (when rps is disabled).
Before:
~1150000 pps
After:
~1300000 pps
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
----
Changes from V1:
- Fix build when CONFIG_RPS is not set
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds an eor bit to the TCP_SKB_CB. When MSG_EOR
is passed to tcp_sendmsg, the eor bit will be set at the skb
containing the last byte of the userland's msg. The eor bit
will prevent data from appending to that skb in the future.
The change in do_tcp_sendpages is to honor the eor set
during the previous tcp_sendmsg(MSG_EOR) call.
This patch handles the tcp_sendmsg case. The followup patches
will handle other skb coalescing and fragment cases.
One potential use case is to use MSG_EOR with
SOF_TIMESTAMPING_TX_ACK to get a more accurate
TCP ack timestamping on application protocol with
multiple outgoing response messages (e.g. HTTP2).
Packetdrill script for testing:
~~~~~~
+0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
+0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
+0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0
0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
0.200 < . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
0.200 write(4, ..., 14600) = 14600
0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730
0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730
0.200 > . 1:7301(7300) ack 1
0.200 > P. 7301:14601(7300) ack 1
0.300 < . 1:1(0) ack 14601 win 257
0.300 > P. 14601:15331(730) ack 1
0.300 > P. 15331:16061(730) ack 1
0.400 < . 1:1(0) ack 16061 win 257
0.400 close(4) = 0
0.400 > F. 16061:16061(0) ack 1
0.400 < F. 1:1(0) ack 16062 win 257
0.400 > . 16062:16062(0) ack 2
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The SKBTX_ACK_TSTAMP flag is set in skb_shinfo->tx_flags when
the timestamp of the TCP acknowledgement should be reported on
error queue. Since accessing skb_shinfo is likely to incur a
cache-line miss at the time of receiving the ack, the
txstamp_ack bit was added in tcp_skb_cb, which is set iff
the SKBTX_ACK_TSTAMP flag is set for an skb. This makes
SKBTX_ACK_TSTAMP flag redundant.
Remove the SKBTX_ACK_TSTAMP and instead use the txstamp_ack bit
everywhere.
Note that this frees one bit in shinfo->tx_flags.
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Suggested-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove the redundant check for sk->sk_tsflags in tcp_tx_timestamp.
tcp_tx_timestamp() receives the tsflags as a parameter. As a
result the "sk->sk_tsflags || tsflags" is redundant, since
tsflags already includes sk->sk_tsflags plus overrides from
control messages.
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is nothing related to BH in SNMP counters anymore,
since linux-3.0.
Rename helpers to use __ prefix instead of _BH prefix,
for contexts where preemption is disabled.
This more closely matches convention used to update
percpu variables.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename IP6_UPD_PO_STATS_BH() to __IP6_UPD_PO_STATS()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename IP6_INC_STATS_BH() to __IP6_INC_STATS()
and IP6_ADD_STATS_BH() to __IP6_ADD_STATS()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename NET_INC_STATS_BH() to __NET_INC_STATS()
and NET_ADD_STATS_BH() to __NET_ADD_STATS()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename IP_UPD_PO_STATS_BH() to __IP_UPD_PO_STATS()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename ICMP6_INC_STATS_BH() to __ICMP6_INC_STATS()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename IP_INC_STATS_BH() to __IP_INC_STATS(), to
better express this is used in non preemptible context.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename SCTP_INC_STATS_BH() to __SCTP_INC_STATS()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename UDP_INC_STATS_BH() to __UDP_INC_STATS(),
and UDP6_INC_STATS_BH() to __UDP6_INC_STATS()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename ICMP_INC_STATS_BH() to __ICMP_INC_STATS()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename DCCP_INC_STATS_BH() to __DCCP_INC_STATS()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the old days (before linux-3.0), SNMP counters were duplicated,
one for user context, and one for BH context.
After commit 8f0ea0fe3a ("snmp: reduce percpu needs by 50%")
we have a single copy, and what really matters is preemption being
enabled or disabled, since we use this_cpu_inc() or __this_cpu_inc()
respectively.
We therefore kill SNMP_INC_STATS_USER(), SNMP_ADD_STATS_USER(),
NET_INC_STATS_USER(), NET_ADD_STATS_USER(), SCTP_INC_STATS_USER(),
SNMP_INC_STATS64_USER(), SNMP_ADD_STATS64_USER(), TCP_ADD_STATS_USER(),
UDP_INC_STATS_USER(), UDP6_INC_STATS_USER(), and XFRM_INC_STATS_USER()
Following patches will rename __BH helpers to make clear their
usage is not tied to BH being disabled.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Minor overlapping changes in the conflicts.
In the macsec case, the change of the default ID macro
name overlapped with the 64-bit netlink attribute alignment
fixes in net-next.
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to 3bfd847203 ("net: Use passed in table for nexthop lookups")
for IPv4, if the route spec contains a table id use that to lookup the
next hop first and fall back to a full lookup if it fails (per the fix
4c9bcd1179 ("net: Fix nexthop lookups")).
Example:
root@kenny:~# ip -6 ro ls table red
local 2100:1::1 dev lo proto none metric 0 pref medium
2100:1::/120 dev eth1 proto kernel metric 256 pref medium
local 2100:2::1 dev lo proto none metric 0 pref medium
2100:2::/120 dev eth2 proto kernel metric 256 pref medium
local fe80::e0:f9ff:fe09:3cac dev lo proto none metric 0 pref medium
local fe80::e0:f9ff:fe1c:b974 dev lo proto none metric 0 pref medium
fe80::/64 dev eth1 proto kernel metric 256 pref medium
fe80::/64 dev eth2 proto kernel metric 256 pref medium
ff00::/8 dev red metric 256 pref medium
ff00::/8 dev eth1 metric 256 pref medium
ff00::/8 dev eth2 metric 256 pref medium
unreachable default dev lo metric 240 error -113 pref medium
root@kenny:~# ip -6 ro add table red 2100:3::/64 via 2100:1::64
RTNETLINK answers: No route to host
Route add fails even though 2100:1::64 is a reachable next hop:
root@kenny:~# ping6 -I red 2100:1::64
ping6: Warning: source address might be selected on device other than red.
PING 2100:1::64(2100:1::64) from 2100:1::1 red: 56 data bytes
64 bytes from 2100:1::64: icmp_seq=1 ttl=64 time=1.33 ms
With this patch:
root@kenny:~# ip -6 ro add table red 2100:3::/64 via 2100:1::64
root@kenny:~# ip -6 ro ls table red
local 2100:1::1 dev lo proto none metric 0 pref medium
2100:1::/120 dev eth1 proto kernel metric 256 pref medium
local 2100:2::1 dev lo proto none metric 0 pref medium
2100:2::/120 dev eth2 proto kernel metric 256 pref medium
2100:3::/64 via 2100:1::64 dev eth1 metric 1024 pref medium
local fe80::e0:f9ff:fe09:3cac dev lo proto none metric 0 pref medium
local fe80::e0:f9ff:fe1c:b974 dev lo proto none metric 0 pref medium
fe80::/64 dev eth1 proto kernel metric 256 pref medium
fe80::/64 dev eth2 proto kernel metric 256 pref medium
ff00::/8 dev red metric 256 pref medium
ff00::/8 dev eth1 metric 256 pref medium
ff00::/8 dev eth2 metric 256 pref medium
unreachable default dev lo metric 240 error -113 pref medium
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nicolas converted most users, but didn't realize some were generated
by macros. Convert those over as well.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In the case that dev_alloc_name() fails, e.g. because the name was
given by the user and already exists, we need to clean up properly
and free the per-CPU statistics. Fix that.
Cc: stable@vger.kernel.org
Fixes: 5a490510ba ("mac80211: use per-CPU TX/RX statistics")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
No more users in the tree, remove NETDEV_TX_LOCKED support.
Adds another hole in softnet_stats struct, but better than keeping
the unused collision counter around.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
For sctp assoc, when rcvbuf_policy is set, it will has it's own
rmem_alloc, when we dump asoc info in sctp_diag, we should use that
value on RMEM_ALLOC as well, just like WMEM_ALLOC.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johan Hedberg says:
====================
pull request: bluetooth-next 2016-04-26
Here's another set of Bluetooth & 802.15.4 patches for the 4.7 kernel:
- Cleanups & refactoring of ieee802154 & 6lowpan code
- Security related additions to ieee802154 and mrf24j40 driver
- Memory corruption fix to Bluetooth 6lowpan code
- Race condition fix in vhci driver
- Enhancements to the atusb 802.15.4 driver
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
I also fix commit 8b32ab9e6ef1: use nla_total_size_64bit() for
OVS_FLOW_ATTR_USED in ovs_flow_cmd_msg_size().
Fixes: 8b32ab9e6ef1 ("ovs: use nla_put_u64_64bit()")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I also fix the value of INET_DIAG_MAX. It's wrong since commit 8f840e47f1
which is only in net-next right now, thus I didn't make a separate patch.
Fixes: 8f840e47f1 ("sctp: add the sctp_diag.c file")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It was a simple idea -- save IPv6 configured addresses on a link down
so that IPv6 behaves similar to IPv4. As always the devil is in the
details and the IPv6 stack as too many behavioral differences from IPv4
making the simple idea more complicated than it needs to be.
The current implementation for keeping IPv6 addresses can panic or spit
out a warning in one of many paths:
1. IPv6 route gets an IPv4 route as its 'next' which causes a panic in
rt6_fill_node while handling a route dump request.
2. rt->dst.obsolete is set to DST_OBSOLETE_DEAD hitting the WARN_ON in
fib6_del
3. Panic in fib6_purge_rt because rt6i_ref count is not 1.
The root cause of all these is references related to the host route for
an address that is retained.
So, this patch deletes the host route every time the ifdown loop runs.
Since the host route is deleted and will be re-generated an up there is
no longer a need for the l3mdev fix up. On the 'admin up' side move
addrconf_permanent_addr into the NETDEV_UP event handling so that it
runs only once versus on UP and CHANGE events.
All of the current panics and warnings appear to be related to
addresses on the loopback device, but given the catastrophic nature when
a bug is triggered this patch takes the conservative approach and evicts
all host routes rather than trying to determine when it can be re-used
and when it can not. That can be a later optimizaton if desired.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 841645b5f2.
Ok, this puts the feature back. I've decided to apply David A.'s
bug fix and run with that rather than make everyone wait another
whole release for this feature.
Signed-off-by: David S. Miller <davem@davemloft.net>
Since cfg80211 maintains separate BSS table entries for APs if the same
BSSID, SSID pair is seen on multiple channels, it is possible that it
can map the current_bss to a BSS entry on the wrong channel. This
current_bss will not get flushed unless disconnected and cfg80211
reports a wrong channel as the associated channel.
Fix this by introducing a new cfg80211_connect_bss() function which is
similar to cfg80211_connect_result(), but it includes an additional
parameter: the bss the STA is connected to. This allows drivers to
provide the exact bss entry that matches the BSS to which the connection
was completed.
Reviewed-by: Jouni Malinen <jouni@qca.qualcomm.com>
Signed-off-by: Vidyullatha Kanchanapally <vkanchan@qti.qualcomm.com>
Signed-off-by: Sunil Dutt <usdutt@qti.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Add support for the a station statistics netlink attribute:
NL80211_STA_INFO_RX_DURATION.
If present, this attribute contains the aggregate PPDU duration (in
microseconds) for all the frames from the peer. This is useful to
help understand the total time spent transmitting to us by all of
the connected peers.
Signed-off-by: Mohammed Shafi Shajakhan <mohammed@qti.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Support checksum neutral ILA as described in the ILA draft. The low
order 16 bits of the identifier are used to contain the checksum
adjustment value.
The csum-mode parameter is added to described checksum processing. There
are three values:
- adjust transport checksum (previous behavior)
- do checksum neutral mapping
- do nothing
On output the csum-mode in the ila_params is checked and acted on. If
mode is checksum neutral mapping then to mapping and set C-bit.
On input, C-bit is checked. If it is set checksum-netural mapping is
done (regardless of csum-mode in ila params) and C-bit will be cleared.
If it is not set then action in csum-mode is taken.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change model of xlat to be used only for input where lookup is done on
the locator part of an address (comparing to locator_match as key
in rhashtable). This is needed for checksum neutral translation
which obfuscates the low order 16 bits of the identifier. It also
permits hosts to be in muliple ILA domains (each locator can map
to a different SIR address). A check is also added to disallow
translating non-ILA addresses (check of type in identifier).
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add structures for identifiers, locators, and an ila address which
is composed of a locator and identifier and in6_addr can be cast to
it. This includes a three bit type field and enums for the types defined
in ILA I-D.
In ILA lwt don't allow user to set a translation for a non-ILA
address (type of identifier is zero meaning it is an IID). This also
requires that the destination prefix is at least 65 bytes (64
bit locator and first byte of identifier).
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The memcpy of ipv6 header destination address to the skb control block
(sbk->cb) in header_create() results in currupted memory when bt_xmit()
is issued. The skb->cb is "released" in the return of header_create()
making room for lower layer to minipulate the skb->cb.
The value retrieved in bt_xmit is not persistent across header creation
and sending, and the lower layer will overwrite portions of skb->cb,
making the copied destination address wrong.
The memory corruption will lead to non-working multicast as the first 4
bytes of the copied destination address is replaced by a value that
resolves into a non-multicast prefix.
This fix removes the dependency on the skb control block between header
creation and send, by moving the destination address memcpy to the send
function path (setup_create, which is called from bt_xmit).
Signed-off-by: Glenn Ruben Bakke <glenn.ruben.bakke@nordicsemi.no>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Cc: stable@vger.kernel.org # 4.5+
rds-stress experiments with request size 256 bytes, 8K acks,
using 16 threads show a 40% improvment when pskb_extract()
replaces the {skb_clone(..); pskb_pull(..); pskb_trim(..);}
pattern in the Rx path, so we leverage the perf gain with
this commit.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A pattern of skb usage seen in modules such as RDS-TCP is to
extract `to_copy' bytes from the received TCP segment, starting
at some offset `off' into a new skb `clone'. This is done in
the ->data_ready callback, where the clone skb is queued up for rx on
the PF_RDS socket, while the parent TCP segment is returned unchanged
back to the TCP engine.
The existing code uses the sequence
clone = skb_clone(..);
pskb_pull(clone, off, ..);
pskb_trim(clone, to_copy, ..);
with the intention of discarding the first `off' bytes. However,
skb_clone() + pskb_pull() implies pksb_expand_head(), which ends
up doing a redundant memcpy of bytes that will then get discarded
in __pskb_pull_tail().
To avoid this inefficiency, this commit adds pskb_extract() that
creates the clone, and memcpy's only the relevant header/frag/frag_list
to the start of `clone'. pskb_trim() is then invoked to trim clone
down to the requested to_copy bytes.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It was impossible to include codel.h for the
purpose of having access to codel_params or
codel_vars structure definitions and using them
for embedding in other more complex structures.
This splits allows codel.h itself to be treated
like any other header file while codel_qdisc.h and
codel_impl.h contain function definitions with
logic that was previously in codel.h.
This copies over copyrights and doesn't involve
code changes other than adding a few additional
include directives to net/sched/sch*codel.c.
Signed-off-by: Michal Kazior <michal.kazior@tieto.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This strips out qdisc specific bits from the code
and makes it slightly more reusable. Codel will be
used by wireless/mac80211 in the future.
Signed-off-by: Michal Kazior <michal.kazior@tieto.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We should call consume_skb(skb) when skb is properly consumed,
or kfree_skb(skb) when skb must be dropped in error case.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We now have proper per-listener but also per network namespace counters
for SYN packets that might be dropped.
We replace the kfree_skb() by consume_skb() to be drop monitor [1]
friendly, and remove an obsolete comment.
FastOpen SYN packets can carry payload in them just fine.
[1] perf record -a -g -e skb:kfree_skb sleep 1; perf report
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts the following three commits:
70af921db6799977d9aaf1705ec197
The feature was ill conceived, has terrible semantics, and has added
nothing but regressions to the already fragile ipv6 stack.
Fixes: f1705ec197 ("net: ipv6: Make address flushing on ifdown optional")
Signed-off-by: David S. Miller <davem@davemloft.net>
Starting the kernel client with cephx disabled and then enabling cephx
and restarting userspace daemons can result in a crash:
[262671.478162] BUG: unable to handle kernel paging request at ffffebe000000000
[262671.531460] IP: [<ffffffff811cd04a>] kfree+0x5a/0x130
[262671.584334] PGD 0
[262671.635847] Oops: 0000 [#1] SMP
[262672.055841] CPU: 22 PID: 2961272 Comm: kworker/22:2 Not tainted 4.2.0-34-generic #39~14.04.1-Ubuntu
[262672.162338] Hardware name: Dell Inc. PowerEdge R720/068CDY, BIOS 2.4.3 07/09/2014
[262672.268937] Workqueue: ceph-msgr con_work [libceph]
[262672.322290] task: ffff88081c2d0dc0 ti: ffff880149ae8000 task.ti: ffff880149ae8000
[262672.428330] RIP: 0010:[<ffffffff811cd04a>] [<ffffffff811cd04a>] kfree+0x5a/0x130
[262672.535880] RSP: 0018:ffff880149aeba58 EFLAGS: 00010286
[262672.589486] RAX: 000001e000000000 RBX: 0000000000000012 RCX: ffff8807e7461018
[262672.695980] RDX: 000077ff80000000 RSI: ffff88081af2be04 RDI: 0000000000000012
[262672.803668] RBP: ffff880149aeba78 R08: 0000000000000000 R09: 0000000000000000
[262672.912299] R10: ffffebe000000000 R11: ffff880819a60e78 R12: ffff8800aec8df40
[262673.021769] R13: ffffffffc035f70f R14: ffff8807e5b138e0 R15: ffff880da9785840
[262673.131722] FS: 0000000000000000(0000) GS:ffff88081fac0000(0000) knlGS:0000000000000000
[262673.245377] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[262673.303281] CR2: ffffebe000000000 CR3: 0000000001c0d000 CR4: 00000000001406e0
[262673.417556] Stack:
[262673.472943] ffff880149aeba88 ffff88081af2be04 ffff8800aec8df40 ffff88081af2be04
[262673.583767] ffff880149aeba98 ffffffffc035f70f ffff880149aebac8 ffff8800aec8df00
[262673.694546] ffff880149aebac8 ffffffffc035c89e ffff8807e5b138e0 ffff8805b047f800
[262673.805230] Call Trace:
[262673.859116] [<ffffffffc035f70f>] ceph_x_destroy_authorizer+0x1f/0x50 [libceph]
[262673.968705] [<ffffffffc035c89e>] ceph_auth_destroy_authorizer+0x3e/0x60 [libceph]
[262674.078852] [<ffffffffc0352805>] put_osd+0x45/0x80 [libceph]
[262674.134249] [<ffffffffc035290e>] remove_osd+0xae/0x140 [libceph]
[262674.189124] [<ffffffffc0352aa3>] __reset_osd+0x103/0x150 [libceph]
[262674.243749] [<ffffffffc0354703>] kick_requests+0x223/0x460 [libceph]
[262674.297485] [<ffffffffc03559e2>] ceph_osdc_handle_map+0x282/0x5e0 [libceph]
[262674.350813] [<ffffffffc035022e>] dispatch+0x4e/0x720 [libceph]
[262674.403312] [<ffffffffc034bd91>] try_read+0x3d1/0x1090 [libceph]
[262674.454712] [<ffffffff810ab7c2>] ? dequeue_entity+0x152/0x690
[262674.505096] [<ffffffffc034cb1b>] con_work+0xcb/0x1300 [libceph]
[262674.555104] [<ffffffff8108fb3e>] process_one_work+0x14e/0x3d0
[262674.604072] [<ffffffff810901ea>] worker_thread+0x11a/0x470
[262674.652187] [<ffffffff810900d0>] ? rescuer_thread+0x310/0x310
[262674.699022] [<ffffffff810957a2>] kthread+0xd2/0xf0
[262674.744494] [<ffffffff810956d0>] ? kthread_create_on_node+0x1c0/0x1c0
[262674.789543] [<ffffffff817bd81f>] ret_from_fork+0x3f/0x70
[262674.834094] [<ffffffff810956d0>] ? kthread_create_on_node+0x1c0/0x1c0
What happens is the following:
(1) new MON session is established
(2) old "none" ac is destroyed
(3) new "cephx" ac is constructed
...
(4) old OSD session (w/ "none" authorizer) is put
ceph_auth_destroy_authorizer(ac, osd->o_auth.authorizer)
osd->o_auth.authorizer in the "none" case is just a bare pointer into
ac, which contains a single static copy for all services. By the time
we get to (4), "none" ac, freed in (2), is long gone. On top of that,
a new vtable installed in (3) points us at ceph_x_destroy_authorizer(),
so we end up trying to destroy a "none" authorizer with a "cephx"
destructor operating on invalid memory!
To fix this, decouple authorizer destruction from ac and do away with
a single static "none" authorizer by making a copy for each OSD or MDS
session. Authorizers themselves are independent of ac and so there is
no reason for destroy_authorizer() to be an ac op. Make it an op on
the authorizer itself by turning ceph_authorizer into a real struct.
Fixes: http://tracker.ceph.com/issues/15447
Reported-by: Alan Zhang <alan.zhang@linux.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
d894ba18d4 ("soreuseport: fix ordering for mixed v4/v6 sockets")
was merged as a bug fix to the net tree. Two conflicting changes
were committed to net-next before the above fix was merged back to
net-next:
ca065d0cf8 ("udp: no longer use SLAB_DESTROY_BY_RCU")
3b24d854cb ("tcp/dccp: do not touch listener sk_refcnt under synflood")
These changes switched the datastructure used for TCP and UDP sockets
from hlist_nulls to hlist. This patch applies the necessary parts
of the net tree fix to net-next which were not automatic as part of the
merge.
Fixes: 1602f49b58 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
Signed-off-by: Craig Gallek <kraig@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Simon Horman says:
====================
IPVS Updates for v4.7
please consider these enhancements to the IPVS. They allow SIP connections
originating from real-servers to be load balanced by the SIP psersitence
engine as is already implemented in the other direction. And for better one
packet scheduling (OPS) performance.
====================
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
ip6_route_output() will never return a NULL pointer, so there's no need
to check it.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Four years ago we introduced a new sysctl knob to disable automatic
helper assignment in 72110dfaa907 ("netfilter: nf_ct_helper: disable
automatic helper assignment"). This knob kept this behaviour enabled by
default to remain conservative.
This measure was introduced to provide a secure way to configure
iptables and connection tracking helpers through explicit rules.
Give the time we have waited for this, let's turn off this by default
now, worse case users still have a chance to recover the former
behaviour by explicitly enabling this back through sysctl.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch fixes dynamic element updates for adjacent intervals in the
rb-tree representation.
Since elements are sorted in the rb-tree, in case of adjacent nodes with
the same key, the assumption is that an interval end node must be placed
before an interval opening.
In tree lookup operations, the idea is to search for the closer element
that is smaller than the one we're searching for. Given that we'll have
two possible matchings, we have to take the opening interval in case of
adjacent nodes.
Range merges are not trivial with the current representation,
specifically we have to check if node extensions are equal and make sure
we keep the existing internal states around.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Parse flags and pass them to the set via ->deactivate() to check if we
remove the right element from the intervals.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This function parses the set element flags, thus, we can reuse the same
handling when deleting elements.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
As earlier commit removed accessed to the hash from other files we can
also make it static.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Use a private seed and init it using get_random_once.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We only allow rehash in init namespace, so we only use
init_ns.generation. And even if we would allow it, it makes no sense
as the conntrack locks are global; any ongoing rehash prevents insert/
delete.
So make this private to nf_conntrack_core instead.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
After commit fbd40ea018 ("ipv4: Don't do expensive useless work
during inetdev destroy.") when deleting an interface,
fib_del_ifaddr() can be executed without any primary address
present on the dead interface.
The above is safe, but triggers some "bug: prim == NULL" warnings.
This commit avoids warning if the in_dev is dead
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linux TCP stack painfully segments all TSO/GSO packets before retransmits.
This was fine back in the days when TSO/GSO were emerging, with their
bugs, but we believe the dark age is over.
Keeping big packets in write queues, but also in stack traversal
has a lot of benefits.
- Less memory overhead, because write queues have less skbs
- Less cpu overhead at ACK processing.
- Better SACK processing, as lot of studies mentioned how
awful linux was at this ;)
- Less cpu overhead to send the rtx packets
(IP stack traversal, netfilter traversal, drivers...)
- Better latencies in presence of losses.
- Smaller spikes in fq like packet schedulers, as retransmits
are not constrained by TCP Small Queues.
1 % packet losses are common today, and at 100Gbit speeds, this
translates to ~80,000 losses per second.
Losses are often correlated, and we see many retransmit events
leading to 1-MSS train of packets, at the time hosts are already
under stress.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 42b18f605f ("tipc: refactor function tipc_link_timeout()"),
introduced a bug which prevents sending of probe messages during
link synchronization phase. This leads to hanging links, if the
bearer is disabled/enabled after links are up.
In this commit, we send the probe messages correctly.
Fixes: 42b18f605f ("tipc: refactor function tipc_link_timeout()")
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a race-condition when updating the mdb offload flag without using
the mulicast_lock. This reverts commit 9e8430f8d6 ("bridge: mdb:
Passing the port-group pointer to br_mdb module").
This patch marks offloaded MDB entry as "offload" by changing the port-
group flags and marks it as MDB_PG_FLAGS_OFFLOAD.
When switchdev PORT_MDB succeeded and adds a multicast group, a completion
callback is been invoked "br_mdb_complete". The completion function
locks the multicast_lock and finds the right net_bridge_port_group and
marks it as offloaded.
Fixes: 9e8430f8d6 ("bridge: mdb: Passing the port-group pointer to br_mdb module")
Reported-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Elad Raz <eladr@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is duplicate code that translates br_mdb_entry to br_ip let's wrap it
in a common function.
Signed-off-by: Elad Raz <eladr@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When using switchdev deferred operation (SWITCHDEV_F_DEFER), the operation
is executed in different context and the application doesn't have any way
to get the operation real status.
Adding a completion callback fixes that. This patch adds fields to
switchdev_attr and switchdev_obj "complete_priv" field which is used by
the "complete" callback.
Application can set a complete function which will be called once the
operation executed.
Signed-off-by: Elad Raz <eladr@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When removing a single interface while a broadcast or ogm packet is
still pending then we will free the forward packet without releasing the
queue slots again.
This patch is supposed to fix this issue.
Fixes: 6d5808d4ae ("batman-adv: Add missing hardif_free_ref in forw_packet_free")
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
[sven@narfation.org: fix conflicts with current version]
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
_batadv_update_route rcu_derefences orig_ifinfo->router outside of a
spinlock protected region to print some information messages to the debug
log. But this pointer is not checked again when the new pointer is assigned
in the spinlock protected region. Thus is can happen that the value of
orig_ifinfo->router changed in the meantime and thus the reference counter
of the wrong router gets reduced after the spinlock protected region.
Just rcu_dereferencing the value of orig_ifinfo->router inside the spinlock
protected region (which also set the new pointer) is enough to get the
correct old router object.
Fixes: e1a5382f97 ("batman-adv: Make orig_node->router an rcu protected pointer")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
The shutdown of an batman-adv interface can happen with one of its slave
interfaces still being in the BATADV_IF_TO_BE_ACTIVATED state. A possible
reason for it is that the routing algorithm BATMAN_V was selected and
batadv_schedule_bat_ogm was not yet called for this interface. This slave
interface still has to be set to BATADV_IF_INACTIVE or the batman-adv
interface will never reduce its usage counter and thus never gets shutdown.
This problem can be simulated via:
$ modprobe dummy
$ modprobe batman-adv routing_algo=BATMAN_V
$ ip link add bat0 type batadv
$ ip link set dummy0 master bat0
$ ip link set dummy0 up
$ ip link del bat0
unregister_netdevice: waiting for bat0 to become free. Usage count = 3
Reported-by: Matthias Schiffer <mschiffer@universe-factory.net>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
[sven@narfation.org: fix conflicts with current version]
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
The encapsulated ethernet and VLAN header may be outside the received
ethernet frame. Thus the skb buffer size has to be checked before it can be
parsed to find out if it encapsulates another batman-adv packet.
Fixes: 420193573f ("batman-adv: softif bridge loop avoidance")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for your net-next
tree, mostly from Florian Westphal to sort out the lack of sufficient
validation in x_tables and connlabel preparation patches to add
nf_tables support. They are:
1) Ensure we don't go over the ruleset blob boundaries in
mark_source_chains().
2) Validate that target jumps land on an existing xt_entry. This extra
sanitization comes with a performance penalty when loading the ruleset.
3) Introduce xt_check_entry_offsets() and use it from {arp,ip,ip6}tables.
4) Get rid of the smallish check_entry() functions in {arp,ip,ip6}tables.
5) Make sure the minimal possible target size in x_tables.
6) Similar to #3, add xt_compat_check_entry_offsets() for compat code.
7) Check that standard target size is valid.
8) More sanitization to ensure that the target_offset field is correct.
9) Add xt_check_entry_match() to validate that matches are well-formed.
10-12) Three patch to reduce the number of parameters in
translate_compat_table() for {arp,ip,ip6}tables by using a container
structure.
13) No need to return value from xt_compat_match_from_user(), so make
it void.
14) Consolidate translate_table() so it can be used by compat code too.
15) Remove obsolete check for compat code, so we keep consistent with
what was already removed in the native layout code (back in 2007).
16) Get rid of target jump validation from mark_source_chains(),
obsoleted by #2.
17) Introduce xt_copy_counters_from_user() to consolidate counter
copying, and use it from {arp,ip,ip6}tables.
18,22) Get rid of unnecessary explicit inlining in ctnetlink for dump
functions.
19) Move nf_connlabel_match() to xt_connlabel.
20) Skip event notification if connlabel did not change.
21) Update of nf_connlabels_get() to make the upcoming nft connlabel
support easier.
23) Remove spinlock to read protocol state field in conntrack.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
nla_data() is now aligned on a 64-bit area.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
nla_data() is now aligned on a 64-bit area.
A temporary version (nla_put_be64_32bit()) is added for nla_put_net64().
This function is removed in the next patch.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
nla_data() is now aligned on a 64-bit area.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts were two cases of simple overlapping changes,
nothing serious.
In the UDP case, we need to add a hlist_add_tail_rcu()
to linux/rculist.h, because we've moved UDP socket handling
away from using nulls lists.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Fix memory leak in iwlwifi, from Matti Gottlieb.
2) Add missing registration of netfilter arp_tables into initial
namespace, from Florian Westphal.
3) Fix potential NULL deref in DecNET routing code.
4) Restrict NETLINK_URELEASE to truly bound sockets only, from Dmitry
Ivanov.
5) Fix dst ref counting in VRF, from David Ahern.
6) Fix TSO segmenting limits in i40e driver, from Alexander Duyck.
7) Fix heap leak in PACKET_DIAG_MCLIST, from Mathias Krause.
8) Ravalidate IPV6 datagram socket cached routes properly, particularly
with UDP, from Martin KaFai Lau.
9) Fix endian bug in RDS dp_ack_seq handling, from Qing Huang.
10) Fix stats typing in bcmgenet driver, from Eric Dumazet.
11) Openvswitch needs to orphan SKBs before ipv6 fragmentation handing,
from Joe Stringer.
12) SPI device reference leak in spi_ks8895 PHY driver, from Mark Brown.
13) atl2 doesn't actually support scatter-gather, so don't advertise the
feature. From Ben Hucthings.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (72 commits)
openvswitch: use flow protocol when recalculating ipv6 checksums
Driver: Vmxnet3: set CHECKSUM_UNNECESSARY for IPv6 packets
atl2: Disable unimplemented scatter/gather feature
net/mlx4_en: Split SW RX dropped counter per RX ring
net/mlx4_core: Don't allow to VF change global pause settings
net/mlx4_core: Avoid repeated calls to pci enable/disable
net/mlx4_core: Implement pci_resume callback
net: phy: spi_ks8895: Don't leak references to SPI devices
net: ethernet: davinci_emac: Fix platform_data overwrite
net: ethernet: davinci_emac: Fix Unbalanced pm_runtime_enable
qede: Fix single MTU sized packet from firmware GRO flow
qede: Fix setting Skb network header
qede: Fix various memory allocation error flows for fastpath
tcp: Merge tx_flags and tskey in tcp_shifted_skb
tcp: Merge tx_flags and tskey in tcp_collapse_retrans
drivers: net: cpsw: fix wrong regs access in cpsw_ndo_open
tcp: Fix SOF_TIMESTAMPING_TX_ACK when handling dup acks
openvswitch: Orphan skbs before IPv6 defrag
Revert "Prevent NUll pointer dereference with two PHYs on cpsw"
VSOCK: Only check error on skb_recv_datagram when skb is NULL
...
When using masked actions the ipv6_proto field of an action
to set IPv6 fields may be zero rather than the prevailing protocol
which will result in skipping checksum recalculation.
This patch resolves the problem by relying on the protocol
in the flow key rather than that in the set field action.
Fixes: 83d2b9ba1a ("net: openvswitch: Support masked set actions.")
Cc: Jarno Rajahalme <jrajahalme@nicira.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for NETIF_F_TSO_MANGLEID if a given tunnel supports
NETIF_F_TSO. This way if needed a device can then later enable the TSO
with IP ID mangling and the tunnels on top of that device can then also
make use of the IP ID mangling as well.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Removed .type field from NLA to do proper length checking.
Reported by Daniel Borkmann and Julia Lawall.
Signed-off-by: Peter Heise <peter.heise@airbus.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
EXPIRES_IN_MS macro comes from net/ipv4/inet_diag.c and dates
back to before jiffies_to_msecs() has been introduced.
Now we can remove it and use jiffies_to_msecs().
Suggested-by: Jakub Sitnicki <jkbs@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Jakub Sitnicki <jkbs@redhat.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Having the tag protocol in dsa_switch_driver for setup time and in
dsa_switch_tree for runtime is enough. Remove dsa_switch's one.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a new RTM_GETSTATS message to query link stats via netlink
from the kernel. RTM_NEWLINK also dumps stats today, but RTM_NEWLINK
returns a lot more than just stats and is expensive in some cases when
frequent polling for stats from userspace is a common operation.
RTM_GETSTATS is an attempt to provide a light weight netlink message
to explicity query only link stats from the kernel on an interface.
The idea is to also keep it extensible so that new kinds of stats can be
added to it in the future.
This patch adds the following attribute for NETDEV stats:
struct nla_policy ifla_stats_policy[IFLA_STATS_MAX + 1] = {
[IFLA_STATS_LINK_64] = { .len = sizeof(struct rtnl_link_stats64) },
};
Like any other rtnetlink message, RTM_GETSTATS can be used to get stats of
a single interface or all interfaces with NLM_F_DUMP.
Future possible new types of stat attributes:
link af stats:
- IFLA_STATS_LINK_IPV6 (nested. for ipv6 stats)
- IFLA_STATS_LINK_MPLS (nested. for mpls/mdev stats)
extended stats:
- IFLA_STATS_LINK_EXTENDED (nested. extended software netdev stats like bridge,
vlan, vxlan etc)
- IFLA_STATS_LINK_HW_EXTENDED (nested. extended hardware stats which are
available via ethtool today)
This patch also declares a filter mask for all stat attributes.
User has to provide a mask of stats attributes to query. filter mask
can be specified in the new hdr 'struct if_stats_msg' for stats messages.
Other important field in the header is the ifindex.
This api can also include attributes for global stats (eg tcp) in the future.
When global stats are included in a stats msg, the ifindex in the header
must be zero. A single stats message cannot contain both global and
netdev specific stats. To easily distinguish them, netdev specific stat
attributes name are prefixed with IFLA_STATS_LINK_
Without any attributes in the filter_mask, no stats will be returned.
This patch has been tested with mofified iproute2 ifstat.
Suggested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When using OPS mode in conjunction with SIP persistent-engine, packets
originating from the same ip-address/port could be balanced to different
real servers, and (to properly handle SIP responses) OPS connections
are created in the in-out direction too, where ip_vs_update_conntrack()
is called to modify the reply tuple.
As a result, there can be collision of conntrack tuples, causing random
packet drops, as explained below:
conntrack1: orig=CIP->VIP, reply=RIP1->CIP
conntrack2: orig=RIP2->CIP, reply=CIP->VIP
Tuple CIP->VIP is both in orig of conntrack1 and reply of conntrack2.
The collision triggers packet drop inside nf_conntrack processing.
In addition, the current implementation deletes the conntrack object at
every expire of an OPS connection (once every forwarded packet), to have
it recreated from scratch at next packet traversing IPVS.
Since in OPS mode, by definition, we don't expect any associated
response, the choices implemented in this patch are:
a) don't call nf_conntrack_alter_reply() for OPS connections inside
ip_vs_update_conntrack().
b) don't delete the conntrack object at OPS connection expire.
The result is that created conntrack objects for each tuple CIP->VIP,
RIP-N->CIP, etc. are left in UNREPLIED state and not modified by IPVS
OPS connection management. This eliminates packet drops and leaves
a single conntrack object for each tuple packets are sent from.
Signed-off-by: Marco Angaroni <marcoangaroni@gmail.com>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
One-packet-scheduling is the most expensive mode in IPVS from
performance point of view: for each packet to be processed a new
connection data structure is created and, after packet is sent,
deleted by starting a new timer set to expire immediately.
SIP persistent-engine needs OPS mode to have Call-ID based load
balancing, so OPS mode performance has negative impact in SIP
protocol load balancing.
This patch aims to improve performance of OPS mode by means of the
following changes in the release mechanism of OPS connections:
a) call expire callback ip_vs_conn_expire() directly instead of
starting a timer programmed to fire immediately.
b) avoid call_rcu() overhead inside expire callback, since OPS
connection are not inserted in the hash-table and last just the
time to process the packet, hence there is no concurrent access
to such data structures.
Signed-off-by: Marco Angaroni <marcoangaroni@gmail.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
When using LVS-NAT and SIP persistence-egine over UDP, the following
limitations are present with current implementation:
1) To actually have load-balancing based on Call-ID header, you need to
use one-packet-scheduling mode. But with one-packet-scheduling the
connection is deleted just after packet is forwarded, so SIP responses
coming from real-servers do not match any connection and SNAT is
not applied.
2) If you do not use "-o" option, IPVS behaves as normal UDP load
balancer, so different SIP calls (each one identified by a different
Call-ID) coming from the same ip-address/port go to the same
real-server. So basically you don’t have load-balancing based on
Call-ID as intended.
3) Call-ID is not learned when a new SIP call is started by a real-server
(inside-to-outside direction), but only in the outside-to-inside
direction. This would be a general problem for all SIP servers acting
as Back2BackUserAgent.
This patch aims to solve problems 1) and 3) while keeping OPS mode
mandatory for SIP-UDP, so that 2) is not a problem anymore.
The basic mechanism implemented is to make packets, that do not match any
existent connection but come from real-servers, create new connections
instead of let them pass without any effect.
When such packets pass through ip_vs_out(), if their source ip address and
source port match a configured real-server, a new connection is
automatically created in the same way as it would have happened if the
packet had come from outside-to-inside direction. A new connection template
is created too if the virtual-service is persistent and there is no
matching connection template found. The new connection automatically
created, if the service had "-o" option, is an OPS connection that lasts
only the time to forward the packet, just like it happens on the
ingress side.
The main part of this mechanism is implemented inside a persistent-engine
specific callback (at the moment only SIP persistent engine exists) and
is triggered only for UDP packets, since connection oriented protocols, by
using different set of ports (typically ephemeral ports) to open new
outgoing connections, should not need this feature.
The following requisites are needed for automatic connection creation; if
any is missing the packet simply goes the same way as before.
a) virtual-service is not fwmark based (this is because fwmark services
do not store address and port of the virtual-service, required to
build the connection data).
b) virtual-service and real-servers must not have been configured with
omitted port (this is again to have all data to create the connection).
Signed-off-by: Marco Angaroni <marcoangaroni@gmail.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
If skb_recv_datagram returns an skb, we should ignore the err
value returned. Otherwise, datagram receives will return EAGAIN
when they have to wait for a datagram.
Acked-by: Adit Ranadive <aditr@vmware.com>
Signed-off-by: Jorgen Hansen <jhansen@vmware.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The dsa_slave_priv structure does not need a pointer to its net_device.
Kill it.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a new helper for cls/act programs that can push events
to user space applications. For networking, this can be f.e. for sampling,
debugging, logging purposes or pushing of arbitrary wake-up events. The
idea is similar to a43eec3042 ("bpf: introduce bpf_perf_event_output()
helper") and 39111695b1 ("samples: bpf: add bpf_perf_event_output example").
The eBPF program utilizes a perf event array map that user space populates
with fds from perf_event_open(), the eBPF program calls into the helper
f.e. as skb_event_output(skb, &my_map, BPF_F_CURRENT_CPU, raw, sizeof(raw))
so that the raw data is pushed into the fd f.e. at the map index of the
current CPU.
User space can poll/mmap/etc on this and has a data channel for receiving
events that can be post-processed. The nice thing is that since the eBPF
program and user space application making use of it are tightly coupled,
they can define their own arbitrary raw data format and what/when they
want to push.
While f.e. packet headers could be one part of the meta data that is being
pushed, this is not a substitute for things like packet sockets as whole
packet is not being pushed and push is only done in a single direction.
Intention is more of a generically usable, efficient event pipe to applications.
Workflow is that tc can pin the map and applications can attach themselves
e.g. after cls/act setup to one or multiple map slots, demuxing is done by
the eBPF program.
Adding this facility is with minimal effort, it reuses the helper
introduced in a43eec3042 ("bpf: introduce bpf_perf_event_output() helper")
and we get its functionality for free by overloading its BPF_FUNC_ identifier
for cls/act programs, ctx is currently unused, but will be made use of in
future. Example will be added to iproute2's BPF example files.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Separated from previous patch for readability.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
Struct ctl_table_header holds pointer to sysctl table which could be used
for freeing it after unregistration. IPv4 sysctls already use that.
Remove redundant NULL assignment: ndev allocated using kzalloc.
This also saves some bytes: sysctl table could be shorter than
DEVCONF_MAX+1 if some options are disable in config.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Suggested-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since the nlattr header is 4 bytes in size, it can cause the netlink
attribute payload to not be 8-byte aligned.
This is particularly troublesome for IFLA_STATS64 which contains 64-bit
statistic values.
Solve this by creating a dummy IFLA_PAD attribute which has a payload
which is zero bytes in size. When HAVE_EFFICIENT_UNALIGNED_ACCESS is
false, we insert an IFLA_PAD attribute into the netlink response when
necessary such that the IFLA_STATS64 payload will be properly aligned.
With help and suggestions from Eric Dumazet.
Signed-off-by: David S. Miller <davem@davemloft.net>
Calm down gcc warnings:
net/netfilter/nf_conntrack_netlink.c:529:15: warning: 'ctnetlink_proto_size' defined but not used [-Wunused-function]
static size_t ctnetlink_proto_size(const struct nf_conn *ct)
^
net/netfilter/nf_conntrack_netlink.c:546:15: warning: 'ctnetlink_acct_size' defined but not used [-Wunused-function]
static size_t ctnetlink_acct_size(const struct nf_conn *ct)
^
net/netfilter/nf_conntrack_netlink.c:556:12: warning: 'ctnetlink_secctx_size' defined but not used [-Wunused-function]
static int ctnetlink_secctx_size(const struct nf_conn *ct)
^
net/netfilter/nf_conntrack_netlink.c:572:15: warning: 'ctnetlink_timestamp_size' defined but not used [-Wunused-function]
static size_t ctnetlink_timestamp_size(const struct nf_conn *ct)
^
So gcc compiles them out when CONFIG_NF_CONNTRACK_EVENTS and
CONFIG_NETFILTER_NETLINK_GLUE_CT are not set.
Fixes: 4054ff4545 ("netfilter: ctnetlink: remove unnecessary inlining")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
The function convert_legacy_u32_to_link_mode and
convert_link_mode_to_legacy_u32 may be used outside
of ethtool.c. We rename them to ethtool_convert_...
and export them, so we could use them in others
drivers and modules.
Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
nf_connlabel_set() takes the bit number that we would like to set.
nf_connlabels_get() however took the number of bits that we want to
support.
So e.g. nf_connlabels_get(32) support bits 0 to 31, but not 32.
This changes nf_connlabels_get() to take the highest bit that we want
to set.
Callers then don't have to cope with a potential integer wrap
when using nf_connlabels_get(bit + 1) anymore.
Current callers are fine, this change is only to make folloup
nft ct label set support simpler.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
make the replace function only send a ctnetlink event if the contents
of the new set is different.
Otherwise 'ct label set ct label | bar'
will cause netlink event storm since we "replace" labels for each packet.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Currently labels can only be set either by iptables connlabel
match or via ctnetlink.
Before adding nftables set support, clean up the clabel core and move
helpers that nft will not need after all to the xtables module.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We deleted a line of code and accidentally made the "return put_user()"
part of the if statement when it's supposed to be unconditional.
Fixes: 9f9a45beaa ('udp: do not expect udp headers on ioctl SIOCINQ')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch passes netlink attr data ptr directly to dev_get_stats
thus elimiating a stats copy.
Suggested-by: David Miller <davem@davemloft.net>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fix spelling typos found in printk
within various part of the kernel sources.
Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Change the dsa_switch_driver.probe function to return a const char *.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds code borrowed from bits and pieces of other protocols to
the IPv6 GRE path so that we can support GSO over IPv6 based GRE tunnels.
By adding this support we are able to significantly improve the throughput
for GRE tunnels as we are able to make use of GSO.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since GRE doesn't really care about L3 protocol we can support IPv4 and
IPv6 using the same offloads. With that being the case we can add a call
to register the offloads for IPv6 as a part of our GRE offload
initialization.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for the basic offloads we support on most devices.
Specifically with this patch set we can support checksum offload, basic
scatter-gather, and highdma.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When we were creating an ip6gretap interface the MTU was about 6 bytes
short of what was needed. It turns out we were not taking the Ethernet
header into account and as a result we were eating into the 8 bytes
reserved for the encap limit.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch updates the IP tunnel core function iptunnel_handle_offloads so
that we return an int and do not free the skb inside the function. This
actually allows us to clean up several paths in several tunnels so that we
can free the skb at one point in the path without having to have a
secondary path if we are supporting tunnel offloads.
In addition it should resolve some double-free issues I have found in the
tunnels paths as I believe it is possible for us to end up triggering such
an event in the case of fou or gue.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Two different threads with different rds sockets may be in
rds_recv_rcvbuf_delta() via receive path. If their ports
both map to the same word in the congestion map, then
using non-atomic ops to update it could cause the map to
be incorrect. Lets use atomics to avoid such an issue.
Full credit to Wengang <wen.gang.wang@oracle.com> for
finding the issue, analysing it and also pointing out
to offending code with spin lock based fix.
Reviewed-by: Leon Romanovsky <leon@leon.nu>
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
dp->dp_ack_seq is used in big endian format. We need to do the
big endianness conversion when we assign a value in host format
to it.
Signed-off-by: Qing Huang <qing.huang@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When __vlan_insert_tag() fails from skb_vlan_push() path due to the
skb_cow_head(), we need to undo the __skb_push() in the error path
as well that was done earlier to move skb->data pointer to mac header.
Moreover, I noticed that when in the non-error path the __skb_pull()
is done and the original offset to mac header was non-zero, we fixup
from a wrong skb->data offset in the checksum complete processing.
So the skb_postpush_rcsum() really needs to be done before __skb_pull()
where skb->data still points to the mac header start and thus operates
under the same conditions as in __vlan_insert_tag().
Fixes: 93515d53b1 ("net: move vlan pop/push functions into common code")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When rhashtable_walk_init return err, no release function should be
called, and when rhashtable_walk_start return err, we should only invoke
rhashtable_walk_exit to release the source.
But now when sctp_transport_walk_start return err, we just call
rhashtable_walk_stop/exit, and never care about if rhashtable_walk_init
or start return err, which is so bad.
We will fix it by calling rhashtable_walk_exit if rhashtable_walk_start
return err in sctp_transport_walk_start, and if sctp_transport_walk_start
return err, we do not need to call sctp_transport_walk_stop any more.
For sctp proc, we will use 'iter->start_fail' to decide if we will call
rhashtable_walk_stop/exit.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In sctp proc, these three functions in remaddrs and assocs are the
same. we should merge them into one.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This one will implement all the interface of inet_diag, inet_diag_handler.
which includes sctp_diag_dump, sctp_diag_dump_one and sctp_diag_get_info.
It will work as a module, and register inet_diag_handler when loading.
v2->v3:
- fix the mistake in inet_assoc_attr_size().
- change inet_diag_msg_laddrs_fill() name to inet_diag_msg_sctpladdrs_fill.
- change inet_diag_msg_paddrs_fill() name to inet_diag_msg_sctpaddrs_fill.
- add inet_diag_msg_sctpinfo_fill() to make asoc/ep fill code clearer.
- add inet_diag_msg_sctpasoc_fill() to make asoc fill code clearer.
- merge inet_asoc_diag_fill() and inet_ep_diag_fill() to
inet_sctp_diag_fill().
- call sctp_diag_get_info() directly, instead by handler, cause the caller
is in the same file with it.
- call lock_sock in sctp_tsp_dump_one() to make sure we call get sctp info
safely.
- after lock_sock(sk), we should check sk != assoc->base.sk.
- change mem[SK_MEMINFO_WMEM_ALLOC] to asoc->sndbuf_used for asoc dump when
asoc->ep->sndbuf_policy is set. don't use INET_DIAG_MEMINFO attr any more.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
inet_diag_msg_common_fill is used to fill the diag msg common info,
we need to use it in sctp_diag as well, so export it.
inet_diag_msg_attrs_fill is used to fill some common attrs info between
sctp diag and tcp diag.
v2->v3:
- do not need to define and export inet_diag_get_handler any more.
cause all the functions in it are in sctp_diag.ko, we just call
them in sctp_diag.ko.
- add inet_diag_msg_attrs_fill to make codes clear.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For some main variables in sctp.ko, we couldn't export it to other modules,
so we have to define some api to access them.
It will include sctp transport and endpoint's traversal.
There are some transport traversal functions for sctp_diag, we can also
use it for sctp_proc. cause they have the similar situation to traversal
transport.
v2->v3:
- rhashtable_walk_init need the parameter gfp, because of recent upstrem
update
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sctp_diag will dump some important details of sctp's assoc or ep, we use
sctp_info to describe them, sctp_get_sctp_info to get them, and export
it to sctp_diag.ko.
v2->v3:
- we will not use list_for_each_safe in sctp_get_sctp_info, cause
all the callers of it will use lock_sock.
- fix the holes in struct sctp_info with __reserved* field.
because sctp_diag is a new feature, and sctp_info is just for now,
it may be changed in the future.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SCTP already serializes access to rcvbuf through its sock lock:
sctp_recvmsg takes it right in the start and release at the end, while
rx path will also take the lock before doing any socket processing. On
sctp_rcv() it will check if there is an user using the socket and, if
there is, it will queue incoming packets to the backlog. The backlog
processing will do the same. Even timers will do such check and
re-schedule if an user is using the socket.
Simplifying this will allow us to remove sctp_skb_list_tail and get ride
of some expensive lockings. The lists that it is used on are also
mangled with functions like __skb_queue_tail and __skb_unlink in the
same context, like on sctp_ulpq_tail_event() and sctp_clear_pd().
sctp_close() will also purge those while using only the sock lock.
Therefore the lockings performed by sctp_skb_list_tail() are not
necessary. This patch removes this function and replaces its calls with
just skb_queue_splice_tail_init() instead.
The biggest gain is at sctp_ulpq_tail_event(), because the events always
contain a list, even if it's queueing a single skb and this was
triggering expensive calls to spin_lock_irqsave/_irqrestore for every
data chunk received.
As SCTP will deliver each data chunk on a corresponding recvmsg, the
more effective the change will be.
Before this patch, with chunks with 30 bytes:
netperf -t SCTP_STREAM -H 192.168.1.2 -cC -l 60 -- -m 30 -S 400000
400000 -s 400000 400000
on a 10Gbit link with 1500 MTU:
SCTP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.1.1 () port 0 AF_INET
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB
425984 425984 30 60.00 137.45 7.34 7.36 52.504 52.608
With it:
SCTP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.1.1 () port 0 AF_INET
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB
425984 425984 30 60.00 179.10 7.97 6.70 43.740 36.788
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for the newer version 1 of the HSR
networking standard. Version 0 is still default and the new
version has to be selected via iproute2.
Main changes are in the supervision frame handling and its
ethertype field.
Signed-off-by: Peter Heise <peter.heise@airbus.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Last known hot point during SYNFLOOD attack is the clearing
of rx_opt.saw_tstamp in tcp_rcv_state_process()
It is not needed for a listener, so we move it where it matters.
Performance while a SYNFLOOD hits a single listener socket
went from 5 Mpps to 6 Mpps on my test server (24 cores, 8 NIC RX queues)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When removing sk_refcnt manipulation on synflood, I missed that
using skb_set_owner_w() was racy, if sk->sk_wmem_alloc had already
transitioned to 0.
We should hold sk_refcnt instead, but this is a big deal under attack.
(Doing so increase performance from 3.2 Mpps to 3.8 Mpps only)
In this patch, I chose to not attach a socket to syncookies skb.
Performance is now 5 Mpps instead of 3.2 Mpps.
Following patch will remove last known false sharing in
tcp_rcv_state_process()
Fixes: 3b24d854cb ("tcp/dccp: do not touch listener sk_refcnt under synflood")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
According to the link FSM, a received traffic packet can take a link
from state ESTABLISHING to ESTABLISHED, but the link can still not be
fully set up in one atomic operation. This means that even if the the
very first packet on the link is a traffic packet with sequence number
1 (one), it has to be dropped and retransmitted.
This can be avoided if we let the mentioned packet be preceded by a
LINK_PROTOCOL/STATE message, which takes up the endpoint before the
arrival of the traffic.
We add this small feature in this commit.
This is a fully compatible change.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In some link establishment scenarios we see that packet #2 may be sent
out before packet #1, forcing the receiver to demand retransmission of
the missing packet. This is harmless, but may cause confusion among
people tracing the packet flow.
Since this is extremely easy to fix, we do so by adding en extra send
call to the bearer immediately after the link has come up.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function tipc_link_timeout() is unnecessary complex, and can
easily be made more readable.
We do that with this commit. The only functional change is that we
remove a redundant test for whether the broadcast link is up or not.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a link is down, it will continuously try to re-establish contact
with the peer by sending out a RESET or an ACTIVATE message at each
timeout interval. The default value for this interval is currently
375 ms. This is wasteful, and may become a problem in very large
clusters with dozens or hundreds of nodes being down simultaneously.
We now introduce a simple backoff algorithm for these cases. The
first five messages are sent at default rate; thereafter a message
is sent only each 16th timer interval.
This will cover the vast majority of link recycling cases, since the
endpoint starting last will transmit at the higher speed, and the link
should normally be established well be before the rate needs to be
reduced.
The only case where we will see a degradation of link re-establishment
times is when the endpoints remain intact, and a glitch in the
transmission media is causing the link reset. We will then experience
a worst-case re-establishing time of 6 seconds, something we deem
acceptable.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a link endpoint is going down locally, e.g., because its interface
is being stopped, it will spontaneously send out a RESET message to
its peer, informing it about this fact. This saves the peer from
detecting the failure via probing, and hence gives both speedier and
less resource consuming failure detection on the peer side.
According to the link FSM, a receiver of a RESET message, ignoring the
reason for it, must now consider the sender ready to come back up, and
starts periodically sending out ACTIVATE messages to the peer in order
to re-establish the link. Also, according to the FSM, the receiver of
an ACTIVATE message can now go directly to state ESTABLISHED and start
sending regular traffic packets. This is a well-proven and robust FSM.
However, in the case of a reboot, there is a small possibilty that link
endpoint on the rebooted node may have been re-created with a new bearer
identity between the moment it sent its (pre-boot) RESET and the moment
it receives the ACTIVATE from the peer. The new bearer identity cannot
be known by the peer according to this scenario, since traffic headers
don't convey such information. This is a problem, because both endpoints
need to know the correct value of the peer's bearer id at any moment in
time in order to be able to produce correct link events for their users.
The only way to guarantee this is to enforce a full setup message
exchange (RESET + ACTIVATE) even after the reboot, since those messages
carry the bearer idientity in their header.
In this commit we do this by introducing and setting a "stopping" bit in
the header of the spontaneously generated RESET messages, informing the
peer that the sender will not be immediately ready to re-establish the
link. A receiver seeing this bit must act as if this were a locally
detected connectivity failure, and hence has to go through a full two-
way setup message exchange before any link can be re-established.
Although never reported, this problem seems to have always been around.
This protocol addition is fully backwards compatible.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch converts all helpers that can use ARG_PTR_TO_RAW_STACK as argument
type. For tc programs this is bpf_skb_load_bytes(), bpf_skb_get_tunnel_key(),
bpf_skb_get_tunnel_opt(). For tracing, this optimizes bpf_get_current_comm()
and bpf_probe_read(). The check in bpf_skb_load_bytes() for MAX_BPF_STACK can
also be removed since the verifier already makes sure we stay within bounds
on stack buffers.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds what's missing to properly support RPS and RFS on SCTP,
as some of it is already implemented in common calls.
Having support for RPS and RFS allows better scaling specially because
not all NICs support hashing SCTP headers.
Save the hash right when we dequeue a skb from inqueue so we do it only
once per skb instead of per chunk. New sockets will then inherit the
hash through sctp_copy_sock().
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skbs given to validate_xmit_skb() should not have a next
pointer anymore.
Also if a packet is dropped, increment dev->tx_dropped
__dev_queue_xmit() no longer has to change tx_dropped in this case.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull crypto fixes from Herbert Xu:
"This fixes an NFS regression caused by the skcipher/hash conversion in
sunrpc. It also fixes a build problem in certain configurations with
bcm63xx"
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
hwrng: bcm63xx - fix device tree compilation
sunrpc: Fix skcipher/shash conversion
With the SO_REUSEPORT socket option, it is possible to create sockets
in the AF_INET and AF_INET6 domains which are bound to the same IPv4 address.
This is only possible with SO_REUSEPORT and when not using IPV6_V6ONLY on
the AF_INET6 sockets.
Prior to the commits referenced below, an incoming IPv4 packet would
always be routed to a socket of type AF_INET when this mixed-mode was used.
After those changes, the same packet would be routed to the most recently
bound socket (if this happened to be an AF_INET6 socket, it would
have an IPv4 mapped IPv6 address).
The change in behavior occurred because the recent SO_REUSEPORT optimizations
short-circuit the socket scoring logic as soon as they find a match. They
did not take into account the scoring logic that favors AF_INET sockets
over AF_INET6 sockets in the event of a tie.
To fix this problem, this patch changes the insertion order of AF_INET
and AF_INET6 addresses in the TCP and UDP socket lists when the sockets
have SO_REUSEPORT set. AF_INET sockets will be inserted at the head of the
list and AF_INET6 sockets with SO_REUSEPORT set will always be inserted at
the tail of the list. This will force AF_INET sockets to always be
considered first.
Fixes: e32ea7e747 ("soreuseport: fast reuseport UDP socket selection")
Fixes: 125e80b88687 ("soreuseport: fast reuseport TCP socket selection")
Reported-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: Craig Gallek <kraig@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
consume_skb() isn't for error cases that kfree_skb() is more proper
one. At this patch, it fixed tpacket_rcv() and packet_rcv() to be
consistent for error or non-error cases letting perf trace its event
properly.
Signed-off-by: Weongyo Jeong <weongyo.linux@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Until now, the requests sent to topology server are queued
to a workqueue by the generic server framework.
These messages are processed by worker threads and trigger the
registered callbacks.
To reduce latency on uniprocessor systems, explicit rescheduling
is performed using cond_resched() after MAX_RECV_MSG_COUNT(25)
messages.
This implementation on SMP systems leads to an subscriber refcnt
error as described below:
When a worker thread yields by calling cond_resched() in a SMP
system, a new worker is created on another CPU to process the
pending workitem. Sometimes the sleeping thread wakes up before
the new thread finishes execution.
This breaks the assumption on ordering and being single threaded.
The fault is more frequent when MAX_RECV_MSG_COUNT is lowered.
If the first thread was processing subscription create and the
second thread processing close(), the close request will free
the subscriber and the create request oops as follows:
[31.224137] WARNING: CPU: 2 PID: 266 at include/linux/kref.h:46 tipc_subscrb_rcv_cb+0x317/0x380 [tipc]
[31.228143] CPU: 2 PID: 266 Comm: kworker/u8:1 Not tainted 4.5.0+ #97
[31.228377] Workqueue: tipc_rcv tipc_recv_work [tipc]
[...]
[31.228377] Call Trace:
[31.228377] [<ffffffff812fbb6b>] dump_stack+0x4d/0x72
[31.228377] [<ffffffff8105a311>] __warn+0xd1/0xf0
[31.228377] [<ffffffff8105a3fd>] warn_slowpath_null+0x1d/0x20
[31.228377] [<ffffffffa0098067>] tipc_subscrb_rcv_cb+0x317/0x380 [tipc]
[31.228377] [<ffffffffa00a4984>] tipc_receive_from_sock+0xd4/0x130 [tipc]
[31.228377] [<ffffffffa00a439b>] tipc_recv_work+0x2b/0x50 [tipc]
[31.228377] [<ffffffff81071925>] process_one_work+0x145/0x3d0
[31.246554] ---[ end trace c3882c9baa05a4fd ]---
[31.248327] BUG: spinlock bad magic on CPU#2, kworker/u8:1/266
[31.249119] BUG: unable to handle kernel NULL pointer dereference at 0000000000000428
[31.249323] IP: [<ffffffff81099d0c>] spin_dump+0x5c/0xe0
[31.249323] PGD 0
[31.249323] Oops: 0000 [#1] SMP
In this commit, we
- rename tipc_conn_shutdown() to tipc_conn_release().
- move connection release callback execution from tipc_close_conn()
to a new function tipc_sock_release(), which is executed before
we free the connection.
Thus we release the subscriber during connection release procedure
rather than connection shutdown procedure.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a release_cb for UDPv6. It does a route lookup
and updates sk->sk_dst_cache if it is needed. It picks up the
left-over job from ip6_sk_update_pmtu() if the sk was owned
by user during the pmtu update.
It takes a rcu_read_lock to protect the __sk_dst_get() operations
because another thread may do ip6_dst_store() without taking the
sk lock (e.g. sendmsg).
Fixes: 45e4fd2668 ("ipv6: Only create RTF_CACHE routes after encountering pmtu exception")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reported-by: Wei Wang <weiwan@google.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a case in connected UDP socket such that
getsockopt(IPV6_MTU) will return a stale MTU value. The reproducible
sequence could be the following:
1. Create a connected UDP socket
2. Send some datagrams out
3. Receive a ICMPV6_PKT_TOOBIG
4. No new outgoing datagrams to trigger the sk_dst_check()
logic to update the sk->sk_dst_cache.
5. getsockopt(IPV6_MTU) returns the mtu from the invalid
sk->sk_dst_cache instead of the newly created RTF_CACHE clone.
This patch updates the sk->sk_dst_cache for a connected datagram sk
during pmtu-update code path.
Note that the sk->sk_v6_daddr is used to do the route lookup
instead of skb->data (i.e. iph). It is because a UDP socket can become
connected after sending out some datagrams in un-connected state. or
It can be connected multiple times to different destinations. Hence,
iph may not be related to where sk is currently connected to.
It is done under '!sock_owned_by_user(sk)' condition because
the user may make another ip6_datagram_connect() (i.e changing
the sk->sk_v6_daddr) while dst lookup is happening in the pmtu-update
code path.
For the sock_owned_by_user(sk) == true case, the next patch will
introduce a release_cb() which will update the sk->sk_dst_cache.
Test:
Server (Connected UDP Socket):
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Route Details:
[root@arch-fb-vm1 ~]# ip -6 r show | egrep '2fac'
2fac::/64 dev eth0 proto kernel metric 256 pref medium
2fac:face::/64 via 2fac::face dev eth0 metric 1024 pref medium
A simple python code to create a connected UDP socket:
import socket
import errno
HOST = '2fac::1'
PORT = 8080
s = socket.socket(socket.AF_INET6, socket.SOCK_DGRAM)
s.bind((HOST, PORT))
s.connect(('2fac:face::face', 53))
print("connected")
while True:
try:
data = s.recv(1024)
except socket.error as se:
if se.errno == errno.EMSGSIZE:
pmtu = s.getsockopt(41, 24)
print("PMTU:%d" % pmtu)
break
s.close()
Python program output after getting a ICMPV6_PKT_TOOBIG:
[root@arch-fb-vm1 ~]# python2 ~/devshare/kernel/tasks/fib6/udp-connect-53-8080.py
connected
PMTU:1300
Cache routes after recieving TOOBIG:
[root@arch-fb-vm1 ~]# ip -6 r show table cache
2fac:face::face via 2fac::face dev eth0 metric 0
cache expires 463sec mtu 1300 pref medium
Client (Send the ICMPV6_PKT_TOOBIG):
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
scapy is used to generate the TOOBIG message. Here is the scapy script I have
used:
>>> p=Ether(src='da:75:4d:36:ac:32', dst='52:54:00:12:34:66', type=0x86dd)/IPv6(src='2fac::face', dst='2fac::1')/ICMPv6PacketTooBig(mtu=1300)/IPv6(src='2fac::
1',dst='2fac:face::face', nh='UDP')/UDP(sport=8080,dport=53)
>>> sendp(p, iface='qemubr0')
Fixes: 45e4fd2668 ("ipv6: Only create RTF_CACHE routes after encountering pmtu exception")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reported-by: Wei Wang <weiwan@google.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch moves the route lookup and update codes for connected
datagram sk to a newly created function ip6_datagram_dst_update()
It will be reused during the pmtu update in the later patch.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move flowi6 init codes for connected datagram sk to a newly created
function ip6_datagram_flow_key_init().
Notes:
1. fl6_flowlabel is used instead of fl6.flowlabel in __ip6_datagram_connect
2. ipv6_addr_is_multicast(&fl6->daddr) is used instead of
(addr_type & IPV6_ADDR_MULTICAST) in ip6_datagram_flow_key_init()
This new function will be reused during pmtu update in the later patch.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for something I am referring to as GSO partial.
The basic idea is that we can support a broader range of devices for
segmentation if we use fixed outer headers and have the hardware only
really deal with segmenting the inner header. The idea behind the naming
is due to the fact that everything before csum_start will be fixed headers,
and everything after will be the region that is handled by hardware.
With the current implementation it allows us to add support for the
following GSO types with an inner TSO_MANGLEID or TSO6 offload:
NETIF_F_GSO_GRE
NETIF_F_GSO_GRE_CSUM
NETIF_F_GSO_IPIP
NETIF_F_GSO_SIT
NETIF_F_UDP_TUNNEL
NETIF_F_UDP_TUNNEL_CSUM
In the case of hardware that already supports tunneling we may be able to
extend this further to support TSO_TCPV4 without TSO_MANGLEID if the
hardware can support updating inner IPv4 headers.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch does two things.
First it allows TCP to aggregate TCP frames with a fixed IPv4 ID field. As
a result we should now be able to aggregate flows that were converted from
IPv6 to IPv4. In addition this allows us more flexibility for future
implementations of segmentation as we may be able to use a fixed IP ID when
segmenting the flow.
The second thing this does is that it places limitations on the outer IPv4
ID header in the case of tunneled frames. Specifically it forces the IP ID
to be incrementing by 1 unless the DF bit is set in the outer IPv4 header.
This way we can avoid creating overlapping series of IP IDs that could
possibly be fragmented if the frame goes through GRO and is then
resegmented via GSO.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for TSO using IPv4 headers with a fixed IP ID
field. This is meant to allow us to do a lossless GRO in the case of TCP
flows that use a fixed IP ID such as those that convert IPv6 header to IPv4
headers.
In addition I am adding a feature that for now I am referring to TSO with
IP ID mangling. Basically when this flag is enabled the device has the
option to either output the flow with incrementing IP IDs or with a fixed
IP ID regardless of what the original IP ID ordering was. This is useful
in cases where the DF bit is set and we do not care if the original IP ID
value is maintained.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The strings were missing for several of the GSO offloads that are
available. This patch provides the missing strings so that we can toggle
or query any of them via the ethtool command.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
User needs to monitor shared buffer occupancy. For that, he issues a
snapshot command in order to instruct hardware to catch current and
maximal occupancy values, and clear command in order to clear the
historical maximal values.
Also port-pool and tc-pool-bind command response messages are extended to
carry occupancy values.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Define userspace API and drivers API for configuration of shared
buffers. Four basic objects are defined:
shared buffer - attributes are size, number of pools and TCs
pool - chunk of sharedbuffer definition, it has some size and either
static or dynamic threshold
port pool threshold - to set per-port threshold for each pool
port tc threshold bind - to bind port and TC to specified pool
with threshold.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
netlink notifier family check to avoid the socket close DoS problem.
-----BEGIN PGP SIGNATURE-----
iQIbBAABCgAGBQJXD0YaAAoJEGt7eEactAAdJqUP9R0Ljud2jnH2JeNRv1qdEAXw
/PKERFsiZjAnU7kXXdOlm5mqYNJILNGbtPaKF5oi6mswzojwgaIQ+rjsHidX0SVs
1F1+VFxt0//pyfp6udEJ3gyx6wxGlncTdWUtxmzikgmMacn20ejT/wJFzF6y8MHO
YSyI8NGNxdTn4eBPLL1QdipsDFYpkS7Z2P8ZUquJmR+uMccjIJ4m4kB2NM8AGQgf
Z6KVjVP5dOvelWoEjSKaw2MyAAN/jRTxc9PFtEARoKMMtGMHBkn9qzin9wDjLA6k
h3M34Aj5i4R8AIFWJmrCw06rDhPT6j02l4g2IXpmZQqvbP7zewyXyRUkmxMiFvQd
n6zpNnQf3hDSNC8f7sBjNM5syi23j9ineJgtXdYtHqyqO0xY2qAVRYm7cSEXxFVk
WaxrWDO/wtGyKm1lTQYudAAsFARIgomra4QJd5FqFdIwX2nuZekuikOUsytUqnEV
ta+cwLJGLmof5jicrkoYD83974awJHcngLYMYLbkZC0kwWEZEiXX58D3o+FoF9ns
4D+qkLPRDAVIJI/W9j5yuIA6/Aj7Cy9kuxtKf5CoqxL35k47lu69aHEXEceB6fDD
M24G8nwEFcMimEWeYz4AARqaEJ08PWcA3/5dgf+N0XDCvQT95OutsZbqAdR2kqzF
ApCKvGmbx/sNeyZtjbs=
=FvAT
-----END PGP SIGNATURE-----
Merge tag 'mac80211-for-davem-2016-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
This has just the single fix from Dmitry Ivanov, adding the missing
netlink notifier family check to avoid the socket close DoS problem.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
A failure in validate_xmit_skb_list() triggered an unconditional call
to dev_requeue_skb with skb=NULL. This slowly grows the queue
discipline's qlen count until all traffic through the queue stops.
We take the optimistic approach and continue running the queue after a
failure since it is unknown if later packets also will fail in the
validate path.
Fixes: 55a93b3ea7 ("qdisc: validate skb without holding lock")
Signed-off-by: Lars Persson <larper@axis.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Because we miss to wipe the remainder of i->addr[] in packet_mc_add(),
pdiag_put_mclist() leaks uninitialized heap bytes via the
PACKET_DIAG_MCLIST netlink attribute.
Fix this by explicitly memset(0)ing the remaining bytes in i->addr[].
Fixes: eea68e2f1a ("packet: Report socket mclist info via diag module")
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Pavel Emelyanov <xemul@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After introduction of ndo_features_check(), we believe that very
specific checks for rare features should not be done in core
networking stack.
No driver uses gso_min_segs yet, so we revert this feature and save
few instructions per tx packet in fast path.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The meta_type_ops structures are never modified, so declare them as const.
Done with the help of Coccinelle.
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
For local routes that require a particular output interface we do not want
to cache the result. Caching the result causes incorrect behaviour when
there are multiple source addresses on the interface. The end result
being that if the intended recipient is waiting on that interface for the
packet he won't receive it because it will be delivered on the loopback
interface and the IP_PKTINFO ipi_ifindex will be set to the loopback
interface as well.
This can be tested by running a program such as "dhcp_release" which
attempts to inject a packet on a particular interface so that it is
received by another program on the same board. The receiving process
should see an IP_PKTINFO ipi_ifndex value of the source interface
(e.g., eth1) instead of the loopback interface (e.g., lo). The packet
will still appear on the loopback interface in tcpdump but the important
aspect is that the CMSG info is correct.
Sample dhcp_release command line:
dhcp_release eth1 192.168.204.222 02:11:33:22:44:66
Signed-off-by: Allain Legacy <allain.legacy@windriver.com>
Signed off-by: Chris Friesen <chris.friesen@windriver.com>
Reviewed-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently processing of multiple chunks in a single SCTP packet leads to
multiple calls to sk_data_ready, causing multiple wake up signals which
are costy and doesn't make it wake up any faster.
With this patch it will note that the wake up is pending and will do it
before leaving the state machine interpreter, latest place possible to
do it realiably and cleanly.
Note that sk_data_ready events are not dependent on asocs, unlike waking
up writers.
v2: series re-checked
v3: use local vars to cleanup the code, suggested by Jakub Sitnicki
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
f1705ec197 added the option to retain user configured addresses on an
admin down. A comment to one of the later revisions suggested using the
IFA_F_PERMANENT flag rather than adding a user_managed boolean to the
ifaddr struct. A side effect of this change is that link local and
loopback addresses are also retained which is not part of the objective
of f1705ec197. Add check to drop those addresses.
Fixes: f1705ec197 ("net: ipv6: Make address flushing on ifdown optional")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now when we change the attributes of bridge or br_port by netlink,
a relevant netlink notification will be sent, but if we change them
by ioctl or sysfs, no notification will be sent.
We should ensure that whenever those attributes change internally or from
sysfs/ioctl, that a netlink notification is sent out to listeners.
Also, NetworkManager will use this in the future to listen for out-of-band
bridge master attribute updates and incorporate them into the runtime
configuration.
This patch is used for ioctl.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now when we change the attributes of bridge or br_port by netlink,
a relevant netlink notification will be sent, but if we change them
by ioctl or sysfs, no notification will be sent.
We should ensure that whenever those attributes change internally or from
sysfs/ioctl, that a netlink notification is sent out to listeners.
Also, NetworkManager will use this in the future to listen for out-of-band
bridge master attribute updates and incorporate them into the runtime
configuration.
This patch is used for br_sysfs_if, and we also move br_ifinfo_notify out
of store_flag.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now when we change the attributes of bridge or br_port by netlink,
a relevant netlink notification will be sent, but if we change them
by ioctl or sysfs, no notification will be sent.
We should ensure that whenever those attributes change internally or from
sysfs/ioctl, that a netlink notification is sent out to listeners.
Also, NetworkManager will use this in the future to listen for out-of-band
bridge master attribute updates and incorporate them into the runtime
configuration.
This patch is used for br_sysfs_br. and we also need to remove some
rtnl_trylock in old functions so that we can call it in a common one.
For group_addr_store, we cannot make it use store_bridge_parm, because
it's not a string-to-long convert, we will add notification on it
individually.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are some repetitive codes in stp_state_store, we can remove
them by calling store_bridge_parm.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are some repetitive codes in forward_delay_store, we can remove
them by calling store_bridge_parm.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are some repetitive codes in flush_store, we can remove
them by calling store_bridge_parm, also, it would send rtnl notification
after we add it in store_bridge_parm in the following patches.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The original tokenized iid support implemented via f53adae4ea ("net: ipv6:
add tokenized interface identifier support") didn't allow for clearing a
device token as it was intended that this addressing mode was the only one
active for globally scoped IPv6 addresses. Later we relaxed that restriction
via 617fe29d45 ("net: ipv6: only invalidate previously tokenized addresses"),
and we should also allow for clearing tokens as there's no good reason why
it shouldn't be allowed.
Fixes: 617fe29d45 ("net: ipv6: only invalidate previously tokenized addresses")
Reported-by: Robin H. Johnson <robbat2@gentoo.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
sock_owned_by_user should not be used without socket lock held. It seems
to be a common practice to check .owned before lock reclassification, so
provide a little help to abstract this check away.
Cc: linux-cifs@vger.kernel.org
Cc: linux-bluetooth@vger.kernel.org
Cc: linux-nfs@vger.kernel.org
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
On udp sockets, recv cmsg IP_CMSG_CHECKSUM returns a checksum over
the packet payload. Since commit e6afc8ace6 pulled the headers,
taking skb->data as the start of transport header is incorrect. Use
the transport header pointer.
Also, when peeking at an offset from the start of the packet, only
return a checksum from the start of the peeked data. Note that the
cmsg does not subtract a tail checkum when reading truncated data.
Fixes: e6afc8ace6 ("udp: remove headers from UDP packets before queueing")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On udp sockets, ioctl SIOCINQ returns the payload size of the first
packet. Since commit e6afc8ace6 pulled the headers, the result is
incorrect when subtracting header length. Remove that operation.
Fixes: e6afc8ace6 ("udp: remove headers from UDP packets before queueing")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for your net tree. More
specifically, they are:
1) Fix missing filter table per-netns registration in arptables, from
Florian Westphal.
2) Resolve out of bound access when parsing TCP options in
nf_conntrack_tcp, patch from Jozsef Kadlecsik.
3) Prefer NFPROTO_BRIDGE extensions over NFPROTO_UNSPEC in ebtables,
this resolves conflict between xt_limit and ebt_limit, from Phil Sutter.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Many of these functions are called from control plane path. Move
ctnetlink_nlmsg_size() under CONFIG_NF_CONNTRACK_EVENTS to avoid a
compilation warning when CONFIG_NF_CONNTRACK_EVENTS=n.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The three variants use same copy&pasted code, condense this into a
helper and use that.
Make sure info.name is 0-terminated.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Since 'netfilter: x_tables: validate targets of jumps' change we
validate that the target aligns exactly with beginning of a rule,
so offset test is now redundant.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
commit 9e67d5a739
("[NETFILTER]: x_tables: remove obsolete overflow check") left the
compat parts alone, but we can kill it there as well.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This looks like refactoring, but its also a bug fix.
Problem is that the compat path (32bit iptables, 64bit kernel) lacks a few
sanity tests that are done in the normal path.
For example, we do not check for underflows and the base chain policies.
While its possible to also add such checks to the compat path, its more
copy&pastry, for instance we cannot reuse check_underflow() helper as
e->target_offset differs in the compat case.
Other problem is that it makes auditing for validation errors harder; two
places need to be checked and kept in sync.
At a high level 32 bit compat works like this:
1- initial pass over blob:
validate match/entry offsets, bounds checking
lookup all matches and targets
do bookkeeping wrt. size delta of 32/64bit structures
assign match/target.u.kernel pointer (points at kernel
implementation, needed to access ->compatsize etc.)
2- allocate memory according to the total bookkeeping size to
contain the translated ruleset
3- second pass over original blob:
for each entry, copy the 32bit representation to the newly allocated
memory. This also does any special match translations (e.g.
adjust 32bit to 64bit longs, etc).
4- check if ruleset is free of loops (chase all jumps)
5-first pass over translated blob:
call the checkentry function of all matches and targets.
The alternative implemented by this patch is to drop steps 3&4 from the
compat process, the translation is changed into an intermediate step
rather than a full 1:1 translate_table replacement.
In the 2nd pass (step #3), change the 64bit ruleset back to a kernel
representation, i.e. put() the kernel pointer and restore ->u.user.name .
This gets us a 64bit ruleset that is in the format generated by a 64bit
iptables userspace -- we can then use translate_table() to get the
'native' sanity checks.
This has two drawbacks:
1. we re-validate all the match and target entry structure sizes even
though compat translation is supposed to never generate bogus offsets.
2. we put and then re-lookup each match and target.
THe upside is that we get all sanity tests and ruleset validations
provided by the normal path and can remove some duplicated compat code.
iptables-restore time of autogenerated ruleset with 300k chains of form
-A CHAIN0001 -m limit --limit 1/s -j CHAIN0002
-A CHAIN0002 -m limit --limit 1/s -j CHAIN0003
shows no noticeable differences in restore times:
old: 0m30.796s
new: 0m31.521s
64bit: 0m25.674s
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Validate that all matches (if any) add up to the beginning of
the target and that each match covers at least the base structure size.
The compat path should be able to safely re-use the function
as the structures only differ in alignment; added a
BUILD_BUG_ON just in case we have an arch that adds padding as well.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We're currently asserting that targetoff + targetsize <= nextoff.
Extend it to also check that targetoff is >= sizeof(xt_entry).
Since this is generic code, add an argument pointing to the start of the
match/target, we can then derive the base structure size from the delta.
We also need the e->elems pointer in a followup change to validate matches.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We have targets and standard targets -- the latter carries a verdict.
The ip/ip6tables validation functions will access t->verdict for the
standard targets to fetch the jump offset or verdict for chainloop
detection, but this happens before the targets get checked/validated.
Thus we also need to check for verdict presence here, else t->verdict
can point right after a blob.
Spotted with UBSAN while testing malformed blobs.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
32bit rulesets have different layout and alignment requirements, so once
more integrity checks get added to xt_check_entry_offsets it will reject
well-formed 32bit rulesets.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The target size includes the size of the xt_entry_target struct.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Once we add more sanity testing to xt_check_entry_offsets it
becomes relvant if we're expecting a 32bit 'config_compat' blob
or a normal one.
Since we already have a lot of similar-named functions (check_entry,
compat_check_entry, find_and_check_entry, etc.) and the current
incarnation is short just fold its contents into the callers.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Currently arp/ip and ip6tables each implement a short helper to check that
the target offset is large enough to hold one xt_entry_target struct and
that t->u.target_size fits within the current rule.
Unfortunately these checks are not sufficient.
To avoid adding new tests to all of ip/ip6/arptables move the current
checks into a helper, then extend this helper in followup patches.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When we see a jump also check that the offset gets us to beginning of
a rule (an ipt_entry).
The extra overhead is negible, even with absurd cases.
300k custom rules, 300k jumps to 'next' user chain:
[ plus one jump from INPUT to first userchain ]:
Before:
real 0m24.874s
user 0m7.532s
sys 0m16.076s
After:
real 0m27.464s
user 0m7.436s
sys 0m18.840s
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Ben Hawkes says:
In the mark_source_chains function (net/ipv4/netfilter/ip_tables.c) it
is possible for a user-supplied ipt_entry structure to have a large
next_offset field. This field is not bounds checked prior to writing a
counter value at the supplied offset.
Base chains enforce absolute verdict.
User defined chains are supposed to end with an unconditional return,
xtables userspace adds them automatically.
But if such return is missing we will move to non-existent next rule.
Reported-by: Ben Hawkes <hawkes@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The phys in phys_port_mask suggests this mask is about PHYs. In fact,
it means physical ports. Rename to enabled_port_mask, indicating
external enabled ports of the switch, which is hopefully less
confusing.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Tested-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The drivers now allocate their own memory for private usage. Remove
the allocation from the core code.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now the switch devices have a dev pointer, make use of it for allocating
the drivers private data structures using a devm_kzalloc().
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
By passing a device structure to the switch devices, it allows them
to use devm_* methods for resource management.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
all drivers - removing the duplicated enum ieee80211_band and
replacing it by enum nl80211_band. On top of that, just a small
documentation update.
-----BEGIN PGP SIGNATURE-----
iQIcBAABCgAGBQJXDp6hAAoJEGt7eEactAAd/xUQAJtKNwp9CLsx+QFx6lMoXX4x
r0XA8DFgLp1BflS9P05/g1m0NiQxm3YuRtpze/FdPglb6AAVjLcqksf+vTkU+Lng
p7rIkb/fQv5s5aoYPxNrD5zgwALVv9y5fI7rV7scj355iesCC0PmAP34own2Dihi
eBVSammsh5ZNTQKLBk8vXECb0UKWsDBMgp4uQc35Bpw8XSx5Nrtl5JI/hMcckte0
a/FQyQKjmjl3O/nRLn3kzGPv1OnRiJOMb5fMWB+Xm2cLtmKPHIErgVk2l/CMaiYj
sRJR8KaZQpQsyWiQU59UNpywlejy7Z1RsSWmuPhm0xTGzIF1wVIgHJSsRI/gNGD2
8Ey1P+RXkM8NVxrQr/0fis9XWyWfE8ne4tFsPiPOD3VmBiStIB9fAukJHLrvTmKU
JrkXCePUkfNY/PqJqlP/RONBcysI253/snVF49oZ7LMBZiGDPhdRcEEcCaS0tmMM
Qa+a78XvaH5xaKuMIDZ4qMdnMMcdv4g8G1DQeA1mb0EIGL1Gtu9BJsu9q8PqmjQU
1ZAf4MlWJWdYk+CtTNT4slSIQVKAN78s6j1HSB/bNcpWk9y93wBhJW0FdP7FtJ1I
pjJGIVcLU98FKdqi2jqPEezbDXXzOz0gNQDbqfJyM9/R7ijnJcaPllviaWjEg/O7
8jMBOg87Hn7kq7JJGpKA
=2xfe
-----END PGP SIGNATURE-----
Merge tag 'mac80211-next-for-davem-2016-04-13' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
Johannes Berg says:
====================
To synchronize with Kalle, here's just a big change that affects
all drivers - removing the duplicated enum ieee80211_band and
replacing it by enum nl80211_band. On top of that, just a small
documentation update.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
We remove a couple of leftover fields in struct tipc_bearer. Those
were used by the old broadcast implementation, and are not needed
any longer. There is no functional changes in this commit.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In case of link-layer specific handling for 802.15.4 we need to cast to
802.15.4 sepcific structures. Simple add this header when include the
6lowpan header.
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Reviewed-by: Stefan Schmidt<stefan@osg.samsung.com>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch adds the lowpan_is_ll function, which can be used to make a
special 6lowpan linklayer handling for a specific 6lowpan linklayer
type.
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Reviewed-by: Stefan Schmidt<stefan@osg.samsung.com>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This function will be use in later functionality in other branches than
generic 6lowpan, so we move it to the global 6lowpan header.
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Reviewed-by: Stefan Schmidt<stefan@osg.samsung.com>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch removes unnecessary zero data for a stack variable.
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch adds a lowpan prefix to each functions which doesn't have
such prefix currently.
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch moves the 802.15.4 link layer specific structures to generic
6lowpan. This is necessary for special 802.15.4 6lowpan handling in
6lowpan generic layer.
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch changes the naming for interface private data for lowpan
intefaces. The current private data scheme is:
-------------------------------------------------
| 6LoWPAN Generic | LinkLayer 6LoWPAN |
-------------------------------------------------
the current naming schemes are:
- 6LoWPAN Generic:
- lowpan_priv
- LinkLayer 6LoWPAN:
- BTLE
- lowpan_dev
- 802.15.4:
- lowpan_dev_info
the new naming scheme with this patch will be:
- 6LoWPAN Generic:
- lowpan_dev
- LinkLayer 6LoWPAN:
- BTLE
- lowpan_btle_dev
- 802.15.4:
- lowpan_802154_dev
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Reviewed-by: Stefan Schmidt<stefan@osg.samsung.com>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
The short address is unique in combination with the panid. This patch
will add the panid for generating an ieee802154 address hash.
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
The generation of autoconfigured IPv6 link-local addresses starts with a
notification on interface up.
To handle autoconfiguration according to RFC 4944 requires pan id and
short address to generate an autoconfigured link-local address. This
patch will avoid changing of these link-layer address configuration
while lowpan interface is up.
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains the first batch of Netfilter updates for
your net-next tree.
1) Define pr_fmt() in nf_conntrack, from Weongyo Jeong.
2) Define and register netfilter's afinfo for the bridge family,
this comes in preparation for native nfqueue's bridge for nft,
from Stephane Bryant.
3) Add new attributes to store layer 2 and VLAN headers to nfqueue,
also from Stephane Bryant.
4) Parse new NFQA_VLAN and NFQA_L2HDR nfqueue netlink attributes
coming from userspace, from Stephane Bryant.
5) Use net->ipv6.devconf_all->hop_limit instead of hardcoded hop_limit
in IPv6 SYNPROXY, from Liping Zhang.
6) Remove unnecessary check for dst == NULL in nf_reject_ipv6,
from Haishuang Yan.
7) Deinline ctnetlink event report functions, from Florian Westphal.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
If a requested extension exists as module and is not loaded,
ebt_check_match() might accidentally use an NFPROTO_UNSPEC one with same
name and fail.
Reproduced with limit match: Given xt_limit and ebt_limit both built as
module, the following would fail:
modprobe xt_limit
ebtables -I INPUT --limit 1/s -j ACCEPT
The fix is to make ebt_check_match() distrust a found NFPROTO_UNSPEC
extension and retry after requesting an appropriate module.
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: Phil Sutter <phil@nwl.cc>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Not performance critical, it is only invoked when an expectation is
added/destroyed.
While at it, kill unused nf_ct_expect_event() wrapper.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Way too large; move it to nf_conntrack_ecache.c.
Reduces total object size by 1216 byte on my machine.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Johan Hedberg says:
====================
pull request: bluetooth-next 2016-04-12
Here's a set of Bluetooth & 802.15.4 patches intended for the 4.7 kernel:
- Fix for race condition in vhci driver
- Memory leak fix for ieee802154/adf7242 driver
- Improvements to deal with single-mode (LE-only) Bluetooth controllers
- Fix for allowing the BT_SECURITY_FIPS security level
- New BCM2E71 ACPI ID
- NULL pointer dereference fix fox hci_ldisc driver
Let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This enum is already perfectly aliased to enum nl80211_band, and
the only reason for it is that we get IEEE80211_NUM_BANDS out of
it. There's no really good reason to not declare the number of
bands in nl80211 though, so do that and remove the cfg80211 one.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
A non-privileged user can create a netlink socket with the same port_id as
used by an existing open nl80211 netlink socket (e.g. as used by a hostapd
process) with a different protocol number.
Closing this socket will then lead to the notification going to nl80211's
socket release notification handler, and possibly cause an action such as
removing a virtual interface.
Fix this issue by checking that the netlink protocol is NETLINK_GENERIC.
Since generic netlink has no notifier chain of its own, we can't fix the
problem more generically.
Fixes: 026331c4d9 ("cfg80211/mac80211: allow registering for and sending action frames")
Cc: stable@vger.kernel.org
Signed-off-by: Dmitry Ivanov <dima@ubnt.com>
[rewrite commit message]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Add a facility whereby proposed new links to be added to a keyring can be
vetted, permitting them to be rejected if necessary. This can be used to
block public keys from which the signature cannot be verified or for which
the signature verification fails. It could also be used to provide
blacklisting.
This affects operations like add_key(), KEYCTL_LINK and KEYCTL_INSTANTIATE.
To this end:
(1) A function pointer is added to the key struct that, if set, points to
the vetting function. This is called as:
int (*restrict_link)(struct key *keyring,
const struct key_type *key_type,
unsigned long key_flags,
const union key_payload *key_payload),
where 'keyring' will be the keyring being added to, key_type and
key_payload will describe the key being added and key_flags[*] can be
AND'ed with KEY_FLAG_TRUSTED.
[*] This parameter will be removed in a later patch when
KEY_FLAG_TRUSTED is removed.
The function should return 0 to allow the link to take place or an
error (typically -ENOKEY, -ENOPKG or -EKEYREJECTED) to reject the
link.
The pointer should not be set directly, but rather should be set
through keyring_alloc().
Note that if called during add_key(), preparse is called before this
method, but a key isn't actually allocated until after this function
is called.
(2) KEY_ALLOC_BYPASS_RESTRICTION is added. This can be passed to
key_create_or_update() or key_instantiate_and_link() to bypass the
restriction check.
(3) KEY_FLAG_TRUSTED_ONLY is removed. The entire contents of a keyring
with this restriction emplaced can be considered 'trustworthy' by
virtue of being in the keyring when that keyring is consulted.
(4) key_alloc() and keyring_alloc() take an extra argument that will be
used to set restrict_link in the new key. This ensures that the
pointer is set before the key is published, thus preventing a window
of unrestrictedness. Normally this argument will be NULL.
(5) As a temporary affair, keyring_restrict_trusted_only() is added. It
should be passed to keyring_alloc() as the extra argument instead of
setting KEY_FLAG_TRUSTED_ONLY on a keyring. This will be replaced in
a later patch with functions that look in the appropriate places for
authoritative keys.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
ifupdown2 found a kernel bug with IPv6 routes and movement from the main
table to the VRF table. Sequence of events:
Create the interface and add addresses:
ip link add dev eth4.105 link eth4 type vlan id 105
ip addr add dev eth4.105 8.105.105.10/24
ip -6 addr add dev eth4.105 2008:105:105::10/64
At this point IPv6 has inserted a prefix route in the main table even
though the interface is 'down'. From there the VRF device is created:
ip link add dev vrf105 type vrf table 105
ip addr add dev vrf105 9.9.105.10/32
ip -6 addr add dev vrf105 2000:9:105::10/128
ip link set vrf105 up
Then the interface is enslaved, while still in the 'down' state:
ip link set dev eth4.105 master vrf105
Since the device is down the VRF driver cycling the device does not
send the NETDEV_UP and NETDEV_DOWN but rather the NETDEV_CHANGE event
which does not flush the routes inserted prior.
When the link is brought up
ip link set dev eth4.105 up
the prefix route is added in the VRF table, but does not remove
the route from the main table.
Fix by handling the NETDEV_CHANGEUPPER event similar what was implemented
for IPv4 in 7f49e7a38b ("net: Flush local routes when device changes vrf
association")
Fixes: 35402e3136 ("net: Add IPv6 support to VRF device")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vivek reported a kernel exception deleting a VRF with an active
connection through it. The root cause is that the socket has a cached
reference to a dst that is destroyed. Converting the dst_destroy to
dst_release and letting proper reference counting kick in does not
work as the dst has a reference to the device which needs to be released
as well.
I talked to Hannes about this at netdev and he pointed out the ipv4 and
ipv6 dst handling has dst_ifdown for just this scenario. Rather than
continuing with the reinvented dst wheel in VRF just remove it and
leverage the ipv4 and ipv6 versions.
Fixes: 193125dbd8 ("net: Introduce VRF device driver")
Fixes: 35402e3136 ("net: Add IPv6 support to VRF device")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Create a null security type for security index 0 and get rid of all
conditional calls to the security operations. We expect normally to be
using security, so this should be of little negative impact.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Absorb the rxkad security module into the af_rxrpc module so that there's
only one module file. This avoids a circular dependency whereby rxkad pins
af_rxrpc and cached connections pin rxkad but can't be manually evicted
(they will expire eventually and cease pinning).
With this change, af_rxrpc can just be unloaded, despite having cached
connections.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Don't assume transport address family and size when using the peer address
to send a packet. Instead, use the start of the transport address rather
than any particular element of the union and use the transport address
length noted inside the sockaddr_rxrpc struct.
This will be necessary when IPv6 support is introduced.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Don't pass gfp around in incoming call handling functions, but rather hard
code it at the points where we actually need it since the value comes from
within the rxrpc driver and is always the same.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the rxrpc_connection and rxrpc_call structs, there's one field to hold
the abort code, no matter whether that value was generated locally to be
sent or was received from the peer via an abort packet.
Split the abort code fields in two for cleanliness sake and add an error
field to hold the Linux error number to the rxrpc_call struct too
(sometimes this is generated in a context where we can't return it to
userspace directly).
Furthermore, add a skb mark to indicate a packet that caused a local abort
to be generated so that recvmsg() can pick up the correct abort code. A
future addition will need to be to indicate to userspace the difference
between aborts via a control message.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Static arrays of strings should be const char *const[].
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move some miscellaneous bits out into their own file to make it easier to
split the call handling.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Disable a debugging statement that has been left enabled
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit e6afc8ace6 modified the udp receive path by pulling the udp
header before queuing an skbuff onto the receive queue.
Rxrpc also calls skb_recv_datagram to dequeue an skb from a udp
socket. Modify this receive path to also no longer expect udp
headers.
Fixes: e6afc8ace6 ("udp: remove headers from UDP packets before queueing")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Tested-by: Thierry Reding <treding@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit e6afc8ace6 modified the udp receive path by pulling the udp
header before queuing an skbuff onto the receive queue.
Sunrpc also calls skb_recv_datagram to dequeue an skb from a udp
socket. Modify this receive path to also no longer expect udp
headers.
Fixes: e6afc8ace6 ("udp: remove headers from UDP packets before queueing")
Reported-by: Franklin S Cooper Jr. <fcooper@ti.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Tested-by: Thierry Reding <treding@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If a peer node becomes unavailable, in addition to removing the
nametable entries from this node we also need to purge all deferred
updates associated with this node.
Signed-off-by: Erik Hugne <erik.hugne@gmail.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nametable updates received from the network that cannot be applied
immediately are placed on a defer queue. This queue is global to the
TIPC module, which might cause problems when using TIPC in containers.
To prevent nametable updates from escaping into the wrong namespace,
we make the queue pernet instead.
Signed-off-by: Erik Hugne <erik.hugne@gmail.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Multipath route lookups should consider knowledge about next hops and not
select a hop that is known to be failed.
Example:
[h2] [h3] 15.0.0.5
| |
3| 3|
[SP1] [SP2]--+
1 2 1 2
| | /-------------+ |
| \ / |
| X |
| / \ |
| / \---------------\ |
1 2 1 2
12.0.0.2 [TOR1] 3-----------------3 [TOR2] 12.0.0.3
4 4
\ /
\ /
\ /
-------| |-----/
1 2
[TOR3]
3|
|
[h1] 12.0.0.1
host h1 with IP 12.0.0.1 has 2 paths to host h3 at 15.0.0.5:
root@h1:~# ip ro ls
...
12.0.0.0/24 dev swp1 proto kernel scope link src 12.0.0.1
15.0.0.0/16
nexthop via 12.0.0.2 dev swp1 weight 1
nexthop via 12.0.0.3 dev swp1 weight 1
...
If the link between tor3 and tor1 is down and the link between tor1
and tor2 then tor1 is effectively cut-off from h1. Yet the route lookups
in h1 are alternating between the 2 routes: ping 15.0.0.5 gets one and
ssh 15.0.0.5 gets the other. Connections that attempt to use the
12.0.0.2 nexthop fail since that neighbor is not reachable:
root@h1:~# ip neigh show
...
12.0.0.3 dev swp1 lladdr 00:02:00:00:00:1b REACHABLE
12.0.0.2 dev swp1 FAILED
...
The failed path can be avoided by considering known neighbor information
when selecting next hops. If the neighbor lookup fails we have no
knowledge about the nexthop, so give it a shot. If there is an entry
then only select the nexthop if the state is sane. This is similar to
what fib_detect_death does.
To maintain backward compatibility use of the neighbor information is
based on a new sysctl, fib_multipath_use_neigh.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Reviewed-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
All existing users of NETLINK_URELEASE use it to clean up resources that
were previously allocated to a socket via some command. As a result, no
users require getting this notification for unbound sockets.
Sending it for unbound sockets, however, is a problem because any user
(including unprivileged users) can create a socket that uses the same ID
as an existing socket. Binding this new socket will fail, but if the
NETLINK_URELEASE notification is generated for such sockets, the users
thereof will be tricked into thinking the socket that they allocated the
resources for is closed.
In the nl80211 case, this will cause destruction of virtual interfaces
that still belong to an existing hostapd process; this is the case that
Dmitry noticed. In the NFC case, it will cause a poll abort. In the case
of netlink log/queue it will cause them to stop reporting events, as if
NFULNL_CFG_CMD_UNBIND/NFQNL_CFG_CMD_UNBIND had been called.
Fix this problem by checking that the socket is bound before generating
the NETLINK_URELEASE notification.
Cc: stable@vger.kernel.org
Signed-off-by: Dmitry Ivanov <dima@ubnt.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently on high rate SCTP streams the heartbeat timer refresh can
consume quite a lot of resources as timer updates are costly and it
contains a random factor, which a) is also costly and b) invalidates
mod_timer() optimization for not editing a timer to the same value.
It may even cause the timer to be slightly advanced, for no good reason.
As suggested by David Laight this patch now removes this timer update
from hot path by leaving the timer on and re-evaluating upon its
expiration if the heartbeat is still needed or not, similarly to what is
done for TCP. If it's not needed anymore the timer is re-scheduled to
the new timeout, considering the time already elapsed.
For this, we now record the last tx timestamp per transport, updated in
the same spots as hb timer was restarted on tx. Also split up
sctp_transport_reset_timers into sctp_transport_reset_t3_rtx and
sctp_transport_reset_hb_timer, so we can re-arm T3 without re-arming the
heartbeat one.
On loopback with MTU of 65535 and data chunks with 1636, so that we
have a considerable amount of chunks without stressing system calls,
netperf -t SCTP_STREAM -l 30, perf looked like this before:
Samples: 103K of event 'cpu-clock', Event count (approx.): 25833000000
Overhead Command Shared Object Symbol
+ 6,15% netperf [kernel.vmlinux] [k] copy_user_enhanced_fast_string
- 5,43% netperf [kernel.vmlinux] [k] _raw_write_unlock_irqrestore
- _raw_write_unlock_irqrestore
- 96,54% _raw_spin_unlock_irqrestore
- 36,14% mod_timer
+ 97,24% sctp_transport_reset_timers
+ 2,76% sctp_do_sm
+ 33,65% __wake_up_sync_key
+ 28,77% sctp_ulpq_tail_event
+ 1,40% del_timer
- 1,84% mod_timer
+ 99,03% sctp_transport_reset_timers
+ 0,97% sctp_do_sm
+ 1,50% sctp_ulpq_tail_event
And after this patch, now with netperf -l 60:
Samples: 230K of event 'cpu-clock', Event count (approx.): 57707250000
Overhead Command Shared Object Symbol
+ 5,65% netperf [kernel.vmlinux] [k] memcpy_erms
+ 5,59% netperf [kernel.vmlinux] [k] copy_user_enhanced_fast_string
- 5,05% netperf [kernel.vmlinux] [k] _raw_spin_unlock_irqrestore
- _raw_spin_unlock_irqrestore
+ 49,89% __wake_up_sync_key
+ 45,68% sctp_ulpq_tail_event
- 2,85% mod_timer
+ 76,51% sctp_transport_reset_t3_rtx
+ 23,49% sctp_do_sm
+ 1,55% del_timer
+ 2,50% netperf [sctp] [k] sctp_datamsg_from_user
+ 2,26% netperf [sctp] [k] sctp_sendmsg
Throughput-wise, from 6800mbps without the patch to 7050mbps with it,
~3.7%.
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A stupid refactoring bug in inet6_lookup_listener() needs to be fixed
in order to get proper SO_REUSEPORT behavior.
Fixes: 3b24d854cb ("tcp/dccp: do not touch listener sk_refcnt under synflood")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Stale SKB data pointer access across pskb_may_pull() calls in L2TP,
from Haishuang Yan.
2) Fix multicast frame handling in mac80211 AP code, from Felix
Fietkau.
3) mac80211 station hashtable insert errors not handled properly, fix
from Johannes Berg.
4) Fix TX descriptor count limit handling in e1000, from Alexander
Duyck.
5) Revert a buggy netdev refcount fix in netpoll, from Bjorn Helgaas.
6) Must assign rtnl_link_ops of the device before registering it, fix
in ip6_tunnel from Thadeu Lima de Souza Cascardo.
7) Memory leak fix in tc action net exit, from WANG Cong.
8) Add missing AF_KCM entries to name tables, from Dexuan Cui.
9) Fix regression in GRE handling of csums wrt. FOU, from Alexander
Duyck.
10) Fix memory allocation alignment and congestion map corruption in
RDS, from Shamir Rabinovitch.
11) Fix default qdisc regression in tuntap driver, from Jason Wang.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (44 commits)
bridge, netem: mark mailing lists as moderated
tuntap: restore default qdisc
mpls: find_outdev: check for err ptr in addition to NULL check
ipv6: Count in extension headers in skb->network_header
RDS: fix congestion map corruption for PAGE_SIZE > 4k
RDS: memory allocated must be align to 8
GRE: Disable segmentation offloads w/ CSUM and we are encapsulated via FOU
net: add the AF_KCM entries to family name tables
MAINTAINERS: intel-wired-lan list is moderated
lib/test_bpf: Add additional BPF_ADD tests
lib/test_bpf: Add test to check for result of 32-bit add that overflows
lib/test_bpf: Add tests for unsigned BPF_JGT
lib/test_bpf: Fix JMP_JSET tests
VSOCK: Detach QP check should filter out non matching QPs.
stmmac: fix adjust link call in case of a switch is attached
af_packet: tone down the Tx-ring unsupported spew.
net_sched: fix a memory leak in tc action
samples/bpf: Enable powerpc support
samples/bpf: Use llc in PATH, rather than a hardcoded value
samples/bpf: Fix build breakage with map_perf_test_user.c
...
The switchdev design implies that a software error should not happen in
the commit phase since it must have been previously reported in the
prepare phase. If an hardware error occurs during the commit phase,
there is nothing switchdev can do about it.
The DSA layer separates port_vlan_prepare and port_vlan_add for
simplicity and convenience. If an hardware error occurs during the
commit phase, there is no need to report it outside the driver itself.
Make the DSA port_vlan_add routine return void for explicitness.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The switchdev design implies that a software error should not happen in
the commit phase since it must have been previously reported in the
prepare phase. If an hardware error occurs during the commit phase,
there is nothing switchdev can do about it.
The DSA layer separates port_fdb_prepare and port_fdb_add for simplicity
and convenience. If an hardware error occurs during the commit phase,
there is no need to report it outside the DSA driver itself.
Make the DSA port_fdb_add routine return void for explicitness.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The DSA layer doesn't care about the return code of the port_stp_update
routine, so make it void in the layer and the DSA drivers.
Replace the useless dsa_slave_stp_update function with a
dsa_slave_stp_state function used to reply to the switchdev
SWITCHDEV_ATTR_ID_PORT_STP_STATE attribute.
In the meantime, rename port_stp_update to port_stp_state_set to
explicit the state change.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* Bob's mesh mode rhashtable conversion, this includes
the rhashtable API change for allocation flags
* BSSID scan, connect() command reassoc support (Jouni)
* fast (optimised data only) and support for RSS in mac80211 (myself)
* various smaller changes
-----BEGIN PGP SIGNATURE-----
iQIcBAABCgAGBQJXBQ4GAAoJEGt7eEactAAdWiMP/ibaP3I79NDc0s7wCDA+KRkm
hx0Qx4a0wwm7lDFlnGBjY6yKr+XFDliCvdGX7XGpLSsTioNg7eXPpwx5FQoj6RiV
8+5RKE9fTguN9ofUzqAwHd9sVOaxvdlXbKfb/N93Gzjpw/meYk58wXdF7Almkroa
ukgJeMzIlIh+6D96zFEA+Ofzp5chwh+x2Dn0wXutEe9P9fOERA859veAvx65b+Ql
IRGTqyuY5B/wcbkr4o+DWQwgrdt7Vop9nYVPNWtMHm2JTzfuCSaQ2cD9TnVAK/bg
/vtqC46KKNLyBRGexAPqdftY9PWcfipgE+n7k+Et4iGSmNm7Z3dEyewgXmqli7XJ
X8Uiaq+N6Fpe06DVSU7aSRt8NLV64A44jXSfKRI9U2POUqKMn/PMdm8bhPW8qCdM
ra6myWpQGHWK9e0TQQdShq0NQKGxCZAiSRiiIrbbvXl1CwXxkPCG39wAC3Sh1tEN
ou4lGraeywGnTjaq+mwLEtHLoug8Y2x+Fz+Ze4Cu2enXxna9lp4lr+rFlc+2+0Er
o9oPxkTk8krZGIj9M6PNc5W+InMwchaFX3076n67hnFHzFRlOQzkfffbPYlhKJDQ
f8c9JiNZIoX/fD1TAKsrdO1+EKm/xo7w7pLgbMwQal8Jr88SkITDg0i3oXc56vNQ
ZK2gUzwvrD/jh0AUyDfN
=sj7y
-----END PGP SIGNATURE-----
Merge tag 'mac80211-next-for-davem-2016-04-06' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
Johannes Berg says:
====================
For the 4.7 cycle, we have a number of changes:
* Bob's mesh mode rhashtable conversion, this includes
the rhashtable API change for allocation flags
* BSSID scan, connect() command reassoc support (Jouni)
* fast (optimised data only) and support for RSS in mac80211 (myself)
* various smaller changes
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
* TDLS fixes from Arik and Ilan
* rhashtable fixes from Ben and myself
* documentation fixes from Luis
* U-APSD fixes from Emmanuel
* a TXQ fix from Felix
* and a compiler warning suppression from Jeff
-----BEGIN PGP SIGNATURE-----
iQIcBAABCgAGBQJXBQyAAAoJEGt7eEactAAdymsP/i2zU+VQFpkB8RG+nn/AYogY
x2RXgKjCWOs6FwUcu+VHMx0whKuMADjqbMABdlsGGK62Xa5aYYcObvY+CgCUAI+m
unV7kYDIBUHudiTpxXgZYUvylhIvW37VYjc6BDoaq4Jc1rz/L69zSrNHmoNiQv+Y
113T0Ft5EEmEO1LP4s2GLMZTPqwgi2FnaP6UYFdTr2/ZfaRHlj2xRRG62WiT/q1x
DMT2KWHvETCftpK3GwwkMSr0Au8CVV1soiQOoioTPRaevYbBFVi60GVXQeDlvFEV
PqVCOEfsSvsw84phfHrW1bOxBeVsNYHbY/T4eVlC0zssUzz6KNH5jAfpyla1p0lP
WniSqAaWxMcUYWCEBiOLa5LV2XVpXOuTpI84xcgc/BmprgzNyLgDAiCDtehpxALf
Qmhc/rPR5BbLhTNY8z5qANG6mhQCHCo+52ypvBLMhZoajkPjgyBabwoqaRGje2ub
vgzbAfqEguJmCAszw04KZ2UHInBcCDAZ5aOKiinawWvpftkDN0IJmO/HW5vnh4xV
kawpo1eh1JzDcEyYfSjySHHFmR/5qnaDzaPL2cJthcOY/fGywibJHCQmoDPyH5jz
Bkk0F0rMEcdQWs9pJLIMMzkA7BAlxYLYip0J9QImHL77sWK0QwUDCoryrgD6lL1D
v2V31g1TZwPF2Noe9Rk7
=r3Hr
-----END PGP SIGNATURE-----
Merge tag 'mac80211-for-davem-2016-04-06' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
For the current RC series, we have the following fixes:
* TDLS fixes from Arik and Ilan
* rhashtable fixes from Ben and myself
* documentation fixes from Luis
* U-APSD fixes from Emmanuel
* a TXQ fix from Felix
* and a compiler warning suppression from Jeff
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Ptr to devlink structure can be easily obtained from
devlink_port->devlink. So share user_ptr[0] pointer for both and leave
user_ptr[1] free for other users.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As we rely on caller zeroing or correctly set the struct before the call,
this implicit type set is either no-op (DEVLINK_PORT_TYPE_NOTSET is 0)
or it rewrites wanted value. So remove this.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes handling in case of link-local address compression. A
IPv6 link-local address is defined as fe80::/10 prefix which is also
what ipv6_addr_type checks for link-local addresses.
But IPHC compression for link-local addresses are for fe80::/64 types
only. This patch adds additional checks for zero padded bits in case of
link-local address compression to match on a fe80::/64 address only.
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
In case of buggy controllers send advertising packet types that we
don't know of we should simply ignore them instead of trying to react
to them in some (potentially wrong) way.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
If we're dealing with a single-mode controller or BR/EDR is disable
for a dual-mode one, the NO_BREDR flag needs to be unconditionally
present in the advertising data. This patch moves it out from behind
an extra condition to be always set in the create_instance_adv_data()
function if BR/EDR is disabled.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
When sending a UDPv6 message longer than MTU, account for the length
of fragmentable IPv6 extension headers in skb->network_header offset.
Same as we do in alloc_new_skb path in __ip6_append_data().
This ensures that later on __ip6_make_skb() will make space in
headroom for fragmentable extension headers:
/* move skb->data to ip header from ext header */
if (skb->data < skb_network_header(skb))
__skb_pull(skb, skb_network_offset(skb));
Prevents a splat due to skb_under_panic:
skbuff: skb_under_panic: text:ffffffff8143397b len:2126 put:14 \
head:ffff880005bacf50 data:ffff880005bacf4a tail:0x48 end:0xc0 dev:lo
------------[ cut here ]------------
kernel BUG at net/core/skbuff.c:104!
invalid opcode: 0000 [#1] KASAN
CPU: 0 PID: 160 Comm: reproducer Not tainted 4.6.0-rc2 #65
[...]
Call Trace:
[<ffffffff813eb7b9>] skb_push+0x79/0x80
[<ffffffff8143397b>] eth_header+0x2b/0x100
[<ffffffff8141e0d0>] neigh_resolve_output+0x210/0x310
[<ffffffff814eab77>] ip6_finish_output2+0x4a7/0x7c0
[<ffffffff814efe3a>] ip6_output+0x16a/0x280
[<ffffffff815440c1>] ip6_local_out+0xb1/0xf0
[<ffffffff814f1115>] ip6_send_skb+0x45/0xd0
[<ffffffff81518836>] udp_v6_send_skb+0x246/0x5d0
[<ffffffff8151985e>] udpv6_sendmsg+0xa6e/0x1090
[...]
Reported-by: Ji Jianwen <jiji@redhat.com>
Signed-off-by: Jakub Sitnicki <jkbs@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Resetting a bearer/interface, with the consequence of resetting all its
pertaining links, is not an atomic action. This becomes particularly
evident in very large clusters, where a lot of traffic may happen on the
remaining links while we are busy shutting them down. In extreme cases,
we may even see links being re-created and re-established before we are
finished with the job.
To solve this, we now introduce a solution where we temporarily detach
the bearer from the interface when the bearer is reset. This inhibits
all packet reception, while sending still is possible. For the latter,
we use the fact that the device's user pointer now is zero to filter out
which packets can be sent during this situation; i.e., outgoing RESET
messages only. This filtering serves to speed up the neighbors'
detection of the loss event, and saves us from unnecessary probing.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When enabling a bearer we create a 'neigbor discoverer' instance by
calling the function tipc_disc_create() before the bearer is actually
registered in the list of enabled bearers. Because of this, the very
first discovery broadcast message, created by the mentioned function,
is lost, since it cannot find any valid bearer to use. Furthermore,
the used send function, tipc_bearer_xmit_skb() does not free the given
buffer when it cannot find a bearer, resulting in the leak of exactly
one send buffer each time a bearer is enabled.
This commit fixes this problem by introducing two changes:
1) Instead of attemting to send the discovery message directly, we let
tipc_disc_create() return the discovery buffer to the calling
function, tipc_enable_bearer(), so that the latter can send it
when the enabling sequence is finished.
2) In tipc_bearer_xmit_skb(), as well as in the two other transmit
functions at the bearer layer, we now free the indicated buffer or
buffer chain when a valid bearer cannot be found.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When PAGE_SIZE > 4k single page can contain 2 RDS fragments. If
'rds_ib_cong_recv' ignore the RDS fragment offset in to the page it
then read the data fragment as far congestion map update and lead to
corruption of the RDS connection far congestion map.
Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix issue in 'rds_ib_cong_recv' when accessing unaligned memory
allocated by 'rds_page_remainder_alloc' using uint64_t pointer.
Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes an issue I found in which we were dropping frames if we
had enabled checksums on GRE headers that were encapsulated by either FOU
or GUE. Without this patch I was barely able to get 1 Gb/s of throughput.
With this patch applied I am now at least getting around 6 Gb/s.
The issue is due to the fact that with FOU or GUE applied we do not provide
a transport offset pointing to the GRE header, nor do we offload it in
software as the GRE header is completely skipped by GSO and treated like a
VXLAN or GENEVE type header. As such we need to prevent the stack from
generating it and also prevent GRE from generating it via any interface we
create.
Fixes: c3483384ee ("gro: Allow tunnel stacking in the case of FOU/GUE")
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that the UDP encapsulation GRO functions have been moved to the UDP
socket we not longer need the udp_offload insfrastructure so removing it.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adapt gue_gro_receive, gue_gro_complete to take a socket argument.
Don't set udp_offloads any more.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add gro_receive and gro_complete to struct udp_tunnel_sock_cfg.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds GRO functions (gro_receive and gro_complete) to UDP
sockets. udp_gro_receive is changed to perform socket lookup on a
packet. If a socket is found the related GRO functions are called.
This features obsoletes using UDP offload infrastructure for GRO
(udp_offload). This has the advantage of not being limited to provide
offload on a per port basis, GRO is now applied to whatever individual
UDP sockets are bound to. This also allows the possbility of
"application defined GRO"-- that is we can attach something like
a BPF program to a UDP socket to perfrom GRO on an application
layer protocol.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add externally visible functions to lookup a UDP socket by skb. This
will be used for GRO in UDP sockets. These functions also check
if skb->dst is set, and if it is not skb->dev is used to get dev_net.
This allows calling lookup functions before dst has been set on the
skbuff.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 5a5abb1fa3 ("tun, bpf: fix suspicious RCU usage
in tun_{attach, detach}_filter") and replaces it to use lock_sock around
sk_{attach,detach}_filter. The checks inside filter.c are updated with
lockdep_sock_is_held to check for proper socket locks.
It keeps the code cleaner by ensuring that only one lock governs the
socket filter instead of two independent locks.
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The socket is either locked if we hold the slock spin_lock for
lock_sock_fast and unlock_sock_fast or we own the lock (sk_lock.owned
!= 0). Check for this and at the same time improve that the current
thread/cpu is really holding the lock.
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
During release_sock we use callbacks to finish the processing
of outstanding skbs on the socket. We actually are still locked,
sk_locked.owned == 1, but we already told lockdep that the mutex
is released. This could lead to false positives in lockdep for
lockdep_sock_is_held (we don't hold the slock spinlock during processing
the outstanding skbs).
I took over this patch from Eric Dumazet and tested it.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip6_route_output() never returns NULL, so it is not appropriate to
check if the return value is NULL.
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Baozeng Ding reported a KASAN stack out of bounds issue - it uncovered that
the TCP option parsing routines in netfilter TCP connection tracking could
read one byte out of the buffer of the TCP options. Therefore in the patch
we check that the available data length is large enough to parse both TCP
option code and size.
Reported-by: Baozeng Ding <sploving1@gmail.com>
Tested-by: Baozeng Ding <sploving1@gmail.com>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
David Ahern reported panics in __inet_hash() caused by my recent commit.
The reason is inet_reuseport_add_sock() was still using
sk_nulls_for_each_rcu() instead of sk_for_each_rcu().
SO_REUSEPORT enabled listeners were causing an instant crash.
While chasing this bug, I found that I forgot to clear SOCK_RCU_FREE
flag, as it is inherited from the parent at clone time.
Fixes: 3b24d854cb ("tcp/dccp: do not touch listener sk_refcnt under synflood")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: David Ahern <dsa@cumulusnetworks.com>
Tested-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
arptables is broken since we didn't register the table anymore --
even 'arptables -L' fails.
Fixes: b9e69e1273 ("netfilter: xtables: don't hook tables by default")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This is for the recent kcm driver, which introduces AF_KCM(41) in
b7ac4eb(kcm: Kernel Connection Multiplexor module).
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Cc: Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow calling of iptunnel_pull_header without special casing ETH_P_TEB inner
protocol.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The check in vmci_transport_peer_detach_cb should only allow a
detach when the qp handle of the transport matches the one in
the detach message.
Testing: Before this change, a detach from a peer on a different
socket would cause an active stream socket to register a detach.
Reviewed-by: George Zhang <georgezhang@vmware.com>
Signed-off-by: Jorgen Hansen <jhansen@vmware.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Trinity and other fuzzers can hit this WARN on far too easily,
resulting in a tainted kernel that hinders automated fuzzing.
Replace it with a rate-limited printk.
Signed-off-by: Dave Jones <davej@codemonkey.org.uk>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This fixes:
net/mac80211/mesh_hwmp.c:603:26: warning: ‘target_metric’ may be used uninitialized in this function
target_metric is only consumed when reply = true so no bug exists here,
but not all versions of gcc realize it. Initialize to 0 to remove the
warning.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
If the user space issues a NL80211_CMD_CONNECT with
NL80211_ATTR_PREV_BSSID when there is already a connection, allow this
to proceed as a reassociation instead of rejecting the new connect
command with EALREADY.
Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com>
[validate prev_bssid]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This extends NL80211_CMD_CONNECT to allow the NL80211_ATTR_PREV_BSSID
attribute to be used similarly to way this was already allowed with
NL80211_CMD_ASSOCIATE. This allows user space to request reassociation
(instead of association) when already connected to an AP. This provides
an option to reassociate within an ESS without having to disconnect and
associate with the AP.
Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Prevents excessive A-MSDU aggregation at low data rates or bad
conditions.
Signed-off-by: Felix Fietkau <nbd@openwrt.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Requires software tx queueing and fast-xmit support. For good
performance, drivers need frag_list support as well. This avoids the
need for copying data of aggregated frames. Running without it is only
supported for debugging purposes.
To avoid performance and packet size issues, the rate control module or
driver needs to limit the maximum A-MSDU size by setting
max_rc_amsdu_len in struct ieee80211_sta.
Signed-off-by: Felix Fietkau <nbd@openwrt.org>
[fix locking issue]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
If the driver advertises the new HW flag USE_RSS, make the
station statistics on the fast-rx path per-CPU. This will
enable calling the RX in parallel, only hitting locking or
shared cachelines when the fast-RX path isn't available.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The regular RX path has a lot of code, but with a few
assumptions on the hardware it's possible to reduce the
amount of code significantly. Currently the assumptions
on the driver are the following:
* hardware/driver reordering buffer (if supporting aggregation)
* hardware/driver decryption & PN checking (if using encryption)
* hardware/driver did de-duplication
* hardware/driver did A-MSDU deaggregation
* AP_LINK_PS is used (in AP mode)
* no client powersave handling in mac80211 (in client mode)
of which some are actually checked per packet:
* de-duplication
* PN checking
* decryption
and additionally packets must
* not be A-MSDU (have been deaggregated by driver/device)
* be data packets
* not be fragmented
* be unicast
* have RFC 1042 header
Additionally dynamically we assume:
* no encryption or CCMP/GCMP, TKIP/WEP/other not allowed
* station must be authorized
* 4-addr format not enabled
Some data needed for the RX path is cached in a new per-station
"fast_rx" structure, so that we only need to look at this and
the packet, no other memory when processing packets on the fast
RX path.
After doing the above per-packet checks, the data path collapses
down to a pretty simple conversion function taking advantage of
the data cached in the small fast_rx struct.
This should speed up the RX processing, and will make it easier
to reason about parallelizing RX (for which statistics will need
to be per-CPU still.)
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
On 32-bit platforms, the 64-bit counters we keep need to be protected
to be consistently read. Use the u64_stats_sync mechanism to do that.
In order to not end up with overly long lines, refactor the tidstats
assignments a bit.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When storing the last_rate_* values in the RX code, there's nothing
to guarantee consistency, so a concurrent reader could see, e.g.
last_rate_idx on the new value, but last_rate_flag still on the old,
getting completely bogus values in the end.
To fix this, I lifted the sta_stats_encode_rate() function from my
old rate statistics code, which encodes the entire rate data into a
single 16-bit value, avoiding the consistency issue.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Instead of touching the rx_stats.last_rx from the status path, introduce
and use a status_stats.last_ack variable. This will make rx_stats.last_rx
indicate when the last frame was received, making it available for real
"last_rx" and statistics gathering; statistics, when done per-CPU, will
need to figure out which place was updated last for those items where the
"last" value is exposed.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
There's no need to update rx_stats.last_rx after allocating
a station since it's already updated during allocation.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Move the averaged values out of rx_stats and into rx_stats_avg,
to cleanly split them out. The averaged ones cannot be supported
for parallel RX in a per-CPU fashion, while the other values can
be collected per CPU and then combined/selected when needed.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Move the semicolon, people typically assume that and
once line already put a semicolon behind the "call".
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
For the RX MSDU statistics, we need to count the number of
MSDUs created and accepted from an A-MSDU. Right now, all
frames in any A-MSDUs were completely ignored. Fix this by
moving the RX MSDU statistics accounting into the deliver
function.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Sometimes drivers already looked up, or know out-of-band
from their device, which station transmitted a given RX
frame. Allow them to pass the station pointer to mac80211
to save the extra lookup.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When signaling that a GRO frame is ready to be processed, the network stack
correctly checks length and aborts processing when a frame is less than 14
bytes. However, such a condition is really indicative of a broken driver,
and should be loudly signaled, rather than silently dropped as the case is
today.
Convert the condition to use net_warn_ratelimited() to ensure the stack
loudly complains about such broken drivers.
Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When creating an ip6tnl tunnel with ip tunnel, rtnl_link_ops is not set
before ip6_tnl_create2 is called. When register_netdevice is called, there
is no linkinfo attribute in the NEWLINK message because of that.
Setting rtnl_link_ops before calling register_netdevice fixes that.
Fixes: 0b11245722 ("ip6tnl: add support of link creation via rtnl")
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 543e3a8da5.
Direct callers of __netpoll_setup() depend on it to set np->dev,
so we can't simply move that assignment up to netpoll_stup().
Reported-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Enable peeking at UDP datagrams at the offset specified with socket
option SOL_SOCKET/SO_PEEK_OFF. Peek at any datagram in the queue, up
to the end of the given datagram.
Implement the SO_PEEK_OFF semantics introduced in commit ef64a54f6e
("sock: Introduce the SO_PEEK_OFF sock option"). Increase the offset
on peek, decrease it on regular reads.
When peeking, always checksum the packet immediately, to avoid
recomputation on subsequent peeks and final read.
The socket lock is not held for the duration of udp_recvmsg, so
peek and read operations can run concurrently. Only the last store
to sk_peek_off is preserved.
Signed-off-by: Sam Kumar <samanthakumar@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove UDP transport headers before queueing packets for reception.
This change simplifies a follow-up patch to add MSG_PEEK support.
Signed-off-by: Sam Kumar <samanthakumar@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We try to be clever and set large chunks of the bitmap at once, when
possible; unfortunately we weren't very clever when we wrote the code
and messed up the if-conditional. Fix this bug and restore proper
operation.
Signed-off-by: Janak Desai <Janak.Desai@gtri.gatech.edu>
Signed-off-by: Paul Moore <paul@paul-moore.com>
There is no point on delaying the packet if we can't fit a single byte
of data on it anymore. So lets just reduce the threshold by the amount
that a data chunk with 4 bytes (rounding) would use.
v2: based on the right tree
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lockdep warned of a lock dependency between the mesh_plink lock
and the internal lock for the rhashtable. The problem is that
the rhashtable code uses a spin lock with softirqs enabled, while
mesh_plink_timer executes a walk (to flush paths on a state change)
inside a softirq with the plink lock held.
This leads to the following deadlock if the timer fires while rht
lock is held on this CPU, and plink lock is held on another CPU:
CPU0 CPU1
---- ----
lock(&(&ht->lock)->rlock);
local_irq_disable();
lock(&(&sta->mesh->plink_lock)->rlock);
lock(&(&ht->lock)->rlock);
<Interrupt>
lock(&(&sta->mesh->plink_lock)->rlock);
*** DEADLOCK ***
Fix by waiting until we drop the plink lock to flush paths.
Fixes: d48a1b7cd439 ("mac80211: mesh: convert path table to rhashtable")
Signed-off-by: Bob Copeland <me@bobcopeland.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Several of the mesh path fields are undocumented and some
of the documentation is no longer correct or relevant after
the switch to rhashtable. Clean up the kernel doc
accordingly and reorder some fields to match the structure
layout.
Signed-off-by: Bob Copeland <me@bobcopeland.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Reduce padding waste in struct mesh_table and struct rmc_entry by
moving the smaller fields to the end.
Signed-off-by: Bob Copeland <me@bobcopeland.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Since we have converted the mesh path tables to rhashtable, we are
no longer swapping out the entire mesh_pathtbl pointer with RCU.
As a result, we no longer need indirection to the hlist head for
the gates list and can simply embed it, saving a pair of
pointer-sized allocations.
Signed-off-by: Bob Copeland <me@bobcopeland.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The RMC cache has 256 list heads plus a u32, which puts it at the
unfortunate size of 4104 bytes with padding. kmalloc() will then
round this up to the next power-of-two, so we wind up actually
using two pages here where most of the second is wasted.
Switch to hlist heads here to reduce the structure size down to
fit within a page.
Signed-off-by: Bob Copeland <me@bobcopeland.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In the unlikely case that mesh_rmc_init() fails with -ENOMEM,
the rmc pointer will be left as NULL but the interface is still
operational because ieee80211_mesh_init_sdata() is not allowed
to fail.
If this happens, we would blindly dereference rmc when checking
whether a multicast frame is in the cache. Instead just drop the
frames in the forwarding path.
Signed-off-by: Bob Copeland <me@bobcopeland.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Legacy clients don't support P2P power save mechanism, and thus if a P2P GO
has a legacy client connected to it, it should disable P2P PS mechanisms.
Let the driver know about this with a new bss_conf parameter.
Signed-off-by: Ayala Beker <ayala.beker@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Legacy clients don't support P2P power save mechanisms, and thus
if a P2P GO has a legacy client connected to it, it has to make
some changes in the PS behavior.
To handle this, add an attribute to specify whether a station supports
P2P PS or not. If the attribute was not specified cfg80211 will assume
that station supports it for P2P GO interface, and does NOT support it
for AP interface, matching the current assumptions in the code.
Signed-off-by: Ayala Beker <ayala.beker@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In the likely case that probe_count is 0, don't write to the
memory there.
Also use ifmgd consistently in the function, instead of using
sdata->u.mgd as well.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The code is only used with iwlwifi, but still should have proper
mac80211 naming scheme; fix that.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Avoid the really strange %s%s%s expression, use an array
of flag names and check that all flags are present.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
If the device implements dynamic PS itself, there's no need
to ever start the dynamic powersave timer on RX.
While at it, fix up some indentation in this code.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Since the previous patch, the struct only has a single member,
so remove the struct and leave just the single member.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Remove unused variable in per STA debugfs structure, 'commit 34e895075e
("mac80211: allow station add/remove to sleep")' removed the only user of
'add_has_run'.
Signed-off-by: Mohammed Shafi Shajakhan <mohammed@qti.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Currently the debugfs entry for starting aggregation session
starts it with timeout of 5 seconds. Allow opening a session
with a custom timeout (according to spec 0 is no timeout).
while at it, refactor the function and remove the magic numbers.
Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
NETIF_F_RXCSUM is not in the white list, though some
drivers may want to set it in order to enable seeing the
actual RX checksum status in ethtool.
Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Allow publishing RRM capabilities for features that are not
HW dependent.
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
There were a few issues that were slowing down the process of finding
the optimal rate, especially on devices with multi-rate retry
limitations:
When max_tp_rate[0] was slower than max_tp_rate[1], the code did not
sample max_tp_rate[1], which would often allow it to switch places with
max_tp_rate[0] (e.g. if only the first sampling attempts were bad, but the
rate is otherwise good).
Also, sample attempts of rates between max_tp_rate[0] and [1] were being
ignored in this case, because the code only checked if the rate was
slower than [1].
Fix this by checking against the fastest / second fastest max_tp_rate
instead of assuming a specific order between the two.
In my tests this patch significantly reduces the time until minstrel_ht
finds the optimal rate right after assoc
Signed-off-by: Felix Fietkau <nbd@openwrt.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Since we enqueued the frame that was supposed to be sent
during the SP, and that frame may very well cary the
IEEE80211_TX_STATUS_EOSP bit, we may never close the SP
(WLAN_STA_SP will never be cleared). If that happens, we
will not open any new SP and will never respond to any poll
frame from the client.
Clear WLAN_STA_SP manually if a frame that was polled during
the SP is queued because of a starting A-MPDU session. The
client may not see the EOSP bit, but it will at least be
able to poll new frames in another SP.
Reported-by: Alesya Shapira <alesya.shapira@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
[remove erroneous comment]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
It is possible that the station is connected to an AP
with bandwidth of 80+80MHz or 160MHz. In such cases
there is no need to perform an upgrade as the maximal
supported bandwidth is 80MHz.
In addition, when upgrading and setting center_freq1
and bandwidth to 80MHz also set center_freq2 to 0.
Fixes: 0fabfaafec ("mac80211: upgrade BW of TDLS peers when possible"
Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Frames that are sent between
ampdu_action(IEEE80211_AMPDU_TX_START) and the move to the
HT_AGG_STATE_OPERATIONAL state are buffered.
If we try to start an A-MPDU session while the peer is
sleeping and polling frames with U-APSD, we may have frames
that will be buffered by ieee80211_tx_prep_agg. These frames
have IEEE80211_TX_CTL_NO_PS_BUFFER set since they are sent to
a sleeping client and possibly IEEE80211_TX_STATUS_EOSP.
If the frame is buffered, we need clear these two flags
since they will be re-sent after the move to
HT_AGG_STATE_OPERATIONAL state which is very likely to
happen after the SP ends.
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Commit 976bd9efda ("mac80211: move beacon_loss_count into ifmgd")
removed the member from the sta_info struct but the description stayed
lingering. Remove it.
Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
By default, the rhashtable logic will fail to insert
objects if the key-chains are too long and un-balanced.
In the degenerate case where mac80211 is creating many
virtual interfaces connected to the same peer(s), this
case can happen.
St insecure_elasticity to true to allow chains to grow
as long as needed.
Signed-off-by: Ben Greear <greearb@candelatech.com>
[remove message, change commit message slightly]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The original hand-implemented hash-table in mac80211 couldn't result
in insertion errors, and while converting to rhashtable I evidently
forgot to check the errors.
This surfaced now only because Ben is adding many identical keys and
that resulted in hidden insertion errors.
Cc: stable@vger.kernel.org
Fixes: 7bedd0cfad ("mac80211: use rhashtable for station table")
Reported-by: Ben Greear <greearb@candelatech.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Fall back to rate control if the requested bitrate was not found.
Fixes: dfdfc2beb0 ("mac80211: Parse legacy and HT rate in injected frames")
Signed-off-by: Felix Fietkau <nbd@openwrt.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The MCS bandwidth part of the radiotap header is 2 bits wide. The full 2
bit have to compared against IEEE80211_RADIOTAP_MCS_BW_40 and not only if
the first bit is set. Otherwise IEEE80211_RADIOTAP_MCS_BW_40 can be
confused with IEEE80211_RADIOTAP_MCS_BW_20U.
Fixes: dfdfc2beb0 ("mac80211: Parse legacy and HT rate in injected frames")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Introducing a new feature that the driver can use to
indicate the driver/firmware supports configuration of BSS
selection criteria upon CONNECT command. This can be useful
when multiple BSS-es are found belonging to the same ESS,
ie. Infra-BSS with same SSID. The criteria can then be used to
offload selection of a preferred BSS.
Reviewed-by: Hante Meuleman <meuleman@broadcom.com>
Reviewed-by: Franky (Zhenhui) Lin <frankyl@broadcom.com>
Reviewed-by: Pieter-Paul Giesberts <pieterpg@broadcom.com>
Reviewed-by: Lei Zhang <leizh@broadcom.com>
Signed-off-by: Arend van Spriel <arend@broadcom.com>
[move wiphy support check into parse_bss_select()]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Some devices, like iwlwifi, have RSS queues. This may cause a
situation where a disassociation is handled in control path and
results in station removal while there are prior RX frames
that were still not processed in other queues. When they will
be processed the station will be gone, and the frames will be
dropped.
Add a synchronization interface to avoid that. When driver returns
from the synchronization mac80211 may remove the station.
Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In the time since the mesh path table was implemented as an
RCU-traversable, dynamically growing hash table, a generic RCU
hashtable implementation was added to the kernel.
Switch the mesh path table over to rhashtable to remove some code
and also gain some features like automatic shrinking.
Cc: Thomas Graf <tgraf@suug.ch>
Cc: netdev@vger.kernel.org
Signed-off-by: Bob Copeland <me@bobcopeland.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In certain cases, the 802.11 mesh pathtable code wants to
iterate over all of the entries in the forwarding table from
the receive path, which is inside an RCU read-side critical
section. Enable walks inside atomic sections by allowing
GFP_ATOMIC allocations for the walker state.
Change all existing callsites to pass in GFP_KERNEL.
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Bob Copeland <me@bobcopeland.com>
[also adjust gfs2/glock.c and rhashtable tests]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The mesh path table uses a struct mesh_node in its hlists in
order to support a resizable hash table: the mesh_node provides
an indirection to the actual mesh path so that two different
bucket lists can point to the same path entry.
However, for the known gates list, we don't need this indirection
because there is ever only one list. So we can just embed the
hlist_node in the mesh path itself, which simplifies things a bit
and saves a linear search whenever we need to find an item in
the list.
Signed-off-by: Bob Copeland <me@bobcopeland.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Remove duplicate code to allocate and initialize a mesh
path or mesh proxy path.
Signed-off-by: Bob Copeland <me@bobcopeland.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Now that the sdata pointer is the same for all entries of a
path table, hashing it is pointless, so hash only the address.
Signed-off-by: Bob Copeland <me@bobcopeland.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The mesh path and mesh gate hashtables are global, containing
all of the mpaths for every mesh interface, but the paths are
all tied logically to a single interface. The common case is
just a single mesh interface, so optimize for that by moving
the global hashtable into the per-interface struct.
Doing so allows us to drop sdata pointer comparisons inside
the lookups and also saves a few bytes of BSS and data.
Signed-off-by: Bob Copeland <me@bobcopeland.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
If the cfg80211 scan trigger operation specifies a single BSSID, use
that value instead of the wildcard BSSID in the Probe Request frames.
Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This allows scans for a specific BSSID to be optimized by the user space
application by requesting the driver to set the Probe Request frame
BSSID field (Address 3) to the specified BSSID instead of the wildcard
BSSID. This prevents other APs from replying which reduces airtime need
and latency in getting the response from the target AP through.
This is an optimization and as such, it is acceptable for some of the
drivers not to support the mechanism. If not supported, the wildcard
BSSID will be used and more responses may be received.
Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The min_def chanctx is affected not only by the current chandef, but
sometimes also by other stations on the vif. There's a valid scenario
where a TDLS peer can widen its BW, thereby causing the min_def
to increase.
Signed-off-by: Arik Nemtsov <arikx.nemtsov@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The previous approach simply ignored chandef restrictions when calculating
the appropriate peer BW for a WIDER_BW peer. This could result in a
regulatory violation if both peers indicated 80MHz support, but the
regdomain forbade it.
Change the approach to setting a WIDER_BW peer's BW. Don't exempt it from
the chandef width at first. If during TDLS negotiation the chandef width
is upgraded, update the peer's BW to match.
Fixes: 0fabfaafec ("mac80211: upgrade BW of TDLS peers when possible")
Signed-off-by: Arik Nemtsov <arikx.nemtsov@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Even if the current chandef width is equal to the station's max-BW, it
doesn't mean it's a valid width for TDLS. Make sure to always check
regulatory constraints in these cases.
Fixes: 0fabfaafec ("mac80211: upgrade BW of TDLS peers when possible")
Signed-off-by: Arik Nemtsov <arikx.nemtsov@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Buffered multicast frames must be passed to the driver directly via
drv_tx instead of going through the txq, otherwise they cannot easily be
scheduled to be sent after DTIM.
Signed-off-by: Felix Fietkau <nbd@openwrt.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When HW crypto is used, there's no need for the CCMP/GCMP MIC to
be available to mac80211, and the hardware might have removed it
already after checking. The MIC is also useless to have when the
frame is already decrypted, so allow indicating that it's not
present.
Since we are running out of bits in mac80211_rx_flags, make
the flags field a u64.
Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This was requested by Android, and the appropriate cfg80211 API
had been added by Dmitry. Support it in mac80211, allowing drivers
to provide the timestamp.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Add VHT radiotap parsing support to ieee80211_parse_tx_radiotap().
That capability has been tested using a d-link dir-860l rev b1 running
OpenWrt trunk and mt76 driver
Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi83@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Using a switch to handle different ev.op values in rfkill_fop_write()
makes the code easier to extend, as out-of-range values can always be
handled by the default case.
Signed-off-by: João Paulo Rechi Vita <jprvita@endlessm.com>
[roll in fix for RFKILL_OP_CHANGE from Jouni]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
If register_netdevice_notifier() fails (which in practice it can't
right now), we should call unregister_pernet_subsys(). Do that.
Reported-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Attackers like to use SYNFLOOD targeting one 5-tuple, as they
hit a single RX queue (and cpu) on the victim.
If they use random sequence numbers in their SYN, we detect
they do not match the expected window and send back an ACK.
This patch adds a rate limitation, so that the effect of such
attacks is limited to ingress only.
We roughly double our ability to absorb such attacks.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP uses per cpu 'sockets' to send some packets :
- RST packets ( tcp_v4_send_reset()) )
- ACK packets for SYN_RECV and TIMEWAIT sockets
By setting SOCK_USE_WRITE_QUEUE flag, we tell sock_wfree()
to not call sk_write_space() since these internal sockets
do not care.
This gives a small performance improvement, merely by allowing
cpu to properly predict the sock_wfree() conditional branch,
and avoiding one atomic operation.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Goal: packets dropped by a listener are accounted for.
This adds tcp_listendrop() helper, and clears sk_drops in sk_clone_lock()
so that children do not inherit their parent drop count.
Note that we no longer increment LINUX_MIB_LISTENDROPS counter when
sending a SYNCOOKIE, since the SYN packet generated a SYNACK.
We already have a separate LINUX_MIB_SYNCOOKIESSENT
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now ss can report sk_drops, we can instruct TCP to increment
this per socket counter when it drops an incoming frame, to refine
monitoring and debugging.
Following patch takes care of listeners drops.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reporting sk_drops to user space was available for UDP
sockets using /proc interface.
Add this to sock_diag, so that we can have the same information
available to ss users, and we'll be able to add sk_drops
indications for TCP sockets as well.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a SYNFLOOD targets a non SO_REUSEPORT listener, multiple
cpus contend on sk->sk_refcnt and sk->sk_wmem_alloc changes.
By letting listeners use SOCK_RCU_FREE infrastructure,
we can relax TCP_LISTEN lookup rules and avoid touching sk_refcnt
Note that we still use SLAB_DESTROY_BY_RCU rules for other sockets,
only listeners are impacted by this change.
Peak performance under SYNFLOOD is increased by ~33% :
On my test machine, I could process 3.2 Mpps instead of 2.4 Mpps
Most consuming functions are now skb_set_owner_w() and sock_wfree()
contending on sk->sk_wmem_alloc when cooking SYNACK and freeing them.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RX packet processing holds rcu_read_lock(), so we can remove
pairs of rcu_read_lock()/rcu_read_unlock() in lookup functions
if inet_diag also holds rcu before calling them.
This is needed anyway as __inet_lookup_listener() and
inet6_lookup_listener() will soon no longer increment
refcount on the found listener.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since linux 2.6.29, lookups only use rcu locking.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert would like not touching UDP socket refcnt for encapsulated
traffic. For this to happen, we need to use normal RCU rules, with a grace
period before freeing a socket. UDP sockets are not short lived in the
high usage case, so the added cost of call_rcu() should not be a concern.
This actually removes a lot of complexity in UDP stack.
Multicast receives no longer need to hold a bucket spinlock.
Note that ip early demux still needs to take a reference on the socket.
Same remark for functions used by xt_socket and xt_PROXY netfilter modules,
but this might be changed later.
Performance for a single UDP socket receiving flood traffic from
many RX queues/cpus.
Simple udp_rx using simple recvfrom() loop :
438 kpps instead of 374 kpps : 17 % increase of the peak rate.
v2: Addressed Willem de Bruijn feedback in multicast handling
- keep early demux break in __udp4_lib_demux_lookup()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <tom@herbertland.com>
Cc: Willem de Bruijn <willemb@google.com>
Tested-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We want a generic way to insert an RCU grace period before socket
freeing for cases where RCU_SLAB_DESTROY_BY_RCU is adding too
much overhead.
SLAB_DESTROY_BY_RCU strict rules force us to take a reference
on the socket sk_refcnt, and it is a performance problem for UDP
encapsulation, or TCP synflood behavior, as many CPUs might
attempt the atomic operations on a shared sk_refcnt
UDP sockets and TCP listeners can set SOCK_RCU_FREE so that their
lookup can use traditional RCU rules, without refcount changes.
They can set the flag only once hashed and visible by other cpus.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <tom@herbertland.com>
Tested-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This fixes the incorrect variable assignment on error path in
br_sysfs_addbr for when the call to kobject_create_and_add
fails to assign the value of -EINVAL to the returned variable of
err rather then incorrectly return zero making callers think this
function has succeededed due to the previous assignment being
assigned zero when assigning it the successful return value of
the call to sysfs_create_group which is zero.
Signed-off-by: Bastien Philbert <bastienphilbert@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
pskb_may_pull() can change skb->data, so we have to load ptr/optr at the
right place.
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
pskb_may_pull() can change skb->data, so we have to load ptr/optr at the
right place.
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, SOL_TIMESTAMPING can only be enabled using setsockopt.
This is very costly when users want to sample writes to gather
tx timestamps.
Add support for enabling SO_TIMESTAMPING via control messages by
using tsflags added in `struct sockcm_cookie` (added in the previous
patches in this series) to set the tx_flags of the last skb created in
a sendmsg. With this patch, the timestamp recording bits in tx_flags
of the skbuff is overridden if SO_TIMESTAMPING is passed in a cmsg.
Please note that this is only effective for overriding the recording
timestamps flags. Users should enable timestamp reporting (e.g.,
SOF_TIMESTAMPING_SOFTWARE | SOF_TIMESTAMPING_OPT_ID) using
socket options and then should ask for SOF_TIMESTAMPING_TX_*
using control messages per sendmsg to sample timestamps for each
write.
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Process socket-level control messages by invoking
__sock_cmsg_send in ip6_datagram_send_ctl for control messages on
the SOL_SOCKET layer.
This makes sure whenever ip6_datagram_send_ctl is called for
udp and raw, we also process socket-level control messages.
This is a bit uglier than IPv4, since IPv6 does not have
something like ipcm_cookie. Perhaps we can later create
a control message cookie for IPv6?
Note that this commit interprets new control messages that
were ignored before. As such, this commit does not change
the behavior of IPv6 control messages.
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Process socket-level control messages by invoking
__sock_cmsg_send in ip_cmsg_send for control messages on
the SOL_SOCKET layer.
This makes sure whenever ip_cmsg_send is called in udp, icmp,
and raw, we also process socket-level control messages.
Note that this commit interprets new control messages that
were ignored before. As such, this commit does not change
the behavior of IPv4 control messages.
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Accept SO_TIMESTAMPING in control messages of the SOL_SOCKET level
as a basis to accept timestamping requests per write.
This implementation only accepts TX recording flags (i.e.,
SOF_TIMESTAMPING_TX_HARDWARE, SOF_TIMESTAMPING_TX_SOFTWARE,
SOF_TIMESTAMPING_TX_SCHED, and SOF_TIMESTAMPING_TX_ACK) in
control messages. Users need to set reporting flags (e.g.,
SOF_TIMESTAMPING_OPT_ID) per socket via socket options.
This commit adds a tsflags field in sockcm_cookie which is
set in __sock_cmsg_send. It only override the SOF_TIMESTAMPING_TX_*
bits in sockcm_cookie.tsflags allowing the control message
to override the recording behavior per write, yet maintaining
the value of other flags.
This patch implements validating the control message and setting
tsflags in struct sockcm_cookie. Next commits in this series will
actually implement timestamping per write for different protocols.
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, to avoid a cache line miss for accessing skb_shinfo,
tcp_ack_tstamp skips socket that do not have
SOF_TIMESTAMPING_TX_ACK bit set in sk_tsflags. This is
implemented based on an implicit assumption that the
SOF_TIMESTAMPING_TX_ACK is set via socket options for the
duration that ACK timestamps are needed.
To implement per-write timestamps, this check should be
removed and replaced with a per-packet alternative that
quickly skips packets missing ACK timestamps marks without
a cache-line miss.
To enable per-packet marking without a cache line miss, use
one bit in TCP_SKB_CB to mark a whether a SKB might need a
ack tx timestamp or not. Further checks in tcp_ack_tstamp are not
modified and work as before.
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SOF_TIMESTAMPING_OPT_ID is set to get data-independent IDs
to associate timestamps with send calls. For TCP connections,
tp->snd_una is used as the starting point to calculate
relative IDs.
This socket option will fail if set before the handshake on a
passive TCP fast open connection with data in SYN or SYN/ACK,
since setsockopt requires the connection to be in the
ESTABLISHED state.
To address these, instead of limiting the option to the
ESTABLISHED state, accept the SOF_TIMESTAMPING_OPT_ID option as
long as the connection is not in LISTEN or CLOSE states.
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To process cmsg's of the SOL_SOCKET level in addition to
cmsgs of another level, protocols can call sock_cmsg_send().
This causes a double walk on the cmsghdr list, one for SOL_SOCKET
and one for the other level.
Extract the inner demultiplex logic from the loop that walks the list,
to allow having this called directly from a walker in the protocol
specific code.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Merge PAGE_CACHE_SIZE removal patches from Kirill Shutemov:
"PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
Let's stop pretending that pages in page cache are special. They are
not.
The first patch with most changes has been done with coccinelle. The
second is manual fixups on top.
The third patch removes macros definition"
[ I was planning to apply this just before rc2, but then I spaced out,
so here it is right _after_ rc2 instead.
As Kirill suggested as a possibility, I could have decided to only
merge the first two patches, and leave the old interfaces for
compatibility, but I'd rather get it all done and any out-of-tree
modules and patches can trivially do the converstion while still also
working with older kernels, so there is little reason to try to
maintain the redundant legacy model. - Linus ]
* PAGE_CACHE_SIZE-removal:
mm: drop PAGE_CACHE_* and page_cache_{get,release} definition
mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
Mostly direct substitution with occasional adjustment or removing
outdated comments.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The skcpiher/shash conversion introduced a number of bugs in the
sunrpc code:
1) Missing calls to skcipher_request_set_tfm lead to crashes.
2) The allocation size of shash_desc is too small which leads to
memory corruption.
Fixes: 3b5cf20cf4 ("sunrpc: Use skcipher and ahash/shash")
Reported-by: J. Bruce Fields <bfields@fieldses.org>
Tested-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Since nla_get_in_addr and nla_put_in_addr were implemented,
so use them appropriately.
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For non-SACK connections, cwnd is lowered to inflight plus 3 packets
when the recovery ends. This is an optional feature in the NewReno
RFC 2582 to reduce the potential burst when cwnd is "re-opened"
after recovery and inflight is low.
This feature is questionably effective because of PRR: when
the recovery ends (i.e., snd_una == high_seq) NewReno holds the
CA_Recovery state for another round trip to prevent false fast
retransmits. But if the inflight is low, PRR will overwrite the
moderated cwnd in tcp_cwnd_reduction() later regardlessly. So if a
receiver responds bogus ACKs (i.e., acking future data) to speed up
transfer after recovery, it can only induce a burst up to a window
worth of data packets by acking up to SND.NXT. A restart from (short)
idle or receiving streched ACKs can both cause such bursts as well.
On the other hand, if the recovery ends because the sender
detects the losses were spurious (e.g., reordering). This feature
unconditionally lowers a reverted cwnd even though nothing
was lost.
By principle loss recovery module should not update cwnd. Further
pacing is much more effective to reduce burst. Hence this patch
removes the cwnd moderation feature.
v2 changes: revised commit message on bogus ACKs and burst, and
missing signature
Signed-off-by: Matt Mathis <mattmathis@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Missing device reference in IPSEC input path results in crashes
during device unregistration. From Subash Abhinov Kasiviswanathan.
2) Per-queue ISR register writes not being done properly in macb
driver, from Cyrille Pitchen.
3) Stats accounting bugs in bcmgenet, from Patri Gynther.
4) Lightweight tunnel's TTL and TOS were swapped in netlink dumps, from
Quentin Armitage.
5) SXGBE driver has off-by-one in probe error paths, from Rasmus
Villemoes.
6) Fix race in save/swap/delete options in netfilter ipset, from
Vishwanath Pai.
7) Ageing time of bridge not set properly when not operating over a
switchdev device. Fix from Haishuang Yan.
8) Fix GRO regression wrt nested FOU/GUE based tunnels, from Alexander
Duyck.
9) IPV6 UDP code bumps wrong stats, from Eric Dumazet.
10) FEC driver should only access registers that actually exist on the
given chipset, fix from Fabio Estevam.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (73 commits)
net: mvneta: fix changing MTU when using per-cpu processing
stmmac: fix MDIO settings
Revert "stmmac: Fix 'eth0: No PHY found' regression"
stmmac: fix TX normal DESC
net: mvneta: use cache_line_size() to get cacheline size
net: mvpp2: use cache_line_size() to get cacheline size
net: mvpp2: fix maybe-uninitialized warning
tun, bpf: fix suspicious RCU usage in tun_{attach, detach}_filter
net: usb: cdc_ncm: adding Telit LE910 V2 mobile broadband card
rtnl: fix msg size calculation in if_nlmsg_size()
fec: Do not access unexisting register in Coldfire
net: mvneta: replace MVNETA_CPU_D_CACHE_LINE_SIZE with L1_CACHE_BYTES
net: mvpp2: replace MVPP2_CPU_D_CACHE_LINE_SIZE with L1_CACHE_BYTES
net: dsa: mv88e6xxx: Clear the PDOWN bit on setup
net: dsa: mv88e6xxx: Introduce _mv88e6xxx_phy_page_{read, write}
bpf: make padding in bpf_tunnel_key explicit
ipv6: udp: fix UDP_MIB_IGNOREDMULTI updates
bnxt_en: Fix ethtool -a reporting.
bnxt_en: Fix typo in bnxt_hwrm_set_pause_common().
bnxt_en: Implement proper firmware message padding.
...
Sasha Levin reported a suspicious rcu_dereference_protected() warning
found while fuzzing with trinity that is similar to this one:
[ 52.765684] net/core/filter.c:2262 suspicious rcu_dereference_protected() usage!
[ 52.765688] other info that might help us debug this:
[ 52.765695] rcu_scheduler_active = 1, debug_locks = 1
[ 52.765701] 1 lock held by a.out/1525:
[ 52.765704] #0: (rtnl_mutex){+.+.+.}, at: [<ffffffff816a64b7>] rtnl_lock+0x17/0x20
[ 52.765721] stack backtrace:
[ 52.765728] CPU: 1 PID: 1525 Comm: a.out Not tainted 4.5.0+ #264
[...]
[ 52.765768] Call Trace:
[ 52.765775] [<ffffffff813e488d>] dump_stack+0x85/0xc8
[ 52.765784] [<ffffffff810f2fa5>] lockdep_rcu_suspicious+0xd5/0x110
[ 52.765792] [<ffffffff816afdc2>] sk_detach_filter+0x82/0x90
[ 52.765801] [<ffffffffa0883425>] tun_detach_filter+0x35/0x90 [tun]
[ 52.765810] [<ffffffffa0884ed4>] __tun_chr_ioctl+0x354/0x1130 [tun]
[ 52.765818] [<ffffffff8136fed0>] ? selinux_file_ioctl+0x130/0x210
[ 52.765827] [<ffffffffa0885ce3>] tun_chr_ioctl+0x13/0x20 [tun]
[ 52.765834] [<ffffffff81260ea6>] do_vfs_ioctl+0x96/0x690
[ 52.765843] [<ffffffff81364af3>] ? security_file_ioctl+0x43/0x60
[ 52.765850] [<ffffffff81261519>] SyS_ioctl+0x79/0x90
[ 52.765858] [<ffffffff81003ba2>] do_syscall_64+0x62/0x140
[ 52.765866] [<ffffffff817d563f>] entry_SYSCALL64_slow_path+0x25/0x25
Same can be triggered with PROVE_RCU (+ PROVE_RCU_REPEATEDLY) enabled
from tun_attach_filter() when user space calls ioctl(tun_fd, TUN{ATTACH,
DETACH}FILTER, ...) for adding/removing a BPF filter on tap devices.
Since the fix in f91ff5b9ff ("net: sk_{detach|attach}_filter() rcu
fixes") sk_attach_filter()/sk_detach_filter() now dereferences the
filter with rcu_dereference_protected(), checking whether socket lock
is held in control path.
Since its introduction in 9940516259 ("tun: socket filter support"),
tap filters are managed under RTNL lock from __tun_chr_ioctl(). Thus the
sock_owned_by_user(sk) doesn't apply in this specific case and therefore
triggers the false positive.
Extend the BPF API with __sk_attach_filter()/__sk_detach_filter() pair
that is used by tap filters and pass in lockdep_rtnl_is_held() for the
rcu_dereference_protected() checks instead.
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Size of the attribute IFLA_PHYS_PORT_NAME was missing.
Fixes: db24a9044e ("net: add support for phys_port_name")
CC: David Ahern <dsahern@gmail.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We currently rely on the PMTU discovery of xfrm.
However if a packet is locally sent, the PMTU mechanism
of xfrm tries to do local socket notification what
might not work for applications like ping that don't
check for this. So add pmtu handling to vti_xmit to
report MTU changes immediately.
Reported-by: Mark McKinstry <Mark.McKinstry@alliedtelesis.co.nz>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Make the 2 byte padding in struct bpf_tunnel_key between tunnel_ttl
and tunnel_label members explicit. No issue has been observed, and
gcc/llvm does padding for the old struct already, where tunnel_label
was not yet present, so the current code works, but since it's part
of uapi, make sure we don't introduce holes in structs.
Therefore, add tunnel_ext that we can use generically in future
(f.e. to flag OAM messages for backends, etc). Also add the offset
to the compat tests to be sure should some compilers not padd the
tail of the old version of bpf_tunnel_key.
Fixes: 4018ab1875 ("bpf: support flow label for bpf_skb_{set, get}_tunnel_key")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPv6 counters updates use a different macro than IPv4.
Fixes: 36cbb2452c ("udp: Increment UDP_MIB_IGNOREDMULTI for arriving unmatched multicasts")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Rick Jones <rick.jones2@hp.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch should fix the issues seen with a recent fix to prevent
tunnel-in-tunnel frames from being generated with GRO. The fix itself is
correct for now as long as we do not add any devices that support
NETIF_F_GSO_GRE_CSUM. When such a device is added it could have the
potential to mess things up due to the fact that the outer transport header
points to the outer UDP header and not the GRE header as would be expected.
Fixes: fac8e0f579 ("tunnels: Don't apply GRO to multiple layers of encapsulation.")
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Somehow my patch for commit cea8768f33 ("sctp: allow
sctp_transmit_packet and others to use gfp") missed two important
chunks, which are now added.
Fixes: cea8768f33 ("sctp: allow sctp_transmit_packet and others to use gfp")
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-By: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When NET_SWITCHDEV=n, switchdev_port_attr_set will return -EOPNOTSUPP,
we should ignore this error code and continue to set the ageing time.
Fixes: c62987bbd8 ("bridge: push bridge setting ageing_time down to switchdev")
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Acked-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace '64' with the per-net ipv6_devconf_all's hop_limit when
building the ipv6 header.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This makes nf queues use NFQA_VLAN and NFQA_L2HDR in verdict to modify the
original skb
Signed-off-by: Stephane Bryant <stephane.ml.bryant@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
- This creates 2 netlink attribute NFQA_VLAN and NFQA_L2HDR.
- These are filled up for the PF_BRIDGE family on the way to userspace.
- NFQA_VLAN is a nested attribute, with the NFQA_VLAN_PROTO and the
NFQA_VLAN_TCI carrying the corresponding vlan_proto and vlan_tci
fields from the skb using big endian ordering (and using the CFI
bit as the VLAN_TAG_PRESENT flag in vlan_tci as in the skb)
Signed-off-by: Stephane Bryant <stephane.ml.bryant@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This just adds and registers a nf_afinfo for the ethernet
bridge, which enables queuing to userspace for the AF_BRIDGE
family. No checksum computation is done.
Signed-off-by: Stephane Bryant <stephane.ml.bryant@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for you net tree,
they are:
1) There was a race condition between parallel save/swap and delete,
which resulted a kernel crash due to the increase ref for save, swap,
wrong ref decrease operations. Reported and fixed by Vishwanath Pai.
2) OVS should call into CT NAT for packets of new expected connections only
when the conntrack state is persisted with the 'commit' option to the
OVS CT action. From Jarno Rajahalme.
3) Resolve kconfig dependencies with new OVS NAT support. From Arnd Bergmann.
4) Early validation of entry->target_offset to make sure it doesn't take us
out from the blob, from Florian Westphal.
5) Again early validation of entry->next_offset to make sure it doesn't take
out from the blob, also from Florian.
6) Check that entry->target_offset is always of of sizeof(struct xt_entry)
for unconditional entries, when checking both from check_underflow()
and when checking for loops in mark_source_chains(), again from
Florian.
7) Fix inconsistent behaviour in nfnetlink_queue when
NFQA_CFG_F_FAIL_OPEN is set and netlink_unicast() fails due to buffer
overrun, we have to reinject the packet as the user expects.
8) Enforce nul-terminated table names from getsockopt GET_ENTRIES
requests.
9) Don't assume skb->sk is set from nft_bridge_reject and synproxy,
this fixes a recent update of the code to namespaceify
ip_default_ttl, patch from Liping Zhang.
This batch comes with four patches to validate x_tables blobs coming
from userspace. CONFIG_USERNS exposes the x_tables interface to
unpriviledged users and to be honest this interface never received the
attention for this move away from the CAP_NET_ADMIN domain. Florian is
working on another round with more patches with more sanity checks, so
expect a bit more Netfilter fixes in this development cycle than usual.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit fa50d974d1 ("ipv4: Namespaceify ip_default_ttl sysctl knob")
use sock_net(skb->sk) to get the net namespace, but we can't assume
that sk_buff->sk is always exist, so when it is NULL, oops will happen.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Reviewed-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Make sure the table names via getsockopt GET_ENTRIES is nul-terminated
in ebtables and all the x_tables variants and their respective compat
code. Uncovered by KASAN.
Reported-by: Baozeng Ding <sploving1@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When netlink unicast fails to deliver the message to userspace, we
should also check if the NFQA_CFG_F_FAIL_OPEN flag is set so we reinject
the packet back to the stack.
I think the user expects no packet drops when this flag is set due to
queueing to userspace errors, no matter if related to the internal queue
or when sending the netlink message to userspace.
The userspace application will still get the ENOBUFS error via recvmsg()
so the user still knows that, with the current configuration that is in
place, the userspace application is not consuming the messages at the
pace that the kernel needs.
Reported-by: "Yigal Reiss (yreiss)" <yreiss@cisco.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Tested-by: "Yigal Reiss (yreiss)" <yreiss@cisco.com>
Ben Hawkes says:
In the mark_source_chains function (net/ipv4/netfilter/ip_tables.c) it
is possible for a user-supplied ipt_entry structure to have a large
next_offset field. This field is not bounds checked prior to writing a
counter value at the supplied offset.
Problem is that mark_source_chains should not have been called --
the rule doesn't have a next entry, so its supposed to return
an absolute verdict of either ACCEPT or DROP.
However, the function conditional() doesn't work as the name implies.
It only checks that the rule is using wildcard address matching.
However, an unconditional rule must also not be using any matches
(no -m args).
The underflow validator only checked the addresses, therefore
passing the 'unconditional absolute verdict' test, while
mark_source_chains also tested for presence of matches, and thus
proceeeded to the next (not-existent) rule.
Unify this so that all the callers have same idea of 'unconditional rule'.
Reported-by: Ben Hawkes <hawkes@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Otherwise this function may read data beyond the ruleset blob.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We should check that e->target_offset is sane before
mark_source_chains gets called since it will fetch the target entry
for loop detection.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The openvswitch code has gained support for calling into the
nf-nat-ipv4/ipv6 modules, however those can be loadable modules
in a configuration in which openvswitch is built-in, leading
to link errors:
net/built-in.o: In function `__ovs_ct_lookup':
:(.text+0x2cc2c8): undefined reference to `nf_nat_icmp_reply_translation'
:(.text+0x2cc66c): undefined reference to `nf_nat_icmpv6_reply_translation'
The dependency on (!NF_NAT || NF_NAT) prevents similar issues,
but NF_NAT is set to 'y' if any of the symbols selecting
it are built-in, but the link error happens when any of them
are modular.
A second issue is that even if CONFIG_NF_NAT_IPV6 is built-in,
CONFIG_NF_NAT_IPV4 might be completely disabled. This is unlikely
to be useful in practice, but the driver currently only handles
IPv6 being optional.
This patch improves the Kconfig dependency so that openvswitch
cannot be built-in if either of the two other symbols are set
to 'm', and it replaces the incorrect #ifdef in ovs_ct_nat_execute()
with two "if (IS_ENABLED())" checks that should catch all corner
cases also make the code more readable.
The same #ifdef exists ovs_ct_nat_to_attr(), where it does not
cause a link error, but for consistency I'm changing it the same
way.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 05752523e5 ("openvswitch: Interface with NAT.")
Acked-by: Joe Stringer <joe@ovn.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
OVS should call into CT NAT for packets of new expected connections only
when the conntrack state is persisted with the 'commit' option to the
OVS CT action. The test for this condition is doubly wrong, as the CT
status field is ANDed with the bit number (IPS_EXPECTED_BIT) rather
than the mask (IPS_EXPECTED), and due to the wrong assumption that the
expected bit would apply only for the first (i.e., 'new') packet of a
connection, while in fact the expected bit remains on for the lifetime of
an expected connection. The 'ctinfo' value IP_CT_RELATED derived from
the ct status can be used instead, as it is only ever applicable to
the 'new' packets of the expected connection.
Fixes: 05752523e5 ('openvswitch: Interface with NAT.')
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This fix adds a new reference counter (ref_netlink) for the struct ip_set.
The other reference counter (ref) can be swapped out by ip_set_swap and we
need a separate counter to keep track of references for netlink events
like dump. Using the same ref counter for dump causes a race condition
which can be demonstrated by the following script:
ipset create hash_ip1 hash:ip family inet hashsize 1024 maxelem 500000 \
counters
ipset create hash_ip2 hash:ip family inet hashsize 300000 maxelem 500000 \
counters
ipset create hash_ip3 hash:ip family inet hashsize 1024 maxelem 500000 \
counters
ipset save &
ipset swap hash_ip3 hash_ip2
ipset destroy hash_ip3 /* will crash the machine */
Swap will exchange the values of ref so destroy will see ref = 0 instead of
ref = 1. With this fix in place swap will not succeed because ipset save
still has ref_netlink on the set (ip_set_swap doesn't swap ref_netlink).
Both delete and swap will error out if ref_netlink != 0 on the set.
Note: The changes to *_head functions is because previously we would
increment ref whenever we called these functions, we don't do that
anymore.
Reviewed-by: Joshua Hunt <johunt@akamai.com>
Signed-off-by: Vishwanath Pai <vpai@akamai.com>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
For the input parameter count, it's better to use the size
of destination buffer size, as nla_memcpy would take into
account the length of the source netlink attribute when
a data is copied from an attribute.
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For a route with IPv6 encapsulation, the traffic class and hop limit
values are interchanged when returned to userspace by the kernel.
For example, see below.
># ip route add 192.168.0.1 dev eth0.2 encap ip6 dst 0x50 tc 0x50 hoplimit 100 table 1000
># ip route show table 1000
192.168.0.1 encap ip6 id 0 src :: dst fe83::1 hoplimit 80 tc 100 dev eth0.2 scope link
Signed-off-by: Quentin Armitage <quentin@armitage.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull Ceph updates from Sage Weil:
"There is quite a bit here, including some overdue refactoring and
cleanup on the mon_client and osd_client code from Ilya, scattered
writeback support for CephFS and a pile of bug fixes from Zheng, and a
few random cleanups and fixes from others"
[ I already decided not to pull this because of it having been rebased
recently, but ended up changing my mind after all. Next time I'll
really hold people to it. Oh well. - Linus ]
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (34 commits)
libceph: use KMEM_CACHE macro
ceph: use kmem_cache_zalloc
rbd: use KMEM_CACHE macro
ceph: use lookup request to revalidate dentry
ceph: kill ceph_get_dentry_parent_inode()
ceph: fix security xattr deadlock
ceph: don't request vxattrs from MDS
ceph: fix mounting same fs multiple times
ceph: remove unnecessary NULL check
ceph: avoid updating directory inode's i_size accidentally
ceph: fix race during filling readdir cache
libceph: use sizeof_footer() more
ceph: kill ceph_empty_snapc
ceph: fix a wrong comparison
ceph: replace CURRENT_TIME by current_fs_time()
ceph: scattered page writeback
libceph: add helper that duplicates last extent operation
libceph: enable large, variable-sized OSD requests
libceph: osdc->req_mempool should be backed by a slab pool
libceph: make r_request msg_size calculation clearer
...
Use KMEM_CACHE() instead of kmem_cache_create() to simplify the code.
Signed-off-by: Geliang Tang <geliangtang@163.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Don't open-code sizeof_footer() in read_partial_message() and
ceph_msg_revoke(). Also, after switching to sizeof_footer(), it's now
possible to use con_out_kvec_add() in prepare_write_message_footer().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
This helper duplicates last extent operation in OSD request, then
adjusts the new extent operation's offset and length. The helper
is for scatterd page writeback, which adds nonconsecutive dirty
pages to single OSD request.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Turn r_ops into a flexible array member to enable large, consisting of
up to 16 ops, OSD requests. The use case is scattered writeback in
cephfs and, as far as the kernel client is concerned, 16 is just a made
up number.
r_ops had size 3 for copyup+hint+write, but copyup is really a special
case - it can only happen once. ceph_osd_request_cache is therefore
stuffed with num_ops=2 requests, anything bigger than that is allocated
with kmalloc(). req_mempool is backed by ceph_osd_request_cache, which
means either num_ops=1 or num_ops=2 for use_mempool=true - all existing
users (ceph_writepages_start(), ceph_osdc_writepages()) are fine with
that.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
ceph_osd_request_cache was introduced a long time ago. Also, osd_req
is about to get a flexible array member, which ceph_osd_request_cache
is going to be aware of.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Although msg_size is calculated correctly, the terms are grouped in
a misleading way - snaps appears to not have room for a u32 length.
Move calculation closer to its use and regroup terms.
No functional change.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This avoids defining large array of r_reply_op_{len,result} in
in struct ceph_osd_request.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This can happen if __close_session() in ceph_monc_stop() races with
a connection reset. We need to ignore such faults, otherwise it's
likely we would take !hunting, call __schedule_delayed() and end up
with delayed_work() executing on invalid memory, among other things.
The (two!) con->private tests are useless, as nothing ever clears
con->private. Nuke them.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Doing __schedule_delayed() in the hunting branch is pointless, as the
tick will have already been scheduled by then.
What we need to do instead is *reschedule* it in the !hunting branch,
after reopen_session() changes hunt_mult, which affects the delay.
This helps with spacing out connection attempts and avoiding things
like two back-to-back attempts followed by a longer period of waiting
around.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
hunting is now set in __open_session() and cleared in finish_hunting(),
instead of all around. The "session lost" message is printed not only
on connection resets, but also on keepalive timeouts.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Unless we are in the process of setting up a client (i.e. connecting to
the monitor cluster for the first time), apply a backoff: every time we
want to reopen a session, increase our timeout by a multiple (currently
2); when we complete the connection, reduce that multipler by 50%.
Mirrors ceph.git commit 794c86fd289bd62a35ed14368fa096c46736e9a2.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Split ping interval and ping timeout: ping interval is 10s; keepalive
timeout is 30s.
Make monc_ping_timeout a constant while at it - it's not actually
exported as a mount option (and the rest of tick-related settings won't
be either), so it's got no place in ceph_options.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Don't try to reconnect to the same monitor when we fail to establish
a session within a timeout or it's lost.
For that, pick_new_mon() needs to see the old value of cur_mon, so
don't clear it in __close_session() - all calls to __close_session()
but one are followed by __open_session() anyway. __open_session() is
only called when a new session needs to be established, so the "already
open?" branch, which is now in the way, is simply dropped.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
It is currently hard-coded in the mon_client that mdsmap and monmap
subs are continuous, while osdmap sub is always "onetime". To better
handle full clusters/pools in the osd_client, we need to be able to
issue continuous osdmap subs. Revamp subs code to allow us to specify
for each sub whether it should be continuous or not.
Although not strictly required for the above, switch to SUBSCRIBE2
protocol while at it, eliminating the ambiguity between a request for
"every map since X" and a request for "just the latest" when we don't
have a map yet (i.e. have epoch 0). SUBSCRIBE2 feature bit is now
required - it's been supported since pre-argonaut (2010).
Move "got mdsmap" call to the end of ceph_mdsc_handle_map() - calling
in before we validate the epoch and successfully install the new map
can mess up mon_client sub state.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Coupling hunting state with subscribe state is not a good idea. Clear
hunting when we complete the authentication handshake.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Our debugfs dir name is a concatenation of cluster fsid and client
unique ID ("global_id"). It used to be the case that we learned
global_id first, nowadays we always learn fsid first - the monmap is
sent before any auth replies are. ceph_debugfs_client_init() call in
ceph_monc_handle_map() is therefore never executed and can be removed.
Its counterpart in handle_auth_reply() doesn't really belong there
either: having to do monc->client and unlocking early to work around
lockdep is a testament to that. Move it into __ceph_open_session(),
where it can be called unconditionally.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
netpoll_setup() does a dev_hold() on np->dev, the netpoll device. If it
fails, it correctly does a dev_put() but leaves np->dev set. If we call
netpoll_cleanup() after the failure, np->dev is still set so we do another
dev_put(), which decrements the refcount an extra time.
It's questionable to call netpoll_cleanup() after netpoll_setup() fails,
but it can be difficult to find the problem, and we can easily avoid it in
this case. The extra decrements can lead to hangs like this:
unregister_netdevice: waiting for bond0 to become free. Usage count = -3
Set and clear np->dev at the points where we dev_hold() and dev_put() the
device.
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A crash is observed when a decrypted packet is processed in receive
path. get_rps_cpus() tries to dereference the skb->dev fields but it
appears that the device is freed from the poison pattern.
[<ffffffc000af58ec>] get_rps_cpu+0x94/0x2f0
[<ffffffc000af5f94>] netif_rx_internal+0x140/0x1cc
[<ffffffc000af6094>] netif_rx+0x74/0x94
[<ffffffc000bc0b6c>] xfrm_input+0x754/0x7d0
[<ffffffc000bc0bf8>] xfrm_input_resume+0x10/0x1c
[<ffffffc000ba6eb8>] esp_input_done+0x20/0x30
[<ffffffc0000b64c8>] process_one_work+0x244/0x3fc
[<ffffffc0000b7324>] worker_thread+0x2f8/0x418
[<ffffffc0000bb40c>] kthread+0xe0/0xec
-013|get_rps_cpu(
| dev = 0xFFFFFFC08B688000,
| skb = 0xFFFFFFC0C76AAC00 -> (
| dev = 0xFFFFFFC08B688000 -> (
| name =
"......................................................
| name_hlist = (next = 0xAAAAAAAAAAAAAAAA, pprev =
0xAAAAAAAAAAA
Following are the sequence of events observed -
- Encrypted packet in receive path from netdevice is queued
- Encrypted packet queued for decryption (asynchronous)
- Netdevice brought down and freed
- Packet is decrypted and returned through callback in esp_input_done
- Packet is queued again for process in network stack using netif_rx
Since the device appears to have been freed, the dereference of
skb->dev in get_rps_cpus() leads to an unhandled page fault
exception.
Fix this by holding on to device reference when queueing packets
asynchronously and releasing the reference on call back return.
v2: Make the change generic to xfrm as mentioned by Steffen and
update the title to xfrm
Suggested-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Jerome Stanislaus <jeromes@codeaurora.org>
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
pnfs layout type from Christoph Hellwig. The new layout type is a
variant of the block layout which uses SCSI features to offer improved
fencing and device identification.
(Also: note this pull request also includes the client side of SCSI
layout, with Trond's permission.)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJW8+uhAAoJECebzXlCjuG+26YP/35DP4MPfszEJ5G0dYq5HMwl
dJUni8ajSHRswZ/2FqiBsRwmg3Djfc+uoXdOneD1f6ogkDe7S16yp+FRyh8/VwUs
Ym6LcxSjT28uqkxO0MblcnUl0G9nNSuOwqIsZ0HG7/UC7E6RmCF4o3r5fFUfOsA+
B3koB5UcHNAFythAk+GDwOQ46Fr96VkZ7Y+OhdNAwmeXZIdKXIufweueI/o2uipB
RoJFJ7lqrzAjFe+CqAUBr2l2k6lEKzdxbEH6HXQ5+cvVNwfVIgnrONpF78uF/p9T
NNDnZ+fn3YdRhd+W9RxUHZq7ZL5YOEA8kHsAlloeBH74GqCy7IcS+DrKt1ReM3px
bhgsXM3dqqJ9xiDGqmeE4VQwRF30SxgYZbO386E+cLHnCYV+vfY6RUaWPrk6On/r
FL9g3iyVvhyC4HO06Xm+uvvERw8R+fTZY9KZQKH2RL0Tr5DkWRRNJfasMO+PwGOv
Fdku01vyoA4Y6mbqUgQ9DmrbLO4gK3UyMiOTanQV9shrIDxI0MOuLK03zL25vZCM
s1A4YBpXmg4gx3XsOFM+tygv6EVujDu6scICeb+hj0vi0oG82Lx7T9e3MJEiYC+T
jbi8bu+x+0bX2obMprvDNVUzi/PgSUVpGCnRlbRTaXBa0lB6nV7uUiQ1HC9gGesm
ZWWiOv7du+7WlFP5c6r5
=mY8w
-----END PGP SIGNATURE-----
Merge tag 'nfsd-4.6' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
"Various bugfixes, a RDMA update from Chuck Lever, and support for a
new pnfs layout type from Christoph Hellwig. The new layout type is a
variant of the block layout which uses SCSI features to offer improved
fencing and device identification.
(Also: note this pull request also includes the client side of SCSI
layout, with Trond's permission.)"
* tag 'nfsd-4.6' of git://linux-nfs.org/~bfields/linux:
sunrpc/cache: drop reference when sunrpc_cache_pipe_upcall() detects a race
nfsd: recover: fix memory leak
nfsd: fix deadlock secinfo+readdir compound
nfsd4: resfh unused in nfsd4_secinfo
svcrdma: Use new CQ API for RPC-over-RDMA server send CQs
svcrdma: Use new CQ API for RPC-over-RDMA server receive CQs
svcrdma: Remove close_out exit path
svcrdma: Hook up the logic to return ERR_CHUNK
svcrdma: Use correct XID in error replies
svcrdma: Make RDMA_ERROR messages work
rpcrdma: Add RPCRDMA_HDRLEN_ERR
svcrdma: svc_rdma_post_recv() should close connection on error
svcrdma: Close connection when a send error occurs
nfsd: Lower NFSv4.1 callback message size limit
svcrdma: Do not send Write chunk XDR pad with inline content
svcrdma: Do not write xdr_buf::tail in a Write chunk
svcrdma: Find client-provided write and reply chunks once per reply
nfsd: Update NFS server comments related to RDMA support
nfsd: Fix a memory leak when meeting unsupported state_protect_how4
nfsd4: fix bad bounds checking
Pull networking bugfixes from David Miller:
"Several bug fixes rolling in, some for changes introduced in this
merge window, and some for problems that have existed for some time:
1) Fix prepare_to_wait() handling in AF_VSOCK, from Claudio Imbrenda.
2) The new DST_CACHE should be a silent config option, from Dave
Jones.
3) inet_current_timestamp() unintentionally truncates timestamps to
16-bit, from Deepa Dinamani.
4) Missing reference to netns in ppp, from Guillaume Nault.
5) Free memory reference in hv_netvsc driver, from Haiyang Zhang.
6) Missing kernel doc documentation for function arguments in various
spots around the networking, from Luis de Bethencourt.
7) UDP stopped receiving broadcast packets properly, due to
overzealous multicast checks, fix from Paolo Abeni"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (59 commits)
net: ping: make ping_v6_sendmsg static
hv_netvsc: Fix the order of num_sc_offered decrement
net: Fix typos and whitespace.
hv_netvsc: Fix the array sizes to be max supported channels
hv_netvsc: Fix accessing freed memory in netvsc_change_mtu()
ppp: take reference on channels netns
net: Reset encap_level to avoid resetting features on inner IP headers
net: mediatek: fix checking for NULL instead of IS_ERR() in .probe
net: phy: at803x: Request 'reset' GPIO only for AT8030 PHY
at803x: fix reset handling
AF_VSOCK: Shrink the area influenced by prepare_to_wait
Revert "vsock: Fix blocking ops call in prepare_to_wait"
macb: fix PHY reset
ipv4: initialize flowi4_flags before calling fib_lookup()
fsl/fman: Workaround for Errata A-007273
ipv4: fix broadcast packets reception
net: hns: bug fix about the overflow of mss
net: hns: adds limitation for debug port mtu
net: hns: fix the bug about mtu setting
net: hns: fixes a bug of RSS
...
As ping_v6_sendmsg is used only in this file,
making it static
The body of "pingv6_prot" and "pingv6_protosw" were
moved at the middle of the file, to avoid having to
declare some static prototypes.
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch corrects an oversight in which we were allowing the encap_level
value to pass from the outer headers to the inner headers. As a result we
were incorrectly identifying UDP or GRE tunnels as also making use of ipip
or sit when the second header actually represented a tunnel encapsulated in
either a UDP or GRE tunnel which already had the features masked.
Fixes: 7644345622 ("net: Move GSO csum into SKB_GSO_CB")
Reported-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Merge third patch-bomb from Andrew Morton:
- more ocfs2 changes
- a few hotfixes
- Andy's compat cleanups
- misc fixes to fatfs, ptrace, coredump, cpumask, creds, eventfd,
panic, ipmi, kgdb, profile, kfifo, ubsan, etc.
- many rapidio updates: fixes, new drivers.
- kcov: kernel code coverage feature. Like gcov, but not
"prohibitively expensive".
- extable code consolidation for various archs
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (81 commits)
ia64/extable: use generic search and sort routines
x86/extable: use generic search and sort routines
s390/extable: use generic search and sort routines
alpha/extable: use generic search and sort routines
kernel/...: convert pr_warning to pr_warn
drivers: dma-coherent: use memset_io for DMA_MEMORY_IO mappings
drivers: dma-coherent: use MEMREMAP_WC for DMA_MEMORY_MAP
memremap: add MEMREMAP_WC flag
memremap: don't modify flags
kernel/signal.c: add compile-time check for __ARCH_SI_PREAMBLE_SIZE
mm/mprotect.c: don't imply PROT_EXEC on non-exec fs
ipc/sem: make semctl setting sempid consistent
ubsan: fix tree-wide -Wmaybe-uninitialized false positives
kfifo: fix sparse complaints
scripts/gdb: account for changes in module data structure
scripts/gdb: add cmdline reader command
scripts/gdb: add version command
kernel: add kcov code coverage
profile: hide unused functions when !CONFIG_PROC_FS
hpwdt: use nmi_panic() when kernel panics in NMI handler
...
- A few minor core fixups needed for the next patch series
- The IB SRIOV series. This has bounced around for several versions.
Of note is the fact that the first patch in this series effects
the net core. It was directed to netdev and DaveM for each iteration
of the series (three versions total). Dave did not object, but did
not respond either. I've taken this as permission to move forward
with the series.
- The new Intel X722 iWARP driver
- A huge set of updates to the Intel hfi1 driver. Of particular interest
here is that we have left the driver in staging since it still has an
API that people object to. Intel is working on a fix, but getting
these patches in now helps keep me sane as the upstream and Intel's
trees were over 300 patches apart.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJW8HR9AAoJELgmozMOVy/dDYMP+wSBALhIdV/pqVzdLCGfIUbK
H5agonm/3b/Oj74W30w2JYqXBFfZC2LGVJy6OwocJ3wK04v/KfZbA9G+QsOuh2hQ
Db+tFn1eoltvzrcx3k/a7x6zHGC4YyxyH9OX2B3QfRsNHeE7PG9KGp5dfEs2OH1r
WGp3jMLAsHf7o8uKpa0jyTEUEErATaTlG+YoaJ+BGHwurgCNy8ni+wAn+EAFiJ3w
iEJhcXB6KY69vkLsrLYuT9xxJn4udFJ3QEk8xdPkpLKsu+6Ue5i/eNQ19VfbpZgR
c6fTc8genfIv5S+fis+0P44u1oA7Kl2JT6IZYLi35gJ60ZmxTD+7GruWP3xX/wJ2
zuR3sTj5fjcFWenk087RSIU/EK87ONPD4g9QPdZpf3FtgleTVKk3YDlqwjqf8pgv
cO6gQ1BcOBnixJvhjNFiX1c2hvNhb3CkgObly1JBwhcCzZhLkV7BNFPbZuDHAeAx
VqzNEUse4hupkgiiuiGgudcJ4fsSxMW37kyfX9QC/qyk6YVuUDbrekcWI+MAKot7
5e5dHqFExpbn1Zgvc8yfvh88H2MUQAgaYwjanWF/qpppOPRd01nTisVQIOJn7s5C
arcWzvocpQe0GL2UsvDoWwAABXznL3bnnAoCyTWOES2RhOOcw0Ibw46Jl8FQ8gnl
2IRxQ+ltNEscb2cwi5wE
=t2Ko
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma
Pull more rdma updates from Doug Ledford:
"Round two of 4.6 merge window patches.
This is a monster pull request. I held off on the hfi1 driver updates
(the hfi1 driver is intimately tied to the qib driver and the new
rdmavt software library that was created to help both of them) in my
first pull request. The hfi1/qib/rdmavt update is probably 90% of
this pull request. The hfi1 driver is being left in staging so that
it can be fixed up in regards to the API that Al and yourself didn't
like. Intel has agreed to do the work, but in the meantime, this
clears out 300+ patches in the backlog queue and brings my tree and
their tree closer to sync.
This also includes about 10 patches to the core and a few to mlx5 to
create an infrastructure for configuring SRIOV ports on IB devices.
That series includes one patch to the net core that we sent to netdev@
and Dave Miller with each of the three revisions to the series. We
didn't get any response to the patch, so we took that as implicit
approval.
Finally, this series includes Intel's new iWARP driver for their x722
cards. It's not nearly the beast as the hfi1 driver. It also has a
linux-next merge issue, but that has been resolved and it now passes
just fine.
Summary:
- A few minor core fixups needed for the next patch series
- The IB SRIOV series. This has bounced around for several versions.
Of note is the fact that the first patch in this series effects the
net core. It was directed to netdev and DaveM for each iteration
of the series (three versions total). Dave did not object, but did
not respond either. I've taken this as permission to move forward
with the series.
- The new Intel X722 iWARP driver
- A huge set of updates to the Intel hfi1 driver. Of particular
interest here is that we have left the driver in staging since it
still has an API that people object to. Intel is working on a fix,
but getting these patches in now helps keep me sane as the upstream
and Intel's trees were over 300 patches apart"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (362 commits)
IB/ipoib: Allow mcast packets from other VFs
IB/mlx5: Implement callbacks for manipulating VFs
net/mlx5_core: Implement modify HCA vport command
net/mlx5_core: Add VF param when querying vport counter
IB/ipoib: Add ndo operations for configuring VFs
IB/core: Add interfaces to control VF attributes
IB/core: Support accessing SA in virtualized environment
IB/core: Add subnet prefix to port info
IB/mlx5: Fix decision on using MAD_IFC
net/core: Add support for configuring VF GUIDs
IB/{core, ulp} Support above 32 possible device capability flags
IB/core: Replace setting the zero values in ib_uverbs_ex_query_device
net/mlx5_core: Introduce offload arithmetic hardware capabilities
net/mlx5_core: Refactor device capability function
net/mlx5_core: Fix caching ATOMIC endian mode capability
ib_srpt: fix a WARN_ON() message
i40iw: Replace the obsolete crypto hash interface with shash
IB/hfi1: Add SDMA cache eviction algorithm
IB/hfi1: Switch to using the pin query function
IB/hfi1: Specify mm when releasing pages
...
The code wants to prevent compat code from receiving messages. Use
in_compat_syscall for this.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
SCTP unfortunately has a different ABI for SCTP_SOCKOPT_CONNECTX3 for
32-bit and 64-bit callers. Use in_compat_syscall to correctly
distinguish them on all architectures.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Vlad Yasevich <vyasevich@gmail.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When a thread is prepared for waiting by calling prepare_to_wait, sleeping
is not allowed until either the wait has taken place or finish_wait has
been called. The existing code in af_vsock imposed unnecessary no-sleep
assumptions to a broad list of backend functions.
This patch shrinks the influence of prepare_to_wait to the area where it
is strictly needed, therefore relaxing the no-sleep restriction there.
Signed-off-by: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 5988818008 ("vsock: Fix
blocking ops call in prepare_to_wait")
The commit reverted with this patch caused us to potentially miss wakeups.
Since the condition is not checked between the prepare_to_wait and the
schedule(), if a wakeup happens after the condition is checked but before
the sleep happens, we will miss it. ( A description of the problem can be
found here: http://www.makelinux.net/ldd3/chp-6-sect-2 ).
By reverting the patch, the behaviour is still incorrect (since we
shouldn't sleep between the prepare_to_wait and the schedule) but at least
it will not miss wakeups.
The next patch in the series actually fixes the behaviour.
Signed-off-by: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Highlights include:
Features:
- Add support for multiple NFSv4.1 callbacks in flight
- Initial patchset for RPC multipath support
- Adapt RPC/RDMA to use the new completion queue API
Bugfixes and cleanups:
- nfs4: nfs4_ff_layout_prepare_ds should return NULL if connection failed
- Cleanups to remove nfs_inode_dio_wait and nfs4_file_fsync
- Fix RPC/RDMA credit accounting
- Properly handle RDMA_ERROR replies
- xprtrdma: Do not wait if ib_post_send() fails
- xprtrdma: Segment head and tail XDR buffers on page boundaries
- xprtrdma cleanups for dprintk, physical_op_map and unused macros
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJW8Y7MAAoJEGcL54qWCgDyMsMP+we8JSgfVqI5X1lKpU9aPWkI
D912ybtV58Kv0elKwYvQMqm+mRvdMNz1hZNJa4sAEaPVBOGfFjyZLy3xlDlr0HTf
M+Juh0FNLTcUh1obxJamjsbpNxfg4b6f/Z29KWRzahv/MlpMJVS3hLjpAEzCcTYr
WfOOovV6mragtsBINegGl/6jk/x2D22JDnKcTU+8ltVZGJtZe+HoqTFhUOrLO5qm
wR3YO22fbOuiZxCPoST06kMNiksYnYXnOju8RwlKwFYq3bWke0jWstQtIC4vKs6K
4u5o74aTBL5zMkJPnJuIfN2if4LJPptSr1n7pItbv3MLmgY1mWjE6N2BATpijfhQ
p+Gt/GHTAvswuWrmwySZKLj/Q8EkBuw4ohPFwLQ9eFHl2USoV3G9KQw7H0odR4d1
IyQPCag+suN2lWBreFkPIV48dZyeCVk6JmJmy3SN+d0L1t3jd6gwSO2UBgG7S2Gd
qVbdxYRiU/zYP6E5wFouLhIc1beSfe4vnJqvnuWZrIId+haTE2+OLi7772WGIkSe
xoZVTg7AX4Wu79ZyWoH+e9FnDvEsRkRVv7HQfpsMq2gynBWj70/KemEoeZnjqWaB
UOWcH8/vNLrnwlXTh0VHG6I8t3s0EXgqQB4//tYRLI42oIj35W2pIMnjYt52DeVB
Mo5mbYghtR9bgeoRQ6V4
=kC3t
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-4.6-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Highlights include:
Features:
- Add support for multiple NFSv4.1 callbacks in flight
- Initial patchset for RPC multipath support
- Adapt RPC/RDMA to use the new completion queue API
Bugfixes and cleanups:
- nfs4: nfs4_ff_layout_prepare_ds should return NULL if connection failed
- Cleanups to remove nfs_inode_dio_wait and nfs4_file_fsync
- Fix RPC/RDMA credit accounting
- Properly handle RDMA_ERROR replies
- xprtrdma: Do not wait if ib_post_send() fails
- xprtrdma: Segment head and tail XDR buffers on page boundaries
- xprtrdma cleanups for dprintk, physical_op_map and unused macros"
* tag 'nfs-for-4.6-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (35 commits)
nfs/blocklayout: make sure making a aligned read request
nfs4: nfs4_ff_layout_prepare_ds should return NULL if connection failed
nfs: remove nfs_inode_dio_wait
nfs: remove nfs4_file_fsync
xprtrdma: Use new CQ API for RPC-over-RDMA client send CQs
xprtrdma: Use an anonymous union in struct rpcrdma_mw
xprtrdma: Use new CQ API for RPC-over-RDMA client receive CQs
xprtrdma: Serialize credit accounting again
xprtrdma: Properly handle RDMA_ERROR replies
rpcrdma: Add RPCRDMA_HDRLEN_ERR
xprtrdma: Do not wait if ib_post_send() fails
xprtrdma: Segment head and tail XDR buffers on page boundaries
xprtrdma: Clean up dprintk format string containing a newline
xprtrdma: Clean up physical_op_map()
xprtrdma: Clean up unused RPCRDMA_INLINE_PAD_THRESH macro
NFS add callback_ops to nfs4_proc_bind_conn_to_session_callback
pnfs/NFSv4.1: Add multipath capabilities to pNFS flexfiles servers over NFSv3
SUNRPC: Allow addition of new transports to a struct rpc_clnt
NFSv4.1: nfs4_proc_bind_conn_to_session must iterate over all connections
SUNRPC: Make NFS swap work with multipath
...
Field fl4.flowi4_flags is not initialized in fib_compute_spec_dst()
before calling fib_lookup(), which means fib_table_lookup() is
using non-deterministic data at this line:
if (!(flp->flowi4_flags & FLOWI_FLAG_SKIP_NH_OIF)) {
Fix by initializing the entire fl4 structure, which will prevent
similar issues as fields are added in the future by ensuring that
all fields are initialized to zero unless explicitly initialized
to another value.
Fixes: 58189ca7b2 ("net: Fix vti use case with oif in dst lookups")
Suggested-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: Lance Richardson <lrichard@redhat.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, ingress ipv4 broadcast datagrams are dropped since,
in udp_v4_early_demux(), ip_check_mc_rcu() is invoked even on
bcast packets.
This patch addresses the issue, invoking ip_check_mc_rcu()
only for mcast packets.
Fixes: 6e54030932 ("ipv4/udp: Verify multicast group is ours in upd_v4_early_demux()")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
By returning -ENOIOCTLCMD, sock_do_ioctl() falls back to calling
dev_ioctl(), which provides support for NIC driver ioctls, which
includes ethtool support. This is similar to the way ioctls are handled
in udp.c or tcp.c.
This removes the requirement that ethtool for example be tied to the
support of a specific L3 protocol (ethtool uses an AF_INET socket
today).
Signed-off-by: David Decotigny <decot@googlers.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The millisecond timestamps returned by the function is
converted to network byte order by making a call to htons().
htons() only returns __be16 while __be32 is required here.
This was identified by the sparse warning from the buildbot:
net/ipv4/af_inet.c:1405:16: sparse: incorrect type in return
expression (different base types)
net/ipv4/af_inet.c:1405:16: expected restricted __be32
net/ipv4/af_inet.c:1405:16: got restricted __be16 [usertype] <noident>
Change the function to use htonl() to return the correct __be32 type
instead so that the millisecond value doesn't get truncated.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: James Morris <jmorris@namei.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Arnd Bergmann <arnd@arndb.de>
Fixes: 822c868532 ("net: ipv4: Convert IP network timestamps to be y2038 safe")
Reported-by: Fengguang Wu <fengguang.wu@intel.com> [0-day test robot]
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 911362c70d ("net: add dst_cache support") added a new
kconfig option that gets selected by other networking options.
It seems the intent wasn't to offer this as a user-selectable
option given the lack of help text, so this patch converts it
to a silent option.
Signed-off-by: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add two new NLAs to support configuration of Infiniband node or port
GUIDs. New applications can choose to use this interface to configure
GUIDs with iproute2 with commands such as:
ip link set dev ib0 vf 0 node_guid 00:02:c9:03:00:21:6e:70
ip link set dev ib0 vf 0 port_guid 00:02:c9:03:00:21:6e:78
A new ndo, ndo_sef_vf_guid is introduced to notify the net device of the
request to change the GUID.
Signed-off-by: Eli Cohen <eli@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
It can be useful to lower max_gso_segs on NIC with very low
number of TX descriptors like bcmgenet.
However, this is defeated by bridge since it does not propagate
the lower value of max_gso_segs and max_gso_size.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Petri Gynther <pgynther@google.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
It can be useful to report dev->gso_max_segs and dev->gso_max_size
so that "ip -d link" can display them to help debugging.
For the moment, these attributes are read-only.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Petri Gynther <pgynther@google.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the function dev_get_phys_port_name was added it missed a description
for it's len argument. Adding it.
Fixes: db24a9044e ("net: add support for phys_port_name")
Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull x86 protection key support from Ingo Molnar:
"This tree adds support for a new memory protection hardware feature
that is available in upcoming Intel CPUs: 'protection keys' (pkeys).
There's a background article at LWN.net:
https://lwn.net/Articles/643797/
The gist is that protection keys allow the encoding of
user-controllable permission masks in the pte. So instead of having a
fixed protection mask in the pte (which needs a system call to change
and works on a per page basis), the user can map a (handful of)
protection mask variants and can change the masks runtime relatively
cheaply, without having to change every single page in the affected
virtual memory range.
This allows the dynamic switching of the protection bits of large
amounts of virtual memory, via user-space instructions. It also
allows more precise control of MMU permission bits: for example the
executable bit is separate from the read bit (see more about that
below).
This tree adds the MM infrastructure and low level x86 glue needed for
that, plus it adds a high level API to make use of protection keys -
if a user-space application calls:
mmap(..., PROT_EXEC);
or
mprotect(ptr, sz, PROT_EXEC);
(note PROT_EXEC-only, without PROT_READ/WRITE), the kernel will notice
this special case, and will set a special protection key on this
memory range. It also sets the appropriate bits in the Protection
Keys User Rights (PKRU) register so that the memory becomes unreadable
and unwritable.
So using protection keys the kernel is able to implement 'true'
PROT_EXEC on x86 CPUs: without protection keys PROT_EXEC implies
PROT_READ as well. Unreadable executable mappings have security
advantages: they cannot be read via information leaks to figure out
ASLR details, nor can they be scanned for ROP gadgets - and they
cannot be used by exploits for data purposes either.
We know about no user-space code that relies on pure PROT_EXEC
mappings today, but binary loaders could start making use of this new
feature to map binaries and libraries in a more secure fashion.
There is other pending pkeys work that offers more high level system
call APIs to manage protection keys - but those are not part of this
pull request.
Right now there's a Kconfig that controls this feature
(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) that is default enabled
(like most x86 CPU feature enablement code that has no runtime
overhead), but it's not user-configurable at the moment. If there's
any serious problem with this then we can make it configurable and/or
flip the default"
* 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
x86/mm/pkeys: Fix mismerge of protection keys CPUID bits
mm/pkeys: Fix siginfo ABI breakage caused by new u64 field
x86/mm/pkeys: Fix access_error() denial of writes to write-only VMA
mm/core, x86/mm/pkeys: Add execute-only protection keys support
x86/mm/pkeys: Create an x86 arch_calc_vm_prot_bits() for VMA flags
x86/mm/pkeys: Allow kernel to modify user pkey rights register
x86/fpu: Allow setting of XSAVE state
x86/mm: Factor out LDT init from context init
mm/core, x86/mm/pkeys: Add arch_validate_pkey()
mm/core, arch, powerpc: Pass a protection key in to calc_vm_flag_bits()
x86/mm/pkeys: Actually enable Memory Protection Keys in the CPU
x86/mm/pkeys: Add Kconfig prompt to existing config option
x86/mm/pkeys: Dump pkey from VMA in /proc/pid/smaps
x86/mm/pkeys: Dump PKRU with other kernel registers
mm/core, x86/mm/pkeys: Differentiate instruction fetches
x86/mm/pkeys: Optimize fault handling in access_error()
mm/core: Do not enforce PKEY permissions on remote mm access
um, pkeys: Add UML arch_*_access_permitted() methods
mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys
x86/mm/gup: Simplify get_user_pages() PTE bit handling
...
Commit 22e0f8b932 ("net: sched: make bstats per cpu and estimator RCU safe")
added the argument cpu_bstats to functions gen_new_estimator and
gen_replace_estimator and now the descriptions of these are missing for the
documentation. Adding them.
Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Function gnet_stats_copy_basic is missing the description of the cpu
argument in the documentation. Adding it.
Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If a packet is either locally encapsulated or processed through GRO
it is marked with the offloads that it requires. However, when it is
decapsulated these tunnel offload indications are not removed. This
means that if we receive an encapsulated TCP packet, aggregate it with
GRO, decapsulate, and retransmit the resulting frame on a NIC that does
not support encapsulation, we won't be able to take advantage of hardware
offloads even though it is just a simple TCP packet at this point.
This fixes the problem by stripping off encapsulation offload indications
when packets are decapsulated.
The performance impacts of this bug are significant. In a test where a
Geneve encapsulated TCP stream is sent to a hypervisor, GRO'ed, decapsulated,
and bridged to a VM performance is improved by 60% (5Gbps->8Gbps) as a
result of avoiding unnecessary segmentation at the VM tap interface.
Reported-by: Ramu Ramamurthy <sramamur@linux.vnet.ibm.com>
Fixes: 68c33163 ("v4 GRE: Add TCP segmentation offload for GRE")
Signed-off-by: Jesse Gross <jesse@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When drivers express support for TSO of encapsulated packets, they
only mean that they can do it for one layer of encapsulation.
Supporting additional levels would mean updating, at a minimum,
more IP length fields and they are unaware of this.
No encapsulation device expresses support for handling offloaded
encapsulated packets, so we won't generate these types of frames
in the transmit path. However, GRO doesn't have a check for
multiple levels of encapsulation and will attempt to build them.
UDP tunnel GRO actually does prevent this situation but it only
handles multiple UDP tunnels stacked on top of each other. This
generalizes that solution to prevent any kind of tunnel stacking
that would cause problems.
Fixes: bf5a755f ("net-gre-gro: Add GRE support to the GRO stack")
Signed-off-by: Jesse Gross <jesse@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
ipip encapsulated packets can be merged together by GRO but the result
does not have the proper GSO type set or even marked as being
encapsulated at all. Later retransmission of these packets will likely
fail if the device does not support ipip offloads. This is similar to
the issue resolved in IPv6 sit in feec0cb3
("ipv6: gro: support sit protocol").
Reported-by: Patrick Boutilier <boutilpj@ednet.ns.ca>
Fixes: 9667e9bb ("ipip: Add gro callbacks to ipip offload")
Tested-by: Patrick Boutilier <boutilpj@ednet.ns.ca>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jesse Gross <jesse@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
SCTP is a protocol that is aligned to a word (4 bytes). Thus using bare
MTU can sometimes return values that are not aligned, like for loopback,
which is 65536 but ipv4_mtu() limits that to 65535. This mis-alignment
will cause the last non-aligned bytes to never be used and can cause
issues with congestion control.
So it's better to just consider a lower MTU and keep congestion control
calcs saner as they are based on PMTU.
Same applies to icmp frag needed messages, which is also fixed by this
patch.
One other effect of this is the inability to send MTU-sized packet
without queueing or fragmentation and without hitting Nagle. As the
check performed at sctp_packet_can_append_data():
if (chunk->skb->len + q->out_qlen >= transport->pathmtu - packet->overhead)
/* Enough data queued to fill a packet */
return SCTP_XMIT_OK;
with the above example of MTU, if there are no other messages queued,
one cannot send a packet that just fits one packet (65532 bytes) and
without causing DATA chunk fragmentation or a delay.
v2:
- Added WORD_TRUNC macro
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, if a chunk is scheduled to be sent through a transport that
is currently unconfirmed, it will be leaked as it is dequeued from outq
and is not re-queued nor freed.
As I'm not aware of any situation that may lead to this situation, I'm
fixing this by freeing the chunk and also logging a trace so that we can
fix the other bug if it ever happens.
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The SACK can be lost pretty much elsewhere, but if its allocation fail,
we know we are not sending it, so it is better to revert a_rwnd to its
previous value as this may give it a chance to issue a window update
later.
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP protocol is still used these days, and TCP uses
clones in its transmit path. We can not optimize linux
stack assuming it is mostly used in routers, or that TCP
is dead.
Fixes: 795bb1c00d ("net: bulk free infrastructure for NAPI context, use napi_consume_skb")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking updates from David Miller:
"Highlights:
1) Support more Realtek wireless chips, from Jes Sorenson.
2) New BPF types for per-cpu hash and arrap maps, from Alexei
Starovoitov.
3) Make several TCP sysctls per-namespace, from Nikolay Borisov.
4) Allow the use of SO_REUSEPORT in order to do per-thread processing
of incoming TCP/UDP connections. The muxing can be done using a
BPF program which hashes the incoming packet. From Craig Gallek.
5) Add a multiplexer for TCP streams, to provide a messaged based
interface. BPF programs can be used to determine the message
boundaries. From Tom Herbert.
6) Add 802.1AE MACSEC support, from Sabrina Dubroca.
7) Avoid factorial complexity when taking down an inetdev interface
with lots of configured addresses. We were doing things like
traversing the entire address less for each address removed, and
flushing the entire netfilter conntrack table for every address as
well.
8) Add and use SKB bulk free infrastructure, from Jesper Brouer.
9) Allow offloading u32 classifiers to hardware, and implement for
ixgbe, from John Fastabend.
10) Allow configuring IRQ coalescing parameters on a per-queue basis,
from Kan Liang.
11) Extend ethtool so that larger link mode masks can be supported.
From David Decotigny.
12) Introduce devlink, which can be used to configure port link types
(ethernet vs Infiniband, etc.), port splitting, and switch device
level attributes as a whole. From Jiri Pirko.
13) Hardware offload support for flower classifiers, from Amir Vadai.
14) Add "Local Checksum Offload". Basically, for a tunneled packet
the checksum of the outer header is 'constant' (because with the
checksum field filled into the inner protocol header, the payload
of the outer frame checksums to 'zero'), and we can take advantage
of that in various ways. From Edward Cree"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1548 commits)
bonding: fix bond_get_stats()
net: bcmgenet: fix dma api length mismatch
net/mlx4_core: Fix backward compatibility on VFs
phy: mdio-thunder: Fix some Kconfig typos
lan78xx: add ndo_get_stats64
lan78xx: handle statistics counter rollover
RDS: TCP: Remove unused constant
RDS: TCP: Add sysctl tunables for sndbuf/rcvbuf on rds-tcp socket
net: smc911x: convert pxa dma to dmaengine
team: remove duplicate set of flag IFF_MULTICAST
bonding: remove duplicate set of flag IFF_MULTICAST
net: fix a comment typo
ethernet: micrel: fix some error codes
ip_tunnels, bpf: define IP_TUNNEL_OPTS_MAX and use it
bpf, dst: add and use dst_tclassid helper
bpf: make skb->tc_classid also readable
net: mvneta: bm: clarify dependencies
cls_bpf: reset class and reuse major in da
ldmvsw: Checkpatch sunvnet.c and sunvnet_common.c
ldmvsw: Add ldmvsw.c driver code
...
Merge second patch-bomb from Andrew Morton:
- a couple of hotfixes
- the rest of MM
- a new timer slack control in procfs
- a couple of procfs fixes
- a few misc things
- some printk tweaks
- lib/ updates, notably to radix-tree.
- add my and Nick Piggin's old userspace radix-tree test harness to
tools/testing/radix-tree/. Matthew said it was a godsend during the
radix-tree work he did.
- a few code-size improvements, switching to __always_inline where gcc
screwed up.
- partially implement character sets in sscanf
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (118 commits)
sscanf: implement basic character sets
lib/bug.c: use common WARN helper
param: convert some "on"/"off" users to strtobool
lib: add "on"/"off" support to kstrtobool
lib: update single-char callers of strtobool()
lib: move strtobool() to kstrtobool()
include/linux/unaligned: force inlining of byteswap operations
include/uapi/linux/byteorder, swab: force inlining of some byteswap operations
include/asm-generic/atomic-long.h: force inlining of some atomic_long operations
usb: common: convert to use match_string() helper
ide: hpt366: convert to use match_string() helper
ata: hpt366: convert to use match_string() helper
power: ab8500: convert to use match_string() helper
power: charger_manager: convert to use match_string() helper
drm/edid: convert to use match_string() helper
pinctrl: convert to use match_string() helper
device property: convert to use match_string() helper
lib/string: introduce match_string() helper
radix-tree tests: add test for radix_tree_iter_next
radix-tree tests: add regression3 test
...
RDS_TCP_DEFAULT_BUFSIZE has been unused since commit 1edd6a14d2
("RDS-TCP: Do not bloat sndbuf/rcvbuf in rds_tcp_tune").
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add per-net sysctl tunables to set the size of sndbuf and
rcvbuf on the kernel tcp socket.
The tunables are added at /proc/sys/net/rds/tcp/rds_tcp_sndbuf
and /proc/sys/net/rds/tcp/rds_tcp_rcvbuf.
These values must be set before accept() or connect(),
and there may be an arbitrary number of existing rds-tcp
sockets when the tunable is modified. To make sure that all
connections in the netns pick up the same value for the tunable,
we reset existing rds-tcp connections in the netns, so that
they can reconnect with the new parameters.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
eBPF defines this as BPF_TUNLEN_MAX and OVS just uses the hard-coded
value inside struct sw_flow_key. Thus, add and use IP_TUNNEL_OPTS_MAX
for this, which makes the code a bit more generic and allows to remove
BPF_TUNLEN_MAX from eBPF code.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
We can just add a small helper dst_tclassid() for retrieving the
dst->tclassid value. It makes the code a bit better in that we can
get rid of the ifdef from filter.c by moving this into the header.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, the tc_classid from eBPF skb context is write-only, but there's
no good reason for tc programs to limit it to write-only. For example,
it can be used to transfer its state via tail calls where the resulting
tc_classid gets filled gradually.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are two issues with the current code. First one is that we need
to set res->class to 0 in case we use non-default classid matching.
This is important for the case where cls_bpf was initially set up with
an optional binding to a default class with tcf_bind_filter(), where
the underlying qdisc implements bind_tcf() that fills res->class and
tests for it later on when doing the classification. Convention for
these cases is that after tc_classify() was called, such qdiscs (atm,
drr, qfq, cbq, hfsc, htb) first test class, and if 0, then they lookup
based on classid.
Second, there's a bug with da mode, where res->classid is only assigned
a 16 bit minor, but it needs to expand to the full 32 bit major/minor
combination instead, therefore we need to expand with the bound major.
This is fine as classes belonging to a classful qdisc must share the
same major.
Fixes: 045efa82ff ("cls_bpf: introduce integrated actions")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently output of MPLS packets on tunnel vports is not allowed by Open
vSwitch. This is because historically encapsulation was done in such a way
that the inner_protocol field of the skb needed to hold the inner protocol
for both MPLS and tunnel encapsulation in order for GSO segmentation to be
performed correctly.
Since b2acd1dc39 ("openvswitch: Use regular GRE net_device instead of
vport") Open vSwitch makes use of lwt to output to tunnel netdevs which
perform encapsulation. As no drivers expose support for MPLS offloads this
means that GSO packets are segmented in software by validate_xmit_skb(),
which is called from __dev_queue_xmit(), before tunnel encapsulation occurs.
This means that the inner protocol of MPLS is no longer needed by the time
encapsulation occurs and the contention on the inner_protocol field of the
skb no longer occurs.
Thus it is now safe to output MPLS to tunnel vports.
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Jesse Gross <jesse@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
- cxgb4 updates
- nes updates
- unification of iwarp portmapper code to core
- add drain_cq API
- various ib_core updates
- minor ipoib updates
- minor mlx4 updates
- more significant mlx5 updates (including a minor merge conflict with
net-next tree...merge is simple to resolve and Stephen's resolution was
confirmed by Mellanox)
- trivial net/9p rdma conversion
- ocrdma RoCEv2 update
- srpt updates
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJW6aTEAAoJELgmozMOVy/dlAEQAKgT0VwBi6Zd4PihP2UQgsfH
LUmbGhCzBpcao1eJ7piOOEYQGSb3slN3Cnup4qBJak+y2mhtErxNkLOIhGRrvcHk
XCym7N9uAhp4j++OnUBp6Cpr0hZNmBEBKm6nKqdEcdaxLaVa0ezdcxAOkVlHhZ77
NnhTHvPy8pu4kC8NZCvCIJK+fqW+5Xj+ojAcVKGPV+Y3zf9lfaDCXCSdD2m6+dFX
/KV3V/CNUSdYTWrPZSIDhqoYix2AGl5Fg17mfsgBWQB/T405fiwZkd0FEXkqXDkR
bOhS5PnuCN+ScwsxMDHCbzqtaOb06sKttg9IE3s0qdFpOwGtbyoU+lLUh1qbjKLP
vtEiySZq2Mhlr41ajuUuDSgNbqCTL7+52/HUf8qcjFFiSBlZRaTO8rVJ5tABKRiW
SkxkHbR6orx8okKtaWRskKRtYSNkA2uexdIQ/wzc4fJVqzqJUh6Elcxp3dPq/KSN
lkrYXNJ5X4ux72QfHRobBX1pBjT0P2+avoFri3763k9ZrsWwY9tXgDUB/OdX11IF
gAadgUNw2pHgY10jqCZBOw22F+foB2qx8ZkaNSGYE0h3uQrp+iiCnfeU9rWNCWVv
MelRGpfGa7VF3RTDojc7Dq7JpWRUChMx9BY+XrQPmV08Z+JGoVuRT20Q7twgillz
Yb3aGRKZNtqYehj9fM4n
=kTkT
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma
Pull rdma updates from Doug Ledford:
"Initial roundup of 4.6 merge window patches.
This is the first of two pull requests. It is the smaller request,
but touches for more different things (this is everything but what is
in or going into staging). The pull request for the code in
staging/rdma is on hold until after we decide what to do on the
write/writev API issue and may be partially deferred until 4.7 as a
result.
Summary:
- cxgb4 updates
- nes updates
- unification of iwarp portmapper code to core
- add drain_cq API
- various ib_core updates
- minor ipoib updates
- minor mlx4 updates
- more significant mlx5 updates (including a minor merge conflict
with net-next tree...merge is simple to resolve and Stephen's
resolution was confirmed by Mellanox)
- trivial net/9p rdma conversion
- ocrdma RoCEv2 update
- srpt updates"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (85 commits)
iwpm: crash fix for large connections test
iw_cxgb3: support for iWARP port mapping
iw_cxgb4: remove port mapper related code
iw_nes: remove port mapper related code
iwcm: common code for port mapper
net/9p: convert to new CQ API
IB/mlx5: Add support for don't trap rules
net/mlx5_core: Introduce forward to next priority action
net/mlx5_core: Create anchor of last flow table
iser: Accept arbitrary sg lists mapping if the device supports it
mlx5: Add arbitrary sg list support
IB/core: Add arbitrary sg_list support
IB/mlx5: Expose correct max_fast_reg_page_list_len
IB/mlx5: Make coding style more consistent
IB/mlx5: Convert UMR CQ to new CQ API
IB/ocrdma: Skip using unneeded intermediate variable
IB/ocrdma: Skip using unneeded intermediate variable
IB/ocrdma: Delete unnecessary variable initialisations in 11 functions
IB/core: Documentation fix in the MAD header file
IB/core: trivial prink cleanup.
...
Pull trivial tree updates from Jiri Kosina.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
drivers/rtc: broken link fix
drm/i915 Fix typos in i915_gem_fence.c
Docs: fix missing word in REPORTING-BUGS
lib+mm: fix few spelling mistakes
MAINTAINERS: add git URL for APM driver
treewide: Fix typo in printk
Now SYN_RECV request sockets are installed in ehash table, an ICMP
handler can find a request socket while another cpu handles an incoming
packet transforming this SYN_RECV request socket into an ESTABLISHED
socket.
We need to remove the now obsolete WARN_ON(req->sk), since req->sk
is set when a new child is created and added into listener accept queue.
If this race happens, the ICMP will do nothing special.
Fixes: 079096f103 ("tcp/dccp: install syn_recv requests into ehash table")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Ben Lazarus <blazarus@google.com>
Reported-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
vlan drivers lack proper propagation of gso_max_segs from
lower device.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The success of CMA allocation largely depends on the success of
migration and key factor of it is page reference count. Until now, page
reference is manipulated by direct calling atomic functions so we cannot
follow up who and where manipulate it. Then, it is hard to find actual
reason of CMA allocation failure. CMA allocation should be guaranteed
to succeed so finding offending place is really important.
In this patch, call sites where page reference is manipulated are
converted to introduced wrapper function. This is preparation step to
add tracepoint to each page reference manipulation function. With this
facility, we can easily find reason of CMA allocation failure. There is
no functional change in this patch.
In addition, this patch also converts reference read sites. It will
help a second step that renames page._count to something else and
prevents later attempt to direct access to it (Suggested by Andrew).
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Here's the big tty/serial driver pull request for 4.6-rc1.
Lots of changes in here, Peter has been on a tear again, with lots of
refactoring and bugs fixes, many thanks to the great work he has been
doing. Lots of driver updates and fixes as well, full details in the
shortlog.
All have been in linux-next for a while with no reported issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iEYEABECAAYFAlbp8z8ACgkQMUfUDdst+ym1vwCgnOOCORaZyeQ4QrcxPAK5pHFn
VrMAoNHvDgNYtG+Hmzv25Lgp3HnysPin
=MLRG
-----END PGP SIGNATURE-----
Merge tag 'tty-4.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty/serial updates from Greg KH:
"Here's the big tty/serial driver pull request for 4.6-rc1.
Lots of changes in here, Peter has been on a tear again, with lots of
refactoring and bugs fixes, many thanks to the great work he has been
doing. Lots of driver updates and fixes as well, full details in the
shortlog.
All have been in linux-next for a while with no reported issues"
* tag 'tty-4.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (220 commits)
serial: 8250: describe CONFIG_SERIAL_8250_RSA
serial: samsung: optimize UART rx fifo access routine
serial: pl011: add mark/space parity support
serial: sa1100: make sa1100_register_uart_fns a function
tty: serial: 8250: add MOXA Smartio MUE boards support
serial: 8250: convert drivers to use up_to_u8250p()
serial: 8250/mediatek: fix building with SERIAL_8250=m
serial: 8250/ingenic: fix building with SERIAL_8250=m
serial: 8250/uniphier: fix modular build
Revert "drivers/tty/serial: make 8250/8250_ingenic.c explicitly non-modular"
Revert "drivers/tty/serial: make 8250/8250_mtk.c explicitly non-modular"
serial: mvebu-uart: initial support for Armada-3700 serial port
serial: mctrl_gpio: Add missing module license
serial: ifx6x60: avoid uninitialized variable use
tty/serial: at91: fix bad offset for UART timeout register
tty/serial: at91: restore dynamic driver binding
serial: 8250: Add hardware dependency to RT288X option
TTY, devpts: document pty count limiting
tty: goldfish: support platform_device with id -1
drivers: tty: goldfish: Add device tree bindings
...
sunrpc_cache_pipe_upcall() can detect a race if CACHE_PENDING is no longer
set. In this case it aborts the queuing of the upcall.
However it has already taken a new counted reference on "h" and
doesn't "put" it, even though it frees the data structure holding the reference.
So let's delay the "cache_get" until we know we need it.
Fixes: f9e1aedc6c ("sunrpc/cache: remove races with queuing an upcall.")
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Pull crypto update from Herbert Xu:
"Here is the crypto update for 4.6:
API:
- Convert remaining crypto_hash users to shash or ahash, also convert
blkcipher/ablkcipher users to skcipher.
- Remove crypto_hash interface.
- Remove crypto_pcomp interface.
- Add crypto engine for async cipher drivers.
- Add akcipher documentation.
- Add skcipher documentation.
Algorithms:
- Rename crypto/crc32 to avoid name clash with lib/crc32.
- Fix bug in keywrap where we zero the wrong pointer.
Drivers:
- Support T5/M5, T7/M7 SPARC CPUs in n2 hwrng driver.
- Add PIC32 hwrng driver.
- Support BCM6368 in bcm63xx hwrng driver.
- Pack structs for 32-bit compat users in qat.
- Use crypto engine in omap-aes.
- Add support for sama5d2x SoCs in atmel-sha.
- Make atmel-sha available again.
- Make sahara hashing available again.
- Make ccp hashing available again.
- Make sha1-mb available again.
- Add support for multiple devices in ccp.
- Improve DMA performance in caam.
- Add hashing support to rockchip"
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (116 commits)
crypto: qat - remove redundant arbiter configuration
crypto: ux500 - fix checks of error code returned by devm_ioremap_resource()
crypto: atmel - fix checks of error code returned by devm_ioremap_resource()
crypto: qat - Change the definition of icp_qat_uof_regtype
hwrng: exynos - use __maybe_unused to hide pm functions
crypto: ccp - Add abstraction for device-specific calls
crypto: ccp - CCP versioning support
crypto: ccp - Support for multiple CCPs
crypto: ccp - Remove check for x86 family and model
crypto: ccp - memset request context to zero during import
lib/mpi: use "static inline" instead of "extern inline"
lib/mpi: avoid assembler warning
hwrng: bcm63xx - fix non device tree compatibility
crypto: testmgr - allow rfc3686 aes-ctr variants in fips mode.
crypto: qat - The AE id should be less than the maximal AE number
lib/mpi: Endianness fix
crypto: rockchip - add hash support for crypto engine in rk3288
crypto: xts - fix compile errors
crypto: doc - add skcipher API documentation
crypto: doc - update AEAD AD handling
...
The inner headers are invalid after a xfrm transformation.
So reset the skb encapsulation field to ensure nobody tries
to access the inner headers.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
We can hit an OOM condition if we are under presure because
we can not free the entries in gc_list fast enough. So add
a counter for the not yet freed entries in the gc_list and
refuse new allocations if the value is too high.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
When the ETHTOOL_GLINKSETTINGS implementation finds that userland is
using the wrong number of words of link mode bitmaps (or is trying to
find out the right numbers) it sets the cmd field to 0 in the response
structure.
This is inconsistent with the implementation of every other ethtool
command, so let's remove that inconsistency before it gets into a
stable release.
Fixes: 3f1ac7a700 ("net: ethtool: add new ETHTOOL_xLINKSETTINGS API")
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
local_bh_disable() + spin_lock() is equivalent to spin_lock_bh(), same for
the unlock/enable case, so replace the calls by the appropriate wrappers.
Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These patches include several bugfixes and cleanups for the NFSoRDMA client.
This includes bugfixes for NFS v4.1, proper RDMA_ERROR handling, and fixes
from the recent workqueue swicchover. These patches also switch xprtrdma to
use the new CQ API
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJW5wl+AAoJENfLVL+wpUDrH7sP/i3lZ7n6Pr+Flrb/Z+ywjmEA
mvZ0u3O3eviFxDqr82J/WL7fgDUGbBd9urtUu5xZGscc0HNs3LQ8izm6Dy3gmrVF
MFh69jNGL5djmfymYMWRbdoKLuOw4V70EBtGnYqH7Bh7gwpiIl3EiVcBBJup/vKV
rgvx6NnSGYpYPYpBFC4Ql4qZx7m5j4cxsThRScbu7wMMjDknls+7ZDM1B5mGO00+
1vYObpYGXOXovGZyHAHspQityWp6jvUcEMnJzWbMUFDqxkOmmGK3t54MfvRXZiFE
vUkgg5nGxhMejEIfMywuf6czKGfXc4HZT2yF9eSZeA4IW+7QkeeXeCcIANBDY6Ga
oKXu4fgM6T7SnCpefwkXRhinwtEM04tGAlxo1X3UcKrMz7Q3di3/NtgaWijfL8gy
9Nd1lt2kqI375h7+OZccURl33lnQBSjO4F2pt/pFk6wYwGh0F8co9bIp6QIEVQ0f
N2l8KU9vgLofhrD9JzZeu1l3+TCdDU9YaJLSjbkZ71BTjNtkNdUcd9Tk+bMSxept
mq7mNKe1oBHAGgR8+7zYBUKEt85rdpNovPoU+Hz/QKbV3zikGSPmL3e9gpEvhmH4
MKD7Vs61Fi8volfv7wHmzZF8Tk68qc1MAZXSbgzVQxwF/uBjqEuSO+n4Kii/gfkG
MJjlzqmqlO01O1fn2XHu
=aHja
-----END PGP SIGNATURE-----
Merge tag 'nfs-rdma-4.6-1' of git://git.linux-nfs.org/projects/anna/nfs-rdma
NFS: NFSoRDMA Client Side Changes
These patches include several bugfixes and cleanups for the NFSoRDMA client.
This includes bugfixes for NFS v4.1, proper RDMA_ERROR handling, and fixes
from the recent workqueue swicchover. These patches also switch xprtrdma to
use the new CQ API
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* tag 'nfs-rdma-4.6-1' of git://git.linux-nfs.org/projects/anna/nfs-rdma: (787 commits)
xprtrdma: Use new CQ API for RPC-over-RDMA client send CQs
xprtrdma: Use an anonymous union in struct rpcrdma_mw
xprtrdma: Use new CQ API for RPC-over-RDMA client receive CQs
xprtrdma: Serialize credit accounting again
xprtrdma: Properly handle RDMA_ERROR replies
rpcrdma: Add RPCRDMA_HDRLEN_ERR
xprtrdma: Do not wait if ib_post_send() fails
xprtrdma: Segment head and tail XDR buffers on page boundaries
xprtrdma: Clean up dprintk format string containing a newline
xprtrdma: Clean up physical_op_map()
xprtrdma: Clean up unused RPCRDMA_INLINE_PAD_THRESH macro
$ make tags
GEN tags
ctags: Warning: drivers/acpi/processor_idle.c:64: null expansion of name pattern "\1"
ctags: Warning: drivers/xen/events/events_2l.c:41: null expansion of name pattern "\1"
ctags: Warning: kernel/locking/lockdep.c:151: null expansion of name pattern "\1"
ctags: Warning: kernel/rcu/rcutorture.c:133: null expansion of name pattern "\1"
ctags: Warning: kernel/rcu/rcutorture.c:135: null expansion of name pattern "\1"
ctags: Warning: kernel/workqueue.c:323: null expansion of name pattern "\1"
ctags: Warning: net/ipv4/syncookies.c:53: null expansion of name pattern "\1"
ctags: Warning: net/ipv6/syncookies.c:44: null expansion of name pattern "\1"
ctags: Warning: net/rds/page.c:45: null expansion of name pattern "\1"
Which are all the result of the DEFINE_PER_CPU pattern:
scripts/tags.sh:200: '/\<DEFINE_PER_CPU([^,]*, *\([[:alnum:]_]*\)/\1/v/'
scripts/tags.sh:201: '/\<DEFINE_PER_CPU_SHARED_ALIGNED([^,]*, *\([[:alnum:]_]*\)/\1/v/'
The below cures them. All except the workqueue one are within reasonable
distance of the 80 char limit. TJ do you have any preference on how to
fix the wq one, or shall we just not care its too long?
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS/OVS updates for net-next
The following patchset contains Netfilter/IPVS fixes and OVS NAT
support, more specifically this batch is composed of:
1) Fix a crash in ipset when performing a parallel flush/dump with
set:list type, from Jozsef Kadlecsik.
2) Make sure NFACCT_FILTER_* netlink attributes are in place before
accessing them, from Phil Turnbull.
3) Check return error code from ip_vs_fill_iph_skb_off() in IPVS SIP
helper, from Arnd Bergmann.
4) Add workaround to IPVS to reschedule existing connections to new
destination server by dropping the packet and wait for retransmission
of TCP syn packet, from Julian Anastasov.
5) Allow connection rescheduling in IPVS when in CLOSE state, also
from Julian.
6) Fix wrong offset of SIP Call-ID in IPVS helper, from Marco Angaroni.
7) Validate IPSET_ATTR_ETHER netlink attribute length, from Jozsef.
8) Check match/targetinfo netlink attribute size in nft_compat,
patch from Florian Westphal.
9) Check for integer overflow on 32-bit systems in x_tables, from
Florian Westphal.
Several patches from Jarno Rajahalme to prepare the introduction of
NAT support to OVS based on the Netfilter infrastructure:
10) Schedule IP_CT_NEW_REPLY definition for removal in
nf_conntrack_common.h.
11) Simplify checksumming recalculation in nf_nat.
12) Add comments to the openvswitch conntrack code, from Jarno.
13) Update the CT state key only after successful nf_conntrack_in()
invocation.
14) Find existing conntrack entry after upcall.
15) Handle NF_REPEAT case due to templates in nf_conntrack_in().
16) Call the conntrack helper functions once the conntrack has been
confirmed.
17) And finally, add the NAT interface to OVS.
The batch closes with:
18) Cleanup to use spin_unlock_wait() instead of
spin_lock()/spin_unlock(), from Nicholas Mc Guire.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The spin_lock()/spin_unlock() is synchronizing on the
nf_conntrack_locks_all_lock which is equivalent to
spin_unlock_wait() but the later should be more efficient.
Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
On loaded TCP servers, looking at millions of sockets can hold
cpu for many seconds, if the lookup condition is very narrow.
(eg : ss dst 1.2.3.4 )
Better add a cond_resched() to allow other processes to access
the cpu.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Extend OVS conntrack interface to cover NAT. New nested
OVS_CT_ATTR_NAT attribute may be used to include NAT with a CT action.
A bare OVS_CT_ATTR_NAT only mangles existing and expected connections.
If OVS_NAT_ATTR_SRC or OVS_NAT_ATTR_DST is included within the nested
attributes, new (non-committed/non-confirmed) connections are mangled
according to the rest of the nested attributes.
The corresponding OVS userspace patch series includes test cases (in
tests/system-traffic.at) that also serve as example uses.
This work extends on a branch by Thomas Graf at
https://github.com/tgraf/ovs/tree/nat.
Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Acked-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Joe Stringer <joe@ovn.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
There is no need to help connections that are not confirmed, so we can
delay helping new connections to the time when they are confirmed.
This change is needed for NAT support, and having this as a separate
patch will make the following NAT patch a bit easier to review.
Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Acked-by: Joe Stringer <joe@ovn.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Repeat the nf_conntrack_in() call when it returns NF_REPEAT. This
avoids dropping a SYN packet re-opening an existing TCP connection.
Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Acked-by: Joe Stringer <joe@ovn.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add a new function ovs_ct_find_existing() to find an existing
conntrack entry for which this packet was already applied to. This is
only to be called when there is evidence that the packet was already
tracked and committed, but we lost the ct reference due to an
userspace upcall.
ovs_ct_find_existing() is called from skb_nfct_cached(), which can now
hide the fact that the ct reference may have been lost due to an
upcall. This allows ovs_ct_commit() to be simplified.
This patch is needed by later "openvswitch: Interface with NAT" patch,
as we need to be able to pass the packet through NAT using the
original ct reference also after the reference is lost after an
upcall.
Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Acked-by: Joe Stringer <joe@ovn.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Only a successful nf_conntrack_in() call can effect a connection state
change, so it suffices to update the key only after the
nf_conntrack_in() returns.
This change is needed for the later NAT patches.
Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Acked-by: Joe Stringer <joe@ovn.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This makes the code easier to understand and the following patches
more focused.
Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Acked-by: Joe Stringer <joe@ovn.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
NAT checksum recalculation code assumes existence of skb_dst, which
becomes a problem for a later patch in the series ("openvswitch:
Interface with NAT."). Simplify this by removing the check on
skb_dst, as the checksum will be dealt with later in the stack.
Suggested-by: Pravin Shelar <pshelar@nicira.com>
Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Remove the definition of IP_CT_NEW_REPLY from the kernel as it does
not make sense. This allows the definition of IP_CT_NUMBER to be
simplified as well.
Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Rework the netdev event handler, similar to what the Mellanox Spectrum
driver does, to easily welcome more events later (for example
NETDEV_PRECHANGEUPPER) and use netdev helpers (such as
netif_is_bridge_master).
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
netdev_upper_dev_unlink() which notifies NETDEV_CHANGEUPPER, returns
void, as well as del_nbp(). So there's no advantage to catch an eventual
error from the port_bridge_leave routine at the DSA level.
Make this routine void for the DSA layer and its existing drivers.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename DSA port_join_bridge and port_leave_bridge routines to
respectively port_bridge_join and port_bridge_leave in order to respect
an implicit Port::Bridge namespace.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Zefir Kurtisi reported kernel panic with an openwrt specific patch.
However, it turns out that mainline has a similar bug waiting to happen.
Once NF_HOOK() returns the skb is in undefined state and must not be
used. Moreover, the okfn must consume the skb to support async
processing (NF_QUEUE).
Current okfn in this spot doesn't consume it and caller assumes that
NF_HOOK return value tells us if skb was freed or not, but thats wrong.
It "works" because no in-tree user registers a NFPROTO_BRIDGE hook at
LOCAL_IN that returns STOLEN or NF_QUEUE verdicts.
Once we add NF_QUEUE support for nftables bridge this will break --
NF_QUEUE holds the skb for async processing, caller will erronoulsy
return RX_HANDLER_PASS and on reinject netfilter will access free'd skb.
Fix this by pushing skb up the stack in the okfn instead.
NB: It also seems dubious to use LOCAL_IN while bypassing PRE_ROUTING
completely in this case but this is how its been forever so it seems
preferable to not change this.
Cc: Felix Fietkau <nbd@openwrt.org>
Cc: Zefir Kurtisi <zefir.kurtisi@neratec.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Tested-by: Zefir Kurtisi <zefir.kurtisi@neratec.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johan Hedberg says:
====================
pull request: bluetooth-next 2016-03-12
Here's the last bluetooth-next pull request for the 4.6 kernel.
- New USB ID for AR3012 in btusb
- New BCM2E55 ACPI ID
- Buffer overflow fix for the Add Advertising command
- Support for a new Bluetooth LE limited privacy mode
- Fix for firmware activation in btmrvl_sdio
- Cleanups to mac802154 & 6lowpan code
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The fixed phys delete function simply removed the fixed phy from the
internal linked list and freed the memory. It however did not
unregister the associated phy device. This meant it was still possible
to find the phy device on the mdio bus.
Make fixed_phy_del() an internal function and add a
fixed_phy_unregister() to unregisters the phy device and then uses
fixed_phy_del() to free resources.
Modify DSA to use this new API function, so we don't leak phys.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
All ports types can have a fixed PHY associated with it. Remove the
check which limits removal to only CPU and DSA ports.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
The phy is disconnected from the slave in dsa_slave_destroy(). Don't
destroy fixed link phys until after this, since there can be fixed
linked phys connected to ports.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the phy is disconnected, the parent pointer to the netdev it was
attached to is set to NULL. The code then tries to suspend the phy,
but dsa_slave_fixed_link_update needs the parent pointer to determine
which switch the phy is connected to. So it dereferenced a NULL
pointer. Check for this condition.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Calling ib_poll_cq() to sort through WCs during a completion is a
common pattern amongst RDMA consumers. Since commit 14d3a3b249
("IB: add a proper completion queue abstraction"), WC sorting can
be handled by the IB core.
By converting to this new API, xprtrdma is made a better neighbor to
other RDMA consumers, as it allows the core to schedule the delivery
of completions more fairly amongst all active consumers.
Because each ib_cqe carries a pointer to a completion method, the
core can now post its own operations on a consumer's QP, and handle
the completions itself, without changes to the consumer.
Send completions were previously handled entirely in the completion
upcall handler (ie, deferring to a process context is unneeded).
Thus IB_POLL_SOFTIRQ is a direct replacement for the current
xprtrdma send code path.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Devesh Sharma <devesh.sharma@broadcom.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Calling ib_poll_cq() to sort through WCs during a completion is a
common pattern amongst RDMA consumers. Since commit 14d3a3b249
("IB: add a proper completion queue abstraction"), WC sorting can
be handled by the IB core.
By converting to this new API, xprtrdma is made a better neighbor to
other RDMA consumers, as it allows the core to schedule the delivery
of completions more fairly amongst all active consumers.
Because each ib_cqe carries a pointer to a completion method, the
core can now post its own operations on a consumer's QP, and handle
the completions itself, without changes to the consumer.
xprtrdma's reply processing is already handled in a work queue, but
there is some initial order-dependent processing that is done in the
soft IRQ context before a work item is scheduled.
IB_POLL_SOFTIRQ is a direct replacement for the current xprtrdma
receive code path.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Devesh Sharma <devesh.sharma@broadcom.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Commit fe97b47cd6 ("xprtrdma: Use workqueue to process RPC/RDMA
replies") replaced the reply tasklet with a workqueue that allows
RPC replies to be processed in parallel. Thus the credit values in
RPC-over-RDMA replies can be applied in a different order than in
which the server sent them.
To fix this, revert commit eba8ff660b ("xprtrdma: Move credit
update to RPC reply handler"). Reverting is done by hand to
accommodate code changes that have occurred since then.
Fixes: fe97b47cd6 ("xprtrdma: Use workqueue to process . . .")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
These are shorter than RPCRDMA_HDRLEN_MIN, and they need to
complete the waiting RPC.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
If ib_post_send() in ro_unmap_sync() fails, the WRs have not been
posted, no completions will fire, and wait_for_completion() will
wait forever. Skip the wait in that case.
To ensure the MRs are invalid, disconnect.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
A single memory allocation is used for the pair of buffers wherein
the RPC client builds an RPC call message and decodes its matching
reply. These buffers are sized based on the maximum possible size
of the RPC call and reply messages for the operation in progress.
This means that as the call buffer increases in size, the start of
the reply buffer is pushed farther into the memory allocation.
RPC requests are growing in size. It used to be that both the call
and reply buffers fit inside a single page.
But these days, thanks to NFSv4 (and especially security labels in
NFSv4.2) the maximum call and reply sizes are large. NFSv4.0 OPEN,
for example, now requires a 6KB allocation for a pair of call and
reply buffers, and NFSv4 LOOKUP is not far behind.
As the maximum size of a call increases, the reply buffer is pushed
far enough into the buffer's memory allocation that a page boundary
can appear in the middle of it.
When the maximum possible reply size is larger than the client's
RDMA receive buffers (currently 1KB), the client has to register a
Reply chunk for the server to RDMA Write the reply into.
The logic in rpcrdma_convert_iovs() assumes that xdr_buf head and
tail buffers would always be contained on a single page. It supplies
just one segment for the head and one for the tail.
FMR, for example, registers up to a page boundary (only a portion of
the reply buffer in the OPEN case above). But without additional
segments, it doesn't register the rest of the buffer.
When the server tries to write the OPEN reply, the RDMA Write fails
with a remote access error since the client registered only part of
the Reply chunk.
rpcrdma_convert_iovs() must split the XDR buffer into multiple
segments, each of which are guaranteed not to contain a page
boundary. That way fmr_op_map is given the proper number of segments
to register the whole reply buffer.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Devesh Sharma <devesh.sharma@broadcom.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
physical_op_unmap{_sync} don't use mr_nsegs, so don't bother to set
it in physical_op_map.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Per RFC4898, they count segments sent/received
containing a positive length data segment (that includes
retransmission segments carrying data). Unlike
tcpi_segs_out/in, tcpi_data_segs_out/in excludes segments
carrying no data (e.g. pure ack).
The patch also updates the segs_in in tcp_fastopen_add_skb()
so that segs_in >= data_segs_in property is kept.
Together with retransmission data, tcpi_data_segs_out
gives a better signal on the rxmit rate.
v6: Rebase on the latest net-next
v5: Eric pointed out that checking skb->len is still needed in
tcp_fastopen_add_skb() because skb can carry a FIN without data.
Hence, instead of open coding segs_in and data_segs_in, tcp_segs_in()
helper is used. Comment is added to the fastopen case to explain why
segs_in has to be reset and tcp_segs_in() has to be called before
__skb_pull().
v4: Add comment to the changes in tcp_fastopen_add_skb()
and also add remark on this case in the commit message.
v3: Add const modifier to the skb parameter in tcp_segs_in()
v2: Rework based on recent fix by Eric:
commit a9d99ce28e ("tcp: fix tcpi_segs_in after connection establishment")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Chris Rapier <rapier@psc.edu>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Marcelo Ricardo Leitner <mleitner@redhat.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
gcc points out code that is not indented the way it is
interpreted:
net/caif/cfpkt_skbuff.c: In function 'cfpkt_setlen':
net/caif/cfpkt_skbuff.c:289:4: error: statement is indented as if it were guarded by... [-Werror=misleading-indentation]
return cfpkt_getlen(pkt);
^~~~~~
net/caif/cfpkt_skbuff.c:286:3: note: ...this 'else' clause, but it is not
else
^~~~
It is clear from the context that not returning here would be
a bug, as we'd end up passing a negative length into a function
that takes a u16 length, so it is not missing curly braces
here, and I'm assuming that the indentation is the only part
that's wrong about it.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
The syzkaller fuzzer hit the following use-after-free:
Call Trace:
[<ffffffff8175ea0e>] __asan_report_load8_noabort+0x3e/0x40 mm/kasan/report.c:295
[<ffffffff851cc31a>] __sys_recvmmsg+0x6fa/0x7f0 net/socket.c:2261
[< inline >] SYSC_recvmmsg net/socket.c:2281
[<ffffffff851cc57f>] SyS_recvmmsg+0x16f/0x180 net/socket.c:2270
[<ffffffff86332bb6>] entry_SYSCALL_64_fastpath+0x16/0x7a
arch/x86/entry/entry_64.S:185
And, as Dmitry rightly assessed, that is because we can drop the
reference and then touch it when the underlying recvmsg calls return
some packets and then hit an error, which will make recvmmsg to set
sock->sk->sk_err, oops, fix it.
Reported-and-Tested-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Fixes: a2e2725541 ("net: Introduce recvmmsg socket syscall")
http://lkml.kernel.org/r/20160122211644.GC2470@redhat.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Expand headroom further in order to be able to fit the larger IPv6
header. Prior to this patch this caused a skb under panic for certain
tipc packets when using IPv6 UDP bearer(s).
Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This basic implementation allows to share code between driver using
hardware buffer management. As the code is hardware agnostic, there is
few helpers, most of the optimization brought by the an HW BM has to be
done at driver level.
Tested-by: Sebastian Careba <nitroshift@yahoo.com>
Signed-off-by: Gregory CLEMENT <gregory.clement@free-electrons.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is possible for tunnels to end up generating IP or IPv6 datagrams that
are larger than 64K and expecting to be segmented. As such we need to deal
with length values greater than 64K. In order to accommodate this we need
to update the code to work with a 32b length value instead of a 16b one.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch updates csum_ipv6_magic so that it correctly recognizes that
protocol is a unsigned 8 bit value.
This will allow us to better understand what limitations may or may not be
present in how we handle the data. For example there are a number of
places that call htonl on the protocol value. This is likely not necessary
and can be replaced with a multiplication by ntohl(1) which will be
converted to a shift by the compiler.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When an inetdev is destroyed, every address assigned to the interface
is removed. And in this scenerio we do two pointless things which can
be very expensive if the number of assigned interfaces is large:
1) Address promotion. We are deleting all addresses, so there is no
point in doing this.
2) A full nf conntrack table purge for every address. We only need to
do this once, as is already caught by the existing
masq_dev_notifier so masq_inet_event() can skip this.
Reported-by: Solar Designer <solar@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tested-by: Cyrill Gorcunov <gorcunov@openvz.org>
This is a very small one this time, with only 5 patches.
There are a couple of big items that could not be merged/finished
on time.
We have:
- 2 LLCP fixes for a race and a potential OOM.
- 2 cleanups for the pn544 and microread drivers.
- 1 Maintainer addition for the s3fwrn5 driver.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJW4ygDAAoJEIqAPN1PVmxKQ5gQAIN5MaT6gcu2/p1kd7FzPKyC
KRb2EU907uVQObNLyZgU6liLYQ2prAvK7Dr9SuL10sl+zJkdU2TUuPfK+G4I50qQ
nY7dHN94S0d1F6vBZVKFslh9+mrJKZCqhyN7WrCdXT2i7W5bYO8lw419nADImAq9
fF4k7Ww+t9ZHGlzYun9WmNl8lGYz27w/sjjqnEbD8G+SZlkrqRNwccILP7rmOso2
orGvzxouGiWmp4XcXno2X3ydfFVTl5LdVeaX6oyKxEBpcikz0QFJUmTcqqy0GqGo
NeAGUip36MaJETXVgqKqeZERrST2Eminps3nTEwftd/ukOE+Sa8tX3spe2seN7mm
o9ZBk8a0VChlhfGmIqnYZfksRyacEmVaLFV2W5EQdumL+G+jelCmpPgQKo0ZxEEX
WTYULG5JiSIu34RjtTAP9tOPQ9XTV9hcArRYR355aki5EXGh/ntY1eVtV2nCSHjA
vNZ0hpJNwfcx/0UlVULEJqxwDYYtl/QzJFTQO4l0zFYgf5SK9DXbOGDcZfoKgFf8
7jzLn/hDzHNjCPocHDkXOq50H/pPHwdV65U4hPMnemcGVuVrNzHr5CyKBDgj7nXN
NYP3S6fphg8W/DIlQFaTRAuBCqanZE5DIqFUgptSSdAmBLKagZKkaypUjCscb7Ee
GinQKHm7YgZWG+YeAs4O
=W1gM
-----END PGP SIGNATURE-----
Merge tag 'nfc-next-4.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/nfc-next
Samuel Ortiz says:
====================
NFC 4.6 pull request
This is a very small one this time, with only 5 patches.
There are a couple of big items that could not be merged/finished
on time.
We have:
- 2 LLCP fixes for a race and a potential OOM.
- 2 cleanups for the pn544 and microread drivers.
- 1 Maintainer addition for the s3fwrn5 driver.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no need to use the static variable here, pr_info_once is more
concise.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some drivers reuse/share code paths that free SKBs between NAPI
and non-NAPI calls. Adjust napi_consume_skb to handle this
use-case.
Before, calls from netpoll (w/ IRQs disabled) was handled and
indicated with a budget zero indication. Use the same zero
indication to handle calls not originating from NAPI/softirq.
Simply handled by using dev_consume_skb_any().
This adds an extra branch+call for the netpoll case (checking
in_irq() + irqs_disabled()), but that is okay as this is a slowpath.
Suggested-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently sctp_sendmsg() triggers some calls that will allocate memory
with GFP_ATOMIC even when not necessary. In the case of
sctp_packet_transmit it will allocate a linear skb that will be used to
construct the packet and this may cause sends to fail due to ENOMEM more
often than anticipated specially with big MTUs.
This patch thus allows it to inherit gfp flags from upper calls so that
it can use GFP_KERNEL if it was triggered by a sctp_sendmsg call or
similar. All others, like retransmits or flushes started from BH, are
still allocated using GFP_ATOMIC.
In netperf tests this didn't result in any performance drawbacks when
memory is not too fragmented and made it trigger ENOMEM way less often.
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When we want to change a flow using netlink, we have to identify it to
be able to perform a lookup. Both the flow key and unique flow ID
(ufid) are valid identifiers, but we always have to specify the flow
key in the netlink message. When both attributes are there, the ufid
is used. The flow key is used to validate the actions provided by
the userland.
This commit allows to use the ufid without having to provide the flow
key, as it is already done in the netlink 'flow get' and 'flow del'
path. The flow key remains mandatory when an action is provided.
Signed-off-by: Samuel Gauthier <samuel.gauthier@6wind.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds macro NETCONFA_ALL to represent all type of netconf
attributes for IPv4 and IPv6.
Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
prior to this patch, at the beginning if we have two paths in one assoc,
they may have the same params other than the last_time_heard, it will try
the paths like this:
1st cycle
try trans1 fail.
then trans2 is selected.(cause it's last_time_heard is after trans1).
2nd cycle:
try trans2 fail
then trans2 is selected.(cause it's last_time_heard is after trans1).
3rd cycle:
try trans2 fail
then trans2 is selected.(cause it's last_time_heard is after trans1).
....
trans1 will never have change to be selected, which is not what we expect.
we should keeping round robin all the paths if they are just added at the
beginning.
So at first every tranport's last_time_heard should be initialized 0, so
that we ensure they have the same value at the beginning, only by this,
all the transports could get equal chance to be selected.
Then for sctp_trans_elect_best, it should return the trans_next one when
*trans == *trans_next, so that we can try next if it fails, but now it
always return trans. so we can fix it by exchanging these two params when
we calls sctp_trans_elect_tie().
Fixes: 4c47af4d5e ('net: sctp: rework multihoming retransmission path selection to rfc4960')
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace all "unsigned" types with "unsigned int" types.
Reported-by: David Miller <davem@davemloft.net>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch updates the GRO handlers for GRE, VXLAN, GENEVE, and FOU so that
we do not clear the flush bit until after we have called the next level GRO
handler. Previously this was being cleared before parsing through the list
of frames, however this resulted in several paths where either the bit
needed to be reset but wasn't as in the case of FOU, or cases where it was
being set as in GENEVE. By just deferring the clearing of the bit until
after the next level protocol has been parsed we can avoid any unnecessary
bit twiddling and avoid bugs.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ben Hawkes says:
integer overflow in xt_alloc_table_info, which on 32-bit systems can
lead to small structure allocation and a copy_from_user based heap
corruption.
Reported-by: Ben Hawkes <hawkes@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch extends bpf_tunnel_key with a tunnel_label member, that maps
to ip_tunnel_key's label so underlying backends like vxlan and geneve
can propagate the label to udp_tunnel6_xmit_skb(), where it's being set
in the IPv6 header. It allows for having 20 more bits to encode/decode
flow related meta information programmatically. Tested with vxlan and
geneve.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch extends udp_tunnel6_xmit_skb() to pass in the IPv6 flow label
from call sites. Currently, there's no such option and it's always set to
zero when writing ip6_flow_hdr(). Add a label member to ip_tunnel_key, so
that flow-based tunnels via collect metadata frontends can make use of it.
vxlan and geneve will be converted to add flow label support separately.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This fixes a regression in the bridge ageing time caused by:
commit c62987bbd8 ("bridge: push bridge setting ageing_time down to switchdev")
There are users of Linux bridge which use the feature that if ageing time
is set to 0 it causes entries to never expire. See:
https://www.linuxfoundation.org/collaborate/workgroups/networking/bridge
For a pure software bridge, it is unnecessary for the code to have
arbitrary restrictions on what values are allowable.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cast pointer to unsigned long instead of u64, to fix compilation warning
on 32 bit arch, spotted by 0day build.
Fixes: 5b33f48 ("net/flower: Introduce hardware offload support")
Signed-off-by: Amir Vadai <amir@vadai.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
The Add Advertising command handler does the appropriate checks for
the AD and Scan Response data, however fails to take into account the
general length of the mgmt command itself, which could lead to
potential buffer overflows. This patch adds the necessary check that
the mgmt command length is consistent with the given ad and scan_rsp
lengths.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Cc: stable@vger.kernel.org
A recent change added MGMT_ADV_FLAG_DISCOV to the flags returned by
get_adv_instance_flags(), however failed to take into account limited
discoverable mode. This patch fixes the issue by setting the correct
discoverability flag in the AD data.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
We copy according to ->target|matchsize, so check that the netlink attribute
(which can include padding and might be larger) contains enough data.
Reported-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Simon Horman says:
====================
please consider these IPVS fixes for v4.5 or
if it is too late please consider them for v4.6.
* Arnd Bergman has corrected an error whereby the SIP persistence engine
may incorrectly access protocol fields
* Julian Anastasov has corrected a problem reported by Jiri Bohac with the
connection rescheduling mechanism added in 3.10 when new SYNs in
connection to dead real server can be redirected to another real server.
* Marco Angaroni resolved a problem in the SIP persistence engine
whereby the Call-ID could not be found if it was at the beginning of a
SIP message.
====================
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Trivial conversion to the new RDMA CQ API.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Dominique Martinet <dominique.martinet@cea.fr>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Will be used in a following patch to query if a key is being used, and
what it's value in the target object.
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Amir Vadai <amir@vadai.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is based on a patch made by John Fastabend.
It adds support for offloading cls_flower.
when NETIF_F_HW_TC is on:
flags = 0 => Rule will be processed twice - by hardware, and if
still relevant, by software.
flags = SKIP_HW => Rull will be processed by software only
If hardware fail/not capabale to apply the rule, operation will NOT
fail. Filter will be processed by SW only.
Acked-by: Jiri Pirko <jiri@mellanox.com>
Suggested-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Amir Vadai <amir@vadai.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
The initial commit badly merged into the dsa_resume method instead
of the dsa_remove_dst method.
As consequence, the dst->master_netdev->dsa_ptr is not set to NULL on
removal and re-bind of the dsa device fails with error -17.
Fixes: b0dc635d92 ("net: dsa: cleanup resources upon module removal ")
Signed-off-by: Neil Armstrong <narmstrong@baylibre.com>
Acked-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Increment the mgmt revision due to the recently added limited
privacy mode.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Introduce a limited privacy mode indicated by value 0x02 to the mgmt
Set Privacy command.
With value 0x02 the kernel will use privacy mode with a resolvable
private address. In case the controller is bondable and discoverable
the identity address will be used.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
When lookup up the advertising instance flags for the default
advertising instance (0) the discoverable flag should be filled in
based on the HCI_DISCOVERABLE flag.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Minor fix to not do the memset until the variable it clears is
actually used.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch fixes the comments for SAM/DAM value.
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch solves the sparse warning:
net/6lowpan/debugfs.c:164:30: warning: symbol 'lowpan_ctx_pfx_fops' was
not declared. Should it be static?
net/6lowpan/debugfs.c:241:30: warning: symbol 'lowpan_context_fops' was
not declared. Should it be static?
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Jozsef Kadlecsik says:
====================
Please apply the next patch against the nf tree:
- Deniz Eren reported that parallel flush/dump of list:set type of sets
can lead to kernel crash. The bug was due to non-RCU compatible
flushing, listing in the set type, fixed by me.
- Julia Lawall pointed out that IPSET_ATTR_ETHER netlink attribute
length was not checked explicitly. The patch adds the missing
checkings.
====================
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Replace link layer header validation check ll_header_truncate with
more generic dev_validate_header.
Validation based on hard_header_len incorrectly drops valid packets
in variable length protocols, such as AX25. dev_validate_header
calls header_ops.validate for such protocols to ensure correctness
below hard_header_len.
See also http://comments.gmane.org/gmane.linux.network/401064
Fixes 9c7077622d ("packet: make packet_snd fail on len smaller than l2 header")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As variable length protocol, AX25 fails link layer header validation
tests based on a minimum length. header_ops.validate allows protocols
to validate headers that are shorter than hard_header_len. Implement
this callback for AX25.
See also http://comments.gmane.org/gmane.linux.network/401064
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds receive timeout for message assembly on the attached TCP
sockets. The timeout is set when a new messages is started and the whole
message has not been received by TCP (not in the receive queue). If the
completely message is subsequently received the timer is cancelled, if the
timer expires the RX side is aborted.
The timeout value is taken from the socket timeout (SO_RCVTIMEO) that is
set on a TCP socket (i.e. set by get sockopt before attaching a TCP socket
to KCM.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Message assembly is performed on the TCP socket. This is logically
equivalent of an application that performs a peek on the socket to find
out how much memory is needed for a receive buffer. The receive socket
buffer also provides the maximum message size which is checked.
The receive algorithm is something like:
1) Receive the first skbuf for a message (or skbufs if multiple are
needed to determine message length).
2) Check the message length against the number of bytes in the TCP
receive queue (tcp_inq()).
- If all the bytes of the message are in the queue (incluing the
skbuf received), then proceed with message assembly (it should
complete with the tcp_read_sock)
- Else, mark the psock with the number of bytes needed to
complete the message.
3) In TCP data ready function, if the psock indicates that we are
waiting for the rest of the bytes of a messages, check the number
of queued bytes against that.
- If there are still not enough bytes for the message, just
return
- Else, clear the waiting bytes and proceed to receive the
skbufs. The message should now be received in one
tcp_read_sock
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement kcm_sendpage. Set in sendpage to kcm_sendpage in both
dgram and seqpacket ops.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement kcm_splice_read. This is supported only for seqpacket.
Add kcm_seqpacket_ops and set splice read to kcm_splice_read.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds various counters for KCM. These include counters for
messages and bytes received or sent, as well as counters for number of
attached/unattached TCP sockets and other error or edge events.
The statistics are exposed via a proc interface. /proc/net/kcm provides
statistics per KCM socket and per psock (attached TCP sockets).
/proc/net/kcm_stats provides aggregate statistics.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This module implements the Kernel Connection Multiplexor.
Kernel Connection Multiplexor (KCM) is a facility that provides a
message based interface over TCP for generic application protocols.
With KCM an application can efficiently send and receive application
protocol messages over TCP using datagram sockets.
For more information see the included Documentation/networking/kcm.txt
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Create a common kernel function to get the number of bytes available
on a TCP socket. This is based on code in INQ getsockopt and we now call
the function for that getsockopt.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a new msg flag called MSG_BATCH. This flag is used in sendmsg to
indicate that more messages will follow (i.e. a batch of messages is
being sent). This is similar to MSG_MORE except that the following
messages are not merged into one packet, they are sent individually.
sendmmsg is updated so that each contained message except for the
last one is marked as MSG_BATCH.
MSG_BATCH is a performance optimization in cases where a socket
implementation can benefit by transmitting packets in a batch.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch allows setting MSG_EOR in each individual msghdr passed
in sendmmsg. This allows a sendmmsg to send multiple messages when
using SOCK_SEQPACKET.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Export it for cases where we want to create sockets by hand.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
One of our customers observed issues with FIB6 garbage collectors
running in different network namespaces blocking each other, resulting
in soft lockups (fib6_run_gc() initiated from timer runs always in
forced mode).
Now that FIB6 walkers are separated per namespace, there is no more need
for instances of fib6_run_gc() in different namespaces blocking each
other. There is still a call to icmp6_dst_gc() which operates on shared
data but this function is protected by its own shared lock.
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The IPv6 FIB data structures are separated per network namespace but
there is still only one global walkers list and one global walker list
lock. This means changes in one namespace unnecessarily interfere with
walkers in other namespaces.
Replace the global list with per-netns lists (and give each its own
lock).
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Global variable gc_args is only used in fib6_run_gc() and functions
called from it. As fib6_run_gc() makes sure there is at most one
instance of fib6_clean_all() running at any moment, we can replace
gc_args with a local variable which will be needed once multiple
instances (per netns) of garbage collector are allowed.
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dmitry reported that sctp_add_bind_addr may read more bytes than
expected in case the parameter is a IPv4 addr supplied by the user
through calls such as sctp_bindx_add(), because it always copies
sizeof(union sctp_addr) while the buffer may be just a struct
sockaddr_in, which is smaller.
This patch then fixes it by limiting the memcpy to the min between the
union size and a (new parameter) provided addr size. Where possible this
parameter still is the size of that union, except for reading from
user-provided buffers, which then it accounts for protocol type.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Julia Lawall pointed out that IPSET_ATTR_ETHER netlink attribute length
was not checked explicitly, just for the maximum possible size. Malicious
netlink clients could send shorter attribute and thus resulting a kernel
read after the buffer.
The patch adds the explicit length checkings.
Reported-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
This fix is for dsmark similar to commit 3557619f0f
("net_sched: prio: use qdisc_dequeue_peeked")
and makes use of qdisc_dequeue_peeked() instead of direct dequeue() call.
First time, wrr peeks dsmark, which will then peek into sfq.
sfq dequeues an skb and it's stored in sch->gso_skb.
Next time, wrr tries to dequeue from dsmark, which will call sfq dequeue
directly. This results skipping the previously peeked skb.
So changed dsmark dequeue to call qdisc_dequeue_peeked() instead to use
peeked skb if exists.
Signed-off-by: Kyeong Yoo <kyeong.yoo@alliedtelesis.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
The following patchset contains Netfilter updates for your net-next tree,
they are:
1) Remove useless debug message when deleting IPVS service, from
Yannick Brosseau.
2) Get rid of compilation warning when CONFIG_PROC_FS is unset in
several spots of the IPVS code, from Arnd Bergmann.
3) Add prandom_u32 support to nft_meta, from Florian Westphal.
4) Remove unused variable in xt_osf, from Sudip Mukherjee.
5) Don't calculate IP checksum twice from netfilter ipv4 defrag hook
since fixing af_packet defragmentation issues, from Joe Stringer.
6) On-demand hook registration for iptables from netns. Instead of
registering the hooks for every available netns whenever we need
one of the support tables, we register this on the specific netns
that needs it, patchset from Florian Westphal.
7) Add missing port range selection to nf_tables masquerading support.
BTW, just for the record, there is a typo in the description of
5f6c253ebe ("netfilter: bridge: register hooks only when bridge
interface is added") that refers to the cluster match as deprecated, but
it is actually the CLUSTERIP target (which registers hooks
inconditionally) the one that is scheduled for removal.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The assumptions from commit 0c1d70af92 ("net: use dst_cache for vxlan
device"), 468dfffcd7 ("geneve: add dst caching support") and 3c1cb4d260
("net/ipv4: add dst cache support for gre lwtunnels") on dst_cache usage
when ip_tunnel_info is used is unfortunately not always valid as assumed.
While it seems correct for ip_tunnel_info front-ends such as OVS, eBPF
however can fill in ip_tunnel_info for consumers like vxlan, geneve or gre
with different remote dsts, tos, etc, therefore they cannot be assumed as
packet independent.
Right now vxlan, geneve, gre would cache the dst for eBPF and every packet
would reuse the same entry that was first created on the initial route
lookup. eBPF doesn't store/cache the ip_tunnel_info, so each skb may have
a different one.
Fix it by adding a flag that checks the ip_tunnel_info. Also the !tos test
in vxlan needs to be handeled differently in this context as it is currently
inferred from ip_tunnel_info as well if present. ip_tunnel_dst_cache_usable()
helper is added for the three tunnel cases, which checks if we can use dst
cache.
Fixes: 0c1d70af92 ("net: use dst_cache for vxlan device")
Fixes: 468dfffcd7 ("geneve: add dst caching support")
Fixes: 3c1cb4d260 ("net/ipv4: add dst cache support for gre lwtunnels")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
After eBPF being able to programmatically access/manage tunnel key meta
data via commit d3aa45ce6b ("bpf: add helpers to access tunnel metadata")
and more recently also for IPv6 through c6c3345407 ("bpf: support ipv6
for bpf_skb_{set,get}_tunnel_key"), this work adds two complementary
helpers to generically access their auxiliary tunnel options.
Geneve and vxlan support this facility. For geneve, TLVs can be pushed,
and for the vxlan case its GBP extension. I.e. setting tunnel key for geneve
case only makes sense, if we can also read/write TLVs into it. In the GBP
case, it provides the flexibility to easily map the group policy ID in
combination with other helpers or maps.
I chose to model this as two separate helpers, bpf_skb_{set,get}_tunnel_opt(),
for a couple of reasons. bpf_skb_{set,get}_tunnel_key() is already rather
complex by itself, and there may be cases for tunnel key backends where
tunnel options are not always needed. If we would have integrated this
into bpf_skb_{set,get}_tunnel_key() nevertheless, we are very limited with
remaining helper arguments, so keeping compatibility on structs in case of
passing in a flat buffer gets more cumbersome. Separating both also allows
for more flexibility and future extensibility, f.e. options could be fed
directly from a map, etc.
Moreover, change geneve's xmit path to test only for info->options_len
instead of TUNNEL_GENEVE_OPT flag. This makes it more consistent with vxlan's
xmit path and allows for avoiding to specify a protocol flag in the API on
xmit, so it can be protocol agnostic. Having info->options_len is enough
information that is needed. Tested with vxlan and geneve.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Added by 9a628224a6 ("ip_tunnel: Add dont fragment flag."), allow to
feed df flag into tunneling facilities (currently supported on TX by
vxlan, geneve and gre) as a hint from eBPF's bpf_skb_set_tunnel_key()
helper.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
They are only used here, so there's no reason they should not be static.
Only the vlan push/pop protos are used in the test_bpf suite.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When overwriting parts of the packet with bpf_skb_store_bytes() that
were fed previously into skb->hash calculation, we should clear the
current hash with skb_clear_hash(), so that a next skb_get_hash() call
can determine the correct hash related to this skb.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 7d672345ed ("bpf: add generic bpf_csum_diff helper") added a
generic checksum diff helper that can feed bpf_l4_csum_replace() with
a target __wsum diff that is to be applied to the L4 checksum. This
facility is very flexible, can be cascaded, allows for adding, removing,
or diffing data, or for calculating the pseudo header checksum from
scratch, but it can also be reused for working with the IPv4 header
checksum.
Thus, analogous to bpf_l4_csum_replace(), add a case for header field
value of 0 to change the checksum at a given offset through a new helper
csum_replace_by_diff(). Also, in addition to that, this provides an
easy to use interface for feeding precalculated diffs f.e. coming from
a map. It nicely complements bpf_l3_csum_replace() that currently allows
only for csum updates of 2 and 4 byte diffs.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Several cases of overlapping changes, as well as one instance
(vxlan) of a bug fix in 'net' overlapping with code movement
in 'net-next'.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Fix ordering of WEXT netlink messages so we don't see a newlink
after a dellink, from Johannes Berg.
2) Out of bounds access in minstrel_ht_set_best_prob_rage, from
Konstantin Khlebnikov.
3) Paging buffer memory leak in iwlwifi, from Matti Gottlieb.
4) Wrong units used to set initial TCP rto from cached metrics, also
from Konstantin Khlebnikov.
5) Fix stale IP options data in the SKB control block from leaking
through layers of encapsulation, from Bernie Harris.
6) Zero padding len miscalculated in bnxt_en, from Michael Chan.
7) Only CHECKSUM_PARTIAL packets should be passed down through GSO, fix
from Hannes Frederic Sowa.
8) Fix suspend/resume with JME networking devices, from Diego Violat
and Guo-Fu Tseng.
9) Checksums not validated properly in bridge multicast support due to
the placement of the SKB header pointers at the time of the check,
fix from Álvaro Fernández Rojas.
10) Fix hang/tiemout with r8169 if a stats fetch is done while the
device is runtime suspended. From Chun-Hao Lin.
11) The forwarding database netlink dump facilities don't track the
state of the dump properly, resulting in skipped/missed entries.
From Minoura Makoto.
12) Fix regression from a recent 3c59x bug fix, from Neil Horman.
13) Fix list corruption in bna driver, from Ivan Vecera.
14) Big endian machines crash on vlan add in bnx2x, fix from Michal
Schmidt.
15) Ethtool RSS configuration not propagated properly in mlx5 driver,
from Tariq Toukan.
16) Fix regression in PHY probing in stmmac driver, from Gabriel
Fernandez.
17) Fix SKB tailroom calculation in igmp/mld code, from Benjamin
Poirier.
18) A past change to skip empty routing headers in ipv6 extention header
parsing accidently caused fragment headers to not be matched any
longer. Fix from Florian Westphal.
19) eTSEC-106 erratum needs to be applied to more gianfar chips, from
Atsushi Nemoto.
20) Fix netdev reference after free via workqueues in usb networking
drivers, from Oliver Neukum and Bjørn Mork.
21) mdio->irq is now an array rather than a pointer to dynamic memory,
but several drivers were still trying to free it :-/ Fixes from
Colin Ian King.
22) act_ipt iptables action forgets to set the family field, thus LOG
netfilter targets don't work with it. Fix from Phil Sutter.
23) SKB leak in ibmveth when skb_linearize() fails, from Thomas Falcon.
24) pskb_may_pull() cannot be called with interrupts disabled, fix code
that tries to do this in vmxnet3 driver, from Neil Horman.
25) be2net driver leaks iomap'd memory on removal, fix from Douglas
Miller.
26) Forgotton RTNL mutex unlock in ppp_create_interface() error paths,
from Guillaume Nault.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (97 commits)
ppp: release rtnl mutex when interface creation fails
cdc_ncm: do not call usbnet_link_change from cdc_ncm_bind
tcp: fix tcpi_segs_in after connection establishment
net: hns: fix the bug about loopback
jme: Fix device PM wakeup API usage
jme: Do not enable NIC WoL functions on S0
udp6: fix UDP/IPv6 encap resubmit path
be2net: Don't leak iomapped memory on removal.
vmxnet3: avoid calling pskb_may_pull with interrupts disabled
net: ethernet: Add missing MFD_SYSCON dependency on HAS_IOMEM
ibmveth: check return of skb_linearize in ibmveth_start_xmit
cdc_ncm: toggle altsetting to force reset before setup
usbnet: cleanup after bind() in probe()
mlxsw: pci: Correctly determine if descriptor queue is full
mlxsw: spectrum: Always decrement bridge's ref count
tipc: fix nullptr crash during subscription cancel
net: eth: altera: do not free array priv->mdio->irq
net/ethoc: do not free array priv->mdio->irq
net: sched: fix act_ipt for LOG target
asix: do not free array priv->mdio->irq
...
If final packet (ACK) of 3WHS is lost, it appears we do not properly
account the following incoming segment into tcpi_segs_in
While we are at it, starts segs_in with one, to count the SYN packet.
We do not yet count number of SYN we received for a request sock, we
might add this someday.
packetdrill script showing proper behavior after fix :
// Tests tcpi_segs_in when 3rd packet (ACK) of 3WHS is lost
0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0
+0 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop>
+0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK>
+.020 < P. 1:1001(1000) ack 1 win 32792
+0 accept(3, ..., ...) = 4
+.000 %{ assert tcpi_segs_in == 2, 'tcpi_segs_in=%d' % tcpi_segs_in }%
Fixes: 2efd055c53 ("tcp: add tcpi_segs_in and tcpi_segs_out to tcp_info")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPv4 interprets a negative return value from a protocol handler as a
request to redispatch to a new protocol. In contrast, IPv6 interprets a
negative value as an error, and interprets a positive value as a request
for redispatch.
UDP for IPv6 was unaware of this difference. Change __udp6_lib_rcv() to
return a positive value for redispatch. Note that the socket's
encap_rcv hook still needs to return a negative value to request
dispatch, and in the case of IPv6 packets, adjust IP6CB(skb)->nhoff to
identify the byte containing the next protocol.
Signed-off-by: Bill Sommerfeld <wsommerfeld@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make the c files less cluttered and enable netlink attributes to be
shared between files.
Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, arp_rcv() always return zero on a packet delivery upcall.
To make its behavior more compliant with the way this API should be
used, this patch changes this to let it return NET_RX_SUCCESS when the
packet is proper handled, and NET_RX_DROP otherwise.
v1->v2:
If sanity check is failed, call kfree_skb() instead of consume_skb(), then
return the correct return value.
Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes the checkpatch.pl error to netlabel_domainhash.c:
ERROR: do not initialise statics to NULL
Signed-off-by: Wei Tang <tangwei@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes the checkpatch.pl error to netlabel_unlabeled.c:
ERROR: do not initialise statics to 0 or NULL
Signed-off-by: Wei Tang <tangwei@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Until now, we have kept a pre-allocated protocol message header
aggregated into struct tipc_link. Apart from adding unnecessary
footprint to the link instances, this requires extra code both to
initialize and re-initialize it.
We now remove this sub-optimization. This change also makes it
possible to clean up the function tipc_build_proto_msg() and remove
a couple of small functions that were accessing the mentioned header.
In particular, we can replace all occurrences of the local function
call link_own_addr(link) with the generic tipc_own_addr(net).
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 4d5cfcba2f ('tipc: fix connection abort during subscription
cancel'), removes the check for a valid subscription before calling
tipc_nametbl_subscribe().
This will lead to a nullptr exception when we process a
subscription cancel request. For a cancel request, a null
subscription is passed to tipc_nametbl_subscribe() resulting
in exception.
In this commit, we call tipc_nametbl_subscribe() only for
a valid subscription.
Fixes: 4d5cfcba2f ('tipc: fix connection abort during subscription cancel')
Reported-by: Anders Widell <anders.widell@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before calling the destroy() or target() callbacks, the family parameter
field has to be initialized. Otherwise at least the LOG target will
refuse to work and upon removal oops the kernel.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Phil Sutter <phil@nwl.cc>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make sure the user has provided a scope for multicast and link local
addresses used locally by a UDP bearer.
Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The netlink policy for TIPC_NLA_UDP_LOCAL and TIPC_NLA_UDP_REMOTE
is of type binary with a defined length. This causes the policy
framework to threat the defined length as maximum length.
There is however no protection against a user sending a smaller
amount of data. Prior to this patch this wasn't handled which could
result in a partially incomplete sockaddr_storage struct containing
uninitialized data.
In this patch we use nla_memcpy() when copying the user data. This
ensures a potential gap at the end is cleared out properly.
This was found by Julia with Coccinelle tool.
Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Reported-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make sure we have a link before checking if it has been reset or not.
Prior to this patch tipc_link_is_reset() could be called with a non
existing link, resulting in a null pointer dereference.
Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Prior to this patch enabling a IPv4 UDP bearer caused a null pointer
dereference in iptunnel_xmit_stats(), when it tried to dereference the
net device from the skb. To resolve this we now point the skb device
to the net device resolved from the routing table.
Fixes: 039f50629b (ip_tunnel: Move stats update to iptunnel_xmit())
Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The IPVS SIP persistence engine is not able to parse the SIP header
"Call-ID" when such header is inserted in the first positions of
the SIP message.
When IPVS is configured with "--pe sip" option, like for example:
ipvsadm -A -u 1.2.3.4:5060 -s rr --pe sip -p 120 -o
some particular messages (see below for details) do not create entries
in the connection template table, which can be listed with:
ipvsadm -Lcn --persistent-conn
Problematic SIP messages are SIP responses having "Call-ID" header
positioned just after message first line:
SIP/2.0 200 OK
[Call-ID header here]
[rest of the headers]
When "Call-ID" header is positioned down (after a few other headers)
it is correctly recognized.
This is due to the data offset used in get_callid function call inside
ip_vs_pe_sip.c file: since dptr already points to the start of the
SIP message, the value of dataoff should be initially 0.
Otherwise the header is searched starting from some bytes after the
first character of the SIP message.
Fixes: 758ff03387 ("IPVS: sip persistence engine")
Signed-off-by: Marco Angaroni <marcoangaroni@gmail.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
"RFC 5961, 4.2. Mitigation" describes a mechanism to request
client to confirm with RST the restart of TCP connection
before resending its SYN. As result, IPVS can see SYNs for
existing connection in CLOSE state. Add check to allow
rescheduling in this state.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Jiri Bohac is reporting for a problem where the attempt
to reschedule existing connection to another real server
needs proper redirect for the conntrack used by the IPVS
connection. For example, when IPVS connection is created
to NAT-ed real server we alter the reply direction of
conntrack. If we later decide to select different real
server we can not alter again the conntrack. And if we
expire the old connection, the new connection is left
without conntrack.
So, the only way to redirect both the IPVS connection and
the Netfilter's conntrack is to drop the SYN packet that
hits existing connection, to wait for the next jiffie
to expire the old connection and its conntrack and to rely
on client's retransmission to create new connection as
usually.
Jiri Bohac provided a fix that drops all SYNs on rescheduling,
I extended his patch to do such drops only for connections
that use conntrack. Here is the original report from Jiri Bohac:
Since commit dc7b3eb900 ("ipvs: Fix reuse connection if real server
is dead"), new connections to dead servers are redistributed
immediately to new servers. The old connection is expired using
ip_vs_conn_expire_now() which sets the connection timer to expire
immediately.
However, before the timer callback, ip_vs_conn_expire(), is run
to clean the connection's conntrack entry, the new redistributed
connection may already be established and its conntrack removed
instead.
Fix this by dropping the first packet of the new connection
instead, like we do when the destination server is not available.
The timer will have deleted the old conntrack entry long before
the first packet of the new connection is retransmitted.
Fixes: dc7b3eb900 ("ipvs: Fix reuse connection if real server is dead")
Signed-off-by: Jiri Bohac <jbohac@suse.cz>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
ip_vs_fill_iph_skb_off() may not find an IP header, and gcc has
determined that ip_vs_sip_fill_param() then incorrectly accesses
the protocol fields:
net/netfilter/ipvs/ip_vs_pe_sip.c: In function 'ip_vs_sip_fill_param':
net/netfilter/ipvs/ip_vs_pe_sip.c:76:5: error: 'iph.protocol' may be used uninitialized in this function [-Werror=maybe-uninitialized]
if (iph.protocol != IPPROTO_UDP)
^
net/netfilter/ipvs/ip_vs_pe_sip.c:81:10: error: 'iph.len' may be used uninitialized in this function [-Werror=maybe-uninitialized]
dataoff = iph.len + sizeof(struct udphdr);
^
This adds a check for the ip_vs_fill_iph_skb_off() return code
before looking at the ip header data returned from it.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: b0e010c527 ("ipvs: replace ip_vs_fill_ip4hdr with ip_vs_fill_iph_skb_off")
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Since offset is zero, it's not necessary to use set function. Reset
function is straightforward, and will remove the unnecessary add
operation in set function.
Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since offset is zero, it's not necessary to use set function. Reset
function is straightforward, and will remove the unnecessary add
operation in set function.
Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the ICMP message processing code, don't try to map ICMP codes to UNIX
error codes as the caller (IPv4/IPv6) already did that for us (ee_errno).
Signed-off-by: David Howells <dhowells@redhat.com>
The version number rxkad places in the response should be network byte
order.
Whilst we're at it, rearrange the code to be more readable.
Signed-off-by: David Howells <dhowells@redhat.com>
Use ACCESS_ONCE() when accessing the other-end pointer into a circular
buffer as it's possible the other-end pointer might change whilst we're
doing this, and if we access it twice, we might get some weird things
happening.
Signed-off-by: David Howells <dhowells@redhat.com>
Currently, received RxRPC packets outside the range 1-13 are rejected.
There are, however, holes in the range that should also be rejected - plus
at least one type we don't yet support - so reject these also.
Signed-off-by: David Howells <dhowells@redhat.com>
The upper bound of the defined range for rx_mtu is being set in the same
member as the lower bound (extra1) rather than the correct place (extra2).
I'm not entirely sure why this compiles.
Signed-off-by: David Howells <dhowells@redhat.com>
Currently, a copy of the Rx packet header is copied into the the sk_buff
private data so that we can advance the pointer into the buffer,
potentially discarding the original. At the moment, this copy is held in
network byte order, but this means we're doing a lot of unnecessary
translations.
The reasons it was done this way are that we need the values in network
byte order occasionally and we can use the copy, slightly modified, as part
of an iov array when sending an ack or an abort packet.
However, it seems more reasonable on review that it would be better kept in
host byte order and that we make up a new header when we want to send
another packet.
To this end, rename the original header struct to rxrpc_wire_header (with
BE fields) and institute a variant called rxrpc_host_header that has host
order fields. Change the struct in the sk_buff private data into an
rxrpc_host_header and translate the values when filling it in.
This further allows us to keep values kept in various structures in host
byte order rather than network byte order and allows removal of some fields
that are byteswapped duplicates.
Signed-off-by: David Howells <dhowells@redhat.com>
Convert call flag and event numbers into enums and move their definitions
outside of the struct.
Also move the call state enum outside of the struct and add an extra
element to count the number of states.
Signed-off-by: David Howells <dhowells@redhat.com>
Fix a case where RXRPC_CALL_RELEASE (an event) is being used to specify a
flag bit. RXRPC_CALL_RELEASED should be used instead.
Signed-off-by: David Howells <dhowells@redhat.com>
Some devices declare a high number of TX queues, then set a much
lower real_num_tx_queues
This cause setups using fq_codel, sfq or fq as the default qdisc to consume
more memory than really needed.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew and Ying Huang's test robot both reported usage count problems that
trace back to the 'keep address on ifdown' patch.
>From Andrew:
We execute CRIU test on linux-next. On the current linux-next kernel
they hangs on creating a network namespace.
The kernel log contains many massages like this:
[ 1036.122108] unregister_netdevice: waiting for lo to become free.
Usage count = 2
[ 1046.165156] unregister_netdevice: waiting for lo to become free.
Usage count = 2
[ 1056.210287] unregister_netdevice: waiting for lo to become free.
Usage count = 2
I tried to revert this patch and the bug disappeared.
Here is a set of commands to reproduce this bug:
[root@linux-next-test linux-next]# uname -a
Linux linux-next-test 4.5.0-rc6-next-20160301+ #3 SMP Wed Mar 2
17:32:18 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
[root@linux-next-test ~]# unshare -n
[root@linux-next-test ~]# ip link set up dev lo
[root@linux-next-test ~]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
group default qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
[root@linux-next-test ~]# logout
[root@linux-next-test ~]# unshare -n
-----
The problem is a change made to RTM_DELADDR case in __ipv6_ifa_notify that
was added in an early version of the offending patch and is no longer
needed.
Fixes: f1705ec197 ("net: ipv6: Make address flushing on ifdown optional")
Cc: Andrey Wagin <avagin@gmail.com>
Cc: Ying Huang <ying.huang@linux.intel.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Tested-by: Jeremiah Mahler <jmmahler@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The new NET_DEVLINK infrastructure can be a loadable module, but the drivers
using it might be built-in, which causes link errors like:
drivers/net/built-in.o: In function `mlx4_load_one':
:(.text+0x2fbfda): undefined reference to `devlink_port_register'
:(.text+0x2fc084): undefined reference to `devlink_port_unregister'
drivers/net/built-in.o: In function `mlxsw_sx_port_remove':
:(.text+0x33a03a): undefined reference to `devlink_port_type_clear'
:(.text+0x33a04e): undefined reference to `devlink_port_unregister'
There are multiple ways to avoid this:
a) add 'depends on NET_DEVLINK || !NET_DEVLINK' dependencies
for each user
b) use 'select NET_DEVLINK' from each driver that uses it
and hide the symbol in Kconfig.
c) make NET_DEVLINK a 'bool' option so we don't have to
list it as a dependency, and rely on the APIs to be
stubbed out when it is disabled
d) use IS_REACHABLE() rather than IS_ENABLED() to check for
NET_DEVLINK in include/net/devlink.h
This implements a variation of approach a) by adding an
intermediate symbol that drivers can depend on, and changes
the three drivers using it.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 09d4d087cd ("mlx4: Implement devlink interface")
Fixes: c4745500e9 ("mlxsw: Implement devlink interface")
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When ipv6_find_hdr is used to find a fragment header
(caller specifies target NEXTHDR_FRAGMENT) we erronously return
-ENOENT for all fragments with nonzero offset.
Before commit 9195bb8e38, when target was specified, we did not
enter the exthdr walk loop as nexthdr == target so this used to work.
Now we do (so we can skip empty route headers). When we then stumble upon
a frag with nonzero frag_off we must return -ENOENT ("header not found")
only if the caller did not specifically request NEXTHDR_FRAGMENT.
This allows nfables exthdr expression to match ipv6 fragments, e.g. via
nft add rule ip6 filter input frag frag-off gt 0
Fixes: 9195bb8e38 ("ipv6: improve ipv6_find_hdr() to skip empty routing headers")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
reverts commit 94153e36e7 ("tipc: use existing sk_write_queue for
outgoing packet chain")
In Commit 94153e36e7, we assume that we fill & empty the socket's
sk_write_queue within the same lock_sock() session.
This is not true if the link is congested. During congestion, the
socket lock is released while we wait for the congestion to cease.
This implementation causes a nullptr exception, if the user space
program has several threads accessing the same socket descriptor.
Consider two threads of the same program performing the following:
Thread1 Thread2
-------------------- ----------------------
Enter tipc_sendmsg() Enter tipc_sendmsg()
lock_sock() lock_sock()
Enter tipc_link_xmit(), ret=ELINKCONG spin on socket lock..
sk_wait_event() :
release_sock() grab socket lock
: Enter tipc_link_xmit(), ret=0
: release_sock()
Wakeup after congestion
lock_sock()
skb = skb_peek(pktchain);
!! TIPC_SKB_CB(skb)->wakeup_pending = tsk->link_cong;
In this case, the second thread transmits the buffers belonging to
both thread1 and thread2 successfully. When the first thread wakeup
after the congestion it assumes that the pktchain is intact and
operates on the skb's in it, which leads to the following exception:
[2102.439969] BUG: unable to handle kernel NULL pointer dereference at 00000000000000d0
[2102.440074] IP: [<ffffffffa005f330>] __tipc_link_xmit+0x2b0/0x4d0 [tipc]
[2102.440074] PGD 3fa3f067 PUD 3fa6b067 PMD 0
[2102.440074] Oops: 0000 [#1] SMP
[2102.440074] CPU: 2 PID: 244 Comm: sender Not tainted 3.12.28 #1
[2102.440074] RIP: 0010:[<ffffffffa005f330>] [<ffffffffa005f330>] __tipc_link_xmit+0x2b0/0x4d0 [tipc]
[...]
[2102.440074] Call Trace:
[2102.440074] [<ffffffff8163f0b9>] ? schedule+0x29/0x70
[2102.440074] [<ffffffffa006a756>] ? tipc_node_unlock+0x46/0x170 [tipc]
[2102.440074] [<ffffffffa005f761>] tipc_link_xmit+0x51/0xf0 [tipc]
[2102.440074] [<ffffffffa006d8ae>] tipc_send_stream+0x11e/0x4f0 [tipc]
[2102.440074] [<ffffffff8106b150>] ? __wake_up_sync+0x20/0x20
[2102.440074] [<ffffffffa006dc9c>] tipc_send_packet+0x1c/0x20 [tipc]
[2102.440074] [<ffffffff81502478>] sock_sendmsg+0xa8/0xd0
[2102.440074] [<ffffffff81507895>] ? release_sock+0x145/0x170
[2102.440074] [<ffffffff815030d8>] ___sys_sendmsg+0x3d8/0x3e0
[2102.440074] [<ffffffff816426ae>] ? _raw_spin_unlock+0xe/0x10
[2102.440074] [<ffffffff81115c2a>] ? handle_mm_fault+0x6ca/0x9d0
[2102.440074] [<ffffffff8107dd65>] ? set_next_entity+0x85/0xa0
[2102.440074] [<ffffffff816426de>] ? _raw_spin_unlock_irq+0xe/0x20
[2102.440074] [<ffffffff8107463c>] ? finish_task_switch+0x5c/0xc0
[2102.440074] [<ffffffff8163ea8c>] ? __schedule+0x34c/0x950
[2102.440074] [<ffffffff81504e12>] __sys_sendmsg+0x42/0x80
[2102.440074] [<ffffffff81504e62>] SyS_sendmsg+0x12/0x20
[2102.440074] [<ffffffff8164aed2>] system_call_fastpath+0x16/0x1b
In this commit, we maintain the skb list always in the stack.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johan Hedberg says:
====================
pull request: bluetooth-next 2016-03-01
Here's our main set of Bluetooth & 802.15.4 patches for the 4.6 kernel.
- New Bluetooth HCI driver for Intel/AG6xx controllers
- New Broadcom ACPI IDs
- LED trigger support for indicating Bluetooth powered state
- Various fixes in mac802154, 6lowpan and related drivers
- New USB IDs for AR3012 Bluetooth controllers
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The current reserved_tailroom calculation fails to take hlen and tlen into
account.
skb:
[__hlen__|__data____________|__tlen___|__extra__]
^ ^
head skb_end_offset
In this representation, hlen + data + tlen is the size passed to alloc_skb.
"extra" is the extra space made available in __alloc_skb because of
rounding up by kmalloc. We can reorder the representation like so:
[__hlen__|__data____________|__extra__|__tlen___]
^ ^
head skb_end_offset
The maximum space available for ip headers and payload without
fragmentation is min(mtu, data + extra). Therefore,
reserved_tailroom
= data + extra + tlen - min(mtu, data + extra)
= skb_end_offset - hlen - min(mtu, skb_end_offset - hlen - tlen)
= skb_tailroom - min(mtu, skb_tailroom - tlen) ; after skb_reserve(hlen)
Compare the second line to the current expression:
reserved_tailroom = skb_end_offset - min(mtu, skb_end_offset)
and we can see that hlen and tlen are not taken into account.
The min() in the third line can be expanded into:
if mtu < skb_tailroom - tlen:
reserved_tailroom = skb_tailroom - mtu
else:
reserved_tailroom = tlen
Depending on hlen, tlen, mtu and the number of multicast address records,
the current code may output skbs that have less tailroom than
dev->needed_tailroom or it may output more skbs than needed because not all
space available is used.
Fixes: 4c672e4b ("ipv6: mld: fix add_grhead skb_over_panic for devs with large MTUs")
Signed-off-by: Benjamin Poirier <bpoirier@suse.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
8cc785f6f4 ("net: ipv4: make the ping
/proc code AF-independent") removed the code using it, but renamed this
variable instead of removing it.
Signed-off-by: Eric Engestrom <eric.engestrom@imgtec.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3b766cd832 ("net/core: Add reading VF
statistics through the PF netdevice") added that variable but it's never
been used.
Signed-off-by: Eric Engestrom <eric.engestrom@imgtec.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fastreg MR(FRMR) is another method with which one can
register memory to HCA. Some of the newer HCAs supports only fastreg
mr mode, so we need to add support for it to have RDS functional
on them.
Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Avinash Repaka <avinash.repaka@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fastreg MR(FRMR) memory registration and invalidation makes use
of work request and completion queues for its operation. Patch
allocates extra queue space towards these operation(s).
Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Discovere Fast Memmory Registration support using IB device
IB_DEVICE_MEM_MGT_EXTENSIONS. Certain HCA might support just FRMR
or FMR or both FMR and FRWR. In case both mr type are supported,
default FMR is used.
Default MR is still kept as FMR against what everyone else
is following. Default will be changed to FRMR once the
RDS performance with FRMR is comparable with FMR. The
work is in progress for the same.
Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Drop the RDS connection on RDMA_CM_EVENT_TIMEWAIT_EXIT so that
it can reconnect and resume.
While testing fastreg, this error happened in couple of tests but
was getting un-noticed.
Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Preperatory patch for FRMR support. From connection info,
we can retrieve cm_id which contains qp handled needed for
work request posting.
We also need to drop the RDS connection on QP error states
where connection handle becomes useful.
Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
No functional change.
Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Keep fmr related filed in its own struct. Fastreg MR structure
will be added to the union.
Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
No functional changes. This is in preperation towards adding
fastreg memory resgitration support.
Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This helps to combine asynchronous fastreg MR completion handler
with send completion handler.
No functional change.
Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The SO_TIMESTAMP generates time stamp for each incoming RDS messages
User app can enable it by using SO_TIMESTAMP setsocketopt() at
SOL_SOCKET level. CMSG data of cmsg type SO_TIMESTAMP contains the
time stamp in struct timeval format.
Reviewed-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RDS iWarp support code has become stale and non testable. As
indicated earlier, am dropping the support for it.
If new iWarp user(s) shows up in future, we can adapat the RDS IB
transprt for the special RDMA READ sink case. iWarp needs an MR
for the RDMA READ sink.
Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This moves bridge hooks to a register-when-needed scheme.
We use a device notifier to register the 'call-iptables' netfilter hooks
only once a bridge gets added.
This means that if the initial namespace uses a bridge, newly created
network namespaces no longer get the PRE_ROUTING ipt_sabotage hook.
It will registered in that network namespace once a bridge is created
within that namespace.
A few modules still use global hooks:
- conntrack
- bridge PF_BRIDGE hooks
- IPVS
- CLUSTER match (deprecated)
- SYNPROXY
As long as these modules are not loaded/used, a new network namespace has
empty hook list and NF_HOOK() will boil down to single list_empty test even
if initial namespace does stateless packet filtering.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
delay hook registration until the table is being requested inside a
namespace.
Historically, a particular table (iptables mangle, ip6tables filter, etc)
was registered on module load.
When netns support was added to iptables only the ip/ip6tables ruleset was
made namespace aware, not the actual hook points.
This means f.e. that when ipt_filter table/module is loaded on a system,
then each namespace on that system has an (empty) iptables filter ruleset.
In other words, if a namespace sends a packet, such skb is 'caught' by
netfilter machinery and fed to hooking points for that table (i.e. INPUT,
FORWARD, etc).
Thanks to Eric Biederman, hooks are no longer global, but per namespace.
This means that we can avoid allocation of empty ruleset in a namespace and
defer hook registration until we need the functionality.
We register a tables hook entry points ONLY in the initial namespace.
When an iptables get/setockopt is issued inside a given namespace, we check
if the table is found in the per-namespace list.
If not, we attempt to find it in the initial namespace, and, if found,
create an empty default table in the requesting namespace and register the
needed hooks.
Hook points are destroyed only once namespace is deleted, there is no
'usage count' (it makes no sense since there is no 'remove table' operation
in xtables api).
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This change prepares for upcoming on-demand xtables hook registration.
We change the protoypes of the register/unregister functions.
A followup patch will then add nf_hook_register/unregister calls
to the iptables one.
Once a hook is registered packets will be picked up, so all assignments
of the form
net->ipv4.iptable_$table = new_table
have to be moved to ip(6)t_register_table, else we can see NULL
net->ipv4.iptable_$table later.
This patch doesn't change functionality; without this the actual change
simply gets too big.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Since commit 0848f6428b ("inet: frags: fix defragmented packet's IP
header for af_packet"), ip_send_check() would be called twice for
defragmentation that occurs from netfilter ipv4 defrag hooks. Remove the
extra call.
Signed-off-by: Joe Stringer <joe@ovn.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Simon Horman says:
====================
please consider these cleanups for IPVS for v4.6.
* Arnd Bergmann has resolved a bunch of unused variable warnings and;
* Yannick Brosseau has removed a noisy debug message
====================
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The driver calls cfg80211_get_station, which may be part of a
module, so we must not enable BATMAN_ADV_BATMAN_V if
BATMAN_ADV=y and CFG80211=m:
net/built-in.o: In function `batadv_v_elp_get_throughput':
(text+0x5c62c): undefined reference to `cfg80211_get_station'
This clarifies the dependency to cover all combinations.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: c833484e5f ("batman-adv: ELP - compute the metric based on the estimated throughput")
Acked-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>
* check GCMP encryption vs. fragmentation properly; we'd found
this problem quite a while ago but waited for the 802.11 spec
to be updated
* fix RTS/CTS logic in minstrel_ht
* fix RX of certain public action frames in AP mode
* add mac80211_hwsim to MAC80211 in MAINTAINERS, this helps
the kbuild robot pick up the right tree for it
-----BEGIN PGP SIGNATURE-----
iQIcBAABCgAGBQJW1u7EAAoJEGt7eEactAAdZXsP/il3dCxYtqYtBOzmR8OkDztn
yJiapt9aZ/u1a1mfDKfsJ2/8OLsE4pY/fp3G8TttURKNB/26sHWUEYSHWV2Ecfmi
4QFvAiUeO9Ue50ggASno239hp51KIKGcVF5flqKrCjTaSSyUm8YR78S1NlH2N+lT
i+i/AidHEdLnTgvYGOUhWu3QyL7Bm5t5wVZPivZUs0Zoos+tTokEj40eYmetQX27
lQMGa/xbvLHYeBcuLSgdQbzUR1jzOi70b9+QkjqhrYrlsc+MZE8ghwOgeAeQx56B
0sg0fZZVI8DIaCD5S3Pj8EP0WnP990DtM+4nSvtfFUYJxEM7LthEV+bs4Gr1MR2q
dZESMisKoRWBf2SyUhoVeUO6bV5kY+eTz/wyQCTdFsp9u6jzOgiY7y2aliqbJFgD
0Ydbyqq1dAeRqFaTg/46OAXkUxHW7/8hQC78Cmintevd1nHivZC3eevG6xj22G1N
eUvqy+OIv0d2Hg8+CL6oBsexQrKLg4bQFPR6s/doPnXtgsh5JccFYWujClsrNL26
kRb1kYrIqoVWXwnPp1dLF02s6Ny0+AW4xHLQIAwz5HACsJp+SjnpRP3WF9FTPs3V
T71dXQ48MLrQ/EVidCpMwJYoUnZ+BWeUf5hHpkxILIP1f0elx3i50eshlMNA/kwb
YzufoqiQhICciwEBAgBM
=2rZC
-----END PGP SIGNATURE-----
Merge tag 'mac80211-for-davem-2016-03-02' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
Here are a few more fixes for the current cycle:
* check GCMP encryption vs. fragmentation properly; we'd found
this problem quite a while ago but waited for the 802.11 spec
to be updated
* fix RTS/CTS logic in minstrel_ht
* fix RX of certain public action frames in AP mode
* add mac80211_hwsim to MAC80211 in MAINTAINERS, this helps
the kbuild robot pick up the right tree for it
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
B.A.T.M.A.N. V. Its implementation started quite some years ago,
but due to the big changes being introduced it took a while to be
discussed, designed, worked, re-worked, tested and debugged (well,
we're never done with the latest). The entire operation has
basically been a team work involving all the core contributors
together with other people interested in the project.
The new protocol is divided into two main subcomponents, called
respectively ELP and OGMv2. The former is in charge of
dealing with the neighbour discovery and link quality estimation,
while the latter implements the algorithm that spreads the
metrics around the network and computes optimal paths.
The biggest change introduced with B.A.T.M.A.N. V is the new
metric: the protocol won't rely on packet loss anymore, but it
will use the estimated throughput extracted directly from the
wifi driver (when available) by querying cfg80211.
Batman-adv will also send some unicast probing packets when
an interface is not used for payload traffic to make sure that
such values are current.
The new protocol can be compiled-in or not like other
features we have and when selected will pull in CFG80211 as
dependency for the reason described above.
Thanks to the big work brought up in the past by Marek Lindner,
batman-adv can easily deal several protocol implementations,
therefore compiling in this new version does not exclude the
older.
This means that the user is offered the option to choose
the protocol when creating the mesh interface (default is the
old one to keep backward compatibility).
Along with the protocol there are some sysfs knobs that are
introduced to fine tune some of its behaviours, but users
are recommended to keep the default values unless they know
what they are doing.
The last patch is about advertising our own patchwork platform
(thanks to Sven Eckelmann for having set that up!) in the
MAINTAINERS file.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJW1AvOAAoJENpFlCjNi1MRKxsP/1BR8Wh3MrtNhX7MMe7PnZEQ
p00kArkjtnZCbBVFbG3JduqYaOPnGyzCLgBgF3DnOxx8gAGUEu6fu4AQNNr9jiUM
4lPnPbe5NuvjzQW27fJDZMuy4Y1FLaJ5DeoYjDH48/YDNS38nzlPPtNwg0pJA2TG
C22dM6CX9kzxiy5RETmjgIjK2fgIybudWtCv7BGZxNyiqMaq+bYD+ZJJ6/wbj/FV
uKbWB28puJZ3U3qTJ6ygbhAq8moeubp/nbZeh2RGkhpmnS7LKgdLWyPxJLAEza7Z
jbdhyAE2xDYaU2xxVdOH9GecSYeUlQ+mrBdEblo2cfKMgShwZBrNjQxCaJIHN3G9
ta44yUpjGFl6/HIQnzfNZD4bTRNnb/DeRTG5qWxdrGAbxZESsMPf1Ph/hZ+KvSZ5
7eSZPmVGRYl8hJaLc25bG8pCOLQRsx/54hXNqYRqmImiPQi4IRvKUQkEDXBpHrT+
EwChfIQvCJSzZQAljrhYP/ytg/CiZxNKgW1u8cJRlHRs3jbBxJUJ67gvf2yWdQdj
Luc76uLA1++8QLqhny6AtLeRrr7mYDFnJuNjkwbqWFSyBK6r/lKS1L3ereTfeVIM
5VWMuEDlbWuPscufDJ6WDgWwTqypybjvYbnDlQV7xHap7OegyS/ukQVz2W5lXvS1
gGERfC8WeW1JnFb5TSAY
=qd6p
-----END PGP SIGNATURE-----
Merge tag 'batman-adv-for-davem' of git://git.open-mesh.org/linux-merge
Antonio Quartulli says:
====================
batman-adv 20160229
this is our (hopefully) latest batch of patches intended for net-next.
With this patchset we finally introduce B.A.T.M.A.N. V: the latest
version of our routing protocol.
Technical documentation describing the protocol in more detail can
be found in our wiki[1][2][3][4].
For what concerns this pull request, you can find the high level
description right below.
[1] https://www.open-mesh.org/projects/batman-adv/wiki/BATMAN_V
[2] https://www.open-mesh.org/projects/batman-adv/wiki/OGMv2
[3] https://www.open-mesh.org/projects/batman-adv/wiki/ELP
[4] https://www.open-mesh.org/projects/batman-adv/wiki/BATMAN_V_Tests
...
With this patchset we finally introduce our new routing protocol:
B.A.T.M.A.N. V. Its implementation started quite some years ago,
but due to the big changes being introduced it took a while to be
discussed, designed, worked, re-worked, tested and debugged (well,
we're never done with the latest). The entire operation has
basically been a team work involving all the core contributors
together with other people interested in the project.
The new protocol is divided into two main subcomponents, called
respectively ELP and OGMv2. The former is in charge of
dealing with the neighbour discovery and link quality estimation,
while the latter implements the algorithm that spreads the
metrics around the network and computes optimal paths.
The biggest change introduced with B.A.T.M.A.N. V is the new
metric: the protocol won't rely on packet loss anymore, but it
will use the estimated throughput extracted directly from the
wifi driver (when available) by querying cfg80211.
Batman-adv will also send some unicast probing packets when
an interface is not used for payload traffic to make sure that
such values are current.
The new protocol can be compiled-in or not like other
features we have and when selected will pull in CFG80211 as
dependency for the reason described above.
Thanks to the big work brought up in the past by Marek Lindner,
batman-adv can easily deal several protocol implementations,
therefore compiling in this new version does not exclude the
older.
This means that the user is offered the option to choose
the protocol when creating the mesh interface (default is the
old one to keep backward compatibility).
Along with the protocol there are some sysfs knobs that are
introduced to fine tune some of its behaviours, but users
are recommended to keep the default values unless they know
what they are doing.
The last patch is about advertising our own patchwork platform
(thanks to Sven Eckelmann for having set that up!) in the
MAINTAINERS file.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Since offset is zero, it's not necessary to use set function.
Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
CC [M] net/sched/sch_mqprio.o
net/sched/sch_mqprio.c: In function ?mqprio_init?:
net/sched/sch_mqprio.c:145: error: unknown field ?tc? specified in initializer
net/sched/sch_mqprio.c:145: warning: missing braces around initializer
net/sched/sch_mqprio.c:145: warning: (near initialization for ?tc.<anonymous>?)
make[2]: *** [net/sched/sch_mqprio.o] Error 1
make[1]: *** [net/sched] Error 2
make: *** [net] Error 2
Several people reported this, surround the unnamed union
member initialization with braces to fix.
Signed-off-by: David S. Miller <davem@davemloft.net>
After commit 52bd2d62ce ("net: better skb->sender_cpu and skb->napi_id cohabitation")
skb_sender_cpu_clear() becomes empty and can be removed.
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now in sctp_remaddr_seq_show(), we use variable *tsp to get the param *v.
but *tsp is also used to traversal transport_addr_list, which will cover
the previous value, and make sctp_transport_put work on the wrong transport.
So fix it by adding a new variable to get the param *v.
Fixes: fba4c330c5 ("sctp: hold transport before we access t->asoc in sctp proc")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As the member .cmp_addr of sctp_af_inet6, sctp_v6_cmp_addr should also check
the port of addresses, just like sctp_v4_cmp_addr, cause it's invoked by
sctp_cmp_addr_exact().
Now sctp_v6_cmp_addr just check the port when two addresses have different
family, and lack the port check for two ipv6 addresses. that will make
sctp_hash_cmp() cannot work well.
so fix it by adding ports comparison in sctp_v6_cmp_addr().
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit f1705ec197 allows IPv6 addresses to be retained on a link down.
The address can have a cached host route which can point to the wrong
FIB table if the L3 enslavement is changed (e.g., route can point to local
table instead of VRF table if device is added to an L3 domain).
On link up check the table of the cached host route against the FIB
table associated with the device and correct if needed.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SCTP probe log timestamps use struct timespec which is
not y2038 safe.
Use struct timespec64 which is 2038 safe instead.
Use monotonic time instead of real time as only time
differences are logged.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Cc: Vlad Yasevich <vyasevich@gmail.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: linux-sctp@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP probe log timestamps use struct timespec which is
not y2038 safe. Even though timespec might be good enough here
as it is used to represent delta time, the plan is to get rid
of all uses of timespec in the kernel.
Replace with struct timespec64 which is y2038 safe.
Prints still use unsigned long format and type.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
ICMP timestamp messages and IP source route options require
timestamps to be in milliseconds modulo 24 hours from
midnight UT format.
Add inet_current_timestamp() function to support this. The function
returns the required timestamp in network byte order.
Timestamp calculation is also changed to call ktime_get_real_ts64()
which uses struct timespec64. struct timespec64 is y2038 safe.
Previously it called getnstimeofday() which uses struct timespec.
struct timespec is not y2038 safe.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: James Morris <jmorris@namei.org>
Cc: Patrick McHardy <kaber@trash.net>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Example usage:
Set the skb priority using skbedit then allow it to be encoded
sudo tc qdisc add dev $ETH root handle 1: prio
sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \
u32 match ip protocol 1 0xff flowid 1:2 \
action skbedit prio 17 \
action ife encode \
allow prio \
dst 02:15:15:15:15:15
Note: You dont need the skbedit action if you are already encoding the
skb priority earlier. A zero skb priority will not be sent
Alternative hard code static priority of decimal 33 (unlike skbedit)
then mark of 0x12 every time the filter matches
sudo $TC filter add dev $ETH parent 1: protocol ip prio 10 \
u32 match ip protocol 1 0xff flowid 1:2 \
action ife encode \
type 0xDEAD \
use prio 33 \
use mark 0x12 \
dst 02:15:15:15:15:15
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Example usage:
Set the skb using skbedit then allow it to be encoded
sudo tc qdisc add dev $ETH root handle 1: prio
sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \
u32 match ip protocol 1 0xff flowid 1:2 \
action skbedit mark 17 \
action ife encode \
allow mark \
dst 02:15:15:15:15:15
Note: You dont need the skbedit action if you are already encoding the
skb mark earlier. A zero skb mark, when seen, will not be encoded.
Alternative hard code static mark of 0x12 every time the filter matches
sudo $TC filter add dev $ETH parent 1: protocol ip prio 10 \
u32 match ip protocol 1 0xff flowid 1:2 \
action ife encode \
type 0xDEAD \
use mark 0x12 \
dst 02:15:15:15:15:15
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This action allows for a sending side to encapsulate arbitrary metadata
which is decapsulated by the receiving end.
The sender runs in encoding mode and the receiver in decode mode.
Both sender and receiver must specify the same ethertype.
At some point we hope to have a registered ethertype and we'll
then provide a default so the user doesnt have to specify it.
For now we enforce the user specify it.
Lets show example usage where we encode icmp from a sender towards
a receiver with an skbmark of 17; both sender and receiver use
ethertype of 0xdead to interop.
YYYY: Lets start with Receiver-side policy config:
xxx: add an ingress qdisc
sudo tc qdisc add dev $ETH ingress
xxx: any packets with ethertype 0xdead will be subjected to ife decoding
xxx: we then restart the classification so we can match on icmp at prio 3
sudo $TC filter add dev $ETH parent ffff: prio 2 protocol 0xdead \
u32 match u32 0 0 flowid 1:1 \
action ife decode reclassify
xxx: on restarting the classification from above if it was an icmp
xxx: packet, then match it here and continue to the next rule at prio 4
xxx: which will match based on skb mark of 17
sudo tc filter add dev $ETH parent ffff: prio 3 protocol ip \
u32 match ip protocol 1 0xff flowid 1:1 \
action continue
xxx: match on skbmark of 0x11 (decimal 17) and accept
sudo tc filter add dev $ETH parent ffff: prio 4 protocol ip \
handle 0x11 fw flowid 1:1 \
action ok
xxx: Lets show the decoding policy
sudo tc -s filter ls dev $ETH parent ffff: protocol 0xdead
xxx:
filter pref 2 u32
filter pref 2 u32 fh 800: ht divisor 1
filter pref 2 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:1 (rule hit 0 success 0)
match 00000000/00000000 at 0 (success 0 )
action order 1: ife decode action reclassify
index 1 ref 1 bind 1 installed 14 sec used 14 sec
type: 0x0
Metadata: allow mark allow hash allow prio allow qmap
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
xxx:
Observe that above lists all metadatum it can decode. Typically these
submodules will already be compiled into a monolithic kernel or
loaded as modules
YYYY: Lets show the sender side now ..
xxx: Add an egress qdisc on the sender netdev
sudo tc qdisc add dev $ETH root handle 1: prio
xxx:
xxx: Match all icmp packets to 192.168.122.237/24, then
xxx: tag the packet with skb mark of decimal 17, then
xxx: Encode it with:
xxx: ethertype 0xdead
xxx: add skb->mark to whitelist of metadatum to send
xxx: rewrite target dst MAC address to 02:15:15:15:15:15
xxx:
sudo $TC filter add dev $ETH parent 1: protocol ip prio 10 u32 \
match ip dst 192.168.122.237/24 \
match ip protocol 1 0xff \
flowid 1:2 \
action skbedit mark 17 \
action ife encode \
type 0xDEAD \
allow mark \
dst 02:15:15:15:15:15
xxx: Lets show the encoding policy
sudo tc -s filter ls dev $ETH parent 1: protocol ip
xxx:
filter pref 10 u32
filter pref 10 u32 fh 800: ht divisor 1
filter pref 10 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:2 (rule hit 0 success 0)
match c0a87aed/ffffffff at 16 (success 0 )
match 00010000/00ff0000 at 8 (success 0 )
action order 1: skbedit mark 17
index 6 ref 1 bind 1
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
action order 2: ife encode action pipe
index 3 ref 1 bind 1
dst MAC: 02:15:15:15:15:15 type: 0xDEAD
Metadata: allow mark
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
xxx:
test by sending ping from sender to destination
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow for more multicast router port information to be dumped such as
timer and type attributes. For that that purpose we need to extend the
MDBA_ROUTER_PORT attribute similar to how it was done for the mdb entries
recently. The new format is thus:
[MDBA_ROUTER_PORT] = { <- nested attribute
u32 ifindex <- router port ifindex for user-space compatibility
[MDBA_ROUTER_PATTR attributes]
}
This way it remains compatible with older users (they'll simply retrieve
the u32 in the beginning) and new users can parse the remaining
attributes. It would also allow to add future extensions to the router
port without breaking compatibility.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for a temporary router port which doesn't depend only on the
incoming query. It can be refreshed if set to the same value, which is
a no-op for the rest.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is needed for the upcoming temporary port router. There's no point
to go through the logic if the value is the same.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Using raw values makes it difficult to extend and also understand the
code, give them names and do explicit per-option manipulation in
br_multicast_set_port_router.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a user explicitly requests VLAN filtering with something like:
# echo 1 > /sys/class/net/<bridge>/bridge/vlan_filtering
Switchdev propagates a SWITCHDEV_ATTR_ID_BRIDGE_VLAN_FILTERING port
attribute.
Add support for it in the DSA layer with a new port_vlan_filtering
function to let drivers toggle 802.1Q filtering on user demand.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce devlink infrastructure for drivers to register and expose to
userspace via generic Netlink interface.
There are two basic objects defined:
devlink - one instance for every "parent device", for example switch ASIC
devlink port - one instance for every physical port of the device.
This initial portion implements basic get/dump of objects to userspace.
Also, port splitter and port type setting is implemented.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Calling ib_poll_cq() to sort through WCs during a completion is a
common pattern amongst RDMA consumers. Since commit 14d3a3b249
("IB: add a proper completion queue abstraction"), WC sorting can
be handled by the IB core.
By converting to this new API, svcrdma is made a better neighbor to
other RDMA consumers, as it allows the core to schedule the delivery
of completions more fairly amongst all active consumers.
This new API also aims each completion at a function that is
specific to the WR's opcode. Thus the ctxt->wr_op field and the
switch in process_context is replaced by a set of methods that
handle each completion type.
Because each ib_cqe carries a pointer to a completion method, the
core can now post operations on a consumer's QP, and handle the
completions itself.
The server's rdma_stat_sq_poll and rdma_stat_sq_prod metrics are no
longer updated.
As a clean up, the cq_event_handler, the dto_tasklet, and all
associated locking is removed, as they are no longer referenced or
used.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Calling ib_poll_cq() to sort through WCs during a completion is a
common pattern amongst RDMA consumers. Since commit 14d3a3b249
("IB: add a proper completion queue abstraction"), WC sorting can
be handled by the IB core.
By converting to this new API, svcrdma is made a better neighbor to
other RDMA consumers, as it allows the core to schedule the delivery
of completions more fairly amongst all active consumers.
Because each ib_cqe carries a pointer to a completion method, the
core can now post operations on a consumer's QP, and handle the
completions itself.
svcrdma receive completions no longer use the dto_tasklet. Each
polled Receive WC is now handled individually in soft IRQ context.
The server transport's rdma_stat_rq_poll and rdma_stat_rq_prod
metrics are no longer updated.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Clean up: close_out is reached only when ctxt == NULL and XPT_CLOSE
is already set.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Devesh Sharma <devesh.sharma@broadcom.com>
Tested-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
RFC 5666 Section 4.2 states:
> When the peer detects an RPC-over-RDMA header version that it does
> not support (currently this document defines only version 1), it
> replies with an error code of ERR_VERS, and provides the low and
> high inclusive version numbers it does, in fact, support.
And:
> When other decoding errors are detected in the header or chunks,
> either an RPC decode error MAY be returned or the RPC/RDMA error
> code ERR_CHUNK MUST be returned.
The Linux NFS server does throw ERR_VERS when a client sends it
a request whose rdma_version is not "one." But it does not return
ERR_CHUNK when a header decoding error occurs. It just drops the
request.
To improve protocol extensibility, it should reject invalid values
in the rdma_proc field instead of treating them all like RDMA_MSG.
Otherwise clients can't detect when the server doesn't support
new rdma_proc values.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Devesh Sharma <devesh.sharma@broadcom.com>
Tested-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
When constructing an error reply, svc_rdma_xdr_encode_error()
needs to view the client's request message so it can get the
failing request's XID.
svc_rdma_xdr_decode_req() is supposed to return a pointer to the
client's request header. But if it fails to decode the client's
message (and thus an error reply is needed) it does not return the
pointer. The server then sends a bogus XID in the error reply.
Instead, unconditionally generate the pointer to the client's header
in svc_rdma_recvfrom(), and pass that pointer to both functions.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Devesh Sharma <devesh.sharma@broadcom.com>
Tested-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Fix several issues with svc_rdma_send_error():
- Post a receive buffer to replace the one that was consumed by
the incoming request
- Posting a send should use DMA_TO_DEVICE, not DMA_FROM_DEVICE
- No need to put_page _and_ free pages in svc_rdma_put_context
- Make sure the sge is set up completely in case the error
path goes through svc_rdma_unmap_dma()
- Replace the use of ENOSYS, which has a reserved meaning
Related fixes in svc_rdma_recvfrom():
- Don't leak the ctxt associated with the incoming request
- Don't close the connection after sending an error reply
- Let svc_rdma_send_error() figure out the right header error code
As a last clean up, move svc_rdma_send_error() to svc_rdma_sendto.c
with other similar functions. There is some common logic in these
functions that could someday be combined to reduce code duplication.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Devesh Sharma <devesh.sharma@broadcom.com>
Tested-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Clean up: Most svc_rdma_post_recv() call sites close the transport
connection when a receive cannot be posted. Wrap that in a common
helper.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Devesh Sharma <devesh.sharma@broadcom.com>
Tested-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The maximum size of a backchannel message on RPC-over-RDMA depends
on the connection's inline threshold. Today that threshold is
typically 1024 bytes, making the maximum message size 996 bytes.
The Linux server's CREATE_SESSION operation checks that the size
of callback Calls can be as large as 1044 bytes, to accommodate
RPCSEC_GSS. Thus CREATE_SESSION fails if a client advertises the
true message size maximum of 996 bytes.
But the server's backchannel currently does not support RPCSEC_GSS.
The actual maximum size it needs is much smaller. It is safe to
reduce the limit to enable NFSv4.1 on RDMA backchannel operation.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The NFS server's XDR encoders adds an XDR pad for content in the
xdr_buf page list at the beginning of the xdr_buf's tail buffer.
On RDMA transports, Write chunks are sent separately and without an
XDR pad.
If a Write chunk is being sent, strip off the pad in the tail buffer
so that inline content following the Write chunk remains XDR-aligned
when it is sent to the client.
BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=294
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
When the Linux NFS server writes an odd-length data item into a
Write chunk, it finishes with XDR pad bytes. If the data item is
smaller than the Write chunk, the pad bytes are written at the end
of the data item, but still inside the chunk (ie, in the
application's buffer). Since this is direct data placement, that
exposes the pad bytes.
XDR pad bytes are inserted in order to preserve the XDR alignment
of the next XDR data item in an XDR stream. But Write chunks do not
appear in the payload XDR stream, and only one data item is allowed
in each chunk. Thus XDR padding is not needed in a Write chunk.
With NFSv4, the Linux NFS server places the results of any
operations that follow an NFSv4 READ or READLINK in the xdr_buf's
tail. Those results also should never be sent as a part of a Write
chunk. The current logic in send_write_chunks() appears to assume
that the xdr_buf's tail contains only pad bytes (ie, NFSv3).
The server should write only the contents of the xdr_buf's page list
in a Write chunk. If there's more than an XDR pad in the tail, that
needs to go inline or in the Reply chunk.
BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=294
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The client provides the location of Write chunks into which the
server writes bulk payload. The client provides these when the
Upper Layer Protocol wants direct data placement and the Binding
allows it. (For NFS, this is READ and READLINK operations).
The client also provides the location of a Reply chunk into which
the server writes the non-bulk part of an RPC reply. The client
provides this chunk whenever it believes the reply can be larger
than its receive buffers.
The server then uses the presence of these chunks to determine how
it will form its reply message.
svc_rdma_sendto() was looking for Write and Reply chunks multiple
times for every reply message. It would be more efficient to do it
just once.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
In the initial implementation the only way to stop a rule from being
inserted into the hardware table was via the device feature flag.
However this doesn't work well when working on an end host system
where packets are expect to hit both the hardware and software
datapaths.
For example we can imagine a rule that will match an IP address and
increment a field. If we install this rule in both hardware and
software we may increment the field twice. To date we have only
added support for the drop action so we have been able to ignore
these cases. But as we extend the action support we will hit this
example plus more such cases. Arguably these are not even corner
cases in many working systems these cases will be common.
To avoid forcing the driver to always abort (i.e. the above example)
this patch adds a flag to add a rule in software only. A careful
user can use this flag to build software and hardware datapaths
that work together. One example we have found particularly useful
is to use hardware resources to set the skb->mark on the skb when
the match may be expensive to run in software but a mark lookup
in a hash table is cheap. The idea here is hardware can do in one
lookup what the u32 classifier may need to traverse multiple lists
and hash tables to compute. The flag is only passed down on inserts.
On deletion to avoid stale references in hardware we always try
to remove a rule if it exists.
The flags field is part of the classifier specific options. Although
it is tempting to lift this into the generic structure doing this
proves difficult do to how the tc netlink attributes are implemented
along with how the dump/change routines are called. There is also
precedence for putting seemingly generic pieces in the specific
classifier options such as TCA_U32_POLICE, TCA_U32_ACT, etc. So
although not ideal I've left FLAGS in the u32 options as well as it
simplifies the code greatly and user space has already learned how
to manage these bits ala 'tc' tool.
Another thing if trying to update a rule we require the flags to
be unchanged. This is to force user space, software u32 and
the hardware u32 to keep in sync. Thanks to Simon Horman for
catching this case.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The offload decision was originally very basic and tied to if the dev
implemented the appropriate ndo op hook. The next step is to allow
the user to more flexibly define if any paticular rule should be
offloaded or not. In order to have this logic in one function lift
the current check into a helper routine tc_should_offload().
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch implements bookkeeping support to compute the maximum
headroom for all the devices in each datapath. When said value
changes, the underlying devs are notified via the
ndo_set_rx_headroom method.
This also increases the internal vports xmit performance.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On bridge needed_headroom changes, the enslaved devices are
notified via the ndo_set_rx_headroom method
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RTS/CTS needs to be enabled if the rate is a fallback rate *or* if it's
a dual-stream rate and the sta is in dynamic SMPS mode.
Cc: stable@vger.kernel.org
Fixes: a3ebb4e1b7 ("mac80211: minstrel_ht: handle peers in dynamic SMPS")
Reported-by: Matías Richart <mrichart@fing.edu.uy>
Signed-off-by: Felix Fietkau <nbd@openwrt.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Public Action frames use special rules for how the BSSID field (Address
3) is set. A wildcard BSSID is used in cases where the transmitter and
recipient are not members of the same BSS. As such, we need to accept
Public Action frames with wildcard BSSID.
Commit db8e173245 ("mac80211: ignore frames between TDLS peers when
operating as AP") added a rule that drops Action frames to TDLS-peers
based on an Action frame having different DA (Address 1) and BSSID
(Address 3) values. This is not correct since it misses the possibility
of BSSID being a wildcard BSSID in which case the Address 1 would not
necessarily match.
Fix this by allowing mac80211 to accept wildcard BSSID in an Action
frame when in AP mode.
Fixes: db8e173245 ("mac80211: ignore frames between TDLS peers when operating as AP")
Cc: stable@vger.kernel.org
Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Just like for CCMP we need to check that for GCMP the fragments
have PNs that increment by one; the spec was updated to fix this
security issue and now has the following text:
The receiver shall discard MSDUs and MMPDUs whose constituent
MPDU PN values are not incrementing in steps of 1.
Adapt the code for CCMP to work for GCMP as well, luckily the
relevant fields already alias each other so no code duplication
is needed (just check the aliasing with BUILD_BUG_ON.)
Cc: stable@vger.kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Similarly, we need to update backlog too when we update qlen.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We saw qlen!=0 but backlog==0 on our production machine:
qdisc htb 1: dev eth0 root refcnt 2 r2q 10 default 1 direct_packets_stat 0 ver 3.17
Sent 172680457356 bytes 222469449 pkt (dropped 0, overlimits 123575834 requeues 0)
backlog 0b 72p requeues 0
The problem is we only count qlen for HTB qdisc but not backlog.
We need to update backlog too when we update qlen, so that we
can at least know the average packet length.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the bottom qdisc decides to, for example, drop some packet,
it calls qdisc_tree_decrease_qlen() to update the queue length
for all its ancestors, we need to update the backlog too to
keep the stats on root qdisc accurate.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove nearly duplicated code and prepare for the following patch.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While building with W=1 we got the warning:
net/netfilter/xt_osf.c:265:9: warning: variable 'loop_cont' set but not used
The local variable loop_cont was only initialized and then assigned a
value but was never used or checked after that.
While removing the variable, the case of OSFOPT_TS was not removed so
that it will serve as a reminder to us that we can do something in that
particular case.
Signed-off-by: Sudip Mukherjee <sudip.mukherjee@codethink.co.uk>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Can be used to randomly match packets e.g. for statistic traffic sampling.
See commit 3ad0040573
("bpf: split state from prandom_u32() and consolidate {c, e}BPF prngs")
for more info why this doesn't use prandom_u32 directly.
Unlike bpf nft_meta can be built as a module, so add an EXPORT_SYMBOL
for prandom_seed_full_state too.
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
nfacct_filter_alloc doesn't validate the NFACCT_FILTER_MASK and
NFACCT_FILTER_VALUE parameters which can trigger a NULL pointer
dereference. CAP_NET_ADMIN is required to trigger the bug.
Signed-off-by: Phil Turnbull <phil.turnbull@oracle.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Lists all neighbours detected by the Echo Locating Protocol
(ELP) and their throughput metric.
Initially Developed by Linus during a 6 months trainee study
period in Ascom (Switzerland) AG.
Signed-off-by: Linus Luessing <linus.luessing@web.de>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@open-mesh.com>
In case of an unused wireless link, the mac80211 throughput estimation
won't get updated further. Consequently, the reported throughput metric
will become obsolete.
With this patch unicast sampling is introduced by periodically sending
unicast ELP packets to each neighbor on idle WiFi links. These sampling
packets will fill an entire frame, so that the measurement is as
reliable as possible
Signed-off-by: Antonio Quartulli <antonio@open-mesh.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
In case of wireless interface retrieve the throughput by
querying cfg80211. To perform this call a separate work
must be scheduled because the function may sleep and this
is not allowed within an RCU protected context (RCU in this
case is used to iterate over all the neighbours).
Use ethtool to retrieve information about an Ethernet link
like HALF/FULL_DUPLEX and advertised bandwidth (e.g.
100/10Mbps).
The metric is updated each time a new ELP packet is sent,
this way it is possible to timely react to a metric
variation which can imply (for example) a neighbour
disconnection.
Signed-off-by: Antonio Quartulli <antonio@open-mesh.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
To enable ELP to send probing packets over wireless links
only if needed, batman-adv must keep track of the last time
it sent a unicast packet towards every neighbour.
For this purpose a 2 main changes are introduced:
1) a new member of the elp_neigh_node structure stores the
last time a unicast packet was sent towards this neighbour;
2) a wrapper function for sending unicast packets is
implemented. This function will simply update the member
describe din point 1) and then forward the packet to the
real sending routine.
Point 2) implies that any code-path leading to a unicast
sending now has to use the new wrapper.
Signed-off-by: Antonio Quartulli <antonio@open-mesh.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
This attribute is exported to user space to disable the link
throughput auto-detection by setting a fixed value.
The throughput override value is used when batman-adv is
computing the link throughput towards a neighbour.
If the value is set to 0 then batman-adv will try to detect
the throughput by itself.
Signed-off-by: Antonio Quartulli <antonio@open-mesh.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Add the support for recognising new originators in the
network and rebroadcast their OGMs.
Signed-off-by: Antonio Quartulli <antonio@open-mesh.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
This is the initial implementation of the new OGM protocol
(version 2). It has been designed to work on top of the
newly added ELP.
In the previous version the OGM protocol was used to both
measure link qualities and flood the network with the metric
information. In this version the protocol is in charge of
the latter task only, leaving the former to ELP.
This means being able to decouple the interval used by the
neighbor discovery from the OGM broadcasting, which revealed
to be costly in dense networks and needed to be relaxed so
leading to a less responsive routing protocol.
Signed-off-by: Antonio Quartulli <antonio@open-mesh.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
This parameter can be set individually on each interface and
allows the configuration of the elp interval for the link
quality measurements during runtime. Usually it is desirable
to set it to a higher (= slower) value on interfaces which
have a more static characteristic (e.g. wired interfaces)
or very dense neighbourhoods to reduce overhead.
Developed by Linus during a 6 months trainee study period in
Ascom (Switzerland) AG.
Signed-off-by: Linus Luessing <linus.luessing@web.de>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
[antonio@open-mesh.com: respin on top of the latest master]
Signed-off-by: Antonio Quartulli <antonio@open-mesh.com>
Initially developed by Linus during a 6 months trainee study
period in Ascom (Switzerland) AG.
Signed-off-by: Linus Luessing <linus.luessing@web.de>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@open-mesh.com>
The B.A.T.M.A.N. protocol originally only used a single
message type (called OGM) to determine the link qualities to
the direct neighbors and spreading these link quality
information through the whole mesh. This procedure is
summarized on the BATMAN concept page and explained in
details in the RFC draft published in 2008.
This approach was chosen for its simplicity during the
protocol design phase and the implementation. However, it
also bears some drawbacks:
* Wireless interfaces usually come with some packet loss,
therefore a higher broadcast rate is desirable to allow
a fast reaction on flaky connections.
Other interfaces of the same host might be connected to
Ethernet LANs / VPNs / etc which rarely exhibit packet
loss would benefit from a lower broadcast rate to reduce
overhead.
* It generally is more desirable to detect local link
quality changes at a faster rate than propagating all
these changes through the entire mesh (the far end of
the mesh does not need to care about local link quality
changes that much). Other optimizations strategies, like
reducing overhead, might be possible if OGMs weren't
used for all tasks in the mesh at the same time.
As a result detecting local link qualities shall be handled
by an independent message type, ELP, whereas the OGM message
type remains responsible for flooding the mesh with these
link quality information and determining the overall path
transmit qualities.
Developed by Linus during a 6 months trainee study period in
Ascom (Switzerland) AG.
Signed-off-by: Linus Luessing <linus.luessing@web.de>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@open-mesh.com>
This allows us to easily add a sysfs parameter for an
unsigned int later, which is not for a batman mesh interface
(e.g. bat0), but for a common interface instead. It allows
reading and writing an atomic_t in hard_iface (instead of
bat_priv compared to the mesh variant).
Developed by Linus during a 6 months trainee study period in
Ascom (Switzerland) AG.
Signed-off-by: Linus Luessing <linus.luessing@web.de>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
[antonio@open-mesh.com: rename functions and move macros]
Signed-off-by: Antonio Quartulli <antonio@open-mesh.com>
When the send skbuff reaches the end, nlmsg_put and friends returns
-EMSGSIZE but it is silently thrown away in ndo_fdb_dump. It is called
within a for_each_netdev loop and the first fdb entry of a following
netdev could fit in the remaining skbuff. This breaks the mechanism
of cb->args[0] and idx to keep track of the entries that are already
dumped, which results missing entries in bridge fdb show command.
Signed-off-by: Minoura Makoto <minoura@valinux.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
On reviewing the code I realized that GRE and UDP tunnels could cause a
kernel panic if we used GSO to segment a large UDP frame that was sent
through the tunnel with an outer checksum and hardware offloads were not
available.
In order to correct this we need to update the feature flags that are
passed to the skb_segment function so that in the event of UDP
fragmentation being requested for the inner header the segmentation
function will correctly generate the checksum for the payload if we cannot
segment the outer header.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When selecting an address in context of a VRF, the vrf master should be
preferred for address selection. If it isn't, the user has a hard time
getting the system to select to their preference - the code will pick
the address off the first in-VRF interface it can find, which on a
router could well be a non-routable address.
Signed-off-by: David Lamparter <equinox@diac24.net>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
[dsa: Fixed comment style and removed extra blank link ]
Signed-off-by: David S. Miller <davem@davemloft.net>
David Lamparter noted a use case where the source address selection fails
to pick an address from a VRF interface - unnumbered interfaces.
Relevant commands from his script:
ip addr add 9.9.9.9/32 dev lo
ip link set lo up
ip link add name vrf0 type vrf table 101
ip rule add oif vrf0 table 101
ip rule add iif vrf0 table 101
ip link set vrf0 up
ip addr add 10.0.0.3/32 dev vrf0
ip link add name dummy2 type dummy
ip link set dummy2 master vrf0 up
--> note dummy2 has no address - unnumbered device
ip route add 10.2.2.2/32 dev dummy2 table 101
ip neigh add 10.2.2.2 dev dummy2 lladdr 02:00:00:00:00:02
tcpdump -ni dummy2 &
And using ping instead of his socat example:
$ ping -I vrf0 -c1 10.2.2.2
ping: Warning: source address might be selected on device other than vrf0.
PING 10.2.2.2 (10.2.2.2) from 9.9.9.9 vrf0: 56(84) bytes of data.
>From tcpdump:
12:57:29.449128 IP 9.9.9.9 > 10.2.2.2: ICMP echo request, id 2491, seq 1, length 64
Note the source address is from lo and is not a VRF local address. With
this patch:
$ ping -I vrf0 -c1 10.2.2.2
PING 10.2.2.2 (10.2.2.2) from 10.0.0.3 vrf0: 56(84) bytes of data.
>From tcpdump:
12:59:25.096426 IP 10.0.0.3 > 10.2.2.2: ICMP echo request, id 2113, seq 1, length 64
Now the source address comes from vrf0.
The ipv4 function for selecting source address takes a const argument.
Removing the const requires touching a lot of places, so instead
l3mdev_master_ifindex_rcu is changed to take a const argument and then
do the typecast to non-const as required by netdev_master_upper_dev_get_rcu.
This is similar to what l3mdev_fib_table_rcu does.
IPv6 for unnumbered interfaces appears to be selecting the addresses
properly.
Cc: David Lamparter <david@opensourcerouting.org>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull Ceph fixes from Sage Weil:
"There are two small messenger bug fixes and a log spam regression fix"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
libceph: don't spam dmesg with stray reply warnings
libceph: use the right footer size when skipping a message
libceph: don't bail early from try_read() when skipping a message
This patch fixes the return value in a case which should never occur.
Instead returning "-EINVAL" we return LOWPAN_IPHC_DAM_00 which is
invalid on context based addresses. Also change the WARN_ON_ONCE to
WARN_ONCE which was suggested by Dan Carpenter.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
in case of unusually long writes to some system interfaces used by
mountd and other nfs support utilities.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJWzlVJAAoJECebzXlCjuG+fVYQAMHQX2fx33juQF03r7R4A+cI
8QITgkAEDyQLcxYu/NdXqOXYc1Nv83uJRXljS2d2Y/D25spRB8ZaIZtpVFN/RlBc
8ss3nnI/iHFd6fqPECZfgHcORwalY+eYMICc2GdXFADYjyMQDD2/xIxF0GqW2RGC
xLKPPxnrcaIcgUGhFgbptpykKeYXHTK4vb/xwjkgI6m+IISFULAG4Z1jfsbIX+4N
NcNtTumgVQ1lFmMrl36K1JXj/0PY9vPPyDG38nSbD0s6zsS1mm6ALCt9N+mPJdX8
yJAq1FL70HtsBD9/oqm/qg9qpJ1dSB3r867HTXlECp6+mYwNi56u3I0jTJKbqwmJ
4eYlgQr1TPVUoa2vzbt/PE4yuKRsbo52TP1cX8VF+/bEtg0R5sz+uSTQZE9hqJE/
Cfk1eXYJxwVY8BnBY5tgmiGwApN13l/1fnhdZltWWsMEZiYvU46JIhVWv/8zMaA7
XOKP1r7JkWXcbPPvI7/KOxi5O4kzs9Mr4kT6xULrAHEbElPak6MIoODwJOLOMh1X
z82HHvfEd3mwRhrDDg2v0MrdikRx+mtaB3zfNFFbXcHAChRCqCQ0OHxOCE9EJhCg
GVsRpRjzafcSW+NwiQrn8OpjKJ7cSaa4gFXFaikQMFHYaYgavpJVWanH3fX6iKWV
RRPufAXYsyRNFmTi4AOb
=ezck
-----END PGP SIGNATURE-----
Merge tag 'nfsd-4.5-1' of git://linux-nfs.org/~bfields/linux
Pull nfsd bugfix from Bruce Fields:
"One fix for a bug that could cause a NULL write past the end of a
buffer in case of unusually long writes to some system interfaces used
by mountd and other nfs support utilities"
* tag 'nfsd-4.5-1' of git://linux-nfs.org/~bfields/linux:
sunrpc/cache: fix off-by-one in qword_get()
This patch defines a new ETHTOOL_GLINKSETTINGS/SLINKSETTINGS API,
handled by the new get_link_ksettings/set_link_ksettings callbacks.
This API provides support for most legacy ethtool_cmd fields, adds
support for larger link mode masks (up to 4064 bits, variable length),
and removes ethtool_cmd deprecated
fields (transceiver/maxrxpkt/maxtxpkt).
This API is deprecating the legacy ETHTOOL_GSET/SSET API and provides
the following backward compatibility properties:
- legacy ethtool with legacy drivers: no change, still using the
get_settings/set_settings callbacks.
- legacy ethtool with new get/set_link_ksettings drivers: the new
driver callbacks are used, data internally converted to legacy
ethtool_cmd. ETHTOOL_GSET will return only the 1st 32b of each link
mode mask. ETHTOOL_SSET will fail if user tries to set the
ethtool_cmd deprecated fields to
non-0 (transceiver/maxrxpkt/maxtxpkt). A kernel warning is logged if
driver sets higher bits.
- future ethtool with legacy drivers: no change, still using the
get_settings/set_settings callbacks, internally converted to new data
structure. Deprecated fields (transceiver/maxrxpkt/maxtxpkt) will be
ignored and seen as 0 from user space. Note that that "future"
ethtool tool will not allow changes to these deprecated fields.
- future ethtool with new drivers: direct call to the new callbacks.
By "future" ethtool, what is meant is:
- query: first try ETHTOOL_GLINKSETTINGS, and revert to ETHTOOL_GSET if
fails
- set: query first and remember which of ETHTOOL_GLINKSETTINGS or
ETHTOOL_GSET was successful
+ if ETHTOOL_GLINKSETTINGS was successful, then change config with
ETHTOOL_SLINKSETTINGS. A failure there is final (do not try
ETHTOOL_SSET).
+ otherwise ETHTOOL_GSET was successful, change config with
ETHTOOL_SSET. A failure there is final (do not try
ETHTOOL_SLINKSETTINGS).
The interaction user/kernel via the new API requires a small
ETHTOOL_GLINKSETTINGS handshake first to agree on the length of the link
mode bitmaps. If kernel doesn't agree with user, it returns the bitmap
length it is expecting from user as a negative length (and cmd field is
0). When kernel and user agree, kernel returns valid info in all
fields (ie. link mode length > 0 and cmd is ETHTOOL_GLINKSETTINGS).
Data structure crossing user/kernel boundary is 32/64-bit
agnostic. Converted internally to a legal kernel bitmap.
The internal __ethtool_get_settings kernel helper will gradually be
replaced by __ethtool_get_link_ksettings by the time the first
"link_settings" drivers start to appear. So this patch doesn't change
it, it will be removed before it needs to be changed.
Signed-off-by: David Decotigny <decot@googlers.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch add the SO_CNX_ADVICE socket option (setsockopt only). The
purpose is to allow an application to give feedback to the kernel about
the quality of the network path for a connected socket. The value
argument indicates the type of quality report. For this initial patch
the only supported advice is a value of 1 which indicates "bad path,
please reroute"-- the action taken by the kernel is to call
dst_negative_advice which will attempt to choose a different ECMP route,
reset the TX hash for flow label and UDP source port in encapsulation,
etc.
This facility should be useful for connected UDP sockets where only the
application can provide any feedback about path quality. It could also
be useful for TCP applications that have additional knowledge about the
path outside of the normal TCP control loop.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, all ipv6 addresses are flushed when the interface is configured
down, including global, static addresses:
$ ip -6 addr show dev eth1
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP qlen 1000
inet6 2100:1::2/120 scope global
valid_lft forever preferred_lft forever
inet6 fe80::e0:f9ff:fe79:34bd/64 scope link
valid_lft forever preferred_lft forever
$ ip link set dev eth1 down
$ ip -6 addr show dev eth1
<< nothing; all addresses have been flushed>>
Add a new sysctl to make this behavior optional. The new setting defaults to
flush all addresses to maintain backwards compatibility. When the set global
addresses with no expire times are not flushed on an admin down. The sysctl
is per-interface or system-wide for all interfaces
$ sysctl -w net.ipv6.conf.eth1.keep_addr_on_down=1
or
$ sysctl -w net.ipv6.conf.all.keep_addr_on_down=1
Will keep addresses on eth1 on an admin down.
$ ip -6 addr show dev eth1
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP qlen 1000
inet6 2100:1::2/120 scope global
valid_lft forever preferred_lft forever
inet6 fe80::e0:f9ff:fe79:34bd/64 scope link
valid_lft forever preferred_lft forever
$ ip link set dev eth1 down
$ ip -6 addr show dev eth1
3: eth1: <BROADCAST,MULTICAST> mtu 1500 state DOWN qlen 1000
inet6 2100:1::2/120 scope global tentative
valid_lft forever preferred_lft forever
inet6 fe80::e0:f9ff:fe79:34bd/64 scope link tentative
valid_lft forever preferred_lft forever
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
msg.dst_sk needs to be set up with a valid socket because some callbacks
later derive the netns from it.
Fixes: 263ea09084d172d ("Revert "genl: Add genlmsg_new_unicast() for unicast message allocation")
Reported-by: Jon Maloy <maloy@donjonn.com>
Bisected-by: Jon Maloy <maloy@donjonn.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the TIPC module is unloaded, we have identified a race condition
that allows a node reference counter to go to zero and the node instance
being freed before the node timer is finished with accessing it. This
leads to occasional crashes, especially in multi-namespace environments.
The scenario goes as follows:
CPU0:(node_stop) CPU1:(node_timeout) // ref == 2
1: if(!mod_timer())
2: if (del_timer())
3: tipc_node_put() // ref -> 1
4: tipc_node_put() // ref -> 0
5: kfree_rcu(node);
6: tipc_node_get(node)
7: // BOOM!
We now clean up this functionality as follows:
1) We remove the node pointer from the node lookup table before we
attempt deactivating the timer. This way, we reduce the risk that
tipc_node_find() may obtain a valid pointer to an instance marked
for deletion; a harmless but undesirable situation.
2) We use del_timer_sync() instead of del_timer() to safely deactivate
the node timer without any risk that it might be reactivated by the
timeout handler. There is no risk of deadlock here, since the two
functions never touch the same spinlocks.
3: We remove a pointless tipc_node_get() + tipc_node_put() from the
timeout handler.
Reported-by: Zhijiang Hu <huzhijiang@gmail.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Although we have never seen it happen, we have identified the
following problematic scenario when nodes are stopped and deleted:
CPU0: CPU1:
tipc_node_xxx() //ref == 1
tipc_node_put() //ref -> 0
tipc_node_find() // node still in table
tipc_node_delete()
list_del_rcu(n. list)
tipc_node_get() //ref -> 1, bad
kfree_rcu()
tipc_node_put() //ref to 0 again.
kfree_rcu() // BOOM!
We fix this by introducing use of the conditional kref_get_if_not_zero()
instead of kref_get() in the function tipc_node_find(). This eliminates
any risk of post-mortem access.
Reported-by: Zhijiang Hu <huzhijiang@gmail.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We need to update the skb->csum after pulling the skb, otherwise
an unnecessary checksum (re)computation can ocure for IGMP/MLD packets
in the bridge code. Additionally this fixes the following splats for
network devices / bridge ports with support for and enabled RX checksum
offloading:
[...]
[ 43.986968] eth0: hw csum failure
[ 43.990344] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.4.0 #2
[ 43.996193] Hardware name: BCM2709
[ 43.999647] [<800204e0>] (unwind_backtrace) from [<8001cf14>] (show_stack+0x10/0x14)
[ 44.007432] [<8001cf14>] (show_stack) from [<801ab614>] (dump_stack+0x80/0x90)
[ 44.014695] [<801ab614>] (dump_stack) from [<802e4548>] (__skb_checksum_complete+0x6c/0xac)
[ 44.023090] [<802e4548>] (__skb_checksum_complete) from [<803a055c>] (ipv6_mc_validate_checksum+0x104/0x178)
[ 44.032959] [<803a055c>] (ipv6_mc_validate_checksum) from [<802e111c>] (skb_checksum_trimmed+0x130/0x188)
[ 44.042565] [<802e111c>] (skb_checksum_trimmed) from [<803a06e8>] (ipv6_mc_check_mld+0x118/0x338)
[ 44.051501] [<803a06e8>] (ipv6_mc_check_mld) from [<803b2c98>] (br_multicast_rcv+0x5dc/0xd00)
[ 44.060077] [<803b2c98>] (br_multicast_rcv) from [<803aa510>] (br_handle_frame_finish+0xac/0x51c)
[...]
Fixes: 9afd85c9e4 ("net: Export IGMP/MLD message validation code")
Reported-by: Álvaro Fernández Rojas <noltari@gmail.com>
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: David S. Miller <davem@davemloft.net>
The VLAN GetNext operation is specific to some switches, and thus can be
complicated to implement for some drivers.
Remove the support for the vlan_getnext/port_pvid_get approach in favor
of the generic and simpler port_vlan_dump function.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to port_fdb_dump, add a port_vlan_dump function to DSA drivers
which gets passed the switchdev VLAN object and callback.
This function, if implemented, takes precedence over the soon legacy
vlan_getnext/port_pvid_get approach.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently tc actions are stored in a per-module hashtable,
therefore are visible to all network namespaces. This is
probably the last part of the tc subsystem which is not
aware of netns now. This patch makes them per-netns,
several tc action API's need to be adjusted for this.
The tc action API code is ugly due to historical reasons,
we need to refactor that code in the future.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We only release the memory of the hashtable itself, not its
entries inside. This is not a problem yet since we only call
it in module release path, and module is refcount'ed by
actions. This would be a problem after we move the per module
hinfo into per netns in the latter patch.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
llcp_sock_getname() checks llcp_sock->dev to make sure
llcp_sock is already connected or bound, however, we could
be in the middle of llcp_sock_bind() where llcp_sock->dev
is bound and llcp_sock->service_name_len is set,
but llcp_sock->service_name is not, in this case we would
lead to copy some bytes from a NULL pointer.
Just lock the sock since this is not a hot path anyway.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
These two functions are called in sendmsg path, and the
'len' is passed from user-space, so we should not allow
malicious users to OOM kernel on purpose.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Julian Calaby <julian.calaby@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
* stop critical protocol session on disconnect to avoid
it getting stuck
* wext: fix two RTNL message ordering issues
* fix an uninitialized value (found by KASAN)
* fix an out-of-bounds access (also found by KASAN)
* clear connection keys when freeing them in all cases
(IBSS, all other places already did so)
* fix expected throughput unit to get consistent values
* set default TX aggregation timeout to 0 in minstrel
to avoid (really just hide) issues and perform better
-----BEGIN PGP SIGNATURE-----
iQIcBAABCgAGBQJWzCthAAoJEGt7eEactAAdLjEP/38ujypJkVlwZzm9L8MbSVPk
qPMmRBshVX89xOXtwJLJHOGCRY6/pfe1LKThodOp+hj+NH7OLLuAGFDScjr7eEjG
bAQrXZ8dDIo54uri4EF4jBfkqUsePs+d4VpLgdJ2Gdh2UgIdDlUcggiEyOSicc27
ZP3gRQzad8tP0O+MuqiAjwJE4C3AUzzj+8TXJJXjrT3RvDpvZfA3MClfiQVhRM+4
3qtFhqFFGfIjTSHCoGr+0RYhLJwfjalIyoENR/kYizjHXU4YtZcu7ihdixKmG3Qb
SUIlNl60WSyoh2+PXx3wp0gfTc0BrOVIpB+TQPouRu5D+tLRiPwlieVrlfJyIUJ5
ccU1RWoTZn+3OVSjNF+9IGbpL43cjaJpZUMU/mQES7+z+uEwlzdcnwDScG7sMMC+
KIyKZ4v+2sBPKYYKAzGxHb2lfTMyp9sCs25U30yqLX3RSNtuxJbDwuKrwbKa1CMx
WEjImI+iSApjTch0THAxV8/sO/ZneuRZP76PXmSDm7Mzf9qob8fnpCjFsHTypDpx
eWy8AHPhIhb3ni6Cm3TfsGLjGE4KGwD/lEwmAles2Cj48N4b2UwRjMrJ5GRxuV00
9kh2MvFEmSYBU4dw6LQmemjxNNIa792+JiS4AKJXBqkPK1H1YFvq7i8Hl9a1tQ0P
SDgbH5x6AYvmTRcD97Ph
=01n9
-----END PGP SIGNATURE-----
Merge tag 'mac80211-for-davem-2016-02-23' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
Another small set of fixes:
* stop critical protocol session on disconnect to avoid
it getting stuck
* wext: fix two RTNL message ordering issues
* fix an uninitialized value (found by KASAN)
* fix an out-of-bounds access (also found by KASAN)
* clear connection keys when freeing them in all cases
(IBSS, all other places already did so)
* fix expected throughput unit to get consistent values
* set default TX aggregation timeout to 0 in minstrel
to avoid (really just hide) issues and perform better
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The fix in 35e2d1152b ("tunnels: Allow IPv6 UDP checksums to be correctly
controlled.") changed behavior for bpf_set_tunnel_key() when in use with
IPv6 and thus uncovered a bug that TUNNEL_CSUM needed to be set but wasn't.
As a result, the stack dropped ingress vxlan IPv6 packets, that have been
sent via eBPF through collect meta data mode due to checksum now being zero.
Since after LCO, we enable IPv4 checksum by default, so make that analogous
and only provide a flag BPF_F_ZERO_CSUM_TX for the user to turn it off in
IPv4 case.
Fixes: 35e2d1152b ("tunnels: Allow IPv6 UDP checksums to be correctly controlled.")
Fixes: c6c3345407 ("bpf: support ipv6 for bpf_skb_{set,get}_tunnel_key")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Flushing/listing entries was not RCU safe, so parallel flush/dump
could lead to kernel crash. Bug reported by Deniz Eren.
Fixes netfilter bugzilla id #1050.
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Commit d15f9d694b ("libceph: check data_len in ->alloc_msg()")
mistakenly bumped the log level on the "tid %llu unknown, skipping"
message. Turn it back into a dout() - stray replies are perfectly
normal when OSDs flap, crash, get killed for testing purposes, etc.
Cc: stable@vger.kernel.org # 4.3+
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
ceph_msg_footer is 21 bytes long, while ceph_msg_footer_old is only 13.
Don't skip too much when CEPH_FEATURE_MSG_AUTH isn't negotiated.
Cc: stable@vger.kernel.org # 3.19+
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
The contract between try_read() and try_write() is that when called
each processes as much data as possible. When instructed by osd_client
to skip a message, try_read() is violating this contract by returning
after receiving and discarding a single message instead of checking for
more. try_write() then gets a chance to write out more requests,
generating more replies/skips for try_read() to handle, forcing the
messenger into a starvation loop.
Cc: stable@vger.kernel.org # 3.10+
Reported-by: Varada Kari <Varada.Kari@sandisk.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Tested-by: Varada Kari <Varada.Kari@sandisk.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Otherwise we break the contract with GSO to only pass CHECKSUM_PARTIAL
skbs down. This can easily happen with UDP+IPv4 sockets with the first
MSG_MORE write smaller than the MTU, second write is a sendfile.
Returning -EOPNOTSUPP lets the callers fall back into normal sendmsg path,
were we calculate the checksum manually during copying.
Commit d749c9cbff ("ipv4: no CHECKSUM_PARTIAL on MSG_MORE corked
sockets") started to exposes this bug.
Fixes: d749c9cbff ("ipv4: no CHECKSUM_PARTIAL on MSG_MORE corked sockets")
Reported-by: Jiri Benc <jbenc@redhat.com>
Cc: Jiri Benc <jbenc@redhat.com>
Reported-by: Wakko Warner <wakko@animx.eu.org>
Cc: Wakko Warner <wakko@animx.eu.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
We want to try and pull the L4 header in if it is available in the first
fragment. As such add the flag to indicate we want to pull the headers on
the first fragment in.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The IPv6 parsing was using a local pointer when it could use the same
pointer as the IPv4 portion of the code since the key_addrs can support
both IPv4 and IPv6 as it is just a pointer.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The flow dissector bits handling FCoE didn't bother to actually validate
that the space there was enough for the FCoE header. So we need to update
things so that if there is room we add the header and report a good result,
otherwise we do not add the header, and report the bad result.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It turns out that for IPv4 we were reporting the ip_proto of the fragment,
and for IPv6 we were not. This patch updates that behavior so that we
always report the IP protocol of the fragment. In addition it takes the
steps of updating the payload offset code so that we will determine the
start of the payload not including the L4 header for any fragment after the
first.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch corrects the logic for the IPv4 parsing so that it is consistent
with how we handle IPv6. Specifically if we do not have the flow key
indicating we want the addresses we still may need to take a look at the IP
fragmentation bits and to see if we should stop after we have recognized
the L3 header.
Fixes: 807e165dc4 ("flow_dissector: Add control/reporting of fragmentation")
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
One of the validation checks for the new array-based TCP SO_REUSEPORT
validation was unintentionally dropped in ea8add2b19. This adds it back.
Lack of this check allows the user to allocate multiple sock_reuseport
structures (leaking all but the first).
Fixes: ea8add2b19 ("tcp/dccp: better use of ephemeral ports in bind()")
Signed-off-by: Craig Gallek <kraig@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bool variable 'fail' is always non-negative, it indicates an error if it
is true.
The problem has been detected using coccinelle script
scripts/coccinelle/tests/unsigned_lesser_than_zero.cocci
Signed-off-by: Andrzej Hajda <a.hajda@samsung.com>
Acked-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
In case of multicast address we need to set always the LOWPAN_IPHC_M bit
and if a destination context identifier was found for a multicast
address then we need to set the LOWPAN_IPHC_DAC as well.
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Factor all assignments to rfkill_global_states[].cur into a single
function rfkill_update_global_state().
Signed-off-by: João Paulo Rechi Vita <jprvita@endlessm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This prepares the driver for removal of platform data.
Signed-off-by: Heikki Krogerus <heikki.krogerus@linux.intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Helper for finding the type based on name. Useful if the
type needs to be determined based on device property.
Signed-off-by: Heikki Krogerus <heikki.krogerus@linux.intel.com>
[modify rfkill_types array and BUILD_BUG_ON to not cause errors]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Add IEEE80211_RADIOTAP_VHT entry to rtap_namespace_sizes array in order to
define alignment and size of VHT info in tx radiotap
Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi83@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Today, the supplicant will add the RRM capabilities
Information Element in the association request only if
Quiet period is supported (NL80211_FEATURE_QUIET).
Quiet is one of many RRM features, and there are other RRM
features that are not related to Quiet (e.g. neighbor
report). Therefore, requiring Quiet to enable RRM is too
restrictive.
Some of the features, like neighbor report, can be
supported by user space without any help from the kernel.
Hence adding the RRM capabilities IE to association request
should be the sole user space's decision.
Removing the RRM dependency on Quiet in the driver solves
this problem, but using an old driver with a user space
tool that would not require Quiet feature would be
problematic: the user space would add NL80211_ATTR_USE_RRM
in the association request even if the kernel doesn't
advertize NL80211_FEATURE_QUIET and the association would
be denied by the kernel.
This solution adds a global RRM capability, that tells user
space that it can request RRM capabilities IE publishment
without any specific feature support in the kernel.
Signed-off-by: Beni Lev <beni.lev@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Drivers may need to track which vif is using VHT MU-MIMO.
Move the flag indicationg the ownership of MU_MIMO to
ieee80211_vif.
Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Provide an interface to the lower level driver to set the VHT
MU-MIMO data. This is needed for example when there is an update
of the group data during low power state, where the management
frame will not be passed to the host at all.
Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Since the PNs of all the tx keys are now tracked in the public
part of the key struct (with atomic counter), we no longer
need these functions.
dvm and vt665{5,6} are currently the only users of these functions,
so update them accordingly.
Signed-off-by: Eliad Peller <eliadx.peller@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Some drivers/devices might want to set the IVs by
themselves (and still let mac80211 generate MMIC).
Specifically, this is needed when the device does
offloading at certain times, and the driver has
to make sure that the IVs of new tx frames (from
the host) are synchronized with IVs that were
potentially used during the offloading.
Similarly to CCMP, move the TX IVs of TKIP keys to the
public part of the key struct, and export a function
to add the IV right into the crypto header.
The public tx_pn field is defined as atomic64, so define
TKIP_PN_TO_IV16/32 helper macros to convert it to iv16/32
when needed.
Since the iv32 used for the p1k cache is taken
directly from the frame, we can safely remove
iv16/32 from being protected by tkip.txlock.
Signed-off-by: Eliad Peller <eliadx.peller@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Fix wiphy supported_band access in tx radiotap parsing introduced
in commit 5ec3aed9ba4c ("mac80211: Parse legacy and HT rate in
injected frames"). In particular, info->band is always set to 0
(IEEE80211_BAND_2GHZ) since it has not assigned yet.
This cause a kernel crash on 5GHz only devices.
Move ieee80211_parse_tx_radiotap() after info->band assignment
Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi83@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This massively reduces data copying and thus improves rx performance
Signed-off-by: Felix Fietkau <nbd@openwrt.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
reuse_skb is set to true if the code decides to use the last segment.
Fixes a memory leak
Signed-off-by: Felix Fietkau <nbd@openwrt.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The source and destintation addresses in the memcpy arguments
are flipped. Fix that.
Fixes: 23a1f8d44c0b("mac80211: process and save VHT MU-MIMO group frame")
Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
mpp_path_del() and mesh_path_del() are mostly the same function.
Move common code into a new static function.
Acked-by: Bob Copeland <me@bobcopeland.com>
Signed-off-by: Henning Rogge <henning.rogge@fkie.fraunhofer.de>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Remember the last time when a mpp table entry is used for
rx or tx and remove them after MESH_PATH_EXPIRE time.
Acked-by: Bob Copeland <me@bobcopeland.com>
Signed-off-by: Henning Rogge <henning.rogge@fkie.fraunhofer.de>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Make the mesh_path_del() function remove all mpp table entries
that are proxied by the removed mesh path.
Acked-by: Bob Copeland <me@bobcopeland.com>
Signed-off-by: Henning Rogge <henning.rogge@fkie.fraunhofer.de>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
PBSS (Personal Basic Service Set) is a new BSS type for DMG
networks. It is similar to infrastructure BSS, having an AP-like
entity called PCP (PBSS Control Point), but it has few differences.
PBSS support is mandatory for 11ad devices.
Add support for PBSS by introducing a new PBSS flag attribute.
The PBSS flag is used in the START_AP command to request starting
a PCP instead of an AP, and in the CONNECT command to request
connecting to a PCP instead of an AP.
Signed-off-by: Lior David <liord@codeaurora.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Use skb_copy_bits in preparation for allowing fragmented skbs
Signed-off-by: Felix Fietkau <nbd@openwrt.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Prepararation for zero-copy A-MSDU support with page fragment SKBs
Signed-off-by: Felix Fietkau <nbd@openwrt.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
On error we jumped to the error label and returned the error code but we
missed releasing sinfo.
Fixes: 5fe74014172d ("mac80211: avoid excessive stack usage in sta_info")
Reviewed-by: Julian Calaby <julian.calaby@gmail.com>
Signed-off-by: Sudip Mukherjee <sudip@vectorindia.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Number of deployed 80 MHz capable VHT stations that do not support 80+80
and 160 MHz bandwidths seem to misbehave when trying to connect to an AP
that advertises 80+80 or 160 MHz channel bandwidth in the VHT Operation
element. To avoid such issues with deployed devices, modify the design
based on recently accepted IEEE 802.11 standard changes (*).
This allows poorly implemented VHT 80 MHz stations to connect with the
AP in 80 MHz mode. 80+80 and 160 MHz capable stations need to support
the new workaround mechanism to allow full bandwidth to be used.
However, there are more or less no impacted station with 80+80/160
capability deployed.
The rebased version of this patch is based on the updated version from
Johannes Berg to take the HT/VHT chandef refactoring into account.
(*) Changes in
https://mentor.ieee.org/802.11/dcn/15/11-15-1530-04-000m-vht160-operation-signaling-through-non-zero-ccfs1.docx
were accepted during the IEEE 802.11 January 2016 meeting.
Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
If any frames are dropped that are part of a BA session, the reorder
buffer will "indefinitely" (until the timeout) wait for them to come
in (or a BAR moving the window) and won't release frames after them.
This means it isn't possible to filter frames within a BA session in
firmware.
Introduce an API function that allows such filtering. Calling this
function will move the BA window forward to the new SSN, and allows
marking frames after the SSN as having been filtered, so any future
reordering activity will release frames while skipping the holes.
Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Clean up ieee80211_rx_reorder_ready() callers by passing the RX
TID struct and the index, instead of the frames list. This will
make it more extensible as well.
While at it, move the inline to rx.c as it's only used there.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This will allow drivers to make more educated
decisions whether to defer transmission or not.
Relying on wake_tx_queue() call count implicitly
was not possible because it could be called
without queued frame count actually changing on
software tx aggregation start/stop code paths.
It was also not possible to know how long
byte-wise queue was without dequeueing.
Signed-off-by: Michal Kazior <michal.kazior@tieto.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Drivers/devices without their own rate control algorithm can get the
information what rates they should use from either the radiotap header of
injected frames or from the rate control algorithm. But the parsing of the
legacy rate information from the radiotap header was removed in commit
e6a9854b05 ("mac80211/drivers: rewrite the rate control API").
The removal of this feature heavily reduced the usefulness of frame
injection when wanting to simulate specific transmission behavior. Having
rate parsing together with MCS rates and retry support allows a fine
grained selection of the tx behavior of injected frames for these kind of
tests.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Cc: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The mean_chain_len field in struct mesh_table is copied whenever a
new mesh table is allocated, but only ever has the value 2 and is
never otherwise updated, so just remove it and use the related
define instead.
Signed-off-by: Bob Copeland <me@bobcopeland.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When CONFIG_OPTIMIZE_INLINING is set, the sta_info_insert_finish
function consumes more stack than normally, exceeding the
1024 byte limit on ARM:
net/mac80211/sta_info.c: In function 'sta_info_insert_finish':
net/mac80211/sta_info.c:561:1: error: the frame size of 1080 bytes is larger than 1024 bytes [-Werror=frame-larger-than=]
It turns out that there are two functions that put a 'struct station_info'
on the stack: __sta_info_destroy_part2 and sta_info_insert_finish, and
this structure alone requires up to 792 bytes.
Hoping that both are called rarely enough, this replaces the
on-stack structure with a dynamic allocation, which unfortunately
requires some suboptimal error handling for out-of-memory.
The __sta_info_destroy_part2 function is actually affected by the
stack usage twice because it calls cfg80211_del_sta_sinfo(), which
has another instance of struct station_info on its stack.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 98b6218388 ("mac80211/cfg80211: add station events")
Fixes: 6f7a8d26e2 ("mac80211: send statistics with delete station event")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Make sure there's at least a debug message whenever the
connection to the AP is terminated.
Also change one message from wiphy_debug() to the common
mlme_dbg().
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When joining IBSS a full scan should be initiated in order to search
for existing cell, unless the fixed_channel parameter was set.
A default channel to create the IBSS on if no cell was found is
provided as well.
However - a scan is initiated only on the default channel provided
regardless of whether ifibss->fixed_channel is set or not, with the
obvious result of the cell not joining existing IBSS cell that is
on another channel.
Fixes: 76bed0f43b ("mac80211: IBSS fix scan request")
Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The timestamp given by iwlwifi is at the beginning of the
frame over the air, at (or during) the SYNC field. Allow
such timestamps to be given to mac80211, at least (for now)
for frames with non-HT/VHT preambles.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The ieee80211_queue_stopped() expects hw queue
number but it was given raw WMM AC number instead.
This could cause frame drops and problems with
traffic in some cases - most notably if driver
doesn't map AC numbers to queue numbers 1:1 and
uses ieee80211_stop_queues() and
ieee80211_wake_queue() only without ever calling
ieee80211_wake_queues().
On ath10k it was possible to hit this problem in
the following case:
1. wlan0 uses queue 0
(ath10k maps queues per vif)
2. offchannel uses queue 15
3. queues 1-14 are unused
4. ieee80211_stop_queues()
5. ieee80211_wake_queue(q=0)
6. ieee80211_wake_queue(q=15)
(other queues are not woken up because both
driver and mac80211 know other queues are
unused)
7. ieee80211_rx_h_mesh_fwding()
8. ieee80211_select_queue_80211() returns 2
9. ieee80211_queue_stopped(q=2) returns true
10. frame is dropped (oops!)
Fixes: d3c1597b8d ("mac80211: fix forwarded mesh frame queue mapping")
Signed-off-by: Michal Kazior <michal.kazior@tieto.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The driver can access the queue simultanously
while mac80211 tears down the interface. Without
spinlock protection this could lead to corrupting
sk_buff_head and subsequently to an invalid
pointer dereference.
Fixes: ba8c3d6f16 ("mac80211: add an intermediate software queue implementation")
Signed-off-by: Michal Kazior <michal.kazior@tieto.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Remove unnecessory "if" statement and club it with previos "if" block.
Signed-off-by: Sunil Shahu <shshahu@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This was scheduled to be removed in 2012 by:
commit 69c86373c6
Author: florian@mickler.org <florian@mickler.org>
Date: Wed Feb 24 12:05:16 2010 +0100
Document the rfkill sysfs ABI
This moves sysfs ABI info from Documentation/rfkill.txt to the
ABI subfolder and reformats it.
This also schedules the deprecated sysfs parts to be removed in
2012 (claim file) and 2014 (state file).
Signed-off-by: Florian Mickler <florian@mickler.org>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Signed-off-by: João Paulo Rechi Vita <jprvita@endlessm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
__rfkill_set_hw_state() is only one used in rfkill_set_hw_state(), and
none of them are long or complicated, so merging the two makes the code
easier to read.
Signed-off-by: João Paulo Rechi Vita <jprvita@endlessm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
RFKILL_BLOCK_SW value have just been saved to prev, no need to check it
again in the if expression. This makes code a little bit easier to read.
Signed-off-by: João Paulo Rechi Vita <jprvita@endlessm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When suspended while polling is paused, polling will erroneously
resume at resume time. Fix this by tracking pause and suspend in
separate state variable and adding the necessary checks.
Clarify the documentation on this as well.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The station MLME and IBSS/mesh ones use entirely different
code for interpreting HT and VHT operation elements. Change
the code that interprets them a bit - it now modifies an
existing chandef - and use it also in the MLME code.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Print a warning whenever an expected callback function
lacks implementation.
Signed-off-by: Ola Olsson <ola.olsson@sonymobile.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In VHT, the specification allows to limit the number of
MSDUs in an A-MSDU in the Extended Capabilities IE. There
is also a limitation on the byte size in the VHT IE.
In HT, the only limitation is on the byte size.
Parse the capabilities from the peer and make them
available to the driver.
In HT, there is another limitation when a BA agreement
is active: the byte size can't be greater than 4095.
This is not enforced here.
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The minimum chandef bandwidth calculation was done only in case
a new station was inserted (or when an existing station was removed).
However, it is possible that stations are inserted before they are
associated, e.g., when FULL_AP_CLIENT_STATE is supported and user
space adds stations unassociated.
Fix this by calling ieee80211_recalc_min_chandef() whenever
a station transitions in/out the associated state, and only
consider station marked as associated.
Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Some drivers offload some frames internally (e.g.
AddBa). Reporting such frames to mac80211 would
only confuse MLME. However it would be useful to
be able to pass such frames to monitor interfaces
for sniffing purposes, e.g. when running AP +
monitor.
To do that allow drivers to tell mac80211 whether
a given frame should be:
- processed but not delivered to any monitor vif
- not processed but delievered to monitor vifs
only
Signed-off-by: Grzegorz Bajorski <grzegorz.bajorski@tieto.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Enable driver to manage the reordering logic itself.
This is needed for example for the iwlwifi driver that
will support hardware assisted reordering.
Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Stable bugfixes:
- Fix nfs_size_to_loff_t
- NFSv4: Fix a dentry leak on alias use
Other bugfixes:
- Don't schedule a layoutreturn if the layout segment can be freed immediately.
- Always set NFS_LAYOUT_RETURN_REQUESTED with lo->plh_return_iomode
- rpcrdma_bc_receive_call() should init rq_private_buf.len
- fix stateid handling for the NFS v4.2 operations
- pnfs/blocklayout: fix a memeory leak when using,vmalloc_to_page
- fix panic in gss_pipe_downcall() in fips mode
- Fix a race between layoutget and pnfs_destroy_layout
- Fix a race between layoutget and bulk recalls
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJWzKShAAoJEGcL54qWCgDyN0QQALiX8v2wvn07vE5ZeXB5uONq
+mfx8avhEoc3NVrpG6F4Kj+yJmHeAbkgIygnhZn4tcM/2YRxGDwlVLHb++yUTHO9
8zEi+tiKx9f5pK2PxRQ0PjavVxO/xOyO0/QNrUdnj8hSNR9ow+YOVjEYUulbuhIg
VAI3oSy5qIKgtDyW7w5PuPpTXLo74hPmyqHaa+ZIr2et//nJMSsw++vAmSg3oqXq
6QkLWPHt/8yvDRRn2hKkbD9gOrFCVfaZIGLM6Q0zRWAcGTzJi94ELzPdm8cVpD1o
eXKcufgLXPt3GOeAmxZ9kwQeebR6IFcvkYom5dsPhtMBuzXu1wpanU8PGgYIQ0VA
88b2YNl+TZpiVbRzxSEellZq5b+zapH/VVVnYptZiq9wUTACc7jK6W2heqe5PzaT
iepTGCAE21tV5JewcITMQHDZiOjRNdtbBzgixI7pNfMN8whU6e5NHYj6psZqT7cf
xEEZzL+RBJuCFKhXSPbBefccA4HCRkDEpT+2QgrMbS4KKfWOg36UNbJ2kgbvcRVi
HTqoRONR6zMzYBhyMlLaUuJ1co8nSHgEsL81Q3MwWSY6gucSW7jeJ2stR20KJIo1
7qgod9Ac/BAIozjzywi0LtmxouPyPU8cqaboMhSRVPDKfFlqZBNBkFLNWwgoYXMa
r1afZQwNeRRbZUR3RulE
=/WDS
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-4.5-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
"Stable bugfixes:
- Fix nfs_size_to_loff_t
- NFSv4: Fix a dentry leak on alias use
Other bugfixes:
- Don't schedule a layoutreturn if the layout segment can be freed
immediately.
- Always set NFS_LAYOUT_RETURN_REQUESTED with lo->plh_return_iomode
- rpcrdma_bc_receive_call() should init rq_private_buf.len
- fix stateid handling for the NFS v4.2 operations
- pnfs/blocklayout: fix a memeory leak when using,vmalloc_to_page
- fix panic in gss_pipe_downcall() in fips mode
- Fix a race between layoutget and pnfs_destroy_layout
- Fix a race between layoutget and bulk recalls"
* tag 'nfs-for-4.5-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFSv4.x/pnfs: Fix a race between layoutget and bulk recalls
NFSv4.x/pnfs: Fix a race between layoutget and pnfs_destroy_layout
auth_gss: fix panic in gss_pipe_downcall() in fips mode
pnfs/blocklayout: fix a memeory leak when using,vmalloc_to_page
nfs4: fix stateid handling for the NFS v4.2 operations
NFSv4: Fix a dentry leak on alias use
xprtrdma: rpcrdma_bc_receive_call() should init rq_private_buf.len
pNFS: Always set NFS_LAYOUT_RETURN_REQUESTED with lo->plh_return_iomode
pNFS: Fix pnfs_mark_matching_lsegs_return()
nfs: fix nfs_size_to_loff_t
IPCB may contain data from previous layers (in the observed case the
qdisc layer). In the observed scenario, the data was misinterpreted as
ip header options, which later caused the ihl to be set to an invalid
value (<5). This resulted in an infinite loop in the mips implementation
of ip_fast_csum.
This patch clears IPCB(skb)->opt before dst_link_failure can be called for
various types of tunnels. This change only applies to encapsulated ipv4
packets.
The code introduced in 11c21a30 which clears all of IPCB has been removed
to be consistent with these changes, and instead the opt field is cleared
unconditionally in ip_tunnel_xmit. The change in ip_tunnel_xmit applies to
SIT, GRE, and IPIP tunnels.
The relevant vti, l2tp, and pptp functions already contain similar code for
clearing the IPCB.
Signed-off-by: Bernie Harris <bernie.harris@alliedtelesis.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently it's converted into msecs, thus HZ=1000 intact.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Fixes: 740b0f1841 ("tcp: switch rtt estimations to usec resolution")
Signed-off-by: David S. Miller <davem@davemloft.net>
DSA drivers may support multiple bridge groups with the same hardware
VLAN. The mv88e6xxx driver which cannot yet, already has its own check
for overlapping bridges. Thus remove the check from the DSA layer.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some DSA drivers may or may not support multiple software bridges on top
of an hardware switch.
It is more convenient for them to access the bridge's net_device for
finer configuration.
Removing the need to craft and access a bitmask also simplifies the
code.
This patch changes the signature of bridge related functions, update DSA
drivers, and removes dsa_slave_br_port_mask.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Tested-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixed the return value of netdev notifier. If the command is
a don't care a NOTIFY_DONE should be returned. If the command matched a
NOTIFY_OK should be returned.
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch introduce support for IPHC stateful address compression. It
will offer the context table via one debugfs entry.
This debugfs has and directory for each cid entry for the context table.
Inside each cid directory there exists the following files:
- "active": If the entry is added or deleted. The context table is
original a list implementation, this flag will indicate if the
context is part of list or not.
- "prefix": The ipv6 prefix.
- "prefix_length": The prefix length for the prefix.
- "compression": The compression flag according RFC6775.
This part should be moved into sysfs after some testing time.
Also the debugfs entry contains a "show" file which is a pretty-printout
for the current context table information.
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Destroying the workqueue before unregistering the net device caused a
kernel oops
Signed-off-by: Koen Zandberg <koen@bergzand.net>
Acked-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
When the HCI_AUTO_OFF flag is cleared, the power_off delayed work need
to be cancel or HCI will be powered off even if it's managed.
Signed-off-by: Wei-Ning Huang <wnhuang@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Recently a managed version of led_trigger_register was introduced.
Using devm_led_trigger_register allows to simplify the LED trigger code.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Add support for LED triggers to the Bluetooth subsystem and add kernel
config symbol BT_LEDS for it.
For now one trigger for indicating "HCI is powered up" is supported.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
The qword_get() function NUL-terminates its output buffer. If the input
string is in hex format \xXXXX... and the same length as the output
buffer, there is an off-by-one:
int qword_get(char **bpp, char *dest, int bufsize)
{
...
while (len < bufsize) {
...
*dest++ = (h << 4) | l;
len++;
}
...
*dest = '\0';
return len;
}
This patch ensures the NUL terminator doesn't fall outside the output
buffer.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
When user-space has started a critical protocol session and a disconnect
event occurs, the rdev::crit_prot_nlportid remains set. This caused a
subsequent NL80211_CMD_CRIT_PROTO_START to fail (-EBUSY). Fix this by
clearing the rdev attribute and call .crit_proto_stop() callback upon
disconnect event.
Reviewed-by: Hante Meuleman <meuleman@broadcom.com>
Reviewed-by: Pieter-Paul Giesberts <pieterpg@broadcom.com>
Signed-off-by: Arend van Spriel <arend@broadcom.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The connection keys are zeroed out in all other cases except this
one. Let's fix the last one as well.
Signed-off-by: Ola Olsson <ola.olsson@sonymobile.com>
Reviewed-by: Julian Calaby <julian.calaby@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The value 5000 was put here with the addition of the timeout field to
ieee80211_start_tx_ba_session. It was originally added in mac80211 to
save resources for drivers like iwlwifi, which only supports a limited
number of concurrent aggregation sessions.
Since iwlwifi does not use minstrel_ht and other drivers don't need
this, 0 is a better default - especially since there have been
recent reports of aggregation setup related issues reproduced with
ath9k. This should improve stability without causing any adverse
effects.
Cc: stable@vger.kernel.org
Acked-by: Avery Pennarun <apenwarr@gmail.com>
Signed-off-by: Felix Fietkau <nbd@openwrt.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The batman-adv source code is the only place in the kernel which uses the
*_free_ref naming scheme for the *_put functions. Changing it to *_put
makes it more consistent and makes it easier to understand the connection
to the *_get functions.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
The batman-adv source code is the only place in the kernel which uses the
*_free_ref naming scheme for the *_put functions. Changing it to *_put
makes it more consistent and makes it easier to understand the connection
to the *_get functions.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
The batman-adv source code is the only place in the kernel which uses the
*_free_ref naming scheme for the *_put functions. Changing it to *_put
makes it more consistent and makes it easier to understand the connection
to the *_get functions.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
The batman-adv source code is the only place in the kernel which uses the
*_free_ref naming scheme for the *_put functions. Changing it to *_put
makes it more consistent and makes it easier to understand the connection
to the *_get functions.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
The batman-adv source code is the only place in the kernel which uses the
*_free_ref naming scheme for the *_put functions. Changing it to *_put
makes it more consistent and makes it easier to understand the connection
to the *_get functions.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
The batman-adv source code is the only place in the kernel which uses the
*_free_ref naming scheme for the *_put functions. Changing it to *_put
makes it more consistent and makes it easier to understand the connection
to the *_get functions.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
The batman-adv source code is the only place in the kernel which uses the
*_free_ref naming scheme for the *_put functions. Changing it to *_put
makes it more consistent and makes it easier to understand the connection
to the *_get functions.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
The batman-adv source code is the only place in the kernel which uses the
*_free_ref naming scheme for the *_put functions. Changing it to *_put
makes it more consistent and makes it easier to understand the connection
to the *_get functions.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
The batman-adv source code is the only place in the kernel which uses the
*_free_ref naming scheme for the *_put functions. Changing it to *_put
makes it more consistent and makes it easier to understand the connection
to the *_get functions.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
The batman-adv source code is the only place in the kernel which uses the
*_free_ref naming scheme for the *_put functions. Changing it to *_put
makes it more consistent and makes it easier to understand the connection
to the *_get functions.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
The batman-adv source code is the only place in the kernel which uses the
*_free_ref naming scheme for the *_put functions. Changing it to *_put
makes it more consistent and makes it easier to understand the connection
to the *_get functions.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
The batman-adv source code is the only place in the kernel which uses the
*_free_ref naming scheme for the *_put functions. Changing it to *_put
makes it more consistent and makes it easier to understand the connection
to the *_get functions.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
The batman-adv source code is the only place in the kernel which uses the
*_free_ref naming scheme for the *_put functions. Changing it to *_put
makes it more consistent and makes it easier to understand the connection
to the *_get functions.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
The batman-adv source code is the only place in the kernel which uses the
*_free_ref naming scheme for the *_put functions. Changing it to *_put
makes it more consistent and makes it easier to understand the connection
to the *_get functions.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
The batman-adv source code is the only place in the kernel which uses the
*_free_ref naming scheme for the *_put functions. Changing it to *_put
makes it more consistent and makes it easier to understand the connection
to the *_get functions.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
The batman-adv source code is the only place in the kernel which uses the
*_free_ref naming scheme for the *_put functions. Changing it to *_put
makes it more consistent and makes it easier to understand the connection
to the *_get functions.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
The batman-adv source code is the only place in the kernel which uses the
*_free_ref naming scheme for the *_put functions. Changing it to *_put
makes it more consistent and makes it easier to understand the connection
to the *_get functions.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
The batman-adv source code is the only place in the kernel which uses the
*_free_ref naming scheme for the *_put functions. Changing it to *_put
makes it more consistent and makes it easier to understand the connection
to the *_get functions.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
The batman-adv source code is the only place in the kernel which uses the
*_free_ref naming scheme for the *_put functions. Changing it to *_put
makes it more consistent and makes it easier to understand the connection
to the *_get functions.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
BATADV_BONDING_TQ_THRESHOLD is not used anymore since the implementation
of the bat_neigh_is_similar_or_better() API function.
Such function uses the more generic BATADV_TQ_SIMILARITY_THRESHOLD
constant.
Therefore, remove definition of the unused BATADV_BONDING_TQ_THRESHOLD
constant.
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Conflicts:
drivers/net/phy/bcm7xxx.c
drivers/net/phy/marvell.c
drivers/net/vxlan.c
All three conflicts were cases of simple overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
* multipath:
NFS add callback_ops to nfs4_proc_bind_conn_to_session_callback
pnfs/NFSv4.1: Add multipath capabilities to pNFS flexfiles servers over NFSv3
SUNRPC: Allow addition of new transports to a struct rpc_clnt
NFSv4.1: nfs4_proc_bind_conn_to_session must iterate over all connections
SUNRPC: Make NFS swap work with multipath
SUNRPC: Add a helper to apply a function to all the rpc_clnt's transports
SUNRPC: Allow caller to specify the transport to use
SUNRPC: Use the multipath iterator to assign a transport to each task
SUNRPC: Make rpc_clnt store the multipath iterators
SUNRPC: Add a structure to track multiple transports
SUNRPC: Make freeing of struct xprt rcu-safe
SUNRPC: Uninline xprt_get(); It isn't performance critical.
SUNRPC: Reorder rpc_task to put waitqueue related info in same cachelines
SUNRPC: Remove unused function rpc_task_reset_client
Johan Hedberg says:
====================
pull request: bluetooth 2016-02-20
Here's an important patch for 4.5 which fixes potential invalid pointer
access when processing completed Bluetooth HCI commands.
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
While debugging with bpf_jit_disasm I noticed emissions of 'mov %eax,%eax',
and found that this comes from BPF_RET | BPF_A translations from classic
BPF. Emitting this is unnecessary as BPF_REG_A is mapped into BPF_REG_0
already, therefore only emit a mov when immediates are used as return value.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When using this helper for updating UDP checksums, we need to extend
this in order to write CSUM_MANGLED_0 for csum computations that result
into 0 as sum. Reason we need this is because packets with a checksum
could otherwise become incorrectly marked as a packet without a checksum.
Likewise, if the user indicates BPF_F_MARK_MANGLED_0, then we should
not turn packets without a checksum into ones with a checksum.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When we're dealing with clones and the area is not writeable, try
harder and get a copy via pskb_expand_head(). Replace also other
occurences in tc actions with the new skb_try_make_writable().
Reported-by: Ashhad Sheikh <ashhadsheikh394@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
We currently limit bpf_skb_store_bytes() and bpf_skb_load_bytes()
helpers to only store or load a maximum buffer of 16 bytes. Thus,
loading, rewriting and storing headers require several bpf_skb_load_bytes()
and bpf_skb_store_bytes() calls.
Also here we can use a per-cpu scratch buffer instead in order to not
pressure stack space any further. I do suspect that this limit was mainly
set in place for this particular reason. So, ease program development
by removing this limitation and make the scratchpad generic, so it can
be reused.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
For L4 checksums, we currently have bpf_l4_csum_replace() helper. It's
currently limited to handle 2 and 4 byte changes in a header and feeds the
from/to into inet_proto_csum_replace{2,4}() helpers of the kernel. When
working with IPv6, for example, this makes it rather cumbersome to deal
with, similarly when editing larger parts of a header.
Instead, extend the API in a more generic way: For bpf_l4_csum_replace(),
add a case for header field mask of 0 to change the checksum at a given
offset through inet_proto_csum_replace_by_diff(), and provide a helper
bpf_csum_diff() that can generically calculate a from/to diff for arbitrary
amounts of data.
This can be used in multiple ways: for the bpf_l4_csum_replace() only
part, this even provides us with the option to insert precalculated diffs
from user space f.e. from a map, or from bpf_csum_diff() during runtime.
bpf_csum_diff() has a optional from/to stack buffer input, so we can
calculate a diff by using a scratchbuffer for scenarios where we're
inserting (from is NULL), removing (to is NULL) or diffing (from/to buffers
don't need to be of equal size) data. Also, bpf_csum_diff() allows to
feed a previous csum into csum_partial(), so the function can also be
cascaded.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Avoid users having to manually load the module by adding a module
alias allowing it to be autoloaded by the lwt infra.
Signed-off-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Avoid users having to manually load the module by adding a module
alias allowing it to be autoloaded by the lwt infra.
Signed-off-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The lwt implementations using net devices can autoload using the
existing mechanism using IFLA_INFO_KIND. However, there's no mechanism
that lwt modules not using net devices can use.
Therefore, add the ability to autoload modules registering lwt
operations for lwt implementations not using a net device so that
users don't have to manually load the modules.
Only users with the CAP_NET_ADMIN capability can cause modules to be
loaded, which is ensured by rtnetlink_rcv_msg rejecting non-RTM_GETxxx
messages for users without this capability, and by
lwtunnel_build_state not being called in response to RTM_GETxxx
messages.
Signed-off-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently vlan device inherits unicast filtering flag from underlying
device. If underlying device doesn't support unicast filter, this will
put vlan device into promiscuous mode when it's stacked.
Tun on IFF_UNICAST_FLT on the vlan device in any case so that it does
not go into promiscuous mode needlessly. If underlying device does not
support unicast filtering, that device will enter promiscuous mode.
Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dmitry Vyukov noted recently that the sctp_port_hashtable had an error in
its size computation, observing that the current method never guaranteed
that the hashsize (measured in number of entries) would be a power of two,
which the input hash function for that table requires. The root cause of
the problem is that two values need to be computed (one, the allocation
order of the storage requries, as passed to __get_free_pages, and two the
number of entries for the hash table). Both need to be ^2, but for
different reasons, and the existing code is simply computing one order
value, and using it as the basis for both, which is wrong (i.e. it assumes
that ((1<<order)*PAGE_SIZE)/sizeof(bucket) is still ^2 when its not).
To fix this, we change the logic slightly. We start by computing a goal
allocation order (which is limited by the maximum size hash table we want
to support. Then we attempt to allocate that size table, decreasing the
order until a successful allocation is made. Then, with the resultant
successful order we compute the number of buckets that hash table supports,
which we then round down to the nearest power of two, giving us the number
of entries the table actually supports.
I've tested this locally here, using non-debug and spinlock-debug kernels,
and the number of entries in the hashtable consistently work out to be
powers of two in all cases.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
CC: Dmitry Vyukov <dvyukov@google.com>
CC: Vladislav Yasevich <vyasevich@gmail.com>
CC: "David S. Miller" <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit 44d2713774 ("Bluetooth: Compress the size of struct
hci_ctrl") we squashed down the size of the structure by using a union
with the assumption that all users would use the flag to determine
whether we had a req_complete or a req_complete_skb.
Unfortunately we had a case in hci_req_cmd_complete() where we weren't
looking at the flag. This can result in a situation where we might be
storing a hci_req_complete_skb_t in a hci_req_complete_t variable, or
vice versa.
During some testing I found at least one case where the function
hci_req_sync_complete() was called improperly because the kernel thought
that it didn't require an SKB. Looking through the stack in kgdb I
found that it was called by hci_event_packet() and that
hci_event_packet() had both of its locals "req_complete" and
"req_complete_skb" pointing to the same place: both to
hci_req_sync_complete().
Let's make sure we always check the flag.
For more details on debugging done, see <http://crbug.com/588288>.
Fixes: 44d2713774 ("Bluetooth: Compress the size of struct hci_ctrl")
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Acked-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
The unix_stream_read_generic function tries to use a continue statement
to restart the receive loop after waiting for a message. This may not
work as intended as the caller might use a recvmsg call to peek at
control messages without specifying a message buffer. If this was the
case, the continue will cause the function to return without an error
and without the credential information if the function had to wait for a
message while it had returned with the credentials otherwise. Change to
using goto to restart the loop without checking the condition first in
this case so that credentials are returned either way.
Signed-off-by: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The value passed by unix_diag_get_exact to unix_lookup_by_ino has type
__u32, but unix_lookup_by_ino's argument ino has type int, which is not
a problem yet.
However, when ino is compared with sock_i_ino return value of type
unsigned long, ino is sign extended to signed long, and this results
to incorrect comparison on 64-bit architectures for inode numbers
greater than INT_MAX.
This bug was found by strace test suite.
Fixes: 5d3cae8bc3 ("unix_diag: Dumping exact socket core")
Signed-off-by: Dmitry V. Levin <ldv@altlinux.org>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace individual implementations with the recently introduced
skb_postpush_rcsum() helper.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Tom Herbert <tom@herbertland.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch implements sub command ETHTOOL_SCOALESCE for ioctl
ETHTOOL_PERQUEUE. It introduces an interface set_per_queue_coalesce to
set coalesce of each masked queue to device driver. The wanted coalesce
information are stored in "data" for each masked queue, which can copy
from userspace.
If it fails to set coalesce to device driver, the value which already
set to specific queue will be tried to rollback.
Signed-off-by: Kan Liang <kan.liang@intel.com>
Reviewed-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch implements sub command ETHTOOL_GCOALESCE for ioctl
ETHTOOL_PERQUEUE. It introduces an interface get_per_queue_coalesce to
get coalesce of each masked queue from device driver. Then the interrupt
coalescing parameters will be copied back to user space one by one.
Signed-off-by: Kan Liang <kan.liang@intel.com>
Reviewed-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce a new ioctl ETHTOOL_PERQUEUE for per queue parameters setting.
The following patches will enable some SUB_COMMANDs for per queue
setting.
Signed-off-by: Kan Liang <kan.liang@intel.com>
Reviewed-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
In ipv4, when the machine receives a ICMP_FRAG_NEEDED message, the
connected UDP socket will get EMSGSIZE message on its next read from the
socket.
However, this is not the case for ipv6.
This fix modifies the udp err handler in Ipv6 for ICMP6_PKT_TOOBIG to
make it similar to ipv4 behavior. That is when the machine gets an
ICMP6_PKT_TOOBIG message, the connected UDP socket will get EMSGSIZE
message on its next read from the socket.
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
the commit 35e2d1152b ("tunnels: Allow IPv6 UDP checksums to be
correctly controlled.") changed the default xmit checksum setting
for lwt vxlan/geneve ipv6 tunnels, so that now the checksum is not
set into external UDP header.
This commit changes the rx checksum setting for both lwt vxlan/geneve
devices created by openvswitch accordingly, so that lwt over ipv6
tunnel pairs are again able to communicate with default values.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Jesse Gross <jesse@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
access - it was triggered when removing an object from a list
after it was already free'd due to bad reference counting.
This misbehaviour existed for both the gw_node and the
orig_node_vlan object and has been fixed by Sven Eckelmann.
The last patch fixes our interface feasibility check and prevents
it from looping indefinitely when two net_device objects
reference each other via iflink index (i.e. veth pair), by
Andrew Lunn
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJWwy9MAAoJENpFlCjNi1MRu+cQAMF7zYIlWggCWUEqaEIId4qi
daGly3DNrXWH8FJta+NR399xXlTXNoyS8uZYt9YUZrXlwXDbDNBHJ8Jcmc5sslZr
/06os5KUvNLlasup1GLBbynKUCTBSjVyo3DBwVCsiCL/iHdN1tFZXg3c2ET2U7Tk
4bFZO4h1JkpwCYeCjCJQSbPrjnAXYdxV01iarDJZgAoc290f92Ob4mTuTrQr1Vzi
EH2fyrd3m/qmRb2YBj7jNdVeD2Q4XyMKY4vHHeBq9JR/kwINwUg38+gnvcwK+KeD
VJilETdC686ZJKBSVvECQZqeXr4HaCINFNuepZLsg128IQ2ZjqPYC1yTq/qc1VF+
YGCunXhbbNgxeE1+QgXssgRAiUBx43m2osQegpHkyCzK8GPbv1gRIbTYaHgmqa4A
Pn6WHdV2kcfGRFNOASST5+eWPe9ol8RvAbXT0yD2+6CNz6aijfbGITUTlB24X+f+
6CsS3uCBsH8FI3hqjgULTqcyF14QbnLZ9AnkxYwt8B0Ge62LEfdwuXAh96mIhfyf
cGi7DBnHAYapOPzz0jSBhW5myiPOioNLJfzokZrBq+RuRBYUIUvzGmz8QJQpy/GO
EdnMOE/uWeZeSCsAGTE00pjqfHliK/vu4Wh0puGFWTQGCBbKWP5p4eNqRh4p+2GC
Sg1f1EKDt2PRw9aZHNHm
=CvI3
-----END PGP SIGNATURE-----
Merge tag 'batman-adv-fix-for-davem' of git://git.open-mesh.org/linux-merge
Antonio Quartulli says:
====================
Two of the fixes included in this patchset prevent wrong memory
access - it was triggered when removing an object from a list
after it was already free'd due to bad reference counting.
This misbehaviour existed for both the gw_node and the
orig_node_vlan object and has been fixed by Sven Eckelmann.
The last patch fixes our interface feasibility check and prevents
it from looping indefinitely when two net_device objects
reference each other via iflink index (i.e. veth pair), by
Andrew Lunn
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
An error response from a RTM_GETNETCONF request can return the positive
error value EINVAL in the struct nlmsgerr that can mislead userspace.
Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently mdb entries are exported directly as a structure inside
MDBA_MDB_ENTRY_INFO attribute, we can't really extend it without
breaking user-space. In order to export new mdb fields, I've converted
the MDBA_MDB_ENTRY_INFO into a nested attribute which starts like before
with struct br_mdb_entry (without header, as it's casted directly in
iproute2) and continues with MDBA_MDB_EATTR_ attributes. This way we
keep compatibility with older users and can export new data.
I've tested this with iproute2, both with and without support for the
added attribute and it works fine.
So basically we again have MDBA_MDB_ENTRY_INFO with struct br_mdb_entry
inside but it may contain also some additional MDBA_MDB_EATTR_ attributes
such as MDBA_MDB_EATTR_TIMER which can be parsed by user-space.
So the new structure is:
[MDBA_MDB] = {
[MDBA_MDB_ENTRY] = {
[MDBA_MDB_ENTRY_INFO]
[MDBA_MDB_ENTRY_INFO] { <- Nested attribute
struct br_mdb_entry <- nla_put_nohdr()
[MDBA_MDB_ENTRY attributes] <- normal netlink attributes
}
}
}
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Switch the port check and skip if it's null, this allows us to reduce one
indentation level.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes a bug where NFS v4.1 callbacks were returning
RPC_GARBAGE_ARGS to the server.
Signed-off-by: Anna Schumaker <Anna@OcarinaProject.net>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJWxO+oAAoJENfLVL+wpUDrTwMQAPb1meyxjiRAWbZJPF/GO0oE
R62K7y6Y7fJ8MtOXWO+cyhZ54sRaGflTm233KS9L9ICY0acLU0fcArO43QLDH5p9
hozOJso2C7Ti2AVcgpLRNAK//DyBe8zzSU4qNGdLGSccKJcl97oUuI+18QpeWS8p
4dbscyx6pYxwTBUrvihLSUWcONge+uKYHHOriZ/r/sno+4wHNY6e11xeseMqmHQx
RFZOvsZakIoXsBW1UEZZsSpY0GnOTTTWZlk0rOb5ZyaCU+/sZygxJSHOSri66v8h
vPp7zx2HUGGpeMNRXlRgWkiJ2B+pShJ7JH7kfq5ZyMoy1uEeMEw3cdVoG+/OSBaT
EjOWbuj2Npq30GJAdXYE9joQ+Pd2d3h8q2/qzucpQh89hoGcf5jwgYrdyUgzHOdA
Lnsvhv5Wm8QiKO8LxNtnNA607dsDI5FLus+ijNl0bVFT12MYS/2ge679armCh5tA
eTJCp0ARKzWsoVRCCkGiPjmGg9yCTP73uNm/EXr4HLKfzSIbrZLK1Aswe9NS1+/L
tLVJsxbjRKfm9Ovvs7fl0lB7S4fbMKW78PO+vUuNfaRbgtha7SexvTp4g3am0Plh
2JnOgUqeG6My5Q9z843iG7vBD5U7jMId8dD37IlkXNMeiav7xBo1MQIokUDfpuXo
pbzSYIpdnIiPkSvAU3L2
=7AXa
-----END PGP SIGNATURE-----
Merge tag 'nfs-rdma-4.5-1' of git://git.linux-nfs.org/projects/anna/nfs-rdma
NFS: NFSoRDMA Client Bugfix
This patch fixes a bug where NFS v4.1 callbacks were returning
RPC_GARBAGE_ARGS to the server.
Signed-off-by: Anna Schumaker <Anna@OcarinaProject.net>
The cfrfml_receive() function might return positive value EPROTO
Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The atalk_sendmsg() function might return wrong value ENETUNREACH
instead of -ENETUNREACH.
Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
My implementation around IFF_NO_QUEUE driver flag assumed that leaving
tx_queue_len untouched (specifically: not setting it to zero) by drivers
would make it possible to assign a regular qdisc to them without having
to worry about setting tx_queue_len to a useful value. This was only
partially true: I overlooked that some drivers don't call ether_setup()
and therefore not initialize tx_queue_len to the default value of 1000.
Consequently, removing the workarounds in place for that case in qdisc
implementations which cared about it (namely, pfifo, bfifo, gred, htb,
plug and sfb) leads to problems with these specific interface types and
qdiscs.
Luckily, there's already a sanitization point for drivers setting
tx_queue_len to zero, which can be reused to assign the fallback value
most qdisc implementations used, which is 1.
Fixes: 348e3435cb ("net: sched: drop all special handling of tx_queue_len == 0")
Tested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>
ether_setup sets IFF_TX_SKB_SHARING but this is not supported by gre
as it modifies the skb on xmit.
Also, clean up whitespace in ipgre_tap_setup when we're already touching it.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Part of skb_scrub_packet was open coded in iptunnel_pull_header. Let it call
skb_scrub_packet directly instead.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove the shared br_log_state function and print the info directly in
br_set_state, where the net_bridge_port state is actually changed.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Acked-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
reverts commit 3ab1f683bf ("nfnetlink: add support for memory mapped
netlink")'
Like previous commits in the series, remove wrappers that are not needed
after mmapped netlink removal.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Following mmapped netlink removal this code can be simplified by
removing the alloc wrapper.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit bb9b18fb55 ("genl: Add genlmsg_new_unicast() for
unicast message allocation")'.
Nothing wrong with it; its no longer needed since this was only for
mmapped netlink support.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
revert commit 795449d8b8 ("openvswitch: Enable memory mapped Netlink i/o").
Following the mmaped netlink removal this code can be removed.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
mmapped netlink has a number of unresolved issues:
- TX zerocopy support had to be disabled more than a year ago via
commit 4682a03586 ("netlink: Always copy on mmap TX.")
because the content of the mmapped area can change after netlink
attribute validation but before message processing.
- RX support was implemented mainly to speed up nfqueue dumping packet
payload to userspace. However, since commit ae08ce0021
("netfilter: nfnetlink_queue: zero copy support") we avoid one copy
with the socket-based interface too (via the skb_zerocopy helper).
The other problem is that skbs attached to mmaped netlink socket
behave different from normal skbs:
- they don't have a shinfo area, so all functions that use skb_shinfo()
(e.g. skb_clone) cannot be used.
- reserving headroom prevents userspace from seeing the content as
it expects message to start at skb->head.
See for instance
commit aa3a022094 ("netlink: not trim skb for mmaped socket when dump").
- skbs handed e.g. to netlink_ack must have non-NULL skb->sk, else we
crash because it needs the sk to check if a tx ring is attached.
Also not obvious, leads to non-intuitive bug fixes such as 7c7bdf359
("netfilter: nfnetlink: use original skbuff when acking batches").
mmaped netlink also didn't play nicely with the skb_zerocopy helper
used by nfqueue and openvswitch. Daniel Borkmann fixed this via
commit 6bb0fef489 ("netlink, mmap: fix edge-case leakages in nf queue
zero-copy")' but at the cost of also needing to provide remaining
length to the allocation function.
nfqueue also has problems when used with mmaped rx netlink:
- mmaped netlink doesn't allow use of nfqueue batch verdict messages.
Problem is that in the mmap case, the allocation time also determines
the ordering in which the frame will be seen by userspace (A
allocating before B means that A is located in earlier ring slot,
but this also means that B might get a lower sequence number then A
since seqno is decided later. To fix this we would need to extend the
spinlocked region to also cover the allocation and message setup which
isn't desirable.
- nfqueue can now be configured to queue large (GSO) skbs to userspace.
Queing GSO packets is faster than having to force a software segmentation
in the kernel, so this is a desirable option. However, with a mmap based
ring one has to use 64kb per ring slot element, else mmap has to fall back
to the socket path (NL_MMAP_STATUS_COPY) for all large packets.
To use the mmap interface, userspace not only has to probe for mmap netlink
support, it also has to implement a recv/socket receive path in order to
handle messages that exceed the size of an rx ring element.
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Ken-ichirou MATSUZAWA <chamaken@gmail.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ilya reported following lockdep splat:
kernel: =========================
kernel: [ BUG: held lock freed! ]
kernel: 4.5.0-rc1-ceph-00026-g5e0a311 #1 Not tainted
kernel: -------------------------
kernel: swapper/5/0 is freeing memory
ffff880035c9d200-ffff880035c9dbff, with a lock still held there!
kernel: (&(&queue->rskq_lock)->rlock){+.-...}, at:
[<ffffffff816f6a88>] inet_csk_reqsk_queue_add+0x28/0xa0
kernel: 4 locks held by swapper/5/0:
kernel: #0: (rcu_read_lock){......}, at: [<ffffffff8169ef6b>]
netif_receive_skb_internal+0x4b/0x1f0
kernel: #1: (rcu_read_lock){......}, at: [<ffffffff816e977f>]
ip_local_deliver_finish+0x3f/0x380
kernel: #2: (slock-AF_INET){+.-...}, at: [<ffffffff81685ffb>]
sk_clone_lock+0x19b/0x440
kernel: #3: (&(&queue->rskq_lock)->rlock){+.-...}, at:
[<ffffffff816f6a88>] inet_csk_reqsk_queue_add+0x28/0xa0
To properly fix this issue, inet_csk_reqsk_queue_add() needs
to return to its callers if the child as been queued
into accept queue.
We also need to make sure listener is still there before
calling sk->sk_data_ready(), by holding a reference on it,
since the reference carried by the child can disappear as
soon as the child is put on accept queue.
Reported-by: Ilya Dryomov <idryomov@gmail.com>
Fixes: ebb516af60 ("tcp/dccp: fix race at listener dismantle phase")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since the gc of ipv4 route was removed, the route cached would has
no chance to be removed, and even it has been timeout, it still could
be used, cause no code to check it's expires.
Fix this issue by checking and removing route cache when we get route.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
When VLANs are created / destroyed on a VLAN filtering bridge (MASTER
flag set), the configuration is passed down to the hardware. However,
when only the flags (e.g. PVID) are toggled, the configuration is done
in the software bridge alone.
While it is possible to pass these flags to hardware when invoked with
the SELF flag set, this creates inconsistency with regards to the way
the VLANs are initially configured.
Pass the flags down to the hardware even when the VLAN already exists
and only the flags are toggled.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
actions could change the etherproto in particular with ethernet
tunnelled data. Typically such actions, after peeling the outer header,
will ask for the packet to be reclassified. We then need to restart
the classification with the new proto header.
Example setup used to catch this:
sudo tc qdisc add dev $ETH ingress
sudo $TC filter add dev $ETH parent ffff: pref 1 protocol 802.1Q \
u32 match u32 0 0 flowid 1:1 \
action vlan pop reclassify
Fixes: 3b3ae88026 ("net: sched: consolidate tc_classify{,_compat}")
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ever since commit 04ed3e741d
("net: change netdev->features to u32") the format string
fmt_long_hex has not been used, so we may as well remove it.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since function vlan_proc_rem_dev() will only return 0, it's better to
return void instead of int.
Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With commit 0071f56e46 ("dsa: Register netdev before phy"), we are now trying
to free a network device that has been previously registered, and in case of
errors, this will make us hit the BUG_ON(dev->reg_state != NETREG_UNREGISTERED)
condition.
Fix this by adding a missing unregister_netdev() before free_netdev().
Fixes: 0071f56e46 ("dsa: Register netdev before phy")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The proc_create() and remove_proc_entry() functions do not reference
their arguments when CONFIG_PROC_FS is disabled, so we get a couple
of warnings about unused variables in IPVS:
ipvs/ip_vs_app.c:608:14: warning: unused variable 'net' [-Wunused-variable]
ipvs/ip_vs_ctl.c:3950:14: warning: unused variable 'net' [-Wunused-variable]
ipvs/ip_vs_ctl.c:3994:14: warning: unused variable 'net' [-Wunused-variable]
This removes the local variables and instead looks them up separately
for each use, which obviously avoids the warning.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 4c50a8ce2b63 ("netfilter: ipvs: avoid unused variable warning")
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
This have been there for a long time, but does not seem to add value
Signed-off-by: Yannick Brosseau <scientist@fb.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
There are no longer any in-tree drivers that use it.
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
The timeout is a long, we return it truncated if it is huge. Basically
harmless as the only caller does a boolean check, but tidy it up anyway.
(64bit build tested this time. Thank you 0day)
Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit 8b570dc9f7 ("sctp: only drop the reference on the datamsg
after sending a msg") used sctp_datamsg_put in sctp_sendmsg, instead of
sctp_datamsg_free, this function has no use in sctp.
So we will remove it.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sctp_seq_dump_remote_addrs is only called by sctp_assocs_seq_show()
and it has been protected by rcu_read_lock that is from
rhashtable_walk_start().
So we will remove this one.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
__sctp_lookup_association() is only invoked by sctp_v4_err() and
sctp_rcv(), both which run on the rx BH, and it has been protected
by rcu_read_lock [see ip_local_deliver_finish() / ipv6_rcv()].
So we can move it to sctp_lookup_association, only let
sctp_lookup_association use rcu_read_lock.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A previous commit (33f72e6) added notification via netlink for tunnels
when created/modified/deleted. If the notification returned an error,
this error was returned from the tunnel function. If there were no
listeners, the error code ESRCH was returned, even though having no
listeners is not an error. Other calls to this and other similar
notification functions either ignore the error code, or filter ESRCH.
This patch checks for ESRCH and does not flag this as an error.
Reviewed-by: Hamish Martin <hamish.martin@alliedtelesis.co.nz>
Signed-off-by: Mark Tomlinson <mark.tomlinson@alliedtelesis.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 3ed80a6 (cgroup: drop module support) made including
module.h redundant in the net cgroup controllers,
netclassid_cgroup.c and netprio_cgroup.c. This patch
removes them.
Signed-off-by: Rami Rosen <rami.rosen@intel.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
On Mon, 15 Feb 2016, Trond Myklebust wrote:
> Hi Scott,
>
> On Mon, Feb 15, 2016 at 2:28 PM, Scott Mayhew <smayhew@redhat.com> wrote:
> > md5 is disabled in fips mode, and attempting to import a gss context
> > using md5 while in fips mode will result in crypto_alg_mod_lookup()
> > returning -ENOENT, which will make its way back up to
> > gss_pipe_downcall(), where the BUG() is triggered. Handling the -ENOENT
> > allows for a more graceful failure.
> >
> > Signed-off-by: Scott Mayhew <smayhew@redhat.com>
> > ---
> > net/sunrpc/auth_gss/auth_gss.c | 3 +++
> > 1 file changed, 3 insertions(+)
> >
> > diff --git a/net/sunrpc/auth_gss/auth_gss.c b/net/sunrpc/auth_gss/auth_gss.c
> > index 799e65b..c30fc3b 100644
> > --- a/net/sunrpc/auth_gss/auth_gss.c
> > +++ b/net/sunrpc/auth_gss/auth_gss.c
> > @@ -737,6 +737,9 @@ gss_pipe_downcall(struct file *filp, const char __user *src, size_t mlen)
> > case -ENOSYS:
> > gss_msg->msg.errno = -EAGAIN;
> > break;
> > + case -ENOENT:
> > + gss_msg->msg.errno = -EPROTONOSUPPORT;
> > + break;
> > default:
> > printk(KERN_CRIT "%s: bad return from "
> > "gss_fill_context: %zd\n", __func__, err);
> > --
> > 2.4.3
> >
>
> Well debugged, but I unfortunately do have to ask if this patch is
> sufficient? In addition to -ENOENT, and -ENOMEM, it looks to me as if
> crypto_alg_mod_lookup() can also fail with -EINTR, -ETIMEDOUT, and
> -EAGAIN. Don't we also want to handle those?
You're right, I was focusing on the panic that I could easily reproduce.
I'm still not sure how I could trigger those other conditions.
>
> In fact, peering into the rats nest that is
> gss_import_sec_context_kerberos(), it looks as if that is just a tiny
> subset of all the errors that we might run into. Perhaps the right
> thing to do here is to get rid of the BUG() (but keep the above
> printk) and just return a generic error?
That sounds fine to me -- updated patch attached.
-Scott
>From d54c6b64a107a90a38cab97577de05f9a4625052 Mon Sep 17 00:00:00 2001
From: Scott Mayhew <smayhew@redhat.com>
Date: Mon, 15 Feb 2016 15:12:19 -0500
Subject: [PATCH] auth_gss: remove the BUG() from gss_pipe_downcall()
Instead return a generic error via gss_msg->msg.errno. None of the
errors returned by gss_fill_context() should necessarily trigger a
kernel panic.
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Some NFSv4.1 OPEN requests were hanging waiting for the NFS server
to finish recalling delegations. Turns out that each NFSv4.1 CB
request on RDMA gets a GARBAGE_ARGS reply from the Linux client.
Commit 756b9b37cf added a line in bc_svc_process that
overwrites the incoming rq_rcv_buf's length with the value in
rq_private_buf.len. But rpcrdma_bc_receive_call() does not invoke
xprt_complete_bc_request(), thus rq_private_buf.len is not
initialized. svc_process_common() is invoked with a zero-length
RPC message, and fails.
Fixes: 756b9b37cf ('SUNRPC: Fix callback channel')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Its useful to turn off the qdisc offload feature at a per device
level. This gives us a big hammer to enable/disable offloading.
More fine grained control (i.e. per rule) may be supported later.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch allows netdev drivers to consume cls_u32 offloads via
the ndo_setup_tc ndo op.
This works aligns with how network drivers have been doing qdisc
offloads for mqprio.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch updates setup_tc so we can pass additional parameters into
the ndo op in a generic way. To do this we provide structured union
and type flag.
This lets each classifier and qdisc provide its own set of attributes
without having to add new ndo ops or grow the signature of the
callback.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ndo_setup_tc() op was added to support drivers offloading tx
qdiscs however only support for mqprio was ever added. So we
only ever added support for passing the number of traffic classes
to the driver.
This patch generalizes the ndo_setup_tc op so that a handle can
be provided to indicate if the offload is for ingress or egress
or potentially even child qdiscs.
CC: Murali Karicheri <m-karicheri2@ti.com>
CC: Shradha Shah <sshah@solarflare.com>
CC: Or Gerlitz <ogerlitz@mellanox.com>
CC: Ariel Elior <ariel.elior@qlogic.com>
CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
CC: Bruce Allan <bruce.w.allan@intel.com>
CC: Jesse Brandeburg <jesse.brandeburg@intel.com>
CC: Don Skidmore <donald.c.skidmore@intel.com>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that all the ip fragmentation related sysctls are namespaceified
there is no reason to hide them anymore from "root" users inside
containers.
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When igmp related sysctl were namespacified their initializatin was
erroneously put into the tcp socket namespace constructor. This
patch moves the relevant code into the igmp namespace constructor to
keep things consistent.
Also sprinkle some #ifdefs to silence warnings
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcpi_min_rtt reports the minimal rtt observed by TCP stack for the flow,
in usec unit. Might be ~0U if not yet known.
tcpi_notsent_bytes reports the amount of bytes in the write queue that
were not yet sent.
This is done in a single patch to not add a temporary 32bit padding hole
in tcp_info.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If tcp_v4_inbound_md5_hash() returns an error, we must release
the refcount on the request socket, not on the listener.
The bug was added for IPv4 only.
Fixes: 079096f103 ("tcp/dccp: install syn_recv requests into ehash table")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In case of UDP traffic with datagram length below MTU this
gives about 4% performance increase
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Suggested-and-Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In case of UDP traffic with datagram length
below MTU this give about 2% performance increase
when tunneling over ipv4 and about 60% when tunneling
over ipv6
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Suggested-and-acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current ip_tunnel cache implementation is prone to a race
that will cause the wrong dst to be cached on cuncurrent dst cache
miss and ip tunnel update via netlink.
Replacing with the generic implementation fix the issue.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Suggested-and-acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This also fix a potential race into the existing tunnel code, which
could lead to the wrong dst to be permanenty cached:
CPU1: CPU2:
<xmit on ip6_tunnel>
<cache lookup fails>
dst = ip6_route_output(...)
<tunnel params are changed via nl>
dst_cache_reset() // no effect,
// the cache is empty
dst_cache_set() // the wrong dst
// is permanenty stored
// into the cache
With the new dst implementation the above race is not possible
since the first cache lookup after dst_cache_reset will fail due
to the timestamp check
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Suggested-and-acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch add a generic, lockless dst cache implementation.
The need for lock is avoided updating the dst cache fields
only in per cpu scope, and requiring that the cache manipulation
functions are invoked with the local bh disabled.
The refresh_ts and reset_ts fields are used to ensure the cache
consistency in case of cuncurrent cache update (dst_cache_set*) and
reset operation (dst_cache_reset).
Consider the following scenario:
CPU1: CPU2:
<cache lookup with emtpy cache: it fails>
<get dst via uncached route lookup>
<related configuration changes>
dst_cache_reset()
dst_cache_set()
The dst entry set passed to dst_cache_set() should not be used
for later dst cache lookup, because it's obtained using old
configuration values.
Since the refresh_ts is updated only on dst_cache lookup, the
cached value in the above scenario will be discarded on the next
lookup.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Suggested-and-acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are some cases where rtt_us derives from deltas of jiffies,
instead of using usec timestamps.
Since we want to track minimal rtt, better to assume a delta of 0 jiffie
might be in fact be very close to 1 jiffie.
It is kind of sad jiffies_to_usecs(1) calls a function instead of simply
using a constant.
Fixes: f672258391 ("tcp: track min RTT using windowed min-filter")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The phy has not been initialized, disconnecting it in the error
path results in a NULL pointer exception. Drop the phy_disconnect
from the error path.
Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Neil Armstrong <narmstrong@baylibre.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Refactor tipc_node_xmit() to fail fast and fail early. Fix several
potential memory leaks in unexpected error paths.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit 5266698661 ("tipc: let broadcast packet reception
use new link receive function") we introduced a new per-node
broadcast reception link instance. This link is created at the
moment the node itself is created. Unfortunately, the allocation
is done after the node instance has already been added to the node
lookup hash table. This creates a potential race condition, where
arriving broadcast packets are able to find and access the node
before it has been fully initialized, and before the above mentioned
link has been created. The result is occasional crashes in the function
tipc_bcast_rcv(), which is trying to access the not-yet existing link.
We fix this by deferring the addition of the node instance until after
it has been fully initialized in the function tipc_node_create().
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A recent change to the mdb code confused the compiler to the point
where it did not realize that the port-group returned from
br_mdb_add_group() is always valid when the function returns a nonzero
return value, so we get a spurious warning:
net/bridge/br_mdb.c: In function 'br_mdb_add':
net/bridge/br_mdb.c:542:4: error: 'pg' may be used uninitialized in this function [-Werror=maybe-uninitialized]
__br_mdb_notify(dev, entry, RTM_NEWMDB, pg);
Slightly rearranging the code in br_mdb_add_group() makes the problem
go away, as gcc is clever enough to see that both functions check
for 'ret != 0'.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 9e8430f8d6 ("bridge: mdb: Passing the port-group pointer to br_mdb module")
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch corrects the unaligned accesses seen on GRE TEB tunnels when
generating hash keys. Specifically what this patch does is make it so that
we force the use of skb_copy_bits when the GRE inner headers will be
unaligned due to NET_IP_ALIGNED being a non-zero value.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a sanity check to ensure that all requested channel sizes are within
bounds, which should reduce errors in driver implementation.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ethernet drivers implementing both {GS}RXFH and {GS}CHANNELS ethtool ops
incorrectly allow SCHANNELS when it would conflict with the settings
from SRXFH. This occurs because it is not possible for drivers to
understand whether their Rx flow indirection table has been configured
or is in the default state. In addition, drivers currently behave in
various ways when increasing the number of Rx channels.
Some drivers will always destroy the Rx flow indirection table when this
occurs, whether it has been set by the user or not. Other drivers will
attempt to preserve the table even if the user has never modified it
from the default driver settings. Neither of these situation is
desirable because it leads to unexpected behavior or loss of user
configuration.
The correct behavior is to simply return -EINVAL when SCHANNELS would
conflict with the current Rx flow table settings. However, it should
only do so if the current settings were modified by the user. If we
required that the new settings never conflict with the current (default)
Rx flow settings, we would force users to first reduce their Rx flow
settings and then reduce the number of Rx channels.
This patch proposes a solution implemented in net/core/ethtool.c which
ensures that all drivers behave correctly. It checks whether the RXFH
table has been configured to non-default settings, and stores this
information in a private netdev flag. When the number of channels is
requested to change, it first ensures that the current Rx flow table is
not going to assign flows to now disabled channels.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contain a rather large batch for your net that
includes accumulated bugfixes, they are:
1) Run conntrack cleanup from workqueue process context to avoid hitting
soft lockup via watchdog for large tables. This is required by the
IPv6 masquerading extension. From Florian Westphal.
2) Use original skbuff from nfnetlink batch when calling netlink_ack()
on error since this needs to access the skb->sk pointer.
3) Incremental fix on top of recent Sasha Levin's lock fix for conntrack
resizing.
4) Fix several problems in nfnetlink batch message header sanitization
and error handling, from Phil Turnbull.
5) Select NF_DUP_IPV6 based on CONFIG_IPV6, from Arnd Bergmann.
6) Fix wrong signess in return values on nf_tables counter expression,
from Anton Protopopov.
Due to the NetDev 1.1 organization burden, I had no chance to pass up
this to you any sooner in this release cycle, sorry about that.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The unix_dgram_sendmsg routine use the following test
if (unlikely(unix_peer(other) != sk && unix_recvq_full(other))) {
to determine if sk and other are in an n:1 association (either
established via connect or by using sendto to send messages to an
unrelated socket identified by address). This isn't correct as the
specified address could have been bound to the sending socket itself or
because this socket could have been connected to itself by the time of
the unix_peer_get but disconnected before the unix_state_lock(other). In
both cases, the if-block would be entered despite other == sk which
might either block the sender unintentionally or lead to trying to unlock
the same spin lock twice for a non-blocking send. Add a other != sk
check to guard against this.
Fixes: 7d267278a9 ("unix: avoid use-after-free in ep_remove_wait_queue")
Reported-By: Philipp Hahn <pmhahn@pmhahn.de>
Signed-off-by: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Tested-by: Philipp Hahn <pmhahn@pmhahn.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
The present unix_stream_read_generic contains various code sequences of
the form
err = -EDISASTER;
if (<test>)
goto out;
This has the unfortunate side effect of possibly causing the error code
to bleed through to the final
out:
return copied ? : err;
and then to be wrongly returned if no data was copied because the caller
didn't supply a data buffer, as demonstrated by the program available at
http://pad.lv/1540731
Change it such that err is only set if an error condition was detected.
Fixes: 3822b5c2fc ("af_unix: Revert 'lock_interruptible' in stream receive code")
Reported-by: Joseph Salisbury <joseph.salisbury@canonical.com>
Signed-off-by: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
batman-adv checks in different situation if a new device is already on top
of a different batman-adv device. This is done by getting the iflink of a
device and all its parent. It assumes that this iflink is always a parent
device in an acyclic graph. But this assumption is broken by devices like
veth which are actually a pair of two devices linked to each other. The
recursive check would therefore get veth0 when calling dev_get_iflink on
veth1. And it gets veth0 when calling dev_get_iflink with veth1.
Creating a veth pair and loading batman-adv freezes parts of the system
ip link add veth0 type veth peer name veth1
modprobe batman-adv
An RCU stall will be detected on the system which cannot be fixed.
INFO: rcu_sched self-detected stall on CPU
1: (5264 ticks this GP) idle=3e9/140000000000001/0
softirq=144683/144686 fqs=5249
(t=5250 jiffies g=46 c=45 q=43)
Task dump for CPU 1:
insmod R running task 0 247 245 0x00000008
ffffffff8151f140 ffffffff8107888e ffff88000fd141c0 ffffffff8151f140
0000000000000000 ffffffff81552df0 ffffffff8107b420 0000000000000001
ffff88000e3fa700 ffffffff81540b00 ffffffff8107d667 0000000000000001
Call Trace:
<IRQ> [<ffffffff8107888e>] ? rcu_dump_cpu_stacks+0x7e/0xd0
[<ffffffff8107b420>] ? rcu_check_callbacks+0x3f0/0x6b0
[<ffffffff8107d667>] ? hrtimer_run_queues+0x47/0x180
[<ffffffff8107cf9d>] ? update_process_times+0x2d/0x50
[<ffffffff810873fb>] ? tick_handle_periodic+0x1b/0x60
[<ffffffff810290ae>] ? smp_trace_apic_timer_interrupt+0x5e/0x90
[<ffffffff813bbae2>] ? apic_timer_interrupt+0x82/0x90
<EOI> [<ffffffff812c3fd7>] ? __dev_get_by_index+0x37/0x40
[<ffffffffa0031f3e>] ? batadv_hard_if_event+0xee/0x3a0 [batman_adv]
[<ffffffff812c5801>] ? register_netdevice_notifier+0x81/0x1a0
[...]
This can be avoided by checking if two devices are each others parent and
stopping the check in this situation.
Fixes: b7eddd0b39 ("batman-adv: prevent using any virtual device created on batman-adv as hard-interface")
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
[sven@narfation.org: rewritten description, extracted fix]
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
The batadv_orig_node_vlan reference counter in batadv_tt_global_size_mod
can only be reduced when the list entry was actually removed. Otherwise the
reference counter may reach zero when batadv_tt_global_size_mod is called
from two different contexts for the same orig_node_vlan but only one
context is actually removing the entry from the list.
The release function for this orig_node_vlan is not called inside the
vlan_list_lock spinlock protected region because the function
batadv_tt_global_size_mod still holds a orig_node_vlan reference for the
object pointer on the stack. Thus the actual release function (when
required) will be called only at the end of the function.
Fixes: 7ea7b4a142 ("batman-adv: make the TT CRC logic VLAN specific")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
The batadv_gw_node reference counter in batadv_gw_node_update can only be
reduced when the list entry was actually removed. Otherwise the reference
counter may reach zero when batadv_gw_node_update is called from two
different contexts for the same gw_node but only one context is actually
removing the entry from the list.
The release function for this gw_node is not called inside the list_lock
spinlock protected region because the function batadv_gw_node_update still
holds a gw_node reference for the object pointer on the stack. Thus the
actual release function (when required) will be called only at the end of
the function.
Fixes: bd3524c14b ("batman-adv: remove obsolete deleted attribute for gateway node")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
We will soon modify the vanilla get_user_pages() so it can no
longer be used on mm/tasks other than 'current/current->mm',
which is by far the most common way it is called. For now,
we allow the old-style calls, but warn when they are used.
(implemented in previous patch)
This patch switches all callers of:
get_user_pages()
get_user_pages_unlocked()
get_user_pages_locked()
to stop passing tsk/mm so they will no longer see the warnings.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: jack@suse.cz
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210156.113E9407@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch fix spelling typos found in printk and Kconfig.
Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Dmitry reported memory leaks of IP options allocated in
ip_cmsg_send() when/if this function returns an error.
Callers are responsible for the freeing.
Many thanks to Dmitry for the report and diagnostic.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
All users now pass false, so we can remove it, and remove the code that
was conditional upon it.
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If the dst device doesn't support it, it'll get fixed up later anyway
by validate_xmit_skb(). Also, this allows us to take advantage of LCO
to avoid summing the payload multiple times.
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The arithmetic properties of the ones-complement checksum mean that a
correctly checksummed inner packet, including its checksum, has a ones
complement sum depending only on whatever value was used to initialise
the checksum field before checksumming (in the case of TCP and UDP,
this is the ones complement sum of the pseudo header, complemented).
Consequently, if we are going to offload the inner checksum with
CHECKSUM_PARTIAL, we can compute the outer checksum based only on the
packed data not covered by the inner checksum, and the initial value of
the inner checksum field.
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement strategy used in __inet_hash_connect() in opposite way :
Try to find a candidate using odd ports, then fallback to even ports.
We no longer disable BH for whole traversal, but one bucket at a time.
We also use cond_resched() to yield cpu to other tasks if needed.
I removed one indentation level and tried to mirror the loop we have
in __inet_hash_connect() and variable names to ease code maintenance.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit 07f4c90062 ("tcp/dccp: try to not exhaust ip_local_port_range
in connect()"), I added a very simple heuristic, so that we got better
chances to use even ports, and allow bind() users to have more available
slots.
It gave nice results, but with more than 200,000 TCP sessions on a typical
server, the ~30,000 ephemeral ports are still a rare resource.
I chose to go a step further, by looking at all even ports, and if none
was available, fallback to odd ports.
The companion patch does the same in bind(), but in opposite way.
I've seen exec times of up to 30ms on busy servers, so I no longer
disable BH for the whole traversal, but only for each hash bucket.
I also call cond_resched() to be gentle to other tasks.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Fix BPF handling of branch offset adjustmnets on backjumps, from
Daniel Borkmann.
2) Make sure selinux knows about SOCK_DESTROY netlink messages, from
Lorenzo Colitti.
3) Fix openvswitch tunnel mtu regression, from David Wragg.
4) Fix ICMP handling of TCP sockets in syn_recv state, from Eric
Dumazet.
5) Fix SCTP user hmacid byte ordering bug, from Xin Long.
6) Fix recursive locking in ipv6 addrconf, from Subash Abhinov
Kasiviswanathan.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
bpf: fix branch offset adjustment on backjumps after patching ctx expansion
vxlan, gre, geneve: Set a large MTU on ovs-created tunnel devices
geneve: Relax MTU constraints
vxlan: Relax MTU constraints
flow_dissector: Fix unaligned access in __skb_flow_dissector when used by eth_get_headlen
of: of_mdio: Add marvell, 88e1145 to whitelist of PHY compatibilities.
selinux: nlmsgtab: add SOCK_DESTROY to the netlink mapping tables
sctp: translate network order to host order when users get a hmacid
enic: increment devcmd2 result ring in case of timeout
tg3: Fix for tg3 transmit queue 0 timed out when too many gso_segs
net:Add sysctl_max_skb_frags
tcp: do not drop syn_recv on all icmp reports
ipv6: fix a lockdep splat
unix: correctly track in-flight fds in sending process user_struct
update be2net maintainers' email addresses
dwc_eth_qos: Reset hardware before PHY start
ipv6: addrconf: Fix recursive spin lock call
The network stack defers SKBs free, in-case free happens in IRQ or
when IRQs are disabled. This happens in __dev_kfree_skb_irq() that
writes SKBs that were free'ed during IRQ to the softirq completion
queue (softnet_data.completion_queue).
These SKBs are naturally delayed, and cleaned up during NET_TX_SOFTIRQ
in function net_tx_action(). Take advantage of this a use the skb
defer and flush API, as we are already in softirq context.
For modern drivers this rarely happens. Although most drivers do call
dev_kfree_skb_any(), which detects the situation and calls
__dev_kfree_skb_irq() when needed. This due to netpoll can call from
IRQ context.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Discovered that network stack were hitting the kmem_cache/SLUB
slowpath when freeing SKBs. Doing bulk free with kmem_cache_free_bulk
can speedup this slowpath.
NAPI context is a bit special, lets take advantage of that for bulk
free'ing SKBs.
In NAPI context we are running in softirq, which gives us certain
protection. A softirq can run on several CPUs at once. BUT the
important part is a softirq will never preempt another softirq running
on the same CPU. This gives us the opportunity to access per-cpu
variables in softirq context.
Extend napi_alloc_cache (before only contained page_frag_cache) to be
a struct with a small array based stack for holding SKBs. Introduce a
SKB defer and flush API for accessing this.
Introduce napi_consume_skb() as replacement for e.g. dev_consume_skb_any()
when running in NAPI context. A small trick to handle/detect if we
are called from netpoll is to see if budget is 0. In that case, we
need to invoke dev_consume_skb_irq().
Joint work with Alexander Duyck.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This was initially introduced in df2cf4a78e ("IGMP: Inhibit
reports for local multicast groups") by defining the sysctl in the
ipv4_net_table array, however it was never implemented to be
namespace aware. Fix this by changing the code accordingly.
Signed-off-by: David S. Miller <davem@davemloft.net>
Operations with the GENL_ADMIN_PERM flag fail permissions checks because
this flag means we call netlink_capable, which uses the init user ns.
Instead, let's introduce a new flag, GENL_UNS_ADMIN_PERM for operations
which should be allowed inside a user namespace.
The motivation for this is to be able to run openvswitch in unprivileged
containers. I've tested this and it seems to work, but I really have no
idea about the security consequences of this patch, so thoughts would be
much appreciated.
v2: use the GENL_UNS_ADMIN_PERM flag instead of a check in each function
v3: use separate ifs for UNS_ADMIN_PERM and ADMIN_PERM, instead of one
massive one
Reported-by: James Page <james.page@canonical.com>
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
CC: Eric Biederman <ebiederm@xmission.com>
CC: Pravin Shelar <pshelar@ovn.org>
CC: Justin Pettit <jpettit@nicira.com>
CC: "David S. Miller" <davem@davemloft.net>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch enables us to use inner checksum offloads if provided by
hardware with outer checksums computed by software.
It basically reduces encap_hdr_csum to an advisory flag for now, but based
on the fact that SCTP may be getting segmentation support before long I
thought we may want to keep it as it is possible we may need to support
CRC32c and 1's compliment checksum in the same packet at some point in the
future.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The segmentation code was having to do a bunch of work to pull the
skb->len and strip the udp header offset before the value could be used to
adjust the checksum. Instead of doing all this work we can just use the
value that goes into uh->len since that is the correct value with the
correct byte order that we need anyway. By using this value we can save
ourselves a bunch of pain as there is no need to do multiple byte swaps.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch goes though and cleans up the logic related to several of the
control flags used in UDP segmentation. Specifically the use of dont_encap
isn't really needed as we can just check the skb for CHECKSUM_PARTIAL and
if it isn't set then we don't need to update the internal headers. As such
we can just drop that value.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of parsing headers to determine the inner protocol we can just pull
the value from inner_proto.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch updates the gre checksum path to follow something much closer to
the UDP checksum path. By doing this we can avoid needing to do as much
header inspection and can just make use of the fields we were already
reading in the sk_buff structure.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The call skb_has_shared_frag is used in the GRE path and skb_checksum_help
to verify that no frags can be modified by an external entity. This check
really doesn't belong in the GRE path but in the skb_segment function
itself. This way any protocol that might be segmented will be performing
this check before attempting to offload a checksum to software.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch makes it so that we can offload the checksums for a packet up
to a certain point and then begin computing the checksums via software.
Setting this up is fairly straight forward as all we need to do is reset
the values stored in csum and csum_start for the GSO context block.
One complication for this is remote checksum offload. In order to allow
the inner checksums to be offloaded while computing the outer checksum
manually we needed to have some way of indicating that the offload wasn't
real. In order to do that I replaced CHECKSUM_PARTIAL with
CHECKSUM_UNNECESSARY in the case of us computing checksums for the outer
header while skipping computing checksums for the inner headers. We clean
up the ip_summed flag and set it to either CHECKSUM_PARTIAL or
CHECKSUM_NONE once we hand the packet off to the next lower level.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch addresses two main issues.
First in the case of remote checksum offload we were avoiding dealing with
scatter-gather issues. As a result it would be possible to assemble a
series of frames that used frags instead of being linearized as they should
have if remote checksum offload was enabled.
Second I have updated the code so that we now let GSO take care of doing
the checksum on the data itself and drop the special case that was added
for remote checksum offload.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch moves the checksum maintained by GSO out of skb->csum and into
the GSO context block in order to allow for us to work on outer checksums
while maintaining the inner checksum offsets in the case of the inner
checksum being offloaded, while the outer checksums will be computed.
While updating the code I also did a minor cleanu-up on gso_make_checksum.
The change is mostly to make it so that we store the values and compute the
checksum instead of computing the checksum and then storing the values we
needed to update.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The enc_features variable isn't necessary since features isn't used
anywhere after we create enc_features so instead just use a destructive AND
on features itself and save ourselves the variable declaration.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In certain 802.11 wireless deployments, there will be NA proxies
that use knowledge of the network to correctly answer requests.
To prevent unsolicitd advertisements on the shared medium from
being a problem, on such deployments wireless needs to drop them.
Enable this by providing an option called "drop_unsolicited_na".
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to solve a problem with 802.11, the so-called hole-196 attack,
add an option (sysctl) called "drop_unicast_in_l2_multicast" which, if
enabled, causes the stack to drop IPv6 unicast packets encapsulated in
link-layer multi- or broadcast frames. Such frames can (as an attack)
be created by any member of the same wireless network and transmitted
as valid encrypted frames since the symmetric key for broadcast frames
is shared between all stations.
Reviewed-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In certain 802.11 wireless deployments, there will be ARP proxies
that use knowledge of the network to correctly answer requests.
To prevent gratuitous ARP frames on the shared medium from being
a problem, on such deployments wireless needs to drop them.
Enable this by providing an option called "drop_gratuitous_arp".
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to solve a problem with 802.11, the so-called hole-196 attack,
add an option (sysctl) called "drop_unicast_in_l2_multicast" which, if
enabled, causes the stack to drop IPv4 unicast packets encapsulated in
link-layer multi- or broadcast frames. Such frames can (as an attack)
be created by any member of the same wireless network and transmitted
as valid encrypted frames since the symmetric key for broadcast frames
is shared between all stations.
Additionally, enabling this option provides compliance with a SHOULD
clause of RFC 1122.
Reviewed-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for filtering link dumps by master device and kind, similar
to the filtering implemented for neighbor dumps.
Each net_device that exists adds between 1196 bytes (eth) and 1556 bytes
(bridge) to the link dump. As the number of interfaces increases so does
the amount of data pushed to user space for a link list. If the user
only wants to see a list of specific devices (e.g., interfaces enslaved
to a specific bridge or a list of VRFs) most of that data is thrown away.
Passing the filters to the kernel to have only relevant data returned
makes the dump more efficient.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change extends the fast SO_REUSEPORT socket lookup implemented
for UDP to TCP. Listener sockets with SO_REUSEPORT and the same
receive address are additionally added to an array for faster
random access. This means that only a single socket from the group
must be found in the listener list before any socket in the group can
be used to receive a packet. Previously, every socket in the group
needed to be considered before handing off the incoming packet.
This feature also exposes the ability to use a BPF program when
selecting a socket from a reuseport group.
Signed-off-by: Craig Gallek <kraig@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Both of the lines in this patch probably should have been included
in the initial implementation of this code for generic socket
support, but weren't technically necessary since only UDP sockets
were supported.
First, the sk_reuseport_cb points to a structure which assumes
each socket in the group has this pointer assigned at the same
time it's added to the array in the structure. The sk_clone_lock
function breaks this assumption. Since a child socket shouldn't
implicitly be in a reuseport group, the simple fix is to clear
the field in the clone.
Second, the SO_ATTACH_REUSEPORT_xBPF socket options require that
SO_REUSEPORT also be set first. For UDP sockets, this is easily
enforced at bind-time since that process both puts the socket in
the appropriate receive hlist and updates the reuseport structures.
Since these operations can happen at two different times for TCP
sockets (bind and listen) it must be explicitly checked to enforce
the use of SO_REUSEPORT with SO_ATTACH_REUSEPORT_xBPF in the
setsockopt call.
Signed-off-by: Craig Gallek <kraig@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a preliminary step to allow fast socket lookup of SO_REUSEPORT
groups. Doing so with a BPF filter will require access to the
skb in question. This change plumbs the skb (and offset to payload
data) through the call stack to the listening socket lookup
implementations where it will be used in a following patch.
Signed-off-by: Craig Gallek <kraig@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to support fast lookups for TCP sockets with SO_REUSEPORT,
the function that adds sockets to the listening hash set needs
to be able to check receive address equality. Since this equality
check is different for IPv4 and IPv6, we will need two different
socket hashing functions.
This patch adds inet6_hash identical to the existing inet_hash function
and updates the appropriate references. A following patch will
differentiate the two by passing different comparison functions to
__inet_hash.
Additionally, in order to use the IPv6 address equality function from
inet6_hashtables (which is compiled as a built-in object when IPv6 is
enabled) it also needs to be in a built-in object file as well. This
moves ipv6_rcv_saddr_equal into inet_hashtables to accomplish this.
Signed-off-by: Craig Gallek <kraig@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to support fast reuseport lookups in TCP, the hash function
defined in struct proto must be capable of returning an error code.
This patch changes the function signature of all related hash functions
to return an integer and handles or propagates this return value at
all call sites.
Signed-off-by: Craig Gallek <kraig@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
batman-adv uses a self-written reference implementation which is just based
on atomic_t. This is less obvious when reading the code than kref and
therefore increases the change that the reference counting will be missed.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
batman-adv uses a self-written reference implementation which is just based
on atomic_t. This is less obvious when reading the code than kref and
therefore increases the change that the reference counting will be missed.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
batman-adv uses a self-written reference implementation which is just based
on atomic_t. This is less obvious when reading the code than kref and
therefore increases the change that the reference counting will be missed.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
batman-adv uses a self-written reference implementation which is just based
on atomic_t. This is less obvious when reading the code than kref and
therefore increases the change that the reference counting will be missed.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
batman-adv uses a self-written reference implementation which is just based
on atomic_t. This is less obvious when reading the code than kref and
therefore increases the change that the reference counting will be missed.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
batman-adv uses a self-written reference implementation which is just based
on atomic_t. This is less obvious when reading the code than kref and
therefore increases the change that the reference counting will be missed.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
batman-adv uses a self-written reference implementation which is just based
on atomic_t. This is less obvious when reading the code than kref and
therefore increases the change that the reference counting will be missed.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
batman-adv uses a self-written reference implementation which is just based
on atomic_t. This is less obvious when reading the code than kref and
therefore increases the change that the reference counting will be missed.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
batman-adv uses a self-written reference implementation which is just based
on atomic_t. This is less obvious when reading the code than kref and
therefore increases the change that the reference counting will be missed.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
batman-adv uses a self-written reference implementation which is just based
on atomic_t. This is less obvious when reading the code than kref and
therefore increases the change that the reference counting will be missed.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
batman-adv uses a self-written reference implementation which is just based
on atomic_t. This is less obvious when reading the code than kref and
therefore increases the change that the reference counting will be missed.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
batman-adv uses a self-written reference implementation which is just based
on atomic_t. This is less obvious when reading the code than kref and
therefore increases the change that the reference counting will be missed.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
batman-adv uses a self-written reference implementation which is just based
on atomic_t. This is less obvious when reading the code than kref and
therefore increases the change that the reference counting will be missed.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
batman-adv uses a self-written reference implementation which is just based
on atomic_t. This is less obvious when reading the code than kref and
therefore increases the change that the reference counting will be missed.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
batman-adv uses a self-written reference implementation which is just based
on atomic_t. This is less obvious when reading the code than kref and
therefore increases the change that the reference counting will be missed.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
batman-adv uses a self-written reference implementation which is just based
on atomic_t. This is less obvious when reading the code than kref and
therefore increases the change that the reference counting will be missed.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
batman-adv uses a self-written reference implementation which is just based
on atomic_t. This is less obvious when reading the code than kref and
therefore increases the change that the reference counting will be missed.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
batman-adv uses a self-written reference implementation which is just based
on atomic_t. This is less obvious when reading the code than kref and
therefore increases the change that the reference counting will be missed.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
The batadv_tvlv_container* functions state in their kernel-doc that they
require tvlv.container_list_lock. Add an assert to automatically detect
when this might have been ignored by the caller.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
To allow future use of the window protected function with different
maximum sequence numbers, add a parameter to set this value which
was previously hardcoded. Another parameter added for future use is a
flag to return whether the protection window has started.
While at it, also fix the kerneldoc.
Signed-off-by: Simon Wunderlich <simon@open-mesh.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
The references to the network device should be dropped inside the release
function for batadv_hard_iface similar to what is done with the batman-adv
internal datastructures.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Prior to 4.3, openvswitch tunnel vports (vxlan, gre and geneve) could
transmit vxlan packets of any size, constrained only by the ability to
send out the resulting packets. 4.3 introduced netdevs corresponding
to tunnel vports. These netdevs have an MTU, which limits the size of
a packet that can be successfully encapsulated. The default MTU
values are low (1500 or less), which is awkwardly small in the context
of physical networks supporting jumbo frames, and leads to a
conspicuous change in behaviour for userspace.
Instead, set the MTU on openvswitch-created netdevs to be the relevant
maximum (i.e. the maximum IP packet size minus any relevant overhead),
effectively restoring the behaviour prior to 4.3.
Signed-off-by: David Wragg <david@weave.works>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes an issue with unaligned accesses when using
eth_get_headlen on a page that was DMA aligned instead of being IP aligned.
The fact is when trying to check the length we don't need to be looking at
the flow label so we can reorder the checks to first check if we are
supposed to gather the flow label and then make the call to actually get
it.
v2: Updated path so that either STOP_AT_FLOW_LABEL or KEY_FLOW_LABEL can
cause us to check for the flow label.
Reported-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Support socket option PACKET_VNET_HDR together with PACKET_TX_RING.
When enabled, a struct virtio_net_hdr is expected to precede the data
in the ring. The vnet option must be set before the ring is created.
The implementation reuses the existing skb_copy_bits code that is used
when dev->hard_header_len is non-zero. Move this ll_header check to
before the skb alloc and combine it with a test for vnet_hdr->hdr_len.
Allocate and copy the max of the two.
Verified with test program at
github.com/wdebruij/kerneltools/blob/master/tests/psock_txring_vnet.c
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
GSO packet headers must be stored in the linear skb segment.
Move tpacket header parsing before sock_alloc_send_skb. The GSO
follow-on patch will later increase the skb linear argument to
sock_alloc_send_skb if needed for large packets.
The header parsing code does not require an allocated skb, so is
safe to move. Later pass to tpacket_fill_skb the computed data
start and length.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Support socket option PACKET_VNET_HDR together with PACKET_RX_RING.
When enabled, a struct virtio_net_hdr will precede the data in the
packet ring slots.
Verified with test program at
github.com/wdebruij/kerneltools/blob/master/tests/psock_rxring_vnet.c
pkt: 1454269209.798420 len=5066
vnet: gso_type=tcpv4 gso_size=1448 hlen=66 ecn=off
csum: start=34 off=16
eth: proto=0x800
ip: src=<masked> dst=<masked> proto=6 len=5052
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
packet_snd and packet_rcv support virtio net headers for GSO.
Move this logic into helper functions to be able to reuse it in
tpacket_snd and tpacket_rcv.
This is a straighforward code move with one exception. Instead of
creating and passing a separate gso_type variable, reuse
vnet_hdr.gso_type after conversion from virtio to kernel gso type.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit ed5a377d87 ("sctp: translate host order to network order when
setting a hmacid") corrected the hmacid byte-order when setting a hmacid.
but the same issue also exists on getting a hmacid.
We fix it by changing hmacids to host order when users get them with
getsockopt.
Fixes: Commit ed5a377d87 ("sctp: translate host order to network order when setting a hmacid")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Passing the port-group to br_mdb in order to allow direct access to the
structure. br_mdb will later use the structure to reflect HW reflection
status via "state" variable.
Signed-off-by: Elad Raz <eladr@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change net_bridge_port_group 'state' member to 'flags' and define new set
of flags internal to the kernel.
Signed-off-by: Elad Raz <eladr@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Devices may have limits on the number of fragments in an skb they support.
Current codebase uses a constant as maximum for number of fragments one
skb can hold and use.
When enabling scatter/gather and running traffic with many small messages
the codebase uses the maximum number of fragments and may thereby violate
the max for certain devices.
The patch introduces a global variable as max number of fragments.
Signed-off-by: Hans Westgaard Ry <hans.westgaard.ry@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Novopashenniy reported that ICMP redirects on SYN_RECV sockets
were leading to RST.
This is of course incorrect.
A specific list of ICMP messages should be able to drop a SYN_RECV.
For instance, a REDIRECT on SYN_RECV shall be ignored, as we do
not hold a dst per SYN_RECV pseudo request.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=111751
Fixes: 079096f103 ("tcp/dccp: install syn_recv requests into ehash table")
Reported-by: Petr Novopashenniy <pety@rusnet.ru>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Silence lockdep false positive about rcu_dereference() being
used in the wrong context.
First one should use rcu_dereference_protected() as we own the spinlock.
Second one should be a normal assignation, as no barrier is needed.
Fixes: 18367681a1 ("ipv6 flowlabel: Convert np->ipv6_fl_list to RCU.")
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The commit referenced in the Fixes tag incorrectly accounted the number
of in-flight fds over a unix domain socket to the original opener
of the file-descriptor. This allows another process to arbitrary
deplete the original file-openers resource limit for the maximum of
open files. Instead the sending processes and its struct cred should
be credited.
To do so, we add a reference counted struct user_struct pointer to the
scm_fp_list and use it to account for the number of inflight unix fds.
Fixes: 712f4aad40 ("unix: properly account for FDs passed over unix sockets")
Reported-by: David Herrmann <dh.herrmann@gmail.com>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The nft_counter_init() and nft_counter_clone() functions should return
negative error value -ENOMEM instead of positive ENOMEM.
Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The NETFILTER_XT_TARGET_TEE option selects NF_DUP_IPV6 whenever
IP6_NF_IPTABLES is enabled, and it ensures that it cannot be
builtin itself if NF_CONNTRACK is a loadable module, as that
is a dependency for NF_DUP_IPV6.
However, NF_DUP_IPV6 can be enabled even if IP6_NF_IPTABLES is
turned off, and it only really depends on IPV6. With the current
check in tee_tg6, we call nf_dup_ipv6() whenever NF_DUP_IPV6
is enabled. This can however be a loadable module which is
unreachable from a built-in xt_TEE:
net/built-in.o: In function `tee_tg6':
:(.text+0x67728): undefined reference to `nf_dup_ipv6'
The bug was originally introduced in the split of the xt_TEE module
into separate modules for ipv4 and ipv6, and two patches tried
to fix it unsuccessfully afterwards.
This is a revert of the the first incorrect attempt to fix it,
going back to depending on IPV6 as the dependency, and we
adapt the 'select' condition accordingly.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: bbde9fc182 ("netfilter: factor out packet duplication for IPv4/IPv6")
Fixes: 116984a316 ("netfilter: xt_TEE: use IS_ENABLED(CONFIG_NF_DUP_IPV6)")
Fixes: 74ec4d55c4 ("netfilter: fix xt_TEE and xt_TPROXY dependencies")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If nlh->nlmsg_len is zero then an infinite loop is triggered because
'skb_pull(skb, msglen);' pulls zero bytes.
The calculation in nlmsg_len() underflows if 'nlh->nlmsg_len <
NLMSG_HDRLEN' which bypasses the length validation and will later
trigger an out-of-bound read.
If the length validation does fail then the malformed batch message is
copied back to userspace. However, we cannot do this because the
nlh->nlmsg_len can be invalid. This leads to an out-of-bounds read in
netlink_ack:
[ 41.455421] ==================================================================
[ 41.456431] BUG: KASAN: slab-out-of-bounds in memcpy+0x1d/0x40 at addr ffff880119e79340
[ 41.456431] Read of size 4294967280 by task a.out/987
[ 41.456431] =============================================================================
[ 41.456431] BUG kmalloc-512 (Not tainted): kasan: bad access detected
[ 41.456431] -----------------------------------------------------------------------------
...
[ 41.456431] Bytes b4 ffff880119e79310: 00 00 00 00 d5 03 00 00 b0 fb fe ff 00 00 00 00 ................
[ 41.456431] Object ffff880119e79320: 20 00 00 00 10 00 05 00 00 00 00 00 00 00 00 00 ...............
[ 41.456431] Object ffff880119e79330: 14 00 0a 00 01 03 fc 40 45 56 11 22 33 10 00 05 .......@EV."3...
[ 41.456431] Object ffff880119e79340: f0 ff ff ff 88 99 aa bb 00 14 00 0a 00 06 fe fb ................
^^ start of batch nlmsg with
nlmsg_len=4294967280
...
[ 41.456431] Memory state around the buggy address:
[ 41.456431] ffff880119e79400: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 41.456431] ffff880119e79480: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 41.456431] >ffff880119e79500: 00 00 00 00 fc fc fc fc fc fc fc fc fc fc fc fc
[ 41.456431] ^
[ 41.456431] ffff880119e79580: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 41.456431] ffff880119e79600: fc fc fc fc fc fc fc fc fc fc fb fb fb fb fb fb
[ 41.456431] ==================================================================
Fix this with better validation of nlh->nlmsg_len and by setting
NFNL_BATCH_FAILURE if any batch message fails length validation.
CAP_NET_ADMIN is required to trigger the bugs.
Fixes: 9ea2aa8b7d ("netfilter: nfnetlink: validate nfnetlink header from batch")
Signed-off-by: Phil Turnbull <phil.turnbull@oracle.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Unintentionally the default was changed to zero, fix
that.
Fixes: 12ed8244ed ("ipv4: Namespaceify tcp syncookies sysctl knob")
Signed-off-by: David S. Miller <davem@davemloft.net>
improvements for our kernel-doc. These changes also make sure
that our doc is now properly processed by the kernel-doc
parsing tool.
Other than that you have a patch updating all the copyright
lines to 2016 and another patch switching the URLs in our
readme, Kconfig and MAINTAINERS file from "http" to "https".
Both by Sven Eckelmann.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJWsV4cAAoJENpFlCjNi1MRXloP/2tpnEdIABtAzZWvdUeE7eLP
7yZGhociLYYWRbGc/uiKeHBr9UvZy+Abo0qJDrW/10BHuFb2SmIqZwBINEhIiLAm
+jpwzGKWvbeXPU/fQ9EfhG9CR7U2ceeI6c/0F8eAAofTzDZSEu2JTk43/wzXhh46
LuKKHhvqZ3T+S7LLrMm0uJmwPBGJwdFYEWgvG18WB/1VEJTsrYipM/tbinpEKfiw
QaGYXClQmOx/n0yo46/H5lR8VB4RfCF8Gtx6+5teMcYXtVO6j6lnacgoztt5P7Ku
6j/9+r/5xXmFMO/D+xkE/QprbYtomB+R4t8c2FaQZxae4FN5sIEggFphFj5XwQ4L
OpuQWFH0UhJLaxFjq54lO0tzMICy6tpQIy0EstgKoa91VOOlyniS0bb3up9/tP3O
nWpPbgVz+xB3KbAdUT0cPcN0TePGJxLC0v1Gli1KjwiVkU0kC+ADfXj54b+aNn5T
N5aTbWFiJyW/9p+hDEh+/NBqg56v40qFri7KcEBNWsV6qI2R9pmKyj68oyV/GVYS
wYi+6wIKjwW4x6Gk1AqeZsJ+nj5QVnzYOTf1XT1zGQjwseMJxUhHW2/ct2Fae7PM
9CHZVEd129BdE66NRlWzjjfefMjNlIFqMKxLGyqEKQp5ClWDdfsHtiC78ry1fHWR
n4ymPVPZXOTQ+plyObr2
=IfQN
-----END PGP SIGNATURE-----
Merge tag 'batman-adv-for-davem' of git://git.open-mesh.org/linux-merge
Antonio Quartulli says:
====================
This batch of patches includes a number of corrections and
improvements for our kernel-doc. These changes also make sure
that our doc is now properly processed by the kernel-doc
parsing tool.
Other than that you have a patch updating all the copyright
lines to 2016 and another patch switching the URLs in our
readme, Kconfig and MAINTAINERS file from "http" to "https".
Both by Sven Eckelmann.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Refactor and consolidate cwnd and rate updates into a new function
tcp_cong_control().
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change enables congestion control to update cwnd based on
not only packet cumulatively acked but also packets delivered
out-of-order. This makes congestion control robust against packet
reordering because it may raise cwnd as long as packets are being
delivered once reordering has been detected (i.e., it only cares
the amount of packets delivered, not the ordering among them).
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A small refactoring that gets number of packets cumulatively acked
from tcp_clean_rtx_queue() directly.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch changes the accounting of how many packets are
newly acked or sacked when the sender receives an ACK.
The current approach basically computes
newly_acked_sacked = (prior_packets - prior_sacked) -
(tp->packets_out - tp->sacked_out)
where prior_packets and prior_sacked out are snapshot
at the beginning of the ACK processing.
The new approach tracks the delivery information via a new
TCP state variable "delivered" which monotically increases
as new packets are delivered in order or out-of-order.
The reason for this change is that the current approach is
brittle that produces negative or inaccurate estimate.
1) For non-SACK connections, an ACK that advances the SND.UNA
could reset the DUPACK counters (tp->sacked_out) in
tcp_process_loss() or tcp_fastretrans_alert(). This inflates
the inflight suddenly and causes under-estimate or even
negative estimate. Here is a real example:
before after (processing ACK)
packets_out 75 73
sacked_out 23 0
ca state Loss Open
The old approach computes (75-23) - (73 - 0) = -21 delivered
while the new approach computes 1 delivered since it
considers the 2nd-24th packets are delivered OOO.
2) MSS change would re-count packets_out and sacked_out so
the estimate is in-accurate and can even become negative.
E.g., the inflight is doubled when MSS is halved.
3) Spurious retransmission signaled by DSACK is not accounted
The new approach is simpler and more robust. For SACK connections,
tp->delivered increments as packets are being acked or sacked in
SACK and ACK processing.
For non-sack connections, it's done in tcp_remove_reno_sacks() and
tcp_add_reno_sack(). When an ACK advances the SND.UNA, tp->delivered
is incremented by the number of packets ACKed (less the current
number of DUPACKs received plus one packet hole). Upon receiving
a DUPACK, tp->delivered is incremented assuming one out-of-order
packet is delivered.
Upon receiving a DSACK, tp->delivered is incremtened assuming one
retransmission is delivered in tcp_sacktag_write_queue().
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently the cwnd is reduced and increased in various different
places. The reduction happens in various places in the recovery
state processing (tcp_fastretrans_alert) while the increase
happens afterward.
A better sequence is to identify lost packets and update
the congestion control state (icsk_ca_state) first. Then base
on the new state, up/down the cwnd in one central place. It's
more clear to reason cwnd changes.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The retransmission and F-RTO transmission currently happen inside
recovery state processing (tcp_fastretrans_alert) but before
congestion control. This refactoring moves the logic after both
s.t. we can determine how much to send (cwnd) before deciding what to
send.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove a write-only stack variable from unix_attach_fds(). This is a
left-over from the security fix in:
commit 712f4aad40
Author: willy tarreau <w@1wt.eu>
Date: Sun Jan 10 07:54:56 2016 +0100
unix: properly account for FDs passed over unix sockets
Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When we acknowledge a FIN, it is not enough to ack the sequence number
and queue the skb into receive queue. We also have to call tcp_fin()
to properly update socket state and send proper poll() notifications.
It seems we also had the problem if we received a SYN packet with the
FIN flag set, but it does not seem an urgent issue, as no known
implementation can do that.
Fixes: 61d2bcae99 ("tcp: fastopen: accept data/FIN present in SYNACK message")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Until now, tipc_rcv and tipc_send workqueues in server are allocated
with parameters WQ_UNBOUND & max_active = 1.
This parameters passed to this function makes it equivalent to
alloc_ordered_workqueue(). The later form is more explicit and
can inherit future ordered_workqueue changes.
In this commit we replace alloc_workqueue() with more readable
alloc_ordered_workqueue().
Acked-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Until now, we create timers even for the subscription requests
with timeout = TIPC_WAIT_FOREVER.
This can be improved by avoiding timer creation when the timeout
is set to TIPC_WAIT_FOREVER.
In this commit, we introduce a check to creates timers only
when timeout != TIPC_WAIT_FOREVER.
Acked-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Until now, during subscription creation the mod_time() &
tipc_subscrb_get() are called after releasing the subscriber
spin lock.
In a SMP system when performing a subscription creation, if the
subscription timeout occurs simultaneously (the timer is
scheduled to run on another CPU) then the timer thread
might decrement the subscribers refcount before the create
thread increments the refcount.
This can be simulated by creating subscription with timeout=0 and
sometimes the timeout occurs before the create request is complete.
This leads to the following message:
[30.702949] BUG: spinlock bad magic on CPU#1, kworker/u8:3/87
[30.703834] general protection fault: 0000 [#1] SMP
[30.704826] CPU: 1 PID: 87 Comm: kworker/u8:3 Not tainted 4.4.0-rc8+ #18
[30.704826] Workqueue: tipc_rcv tipc_recv_work [tipc]
[30.704826] task: ffff88003f878600 ti: ffff88003fae0000 task.ti: ffff88003fae0000
[30.704826] RIP: 0010:[<ffffffff8109196c>] [<ffffffff8109196c>] spin_dump+0x5c/0xe0
[...]
[30.704826] Call Trace:
[30.704826] [<ffffffff81091a16>] spin_bug+0x26/0x30
[30.704826] [<ffffffff81091b75>] do_raw_spin_lock+0xe5/0x120
[30.704826] [<ffffffff81684439>] _raw_spin_lock_bh+0x19/0x20
[30.704826] [<ffffffffa0096f10>] tipc_subscrb_rcv_cb+0x1d0/0x330 [tipc]
[30.704826] [<ffffffffa00a37b1>] tipc_receive_from_sock+0xc1/0x150 [tipc]
[30.704826] [<ffffffffa00a31df>] tipc_recv_work+0x3f/0x80 [tipc]
[30.704826] [<ffffffff8106a739>] process_one_work+0x149/0x3c0
[30.704826] [<ffffffff8106aa16>] worker_thread+0x66/0x460
[30.704826] [<ffffffff8106a9b0>] ? process_one_work+0x3c0/0x3c0
[30.704826] [<ffffffff8106a9b0>] ? process_one_work+0x3c0/0x3c0
[30.704826] [<ffffffff8107029d>] kthread+0xed/0x110
[30.704826] [<ffffffff810701b0>] ? kthread_create_on_node+0x190/0x190
[30.704826] [<ffffffff81684bdf>] ret_from_fork+0x3f/0x70
In this commit,
1. we remove the check for the return code for mod_timer()
2. we protect tipc_subscrb_get() using the subscriber spin lock.
We increment the subscriber's refcount as soon as we add the
subscription to subscriber's subscription list.
Acked-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Until now, while creating a subscription the subscriber lock
protects only the subscribers subscription list and not the
nametable. The call to tipc_nametbl_subscribe() is outside
the lock. However, at subscription timeout and cancel both
the subscribers subscription list and the nametable are
protected by the subscriber lock.
This asymmetric locking mechanism leads to the following problem:
In a SMP system, the timer can be fire on another core before
the create request is complete.
When the timer thread calls tipc_nametbl_unsubscribe() before create
thread calls tipc_nametbl_subscribe(), we get a nullptr exception.
This can be simulated by creating subscription with timeout=0 and
sometimes the timeout occurs before the create request is complete.
The following is the oops:
[57.569661] BUG: unable to handle kernel NULL pointer dereference at (null)
[57.577498] IP: [<ffffffffa02135aa>] tipc_nametbl_unsubscribe+0x8a/0x120 [tipc]
[57.584820] PGD 0
[57.586834] Oops: 0002 [#1] SMP
[57.685506] CPU: 14 PID: 10077 Comm: kworker/u40:1 Tainted: P OENX 3.12.48-52.27.1. 9688.1.PTF-default #1
[57.703637] Workqueue: tipc_rcv tipc_recv_work [tipc]
[57.708697] task: ffff88064c7f00c0 ti: ffff880629ef4000 task.ti: ffff880629ef4000
[57.716181] RIP: 0010:[<ffffffffa02135aa>] [<ffffffffa02135aa>] tipc_nametbl_unsubscribe+0x8a/ 0x120 [tipc]
[...]
[57.812327] Call Trace:
[57.814806] [<ffffffffa0211c77>] tipc_subscrp_delete+0x37/0x90 [tipc]
[57.821357] [<ffffffffa0211e2f>] tipc_subscrp_timeout+0x3f/0x70 [tipc]
[57.827982] [<ffffffff810618c1>] call_timer_fn+0x31/0x100
[57.833490] [<ffffffff81062709>] run_timer_softirq+0x1f9/0x2b0
[57.839414] [<ffffffff8105a795>] __do_softirq+0xe5/0x230
[57.844827] [<ffffffff81520d1c>] call_softirq+0x1c/0x30
[57.850150] [<ffffffff81004665>] do_softirq+0x55/0x90
[57.855285] [<ffffffff8105aa35>] irq_exit+0x95/0xa0
[57.860290] [<ffffffff815215b5>] smp_apic_timer_interrupt+0x45/0x60
[57.866644] [<ffffffff8152005d>] apic_timer_interrupt+0x6d/0x80
[57.872686] [<ffffffffa02121c5>] tipc_subscrb_rcv_cb+0x2a5/0x3f0 [tipc]
[57.879425] [<ffffffffa021c65f>] tipc_receive_from_sock+0x9f/0x100 [tipc]
[57.886324] [<ffffffffa021c826>] tipc_recv_work+0x26/0x60 [tipc]
[57.892463] [<ffffffff8106fb22>] process_one_work+0x172/0x420
[57.898309] [<ffffffff8107079a>] worker_thread+0x11a/0x3c0
[57.903871] [<ffffffff81077114>] kthread+0xb4/0xc0
[57.908751] [<ffffffff8151f318>] ret_from_fork+0x58/0x90
In this commit, we do the following at subscription creation:
1. set the subscription's subscriber pointer before performing
tipc_nametbl_subscribe(), as this value is required further in
the call chain ex: by tipc_subscrp_send_event().
2. move tipc_nametbl_subscribe() under the scope of subscriber lock
Acked-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Until now, the subscribers endianness for a subscription
create/cancel request is determined as:
swap = !(s->filter & (TIPC_SUB_PORTS | TIPC_SUB_SERVICE))
The checks are performed only for port/service subscriptions.
The swap calculation is incorrect if the filter in the subscription
cancellation request is set to TIPC_SUB_CANCEL (it's a malformed
cancel request, as the corresponding subscription create filter
is missing).
Thus, the check if the request is for cancellation fails and the
request is treated as a subscription create request. The
subscription creation fails as the request is illegal, which
terminates this connection.
In this commit we determine the endianness by including
TIPC_SUB_CANCEL, which will set swap correctly and the
request is processed as a cancellation request.
Acked-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In 'commit 7fe8097cef ("tipc: fix nullpointer bug when subscribing
to events")', we terminate the connection if the subscription
creation fails.
In the same commit, the subscription creation result was based on
the value of subscription pointer (set in the function) instead of
the return code.
Unfortunately, the same function also handles subscription
cancellation request. For a subscription cancellation request,
the subscription pointer cannot be set. Thus the connection is
terminated during cancellation request.
In this commit, we move the subcription cancel check outside
of tipc_subscrp_create(). Hence,
- tipc_subscrp_create() will create a subscripton
- tipc_subscrb_rcv_cb() will subscribe or cancel a subscription.
Fixes: 'commit 7fe8097cef ("tipc: fix nullpointer bug when subscribing to events")'
Acked-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In this commit, we split tipc_subscrp_create() into two:
1. tipc_subscrp_create() creates a subscription
2. A new function tipc_subscrp_subscribe() adds the
subscription to the subscriber subscription list,
activates the subscription timer and subscribes to
the nametable updates.
In future commits, the purpose of tipc_subscrb_rcv_cb() will
be to either subscribe or cancel a subscription.
There is no functional change in this commit.
Acked-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Until now, struct tipc_subscriber has duplicate fields for
type, upper and lower (as member of struct tipc_name_seq) at:
1. as member seq in struct tipc_subscription
2. as member seq in struct tipc_subscr, which is contained
in struct tipc_event
The former structure contains the type, upper and lower
values in network byte order and the later contains the
intact copy of the request.
The struct tipc_subscription contains a field swap to
determine if request needs network byte order conversion.
Thus by using swap, we can convert the request when
required instead of duplicating it.
In this commit,
1. we remove the references to these elements as members of
struct tipc_subscription and replace them with elements
from struct tipc_subscr.
2. provide new functions to convert the user request into
network byte order.
Acked-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Until now, struct tipc_subscription has duplicate timeout and filter
attributes present:
1. directly as members of struct tipc_subscription
2. in struct tipc_subscr, which is contained in struct tipc_event
In this commit, we remove the references to these elements as
members of struct tipc_subscription and replace them with elements
from struct tipc_subscr.
Acked-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Until now, during subscription creation we set sub->timeout by
converting the timeout request value in milliseconds to jiffies.
This is followed by setting the timeout value in the timer if
sub->timeout != TIPC_WAIT_FOREVER.
For a subscription create request with a timeout value of
TIPC_WAIT_FOREVER, msecs_to_jiffies(TIPC_WAIT_FOREVER)
returns MAX_JIFFY_OFFSET (0xfffffffe). This is not equal to
TIPC_WAIT_FOREVER (0xffffffff).
In this commit, we remove this check.
Acked-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
netdev_rss_key is written to once and thereafter is read by
drivers when they are initialising. The fact that it is mostly
read and not written to makes it a candidate for a __read_mostly
declaration.
Signed-off-by: Kim Jones <kim-marie.jones@intel.com>
Signed-off-by: Alan Carey <alan.carey@intel.com>
Acked-by: Rami Rosen <rami.rosen@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If we remove the SYN flag from the skbs that tcp_fastopen_add_skb()
places in socket receive queue, then we can remove the test that
tcp_recvmsg() has to perform in fast path.
All we have to do is to adjust SEQ in the slow path.
For the moment, we place an unlikely() and output a message
if we find an skb having SYN flag set.
Goal would be to get rid of the test completely.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RFC 7413 (TCP Fast Open) 4.2.2 states that the SYNACK message
MAY include data and/or FIN
This patch adds support for the client side :
If we receive a SYNACK with payload or FIN, queue the skb instead
of ignoring it.
Since we already support the same for SYN, we refactor the existing
code and reuse it. Note we need to clone the skb, so this operation
might fail under memory pressure.
Sara Dickinson pointed out FreeBSD server Fast Open implementation
was planned to generate such SYNACK in the future.
The server side might be implemented on linux later.
Reported-by: Sara Dickinson <sara@sinodun.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A rcu stall with the following backtrace was seen on a system with
forwarding, optimistic_dad and use_optimistic set. To reproduce,
set these flags and allow ipv6 autoconf.
This occurs because the device write_lock is acquired while already
holding the read_lock. Back trace below -
INFO: rcu_preempt self-detected stall on CPU { 1} (t=2100 jiffies
g=3992 c=3991 q=4471)
<6> Task dump for CPU 1:
<2> kworker/1:0 R running task 12168 15 2 0x00000002
<2> Workqueue: ipv6_addrconf addrconf_dad_work
<6> Call trace:
<2> [<ffffffc000084da8>] el1_irq+0x68/0xdc
<2> [<ffffffc000cc4e0c>] _raw_write_lock_bh+0x20/0x30
<2> [<ffffffc000bc5dd8>] __ipv6_dev_ac_inc+0x64/0x1b4
<2> [<ffffffc000bcbd2c>] addrconf_join_anycast+0x9c/0xc4
<2> [<ffffffc000bcf9f0>] __ipv6_ifa_notify+0x160/0x29c
<2> [<ffffffc000bcfb7c>] ipv6_ifa_notify+0x50/0x70
<2> [<ffffffc000bd035c>] addrconf_dad_work+0x314/0x334
<2> [<ffffffc0000b64c8>] process_one_work+0x244/0x3fc
<2> [<ffffffc0000b7324>] worker_thread+0x2f8/0x418
<2> [<ffffffc0000bb40c>] kthread+0xe0/0xec
v2: do addrconf_dad_kick inside read lock and then acquire write
lock for ipv6_ifa_notify as suggested by Eric
Fixes: 7fd2561e4e ("net: ipv6: Add a sysctl to make optimistic
addresses useful candidates")
Cc: Eric Dumazet <edumazet@google.com>
Cc: Erik Kline <ek@google.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds an rx_nohandler stat counter, along with a sysfs statistics
node, and copies the counter out via netlink as well.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@mellanox.com>
CC: Daniel Borkmann <daniel@iogearbox.net>
CC: Tom Herbert <tom@herbertland.com>
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <gospo@cumulusnetworks.com>
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The netdev_stats_to_stats64 function copies the deprecated
net_device_stats format stats into rtnl_link_stats64 for legacy support
purposes, but with the BUILD_BUG_ON as it was, it wasn't possible to
extend rtnl_link_stats64 without also extending net_device_stats. Relax
the BUILD_BUG_ON to only require that rtnl_link_stats64 is larger, and
zero out all the stat counters that aren't present in net_device_stats.
CC: Eric Dumazet <edumazet@google.com>
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently link priority changes isn't handled for active links. In
this patch we resolve this by changing our priority if the peer passes
a valid priority in a state message.
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Changing certain link attributes (link tolerance and link priority)
from the TIPC management tool is supposed to automatically take
effect at both endpoints of the affected link.
Currently the media address is not instantiated for the link and is
used uninstantiated when crafting protocol messages designated for the
peer endpoint. This means that changing a link property currently
results in the property being changed on the local machine but the
protocol message designated for the peer gets lost. Resulting in
property discrepancy between the endpoints.
In this patch we resolve this by using the media address from the
link entry and using the bearer transmit function to send it. Hence,
we can now eliminate the redundant function tipc_link_prot_xmit() and
the redundant field tipc_link::media_addr.
Fixes: 2af5ae372a (tipc: clean up unused code and structures)
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Reported-by: Jason Hu <huzhijiang@gmail.com>
Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull Ceph fixes from Sage Weil:
"We have a few wire protocol compatibility fixes, ports of a few recent
CRUSH mapping changes, and a couple error path fixes"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
libceph: MOSDOpReply v7 encoding
libceph: advertise support for TUNABLES5
crush: decode and initialize chooseleaf_stable
crush: add chooseleaf_stable tunable
crush: ensure take bucket value is valid
crush: ensure bucket id is valid before indexing buckets array
ceph: fix snap context leak in error path
ceph: checking for IS_ERR instead of NULL
Add a function to allow creation and addition of a new transport
to an existing rpc_clnt
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Add a helper for tasks that require us to apply a function to all the
transports in an rpc_clnt.
An example of a usecase would be BIND_CONN_TO_SESSION, where we want
to send one RPC call down each transport.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
This is needed in order to allow the NFSv4.1 backchannel and
BIND_CONN_TO_SESSION function to work.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
This is a pre-patch for the RPC multipath code. It sets up the storage in
struct rpc_clnt for the multipath code.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
In order to support multipathing/trunking we will need the ability to
track multiple transports. This patch sets up a basic structure for
doing so.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Empty request_redirect_t (struct ceph_request_redirect in the kernel
client) is now encoded with a bool. NEW_OSDOPREPLY_ENCODING feature
bit overlaps with already supported CRUSH_TUNABLES5.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
Add a tunable to fix the bug that chooseleaf may cause unnecessary pg
migrations when some device fails.
Reflects ceph.git commit fdb3f664448e80d984470f32f04e2e6f03ab52ec.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
Ensure that the take argument is a valid bucket ID before indexing the
buckets array.
Reflects ceph.git commit 93ec538e8a667699876b72459b8ad78966d89c61.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
We were indexing the buckets array without verifying the index was
within the [0,max_buckets) range. This could happen because
a multistep rule does not have enough buckets and has CRUSH_ITEM_NONE
for an intermediate result, which would feed in CRUSH_ITEM_NONE and
make us crash.
Reflects ceph.git commit 976a24a326da8931e689ee22fce35feab5b67b76.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
open-mesh.org and its subdomains can only be accessed via HTTPS. HTTP-only
requests are currently redirected automatically to HTTPS but references in
the source code should be only https.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
The change from cur_tp to the function
minstrel_get_tp_avg/minstrel_ht_get_tp_avg changed the unit used for the
current throughput. For example in minstrel_ht the correct
conversion between them would be:
mrs->cur_tp / 10 == minstrel_ht_get_tp_avg(..).
This factor 10 must also be included in the calculation of
minstrel_get_expected_throughput and minstrel_ht_get_expected_throughput to
return values with the unit [Kbps] instead of [10Kbps]. Otherwise routing
algorithms like B.A.T.M.A.N. V will make incorrect decision based on these
values. Its kernel based implementation expects expected_throughput always
to have the unit [Kbps] and not sometimes [10Kbps] and sometimes [Kbps].
The same requirement has iw or olsrdv2's nl80211 based statistics module
which retrieve the same data via NL80211_STA_INFO_TX_BITRATE.
Cc: stable@vger.kernel.org
Fixes: 6a27b2c40b ("mac80211: restructure per-rate throughput calculation into function")
Signed-off-by: Sven Eckelmann <sven@open-mesh.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Patch fixes this splat
BUG: KASAN: slab-out-of-bounds in minstrel_ht_update_stats.isra.7+0x6e1/0x9e0
[mac80211] at addr ffff8800cee640f4 Read of size 4 by task swapper/3/0
Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Link: http://lkml.kernel.org/r/CALYGNiNyJhSaVnE35qS6UCGaSb2Dx1_i5HcRavuOX14oTz2P+w@mail.gmail.com
Acked-by: Felix Fietkau <nbd@openwrt.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Simon Wunderlich <simon@open-mesh.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Pull networking fixes from David Miller:
"This looks like a lot but it's a mixture of regression fixes as well
as fixes for longer standing issues.
1) Fix on-channel cancellation in mac80211, from Johannes Berg.
2) Handle CHECKSUM_COMPLETE properly in xt_TCPMSS netfilter xtables
module, from Eric Dumazet.
3) Avoid infinite loop in UDP SO_REUSEPORT logic, also from Eric
Dumazet.
4) Avoid a NULL deref if we try to set SO_REUSEPORT after a socket is
bound, from Craig Gallek.
5) GRO key comparisons don't take lightweight tunnels into account,
from Jesse Gross.
6) Fix struct pid leak via SCM credentials in AF_UNIX, from Eric
Dumazet.
7) We need to set the rtnl_link_ops of ipv6 SIT tunnels before we
register them, otherwise the NEWLINK netlink message is missing
the proper attributes. From Thadeu Lima de Souza Cascardo.
8) Several Spectrum chip bug fixes for mlxsw switch driver, from Ido
Schimmel
9) Handle fragments properly in ipv4 easly socket demux, from Eric
Dumazet.
10) Don't ignore the ifindex key specifier on ipv6 output route
lookups, from Paolo Abeni"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (128 commits)
tcp: avoid cwnd undo after receiving ECN
irda: fix a potential use-after-free in ircomm_param_request
net: tg3: avoid uninitialized variable warning
net: nb8800: avoid uninitialized variable warning
net: vxge: avoid unused function warnings
net: bgmac: clarify CONFIG_BCMA dependency
net: hp100: remove unnecessary #ifdefs
net: davinci_cpdma: use dma_addr_t for DMA address
ipv6/udp: use sticky pktinfo egress ifindex on connect()
ipv6: enforce flowi6_oif usage in ip6_dst_lookup_tail()
netlink: not trim skb for mmaped socket when dump
vxlan: fix a out of bounds access in __vxlan_find_mac
net: dsa: mv88e6xxx: fix port VLAN maps
fib_trie: Fix shift by 32 in fib_table_lookup
net: moxart: use correct accessors for DMA memory
ipv4: ipconfig: avoid unused ic_proto_used symbol
bnxt_en: Fix crash in bnxt_free_tx_skbs() during tx timeout.
bnxt_en: Exclude rx_drop_pkts hw counter from the stack's rx_dropped counter.
bnxt_en: Ring free response from close path should use completion ring
net_sched: drr: check for NULL pointer in drr_dequeue
...
The spin_unlock call should have been left as-is, revert.
Fixes: b16c29191d ("netfilter: nf_conntrack: use safer way to lock all buckets")
Reported-by: kernel test robot <fengguang.wu@intel.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Since bd678e09dc ("netfilter: nfnetlink: fix splat due to incorrect
socket memory accounting in skbuff clones"), we don't manually attach
the sk to the skbuff clone anymore, so we have to use the original
skbuff from netlink_ack() which needs to access the sk pointer.
Fixes: bd678e09dc ("netfilter: nfnetlink: fix splat due to incorrect socket memory accounting in skbuff clones")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Ulrich reports soft lockup with following (shortened) callchain:
NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s!
__netif_receive_skb_core+0x6e4/0x774
process_backlog+0x94/0x160
net_rx_action+0x88/0x178
call_do_softirq+0x24/0x3c
do_softirq+0x54/0x6c
__local_bh_enable_ip+0x7c/0xbc
nf_ct_iterate_cleanup+0x11c/0x22c [nf_conntrack]
masq_inet_event+0x20/0x30 [nf_nat_masquerade_ipv6]
atomic_notifier_call_chain+0x1c/0x2c
ipv6_del_addr+0x1bc/0x220 [ipv6]
Problem is that nf_ct_iterate_cleanup can run for a very long time
since it can be interrupted by softirq processing.
Moreover, atomic_notifier_call_chain runs with rcu readlock held.
So lets call cond_resched() in nf_ct_iterate_cleanup and defer
the call to a work queue for the atomic_notifier_call_chain case.
We also need another cond_resched in get_next_corpse, since we
have to deal with iter() always returning false, in that case
get_next_corpse will walk entire conntrack table.
Reported-by: Ulrich Weber <uw@ocedo.com>
Tested-by: Ulrich Weber <uw@ocedo.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Johan Hedberg says:
====================
pull request: bluetooth 2016-01-30
Here's a set of important Bluetooth fixes for the 4.5 kernel:
- Two fixes to 6LoWPAN code (one fixing a potential crash)
- Fix LE pairing with devices using both public and random addresses
- Fix allocation of dynamic LE PSM values
- Fix missing COMPATIBLE_IOCTL for UART line discipline
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
RFC 4015 section 3.4 says the TCP sender MUST refrain from
reversing the congestion control state when the ACK signals
congestion through the ECN-Echo flag. Currently we may not
always do that when prior_ssthresh is reset upon receiving
ACKs with ECE marks. This patch fixes that.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
self->ctrl_skb is protected by self->spinlock, we should not
access it out of the lock. Move the debugging printk inside.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Samuel Ortiz <samuel@sortiz.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, the egress interface index specified via IPV6_PKTINFO
is ignored by __ip6_datagram_connect(), so that RFC 3542 section 6.7
can be subverted when the user space application calls connect()
before sendmsg().
Fix it by initializing properly flowi6_oif in connect() before
performing the route lookup.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current implementation of ip6_dst_lookup_tail basically
ignore the egress ifindex match: if the saddr is set,
ip6_route_output() purposefully ignores flowi6_oif, due
to the commit d46a9d678e ("net: ipv6: Dont add RT6_LOOKUP_F_IFACE
flag if saddr set"), if the saddr is 'any' the first route lookup
in ip6_dst_lookup_tail fails, but upon failure a second lookup will
be performed with saddr set, thus ignoring the ifindex constraint.
This commit adds an output route lookup function variant, which
allows the caller to specify lookup flags, and modify
ip6_dst_lookup_tail() to enforce the ifindex match on the second
lookup via said helper.
ip6_route_output() becames now a static inline function build on
top of ip6_route_output_flags(); as a side effect, out-of-tree
modules need now a GPL license to access the output route lookup
functionality.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We should not trim skb for mmaped socket since its buf size is fixed
and userspace will read as frame which data equals head. mmaped
socket will not call recvmsg, means max_recvmsg_len is 0,
skb_reserve was not called before commit: db65a3aaf2.
Fixes: db65a3aaf2 (netlink: Trim skb to alloc size to avoid MSG_TRUNC)
Signed-off-by: Ken-ichirou MATSUZAWA <chamas@h4.dion.ne.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
The fib_table_lookup function had a shift by 32 that triggered a UBSAN
warning. This was due to the fact that I had placed the shift first and
then followed it with the check for the suffix length to ignore the
undefined behavior. If we reorder this so that we verify the suffix is
less than 32 before shifting the value we can avoid the issue.
Reported-by: Toralf Förster <toralf.foerster@gmx.de>
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When CONFIG_PROC_FS, CONFIG_IP_PNP_BOOTP, CONFIG_IP_PNP_DHCP and
CONFIG_IP_PNP_RARP are all disabled, we get a warning about the
ic_proto_used variable being unused:
net/ipv4/ipconfig.c:146:12: error: 'ic_proto_used' defined but not used [-Werror=unused-variable]
This avoids the warning, by making the definition conditional on
whether a dynamic IP configuration protocol is configured. If not,
we know that the value is always zero, so we can optimize away the
variable and all code that depends on it.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are cases where qdisc_dequeue_peeked can return NULL, and the result
is dereferenced later on in the function.
Similarly to the other qdisc dequeue functions, check whether the skb
pointer is NULL and if it is, goto out.
Signed-off-by: Bernie Harris <bernie.harris@alliedtelesis.co.nz>
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In 'commit 7fe8097cef ("tipc: fix nullpointer bug when subscribing
to events")', we terminate the connection if the subscription
creation fails.
In the same commit, the subscription creation result was based on
the value of the subscription pointer (set in the function) instead
of the return code.
Unfortunately, the same function tipc_subscrp_create() handles
subscription cancel request. For a subscription cancellation request,
the subscription pointer cannot be set. Thus if a subscriber has
several subscriptions and cancels any of them, the connection is
terminated.
In this commit, we terminate the connection based on the return value
of tipc_subscrp_create().
Fixes: commit 7fe8097cef ("tipc: fix nullpointer bug when subscribing to events")
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We should not assume a valid protocol header is present,
as this is not the case for IPv4 fragments.
Lets avoid extra cache line misses and potential bugs
if we actually find a socket and incorrectly uses its dst.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use kzalloc instead of kmalloc for struct tid_ampdu_rx to
initialize the "removed" field (all others are initialized
manually). That fixes:
UBSAN: Undefined behaviour in net/mac80211/rx.c:932:29
load of value 2 is not a valid value for type '_Bool'
CPU: 3 PID: 1134 Comm: kworker/u16:7 Not tainted 4.5.0-rc1+ #265
Workqueue: phy0 rt2x00usb_work_rxdone
0000000000000004 ffff880254a7ba50 ffffffff8181d866 0000000000000007
ffff880254a7ba78 ffff880254a7ba68 ffffffff8188422d ffffffff8379b500
ffff880254a7bab8 ffffffff81884747 0000000000000202 0000000348620032
Call Trace:
[<ffffffff8181d866>] dump_stack+0x45/0x5f
[<ffffffff8188422d>] ubsan_epilogue+0xd/0x40
[<ffffffff81884747>] __ubsan_handle_load_invalid_value+0x67/0x70
[<ffffffff82227b4d>] ieee80211_sta_reorder_release.isra.16+0x5ed/0x730
[<ffffffff8222ca14>] ieee80211_prepare_and_rx_handle+0xd04/0x1c00
[<ffffffff8222db03>] __ieee80211_rx_handle_packet+0x1f3/0x750
[<ffffffff8222e4a7>] ieee80211_rx_napi+0x447/0x990
While at it, convert to use sizeof(*tid_agg_rx) instead.
Fixes: 788211d81b ("mac80211: fix RX A-MPDU session reorder timer deletion")
Cc: stable@vger.kernel.org
Signed-off-by: Chris Bainbridge <chris.bainbridge@gmail.com>
[reword commit message, use sizeof(*tid_agg_rx)]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Since cfg80211 frequently takes actions from its netdev notifier
call, wireless extensions messages could still be ordered badly
since the wext netdev notifier, since wext is built into the
kernel, runs before the cfg80211 netdev notifier. For example,
the following can happen:
5: wlan1: <BROADCAST,MULTICAST> mtu 1500 qdisc mq state DOWN group default
link/ether 02:00:00:00:01:00 brd ff:ff:ff:ff:ff:ff
5: wlan1: <BROADCAST,MULTICAST,UP>
link/ether
when setting the interface down causes the wext message.
To also fix this, export the wireless_nlevent_flush() function
and also call it from the cfg80211 notifier.
Cc: stable@vger.kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Beniamino reported that he was getting an RTM_NEWLINK message for a
given interface, after the RTM_DELLINK for it. It turns out that the
message is a wireless extensions message, which was sent because the
interface had been connected and disconnection while it was deleted
caused a wext message.
For its netlink messages, wext uses RTM_NEWLINK, but the message is
without all the regular rtnetlink attributes, so "ip monitor link"
prints just rudimentary information:
5: wlan1: <BROADCAST,MULTICAST> mtu 1500 qdisc mq state DOWN group default
link/ether 02:00:00:00:01:00 brd ff:ff:ff:ff:ff:ff
Deleted 5: wlan1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default
link/ether 02:00:00:00:01:00 brd ff:ff:ff:ff:ff:ff
5: wlan1: <BROADCAST,MULTICAST,UP>
link/ether
(from my hwsim reproduction)
This can cause userspace to get confused since it doesn't expect an
RTM_NEWLINK message after RTM_DELLINK.
The reason for this is that wext schedules a worker to send out the
messages, and the scheduling delay can cause the messages to get out
to userspace in different order.
To fix this, have wext register a netdevice notifier and flush out
any pending messages when netdevice state changes. This fixes any
ordering whenever the original message wasn't sent by a notifier
itself.
Cc: stable@vger.kernel.org
Reported-by: Beniamino Galvani <bgalvani@redhat.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The commit cad20c2780 was supposed to
fix handling of devices first using public addresses and then
switching to RPAs after pairing. Unfortunately it missed a couple of
key places in the code.
1. When evaluating which devices should be removed from the existing
white list we also need to consider whether we have an IRK for them or
not, i.e. a call to hci_find_irk_by_addr() is needed.
2. In smp_notify_keys() we should not be requiring the knowledge of
the RPA, but should simply keep the IRK around if the other conditions
require it.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Cc: stable@vger.kernel.org # 4.4+
At least the l2cap_add_psm() routine depends on the source address
type being properly set to know what auto-allocation ranges to use, so
the assignment to l2cap_chan needs to happen before this.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
The LE dynamic PSM range is different from BR/EDR (0x0080 - 0x00ff)
and doesn't have requirements relating to parity, so separate checks
are needed.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Having proper defines makes the code a bit readable, it also avoids
duplicating hard-coded values since these are also needed when
auto-allocating PSM values (in a subsequent patch).
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
With some combinations of user provided flags in netlink command,
it is possible to call tcp_get_info() with a buffer that is not 8-bytes
aligned.
It does matter on some arches, so we need to use put_unaligned() to
store the u64 fields.
Current iproute2 package does not trigger this particular issue.
Fixes: 0df48c26d8 ("tcp: add tcpi_bytes_acked to tcp_info")
Fixes: 977cb0ecf8 ("tcp: add pacing_rate information into tcp_info")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When switchdev drivers process FDB notifications from the underlying
device they resolve the netdev to which the entry points to and notify
the bridge using the switchdev notifier.
However, since the RTNL mutex is not held there is nothing preventing
the netdev from disappearing in the middle, which will cause
br_switchdev_event() to dereference a non-existing netdev.
Make switchdev drivers hold the lock at the beginning of the
notification processing session and release it once it ends, after
notifying the bridge.
Also, remove switchdev_mutex and fdb_lock, as they are no longer needed
when RTNL mutex is held.
Fixes: 03bf0c2812 ("switchdev: introduce switchdev notifier")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* make regulatory messages much less verbose by default
* various remain-on-channel fixes
* scheduled scanning fixes with hardware restart
* a PS-Poll handling fix; was broken just recently
* bugfix to avoid buffering non-bufferable MMPDUs
* world regulatory domain data fix
* a fix for scanning causing other work to get stuck
* hwsim: revert an older problematic patch that caused some
userspace tools to have issues - not that big a deal as
it's a debug only driver though
-----BEGIN PGP SIGNATURE-----
iQIcBAABCgAGBQJWp05DAAoJEGt7eEactAAdDkkP/0aZQOVRV/TSfn33LiYUoWR7
lYeGptAUITc2y1sWnGspycDDT0aM9/Mdu/A0O0qLZ1Ra60OaIQCmfyV3SlTDna0n
k5Jrm7z9wc1IkQWGSLC1FPbpHPjGHkKL1Ux3lU9dJGz3cDreGhfKPcWuRP30fAns
c4XbkhYQVtdPQZl6Dz8xv8javwi2mJ9HizeHrTt1uDPh770ai4wLF6vcpLj2/Ear
YkPr61ueQS+DfltsBg3ygqasimtdPW+TG8q0oxaX3VA5j9q6zR8e/vAEAUj4X1Yr
gQ6e9N/lVHniCuq5csS+XHnh2qDPIHHbm5T6zDhA8f35CRfAXh/Nnbj/AI/wK3ef
R88D8FU5NKoliqfFDywtOnYWYvc+NwFQ9HT+aB/QW9O4PYWpiqetL2TX3ieiFi6X
LuQ6z7E1fr6DmpUdBnAbdyx4AMkoaalZPCjql5SkZquHBvBuqV9GXrhf4NeGCi4M
KQExOFbIXu6fpD5c/QBpcxJwSwfJEs72E9WCiZEtOfGzOEgf4fetVfRL2mpN9lmU
hBqFYnvLpwK76hJRk0RcSEL3PuDwC6hpfGWs4ngyKa+anY3/2HaiGcp8k+ZoVuN9
WDgudLKnwSb5WE/kcVQkCmYyQu3nkoSKL5fDWuDauRJ2+TNW4Wnhh2ak2q8Azrzp
n15dzktEk/yhkagsNj5C
=6M7a
-----END PGP SIGNATURE-----
Merge tag 'mac80211-for-davem-2016-01-26' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
Here's a first set of fixes for the 4.5-rc cycle:
* make regulatory messages much less verbose by default
* various remain-on-channel fixes
* scheduled scanning fixes with hardware restart
* a PS-Poll handling fix; was broken just recently
* bugfix to avoid buffering non-bufferable MMPDUs
* world regulatory domain data fix
* a fix for scanning causing other work to get stuck
* hwsim: revert an older problematic patch that caused some
userspace tools to have issues - not that big a deal as
it's a debug only driver though
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This commit fixes a corner case in tcp_mark_head_lost() which was
causing the WARN_ON(len > skb->len) in tcp_fragment() to fire.
tcp_mark_head_lost() was assuming that if a packet has
tcp_skb_pcount(skb) of N, then it's safe to fragment off a prefix of
M*mss bytes, for any M < N. But with the tricky way TCP pcounts are
maintained, this is not always true.
For example, suppose the sender sends 4 1-byte packets and have the
last 3 packet sacked. It will merge the last 3 packets in the write
queue into an skb with pcount = 3 and len = 3 bytes. If another
recovery happens after a sack reneging event, tcp_mark_head_lost()
may attempt to split the skb assuming it has more than 2*MSS bytes.
This sounds very counterintuitive, but as the commit description for
the related commit c0638c247f ("tcp: don't fragment SACKed skbs in
tcp_mark_head_lost()") notes, this is because tcp_shifted_skb()
coalesces adjacent regions of SACKed skbs, and when doing this it
preserves the sum of their packet counts in order to reflect the
real-world dynamics on the wire. The c0638c247f commit tried to
avoid problems by not fragmenting SACKed skbs, since SACKed skbs are
where the non-proportionality between pcount and skb->len/mss is known
to be possible. However, that commit did not handle the case where
during a reneging event one of these weird SACKed skbs becomes an
un-SACKed skb, which tcp_mark_head_lost() can then try to fragment.
The fix is to simply mark the entire skb lost when this happens.
This makes the recovery slightly more aggressive in such corner
cases before we detect reordering. But once we detect reordering
this code path is by-passed because FACK is disabled.
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After we use refcnt to check if transport is alive, the dead can be
removed from sctp_transport.
The traversal of transport_addr_list in procfs dump is using
list_for_each_entry_rcu, no need to check if it has been freed.
sctp_generate_t3_rtx_event and sctp_generate_heartbeat_event is
protected by sock lock, it's not necessary to check dead, either.
also, the timers are cancelled when sctp_transport_free() is
called, that it doesn't wait for refcnt to reach 0 to cancel them.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Previously, before rhashtable, /proc assoc listing was done by
read-locking the entire hash entry and dumping all assocs at once, so we
were sure that the assoc wasn't freed because it wouldn't be possible to
remove it from the hash meanwhile.
Now we use rhashtable to list transports, and dump entries one by one.
That is, now we have to check if the assoc is still a good one, as the
transport we got may be being freed.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now when __sctp_lookup_association is running in BH, it will try to
check if t->dead is set, but meanwhile other CPUs may be freeing this
transport and this assoc and if it happens that
__sctp_lookup_association checked t->dead a bit too early, it may think
that the association is still good while it was already freed.
So we fix this race by using atomic_add_unless in sctp_transport_hold.
After we get one transport from hashtable, we will hold it only when
this transport's refcnt is not 0, so that we can make sure t->asoc
cannot be freed before we hold the asoc again.
Note that sctp association is not freed using RCU so we can't use
atomic_add_unless() with it as it may just be too late for that either.
Fixes: 4f00878126 ("sctp: apply rhashtable api to send/recv path")
Reported-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The tty core no longer provides nor uses ASYNC_CLOSING; remove from
tty_port_close_start() and tty_port_close_end() as well as tty drivers
which open-code these state changes. Unfortunately, even though the
bit is masked from userspace, its inclusion in a uapi header precludes
removing the macro.
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: linux-s390@vger.kernel.org
Cc: Mikael Starvik <starvik@axis.com>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: linux-cris-kernel@axis.com
Cc: Samuel Ortiz <samuel@sortiz.org>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Peter Hurley <peter@hurleysoftware.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Expressions of the form "tty->termios.c_*flag & FLAG"
are more clearly expressed with the termios flags macros,
I_FLAG(), C_FLAG(), O_FLAG(), and L_FLAG().
Convert treewide.
Signed-off-by: Peter Hurley <peter@hurleysoftware.com>
Acked-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The N_NCI ldisc does not define a flush_buffer() ldisc method, so
the check when opening the ldisc is always false.
Signed-off-by: Peter Hurley <peter@hurleysoftware.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This patch removes the unnecessary ivsize variabls as they always
have the value of AES_BLOCK_SIZE.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
This patch replaces uses of the long obsolete hash interface with
ahash.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: David S. Miller <davem@davemloft.net>
This patch replaces uses of the long obsolete hash interface with
shash.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: David S. Miller <davem@davemloft.net>
This patch removes the last reference to hash and ablkcipher from
IPsec and replaces them with ahash and skcipher respectively. For
skcipher there is currently no difference at all, while for ahash
the current code is actually buggy and would prevent asynchronous
algorithms from being discovered.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: David S. Miller <davem@davemloft.net>
This patch replaces uses of blkcipher with skcipher.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
This patch replaces uses of blkcipher with skcipher and the long
obsolete hash interface with shash.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Marcel Holtmann <marcel@holtmann.org>
This patch replaces uses of blkcipher with skcipher and the long
obsolete hash interface with ahash.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
This patch replaces uses of blkcipher with skcipher and the long
obsolete hash interface with either shash (for non-SG users) and
ahash.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The code within wait_event_interruptible() is called with
!TASK_RUNNING, so mustn't call any functions that can sleep,
like mutex_lock().
Since we re-check the list_empty() in a loop after the wait,
it's safe to simply use list_empty() without locking.
This bug has existed forever, but was only discovered now
because all userspace implementations, including the default
'rfkill' tool, use poll() or select() to get a readable fd
before attempting to read.
Cc: stable@vger.kernel.org
Fixes: c64fb01627 ("rfkill: create useful userspace interface")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
During a sw scan ieee80211_iface_work ignores work items for all vifs.
However after the scan complete work is requeued only for STA, ADHOC
and MESH iftypes.
This occasionally results in event processing getting delayed/not
processed for iftype AP when it coexists with a STA. This can result
in data halt and eventually disconnection on the AP interface.
Cc: stable@vger.kernel.org
Signed-off-by: Sachin Kulkarni <Sachin.Kulkarni@imgtec.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When creating a SIT tunnel with ip tunnel, rtnl_link_ops is not set before
ipip6_tunnel_create is called. When register_netdevice is called, there is
no linkinfo attribute in the NEWLINK message because of that.
Setting rtnl_link_ops before calling register_netdevice fixes that.
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ESP algorithms using CBC mode require echainiv. Hence INET*_ESP have
to select CRYPTO_ECHAINIV in order to work properly. This solves the
issues caused by a misconfiguration as described in [1].
The original approach, patching crypto/Kconfig was turned down by
Herbert Xu [2].
[1] https://lists.strongswan.org/pipermail/users/2015-December/009074.html
[2] http://marc.info/?l=linux-crypto-vger&m=145224655809562&w=2
Signed-off-by: Thomas Egerer <hakke_007@gmx.de>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch extends commit b93d647174 ("sctp: implement the sender side
for SACK-IMMEDIATELY extension") as it didn't white list
SCTP_SACK_IMMEDIATELY on sctp_msghdr_parse(), causing it to be
understood as an invalid flag and returning -EINVAL to the application.
Note that the actual handling of the flag is already there in
sctp_datamsg_from_user().
https://tools.ietf.org/html/rfc7053#section-7
Fixes: b93d647174 ("sctp: implement the sender side for SACK-IMMEDIATELY extension")
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dmitry reported a struct pid leak detected by a syzkaller program.
Bug happens in unix_stream_recvmsg() when we break the loop when a
signal is pending, without properly releasing scm.
Fixes: b3ca9b02b0 ("net: fix multithreaded signal handling in unix recv routines")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
Comment: GPGTools - http://gpgtools.org
iQIcBAABAgAGBQJWpRmuAAoJEDZk62b0Tg6xnXYP/3nukU7v56M+jTI32XqxKRXb
BtyuOFGxwp0K3pDdtupQs9My1n11zwcN+A15pWQtHjDSY466pX8LJlGUD1aZ9x5A
YHofrRmjFauo61CKSxiIrWt4kO1i/fs5SAoihMsjIT4XBIVS+Snp6uIKY/1Lz60L
h+FJlQr1cGXSwkt+w0aqt5VfvD0zpnpIzzFuB2etyDnzZMzr8SsRjTxo6PoTQsQJ
FQwOFI/J0jBTeLE7WBCCf/25vFRVw/IlCkby4SFvIDpW2CdfYYLD2lljiHho7qwg
2ur5erfVBK8VR4Mo5psdObggq/VUxi2yQyuBRYbVj2dD0WbTfavsgo7qzR4glhFH
/KapL39V3nEVVjoKmMBV0OsnWUq+EokXtozqHX3Omc2MNldin2NgA6zTw5sIeoKo
PhrhiERwllSGh91cKxtCt9FYIRF6jCHUYzXoZCNcuhMnOJfonFJZ6fyoVuDUJDCM
pFWqpGJ/6KVLi8yjNS9fKw82EApvqoFQu+YtS2/IlMXFuJXPCv094kuRCMEFYRxE
NJ50eQz9/F4SuGcM7Wlg6ESaZB4bWrpQOpdsgnPvZzqy6zGbitCNpTvT4yc1Ui2W
pddzvEmRlEdzBrHCvDqQ4z3pYSA5KDZk1sQOB819JF+gzk5rjhlfXQDJ4yJR9wcu
Csu1aBh565evjw9MIO1G
=32IE
-----END PGP SIGNATURE-----
Merge tag 'for-linus-4.5-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs
Pull 9p updates from Eric Van Hensbergen:
"Sorry for the last minute pull request, there's was a change that
didn't get pulled into for-next until two weeks ago and I wanted to
give it some bake time.
Summary:
Rework and error handling fixes, primarily in the fscatch and fd
transports"
* tag 'for-linus-4.5-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
fs/9p: use fscache mutex rather than spinlock
9p: trans_fd, bail out if recv fcall if missing
9p: trans_fd, read rework to use p9_parse_header
net/9p: Add device name details on error
Pull Ceph updates from Sage Weil:
"The two main changes are aio support in CephFS, and a series that
fixes several issues in the authentication key timeout/renewal code.
On top of that are a variety of cleanups and minor bug fixes"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
libceph: remove outdated comment
libceph: kill off ceph_x_ticket_handler::validity
libceph: invalidate AUTH in addition to a service ticket
libceph: fix authorizer invalidation, take 2
libceph: clear messenger auth_retry flag if we fault
libceph: fix ceph_msg_revoke()
libceph: use list_for_each_entry_safe
ceph: use i_size_{read,write} to get/set i_size
ceph: re-send AIO write request when getting -EOLDSNAP error
ceph: Asynchronous IO support
ceph: Avoid to propagate the invalid page point
ceph: fix double page_unlock() in page_mkwrite()
rbd: delete an unnecessary check before rbd_dev_destroy()
libceph: use list_next_entry instead of list_entry_next
ceph: ceph_frag_contains_value can be boolean
ceph: remove unused functions in ceph_frag.h
- Remove usage of ib_query_device and instead store attributes in
ib_device struct
- Move iopoll out of block and into lib, rename to irqpoll, and use
in several places in the rdma stack as our new completion queue
polling library mechanism. Update the other block drivers that
already used iopoll to use the new mechanism too.
- Replace the per-entry GID table locks with a single GID table lock
- IPoIB multicast cleanup
- Cleanups to the IB MR facility
- Add support for 64bit extended IB counters
- Fix for netlink oops while parsing RDMA nl messages
- RoCEv2 support for the core IB code
- mlx4 RoCEv2 support
- mlx5 RoCEv2 support
- Cross Channel support for mlx5
- Timestamp support for mlx5
- Atomic support for mlx5
- Raw QP support for mlx5
- MAINTAINERS update for mlx4/mlx5
- Misc ocrdma, qib, nes, usNIC, cxgb3, cxgb4, mlx4, mlx5 updates
- Add support for remote invalidate to the iSER driver (pushed through the
RDMA tree due to dependencies, acknowledged by nab)
- Update to NFSoRDMA (pushed through the RDMA tree due to dependencies,
acknowledged by Bruce)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJWoSygAAoJELgmozMOVy/dDjsP/2vbTda2MvQfkfkGEZBQdJSg
095RN0gQgCJdg78lAl8yuaK8r4VN/7uefpDtFdudH1I/Pei7X0wxN9R1UzFNG4KR
AD53lz92IVPs15328SbPR2kvNWISR9aBFQo3rlElq3Grqlp0EMn2Ou1vtu87rekF
aMllxr8Nl0uZhP+eWusOsYpJUUtwirLgRnrAyfqo2UxZh/TMIroT0TCx1KXjVcAg
dhDARiZAdu3OgSc6OsWqmH+DELEq6dFVA5F+DDBGAb8bFZqlJc7cuMHWInwNsNXT
so4bnEQ835alTbsdYtqs5DUNS8heJTAJP4Uz0ehkTh/uNCcvnKeUTw1c2P/lXI1k
7s33gMM+0FXj0swMBw0kKwAF2d9Hhus9UAN7NwjBuOyHcjGRd5q7SAnfWkvKx000
s9jVW19slb2I38gB58nhjOh8s+vXUArgxnV1+kTia1+bJSR5swvVoWRicRXdF0vh
TvLX/BjbSIU73g1TnnLNYoBTV3ybFKQ6bVdQW7fzSTDs54dsI1vvdHXi3bYZCpnL
HVwQTZRfEzkvb0AdKbcvf8p/TlaAHem3ODqtO1eHvO4if1QJBSn+SptTEeJVYYdK
n4B3l/dMoBH4JXJUmEHB9jwAvYOpv/YLAFIvdL7NFwbqGNsC3nfXFcmkVORB1W3B
KEMcM2we4bz+uyKMjEAD
=5oO7
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma
Pull rdma updates from Doug Ledford:
"Initial roundup of 4.5 merge window patches
- Remove usage of ib_query_device and instead store attributes in
ib_device struct
- Move iopoll out of block and into lib, rename to irqpoll, and use
in several places in the rdma stack as our new completion queue
polling library mechanism. Update the other block drivers that
already used iopoll to use the new mechanism too.
- Replace the per-entry GID table locks with a single GID table lock
- IPoIB multicast cleanup
- Cleanups to the IB MR facility
- Add support for 64bit extended IB counters
- Fix for netlink oops while parsing RDMA nl messages
- RoCEv2 support for the core IB code
- mlx4 RoCEv2 support
- mlx5 RoCEv2 support
- Cross Channel support for mlx5
- Timestamp support for mlx5
- Atomic support for mlx5
- Raw QP support for mlx5
- MAINTAINERS update for mlx4/mlx5
- Misc ocrdma, qib, nes, usNIC, cxgb3, cxgb4, mlx4, mlx5 updates
- Add support for remote invalidate to the iSER driver (pushed
through the RDMA tree due to dependencies, acknowledged by nab)
- Update to NFSoRDMA (pushed through the RDMA tree due to
dependencies, acknowledged by Bruce)"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (169 commits)
IB/mlx5: Unify CQ create flags check
IB/mlx5: Expose Raw Packet QP to user space consumers
{IB, net}/mlx5: Move the modify QP operation table to mlx5_ib
IB/mlx5: Support setting Ethernet priority for Raw Packet QPs
IB/mlx5: Add Raw Packet QP query functionality
IB/mlx5: Add create and destroy functionality for Raw Packet QP
IB/mlx5: Refactor mlx5_ib_qp to accommodate other QP types
IB/mlx5: Allocate a Transport Domain for each ucontext
net/mlx5_core: Warn on unsupported events of QP/RQ/SQ
net/mlx5_core: Add RQ and SQ event handling
net/mlx5_core: Export transport objects
IB/mlx5: Expose CQE version to user-space
IB/mlx5: Add CQE version 1 support to user QPs and SRQs
IB/mlx5: Fix data validation in mlx5_ib_alloc_ucontext
IB/sa: Fix netlink local service GFP crash
IB/srpt: Remove redundant wc array
IB/qib: Improve ipoib UD performance
IB/mlx4: Advertise RoCE v2 support
IB/mlx4: Create and use another QP1 for RoCEv2
IB/mlx4: Enable send of RoCE QP1 packets with IP/UDP headers
...
Pull final vfs updates from Al Viro:
- The ->i_mutex wrappers (with small prereq in lustre)
- a fix for too early freeing of symlink bodies on shmem (they need to
be RCU-delayed) (-stable fodder)
- followup to dedupe stuff merged this cycle
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
vfs: abort dedupe loop if fatal signals are pending
make sure that freeing shmem fast symlinks is RCU-delayed
wrappers for ->i_mutex access
lustre: remove unused declaration
This patch fixes incorrect handling of the 6lowpan packets that contain
uncompressed IPv6 header.
RFC4944 specifies a special dispatch for 6lowpan to carry uncompressed
IPv6 header. This dispatch (1 byte long) has to be removed during
reception and skb data pointer has to be moved. To correctly point in
the beginning of the IPv6 header the dispatch byte has to be pulled off
before packet can be processed by netif_rx_in().
Test scenario: IPv6 packets are not correctly interpreted by the network
layer when IPv6 header is not compressed (e.g. ICMPv6 Echo Reply is not
propagated correctly to the ICMPv6 layer because the extra byte will make
the header look corrupted).
Similar approach is done for IEEE 802.15.4.
Signed-off-by: Lukasz Duda <lukasz.duda@nordicsemi.no>
Signed-off-by: Glenn Ruben Bakke <glenn.ruben.bakke@nordicsemi.no>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Cc: stable@vger.kernel.org # 4.4+
The fixes provided in this patch assigns a valid net_device structure to
skb before dispatching it for further processing.
Scenario #1:
============
Bluetooth 6lowpan receives an uncompressed IPv6 header, and dispatches it
to netif. The following error occurs:
Null pointer dereference error #1 crash log:
[ 845.854013] BUG: unable to handle kernel NULL pointer dereference at
0000000000000048
[ 845.855785] IP: [<ffffffff816e3d36>] enqueue_to_backlog+0x56/0x240
...
[ 845.909459] Call Trace:
[ 845.911678] [<ffffffff816e3f64>] netif_rx_internal+0x44/0xf0
The first modification fixes the NULL pointer dereference error by
assigning dev to the local_skb in order to set a valid net_device before
processing the skb by netif_rx_ni().
Scenario #2:
============
Bluetooth 6lowpan receives an UDP compressed message which needs further
decompression by nhc_udp. The following error occurs:
Null pointer dereference error #2 crash log:
[ 63.295149] BUG: unable to handle kernel NULL pointer dereference at
0000000000000840
[ 63.295931] IP: [<ffffffffc0559540>] udp_uncompress+0x320/0x626
[nhc_udp]
The second modification fixes the NULL pointer dereference error by
assigning dev to the local_skb in the case of a udp compressed packet.
The 6lowpan udp_uncompress function expects that the net_device is set in
the skb when checking lltype.
Signed-off-by: Glenn Ruben Bakke <glenn.ruben.bakke@nordicsemi.no>
Signed-off-by: Lukasz Duda <lukasz.duda@nordicsemi.no>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Cc: stable@vger.kernel.org # 4.4+
There are many locations that do
if (memory_was_allocated_by_vmalloc)
vfree(ptr);
else
kfree(ptr);
but kvfree() can handle both kmalloc()ed memory and vmalloc()ed memory
using is_vmalloc_addr(). Unless callers have special reasons, we can
replace this branch with kvfree(). Please check and reply if you found
problems.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Jan Kara <jack@suse.com>
Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Acked-by: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Acked-by: David Rientjes <rientjes@google.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Oleg Drokin <oleg.drokin@intel.com>
Cc: Boris Petkov <bp@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
inode_foo(inode) being mutex_foo(&inode->i_mutex).
Please, use those for access to ->i_mutex; over the coming cycle
->i_mutex will become rwsem, with ->lookup() done with it held
only shared.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Neal reported crashes with this stack trace :
RIP: 0010:[<ffffffff8c57231b>] tcp_v4_send_ack+0x41/0x20f
...
CR2: 0000000000000018 CR3: 000000044005c000 CR4: 00000000001427e0
...
[<ffffffff8c57258e>] tcp_v4_reqsk_send_ack+0xa5/0xb4
[<ffffffff8c1a7caa>] tcp_check_req+0x2ea/0x3e0
[<ffffffff8c19e420>] tcp_rcv_state_process+0x850/0x2500
[<ffffffff8c1a6d21>] tcp_v4_do_rcv+0x141/0x330
[<ffffffff8c56cdb2>] sk_backlog_rcv+0x21/0x30
[<ffffffff8c098bbd>] tcp_recvmsg+0x75d/0xf90
[<ffffffff8c0a8700>] inet_recvmsg+0x80/0xa0
[<ffffffff8c17623e>] sock_aio_read+0xee/0x110
[<ffffffff8c066fcf>] do_sync_read+0x6f/0xa0
[<ffffffff8c0673a1>] SyS_read+0x1e1/0x290
[<ffffffff8c5ca262>] system_call_fastpath+0x16/0x1b
The problem here is the skb we provide to tcp_v4_send_ack() had to
be parked in the backlog of a new TCP fastopen child because this child
was owned by the user at the time an out of window packet arrived.
Before queuing a packet, TCP has to set skb->dev to NULL as the device
could disappear before packet is removed from the queue.
Fix this issue by using the net pointer provided by the socket (being a
timewait or a request socket).
IPv6 is immune to the bug : tcp_v6_send_response() already gets the net
pointer from the socket if provided.
Fixes: 168a8f5805 ("tcp: TCP Fast Open Server - main code path")
Reported-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jerry Chu <hkchu@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
MClientMount{,Ack} are long gone. The receipt of bare monmap doesn't
actually indicate a mount success as we are yet to authenticate at that
point in time.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
With it gone, no need to preserve ceph_timespec in process_one_ticket()
either.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
If we fault due to authentication, we invalidate the service ticket we
have and request a new one - the idea being that if a service rejected
our authorizer, it must have expired, despite mon_client's attempts at
periodic renewal. (The other possibility is that our ticket is too new
and the service hasn't gotten it yet, in which case invalidating isn't
necessary but doesn't hurt.)
Invalidating just the service ticket is not enough, though. If we
assume a failure on mon_client's part to renew a service ticket, we
have to assume the same for the AUTH ticket. If our AUTH ticket is
bad, we won't get any service tickets no matter how hard we try, so
invalidate AUTH ticket along with the service ticket.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
Back in 2013, commit 4b8e8b5d78 ("libceph: fix authorizer
invalidation") tried to fix authorizer invalidation issues by clearing
validity field. However, nothing ever consults this field, so it
doesn't force us to request any new secrets in any way and therefore we
never get out of the exponential backoff mode:
[ 129.973812] libceph: osd2 192.168.122.1:6810 connect authorization failure
[ 130.706785] libceph: osd2 192.168.122.1:6810 connect authorization failure
[ 131.710088] libceph: osd2 192.168.122.1:6810 connect authorization failure
[ 133.708321] libceph: osd2 192.168.122.1:6810 connect authorization failure
[ 137.706598] libceph: osd2 192.168.122.1:6810 connect authorization failure
...
AFAICT this was the case at the time 4b8e8b5d78 was merged, too.
Using timespec solely as a bool isn't nice, so introduce a new have_key
flag, specifically for this purpose.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
Commit 20e55c4cc7 ("libceph: clear messenger auth_retry flag when we
authenticate") got us only half way there. We clear the flag if the
second attempt succeeds, but it also needs to be cleared if that
attempt fails, to allow for the exponential backoff to kick in.
Otherwise, if ->should_authenticate() thinks our keys are valid, we
will busy loop, incrementing auth_retry to no avail:
process_connect ffff880079a63830 got BADAUTHORIZER attempt 1
process_connect ffff880079a63830 got BADAUTHORIZER attempt 2
process_connect ffff880079a63830 got BADAUTHORIZER attempt 3
process_connect ffff880079a63830 got BADAUTHORIZER attempt 4
process_connect ffff880079a63830 got BADAUTHORIZER attempt 5
...
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
There are a number of problems with revoking a "was sending" message:
(1) We never make any attempt to revoke data - only kvecs contibute to
con->out_skip. However, once the header (envelope) is written to the
socket, our peer learns data_len and sets itself to expect at least
data_len bytes to follow front or front+middle. If ceph_msg_revoke()
is called while the messenger is sending message's data portion,
anything we send after that call is counted by the OSD towards the now
revoked message's data portion. The effects vary, the most common one
is the eventual hang - higher layers get stuck waiting for the reply to
the message that was sent out after ceph_msg_revoke() returned and
treated by the OSD as a bunch of data bytes. This is what Matt ran
into.
(2) Flat out zeroing con->out_kvec_bytes worth of bytes to handle kvecs
is wrong. If ceph_msg_revoke() is called before the tag is sent out or
while the messenger is sending the header, we will get a connection
reset, either due to a bad tag (0 is not a valid tag) or a bad header
CRC, which kind of defeats the purpose of revoke. Currently the kernel
client refuses to work with header CRCs disabled, but that will likely
change in the future, making this even worse.
(3) con->out_skip is not reset on connection reset, leading to one or
more spurious connection resets if we happen to get a real one between
con->out_skip is set in ceph_msg_revoke() and before it's cleared in
write_partial_skip().
Fixing (1) and (3) is trivial. The idea behind fixing (2) is to never
zero the tag or the header, i.e. send out tag+header regardless of when
ceph_msg_revoke() is called. That way the header is always correct, no
unnecessary resets are induced and revoke stands ready for disabled
CRCs. Since ceph_msg_revoke() rips out con->out_msg, introduce a new
"message out temp" and copy the header into it before sending.
Cc: stable@vger.kernel.org # 4.0+
Reported-by: Matt Conner <matt.conner@keepertech.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Tested-by: Matt Conner <matt.conner@keepertech.com>
Reviewed-by: Sage Weil <sage@redhat.com>
Use list_for_each_entry_safe() instead of list_for_each_safe() to
simplify the code.
Signed-off-by: Geliang Tang <geliangtang@163.com>
[idryomov@gmail.com: nuke call to list_splice_init() as well]
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
list_next_entry has been defined in list.h, so I replace list_entry_next
with it.
Signed-off-by: Geliang Tang <geliangtang@163.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Lorenzo reported that we could not properly find v4mapped sockets
in inet_diag_find_one_icsk(). This patch fixes the issue.
Reported-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
GRO is currently not aware of tunnel metadata generated by lightweight
tunnels and stored in the dst. This leads to two possible problems:
* Incorrectly merging two frames that have different metadata.
* Leaking of allocated metadata from merged frames.
This avoids those problems by comparing the tunnel information before
merging, similar to how we handle other metadata (such as vlan tags),
and releasing any state when we are done.
Reported-by: John <john.phillips5@hpe.com>
Fixes: 2e15ea39 ("ip_gre: Add support to collect tunnel metadata.")
Signed-off-by: Jesse Gross <jesse@kernel.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_memcontrol.c only contains legacy memory.tcp.kmem.* file definitions
and mem_cgroup->tcp_mem init/destroy stuff. This doesn't belong to
network subsys. Let's move it to memcontrol.c. This also allows us to
reuse generic code for handling legacy memcg files.
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: "David S. Miller" <davem@davemloft.net>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Let the user know that CONFIG_MEMCG_KMEM does not apply to the cgroup2
interface. This also makes legacy-only code sections stand out better.
[arnd@arndb.de: mm: memcontrol: only manage socket pressure for CONFIG_INET]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This series adds accounting of the historical "kmem" memory consumers to
the cgroup2 memory controller.
These consumers include the dentry cache, the inode cache, kernel stack
pages, and a few others that are pointed out in patch 7/8. The
footprint of these consumers is directly tied to userspace activity in
common workloads, and so they have to be part of the minimally viable
configuration in order to present a complete feature to our users.
The cgroup2 interface of the memory controller is far from complete, but
this series, along with the socket memory accounting series, provides
the final semantic changes for the existing memory knobs in the cgroup2
interface, which is scheduled for initial release in the next merge
window.
This patch (of 8):
Remove unused css argument frmo memcg_init_kmem()
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With upcoming CONFIG_UBSAN the following BUILD_BUG_ON in
net/mac80211/debugfs.c starts to trigger:
BUILD_BUG_ON(hw_flag_names[NUM_IEEE80211_HW_FLAGS] != (void *)0x1);
It seems, that compiler instrumentation causes some code
deoptimizations. Because of that GCC is not being able to resolve
condition in BUILD_BUG_ON() at compile time.
We could make size of hw_flag_names array unspecified and replace the
condition in BUILD_BUG_ON() with following:
ARRAY_SIZE(hw_flag_names) != NUM_IEEE80211_HW_FLAGS
That will have the same effect as before (adding new flag without
updating array will trigger build failure) except it doesn't fail with
CONFIG_UBSAN. As a bonus this patch slightly decreases size of
hw_flag_names array.
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When we need to lock all buckets in the connection hashtable we'd attempt to
lock 1024 spinlocks, which is way more preemption levels than supported by
the kernel. Furthermore, this behavior was hidden by checking if lockdep is
enabled, and if it was - use only 8 buckets(!).
Fix this by using a global lock and synchronize all buckets on it when we
need to lock them all. This is pretty heavyweight, but is only done when we
need to resize the hashtable, and that doesn't happen often enough (or at all).
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We now alwasy have a per-PD local_dma_lkey available. Make use of that
fact in svc_rdma and stop registering our own MR.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Acked-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
To support the server-side of an NFSv4.1 backchannel on RDMA
connections, add a transport class that enables backward
direction messages on an existing forward channel connection.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Bruce Fields <bfields@fieldses.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Extra resources for handling backchannel requests have to be
pre-allocated when a transport instance is created. Set up
additional fields in svcxprt_rdma to track these resources.
The max_requests fields are elements of the RPC-over-RDMA
protocol, so they should be u32. To ensure that unsigned
arithmetic is used everywhere, some other fields in the
svcxprt_rdma struct are updated.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Bruce Fields <bfields@fieldses.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Pre-requisite to use map_xdr in the backchannel code.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Bruce Fields <bfields@fieldses.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Clean up.
These functions can otherwise fail, so check for page allocation
failures too.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Bruce Fields <bfields@fieldses.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
svc_rdma_post_recv() allocates pages for receive buffers on-demand.
It uses GFP_KERNEL so the allocator tries hard, and may sleep. But
I'm about to add a call to svc_rdma_post_recv() from a function
that may not sleep.
Since all svc_rdma_post_recv() call sites can tolerate its failure,
allow it to fail if the page allocator returns nothing. Longer term,
receive buffers, being a finite resource per-connection, should be
pre-allocated and re-used.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Bruce Fields <bfields@fieldses.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
To ensure this allocation cannot fail and will not sleep,
pre-allocate the req_map structures per-connection.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Bruce Fields <bfields@fieldses.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
When the maximum payload size of NFS READ and WRITE was increased
by commit cc9a903d91 ("svcrdma: Change maximum server payload back
to RPCSVC_MAXPAYLOAD"), the size of struct svc_rdma_op_ctxt
increased to over 6KB (on x86_64). That makes allocating one of
these from a kmem_cache more likely to fail in situations when
system memory is exhausted.
Since I'm about to add a caller where this allocation must always
work _and_ it cannot sleep, pre-allocate ctxts for each connection.
Another motivation for this change is that NFSv4.x servers are
required by specification not to drop NFS requests. Pre-allocating
memory resources reduces the likelihood of a drop.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Bruce Fields <bfields@fieldses.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Be sure the completed ctxt is put in every path.
The xprt enqueue can take a while, so put the completed ctxt back
in circulation _before_ enqueuing the xprt.
Remove/disable debugging.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Bruce Fields <bfields@fieldses.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
kzalloc is used here, so setting the atomic fields to zero is
unnecessary. sc_ord is set again in handle_connect_req. The other
fields are re-initialized in svc_rdma_accept().
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Bruce Fields <bfields@fieldses.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Marc Dionne discovered a NULL pointer dereference when setting
SO_REUSEPORT on a socket after it is bound.
This patch removes the assumption that at least one socket in the
reuseport group is bound with the SO_REUSEPORT option before other
bind calls occur.
Fixes: e32ea7e747 ("soreuseport: fast reuseport UDP socket selection")
Reported-by: Marc Dionne <marc.c.dionne@gmail.com>
Signed-off-by: Craig Gallek <kraig@google.com>
Tested-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Using a combination of connected and un-connected sockets, Dmitry
was able to trigger soft lockups with his fuzzer.
The problem is that sockets in the SO_REUSEPORT array might have
different scores.
Right after sk2=socket(), setsockopt(sk2,...,SO_REUSEPORT, on) and
bind(sk2, ...), but _before_ the connect(sk2) is done, sk2 is added into
the soreuseport array, with a score which is smaller than the score of
first socket sk1 found in hash table (I am speaking of the regular UDP
hash table), if sk1 had the connect() done, giving a +8 to its score.
hash bucket [X] -> sk1 -> sk2 -> NULL
sk1 score = 14 (because it did a connect())
sk2 score = 6
SO_REUSEPORT fast selection is an optimization. If it turns out the
score of the selected socket does not match score of first socket, just
fallback to old SO_REUSEPORT logic instead of trying to be too smart.
Normal SO_REUSEPORT users do not mix different kind of sockets, as this
mechanism is used for load balance traffic.
Fixes: e32ea7e747 ("soreuseport: fast reuseport UDP socket selection")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Craig Gallek <kraigatgoog@gmail.com>
Acked-by: Craig Gallek <kraig@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It was seen that defective configurations of openvswitch could overwrite
the STACK_END_MAGIC and cause a hard crash of the kernel because of too
many recursions within ovs.
This problem arises due to the high stack usage of openvswitch. The rest
of the kernel is fine with the current limit of 10 (RECURSION_LIMIT).
We use the already existing recursion counter in ovs_execute_actions to
implement an upper bound of 5 recursions.
Cc: Pravin Shelar <pshelar@ovn.org>
Cc: Simon Horman <simon.horman@netronome.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Unregister the chain type and return error, otherwise this leaks the
subscription to the netdevice notifier call chain.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
In case MSS option is added in TCP options, skb length increases by 4.
IPv6 needs to update skb->csum if skb has CHECKSUM_COMPLETE,
otherwise kernel complains loudly in netdev_rx_csum_fault() with a
stack dump.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Re-establish the previous behavior and avoid hashing temporary asocs by
checking t->asoc->temp in sctp_(un)hash_transport. Also, remove the
check of t->asoc->temp in __sctp_lookup_association, since they are
never hashed now.
Fixes: 4f00878126 ("sctp: apply rhashtable api to send/recv path")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Reported-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is not allowed to free the memory of an object which is part of a list
which is protected by rcu-read-side-critical sections without making sure
that no other context is accessing the object anymore. This usually happens
by removing the references to this object and then waiting until the rcu
grace period is over and no one (allowedly) accesses it anymore.
But the _now functions ignore this completely. They free the object
directly even when a different context still tries to access it. This has
to be avoided and thus these functions must be removed and all functions
have to use batadv_orig_node_free_ref.
Fixes: 72822225bd ("batman-adv: Fix rcu_barrier() miss due to double call_rcu() in TT code")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
It is not allowed to free the memory of an object which is part of a list
which is protected by rcu-read-side-critical sections without making sure
that no other context is accessing the object anymore. This usually happens
by removing the references to this object and then waiting until the rcu
grace period is over and no one (allowedly) accesses it anymore.
But the _now functions ignore this completely. They free the object
directly even when a different context still tries to access it. This has
to be avoided and thus these functions must be removed and all functions
have to use batadv_hardif_free_ref.
Fixes: 89652331c0 ("batman-adv: split tq information in neigh_node struct")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
It is not allowed to free the memory of an object which is part of a list
which is protected by rcu-read-side-critical sections without making sure
that no other context is accessing the object anymore. This usually happens
by removing the references to this object and then waiting until the rcu
grace period is over and no one (allowedly) accesses it anymore.
But the _now functions ignore this completely. They free the object
directly even when a different context still tries to access it. This has
to be avoided and thus these functions must be removed and all functions
have to use batadv_neigh_ifinfo_free_ref.
Fixes: 89652331c0 ("batman-adv: split tq information in neigh_node struct")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
It is not allowed to free the memory of an object which is part of a list
which is protected by rcu-read-side-critical sections without making sure
that no other context is accessing the object anymore. This usually happens
by removing the references to this object and then waiting until the rcu
grace period is over and no one (allowedly) accesses it anymore.
But the _now functions ignore this completely. They free the object
directly even when a different context still tries to access it. This has
to be avoided and thus these functions must be removed and all functions
have to use batadv_hardif_neigh_free_ref.
Fixes: cef63419f7 ("batman-adv: add list of unique single hop neighbors per hard-interface")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
It is not allowed to free the memory of an object which is part of a list
which is protected by rcu-read-side-critical sections without making sure
that no other context is accessing the object anymore. This usually happens
by removing the references to this object and then waiting until the rcu
grace period is over and no one (allowedly) accesses it anymore.
But the _now functions ignore this completely. They free the object
directly even when a different context still tries to access it. This has
to be avoided and thus these functions must be removed and all functions
have to use batadv_neigh_node_free_ref.
Fixes: 89652331c0 ("batman-adv: split tq information in neigh_node struct")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
It is not allowed to free the memory of an object which is part of a list
which is protected by rcu-read-side-critical sections without making sure
that no other context is accessing the object anymore. This usually happens
by removing the references to this object and then waiting until the rcu
grace period is over and no one (allowedly) accesses it anymore.
But the _now functions ignore this completely. They free the object
directly even when a different context still tries to access it. This has
to be avoided and thus these functions must be removed and all functions
have to use batadv_orig_ifinfo_free_ref.
Fixes: 7351a4822d ("batman-adv: split out router from orig_node")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
The batadv_nc_node_free_ref function uses call_rcu to delay the free of the
batadv_nc_node object until no (already started) rcu_read_lock is enabled
anymore. This makes sure that no context is still trying to access the
object which should be removed. But batadv_nc_node also contains a
reference to orig_node which must be removed.
The reference drop of orig_node was done in the call_rcu function
batadv_nc_node_free_rcu but should actually be done in the
batadv_nc_node_release function to avoid nested call_rcus. This is
important because rcu_barrier (e.g. batadv_softif_free or batadv_exit) will
not detect the inner call_rcu as relevant for its execution. Otherwise this
barrier will most likely be inserted in the queue before the callback of
the first call_rcu was executed. The caller of rcu_barrier will therefore
continue to run before the inner call_rcu callback finished.
Fixes: d56b1705e2 ("batman-adv: network coding - detect coding nodes and remove these after timeout")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
The batadv_claim_free_ref function uses call_rcu to delay the free of the
batadv_bla_claim object until no (already started) rcu_read_lock is enabled
anymore. This makes sure that no context is still trying to access the
object which should be removed. But batadv_bla_claim also contains a
reference to backbone_gw which must be removed.
The reference drop of backbone_gw was done in the call_rcu function
batadv_claim_free_rcu but should actually be done in the
batadv_claim_release function to avoid nested call_rcus. This is important
because rcu_barrier (e.g. batadv_softif_free or batadv_exit) will not
detect the inner call_rcu as relevant for its execution. Otherwise this
barrier will most likely be inserted in the queue before the callback of
the first call_rcu was executed. The caller of rcu_barrier will therefore
continue to run before the inner call_rcu callback finished.
Fixes: 23721387c4 ("batman-adv: add basic bridge loop avoidance code")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Acked-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Pull networking fixes from David Miller:
"A quick set of bug fixes after there initial networking merge:
1) Netlink multicast group storage allocator only was tested with
nr_groups equal to 1, make it work for other values too. From
Matti Vaittinen.
2) Check build_skb() return value in macb and hip04_eth drivers, from
Weidong Wang.
3) Don't leak x25_asy on x25_asy_open() failure.
4) More DMA map/unmap fixes in 3c59x from Neil Horman.
5) Don't clobber IP skb control block during GSO segmentation, from
Konstantin Khlebnikov.
6) ECN helpers for ipv6 don't fixup the checksum, from Eric Dumazet.
7) Fix SKB segment utilization estimation in xen-netback, from David
Vrabel.
8) Fix lockdep splat in bridge addrlist handling, from Nikolay
Aleksandrov"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (26 commits)
bgmac: Fix reversed test of build_skb() return value.
bridge: fix lockdep addr_list_lock false positive splat
net: smsc: Add support h8300
xen-netback: free queues after freeing the net device
xen-netback: delete NAPI instance when queue fails to initialize
xen-netback: use skb to determine number of required guest Rx requests
net: sctp: Move sequence start handling into sctp_transport_get_idx()
ipv6: update skb->csum when CE mark is propagated
net: phy: turn carrier off on phy attach
net: macb: clear interrupts when disabling them
sctp: support to lookup with ep+paddr in transport rhashtable
net: hns: fixes no syscon error when init mdio
dts: hisi: fixes no syscon fault when init mdio
net: preserve IP control block during GSO segmentation
fsl/fman: Delete one function call "put_device" in dtsec_config()
hip04_eth: fix missing error handle for build_skb failed
3c59x: fix another page map/single unmap imbalance
3c59x: balance page maps and unmaps
x25_asy: Free x25_asy on x25_asy_open() failure.
mlxsw: fix SWITCHDEV_OBJ_ID_PORT_MDB
...
kerberized NFSv4.1 mounts, and Scott Mayhew's work addressing ACK storms
that can affect some high-availability NFS setups.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJWmVZHAAoJECebzXlCjuG+Cn4P/3zwSuwIeLuv9b89vzFXU8Xv
AbBWHk7WkFXJQGTKdclYjwxqU+l15D5lYHCae1cuD5eXdviraxXf7EcnqrMhJUc0
oRiQx0rAwlEkKUAVrxGCFP7WKjlX3TsEBV6wPpTCP3BEMzTPDEeaDek7+hICFkLF
9a/miEXAopm3jxP7WNmXEkdKpFEHklDDwtv6Av7iIKCW6+7XCGp7Prqo4NQKAKp6
hjE+nvt2HiD06MZhUeyb14cn6547smzt1rbSfK4IB4yHMwLyaoqPrT7ekDh9LDrE
uGgo+Y2PBbEcTAE6tJ88EjZx7cMCFPn0te+eKPgnpPy9RqrNqSxj5N/b7JAecKgW
a/09BtvFOoYs8fO5ovqeRY5THrE3IRyMIwn4gt7fCYaaAbG3dwGKG1uklTAVXtb1
95DkhOb8He2VhOCCoJ6ybbTnRfjB6b/cv7ZuEGlQfvTE+BtU3Jj9I76ruWFhb3zd
HM1dRI20UfwL/0Y8yYhZ+/rje9SSk2jOmVgSCqY9hnCmEqOqOdUU0X/uumIWaBym
zfGx9GIM0jQuYVdLQRXtJJbUgJUUN3MilGyU5wx7YoXip5guqTalXqAdQpShzXeW
s1ATYh/mY5X9ig51KogkkVlm9bXDQAzJBAnDRpLtJZqy5Cgkrj9RSu0ExN1Rmlhw
LKQCddBQxUSWJ+XWycgK
=G7V3
-----END PGP SIGNATURE-----
Merge tag 'nfsd-4.5' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
"Smaller bugfixes and cleanup, including a fix for a failures of
kerberized NFSv4.1 mounts, and Scott Mayhew's work addressing ACK
storms that can affect some high-availability NFS setups"
* tag 'nfsd-4.5' of git://linux-nfs.org/~bfields/linux:
nfsd: add new io class tracepoint
nfsd: give up on CB_LAYOUTRECALLs after two lease periods
nfsd: Fix nfsd leaks sunrpc module references
lockd: constify nlmsvc_binding structure
lockd: use to_delayed_work
nfsd: use to_delayed_work
Revert "svcrdma: Do not send XDR roundup bytes for a write chunk"
lockd: Register callbacks on the inetaddr_chain and inet6addr_chain
nfsd: Register callbacks on the inetaddr_chain and inet6addr_chain
sunrpc: Add a function to close temporary transports immediately
nfsd: don't base cl_cb_status on stale information
nfsd4: fix gss-proxy 4.1 mounts for some AD principals
nfsd: fix unlikely NULL deref in mach_creds_match
nfsd: minor consolidation of mach_cred handling code
nfsd: helper for dup of possibly NULL string
svcrpc: move some initialization to common code
nfsd: fix a warning message
nfsd: constify nfsd4_callback_ops structure
nfsd: recover: constify nfsd4_client_tracking_ops structures
svcrdma: Do not send XDR roundup bytes for a write chunk
net/sctp/proc.c: In function ‘sctp_transport_get_idx’:
net/sctp/proc.c:313: warning: ‘obj’ may be used uninitialized in this function
This is currently a false positive, as all callers check for a zero
offset first, and handle this case in the exact same way.
Move the check and handling into sctp_transport_get_idx() to kill the
compiler warning, and avoid future bugs.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a tunnel decapsulates the outer header, it has to comply
with RFC 6080 and eventually propagate CE mark into inner header.
It turns out IP6_ECN_set_ce() does not correctly update skb->csum
for CHECKSUM_COMPLETE packets, triggering infamous "hw csum failure"
messages and stack traces.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Merge first patch-bomb from Andrew Morton:
- A few hotfixes which missed 4.4 becasue I was asleep. cc'ed to
-stable
- A few misc fixes
- OCFS2 updates
- Part of MM. Including pretty large changes to page-flags handling
and to thp management which have been buffered up for 2-3 cycles now.
I have a lot of MM material this time.
[ It turns out the THP part wasn't quite ready, so that got dropped from
this series - Linus ]
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (117 commits)
zsmalloc: reorganize struct size_class to pack 4 bytes hole
mm/zbud.c: use list_last_entry() instead of list_tail_entry()
zram/zcomp: do not zero out zcomp private pages
zram: pass gfp from zcomp frontend to backend
zram: try vmalloc() after kmalloc()
zram/zcomp: use GFP_NOIO to allocate streams
mm: add tracepoint for scanning pages
drivers/base/memory.c: fix kernel warning during memory hotplug on ppc64
mm/page_isolation: use macro to judge the alignment
mm: fix noisy sparse warning in LIBCFS_ALLOC_PRE()
mm: rework virtual memory accounting
include/linux/memblock.h: fix ordering of 'flags' argument in comments
mm: move lru_to_page to mm_inline.h
Documentation/filesystems: describe the shared memory usage/accounting
memory-hotplug: don't BUG() in register_memory_resource()
hugetlb: make mm and fs code explicitly non-modular
mm/swapfile.c: use list_for_each_entry_safe in free_swap_count_continuations
mm: /proc/pid/clear_refs: no need to clear VM_SOFTDIRTY in clear_soft_dirty_pmd()
mm: make sure isolate_lru_page() is never called for tail page
vmstat: make vmstat_updater deferrable again and shut down on idle
...
Now, when we sendmsg, we translate the ep to laddr by selecting the
first element of the list, and then do a lookup for a transport.
But sctp_hash_cmp() will compare it against asoc addr_list, which may
be a subset of ep addr_list, meaning that this chosen laddr may not be
there, and thus making it impossible to find the transport.
So we fix it by using ep + paddr to lookup transports in hashtable. In
sctp_hash_cmp, if .ep is set, we will check if this ep == asoc->ep,
or we will do the laddr check.
Fixes: d6c0256a60 ("sctp: add the rhashtable apis for sctp global transport hashtable")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Reported-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Skb_gso_segment() uses skb control block during segmentation.
This patch adds 32-bytes room for previous control block which
will be copied into all resulting segments.
This patch fixes kernel crash during fragmenting forwarded packets.
Fragmentation requires valid IP CB in skb for clearing ip options.
Also patch removes custom save/restore in ovs code, now it's redundant.
Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Link: http://lkml.kernel.org/r/CALYGNiP-0MZ-FExV2HutTvE9U-QQtkKSoE--KN=JQE5STYsjAA@mail.gmail.com
Signed-off-by: David S. Miller <davem@davemloft.net>
Highlights include:
Stable fixes:
- Fix a regression in the SunRPC socket polling code
- Fix the attribute cache revalidation code
- Fix race in __update_open_stateid()
- Fix an lo->plh_block_lgets imbalance in layoutreturn
- Fix an Oopsable typo in ff_mirror_match_fh()
Features:
- pNFS layout recall performance improvements.
- pNFS/flexfiles: Support server-supplied layoutstats sampling period
Bugfixes + cleanups:
- NFSv4: Don't perform cached access checks before we've OPENed the file
- Fix starvation issues with background flushes
- Reclaim writes should be flushed as unstable writes if there are already
entries in the commit lists
- Various bugfixes from Chuck to fix NFS/RDMA send queue ordering problems
- Ensure that we propagate fatal layoutget errors back to the application
- Fixes for sundry flexfiles layoutstats bugs
- Fix files/flexfiles to not cache invalidated layouts in the DS commit buckets
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJWmAvPAAoJEGcL54qWCgDysScP/jnaRdQO+VTXTMtKcPiR7ujd
LBcx3lrI1jsLYjlrBguTh9ROGt0maX1TAu/rsLuo4j/0wQC6dsQw+vFjfkI4CzSn
4htK0f4hNjA29iOAjMaziAzsQW9eJ97Nn0HU4XD43OeK7PGh5e93Xk26Va4cO18P
PqSam+FJoXpUSEWOzNzDwjTZTt4Voo3yJDqDTa8dU0x8c1qBktslo2n0WCntBxMn
IbEDdBEIaUZmYCNhu2Sq1SLwYPatLg1Orfq3quMFJjzEeUbd0lVQno4C1fjjuACt
DNXUgZDH0uR3U3naMXrdkqQ02GHEY9G0CO4a6q0Evsbm15wQuY6GMioxR0+ll7rX
TeZGBUMq3cRFDR+/m1gTBZFjo7BUPE9LKXUazINVaoaJMYqpFunhI8V31ghx8/z8
0kiracIEPXaIGmQ5S151+IDETpw9nntipCzdnduVefB2EAfXPeDzF7uFQPm+mvgx
R4YuAFrlbcIZ/lZRYy5z6Fj3KLnytSOjzgXC5daxPQVt92QumQTQ6HC5jL25zVKb
KOeSWHrFel7M+miL96ERvcS2vi+IDzPH9YbE9YTWbLW9LMBOYQKsukf1aaV9CwC4
9OiNMYGQIGtmjbzIOlRcpVTAsXj+P6UVuwCfGTpQOm1Qa1fDbU+xSLkc62gg3WRa
3E/3RMr1iXD8u1Kiz8hb
=RBmi
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-4.5-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Highlights include:
Stable fixes:
- Fix a regression in the SunRPC socket polling code
- Fix the attribute cache revalidation code
- Fix race in __update_open_stateid()
- Fix an lo->plh_block_lgets imbalance in layoutreturn
- Fix an Oopsable typo in ff_mirror_match_fh()
Features:
- pNFS layout recall performance improvements.
- pNFS/flexfiles: Support server-supplied layoutstats sampling period
Bugfixes + cleanups:
- NFSv4: Don't perform cached access checks before we've OPENed the
file
- Fix starvation issues with background flushes
- Reclaim writes should be flushed as unstable writes if there are
already entries in the commit lists
- Various bugfixes from Chuck to fix NFS/RDMA send queue ordering
problems
- Ensure that we propagate fatal layoutget errors back to the
application
- Fixes for sundry flexfiles layoutstats bugs
- Fix files/flexfiles to not cache invalidated layouts in the DS
commit buckets"
* tag 'nfs-for-4.5-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (68 commits)
NFS: Fix a compile warning about unused variable in nfs_generic_pg_pgios()
NFSv4: Fix a compile warning about no prototype for nfs4_ioctl()
NFS: Use wait_on_atomic_t() for unlock after readahead
SUNRPC: Fixup socket wait for memory
NFSv4.1/pNFS: Cleanup constify struct pnfs_layout_range arguments
NFSv4.1/pnfs: Cleanup copying of pnfs_layout_range structures
NFSv4.1/pNFS: Cleanup pnfs_mark_matching_lsegs_invalid()
NFSv4.1/pNFS: Fix a race in initiate_file_draining()
NFSv4.1/pNFS: pnfs_error_mark_layout_for_return() must always return layout
NFSv4.1/pNFS: pnfs_mark_matching_lsegs_return() should set the iomode
NFSv4.1/pNFS: Use nfs4_stateid_copy for copying stateids
NFSv4.1/pNFS: Don't pass stateids by value to pnfs_send_layoutreturn()
NFS: Relax requirements in nfs_flush_incompatible
NFSv4.1/pNFS: Don't queue up a new commit if the layout segment is invalid
NFS: Allow multiple commit requests in flight per file
NFS/pNFS: Fix up pNFS write reschedule layering violations and bugs
SUNRPC: Fix a missing break in rpc_anyaddr()
pNFS/flexfiles: Fix an Oopsable typo in ff_mirror_match_fh()
NFS: Fix attribute cache revalidation
NFS: Ensure we revalidate attributes before using execute_ok()
...
According to <linux/jump_label.h> the direct use of struct static_key is
deprecated. Update the socket and slab accounting code accordingly.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David S. Miller <davem@davemloft.net>
Reported-by: Jason Baron <jbaron@akamai.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The unified hierarchy memory controller is going to use this jump label
as well to control the networking callbacks. Move it to the memory
controller code and give it a more generic name.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There won't be any separate counters for socket memory consumed by
protocols other than TCP in the future. Remove the indirection and link
sockets directly to their owning memory cgroup.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There won't be a tcp control soft limit, so integrating the memcg code
into the global skmem limiting scheme complicates things unnecessarily.
Replace this with simple and clear charge and uncharge calls--hidden
behind a jump label--to account skb memory.
Note that this is not purely aesthetic: as a result of shoehorning the
per-memcg code into the same memory accounting functions that handle the
global level, the old code would compare the per-memcg consumption
against the smaller of the per-memcg limit and the global limit. This
allowed the total consumption of multiple sockets to exceed the global
limit, as long as the individual sockets stayed within bounds. After
this change, the code will always compare the per-memcg consumption to
the per-memcg limit, and the global consumption to the global limit, and
thus close this loophole.
Without a soft limit, the per-memcg memory pressure state in sockets is
generally questionable. However, we did it until now, so we continue to
enter it when the hard limit is hit, and packets are dropped, to let
other sockets in the cgroup know that they shouldn't grow their transmit
windows, either. However, keep it simple in the new callback model and
leave memory pressure lazily when the next packet is accepted (as
opposed to doing it synchroneously when packets are processed). When
packets are dropped, network performance will already be in the toilet,
so that should be a reasonable trade-off.
As described above, consumption is now checked on the per-memcg level
and the global level separately. Likewise, memory pressure states are
maintained on both the per-memcg level and the global level, and a
socket is considered under pressure when either level asserts as much.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
tcp_memcontrol replicates the global sysctl_mem limit array per cgroup,
but it only ever sets these entries to the value of the memory_allocated
page_counter limit. Use the latter directly.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The number of allocated sockets is used for calculations in the soft
limit phase, where packets are accepted but the socket is under memory
pressure.
Since there is no soft limit phase in tcp_memcontrol, and memory
pressure is only entered when packets are already dropped, this is
actually dead code. Remove it.
As this is the last user of parent_cg_proto(), remove that too.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move the jump-label from sock_update_memcg() and sock_release_memcg() to
the callsite, and so eliminate those function calls when socket
accounting is not enabled.
This also eliminates the need for dummy functions because the calls will
be optimized away if the Kconfig options are not enabled.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are two bits defined for cg_proto->flags - MEMCG_SOCK_ACTIVATED
and MEMCG_SOCK_ACTIVE - both are set in tcp_update_limit, but the former
is never cleared while the latter can be cleared by unsetting the limit.
This allows to disable tcp socket accounting for new sockets after it
was enabled by writing -1 to memory.kmem.tcp.limit_in_bytes while still
guaranteeing that memcg_socket_limit_enabled static key will be
decremented on memcg destruction.
This functionality looks dubious, because it is not clear what a use
case would be. By enabling tcp accounting a user accepts the price. If
they then find the performance degradation unacceptable, they can always
restart their workload with tcp accounting disabled. It does not seem
there is any need to flip it while the workload is running.
Besides, it contradicts to how kmem accounting API works: writing
whatever to memory.kmem.limit_in_bytes enables kmem accounting for the
cgroup in question, after which it cannot be disabled. Therefore one
might expect that writing -1 to memory.kmem.tcp.limit_in_bytes just
enables socket accounting w/o limiting it, which might be useful by
itself, but it isn't true.
Since this API peculiarity is not documented anywhere, I propose to drop
it. This will allow to simplify the code by dropping cg_proto->flags.
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mark those kmem allocations that are known to be easily triggered from
userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
memcg. For the list, see below:
- threadinfo
- task_struct
- task_delay_info
- pid
- cred
- mm_struct
- vm_area_struct and vm_region (nommu)
- anon_vma and anon_vma_chain
- signal_struct
- sighand_struct
- fs_struct
- files_struct
- fdtable and fdtable->full_fds_bits
- dentry and external_name
- inode for all filesystems. This is the most tedious part, because
most filesystems overwrite the alloc_inode method.
The list is far from complete, so feel free to add more objects.
Nevertheless, it should be close to "account everything" approach and
keep most workloads within bounds. Malevolent users will be able to
breach the limit, but this was possible even with the former "account
everything" approach (simply because it did not account everything in
fact).
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently mac80211 does not inform the driver of the session
block ack timeout when starting a rx aggregation session.
Drivers that manage the reorder buffer need to know this
parameter.
Seeing that there are now too many arguments for the
drv_ampdu_action() function, wrap them inside a structure.
Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Use to_delayed_work() instead of open-coding it.
Signed-off-by: Geliang Tang <geliangtang@163.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Currently mac80211 does not inform the driver of the window
size when starting an RX aggregation session.
To enable managing the reorder buffer in the driver or hardware
the window size is needed.
Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Add an option for driver to check for packet duplication
by itself.
This is needed for example by the iwlwifi driver which
parallelizes the RX path and does the duplication check
per queue.
Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The Group ID Management frame is an Action frame of
category VHT. It is transmitted by the AP to assign
or change the user position of a STA for one or more
group IDs.
Process and save the group membership data. Notify
underlying driver of changes.
Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Instead of having this Kconfig option, which just *floods* the
kernel log,
* remove the per-channel prints that are fairly useless anyway
* convert the conditional printing to pr_debug()
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Non-bufferable MMPDUs are sent out to STAs even while in PS mode
(for example probe responses). Applying filtered frame handling for
these doesn't seem to make much sense and will only create more
air utilization when the STA wakes up. Hence, apply filtered frame
handling only for bufferable MMPDUs.
Discovered while testing an old VOIP phone that started probing
for APs while in PS mode. The mac80211/ath9k AP where the STA is
associated would reply with a probe response but the phone sometimes
moved to a new channel already and couldn't ack the probe response
anymore. In that case mac80211 applied filtered frame handling
for the un-acked probe response.
Signed-off-by: Helmut Schaa <helmut.schaa@googlemail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
On hw restart, mac80211 might try to reconfigure already
stopped sched scan, if ieee80211_sched_scan_stopped_work()
wasn't scheduled yet.
This in turn will keep the device driver with scheduled scan
configured, while both mac80211 and cfg80211 will clear
their sched scan state once the work is scheduled.
Fix it by ignoring ieee80211_sched_scan_stopped() calls
while in hw restart, and flush the work before starting
the reconfiguration.
Signed-off-by: Eliad Peller <eliadx.peller@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
My commit below broken PS-Poll handling. In case the driver
has no frames buffered, driver_release_tids will be 0, but
calling find_highest_prio_tid() with 0 as a parameter is
not a good idea:
fls(0) - 1 = -1.
This bug caused mac80211 to think that frames were buffered
in the driver which in turn was confused because mac80211
was asking to release frames that were not reported to
exist.
On iwlwifi, this led to the WARNING below:
WARNING: CPU: 0 PID: 11230 at drivers/net/wireless/intel/iwlwifi/mvm/sta.c:1733 iwl_mvm_sta_modify_sleep_tx_count+0x2af/0x320 [iwlmvm]()
ffffffffc0627c60 ffff8800069b7648 ffffffff81888913 0000000000000000
0000000000000000 ffff8800069b7688 ffffffff81089d6a ffff8800069b7678
0000000000000001 ffff88003b35abf0 ffff88000698b128 ffff8800069b76d4
Call Trace:
[<ffffffff81888913>] dump_stack+0x4c/0x65
[<ffffffff81089d6a>] warn_slowpath_common+0x8a/0xc0
[<ffffffff81089e5a>] warn_slowpath_null+0x1a/0x20
[<ffffffffc05f36bf>] iwl_mvm_sta_modify_sleep_tx_count+0x2af/0x320 [iwlmvm]
[<ffffffffc05dae41>] iwl_mvm_mac_release_buffered_frames+0x31/0x40 [iwlmvm]
[<ffffffffc045d8b6>] ieee80211_sta_ps_deliver_response+0x6e6/0xd80 [mac80211]
[<ffffffffc0461296>] ieee80211_sta_ps_deliver_poll_response+0x26/0x30 [mac80211]
[<ffffffffc048f743>] ieee80211_rx_handlers+0xa83/0x2900 [mac80211]
[<ffffffffc04917ad>] ieee80211_prepare_and_rx_handle+0x1ed/0xa70 [mac80211]
[<ffffffffc045e3d5>] ? sta_info_get_bss+0x5/0x4a0 [mac80211]
[<ffffffffc04925b6>] ieee80211_rx_napi+0x586/0xcd0 [mac80211]
[<ffffffffc05eaa3e>] iwl_mvm_rx_rx_mpdu+0x59e/0xc60 [iwlmvm]
Fixes: 0ead2510f8 ("mac80211: allow the driver to send EOSP when needed")
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
On reconfig, in case of sched_scan_req->n_scan_plans > 1,
local->sched_scan_req was never cleared, although
cfg80211_sched_scan_stopped_rtnl() was called, resulting
in local->sched_scan_req holding a stale and preventing
further scheduled scan requests.
Clear it explicitly in this case.
Fixes: 42a7e82c6792 ("mac80211: Do not restart scheduled scan if multiple scan plans are set")
Signed-off-by: Eliad Peller <eliadx.peller@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Defer ROC requests during hw restart, as the driver
might not be fully configured in this stage (e.g.
channel contexts were not added yet)
Signed-off-by: Eliad Peller <eliadx.peller@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The rule definitions here aren't really valid, they would
be rejected if it came from userspace due to the bandwidth
specified being bigger than the rule's width.
This is fairly much inconsequential since the other rules
around them do enable the bandwidth, but express that better
using the NL80211_RRF_AUTO_BW flag.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Ilan's previous commit 1b894521e6 ("mac80211: handle HW
ROC expired properly") neglected to take into account that
hw_begun was now always set in the software implementation
as well as the offloaded case.
Fix hw_begun to only apply to the offloaded case to make
the check in Ilan's commit safe and correct.
Reported-by: Jouni Malinen <j@w1.fi>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The current (new) code recalculates the new work timeout
for software remain-on-channel whenever any item started.
In two of the callers of ieee80211_handle_roc_started(),
this is completely pointless since they're for hardware
and will skip the recalculation entirely; it's necessary
only in the case of having just added a new item to the
list, as in the last remaining case the recalculation had
just been done.
This last case, however, is also problematic - if one of
the items on the list actually expires during the recalc
the list iteration outside becomes corrupted and crashes.
Fix this by moving the recalculation to the only place
where it's required.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
- avoid freeing batadv_hardif_neigh_node when still in use in other contexts
- prevent lockdep splat in mcast_free during shutdown
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJWlovIAAoJENpFlCjNi1MRxccP/1+pDAPE9GpSUwOaDUYjWVYQ
4k1jRdJZAiMTrLyr1pW3Gesq3wTwtcIjjevaZGOa3QyhKRPmVj7yHqeYv62+WX9x
vEu6Mq6btBUqF+7+hCA/7n+Wyca+fsg9ge87hi6blmTSexlrcog80vTf2/N0CAQ4
w9jVUWcWpK8fZ6ia5IiK/RNw9gL4onh98V/7WqcvUyOmBmlyQJTpdqvvj5DZGFbL
u6B5hr0hzLirW2LJsk4fdYgKzyMYXBPtLZoiRiuk9GyjMDK+t0rUPdpuoPYhiVhH
ZFidpICvNpBDQyDjITGX6XvAUqYI4bjDfxT3vdznkNhcmcPjrcITSw8x8u2rB3Xr
TTVPtxi1OvGc0UCW+2Y47IJuHowZ7dKI0B2PiHdvLAlgM4dMYupZlruRMhR/XF0b
x7wSsbETxO3g+SH3qqJVMyebrrX//uC19vThbq+gVzMEa1hkZSSvDZbp/c2RDlCA
h5DtOjYPzCQeFo/8eorprUaO0zpbF0k/fTQu8f9csyozKVMdcCTO2X0vSRoXJG+m
9EP4WiFhqRXZWlgrTAXVxTgPlP1KEDHTmjynC2iwkg1Fxje1xBWX+zdWAt9Vgrby
HhYlb2wWg6tTt8zHjqwtOUqeCMDl8/7I+nk4GjJMzlNv3Iv2+3Eyj+DALOiQgt1S
xqvE8KkN7EEAzlzkqDPF
=6uBh
-----END PGP SIGNATURE-----
Merge tag 'batman-adv-fix-for-davem' of git://git.open-mesh.org/linux-merge
Antonio Quartulli says:
====================
net: batman-adv 20160114
Included bugfixes:
- avoid freeing batadv_hardif_neigh_node when still in use in other contexts
- prevent lockdep splat in mcast_free during shutdown
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The bug fix for adding n_groups to the computation forgot
to adjust ">=" to ">" to keep the condition correct.
Signed-off-by: David S. Miller <davem@davemloft.net>
Jozsef says:
The correct behaviour is that if we have
ipset create test1 hash:net,iface
ipset add test1 0.0.0.0/0,eth0
iptables -A INPUT -m set --match-set test1 src,src
then the rule should match for any traffic coming in through eth0.
This removes the -EINVAL runtime test to make matching work
in case packet arrived via the specified interface.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1297092
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
David points out that we to three le/be conversions instead
of just one. Doesn't matter on x86_64 w. gcc, but other
architectures might be less lucky.
Since it also simplifies code just follow his advice.
Fixes: c0f3275f5cb ("nftables: byteorder: provide le/be 64 bit conversion helper")
Suggested-by: David Laight <David.Laight@aculab.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The neigh_list with batadv_hardif_neigh_node objects is accessed with only
rcu_read_lock in batadv_hardif_neigh_get and batadv_iv_neigh_print. Thus it
is not allowed to kfree the object before the rcu grace period ends (which
may still protects context accessing this object). Therefore the object has
first to be removed from the neigh_list and then it has either wait with
synchronize_rcu or call_rcu till the grace period ends before it can be
freed.
Fixes: cef63419f7 ("batman-adv: add list of unique single hop neighbors per hard-interface")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Pull networking updates from Davic Miller:
1) Support busy polling generically, for all NAPI drivers. From Eric
Dumazet.
2) Add byte/packet counter support to nft_ct, from Floriani Westphal.
3) Add RSS/XPS support to mvneta driver, from Gregory Clement.
4) Implement IPV6_HDRINCL socket option for raw sockets, from Hannes
Frederic Sowa.
5) Add support for T6 adapter to cxgb4 driver, from Hariprasad Shenai.
6) Add support for VLAN device bridging to mlxsw switch driver, from
Ido Schimmel.
7) Add driver for Netronome NFP4000/NFP6000, from Jakub Kicinski.
8) Provide hwmon interface to mlxsw switch driver, from Jiri Pirko.
9) Reorganize wireless drivers into per-vendor directories just like we
do for ethernet drivers. From Kalle Valo.
10) Provide a way for administrators "destroy" connected sockets via the
SOCK_DESTROY socket netlink diag operation. From Lorenzo Colitti.
11) Add support to add/remove multicast routes via netlink, from Nikolay
Aleksandrov.
12) Make TCP keepalive settings per-namespace, from Nikolay Borisov.
13) Add forwarding and packet duplication facilities to nf_tables, from
Pablo Neira Ayuso.
14) Dead route support in MPLS, from Roopa Prabhu.
15) TSO support for thunderx chips, from Sunil Goutham.
16) Add driver for IBM's System i/p VNIC protocol, from Thomas Falcon.
17) Rationalize, consolidate, and more completely document the checksum
offloading facilities in the networking stack. From Tom Herbert.
18) Support aborting an ongoing scan in mac80211/cfg80211, from
Vidyullatha Kanchanapally.
19) Use per-bucket spinlock for bpf hash facility, from Tom Leiming.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1375 commits)
net: bnxt: always return values from _bnxt_get_max_rings
net: bpf: reject invalid shifts
phonet: properly unshare skbs in phonet_rcv()
dwc_eth_qos: Fix dma address for multi-fragment skbs
phy: remove an unneeded condition
mdio: remove an unneed condition
mdio_bus: NULL dereference on allocation error
net: Fix typo in netdev_intersect_features
net: freescale: mac-fec: Fix build error from phy_device API change
net: freescale: ucc_geth: Fix build error from phy_device API change
bonding: Prevent IPv6 link local address on enslaved devices
IB/mlx5: Add flow steering support
net/mlx5_core: Export flow steering API
net/mlx5_core: Make ipv4/ipv6 location more clear
net/mlx5_core: Enable flow steering support for the IB driver
net/mlx5_core: Initialize namespaces only when supported by device
net/mlx5_core: Set priority attributes
net/mlx5_core: Connect flow tables
net/mlx5_core: Introduce modify flow table command
net/mlx5_core: Managing root flow table
...
Pull misc vfs updates from Al Viro:
"All kinds of stuff. That probably should've been 5 or 6 separate
branches, but by the time I'd realized how large and mixed that bag
had become it had been too close to -final to play with rebasing.
Some fs/namei.c cleanups there, memdup_user_nul() introduction and
switching open-coded instances, burying long-dead code, whack-a-mole
of various kinds, several new helpers for ->llseek(), assorted
cleanups and fixes from various people, etc.
One piece probably deserves special mention - Neil's
lookup_one_len_unlocked(). Similar to lookup_one_len(), but gets
called without ->i_mutex and tries to avoid ever taking it. That, of
course, means that it's not useful for any directory modifications,
but things like getting inode attributes in nfds readdirplus are fine
with that. I really should've asked for moratorium on lookup-related
changes this cycle, but since I hadn't done that early enough... I
*am* asking for that for the coming cycle, though - I'm going to try
and get conversion of i_mutex to rwsem with ->lookup() done under lock
taken shared.
There will be a patch closer to the end of the window, along the lines
of the one Linus had posted last May - mechanical conversion of
->i_mutex accesses to inode_lock()/inode_unlock()/inode_trylock()/
inode_is_locked()/inode_lock_nested(). To quote Linus back then:
-----
| This is an automated patch using
|
| sed 's/mutex_lock(&\(.*\)->i_mutex)/inode_lock(\1)/'
| sed 's/mutex_unlock(&\(.*\)->i_mutex)/inode_unlock(\1)/'
| sed 's/mutex_lock_nested(&\(.*\)->i_mutex,[ ]*I_MUTEX_\([A-Z0-9_]*\))/inode_lock_nested(\1, I_MUTEX_\2)/'
| sed 's/mutex_is_locked(&\(.*\)->i_mutex)/inode_is_locked(\1)/'
| sed 's/mutex_trylock(&\(.*\)->i_mutex)/inode_trylock(\1)/'
|
| with a very few manual fixups
-----
I'm going to send that once the ->i_mutex-affecting stuff in -next
gets mostly merged (or when Linus says he's about to stop taking
merges)"
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
nfsd: don't hold i_mutex over userspace upcalls
fs:affs:Replace time_t with time64_t
fs/9p: use fscache mutex rather than spinlock
proc: add a reschedule point in proc_readfd_common()
logfs: constify logfs_block_ops structures
fcntl: allow to set O_DIRECT flag on pipe
fs: __generic_file_splice_read retry lookup on AOP_TRUNCATED_PAGE
fs: xattr: Use kvfree()
[s390] page_to_phys() always returns a multiple of PAGE_SIZE
nbd: use ->compat_ioctl()
fs: use block_device name vsprintf helper
lib/vsprintf: add %*pg format specifier
fs: use gendisk->disk_name where possible
poll: plug an unused argument to do_poll
amdkfd: don't open-code memdup_user()
cdrom: don't open-code memdup_user()
rsxx: don't open-code memdup_user()
mtip32xx: don't open-code memdup_user()
[um] mconsole: don't open-code memdup_user_nul()
[um] hostaudio: don't open-code memdup_user()
...
On ARM64, a BUG() is triggered in the eBPF JIT if a filter with a
constant shift that can't be encoded in the immediate field of the
UBFM/SBFM instructions is passed to the JIT. Since these shifts
amounts, which are negative or >= regsize, are invalid, reject them in
the eBPF verifier and the classic BPF filter checker, for all
architectures.
Signed-off-by: Rabin Vincent <rabin@rab.in>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Multicast groups are stored in global buffer. Check for needed buffer size
incorrectly compares buffer size to first id for family. This means that
for families with more than one mcast id one may allocate too small buffer
and end up writing rest of the groups to some unallocated memory. Fix the
buffer size check to compare allocated space to last mcast id for the
family.
Tested on ARM using kernel 3.14
Signed-off-by: Matti Vaittinen <matti.vaittinen@nokia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On ARM64, a BUG() is triggered in the eBPF JIT if a filter with a
constant shift that can't be encoded in the immediate field of the
UBFM/SBFM instructions is passed to the JIT. Since these shifts
amounts, which are negative or >= regsize, are invalid, reject them in
the eBPF verifier and the classic BPF filter checker, for all
architectures.
Signed-off-by: Rabin Vincent <rabin@rab.in>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ivaylo Dimitrov reported a regression caused by commit 7866a62104
("dev: add per net_device packet type chains").
skb->dev becomes NULL and we crash in __netif_receive_skb_core().
Before above commit, different kind of bugs or corruptions could happen
without major crash.
But the root cause is that phonet_rcv() can queue skb without checking
if skb is shared or not.
Many thanks to Ivaylo Dimitrov for his help, diagnosis and tests.
Reported-by: Ivaylo Dimitrov <ivo.g.dimitrov.75@gmail.com>
Tested-by: Ivaylo Dimitrov <ivo.g.dimitrov.75@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Remi Denis-Courmont <courmisch@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/bonding/bond_main.c
drivers/net/ethernet/mellanox/mlxsw/spectrum.h
drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
The bond_main.c and mellanox switch conflicts were cases of
overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit acf8dd0a9d ("udp: only allow UFO for packets from SOCK_DGRAM
sockets") disallows UFO for packets sent from raw sockets. We need to do
the same also for SOCK_DGRAM sockets with SO_NO_CHECK options, even if
for a bit different reason: while such socket would override the
CHECKSUM_PARTIAL set by ip_ufo_append_data(), gso_size is still set and
bad offloading flags warning is triggered in __skb_gso_segment().
In the IPv6 case, SO_NO_CHECK option is ignored but we need to disallow
UFO for packets sent by sockets with UDP_NO_CHECK6_TX option.
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Tested-by: Shannon Nelson <shannon.nelson@intel.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix possible null pointer dereference that may occur when calling
skb_reserve() on a null skb.
Fixes: 879c7220e8 ("net: pktgen: Observe needed_headroom of the device")
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After IPv6 support has recently been added to metadata dst and related
encaps, add support for populating/reading it from an eBPF program.
Commit d3aa45ce6b ("bpf: add helpers to access tunnel metadata") started
with initial IPv4-only support back then (due to IPv6 metadata support
not being available yet).
To stay compatible with older programs, we need to test for the passed
structure size. Also TOS and TTL support from the ip_tunnel_info key has
been added. Tested with vxlan devs in collect meta data mode with IPv4,
IPv6 and in compat mode over different network namespaces.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Export flags used by eBPF helper functions through UAPI, so they can be
used by programs (instead of them redefining all flags each time or just
using the hard-coded values). It also gives a better overview what flags
are used where and we can further get rid of the extra macros defined in
filter.c. Moreover, reject invalid flags.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
only when user space passes the addresses should we consider their
presence
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
For tcp_yeah, use an ssthresh floor of 2, the same floor used by Reno
and CUBIC, per RFC 5681 (equation 4).
tcp_yeah_ssthresh() was sometimes returning a 0 or negative ssthresh
value if the intended reduction is as big or bigger than the current
cwnd. Congestion control modules should never return a zero or
negative ssthresh. A zero ssthresh generally results in a zero cwnd,
causing the connection to stall. A negative ssthresh value will be
interpreted as a u32 and will set a target cwnd for PRR near 4
billion.
Oleksandr Natalenko reported that a system using tcp_yeah with ECN
could see a warning about a prior_cwnd of 0 in
tcp_cwnd_reduction(). Testing verified that this was due to
tcp_yeah_ssthresh() misbehaving in this way.
Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dmitry Vyukov reported a use-after-free in the code expanded by the
macro debug_post_sfx, which is caused by the use of the asoc pointer
after it was freed within sctp_side_effect() scope.
This patch fixes it by allowing sctp_side_effect to clear that asoc
pointer when the TCB is freed.
As Vlad explained, we also have to cover the SCTP_DISPOSITION_ABORT case
because it will trigger DELETE_TCB too on that same loop.
Also, there were places issuing SCTP_CMD_INIT_FAILED and ASSOC_FAILED
but returning SCTP_DISPOSITION_CONSUME, which would fool the scheme
above. Fix it by returning SCTP_DISPOSITION_ABORT instead.
The macro is already prepared to handle such NULL pointer.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The sz_idx variable is defined in the rtnetlink_rcv_msg(), but
not used anywhere. Let's remove it.
Signed-off-by: Alexander Kuleshov <kuleshovmail@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is possible for a process to allocate and accumulate far more FDs than
the process' limit by sending them over a unix socket then closing them
to keep the process' fd count low.
This change addresses this problem by keeping track of the number of FDs
in flight per user and preventing non-privileged processes from having
more FDs in flight than their configured FD limit.
Reported-by: socketpair@gmail.com
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Mitigates: CVE-2013-4312 (Linux 2.0+)
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit be4ace6e6b ("openvswitch: Move dev pointer into vport itself")
The commit above added @dev and moved @rcu to the bottom of struct
vport, but the change was not reflected in the kernel doc. So let's
update the kernel doc as well.
Signed-off-by: Jean Sacren <sakiwit@gmail.com>
Cc: Thomas Graf <tgraf@suug.ch>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 6b001e682e ("openvswitch: Use Geneve device.")
The commit above introduced 'port_no' as the name for the member of
struct geneve_port. The correct name should be 'dst_port' as described
in the kernel doc. Let's fix that member name and all the pertinent
instances so that both doc and code would be consistent.
Signed-off-by: Jean Sacren <sakiwit@gmail.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 6b001e682e ("openvswitch: Use Geneve device.")
The commit above deleted the only call site of ovs_tunnel_route_lookup()
and now that function is not used any more. So let's delete the function
definition as well.
Signed-off-by: Jean Sacren <sakiwit@gmail.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
When first SYNACK is sent, we already hold rcu_read_lock(), but this
is not true if a SYNACK is retransmitted, as a timer (soft) interrupt
does not hold rcu_read_lock()
Fixes: 45f6fad84c ("ipv6: add complete rcu protection around np->opt")
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Applications often have to reduce number of datagrams
they receive or send per system call to avoid starvation problems.
Really the kernel should take care of this by using cond_resched(),
so that applications can experiment bigger batch sizes.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The userspace needs to know why is the address being removed so that it can
perhaps obtain a new address.
Without the DADFAILED flag it's impossible to distinguish removal of a
temporary and tentative address due to DAD failure from other reasons (device
removed, manual address removal).
Signed-off-by: Lubomir Rintel <lkundrak@v3.sk>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This work adds a generalization of the ingress qdisc as a qdisc holding
only classifiers. The clsact qdisc works on ingress, but also on egress.
In both cases, it's execution happens without taking the qdisc lock, and
the main difference for the egress part compared to prior version of [1]
is that this can be applied with _any_ underlying real egress qdisc (also
classless ones).
Besides solving the use-case of [1], that is, allowing for more programmability
on assigning skb->priority for the mqprio case that is supported by most
popular 10G+ NICs, it also opens up a lot more flexibility for other tc
applications. The main work on classification can already be done at clsact
egress time if the use-case allows and state stored for later retrieval
f.e. again in skb->priority with major/minors (which is checked by most
classful qdiscs before consulting tc_classify()) and/or in other skb fields
like skb->tc_index for some light-weight post-processing to get to the
eventual classid in case of a classful qdisc. Another use case is that
the clsact egress part allows to have a central egress counterpart to
the ingress classifiers, so that classifiers can easily share state (e.g.
in cls_bpf via eBPF maps) for ingress and egress.
Currently, default setups like mq + pfifo_fast would require for this to
use, for example, prio qdisc instead (to get a tc_classify() run) and to
duplicate the egress classifier for each queue. With clsact, it allows
for leaving the setup as is, it can additionally assign skb->priority to
put the skb in one of pfifo_fast's bands and it can share state with maps.
Moreover, we can access the skb's dst entry (f.e. to retrieve tclassid)
w/o the need to perform a skb_dst_force() to hold on to it any longer. In
lwt case, we can also use this facility to setup dst metadata via cls_bpf
(bpf_skb_set_tunnel_key()) without needing a real egress qdisc just for
that (case of IFF_NO_QUEUE devices, for example).
The realization can be done without any changes to the scheduler core
framework. All it takes is that we have two a-priori defined minors/child
classes, where we can mux between ingress and egress classifier list
(dev->ingress_cl_list and dev->egress_cl_list, latter stored close to
dev->_tx to avoid extra cacheline miss for moderate loads). The egress
part is a bit similar modelled to handle_ing() and patched to a noop in
case the functionality is not used. Both handlers are now called
sch_handle_ingress() and sch_handle_egress(), code sharing among the two
doesn't seem practical as there are various minor differences in both
paths, so that making them conditional in a single handler would rather
slow things down.
Full compatibility to ingress qdisc is provided as well. Since both
piggyback on TC_H_CLSACT, only one of them (ingress/clsact) can exist
per netdevice, and thus ingress qdisc specific behaviour can be retained
for user space. This means, either a user does 'tc qdisc add dev foo ingress'
and configures ingress qdisc as usual, or the 'tc qdisc add dev foo clsact'
alternative, where both, ingress and egress classifier can be configured
as in the below example. ingress qdisc supports attaching classifier to any
minor number whereas clsact has two fixed minors for muxing between the
lists, therefore to not break user space setups, they are better done as
two separate qdiscs.
I decided to extend the sch_ingress module with clsact functionality so
that commonly used code can be reused, the module is being aliased with
sch_clsact so that it can be auto-loaded properly. Alternative would have been
to add a flag when initializing ingress to alter its behaviour plus aliasing
to a different name (as it's more than just ingress). However, the first would
end up, based on the flag, choosing the new/old behaviour by calling different
function implementations to handle each anyway, the latter would require to
register ingress qdisc once again under different alias. So, this really begs
to provide a minimal, cleaner approach to have Qdisc_ops and Qdisc_class_ops
by its own that share callbacks used by both.
Example, adding qdisc:
# tc qdisc add dev foo clsact
# tc qdisc show dev foo
qdisc mq 0: root
qdisc pfifo_fast 0: parent :1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :4 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc clsact ffff: parent ffff:fff1
Adding filters (deleting, etc works analogous by specifying ingress/egress):
# tc filter add dev foo ingress bpf da obj bar.o sec ingress
# tc filter add dev foo egress bpf da obj bar.o sec egress
# tc filter show dev foo ingress
filter protocol all pref 49152 bpf
filter protocol all pref 49152 bpf handle 0x1 bar.o:[ingress] direct-action
# tc filter show dev foo egress
filter protocol all pref 49152 bpf
filter protocol all pref 49152 bpf handle 0x1 bar.o:[egress] direct-action
A 'tc filter show dev foo' or 'tc filter show dev foo parent ffff:' will
show an empty list for clsact. Either using the parent names (ingress/egress)
or specifying the full major/minor will then show the related filter lists.
Prior work on a mqprio prequeue() facility [1] was done mainly by John Fastabend.
[1] http://patchwork.ozlabs.org/patch/512949/
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
proc_dostring() needs an initialized destination string, while the one
provided in proc_sctp_do_hmac_alg() contains stack garbage.
Thus, writing to cookie_hmac_alg would strlen() that garbage and end up
accessing invalid memory.
Fixes: 3c68198e7 ("sctp: Make hmac algorithm selection for cookie generation dynamic")
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a small helper skb_postpush_rcsum() and fix up redirect locations
that need CHECKSUM_COMPLETE fixups on ingress. dev_forward_skb() expects
a proper csum that covers also Ethernet header, f.e. since 2c26d34bbc
("net/core: Handle csum for CHECKSUM_COMPLETE VXLAN forwarding"), we
also do skb_postpull_rcsum() after pulling Ethernet header off via
eth_type_trans().
When using eBPF in a netns setup f.e. with vxlan in collect metadata mode,
I can trigger the following csum issue with an IPv6 setup:
[ 505.144065] dummy1: hw csum failure
[...]
[ 505.144108] Call Trace:
[ 505.144112] <IRQ> [<ffffffff81372f08>] dump_stack+0x44/0x5c
[ 505.144134] [<ffffffff81607cea>] netdev_rx_csum_fault+0x3a/0x40
[ 505.144142] [<ffffffff815fee3f>] __skb_checksum_complete+0xcf/0xe0
[ 505.144149] [<ffffffff816f0902>] nf_ip6_checksum+0xb2/0x120
[ 505.144161] [<ffffffffa08c0e0e>] icmpv6_error+0x17e/0x328 [nf_conntrack_ipv6]
[ 505.144170] [<ffffffffa0898eca>] ? ip6t_do_table+0x2fa/0x645 [ip6_tables]
[ 505.144177] [<ffffffffa08c0725>] ? ipv6_get_l4proto+0x65/0xd0 [nf_conntrack_ipv6]
[ 505.144189] [<ffffffffa06c9a12>] nf_conntrack_in+0xc2/0x5a0 [nf_conntrack]
[ 505.144196] [<ffffffffa08c039c>] ipv6_conntrack_in+0x1c/0x20 [nf_conntrack_ipv6]
[ 505.144204] [<ffffffff8164385d>] nf_iterate+0x5d/0x70
[ 505.144210] [<ffffffff816438d6>] nf_hook_slow+0x66/0xc0
[ 505.144218] [<ffffffff816bd302>] ipv6_rcv+0x3f2/0x4f0
[ 505.144225] [<ffffffff816bca40>] ? ip6_make_skb+0x1b0/0x1b0
[ 505.144232] [<ffffffff8160b77b>] __netif_receive_skb_core+0x36b/0x9a0
[ 505.144239] [<ffffffff8160bdc8>] ? __netif_receive_skb+0x18/0x60
[ 505.144245] [<ffffffff8160bdc8>] __netif_receive_skb+0x18/0x60
[ 505.144252] [<ffffffff8160ccff>] process_backlog+0x9f/0x140
[ 505.144259] [<ffffffff8160c4a5>] net_rx_action+0x145/0x320
[...]
What happens is that on ingress, we push Ethernet header back in, either
from cls_bpf or right before skb_do_redirect(), but without updating csum.
The "hw csum failure" can be fixed by using the new skb_postpush_rcsum()
helper for the dev_forward_skb() case to correct the csum diff again.
Thanks to Hannes Frederic Sowa for the csum_partial() idea!
Fixes: 3896d655f4 ("bpf: introduce bpf_clone_redirect() helper")
Fixes: 27b29f6305 ("bpf: add bpf_redirect() helper")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a skb_at_tc_ingress() as this will be needed elsewhere as well and
can hide the ugly ifdef.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is the final part required to namespaceify the tcp
keep alive mechanism.
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is required to have full tcp keepalive mechanism namespace
support.
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Different net namespaces might have different requirements as to
the keepalive time of tcp sockets. This might be required in cases
where different firewall rules are in place which require tcp
timeout sockets to be increased/decreased independently of the host.
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
udp tunnel offloads tend to aggregate datagrams based on inner
headers. gro engine gets notified by tunnel implementations about
possible offloads. The match is solely based on the port number.
Imagine a tunnel bound to port 53, the offloading will look into all
DNS packets and tries to aggregate them based on the inner data found
within. This could lead to data corruption and malformed DNS packets.
While this patch minimizes the problem and helps an administrator to find
the issue by querying ip tunnel/fou, a better way would be to match on
the specific destination ip address so if a user space socket is bound
to the same address it will conflict.
Cc: Tom Herbert <tom@herbertland.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Offload MDB changes per port to hardware
Signed-off-by: Elad Raz <eladr@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Define HW multicast entry: MAC and VID.
Using a MAC address simplifies support for both IPV4 and IPv6.
Signed-off-by: Elad Raz <eladr@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
kernel-doc is not able to skip an #ifdef between the kernel documentation
block and the start of the struct. Moving the #ifdef before the kernel doc
block avoids this problem
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Let us split a check for a condition at the beginning of the
batadv_is_ap_isolated() function so that a direct return can be performed
in this function if the variable "vlan" contained a null pointer.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
The batadv_softif_vlan_free_ref() function tests whether its argument is NULL
and then returns immediately. Thus the test around the call is not needed.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
* Let us return directly if a call of the batadv_orig_hash_find() function
returned a null pointer.
* Omit the initialisation for the variable "skb" at the beginning.
* Replace an assignment by a call of the kfree_skb() function
and delete the affected variable "ret" then.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
The kfree_skb() function tests whether its argument is NULL and then
returns immediately. Thus the test around the calls is not needed.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
The code to convert the throughput information from a string to the
batman-adv internal (100Kibit/s) representation is duplicated in
batadv_parse_gw_bandwidth. Move this functionality to its own function
batadv_parse_throughput to reduce the code complexity.
Signed-off-by: Sven Eckelmann <sven@open-mesh.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Currently, the post function is also called on errors or if there were
no changes, which is redundant for the functions currently using these
facilities.
Signed-off-by: Simon Wunderlich <simon@open-mesh.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
If networks take a long time to come up, e.g. due to lossy links, then
the bridge loop avoidance wait time to suppress broadcasts may not wait
long enough and detect a backbone before the mesh is brought up.
Increasing the wait period further to 60 seconds makes this scenario
less likely.
Signed-off-by: Simon Wunderlich <simon@open-mesh.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
When bridge loop avoidance is disabled through sysfs, the internal
datastructures are not disabled, but only BLA operations are disabled.
To be sure that they are removed, purge the data immediately. That is
especially useful if a firmwares network state is changed, and the BLA
wait periods should restart on the new network.
Signed-off-by: Simon Wunderlich <simon@open-mesh.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
The function handles tlv containers and not tlv handlers. Thus the
lockdep_assert_held has to check for the container_list lock.
Fixes: 2c72d655b0 ("batman-adv: Annotate deleting functions with external lock via lockdep")
Signed-off-by: Sven Eckelmann <sven@open-mesh.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Add lock release/acquire annotations to ping_seq_start() and
ping_seq_stop() to satisfy sparse.
Signed-off-by: Lance Richardson <lrichard@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eliminate endianness mismatch warnings (reported by sparse) in this file by
using appropriate nla_put_*()/nla_get_*() calls.
Signed-off-by: Lance Richardson <lrichard@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next, they are:
1) Release nf_tables objects on netns destructions via
nft_release_afinfo().
2) Destroy basechain and rules on netdevice removal in the new netdev
family.
3) Get rid of defensive check against removal of inactive objects in
nf_tables.
4) Pass down netns pointer to our existing nfnetlink callbacks, as well
as commit() and abort() nfnetlink callbacks.
5) Allow to invert limit expression in nf_tables, so we can throttle
overlimit traffic.
6) Add packet duplication for the netdev family.
7) Add forward expression for the netdev family.
8) Define pr_fmt() in conntrack helpers.
9) Don't leave nfqueue configuration on inconsistent state in case of
errors, from Ken-ichirou MATSUZAWA, follow up patches are also from
him.
10) Skip queue option handling after unbind.
11) Return error on unknown both in nfqueue and nflog command.
12) Autoload ctnetlink when NFQA_CFG_F_CONNTRACK is set.
13) Add new NFTA_SET_USERDATA attribute to store user data in sets,
from Carlos Falgueras.
14) Add support for 64 bit byteordering changes nf_tables, from Florian
Westphal.
15) Add conntrack byte/packet counter matching support to nf_tables,
also from Florian.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Johan Hedberg says:
====================
pull request: bluetooth-next 2016-01-08
Here's one more bluetooth-next pull request for the 4.5 kernel:
- Support for CRC check and promiscuous mode for CC2520
- Fixes to btmrvl driver
- New ACPI IDs for hci_bcm driver
- Limited Discovery support for the Bluetooth mgmt interface
- Minor other cleanups here and there
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
If the accounting extension isn't present, we'll return a counter
value of 0.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Needed to convert the (64bit) conntrack counters to BE ordering.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
User data is stored at after 'nft_set_ops' private data into 'data[]'
flexible array. The field 'udata' points to user data and 'udlen' stores
its length.
Add new flag NFTA_SET_USERDATA.
Signed-off-by: Carlos Falgueras García <carlosfg@riseup.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch enables to load nf_conntrack_netlink module if
NFQA_CFG_F_CONNTRACK config flag is specified.
Signed-off-by: Ken-ichirou MATSUZAWA <chamas@h4.dion.ne.jp>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch stops processing after destroying a queue instance.
Signed-off-by: Ken-ichirou MATSUZAWA <chamas@h4.dion.ne.jp>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Check that dependencies are fulfilled before updating the queue
instance, otherwise we can leave things in intermediate state on errors
in nfqnl_recv_config().
Signed-off-by: Ken-ichirou MATSUZAWA <chamas@h4.dion.ne.jp>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* bugfixes:
SUNRPC: Fixup socket wait for memory
SUNRPC: Fix a missing break in rpc_anyaddr()
pNFS/flexfiles: Fix an Oopsable typo in ff_mirror_match_fh()
NFS: Fix attribute cache revalidation
NFS: Ensure we revalidate attributes before using execute_ok()
NFS: Flush reclaim writes using FLUSH_COND_STABLE
NFS: Background flush should not be low priority
NFSv4.1/pnfs: Fixup an lo->plh_block_lgets imbalance in layoutreturn
NFSv4: Don't perform cached access checks before we've OPENed the file
NFS: Allow the combination pNFS and labeled NFS
NFS42: handle layoutstats stateid error
nfs: Fix race in __update_open_stateid()
nfs: fix missing assignment in nfs4_sequence_done tracepoint
When the phy is connected, an info message is printed. If the netdev
it is attached to has not been registered yet, the name
'uninitialised' in the output. By registering the netdev first, then
connecting they phy, we can avoid this.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rather than have drivers directly manipulate the mii_bus structure,
provide and API for registering and unregistering devices on an MDIO
bus, and performing lookups.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Not all devices attached to an MDIO bus are phys. So add an
mdio_device structure to represent the generic parts of an mdio
device, and place this structure into the phy_device.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Many Ethernet drivers contain the same netdev_info() print statement
about the attached phy. Move it into the phy device code. Additionally
add a varargs function which can be used to append additional
information.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 6f18dc8939.
Just as one example, it appears this code could do the wrong thing in
the case of a two-byte NFS READ that crosses a page boundary.
Chuck says: "In that case, nfsd would pass down an xdr_buf that has one
byte in a page, one byte in another page, and a two-byte XDR pad. The
logic introduced by this optimization would be fooled, and neither the
second byte nor the XDR pad would be written to the client."
Cc: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
batadv_iv_ogm_orig_del_if removes a part of the bcast_own which previously
belonged to the now removed interface. This is done by copying all data
which comes before the removed interface and then appending all the data
which comes after the removed interface.
The address calculation for the position of the data which comes after the
removed interface assumed that the bat_iv.bcast_own is a pointer to a
single byte datatype. But it is a pointer to unsigned long and thus the
calculated position was wrong off factor sizeof(unsigned long).
Fixes: 83a8342678a0 ("more basic routing code added (forwarding packets /
bitarray added)")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Patch 3759824da8 ("tcp: PRR uses CRB mode by default and SS mode
conditionally") introduced a bug that cwnd may become 0 when both
inflight and sndcnt are 0 (cwnd = inflight + sndcnt). This may lead
to a div-by-zero if the connection starts another cwnd reduction
phase by setting tp->prior_cwnd to the current cwnd (0) in
tcp_init_cwnd_reduction().
To prevent this we skip PRR operation when nothing is acked or
sacked. Then cwnd must be positive in all cases as long as ssthresh
is positive:
1) The proportional reduction mode
inflight > ssthresh > 0
2) The reduction bound mode
a) inflight == ssthresh > 0
b) inflight < ssthresh
sndcnt > 0 since newly_acked_sacked > 0 and inflight < ssthresh
Therefore in all cases inflight and sndcnt can not both be 0.
We check invalid tp->prior_cwnd to avoid potential div0 bugs.
In reality this bug is triggered only with a sequence of less common
events. For example, the connection is terminating an ECN-triggered
cwnd reduction with an inflight 0, then it receives reordered/old
ACKs or DSACKs from prior transmission (which acks nothing). Or the
connection is in fast recovery stage that marks everything lost,
but fails to retransmit due to local issues, then receives data
packets from other end which acks nothing.
Fixes: 3759824da8 ("tcp: PRR uses CRB mode by default and SS mode conditionally")
Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A repeating pattern in drivers has become to use OF node information
and, if not found, platform specific host information to extract the
ethernet address for a given device.
Currently this is done with a call to of_get_mac_address() and then
some ifdef'd stuff for SPARC.
Consolidate this into a portable routine, and provide the
arch_get_platform_mac_address() weak function hook for all
architectures to implement if they want.
Signed-off-by: David S. Miller <davem@davemloft.net>
dst_release should not access dst->flags after decrementing
__refcnt to 0. The dst_entry may be in dst_busy_list and
dst_gc_task may dst_destroy it before dst_release gets a chance
to access dst->flags.
Fixes: d69bbf88c8 ("net: fix a race in dst_release()")
Fixes: 27b75c95f1 ("net: avoid RCU for NOCACHE dst")
Signed-off-by: Francesco Ruggeri <fruggeri@arista.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Notifying hardware about newly bridged port vlan-aware changes.
Signed-off-by: Elad Raz <eladr@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Notifying hardware about bridge vlan-aware changes.
Signed-off-by: Elad Raz <eladr@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Disallow adding interfaces to a bridge when vlan filtering operation
failed. Send the failure code to the user.
Signed-off-by: Elad Raz <eladr@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We're seeing hangs in the NFS client code, with loops of the form:
RPC: 30317 xmit incomplete (267368 left of 524448)
RPC: 30317 call_status (status -11)
RPC: 30317 call_transmit (status 0)
RPC: 30317 xprt_prepare_transmit
RPC: 30317 xprt_transmit(524448)
RPC: xs_tcp_send_request(267368) = -11
RPC: 30317 xmit incomplete (267368 left of 524448)
RPC: 30317 call_status (status -11)
RPC: 30317 call_transmit (status 0)
RPC: 30317 xprt_prepare_transmit
RPC: 30317 xprt_transmit(524448)
Turns out commit ceb5d58b21 ("net: fix sock_wake_async() rcu protection")
moved SOCKWQ_ASYNC_NOSPACE out of sock->flags and into sk->sk_wq->flags,
however it never tried to fix up the code in net/sunrpc.
The new idiom is to use the flags in the RCU protected struct socket_wq.
While we're at it, clear out the now redundant places where we set/clear
SOCKWQ_ASYNC_NOSPACE and SOCK_NOSPACE. In principle, sk_stream_wait_memory()
is supposed to set these for us, so we only need to clear them in the
particular case of our ->write_space() callback.
Fixes: ceb5d58b21 ("net: fix sock_wake_async() rcu protection")
Cc: Eric Dumazet <edumazet@google.com>
Cc: stable@vger.kernel.org # 4.4
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Instead, allow using string formatting with send_monitor_note()
and access init_utsname().
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
When a qdisc is using per cpu stats (currently just the ingress
qdisc) only the bstats are being freed. This also free's the qstats.
Fixes: b0ab6f9275 ("net: sched: enable per cpu qstats")
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes: 538950a1b7 ("soreuseport: setsockopt SO_ATTACH_REUSEPORT_[CE]BPF")
Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Craig Gallek <kraig@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This socket-lookup path did not pass along the skb in question
in my original BPF-based socket selection patch. The skb in the
udpN_lib_lookup2 path can be used for BPF-based socket selection just
like it is in the 'traditional' udpN_lib_lookup path.
udpN_lib_lookup2 kicks in when there are greater than 10 sockets in
the same hlist slot. Coincidentally, I chose 10 sockets per
reuseport group in my functional test, so the lookup2 path was not
excersised. This adds an additional set of tests with 20 sockets.
Fixes: 538950a1b7 ("soreuseport: setsockopt SO_ATTACH_REUSEPORT_[CE]BPF")
Fixes: 3ca8e40299 ("soreuseport: BPF selection functional test")
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Craig Gallek <kraig@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The only user was removed in commit
029f7f3b87 ("netfilter: ipv6: nf_defrag: avoid/free clone operations").
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
[I stole this patch from Eric Biederman. He wrote:]
> There is no defined mechanism to pass network namespace information
> into /sbin/bridge-stp therefore don't even try to invoke it except
> for bridge devices in the initial network namespace.
>
> It is possible for unprivileged users to cause /sbin/bridge-stp to be
> invoked for any network device name which if /sbin/bridge-stp does not
> guard against unreasonable arguments or being invoked twice on the
> same network device could cause problems.
[Hannes: changed patch using netns_eq]
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
sctp_endpoint_lookup_assoc is called in the protection of sock lock
there is no need to call local_bh_disable in this function. so remove
them.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
transport hashtable will replace the association hashtable,
so association hashtable is not used in sctp any more, so
drop the codes about that.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Traversal the transport rhashtable, get the association only once through
the condition assoc->peer.primary_path != transport.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
apply lookup apis to two functions, for __sctp_endpoint_lookup_assoc
and __sctp_lookup_association, it's invoked in the protection of sock
lock, it will be safe, but sctp_lookup_association need to call
rcu_read_lock() and to detect the t->dead to protect it.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tranport hashtbale will replace the association hashtable to do the
lookup for transport, and then get association by t->assoc, rhashtable
apis will be used because of it's resizable, scalable and using rcu.
lport + rport + paddr will be the base hashkey to locate the chain,
with net to protect one netns from another, then plus the laddr to
compare to get the target.
this patch will provider the lookup functions:
- sctp_epaddr_lookup_transport
- sctp_addrs_lookup_transport
hash/unhash functions:
- sctp_hash_transport
- sctp_unhash_transport
init/destroy functions:
- sctp_transport_hashtable_init
- sctp_transport_hashtable_destroy
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch implements the mgmt Start Limited Discovery command. Most
of existing Start Discovery code is reused since the only difference
is the presence of a 'limited' flag as part of the discovery state.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
To make the EIR parsing helper more general purpose, make it return
the found data and its length rather than just saying whether the data
was present or not.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
On 2015/11/06, Dmitry Vyukov reported a deadlock involving the splice
system call and AF_UNIX sockets,
http://lists.openwall.net/netdev/2015/11/06/24
The situation was analyzed as
(a while ago) A: socketpair()
B: splice() from a pipe to /mnt/regular_file
does sb_start_write() on /mnt
C: try to freeze /mnt
wait for B to finish with /mnt
A: bind() try to bind our socket to /mnt/new_socket_name
lock our socket, see it not bound yet
decide that it needs to create something in /mnt
try to do sb_start_write() on /mnt, block (it's
waiting for C).
D: splice() from the same pipe to our socket
lock the pipe, see that socket is connected
try to lock the socket, block waiting for A
B: get around to actually feeding a chunk from
pipe to file, try to lock the pipe. Deadlock.
on 2015/11/10 by Al Viro,
http://lists.openwall.net/netdev/2015/11/10/4
The patch fixes this by removing the kern_path_create related code from
unix_mknod and executing it as part of unix_bind prior acquiring the
readlock of the socket in question. This means that A (as used above)
will sb_start_write on /mnt before it acquires the readlock, hence, it
won't indirectly block B which first did a sb_start_write and then
waited for a thread trying to acquire the readlock. Consequently, A
being blocked by C waiting for B won't cause a deadlock anymore
(effectively, both A and B acquire two locks in opposite order in the
situation described above).
Dmitry Vyukov(<dvyukov@google.com>) tested the original patch.
Signed-off-by: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commands run in a vrf context are not failing as expected on a route lookup:
root@kenny:~# ip ro ls table vrf-red
unreachable default
root@kenny:~# ping -I vrf-red -c1 -w1 10.100.1.254
ping: Warning: source address might be selected on device other than vrf-red.
PING 10.100.1.254 (10.100.1.254) from 0.0.0.0 vrf-red: 56(84) bytes of data.
--- 10.100.1.254 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 999ms
Since the vrf table does not have a route for 10.100.1.254 the ping
should have failed. The saddr lookup causes a full VRF table lookup.
Propogating a lookup failure to the user allows the command to fail as
expected:
root@kenny:~# ping -I vrf-red -c1 -w1 10.100.1.254
connect: No route to host
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Expose socket options for setting a classic or extended BPF program
for use when selecting sockets in an SO_REUSEPORT group. These options
can be used on the first socket to belong to a group before bind or
on any socket in the group after bind.
This change includes refactoring of the existing sk_filter code to
allow reuse of the existing BPF filter validation checks.
Signed-off-by: Craig Gallek <kraig@google.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Include a struct sock_reuseport instance when a UDP socket binds to
a specific address for the first time with the reuseport flag set.
When selecting a socket for an incoming UDP packet, use the information
available in sock_reuseport if present.
This required adding an additional field to the UDP source address
equality function to differentiate between exact and wildcard matches.
The original use case allowed wildcard matches when checking for
existing port uses during bind. The new use case of adding a socket
to a reuseport group requires exact address matching.
Performance test (using a machine with 2 CPU sockets and a total of
48 cores): Create reuseport groups of varying size. Use one socket
from this group per user thread (pinning each thread to a different
core) calling recvmmsg in a tight loop. Record number of messages
received per second while saturating a 10G link.
10 sockets: 18% increase (~2.8M -> 3.3M pkts/s)
20 sockets: 14% increase (~2.9M -> 3.3M pkts/s)
40 sockets: 13% increase (~3.0M -> 3.4M pkts/s)
This work is based off a similar implementation written by
Ying Cai <ycai@google.com> for implementing policy-based reuseport
selection.
Signed-off-by: Craig Gallek <kraig@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
struct sock_reuseport is an optional shared structure referenced by each
socket belonging to a reuseport group. When a socket is bound to an
address/port not yet in use and the reuseport flag has been set, the
structure will be allocated and attached to the newly bound socket.
When subsequent calls to bind are made for the same address/port, the
shared structure will be updated to include the new socket and the
newly bound socket will reference the group structure.
Usually, when an incoming packet was destined for a reuseport group,
all sockets in the same group needed to be considered before a
dispatching decision was made. With this structure, an appropriate
socket can be found after looking up just one socket in the group.
This shared structure will also allow for more complicated decisions to
be made when selecting a socket (eg a BPF filter).
This work is based off a similar implementation written by
Ying Cai <ycai@google.com> for implementing policy-based reuseport
selection.
Signed-off-by: Craig Gallek <kraig@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is the first NFC pull request for 4.5 and it brings:
- A new driver for the STMicroelectronics ST95HF NFC chipset.
The ST95HF is an NFC digital transceiver with an embedded analog
front-end and as such relies on the Linux NFC digital
implementation. This is the 3rd user of the NFC digital stack.
- ACPI support for the ST st-nci and st21nfca drivers.
- A small improvement for the nfcsim driver, as we can now tune
the Rx delay through sysfs.
- A bunch of minor cleanups and small fixes from Christophe Ricard,
for a few drivers and the NFC core code.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJWgsv0AAoJEIqAPN1PVmxKvukP/3eJwA+chAUF89/fqwqFTaJN
fffLoOxx2OBIbTXD2VV36yw4bAo9tbKDAYBZiot3Ig7Kg0SeCJ5oPA9xCVnWPrEY
hxFAldvl+lWQs8nrUOgUZItiFBeUPdfW9YX/yKhUZVc+602nUG/e/+6x8B5MhIce
SAfgCyd0c16DApltP7sw1muyZMvsO6Ow6dyNzDUVYZuabvEhe3SLSj9KFJi7Thsp
h41Iv+bwPLhwF4RXGA6rei/gdEDSMRohprdj3uTDiTarGW+OpcAO0zWACS5m4eR0
zF19+HjGPUk/LpFRaU31xX0ZQQjOTmmfsOt4FBb3P7oJx47egycsadLYPexeG7nj
ruyS6ezlRX1I/tZsnyLNJK92mK5TXYLz2uJ8r2ii/BgPNE+AErB3zKCC+EjXzWhh
AvClGu5b88WJLxoq3I3l5evPwGhebGZ8N/1uiFsHOxvzKVLgxwOmNLRGN4XXxB2i
UbIHgBb6smsu/l+3q9R83kfoMaoWnr+OUIi2QPQVDt/K7t1LfsCuIhzcGSgo1VuW
fGlA1iu+CNDknofeCl4JDo2UXAETO4gdKWw87GXeUcbbraLUczZeO7FFLZqxbMYc
OCaPYshmVFeZRypYdRWDHw67ivj0/h+9iq4PP1XOROkRFH746dD/p4yamJwVi20B
samZ8VPwzgH3/ohQJyX3
=VFUH
-----END PGP SIGNATURE-----
Merge tag 'nfc-next-4.5-1' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/nfc-next
Samuel Ortiz says:
====================
NFC 4.5 pull request
This is the first NFC pull request for 4.5 and it brings:
- A new driver for the STMicroelectronics ST95HF NFC chipset.
The ST95HF is an NFC digital transceiver with an embedded analog
front-end and as such relies on the Linux NFC digital
implementation. This is the 3rd user of the NFC digital stack.
- ACPI support for the ST st-nci and st21nfca drivers.
- A small improvement for the nfcsim driver, as we can now tune
the Rx delay through sysfs.
- A bunch of minor cleanups and small fixes from Christophe Ricard,
for a few drivers and the NFC core code.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Backport of this upstream commit into stable kernels :
89c22d8c3b ("net: Fix skb csum races when peeking")
exposed a bug in udp stack vs MSG_PEEK support, when user provides
a buffer smaller than skb payload.
In this case,
skb_copy_and_csum_datagram_iovec(skb, sizeof(struct udphdr),
msg->msg_iov);
returns -EFAULT.
This bug does not happen in upstream kernels since Al Viro did a great
job to replace this into :
skb_copy_and_csum_datagram_msg(skb, sizeof(struct udphdr), msg);
This variant is safe vs short buffers.
For the time being, instead reverting Herbert Xu patch and add back
skb->ip_summed invalid changes, simply store the result of
udp_lib_checksum_complete() so that we avoid computing the checksum a
second time, and avoid the problematic
skb_copy_and_csum_datagram_iovec() call.
This patch can be applied on recent kernels as it avoids a double
checksumming, then backported to stable kernels as a bug fix.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since 79c441ae50 ("ppp: implement x-netns support"), the PPP layer
calls skb_scrub_packet() whenever the skb is received on the PPP
device. Manually resetting packet meta-data in the L2TP layer is thus
redundant.
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ieee802154_llsec_ops structure is never modified, so declare it as
const.
Done with the help of Coccinelle.
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Acked-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
The IDs should all be for Broadcom BCM43241 module, and
hci_bcm is now the proper driver for them. This removes one
of two different ways of handling PM with the module.
Cc: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Heikki Krogerus <heikki.krogerus@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
You can use this to forward packets from ingress to the egress path of
the specified interface. This provides a fast path to bounce packets
from one interface to another specific destination interface.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
A _lot_ of ->write() instances were open-coding it; some are
converted to memdup_user_nul(), a lot more remain...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
You can use this to duplicate packets and inject them at the egress path
of the specified interface. This duplication allows you to inspect
traffic from the dummy or any other interface dedicated to this purpose.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch allows you to invert the ratelimit matching criteria, so you
can match packets over the ratelimit. This is required to support what
hashlimit does.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Johan Hedberg says:
====================
pull request: bluetooth-next 2015-12-31
Here's (probably) the last bluetooth-next pull request for the 4.5
kernel:
- Add support for BCM2E65 ACPI ID
- Minor fixes/cleanups in the bcm203x & bfusb drivers
- Minor debugfs related fix in 6lowpan code
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Ethernet PHYs can maintain statistics, for example errors while idle
and receive errors. Add an ethtool mechanism to retrieve these
statistics, using the same model as MAC statistics.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The missing break means that we always return EAFNOSUPPORT when
faced with a request for an IPv6 loopback address.
Reported-by: coverity (CID 401987)
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
In sctp_close, sctp_make_abort_user may return NULL because of memory
allocation failure. If this happens, it will bypass any state change
and never free the assoc. The assoc has no chance to be freed and it
will be kept in memory with the state it had even after the socket is
closed by sctp_close().
So if sctp_make_abort_user fails to allocate memory, we should abort
the asoc via sctp_primitive_ABORT as well. Just like the annotation in
sctp_sf_cookie_wait_prm_abort and sctp_sf_do_9_1_prm_abort said,
"Even if we can't send the ABORT due to low memory delete the TCB.
This is a departure from our typical NOMEM handling".
But then the chunk is NULL (low memory) and the SCTP_CMD_REPLY cmd would
dereference the chunk pointer, and system crash. So we should add
SCTP_CMD_REPLY cmd only when the chunk is not NULL, just like other
places where it adds SCTP_CMD_REPLY cmd.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit ceb5d58b21 ("net: fix sock_wake_async() rcu protection") from
the current 4.4 release cycle introduced a new flags member in
struct socket_wq and moved SOCKWQ_ASYNC_NOSPACE and SOCKWQ_ASYNC_WAITDATA
from struct socket's flags member into that new place.
Unfortunately, the new flags field is never initialized properly, at least
not for the struct socket_wq instance created in sock_alloc_inode().
One particular issue I encountered because of this is that my GNU Emacs
failed to draw anything on my desktop -- i.e. what I got is a transparent
window, including the title bar. Bisection lead to the commit mentioned
above and further investigation by means of strace told me that Emacs
is indeed speaking to my Xorg through an O_ASYNC AF_UNIX socket. This is
reproducible 100% of times and the fact that properly initializing the
struct socket_wq ->flags fixes the issue leads me to the conclusion that
somehow SOCKWQ_ASYNC_WAITDATA got set in the uninitialized ->flags,
preventing my Emacs from receiving any SIGIO's due to data becoming
available and it got stuck.
Make sock_alloc_inode() set the newly created struct socket_wq's ->flags
member to zero.
Fixes: ceb5d58b21 ("net: fix sock_wake_async() rcu protection")
Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 5b48bb8506c5 ("openvswitch: Fix helper reference leak") fixed a
reference leak on helper objects, but inadvertently introduced a leak on
the ct template.
Previously, ct_info.ct->general.use was initialized to 0 by
nf_ct_tmpl_alloc() and only incremented when ovs_ct_copy_action()
returned successful. If an error occurred while adding the helper or
adding the action to the actions buffer, the __ovs_ct_free_action()
cleanup would use nf_ct_put() to free the entry; However, this relies on
atomic_dec_and_test(ct_info.ct->general.use). This reference must be
incremented first, or nf_ct_put() will never free it.
Fix the issue by acquiring a reference to the template immediately after
allocation.
Fixes: cae3a26275 ("openvswitch: Allow attaching helpers to ct action")
Fixes: 5b48bb8506c5 ("openvswitch: Fix helper reference leak")
Signed-off-by: Joe Stringer <joe@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
I've moved the check for "number_destination_params" forward
a few lines to avoid leaking "cmd".
Fixes: caa575a86e ('NFC: nci: fix possible crash in nci_core_conn_create')
Acked-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
Add support for missing HCI event EVT_CONNECTIVITY and forward
it to userspace.
Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
net/nfc/nci/hci.c: In function nci_hci_connect_gate :
net/nfc/nci/hci.c:679: warning: comparison is always false due to limited range of data type
In case of error, nci_hci_create_pipe() returns NCI_HCI_INVALID_PIPE,
and not a negative error code.
Correct the check to fix this.
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
The definition of DIGITAL_PROTO_NFCA_RF_TECH is modified to support
ISO14443 Type4A tags. Without this change it is not possible to start
polling for ISO14443 Type4A tags from the initiator side.
Signed-off-by: Shikha Singh <shikha.singh@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
These patches mostly fix send queue ordering issues inside the NFSoRDMA
client, but there are also two patches from Dan Carpenter fixing up smatch
warnings.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJWdG7tAAoJENfLVL+wpUDrqWEP/j+K2BVkanx0/Yi1WjFHPuTW
sKc02qeuhajVcS4QdLygD+4fcnVHsqfmStOg1dHLkQpG62mot8R59aLctISjndfo
Q8O0JlUPYJSUZ1A5e++8BqHi6IiN+hBw+4rxuyqseEvsgcWI1bzFl/Ajm823fjMB
HjcpLPokKmZvgIRpasBtA4O6Lg01rDb80v9a/DFPOa8cwSVmervgHa1ZEGU/KyZH
5taRYUiIIkankEJE4VqQQ86bWIGKbuGlsHH+7BcDxtqNISQhq8uZsC2EMOxtdjnf
IKHt6B6IVZoMfERSaRHOCkK30sEDUi0u3zYkgodFq7xNTW7MBwmCs/xEUqhLtBoe
KV2imTNOuwzl6v3Vnjm9yKa+0PF8ejg3VYNZOrZWSeBaGPSmyiDlQ05zdcH+F5i7
OQeo1ebIb2Wu6PJgZwA3QbNjWkddlwF7WrA58PbHgxEyYJeMHJ697g2cQiUdCBxl
mbnkHyZqcG4ko2hbt5kNM+1TWQoY+HdGs8BJAV6W+UBO6ZYNL+ciM23uFSb5XiJC
bWUPgUEu6vepW0JaCMt0clpFlWtLHFzclwkgOiWfiNKOtmHeENzE0L4JmQX6QNa6
xW/Vn6ctxVx6AzTNdKZO8VK+XUgkH/7Y/x3H/c+zAeq25M2CJs36zdgkDGc9v4h9
eVEBs2tHgI23aAk8brgY
=kwhB
-----END PGP SIGNATURE-----
Merge tag 'nfs-rdma-4.5' of git://git.linux-nfs.org/projects/anna/nfs-rdma
NFS: NFSoRDMA Client Side Changes
These patches mostly fix send queue ordering issues inside the NFSoRDMA
client, but there are also two patches from Dan Carpenter fixing up smatch
warnings.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* tag 'nfs-rdma-4.5' of git://git.linux-nfs.org/projects/anna/nfs-rdma:
xprtrdma: Revert commit e7104a2a96 ('xprtrdma: Cap req_cqinit').
xprtrdma: Invalidate in the RPC reply handler
xprtrdma: Add ro_unmap_sync method for all-physical registration
xprtrdma: Add ro_unmap_sync method for FMR
xprtrdma: Add ro_unmap_sync method for FRWR
xprtrdma: Introduce ro_unmap_sync method
xprtrdma: Move struct ib_send_wr off the stack
xprtrdma: Disable RPC/RDMA backchannel debugging messages
xprtrdma: xprt_rdma_free() must not release backchannel reqs
xprtrdma: Fix additional uses of spin_lock_irqsave(rb_lock)
xprtrdma: checking for NULL instead of IS_ERR()
xprtrdma: clean up some curly braces
The following sequence inside a batch, although not very useful, is
valid:
add table foo
...
delete table foo
This may be generated by some robot while applying some incremental
upgrade, so remove the defensive checks against this.
This patch keeps the check on the get/dump path by now, we have to
replace the inactive flag by introducing object generations.
Reported-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If the netdevice is destroyed, the resources that are attached should
be released too as they belong to the device that is now gone.
Suggested-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We have to release the existing objects on netns removal otherwise we
leak them. Chains are unregistered in first place to make sure no
packets are walking on our rules and sets anymore.
The object release happens by when we unregister the family via
nft_release_afinfo() which is called from nft_unregister_afinfo() from
the corresponding __net_exit path in every family.
Reported-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
xs_reclassify_socket4() and friends used to be called directly.
xs_reclassify_socket() is called instead nowadays.
The xs_reclassify_socketX() helper functions are empty when
CONFIG_DEBUG_LOCK_ALLOC is not defined. Drop them since they have no
callers.
Note that AF_LOCAL still calls xs_reclassify_socketu() directly but is
easily converted to generic xs_reclassify_socket().
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Accepted or peeled off sockets were missing a security label (e.g.
SELinux) which means that socket was in "unlabeled" state.
This patch clones the sock's label from the parent sock and resolves the
issue (similar to AF_BLUETOOTH protocol family).
Cc: Paul Moore <pmoore@redhat.com>
Cc: David Teigland <teigland@redhat.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit cacc062152 ("sctp: use GFP_USER for user-controlled kmalloc")
missed two other spots.
For connectx, as it's more likely to be used by kernel users of the API,
it detects if GFP_USER should be used or not.
Fixes: cacc062152 ("sctp: use GFP_USER for user-controlled kmalloc")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
By moving stats update into iptunnel_xmit(), we can simplify
iptunnel_xmit() usage. With this change there is no need to
call another function (iptunnel_xmit_stats()) to update stats
in tunnel xmit code path.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
kobj_to_dev has been defined in linux/device.h, so I replace to_dev
with it.
Signed-off-by: Geliang Tang <geliangtang@163.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Marc Haber reported we don't honor interface indexes when we receive link
local router addresses in router advertisements. Luckily the non-strict
version of ipv6_chk_addr already does the correct job here, so we can
simply use it to lighten the checks and use those addresses by default
without any configuration change.
Link: <http://permalink.gmane.org/gmane.linux.network/391348>
Reported-by: Marc Haber <mh+netdev@zugschlus.de>
Cc: Marc Haber <mh+netdev@zugschlus.de>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a function svc_age_temp_xprts_now() to close temporary transports
whose xpt_local matches the address passed in server_addr immediately
instead of waiting for them to be closed by the timer function.
The function is intended to be used by notifier_blocks that will be
added to nfsd and lockd that will run when an ip address is deleted.
This will eliminate the ACK storms and client hangs that occur in
HA-NFS configurations where nfsd & lockd is left running on the cluster
nodes all the time and the NFS 'service' is migrated back and forth
within a short timeframe.
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Hannes points out that when we generate tcp reset for timewait sockets we
pretend we found no socket and pass NULL sk to tcp_vX_send_reset().
Make it cope with inet tw sockets and then provide tw sk.
This makes RSTs appear on correct interface when SO_BINDTODEVICE is used.
Packetdrill test case:
// want default route to be used, we rely on BINDTODEVICE
`ip route del 192.0.2.0/24 via 192.168.0.2 dev tun0`
0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
// test case still works due to BINDTODEVICE
0.001 setsockopt(3, SOL_SOCKET, SO_BINDTODEVICE, "tun0", 4) = 0
0.100...0.200 connect(3, ..., ...) = 0
0.100 > S 0:0(0) <mss 1460,sackOK,nop,nop>
0.200 < S. 0:0(0) ack 1 win 32792 <mss 1460,sackOK,nop,nop>
0.200 > . 1:1(0) ack 1
0.210 close(3) = 0
0.210 > F. 1:1(0) ack 1 win 29200
0.300 < . 1:1(0) ack 2 win 46
// more data while in FIN_WAIT2, expect RST
1.300 < P. 1:1001(1000) ack 1 win 46
// fails without this change -- default route is used
1.301 > R 1:1(0) win 0
Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_md5_do_lookup requires a full socket, so once we extend
_send_reset() to also accept timewait socket we would have to change
if (!sk && hash_location)
to something like
if ((!sk || !sk_fullsock(sk)) && hash_location) {
...
} else {
(sk && sk_fullsock(sk)) tcp_md5_do_lookup()
}
Switch the two branches: check if we have a socket first, then
fall back to a listener lookup if we saw a md5 option (hash_location).
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When sysctl performs restrict writes, it allows to write from
a middle position of a sysctl file, which requires us to initialize
the table data before calling proc_dostring() for the write case.
Fixes: 3d1bec9932 ("ipv6: introduce secret_stable to ipv6_devconf")
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert says:
====================
pull request (net): ipsec 2015-12-22
Just one patch to fix dst_entries_init with multiple namespaces.
From Dan Streetman.
Please pull or let me know if there are problems.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
When closing a listen socket, tcp_abort currently calls
tcp_done without clearing the request queue. If the socket has a
child socket that is established but not yet accepted, the child
socket is then left without a parent, causing a leak.
Fix this by setting the socket state to TCP_CLOSE and calling
inet_csk_listen_stop with the socket lock held, like tcp_close
does.
Tested using net_test. With this patch, calling SOCK_DESTROY on a
listen socket that has an established but not yet accepted child
socket results in the parent and the child being closed, such
that they no longer appear in sock_diag dumps.
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip6addrlbl_get() has never worked. If ip6addrlbl_hold() succeeded,
ip6addrlbl_get() will exit with '-ESRCH'. If ip6addrlbl_hold() failed,
ip6addrlbl_get() will use about to be free ip6addrlbl_entry pointer.
Fix this by inverting ip6addrlbl_hold() check.
Fixes: 2a8cc6c890 ("[IPV6] ADDRCONF: Support RFC3484 configurable address selection policy table.")
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Cong Wang <cwang@twopensource.com>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bridge's ageing time is offloaded to hardware when:
1) A port joins a bridge
2) The ageing time of the bridge is changed
In the first case the ageing time is offloaded as jiffies, but in the
second case it's offloaded as clock_t, which is what existing switchdev
drivers expect to receive.
Fixes: 6ac311ae8b ("Adding switchdev ageing notification on port bridged")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It looks like an attempt to use CPU notifier here which was never
completed. Nobody tried to wire it up completely since 2k9. So I unwind
this code and get rid of everything not required. Oh look! 19 lines were
removed while code still does the same thing.
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Tested-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains two netfilter fixes:
1) Oneliner from Florian to dump missing NFT_CT_L3PROTOCOL netlink
attribute, from Florian Westphal.
2) Another oneliner for nf_tables to use skb->protocol from the new
netdev family, we can't assume ethernet there.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead, use the cached copy of the attributes present on the device.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Instead, use the cached copy of the attributes present on the device.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patches moves the debugfs interface related register after
netdevice register. The function lowpan_dev_debugfs_init will use
"dev->name" which can be before register_netdevice a format string.
The function register_netdevice will evaluate the format string if
necessary and replace "dev->name" to the real interface name.
Reported-by: Lukasz Duda <lukasz.duda@nordicsemi.no>
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Acked-by: Lukasz Duda <lukasz.duda@nordicsemi.no>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Use list_for_each_entry*() instead of list_for_each*() to simplify
the code.
Signed-off-by: Geliang Tang <geliangtang@163.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
In a set action tunnel attributes should be encoded in a
nested action.
I noticed this because ovs-dpctl was reporting an error
when dumping flows due to the incorrect encoding of tunnel attributes
in a set action.
Fixes: fc4099f172 ("openvswitch: Fix egress tunnel info.")
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IP-TTL case is already handled in ip_tunnel_ioctl() API.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adding support for SYN_RECV request sockets to tcp_abort()
is quite easy after our tcp listener rewrite.
Note that we also need to better handle listeners, or we might
leak not yet accepted children, because of a missing
inet_csk_listen_stop() call.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Tested-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Comment says "User BPF's register A is mapped to our BPF register 6",
which is actually wrong as the mapping is on register 0. This can
already be inferred from the code itself. So just remove it before
someone makes assumptions based on that. Only code tells truth. ;)
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Back in the days where eBPF (or back then "internal BPF" ;->) was not
exposed to user space, and only the classic BPF programs internally
translated into eBPF programs, we missed the fact that for classic BPF
A and X needed to be cleared. It was fixed back then via 83d5b7ef99
("net: filter: initialize A and X registers"), and thus classic BPF
specifics were added to the eBPF interpreter core to work around it.
This added some confusion for JIT developers later on that take the
eBPF interpreter code as an example for deriving their JIT. F.e. in
f75298f5c3 ("s390/bpf: clear correct BPF accumulator register"), at
least X could leak stack memory. Furthermore, since this is only needed
for classic BPF translations and not for eBPF (verifier takes care
that read access to regs cannot be done uninitialized), more complexity
is added to JITs as they need to determine whether they deal with
migrations or native eBPF where they can just omit clearing A/X in
their prologue and thus reduce image size a bit, see f.e. cde66c2d88
("s390/bpf: Only clear A and X for converted BPF programs"). In other
cases (x86, arm64), A and X is being cleared in the prologue also for
eBPF case, which is unnecessary.
Lets move this into the BPF migration in bpf_convert_filter() where it
actually belongs as long as the number of eBPF JITs are still few. It
can thus be done generically; allowing us to remove the quirk from
__bpf_prog_run() and to slightly reduce JIT image size in case of eBPF,
while reducing code duplication on this matter in current(/future) eBPF
JITs.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Tested-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Cc: Zi Shen Lim <zlim.lnx@gmail.com>
Cc: Yang Shi <yang.shi@linaro.org>
Acked-by: Yang Shi <yang.shi@linaro.org>
Acked-by: Zi Shen Lim <zlim.lnx@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When hacking tc programs with eBPF, one of the issues that come up
from time to time is to load addresses from headers. In eBPF as in
classic BPF, we have BPF_LD | BPF_ABS | BPF_{B,H,W} instructions that
extract a byte, half-word or word out of the skb data though helpers
such as bpf_load_pointer() (interpreter case).
F.e. extracting a whole IPv6 address could possibly look like ...
union v6addr {
struct {
__u32 p1;
__u32 p2;
__u32 p3;
__u32 p4;
};
__u8 addr[16];
};
[...]
a.p1 = htonl(load_word(skb, off));
a.p2 = htonl(load_word(skb, off + 4));
a.p3 = htonl(load_word(skb, off + 8));
a.p4 = htonl(load_word(skb, off + 12));
[...]
/* access to a.addr[...] */
This work adds a complementary helper bpf_skb_load_bytes() (we also
have bpf_skb_store_bytes()) as an alternative where the same call
would look like from an eBPF program:
ret = bpf_skb_load_bytes(skb, off, addr, sizeof(addr));
Same verifier restrictions apply as in ffeedafbf0 ("bpf: introduce
current->pid, tgid, uid, gid, comm accessors") case, where stack memory
access needs to be statically verified and thus guaranteed to be
initialized in first use (otherwise verifier cannot tell whether a
subsequent access to it is valid or not as it's runtime dependent).
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains the first batch of Netfilter updates for
the upcoming 4.5 kernel. This batch contains userspace netfilter header
compilation fixes, support for packet mangling in nf_tables, the new
tracing infrastructure for nf_tables and cgroup2 support for iptables.
More specifically, they are:
1) Two patches to include dependencies in our netfilter userspace
headers to resolve compilation problems, from Mikko Rapeli.
2) Four comestic cleanup patches for the ebtables codebase, from Ian Morris.
3) Remove duplicate include in the netfilter reject infrastructure,
from Stephen Hemminger.
4) Two patches to simplify the netfilter defragmentation code for IPv6,
patch from Florian Westphal.
5) Fix root ownership of /proc/net netfilter for unpriviledged net
namespaces, from Philip Whineray.
6) Get rid of unused fields in struct nft_pktinfo, from Florian Westphal.
7) Add mangling support to our nf_tables payload expression, from
Patrick McHardy.
8) Introduce a new netlink-based tracing infrastructure for nf_tables,
from Florian Westphal.
9) Change setter functions in nfnetlink_log to be void, from
Rami Rosen.
10) Add netns support to the cttimeout infrastructure.
11) Add cgroup2 support to iptables, from Tejun Heo.
12) Introduce nfnl_dereference_protected() in nfnetlink, from Florian.
13) Add support for mangling pkttype in the nf_tables meta expression,
also from Florian.
BTW, I need that you pull net into net-next, I have another batch that
requires changes that I don't yet see in net.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The root of the problem was that sends (especially unsignalled
FASTREG and LOCAL_INV Work Requests) were not properly flow-
controlled, which allowed a send queue overrun.
Now that the RPC/RDMA reply handler waits for invalidation to
complete, the send queue is properly flow-controlled. Thus this
limit is no longer necessary.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
There is a window between the time the RPC reply handler wakes the
waiting RPC task and when xprt_release() invokes ops->buf_free.
During this time, memory regions containing the data payload may
still be accessed by a broken or malicious server, but the RPC
application has already been allowed access to the memory containing
the RPC request's data payloads.
The server should be fenced from client memory containing RPC data
payloads _before_ the RPC application is allowed to continue.
This change also more strongly enforces send queue accounting. There
is a maximum number of RPC calls allowed to be outstanding. When an
RPC/RDMA transport is set up, just enough send queue resources are
allocated to handle registration, Send, and invalidation WRs for
each those RPCs at the same time.
Before, additional RPC calls could be dispatched while invalidation
WRs were still consuming send WQEs. When invalidation WRs backed
up, dispatching additional RPCs resulted in a send queue overrun.
Now, the reply handler prevents RPC dispatch until invalidation is
complete. This prevents RPC call dispatch until there are enough
send queue resources to proceed.
Still to do: If an RPC exits early (say, ^C), the reply handler has
no opportunity to perform invalidation. Currently, xprt_rdma_free()
still frees remaining RDMA resources, which could deadlock.
Additional changes are needed to handle invalidation properly in this
case.
Reported-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
physical's ro_unmap is synchronous already. The new ro_unmap_sync
method just has to DMA unmap all MRs associated with the RPC
request.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
FMR's ro_unmap method is already synchronous because ib_unmap_fmr()
is a synchronous verb. However, some improvements can be made here.
1. Gather all the MRs for the RPC request onto a list, and invoke
ib_unmap_fmr() once with that list. This reduces the number of
doorbells when there is more than one MR to invalidate
2. Perform the DMA unmap _after_ the MRs are unmapped, not before.
This is critical after invalidating a Write chunk.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
FRWR's ro_unmap is asynchronous. The new ro_unmap_sync posts
LOCAL_INV Work Requests and waits for them to complete before
returning.
Note also, DMA unmapping is now done _after_ invalidation.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
In the current xprtrdma implementation, some memreg strategies
implement ro_unmap synchronously (the MR is knocked down before the
method returns) and some asynchonously (the MR will be knocked down
and returned to the pool in the background).
To guarantee the MR is truly invalid before the RPC consumer is
allowed to resume execution, we need an unmap method that is
always synchronous, invoked from the RPC/RDMA reply handler.
The new method unmaps all MRs for an RPC. The existing ro_unmap
method unmaps only one MR at a time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
For FRWR FASTREG and LOCAL_INV, move the ib_*_wr structure off
the stack. This allows frwr_op_map and frwr_op_unmap to chain
WRs together without limit to register or invalidate a set of MRs
with a single ib_post_send().
(This will be for chaining LOCAL_INV requests).
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Preserve any rpcrdma_req that is attached to rpc_rqst's allocated
for the backchannel. Otherwise, after all the pre-allocated
backchannel req's are consumed, incoming backward calls start
writing on freed memory.
Somehow this hunk got lost.
Fixes: f531a5dbc4 ('xprtrdma: Pre-allocate backward rpc_rqst')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up.
rb_lock critical sections added in rpcrdma_ep_post_extra_recv()
should have first been converted to use normal spin_lock now that
the reply handler is a work queue.
The backchannel set up code should use the appropriate helper
instead of open-coding a rb_recv_bufs list add.
Problem introduced by glib patch re-ordering on my part.
Fixes: f531a5dbc4 ('xprtrdma: Pre-allocate backward rpc_rqst')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The rpcrdma_create_req() function returns error pointers or success. It
never returns NULL.
Fixes: f531a5dbc4 ('xprtrdma: Pre-allocate backward rpc_rqst and send/receive buffers')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
It doesn't matter either way, but the curly braces were clearly intended
here. It causes a Smatch warning.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Allow accepted sockets to derive their sk_bound_dev_if setting from the
l3mdev domain in which the packets originated. A sysctl setting is added
to control the behavior which is similar to sk_mark and
sysctl_tcp_fwmark_accept.
This effectively allow a process to have a "VRF-global" listen socket,
with child sockets bound to the VRF device in which the packet originated.
A similar behavior can be achieved using sk_mark, but a solution using marks
is incomplete as it does not handle duplicate addresses in different L3
domains/VRFs. Allowing sockets to inherit the sk_bound_dev_if from l3mdev
domain provides a complete solution.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a new address generator mode, using the stable address generator
with an automatically generated secret. This is intended as a default
address generator mode for device types with no EUI64 implementation.
The new generator is used for ARPHRD_NONE interfaces initially, adding
default IPv6 autoconf support to e.g. tun interfaces.
If the addrgenmode is set to 'random', either by default or manually,
and no stable secret is available, then a random secret is used as
input for the stable-privacy address generator. The secret can be
read and modified like manually configured secrets, using the proc
interface. Modifying the secret will change the addrgen mode to
'stable-privacy' to indicate that it operates on a known secret.
Existing behaviour of the 'stable-privacy' mode is kept unchanged. If
a known secret is available when the device is created, then the mode
will default to 'stable-privacy' as before. The mode can be manually
set to 'random' but it will behave exactly like 'stable-privacy' in
this case. The secret will not change.
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: 吉藤英明 <hideaki.yoshifuji@miraclelinux.com>
Signed-off-by: Bjørn Mork <bjorn@mork.no>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The recently added generic ILA translation facility fails to
build when CONFIG_NETFILTER is disabled:
net/ipv6/ila/ila_xlat.c:229:20: warning: 'struct nf_hook_state' declared inside parameter list
net/ipv6/ila/ila_xlat.c:235:27: error: array type has incomplete element type 'struct nf_hook_ops'
static struct nf_hook_ops ila_nf_hook_ops[] __read_mostly = {
This adds an explicit Kconfig dependency to avoid that case.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 7f00feaf10 ("ila: Add generic ILA translation facility")
Signed-off-by: David S. Miller <davem@davemloft.net>
one nft userspace test case fails with
'ct l3proto original ipv4' mismatches 'ct l3proto ipv4'
... because NFTA_CT_DIRECTION attr is missing.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This allows to redirect bridged packets to local machine:
ether type ip ether daddr set aa:53:08:12:34:56 meta pkttype set unicast
Without 'set unicast', ip stack discards PACKET_OTHERHOST skbs.
It is also useful to add support for a '-m cluster like' nft rule
(where switch floods packets to several nodes, and each cluster node
node processes a subset of packets for load distribution).
Mangling is restricted to HOST/OTHER/BROAD/MULTICAST, i.e. you cannot set
skb->pkt_type to PACKET_KERNEL or change PACKET_LOOPBACK to PACKET_HOST.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Conflicts:
drivers/net/geneve.c
Here we had an overlapping change, where in 'net' the extraneous stats
bump was being removed whilst in 'net-next' the final argument to
udp_tunnel6_xmit_skb() was being changed.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Fix uninitialized variable warnings in nfnetlink_queue, a lot of
people reported this... From Arnd Bergmann.
2) Don't init mutex twice in i40e driver, from Jesse Brandeburg.
3) Fix spurious EBUSY in rhashtable, from Herbert Xu.
4) Missing DMA unmaps in mvpp2 driver, from Marcin Wojtas.
5) Fix race with work structure access in pppoe driver causing
corruptions, from Guillaume Nault.
6) Fix OOPS due to sh_eth_rx() not checking whether netdev_alloc_skb()
actually succeeded or not, from Sergei Shtylyov.
7) Don't lose flags when settifn IFA_F_OPTIMISTIC in ipv6 code, from
Bjørn Mork.
8) VXLAN_HD_RCO defined incorrectly, fix from Jiri Benc.
9) Fix clock source used for cookies in SCTP, from Marcelo Ricardo
Leitner.
10) aurora driver needs HAS_DMA dependency, from Geert Uytterhoeven.
11) ndo_fill_metadata_dst op of vxlan has to handle ipv6 tunneling
properly as well, from Jiri Benc.
12) Handle request sockets properly in xfrm layer, from Eric Dumazet.
13) Double stats update in ipv6 geneve transmit path, fix from Pravin B
Shelar.
14) sk->sk_policy[] needs RCU protection, and as a result
xfrm_policy_destroy() needs to free policies using an RCU grace
period, from Eric Dumazet.
15) SCTP needs to clone ipv6 tx options in order to avoid use after
free, from Eric Dumazet.
16) Missing kbuild export if ila.h, from Stephen Hemminger.
17) Missing mdiobus_alloc() return value checking in mdio-mux.c, from
Tobias Klauser.
18) Validate protocol value range in ->create() methods, from Hannes
Frederic Sowa.
19) Fix early socket demux races that result in illegal dst reuse, from
Eric Dumazet.
20) Validate socket address length in pptp code, from WANG Cong.
21) skb_reorder_vlan_header() uses incorrect offset and can corrupt
packets, from Vlad Yasevich.
22) Fix memory leaks in nl80211 registry code, from Ola Olsson.
23) Timeout loop count handing fixes in mISDN, xgbe, qlge, sfc, and
qlcnic. From Dan Carpenter.
24) msg.msg_iocb needs to be cleared in recvfrom() otherwise, for
example, AF_ALG will interpret it as an async call. From Tadeusz
Struk.
25) inetpeer_set_addr_v4 forgets to initialize the 'vif' field, from
Eric Dumazet.
26) rhashtable enforces the minimum table size not early enough,
breaking how we calculate the per-cpu lock allocations. From
Herbert Xu.
27) Fix FCC port lockup in 82xx driver, from Martin Roth.
28) FOU sockets need to be freed using RCU, from Hannes Frederic Sowa.
29) Fix out-of-bounds access in __skb_complete_tx_timestamp() and
sock_setsockopt() wrt. timestamp handling. From WANG Cong.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (117 commits)
net: check both type and procotol for tcp sockets
drivers: net: xgene: fix Tx flow control
tcp: restore fastopen with no data in SYN packet
af_unix: Revert 'lock_interruptible' in stream receive code
fou: clean up socket with kfree_rcu
82xx: FCC: Fixing a bug causing to FCC port lock-up
gianfar: Don't enable RX Filer if not supported
net: fix warnings in 'make htmldocs' by moving macro definition out of field declaration
rhashtable: Fix walker list corruption
rhashtable: Enforce minimum size on initial hash table
inet: tcp: fix inetpeer_set_addr_v4()
ipv6: automatically enable stable privacy mode if stable_secret set
net: fix uninitialized variable issue
bluetooth: Validate socket address length in sco_sock_bind().
net_sched: make qdisc_tree_decrease_qlen() work for non mq
ser_gigaset: remove unnecessary kfree() calls from release method
ser_gigaset: fix deallocation of platform device structure
ser_gigaset: turn nonsense checks into WARN_ON
ser_gigaset: fix up NULL checks
qlcnic: fix a timeout loop
...
Dmitry reported the following out-of-bound access:
Call Trace:
[<ffffffff816cec2e>] __asan_report_load4_noabort+0x3e/0x40
mm/kasan/report.c:294
[<ffffffff84affb14>] sock_setsockopt+0x1284/0x13d0 net/core/sock.c:880
[< inline >] SYSC_setsockopt net/socket.c:1746
[<ffffffff84aed7ee>] SyS_setsockopt+0x1fe/0x240 net/socket.c:1729
[<ffffffff85c18c76>] entry_SYSCALL_64_fastpath+0x16/0x7a
arch/x86/entry/entry_64.S:185
This is because we mistake a raw socket as a tcp socket.
We should check both sk->sk_type and sk->sk_protocol to ensure
it is a tcp socket.
Willem points out __skb_complete_tx_timestamp() needs to fix as well.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Yuchung tracked a regression caused by commit 57be5bdad7 ("ip: convert
tcp_sendmsg() to iov_iter primitives") for TCP Fast Open.
Some Fast Open users do not actually add any data in the SYN packet.
Fixes: 57be5bdad7 ("ip: convert tcp_sendmsg() to iov_iter primitives")
Reported-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With b3ca9b02b0, the AF_UNIX SOCK_STREAM
receive code was changed from using mutex_lock(&u->readlock) to
mutex_lock_interruptible(&u->readlock) to prevent signals from being
delayed for an indefinite time if a thread sleeping on the mutex
happened to be selected for handling the signal. But this was never a
problem with the stream receive code (as opposed to its datagram
counterpart) as that never went to sleep waiting for new messages with the
mutex held and thus, wouldn't cause secondary readers to block on the
mutex waiting for the sleeping primary reader. As the interruptible
locking makes the code more complicated in exchange for no benefit,
change it back to using mutex_lock.
Signed-off-by: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Same as in Windows, we miss IPV6_HDRINCL for SOL_IPV6 and SOL_RAW.
The SOL_IP/IP_HDRINCL is not available for IPv6 sockets.
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the support for adding expire value to routes, requested by
Tom Gundersen <teg@jklm.no> for systemd-networkd, and NetworkManager
wants it too.
implement it by adding the new RTNETLINK attribute RTA_EXPIRES.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
fou->udp_offloads is managed by RCU. As it is actually included inside
the fou sockets, we cannot let the memory go out of scope before a grace
period. We either can synchronize_rcu or switch over to kfree_rcu to
manage the sockets. kfree_rcu seems appropriate as it is used by vxlan
and geneve.
Fixes: 23461551c0 ("fou: Support for foo-over-udp RX path")
Cc: Tom Herbert <tom@herbertland.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before this change applications monitoring FDB notifications
were not able to determine whether a new FDB entry is permament
or not:
bridge fdb add f1:f2:f3:f4:f5:f8 dev sw0p1 temp self
bridge fdb add f1:f2:f3:f4:f5:f9 dev sw0p1 self
bridge monitor fdb
f1:f2:f3:f4:f5:f8 dev sw0p1 self permanent
f1:f2:f3:f4:f5:f9 dev sw0p1 self permanent
With this change ndm_state from the original netlink message
is passed to the new netlink message sent as notification.
bridge fdb add f1:f2:f3:f4:f5:f6 dev sw0p1 self
bridge fdb add f1:f2:f3:f4:f5:f7 dev sw0p1 temp self
bridge monitor fdb
f1:f2:f3:f4:f5:f6 dev sw0p1 self permanent
f1:f2:f3:f4:f5:f7 dev sw0p1 self static
Signed-off-by: Hubert Sokolowski <hubert.sokolowski@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- change my email in MAINTAINERS and Doc files
- create and export list of single hop neighs per interface
- protect CRC in the BLA code by means of its own lock
- minor fixes and code cleanups
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJWcRAcAAoJENpFlCjNi1MRxPQP/A8wAuHYUyMDu6ov5VRUR+24
TYKgCVI1evuYlkOdgHkTQfTlOILgV2NHvORzKwQRNToyvtsd4VIeXZfFJ0UXB0Wo
sMz9vrmmcndGkcYPWs+6IoVUbxMPZXjRGhme3Ig7nNTOQ4GvcSaQANbYWAMBfg61
h6pxez67el9jzso2acHEgvOyaldOLDcbbdwt4vY7bPRgD9VXpE3nazBymaZXLLxs
1ByXiCgJw32gXSWv8RQ/j13btjrbWCdmcEz9Ag1xO5i5u9BI1VJ8nmbLuNu63S0G
Ftgvk5QBeMAxs3xDiTmtQD3bmiS87Jy+1d5rFb581/arM5SnYq6GZIrb81tdVTU6
PMkMZ6/EUV2HSogq9PJ1ZDk/0oPYT5tqfLtJWcaAZplmWYt3ZUrQsPo4z5CtbBhU
6Ar29G5slkgLslcBqn6YB00LxwOmj7elyVdPtL+wMCojtut+Ds4O+FzPdvd/145F
hCtIe55b2ciBsfp1dDXP5P15HeEMjLiN2xWJKAPLhDCmGruJ6Pnly+ElzgV+zdKL
7Qe1mOXwticMy3pH+ST7CP47tp5uyT0ak27eo+oOn4LI/ppM2qW1P9bNUZ6RrVwf
dRRh0dzQruaLQauHK2Z6BWtA2Q36wY2anwmBONK34NH6VfRMB1GJMi3JF8gDgTds
SWwU5FGWTn/cnCejtsbh
=/Z+R
-----END PGP SIGNATURE-----
Merge tag 'batman-adv-for-davem' of git://git.open-mesh.org/linux-merge
Antonio Quartulli says:
====================
Included changes:
- change my email in MAINTAINERS and Doc files
- create and export list of single hop neighs per interface
- protect CRC in the BLA code by means of its own lock
- minor fixes and code cleanups
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
As we all know, the value of pf_retrans >= max_retrans_path can
disable pf state. The variables of pf_retrans and max_retrans_path
can be changed by the userspace application.
Sometimes the user expects to disable pf state while the 2
variables are changed to enable pf state. So it is necessary to
introduce a new variable to disable pf state.
According to the suggestions from Vlad Yasevich, extra1 and extra2
are removed. The initialization of pf_enable is added.
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: Zhu Yanjun <zyjzyj2000@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We have found some networks in which nodes were constantly requesting
other nodes BLA claim tables to synchronize, just to ask for that again
once completed. The reason was that the crc checksum of the asked nodes
were out of sync due to missing locking and multiple writes to the same
crc checksum when adding/removing entries. Therefore the asked nodes
constantly reported the wrong crc, which caused repeating requests.
To avoid multiple functions changing a backbone gateways crc entry at
the same time, lock it using a spinlock.
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Tested-by: Alfons Name <AlfonsName@web.de>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
The chain pointer was already created in batadv_frag_purge_orig to make the
checks more readable. Just use the chain pointer everywhere instead of
having the same dereference + array access in the most lines of this
function.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Acked-by: Martin Hundebøll <martin@hundeboll.net>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
This should slightly improve readability
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
If the local representation of the global TT table of one originator has
more VLAN entries than the respective TT update, there is some
inconsistency present. By detecting and reporting this inconsistency,
the global table gets updated and the excess VLAN will get removed in
the process.
Reported-by: Alessandro Bolletta <alessandro@mediaspot.net>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Acked-by: Antonio Quartulli <antonio@meshcoding.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
Bjørn reported that while we switch all interfaces to privacy stable mode
when setting the secret, we don't set this mode for new interfaces. This
does not make sense, so change this behaviour.
Fixes: 622c81d57b ("ipv6: generation of stable privacy addresses for link-local and autoconf")
Reported-by: Bjørn Mork <bjorn@mork.no>
Cc: Bjørn Mork <bjorn@mork.no>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
modules init functions being called from process context, we better
use GFP_KERNEL allocations to increase our chances to get these
high-order pages we want for SCTP hash tables.
This mostly matters if SCTP module is loaded once memory got fragmented.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This implements SOCK_DESTROY for TCP sockets. It causes all
blocking calls on the socket to fail fast with ECONNABORTED and
causes a protocol close of the socket. It informs the other end
of the connection by sending a RST, i.e., initiating a TCP ABORT
as per RFC 793. ECONNABORTED was chosen for consistency with
FreeBSD.
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This passes the SOCK_DESTROY operation to the underlying protocol
diag handler, or returns -EOPNOTSUPP if that handler does not
define a destroy operation.
Most of this patch is just renaming functions. This is not
strictly necessary, but it would be fairly counterintuitive to
have the code to destroy inet sockets be in a function whose name
starts with inet_diag_get.
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a SOCK_DESTROY operation, a destroy function
pointer to sock_diag_handler, and a diag_destroy function
pointer. It does not include any implementation code.
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, inet_diag_dump_one_icsk finds a socket and then dumps
its information to userspace. Split it into a part that finds the
socket and a part that dumps the information.
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch implements an ILA tanslation table. This table can be
configured with identifier to locator mappings, and can be be queried
to resolve a mapping. Queries can be parameterized based on interface,
direction (incoming or outoing), and matching locator. The table is
implemented using rhashtable and is configured via netlink (through
"ip ila .." in iproute).
The table may be used as alternative means to do do ILA tanslations
other than the lw tunnels
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The start callback allows the caller to set up a context for the
dump callbacks. Presumably, the context can then be destroyed in
the done callback.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Create ila directory in preparation for supporting other hooks in the
kernel than LWT for doing ILA. This includes:
- Moving ila.c to ila/ila_lwt.c
- Splitting out some common functions into ila_common.c
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add skb_csum_offload_chk driver helper function to determine if a
device with limited checksum offload capabilities is able to offload the
checksum for a given packet.
This patch includes:
- The skb_csum_offload_chk function. Returns true if checksum is
offloadable, else false. Optionally, in the case that the checksum
is not offloable, the function can call skb_checksum_help to resolve
the checksum. skb_csum_offload_chk also returns whether the checksum
refers to an encapsulated checksum.
- Definition of skb_csum_offl_spec structure that caller uses to
indicate rules about what it can offload (e.g. IPv4/v6, TCP/UDP only,
whether encapsulated checksums can be offloaded, whether checksum with
IPv6 extension headers can be offloaded).
- Ancilary functions called skb_csum_offload_chk_help,
skb_csum_off_chk_help_cmn, skb_csum_off_chk_help_cmn_v4_only.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In tcp_send_sendpage and tcp_sendmsg we check the route capabilities to
determine if checksum offload can be performed. This check currently
does not take the IP protocol into account for devices that advertise
only one of NETIF_F_IPV6_CSUM or NETIF_F_IP_CSUM. This patch adds a
function to check capabilities for checksum offload with a socket
called sk_check_csum_caps. This function checks for specific IPv4 or
IPv6 offload support based on the family of the socket.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These netif flags are unnecessary convolutions. It is more
straightforward to just use NETIF_F_HW_CSUM, NETIF_F_IP_CSUM,
and NETIF_F_IPV6_CSUM directly.
This patch also:
- Cleans up can_checksum_protocol
- Simplifies netdev_intersect_features
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>