Check that dependencies are fulfilled before updating the logger
instance, otherwise we can leave things in intermediate state on errors
in nfulnl_recv_config().
[ Ken-ichirou reports that this is also fixing missing instance refcnt drop
on error introduced in his patch 914eebf2f4 ("netfilter: nfnetlink_log:
autoload nf_conntrack_netlink module NFQA_CFG_F_CONNTRACK config flag"). ]
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Tested-by: Ken-ichirou MATSUZAWA <chamaken@gmail.com>
This patch consolidates the check for valid logger instance once we have
passed the command handling:
The config message that we receive may contain the following info:
1) Command only: We always get a valid instance pointer if we just
created it. In case that the instance is being destroyed or the
command is unknown, we jump to exit path of nfulnl_recv_config().
This patch doesn't modify this handling.
2) Config only: In this case, the instance must always exist since the
user is asking for configuration updates. If the instance doesn't exist
this returns -ENODEV.
3) No command and no configs are specified: This case is rare. The
user is sending us a config message with neither commands nor
config options. In this case, we have to check if the instance exists
and bail out otherwise. Before this patch, it was possible to send a
config message with no command and no config updates for an
unexisting instance without triggering an error. So this is the only
case that changes.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Tested-by: Ken-ichirou MATSUZAWA <chamaken@gmail.com>
In commit e3eea1eb47 ("tipc: clean up handling of message priorities")
we introduced a field in the packet header for keeping track of the
priority of fragments, since this value is not present in the specified
protocol header. Since the value so far only is used at the transmitting
end of the link, we have not yet officially defined it as part of the
protocol.
Unfortunately, the field we use for keeping this value, bits 13-15 in
in word 5, has turned out to be a poor choice; it is already used by the
broadcast protocol for carrying the 'network id' field of the sending
node. Since packet fragments also need to be transported across the
broadcast protocol, the risk of conflict is obvious, and we see this
happen when we use network identities larger than 2^13-1. This has
escaped our testing because we have so far only been using small network
id values.
We now move this field to bits 0-2 in word 9, a field that is guaranteed
to be unused by all involved protocols.
Fixes: e3eea1eb47 ("tipc: clean up handling of message priorities")
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
At listen() time, there is a small window where listener is visible with
a zero backlog, triggering a spurious "Possible SYN flooding on port"
message.
Nothing prevents us from setting the correct backlog.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As we no longer hold listener lock in fast path, it is possible that a
child is created right after listener freed its bound port, if a close()
is done while incoming packets are processed.
__inet_inherit_port() must detect this and return an error,
so that caller can free the child earlier.
Fixes: e994b2f0fb ("tcp: do not lock listener to process SYN packets")
Fixes: 079096f103 ("tcp/dccp: install syn_recv requests into ehash table")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It seems that kernel memory can leak into userspace by a
kmalloc, ethtool_get_strings, then copy_to_user sequence.
Avoid this by using kcalloc to zero fill the copied buffer.
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQEcBAABCgAGBQJWHSq8AAoJEP5prqPJtc/HCyMH/R4PeAJoSwlQQVTKxDsdC/x1
Jxue9dMpEQhINoU1EpSQxaYIwg8zzFpQqcKVzoX9bgw5YdfAVgv/wrSW98Hwg/h5
OX+QYBAvhK/1Gk0+b7fwPF323osdD/8hn4lbQorB3gEYmE4+3kKh6ivlxGNa1LfW
VDfX23MhRF+iXFM64pnl7LR6BnflPQlGEKlWQgevR+cZDfEk+lDTRHjdAu/Hjokc
Nwo1agptCOsS5mgE/hyLhqBc6UXSN8ytoi5acP+KtnfnLtmgw/YEt7/2QQgOOTkf
T2zwCxFRQcePwoip7OXFwzkPsZkj3gn4XZCbTSErbqnQ28sFDbTQUTC1mB87f70=
=0GcF
-----END PGP SIGNATURE-----
Merge tag 'linux-can-next-for-4.4-20151013' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next
Marc Kleine-Budde says:
====================
pull-request: can-next 2015-09-17
this is a pull request of 4 patches for net-next/master.
Two patches are by Gerhard Bertelsmann, fixing some problems in the
sun4i driver. The patch by Arnd Bergmann stops using timeval for the
CAN broadcast manager. The last patch by Alexandre Belloni removes the
otherwise unused struct at91_can_data from the driver.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
* fast-xmit was not doing powersave filter clearing correctly,
disable fast-xmit while any such operations are still pending
* a debugfs file was broken due to some infrastructure changes
-----BEGIN PGP SIGNATURE-----
iQIcBAABCAAGBQJWHMdyAAoJEDBSmw7B7bqr+MwQAIG16Oo01vLDRXtjS+XkxVzq
HEXy+PfL3xDEPOq+P5Rm7Bwg1hK6EqRNh6UBab6YvKP0vyrsEgqDe29ftf16R3yC
K9gcslJgm/B8OhwOUQJa9UAyiL28AY8ZTQpKS8b9z7qu7lsXRMFI/S/nVvosdrdT
DGGayyABFuWWbQ0YlLOOoq17/p/BELoaOhj811dlJszkwl7zZmmjsTF4rjB7tsgJ
d0+Gh+Xvx8d5Kl9cvKvgGLeh7Ms7jxnJi96xcNdxUXWylbGeo/05jpRtwnTrQlsj
wYWmkwXXykppbAFO+YQE+hBpEK1KQx8aQVPxNuxv0bPgggt2dkRDJRJFS9g7nSUn
kuJjNJYrVUDYRDszgzjRWi6HFln9PCZJv35BGYTVptt3qM7IcZ16vrNRlDxzTtN+
iX20Fv+IyVW3ZKC7PUIugYYpXvOibKKOpPpkiEz7DiSZXy9YKTdZuhNv3JwuTTca
0BnGIUX+M2zlBeaRUugX3pK88W1LajgKx/FnnFZ6pCivC2bQr3Uf7IsNzSIO9eEZ
+q9zdumyonKi2RJXerPJFN+yXB0afv2rQRqZQqoAt3MURMI73BawXL0SUOgNPrDr
5ivCFy/6deXDnQ3mRLaT+w9alMThBSLPGXKZZKq3RJNJmUYr8Oe+6LMvtFEqPlCt
s703Q3UWgZ6iyx77kd1o
=Ziyp
-----END PGP SIGNATURE-----
Merge tag 'mac80211-for-davem-2015-10-13' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
Like last time, we have two small fixes:
* fast-xmit was not doing powersave filter clearing correctly,
disable fast-xmit while any such operations are still pending
* a debugfs file was broken due to some infrastructure changes
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit 6e498158a8 ("tipc: move link synch and failover to link aggregation level")
we introduced a new mechanism for performing link failover and
synchronization. We have now detected a bug in this mechanism.
During link synchronization we use the arrival of any packet on
the tunnel link to trig a check for whether it has reached the
synchronization point or not. This has turned out to be too
permissive, since it may cause an arriving non-last SYNCH packet to
end the synch state, just to see the next SYNCH packet initiate a
new synch state with a new, higher synch point. This is not fatal,
but should be avoided, because it may significantly extend the
synchronization period, while at the same time we are not allowed
to send NACKs if packets are lost. In the worst case, a low-traffic
user may see its traffic stall until a LINK_PROTOCOL state message
trigs the link to leave synchronization state.
At the same time, LINK_PROTOCOL packets which happen to have a (non-
valid) sequence number lower than the tunnel link's rcv_nxt value will
be consistently dropped, and will never be able to resolve the situation
described above.
We fix this by exempting LINK_PROTOCOL packets from the sequence number
check, as they should be. We also reduce (but don't completely
eliminate) the risk of entering multiple synchronization states by only
allowing the (logically) first SYNCH packet to initiate a synchronization
state. This works independently of actual packet arrival order.
Fixes: commit 6e498158a8 ("tipc: move link synch and failover to link aggregation level")
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Revert the commit e2ca690b65 ("ipv4/icmp: redirect messages
can use the ingress daddr as source"), which tried to introduce a more
suitable behaviour for ICMP redirect messages generated by VRRP routers.
However RFC 5798 section 8.1.1 states:
The IPv4 source address of an ICMP redirect should be the address
that the end-host used when making its next-hop routing decision.
while said commit used the generating packet destination
address, which do not match the above and in most cases leads to
no redirect packets to be generated.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Correct whitespace layout of a pointer casting.
No changes detected by objdiff.
Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Correct whitespace layout of if statements.
No changes detected by objdiff.
Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When a TCP/DCCP listener is closed, its pending SYN_RECV request sockets
become stale, meaning 3WHS can not complete.
But current behavior is wrong :
incoming packets finding such stale sockets are dropped.
We need instead to cleanup the request socket and perform another
lookup :
- Incoming ACK will give a RST answer,
- SYN rtx might find another listener if available.
- We expedite cleanup of request sockets and old listener socket.
Fixes: 079096f103 ("tcp/dccp: install syn_recv requests into ehash table")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
bug.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJWHUj5AAoJECebzXlCjuG+KIoP/RW5zigAEKqUiD7ycKR91BxD
9Nt0fqTTrbkGJhKM1/DN4YEjogAHeFW5OnGiLQRUNI/qdy+I1Gyr1kgwGmCCVDt9
d8AhnxcnXR5SmsQHk7eeUd/rnODetf0bW5YJ8PfFbnC6cmM013nR9ujEccUuCl9M
hHTp+690Doab00PtWtsjmZv5d+eT1bktY/R2PuQhyQM2CKWh1u4FeNTd1lWE551D
b1wSvhAGMYVEsQv8+HICDrIQ8loGfH2gpBILERLM2yJlhN1IPU3RmNSAcQpZSaql
veJYVmHdpMACCLp0Dd3hwWKDYvcQ2lCqKk+Cpd0vLpvZ8J5OjCLC+a2dh0PRIYuf
pwFCvbWz6dn27/9eXEKbyT2JIeBIl4qwrFjfiRKlNX0c4HGKXaE2gJrY7bxnDxe1
BatAbEFZ+rxHyPmycaj3JdyOxafmw94XzbT8q2g7tmUCj+pvAI+Pbv6PlwN6W2r7
aGBZzgd8Y9pT6ZbCB0e413d/t5ulxwkt6vVz9Jze4gfcUrWcqHaqt7AadMl7obUx
AYPLAVGeHybdKlLvqv42IF2QM8ZhizM0+EnxkjfWLrsa7WbstWX5KLPpm3K80dM7
98p1ToNQDFcNU8WBZw8AkBpFz4j32RVOkvzWFWbhCo+T3is4BmP16uEEjH90aCCY
skQKMrq8J1ox33gz5gT7
=Pkuy
-----END PGP SIGNATURE-----
Merge tag 'nfsd-4.3-2' of git://linux-nfs.org/~bfields/linux
Pull nfsd fixes from Bruce Fields:
"Two nfsd fixes, one for an RDMA crash, one for a pnfs/block protocol
bug"
* tag 'nfsd-4.3-2' of git://linux-nfs.org/~bfields/linux:
svcrdma: Fix NFS server crash triggered by 1MB NFS WRITE
nfsd/blocklayout: accept any minlength
The can subsystem communicates with user space using a bcm_msg_head
header, which contains two timestamps. This is problematic for
multiple reasons:
a) The structure layout is currently incompatible between 64-bit
user space and 32-bit user space, and cannot work in compat
mode (other than x32).
b) The timeval structure layout will change in 32-bit user
space when we fix the y2038 overflow problem by redefining
time_t to 64-bit, making new 32-bit user space incompatible
with the current kernel interface.
Cars last a long time and often use old kernels, so the actual
users of this code are the most likely ones to migrate to y2038
safe user space.
This tries to work around part of the problem by changing the
publicly visible user interface in the header, but not the binary
interface. Fortunately, the values passed around in the structure
are relative times and do not actually suffer from the y2038
overflow, so 32-bit is enough here.
We replace the use of 'struct timeval' with a newly defined
'struct bcm_timeval' that uses the exact same binary layout
as before and that still suffers from problem a) but not problem
b).
The downside of this approach is that any user space program
that currently assigns a timeval structure to these members
rather than writing the tv_sec/tv_usec portions individually
will suffer a compile-time error when built with an updated
kernel header. Fixing this error makes it work fine with old
and new headers though.
We could address problem a) by using '__u32' or 'int' members
rather than 'long', but that would have a more significant
downside in also breaking support for all existing 64-bit user
binaries that might be using this interface, which is likely
not acceptable.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
Cc: linux-can@vger.kernel.org
Cc: linux-api@vger.kernel.org
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Correct whitespace layout of ternary operators in the netfilter-ipv6
code.
No changes detected by objdiff.
Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch cleanses whitespace around arithmetical operators.
No changes detected by objdiff.
Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Use tabs instead of spaces to indent code.
No changes detected by objdiff.
Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Use tabs instead of spaces to indent second line of parameters in
function definitions.
No changes detected by objdiff.
Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Whitespace cleansing: Labels should not be indented.
No changes detected by objdiff.
Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Ido Schimmel reported a problem with switchdev devices because of the
order change of del_nbp operations, more specifically the move of
nbp_vlan_flush() which deletes all vlans and frees vlgrp after the
rx_handler has been unregistered. So in order to fix this move
vlan_flush back where it was and make it destroy the rhtable after
NULLing vlgrp and waiting a grace period to make sure noone can see it.
Reported-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As Ido Schimmel pointed out the vlan_vid_del() code in nbp_vlan_flush is
unnecessary (and is actually a remnant of the old vlan code) so we can
remove it.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
br_fill_ifinfo is called by br_ifinfo_notify which can be called from
many contexts with different locks held, sometimes it relies upon
bridge's spinlock only which is a problem for the vlan code, so use
explicitly rcu for that to avoid problems.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bridge and port's vlgrp member is already used in RCU way, currently
we rely on the fact that it cannot disappear while the port exists but
that is error-prone and we might miss places with improper locking
(either RCU or RTNL must be held to walk the vlan_list). So make it
official and use RCU for vlgrp to catch offenders. Introduce proper vlgrp
accessors and use them consistently throughout the code.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As with IPv4 support for VRFs added to IPv6 stack by replacing hardcoded
table ids with possibly device specific ones and manipulating the oif in
the flowi6. The flow flags are used to skip oif compare in nexthop lookups
if the device is enslaved to a VRF via the L3 master device.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As originally written rt6_uncached_list_flush_dev makes no sense when
called with dev == NULL as it attempts to flush all uncached routes
regardless of network namespace when dev == NULL. Which is simply
incorrect behavior.
Furthermore at the point rt6_ifdown is called with dev == NULL no more
network devices exist in the network namespace so even if the code in
rt6_uncached_list_flush_dev were to attempt something sensible it
would be meaningless.
Therefore remove support in rt6_uncached_list_flush_dev for handling
network devices where dev == NULL, and only call rt6_uncached_list_flush_dev
when rt6_ifdown is called with a network device.
Fixes: 8d0b94afdc ("ipv6: Keep track of DST_NOCACHE routes in case of iface down/unregister")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Reviewed-by: Martin KaFai Lau <kafai@fb.com>
Tested-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
VLANs 0 and 4095 are reserved and shouldn't be used, add checks to
switchdev similar to the bridge. Also make sure ids above 4095 cannot
be passed either.
Fixes: 47f8328bb1 ("switchdev: add new switchdev bridge setlink")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We shouldn't allow BRIDGE_VLAN_INFO_PVID flag in VLAN ranges.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: Elad Raz <eladr@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A DSA driver may not provide the port_join_bridge and port_leave_bridge
functions, so don't warn in such case.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Consider the following "duelling syn" sequence between two peers A and B:
A B
SYN1 -->
<-- SYN2
SYN2ACK -->
Note that the SYN/ACK has already been sent out by TCP before
rds_tcp_accept_one() gets invoked as part of callbacks.
If the inet_addr(A) is numerically less than inet_addr(B),
the arbitration scheme in rds_tcp_accept_one() will prefer the
TCP connection triggered by SYN1, and will send a CLOSE for the
SYN2 (just after the SYN2ACK was sent).
Since B also follows the same arbitration scheme, it will send the SYN-ACK
for SYN1 that will set up a healthy ESTABLISHED connection on both sides.
B will also get a CLOSE for SYN2, which should result in the cleanup
of the TCP state machine for SYN2, but it should not trigger any
stale RDS-TCP callbacks (such as ->writespace, ->state_change etc),
that would disrupt the progress of the SYN2 based RDS-TCP connection.
Thus the arbitration scheme in rds_tcp_accept_one() should restore
rds_tcp callbacks for the winner before setting them up for the
new accept socket, and also make sure that conn->c_outgoing
is set to 0 so that we do not trigger any reconnect attempts on the
passive side of the tcp socket in the future, in conformance with
commit c82ac7e69e ("net/rds: RDS-TCP: only initiate reconnect attempt
on outgoing TCP socket.")
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The IP address passed to rds_bind() should be vetted by the
transport's ->laddr_check() for a previously bound transport.
This needs to be done to avoid cases where, for example,
the application has asked for an IB transport,
but the IP address passed to bind is only usable on
ethernet interfaces.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Usage of -prev seems buggy. While packet was out our hook cannot be
removed but we have no way to know if the previous one is still valid.
So better not use ->prev at all. Since NF_REPEAT just asks to invoke
same hook function again, just do so, and continue with nf_interate
if we get an ACCEPT verdict.
A side effect of this change is that if nf_reinject(NF_REPEAT) causes
another REPEAT we will now drop the skb instead of a kernel loop.
However, NF_REPEAT loops would be a bug so this should not happen anyway.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Commit 30686bf7f5 ("mac80211: convert HW flags to unsigned long
bitmap") accidentally removed the newline delimiter from the hwflags
debugfs file. Fix this by adding back the newline between the HW flags.
Cc: stable@vger.kernel.org [4.2]
Signed-off-by: Mohammed Shafi Shajakhan <mohammed@qti.qualcomm.com>
[fix commit log]
Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Currently it's possible for someone to send a vlan range to the kernel
with the pvid flag set which will result in the pvid bouncing from a
vlan to vlan and isn't correct, it also introduces problems for hardware
where it doesn't make sense having more than 1 pvid. iproute2 already
enforces this, so let's enforce it on kernel-side as well.
Reported-by: Elad Raz <eladr@mellanox.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch makes ip6_route_info_create return err pointer instead of
returning the rt pointer by reference as suggested by Dave
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function nf_ct_frag6_gather is called on both the input and the
output paths of the networking stack. In particular ipv6_defrag which
calls nf_ct_frag6_gather is called from both the the PRE_ROUTING chain
on input and the LOCAL_OUT chain on output.
The addition of a net parameter makes it explicit which network
namespace the packets are being reassembled in, and removes the need
for nf_ct_frag6_gather to guess.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function ip_defrag is called on both the input and the output
paths of the networking stack. In particular conntrack when it is
tracking outbound packets from the local machine calls ip_defrag.
So add a struct net parameter and stop making ip_defrag guess which
network namespace it needs to defragment packets in.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip_call_ra_chain is called early in the forwarding chain from
ip_forward and ip_mr_input, which makes skb->dev the correct
expression to get the input network device and dev_net(skb->dev) a
correct expression for the network namespace the packet is being
processed in.
Compute the network namespace and store it in a variable to make the
code clearer.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Recent TCP listener patches exposed a prior af_packet bug :
match_fanout_group() blindly assumes it is always safe
to cast sk to a packet socket to compare fanout with af_packet_priv
But SYNACK packets can be sent while attached to request_sock, which
are smaller than a "struct sock".
We can read non existent memory and crash.
Fixes: c0de08d042 ("af_packet: don't emit packet on orig fanout group")
Fixes: ca6fb06518 ("tcp: attach SYNACK messages to request sockets instead of listener")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Eric Leblond <eric@regit.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch allows configuring how the source address of ICMP
redirect messages is selected; by default the old behaviour is
retained, while setting icmp_redirects_use_orig_daddr force the
usage of the destination address of the packet that caused the
redirect.
The new behaviour fits closely the RFC 5798 section 8.1.1, and fix the
following scenario:
Two machines are set up with VRRP to act as routers out of a subnet,
they have IPs x.x.x.1/24 and x.x.x.2/24, with VRRP holding on to
x.x.x.254/24.
If a host in said subnet needs to get an ICMP redirect from the VRRP
router, i.e. to reach a destination behind a different gateway, the
source IP in the ICMP redirect is chosen as the primary IP on the
interface that the packet arrived at, i.e. x.x.x.1 or x.x.x.2.
The host will then ignore said redirect, due to RFC 1122 section 3.2.2.2,
and will continue to use the wrong next-op.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some drivers need to implement both switchdev vlan ops and
vid_add/kill ndos. For that to work in bridge code, we need to try
switchdev op first when adding/deleting vlan id.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
One 32bit hole is following skc_refcnt, use it.
skc_incoming_cpu can also be an union for request_sock rcv_wnd.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SO_INCOMING_CPU as added in commit 2c8c56e15d was a getsockopt() command
to fetch incoming cpu handling a particular TCP flow after accept()
This commits adds setsockopt() support and extends SO_REUSEPORT selection
logic : If a TCP listener or UDP socket has this option set, a packet is
delivered to this socket only if CPU handling the packet matches the specified
one.
This allows to build very efficient TCP servers, using one listener per
RX queue, as the associated TCP listener should only accept flows handled
in softirq by the same cpu.
This provides optimal NUMA behavior and keep cpu caches hot.
Note that __inet_lookup_listener() still has to iterate over the list of
all listeners. Following patch puts sk_refcnt in a different cache line
to let this iteration hit only shared and read mostly cache lines.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Edward Hyunkoo Jee <edjee@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It's useful to allow users to set fwmark for an individual packet,
without changing the socket state. The function this patch adds in
sock layer can be used by the protocols that need such a feature.
Signed-off-by: Edward Hyunkoo Jee <edjee@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to let unprivileged users load and execute eBPF programs
teach verifier to prevent pointer leaks.
Verifier will prevent
- any arithmetic on pointers
(except R10+Imm which is used to compute stack addresses)
- comparison of pointers
(except if (map_value_ptr == 0) ... )
- passing pointers to helper functions
- indirectly passing pointers in stack to helper functions
- returning pointer from bpf program
- storing pointers into ctx or maps
Spill/fill of pointers into stack is allowed, but mangling
of pointers stored in the stack or reading them byte by byte is not.
Within bpf programs the pointers do exist, since programs need to
be able to access maps, pass skb pointer to LD_ABS insns, etc
but programs cannot pass such pointer values to the outside
or obfuscate them.
Only allow BPF_PROG_TYPE_SOCKET_FILTER unprivileged programs,
so that socket filters (tcpdump), af_packet (quic acceleration)
and future kcm can use it.
tracing and tc cls/act program types still require root permissions,
since tracing actually needs to be able to see all kernel pointers
and tc is for root only.
For example, the following unprivileged socket filter program is allowed:
int bpf_prog1(struct __sk_buff *skb)
{
u32 index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol));
u64 *value = bpf_map_lookup_elem(&my_map, &index);
if (value)
*value += skb->len;
return 0;
}
but the following program is not:
int bpf_prog1(struct __sk_buff *skb)
{
u32 index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol));
u64 *value = bpf_map_lookup_elem(&my_map, &index);
if (value)
*value += (u64) skb;
return 0;
}
since it would leak the kernel address into the map.
Unprivileged socket filter bpf programs have access to the
following helper functions:
- map lookup/update/delete (but they cannot store kernel pointers into them)
- get_random (it's already exposed to unprivileged user space)
- get_smp_processor_id
- tail_call into another socket filter program
- ktime_get_ns
The feature is controlled by sysctl kernel.unprivileged_bpf_disabled.
This toggle defaults to off (0), but can be set true (1). Once true,
bpf programs and maps cannot be accessed from unprivileged process,
and the toggle cannot be set back to false.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch enables to load nf_conntrack_netlink module if
NFULNL_CFG_F_CONNTRACK config flag is specified.
Signed-off-by: Ken-ichirou MATSUZAWA <chamas@h4.dion.ne.jp>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Now that the NFS server advertises a maximum payload size of 1MB
for RPC/RDMA again, it crashes in svc_process_common() when NFS
client sends a 1MB NFS WRITE on an NFS/RDMA mount.
The server has set up a 259 element array of struct page pointers
in rq_pages[] for each incoming request. The last element of the
array is NULL.
When an incoming request has been completely received,
rdma_read_complete() attempts to set the starting page of the
incoming page vector:
rqstp->rq_arg.pages = &rqstp->rq_pages[head->hdr_count];
and the page to use for the reply:
rqstp->rq_respages = &rqstp->rq_arg.pages[page_no];
But the value of page_no has already accounted for head->hdr_count.
Thus rq_respages now points past the end of the incoming pages.
For NFS WRITE operations smaller than the maximum, this is harmless.
But when the NFS WRITE operation is as large as the server's max
payload size, rq_respages now points at the last entry in rq_pages,
which is NULL.
Fixes: cc9a903d91 ('svcrdma: Change maximum server payload . . .')
BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=270
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@dev.mellanox.co.il>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Shirley Ma <shirley.ma@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
With the ARM mini2440_defconfig, the bridge netfilter code gets
built with both CONFIG_NF_DEFRAG_IPV4 and CONFIG_NF_DEFRAG_IPV6
disabled, which leads to a harmless gcc warning:
net/bridge/br_netfilter_hooks.c: In function 'br_nf_dev_queue_xmit':
net/bridge/br_netfilter_hooks.c:792:2: warning: label 'drop' defined but not used [-Wunused-label]
This gets rid of the warning by cleaning up the code to avoid
the respective #ifdefs causing this problem, and replacing them
with if(IS_ENABLED()) checks. I have verified that the resulting
object code is unchanged, and an additional advantage is that
we now get compile coverage of the unused functions in more
configurations.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: dd302b59bd ("netfilter: bridge: don't leak skb in error paths")
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Simon Horman says:
====================
Fourth Round of IPVS Updates for v4.4
please consider these build warning cleanups from David Ahern and myself.
They resolve some minor side effects of Eric Biederman' heroic work to
cleanup IPVS which you recently pulled: its queued up for v4.4 so no need
to worry about earlier kernel versions.
====================
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The object and module refcounts are updated for each conntrack template,
however, if we delete the iptables rules and we flush the timeout
database, we may end up with invalid references to timeout object that
are just gone.
Resolve this problem by setting the timeout reference to NULL when the
custom timeout entry is removed from our base. This patch requires some
RCU trickery to ensure safe pointer handling.
This handling is similar to what we already do with conntrack helpers,
the idea is to avoid bumping the timeout object reference counter from
the packet path to avoid the cost of atomic ops.
Reported-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
On success, this shouldn't put back the timeout policy object, otherwise
we may have module refcount overflow and we allow deletion of timeout
that are still in use.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Use SWITCHDEV_F_SKIP_EOPNOTSUPP to skip over ports in bridge that don't
support setting ageing_time (or setting bridge attrs in general).
If push fails, don't update ageing_time in bridge and return err to user.
If push succeeds, update ageing_time in bridge and run gc_timer now to
recalabrate when to run gc_timer next, based on new ageing_time.
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This allows us to recurse over all the ports, skipping over unsupporting
ports. Without the change, the recursion would stop at first unsupported
port.
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The alive parameter of tcp_orphan_retries, indicates
whether the connection is assumed alive or not.
In the function and all places calling it is used as a boolean value.
Therefore this changes the type of alive to bool in the function
definition and all calling locations.
Since tcp_orphan_tries is a tcp_timer.c local function no change in
any other file or header is necessary.
Signed-off-by: Richard Sailer <richard@weltraumpflege.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch enables adding of fdb entries pointing to the bridge device.
This can be used to propagate mac address of vlan interfaces
configured on top of the vlan filtering bridge.
Before:
$bridge fdb add 44:38:39:00:27:9f dev bridge
RTNETLINK answers: Invalid argument
After:
$bridge fdb add 44:38:39:00:27:9f dev bridge
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before recent TCP listener patches, we were updating listener
sk->sk_rxhash before the cloning of master socket.
children sk_rxhash was therefore correct after the normal 3WHS.
But with lockless listener, we no longer dirty/change listener sk_rxhash
as it would be racy.
We need to correctly update the child sk_rxhash, otherwise first data
packet wont hit correct cpu if RFS is used.
Fixes: 079096f103 ("tcp/dccp: install syn_recv requests into ehash table")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Willem de Bruijn <willemb@google.com>
Cc: Tom Herbert <tom@herbertland.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a clone of commit 2ab957492d ("ip_forward: Drop frames with
attached skb->sk") for ipv6.
This commit has exactly the same reasons as the above mentioned commit,
namely to prevent panics during netfilter reload or a misconfigured stack.
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
GRE point-to-point interfaces should also support ipv6 multicast. Setting
up default multicast routes on interface creation was forgotten. Add it.
Bugzilla: <https://bugzilla.kernel.org/show_bug.cgi?id=103231>
Cc: Julien Muchembled <jm@jmuchemb.eu>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Nicolas Dumazet <ndumazet@google.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
For consistency with the FDB add operation, propagate the
switchdev_obj_port_fdb structure in the DSA drivers.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that the prepare phase is pushed down to the DSA drivers, propagate
it to the port_fdb_add function.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Push the prepare phase for FDB operations down to the DSA drivers, with
a new port_fdb_prepare function. Currently only mv88e6xxx is affected.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johan Hedberg says:
====================
pull request: bluetooth-next 2015-10-08
Here's another set of Bluetooth & 802.15.4 patches for the 4.4 kernel.
802.15.4:
- Many improvements & fixes to the mrf24j40 driver
- Fixes and cleanups to nl802154, mac802154 & ieee802154 code
Bluetooth:
- New chipset support in btmrvl driver
- Fixes & cleanups to btbcm, btmrvl, bpa10x & btintel drivers
- Support for vendor specific diagnostic data through common API
- Cleanups to the 6lowpan code
- New events & message types for monitor channel
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
selinux needs few changes to accommodate fact that SYNACK messages
can be attached to a request socket, lacking sk_security pointer
(Only syncookies are still attached to a TCP_LISTEN socket)
Adds a new sk_listener() helper, and use it in selinux and sch_fq
Fixes: ca6fb06518 ("tcp: attach SYNACK messages to request sockets instead of listener")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported by: kernel test robot <ying.huang@linux.intel.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: Eric Paris <eparis@parisplace.org>
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to commit c0afd9ce4d ("fq_codel: fix return value of fq_codel_drop()")
->drop() is supposed to return the number of bytes it dropped,
but hhf_drop () returns the id of the bucket where it drops
a packet from.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Terry Lam <vtlam@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
eBPF socket filter programs may see junk in 'u32 cb[5]' area,
since it could have been used by protocol layers earlier.
For socket filter programs used in af_packet we need to clean
20 bytes of skb->cb area if it could be used by the program.
For programs attached to TCP/UDP sockets we need to save/restore
these 20 bytes, since it's used by protocol layers.
Remove SK_RUN_FILTER macro, since it's no longer used.
Long term we may move this bpf cb area to per-cpu scratch, but that
requires addition of new 'per-cpu load/store' instructions,
so not suitable as a short term fix.
Fixes: d691f9e8d4 ("bpf: allow programs to write to certain skb fields")
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The Kconfig currently controlling compilation of this code is:
net/sched/Kconfig:menuconfig NET_SCHED
net/sched/Kconfig: bool "QoS and/or fair queueing"
...meaning that it currently is not being built as a module by anyone.
Lets remove the modular code that is essentially orphaned, so that
when reading the driver there is no doubt it is builtin-only.
Since module_init translates to device_initcall in the non-modular
case, the init ordering remains unchanged with this commit. We can
change to one of the other priority initcalls (subsys?) at any later
date, if desired.
We also delete the MODULE_LICENSE tag since all that information
is already contained at the top of the file in the comments.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The Kconfig currently controlling compilation of this code is:
net/dcb/Kconfig:config DCB
net/dcb/Kconfig: bool "Data Center Bridging support"
...meaning that it currently is not being built as a module by anyone.
Lets remove the modular code that is essentially orphaned, so that
when reading the driver there is no doubt it is builtin-only.
Since module_init translates to device_initcall in the non-modular
case, the init ordering remains unchanged with this commit. We can
change to one of the other priority initcalls (subsys?) at any later
date, if desired.
We also delete the MODULE_LICENSE tag etc. since all that information
is (or is now) already contained at the top of the file in the comments.
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Or Gerlitz <ogerlitz@mellanox.com>
Cc: Anish Bhatt <anish@chelsio.com>
Cc: John Fastabend <john.r.fastabend@intel.com>
Cc: Shani Michaeli <shanim@mellanox.com>
Cc: netdev@vger.kernel.org
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The Makefile currently controlling compilation of this code lists
it under "obj-y" ...meaning that it currently is not being built as
a module by anyone.
Lets remove the modular code that is essentially orphaned, so that
when reading the driver there is no doubt it is builtin-only.
Since module_init translates to device_initcall in the non-modular
case, the init ordering remains unchanged with this commit. We can
change to one of the other priority initcalls (subsys?) at any later
date, if desired.
We can't remove module.h since the file uses other module related
stuff even though it is not modular itself.
We move the information from the MODULE_LICENSE tag to the top of the
file, since that information is not captured anywhere else. The
MODULE_ALIAS_NET_PF_PROTO becomes a no-op in the non modular case, so
it is removed.
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Craig Gallek <kraig@google.com>
Cc: netdev@vger.kernel.org
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch makes lockdep_rtnl_is_held return bool due to this
particular function only using either one or zero as its return
value.
In another patch lockdep_is_held is also made return bool.
No functional change.
Signed-off-by: Yaowei Bai <bywxiaobai@163.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch makes dccp_bad_service_code return bool due to these
particular functions only using either one or zero as their return
value.
dccp_list_has_service is also been made return bool in this patchset.
No functional change.
Signed-off-by: Yaowei Bai <bywxiaobai@163.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch makes lockdep_nfnl_is_held return bool to improve
readability due to this particular function only using either
one or zero as its return value.
No functional change.
Signed-off-by: Yaowei Bai <bywxiaobai@163.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch makes lockdep_genl_is_held return bool to improve
readability due to this particular function only using either
one or zero as its return value.
No functional change.
Signed-off-by: Yaowei Bai <bywxiaobai@163.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the controller is unconfigured (for example it does not have a
valid Bluetooth address), then the basic debugfs entries for dut_mode
and vendor_diag are not creates. Ensure they are created in __hci_init
and also __hci_unconf_init functions. One of them is called during setup
stage of a new controller.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
While recently arguing on a seccomp discussion that raw prandom_u32()
access shouldn't be exposed to unpriviledged user space, I forgot the
fact that SKF_AD_RANDOM extension actually already does it for some time
in cBPF via commit 4cd3675ebf ("filter: added BPF random opcode").
Since prandom_u32() is being used in a lot of critical networking code,
lets be more conservative and split their states. Furthermore, consolidate
eBPF and cBPF prandom handlers to use the new internal PRNG. For eBPF,
bpf_get_prandom_u32() was only accessible for priviledged users, but
should that change one day, we also don't want to leak raw sequences
through things like eBPF maps.
One thought was also to have own per bpf_prog states, but due to ABI
reasons this is not easily possible, i.e. the program code currently
cannot access bpf_prog itself, and copying the rnd_state to/from the
stack scratch space whenever a program uses the prng seems not really
worth the trouble and seems too hacky. If needed, taus113 could in such
cases be implemented within eBPF using a map entry to keep the state
space, or get_random_bytes() could become a second helper in cases where
performance would not be critical.
Both sides can trigger a one-time late init via prandom_init_once() on
the shared state. Performance-wise, there should even be a tiny gain
as bpf_user_rnd_u32() saves one function call. The PRNG needs to live
inside the BPF core since kernels could have a NET-less config as well.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Cc: Chema Gonzalez <chema@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There's no good reason why users outside of networking should not
be using this facility, f.e. for initializing their seeds.
Therefore, make it accessible from there as get_random_once().
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch moves values for all lowpan interface to the shared
implementation of 6lowpan. This patch also quietly fixes the forgotten
IFF_NO_QUEUE flag for the bluetooth 6LoWPAN interface. An identically
commit is 4afbc0d ("net: 6lowpan: convert to using IFF_NO_QUEUE") which
wasn't changed for bluetooth 6lowpan.
All 6lowpan interfaces should be virtual with IFF_NO_QUEUE, using EUI64
address length, the mtu size is 1280 (IPV6_MIN_MTU) and the netdev type
is ARPHRD_6LOWPAN.
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Commit deaa0a6a93 ("net: Lookup actual route when oif is VRF device")
exposed a bug in __ip_route_output_key_hash for VRF devices: on FIB lookup
failure if the oif is specified the current logic drops to make_route on
the assumption that the route tables are wrong. For VRF/L3 master devices
this leads to wrong dst entries and route lookups. For example:
$ ip route ls table vrf-red
unreachable default
broadcast 10.2.1.0 dev eth1 proto kernel scope link src 10.2.1.2
10.2.1.0/24 dev eth1 proto kernel scope link src 10.2.1.2
local 10.2.1.2 dev eth1 proto kernel scope host src 10.2.1.2
broadcast 10.2.1.255 dev eth1 proto kernel scope link src 10.2.1.2
$ ip route get oif vrf-red 1.1.1.1
1.1.1.1 dev vrf-red src 10.0.0.2
cache
With this patch:
$ ip route get oif vrf-red 1.1.1.1
RTNETLINK answers: No route to host
which is the correct response based on the default route
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to commit c29390c6df ("xps: must clear sender_cpu before
forwarding"), we also need to clear the skb->sender_cpu when moving
from RX to TX via skb_do_redirect() due to the shared location of
napi_id (used on RX) and sender_cpu (used on TX).
Fixes: 27b29f6305 ("bpf: add bpf_redirect() helper")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to commit c29390c6df ("xps: must clear sender_cpu before forwarding")
the skb->sender_cpu needs to be cleared before xmit.
Fixes: 3896d655f4 ("bpf: introduce bpf_clone_redirect() helper")
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to commit c29390c6df ("xps: must clear sender_cpu before forwarding")
the skb->sender_cpu needs to be cleared when moving from Rx
Tx, otherwise kernel could crash.
Fixes: 2bd82484bb ("xps: fix xps for stacked devices")
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Santosh Shilimkar says:
====================
RDS: connection scalability and performance improvements
[v4]
Re-sending the same patches from v3 again since my repost of
patch 05/14 from v3 was whitespace damaged.
[v3]
Updated patch "[PATCH v2 05/14] RDS: defer the over_batch work to
send worker" as per David Miller's comment [4] to avoid the magic
value usage. Patch now makes use of already available but unused
send_batch_count module parameter. Rest of the patches are same as
earlier version v2 [3]
[v2]:
Dropped "[PATCH 05/15] RDS: increase size of hash-table to 8K" from
earlier version [1]. I plan to address the hash table scalability using
re-sizable hash tables as suggested by David Laight and David Miller [2]
This series addresses RDS connection bottlenecks on massive workloads and
improve the RDMA performance almost by 3X. RDS TCP also gets a small gain
of about 12%.
RDS is being used in massive systems with high scalability where several
hundred thousand end points and tens of thousands of local processes
are operating in tens of thousand sockets. Being RC(reliable connection),
socket bind and release happens very often and any inefficiencies in
bind hash look ups hurts the overall system performance. RDS bin hash-table
uses global spin-lock which is the biggest bottleneck. To make matter worst,
it uses rcu inside global lock for hash buckets.
This is being addressed by simply using per bucket rw lock which makes the
locking simple and very efficient. The hash table size is still an issue and
I plan to address it by using re-sizable hash tables as suggested on the list.
For RDS RDMA improvement, the completion handling is revamped so that we
can do batch completions. Both send and receive completion handlers are
split logically to achieve the same. RDS 8K messages being one of the
key usecase, mr pool is adapted to have the 8K mrs along with default 1M
mrs. And while doing this, few fixes and couple of bottlenecks seen with
rds_sendmsg() are addressed.
Series applies against 4.3-rc1 as well net-next. Its tested on Oracle
hardware with IB fabric for both bcopy as well as RDMA mode. RDS TCP is
tested with iXGB NIC. Like last time, iWARP transport is untested with
these changes. The patchset is also available at below git repo:
git://git.kernel.org/pub/scm/linux/kernel/git/ssantosh/linux.git net/rds/4.3-v3
As a side note, the IB HCA driver I used for testing misses at least 3
important patches in upstream to see the full blown IB performance and
am hoping to get that in mainline with help of them.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The network namespace is already passed into dst_output pass it into
dst->output lwt->output and friends.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Compute net and store it in a variable in the functions
ip_build_and_send_pkt and ip_queue_xmit so that it does not need to be
recomputed next time it is needed.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Store net in a variable in ip_tunnel_xmit so it does not need
to be recomputed when it is used again.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stop hidding the sk parameter with an inline helper function and make
all of the callers pass it, so that it is clear what the function is
doing.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Only __ip6_local_out_sk has callers so rename __ip6_local_out_sk
__ip6_local_out and remove the previous __ip6_local_out.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is confusing and silly hiding a parameter so modify all of
the callers to pass in the appropriate socket or skb->sk if
no socket is known.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For consistency with the other similar methods in the kernel pass a
struct sock into the dst_ops .local_out method.
Simplifying the socket passing case is needed a prequel to passing a
struct net reference into .local_out.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace dst_output_okfn with dst_output
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After a packet has been encapsulated by a tunnel we should use the
tunnel sockets local multicast loopback flag to control if the
encapsulated packet should be locally loopback back.
Pass sk into ip_local_out_sk so that in the rare case we are dealing
with a tunneled packet whose tunnel destination address is a multicast
address the kernel properly decides to loopback this packet.
In practice I don't think this matters as ip_queue_xmit is used by
tcp, l2tp and sctp none of which I am aware of uses ip level
multicasting as they are all point to point communications protocols.
Let's fix this before someone uses ip_queue_xmit for a tunnel protocol
that does use multicast.
Fixes: aad88724c9 ("ipv4: add a sock pointer to dst->output() path.")
Fixes: b0270e9101 ("ipv4: add a sock pointer to ip_queue_xmit()")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the rare case where sk != skb->sk ip_local_out_sk arranges
to call dst->output differently if the skb is queued or not.
This is a bug.
Fix this bug by passing the sk parameter of ip_local_out_sk through
from ip_local_out_sk to __ip_local_out_sk (skipping __ip_local_out).
Fixes: 7026b1ddb6 ("netfilter: Pass socket pointer down through okfn().")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>