Commit Graph

47071 Commits

Author SHA1 Message Date
Lawrence Brakmo
91b5b21c7c bpf: Add support for changing congestion control
Added support for changing congestion control for SOCK_OPS bpf
programs through the setsockopt bpf helper function. It also adds
a new SOCK_OPS op, BPF_SOCK_OPS_NEEDS_ECN, that is needed for
congestion controls, like dctcp, that need to enable ECN in the
SYN packets.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 16:15:14 -07:00
Lawrence Brakmo
9872a4bde3 bpf: Add TCP connection BPF callbacks
Added callbacks to BPF SOCK_OPS type program before an active
connection is intialized and after a passive or active connection is
established.

The following patch demostrates how they can be used to set send and
receive buffer sizes.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 16:15:14 -07:00
Lawrence Brakmo
8c4b4c7e9f bpf: Add setsockopt helper function to bpf
Added support for calling a subset of socket setsockopts from
BPF_PROG_TYPE_SOCK_OPS programs. The code was duplicated rather
than making the changes to call the socket setsockopt function because
the changes required would have been larger.

The ops supported are:
  SO_RCVBUF
  SO_SNDBUF
  SO_MAX_PACING_RATE
  SO_PRIORITY
  SO_RCVLOWAT
  SO_MARK

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 16:15:13 -07:00
Lawrence Brakmo
13d3b1ebe2 bpf: Support for setting initial receive window
This patch adds suppport for setting the initial advertized window from
within a BPF_SOCK_OPS program. This can be used to support larger
initial cwnd values in environments where it is known to be safe.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 16:15:13 -07:00
Lawrence Brakmo
8550f328f4 bpf: Support for per connection SYN/SYN-ACK RTOs
This patch adds support for setting a per connection SYN and
SYN_ACK RTOs from within a BPF_SOCK_OPS program. For example,
to set small RTOs when it is known both hosts are within a
datacenter.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 16:15:13 -07:00
Lawrence Brakmo
40304b2a15 bpf: BPF support for sock_ops
Created a new BPF program type, BPF_PROG_TYPE_SOCK_OPS, and a corresponding
struct that allows BPF programs of this type to access some of the
socket's fields (such as IP addresses, ports, etc.). It uses the
existing bpf cgroups infrastructure so the programs can be attached per
cgroup with full inheritance support. The program will be called at
appropriate times to set relevant connections parameters such as buffer
sizes, SYN and SYN-ACK RTOs, etc., based on connection information such
as IP addresses, port numbers, etc.

Alghough there are already 3 mechanisms to set parameters (sysctls,
route metrics and setsockopts), this new mechanism provides some
distinct advantages. Unlike sysctls, it can set parameters per
connection. In contrast to route metrics, it can also use port numbers
and information provided by a user level program. In addition, it could
set parameters probabilistically for evaluation purposes (i.e. do
something different on 10% of the flows and compare results with the
other 90% of the flows). Also, in cases where IPv6 addresses contain
geographic information, the rules to make changes based on the distance
(or RTT) between the hosts are much easier than route metric rules and
can be global. Finally, unlike setsockopt, it oes not require
application changes and it can be updated easily at any time.

Although the bpf cgroup framework already contains a sock related
program type (BPF_PROG_TYPE_CGROUP_SOCK), I created the new type
(BPF_PROG_TYPE_SOCK_OPS) beccause the existing type expects to be called
only once during the connections's lifetime. In contrast, the new
program type will be called multiple times from different places in the
network stack code.  For example, before sending SYN and SYN-ACKs to set
an appropriate timeout, when the connection is established to set
congestion control, etc. As a result it has "op" field to specify the
type of operation requested.

The purpose of this new program type is to simplify setting connection
parameters, such as buffer sizes, TCP's SYN RTO, etc. For example, it is
easy to use facebook's internal IPv6 addresses to determine if both hosts
of a connection are in the same datacenter. Therefore, it is easy to
write a BPF program to choose a small SYN RTO value when both hosts are
in the same datacenter.

This patch only contains the framework to support the new BPF program
type, following patches add the functionality to set various connection
parameters.

This patch defines a new BPF program type: BPF_PROG_TYPE_SOCKET_OPS
and a new bpf syscall command to load a new program of this type:
BPF_PROG_LOAD_SOCKET_OPS.

Two new corresponding structs (one for the kernel one for the user/BPF
program):

/* kernel version */
struct bpf_sock_ops_kern {
        struct sock *sk;
        __u32  op;
        union {
                __u32 reply;
                __u32 replylong[4];
        };
};

/* user version
 * Some fields are in network byte order reflecting the sock struct
 * Use the bpf_ntohl helper macro in samples/bpf/bpf_endian.h to
 * convert them to host byte order.
 */
struct bpf_sock_ops {
        __u32 op;
        union {
                __u32 reply;
                __u32 replylong[4];
        };
        __u32 family;
        __u32 remote_ip4;     /* In network byte order */
        __u32 local_ip4;      /* In network byte order */
        __u32 remote_ip6[4];  /* In network byte order */
        __u32 local_ip6[4];   /* In network byte order */
        __u32 remote_port;    /* In network byte order */
        __u32 local_port;     /* In host byte horder */
};

Currently there are two types of ops. The first type expects the BPF
program to return a value which is then used by the caller (or a
negative value to indicate the operation is not supported). The second
type expects state changes to be done by the BPF program, for example
through a setsockopt BPF helper function, and they ignore the return
value.

The reply fields of the bpf_sockt_ops struct are there in case a bpf
program needs to return a value larger than an integer.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 16:15:13 -07:00
David S. Miller
57a53a0b67 Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Johan Hedberg says:

====================
pull request: bluetooth-next 2017-07-01

Here are some more Bluetooth patches for the 4.13 kernel:

 - Added support for Broadcom BCM43430 controllers
 - Added sockaddr length checks before accessing sa_family
 - Fixed possible "might sleep" errors in bnep, cmtp and hidp modules
 - A few other minor fixes

Please let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 15:57:29 -07:00
Neil Horman
2cb5c8e378 sctp: Add peeloff-flags socket option
Based on a request raised on the sctp devel list, there is a need to
augment the sctp_peeloff operation while specifying the O_CLOEXEC and
O_NONBLOCK flags (simmilar to the socket syscall).  Since modifying the
SCTP_SOCKOPT_PEELOFF socket option would break user space ABI for existing
programs, this patch creates a new socket option
SCTP_SOCKOPT_PEELOFF_FLAGS, which accepts a third flags parameter to
allow atomic assignment of the socket descriptor flags.

Tested successfully by myself and the requestor

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: Vlad Yasevich <vyasevich@gmail.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: Andreas Steinmetz <ast@domdv.de>
CC: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 15:26:11 -07:00
David S. Miller
283131d20e NFC 4.13 pull request
This is the NFC pull requesy for 4.13. We have:
 
 - A conversion to unified device and GPIO APIs for the
   fdp, pn544, and st{21,-nci} drivers.
 - A fix for NFC device IDs allocation.
 - A fix for the nfcmrvl driver firmware download mechanism.
 - A trf7970a DT and GPIO cleanup and clock setting fix.
 - A few fixes for potential overflows in the digital and LLCP code.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJZVYXGAAoJEIqAPN1PVmxKuA0P/RhaNOWIFLSpVSPOO2/bhLj0
 oxlTW9y+Asy9MPfu5t4vllM/GqtpCYYBpfl35SDGViHlfSIPgrSDqQA2fNRT+hk1
 AwQR8HcCcptfzJ5vkXhJl/9XOUmAalHk1e7KqvNioXP1dAZbCw4/2YK8skzBnoeR
 sRYYSmsJNF1iZR6FSbtlPHWTZqD/SLRSvebDzWiYnltXMqBZeiwXmuEfRPrL/uxO
 C2iAG9Ol9bkxpxpm4jgVTFrW6equymYX0iW4rYKmmd4Ej+29iS8NWAsgK/hCUtNO
 jGrc9+1O/7MM2PoxGt70vojYMQWZpUmPWF2dvRx32+XWTcj96Nk3qOHEM62cQHMs
 OhvgUh9BurXzjtlp+ZipwJhpylczxWZ3TEFxyDTV91cuk+hdQ8xtOWseGgx7XKBy
 3kaYwlYKtiwACYBZ3b/4nzr+vNQ3GzL8VZoeGWzEyimofeWtLUWGuxtxFL4OaSeG
 X3NOZM+g4ILLKH9IKG4DkqWjAn5RD70Jf01z6GfHsfjyz5Awax9Hpp+B8dih/pmb
 4C4g5hX4X7L840Zc+l4WXMRLpH6AgL64orJmLBDxGeYAiPRlWSBHcMTrWSJuN/I4
 pJ++rMpbL2dMXaX8LBgEchtPmzQzVCsKxTXB29NyIbdZE6ebNAnZ9Gur/PwiBQqD
 wnb8HwFOyXgjKBFiA57P
 =Ve8v
 -----END PGP SIGNATURE-----

Merge tag 'nfc-next-4.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/nfc-next

Samuel Ortiz says:

====================
NFC 4.13 pull request

This is the NFC pull requesy for 4.13. We have:

- A conversion to unified device and GPIO APIs for the
  fdp, pn544, and st{21,-nci} drivers.
- A fix for NFC device IDs allocation.
- A fix for the nfcmrvl driver firmware download mechanism.
- A trf7970a DT and GPIO cleanup and clock setting fix.
- A few fixes for potential overflows in the digital and LLCP code.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 14:30:39 -07:00
Tonghao Zhang
9cc9a5cb17 datapath: Avoid using stack larger than 1024.
When compiling OvS-master on 4.4.0-81 kernel,
there is a warning:

    CC [M]  /root/ovs/datapath/linux/datapath.o
    /root/ovs/datapath/linux/datapath.c: In function
    'ovs_flow_cmd_set':
    /root/ovs/datapath/linux/datapath.c:1221:1: warning:
    the frame size of 1040 bytes is larger than 1024 bytes
    [-Wframe-larger-than=]

This patch factors out match-init and action-copy to avoid
"Wframe-larger-than=1024" warning. Because mask is only
used to get actions, we new a function to save some
stack space.

Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 09:09:59 -07:00
Xin Long
01a992bea5 sctp: remove the typedef sctp_init_chunk_t
This patch is to remove the typedef sctp_init_chunk_t, and replace
with struct sctp_init_chunk in the places where it's using this
typedef.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 09:08:42 -07:00
Xin Long
4ae70c0845 sctp: remove the typedef sctp_inithdr_t
This patch is to remove the typedef sctp_inithdr_t, and replace
with struct sctp_inithdr in the places where it's using this
typedef.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 09:08:42 -07:00
Xin Long
9f8d314715 sctp: remove the typedef sctp_data_chunk_t
This patch is to remove the typedef sctp_data_chunk_t, and replace
with struct sctp_data_chunk in the places where it's using this
typedef.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 09:08:42 -07:00
Xin Long
3583df1a3d sctp: remove the typedef sctp_datahdr_t
This patch is to remove the typedef sctp_datahdr_t, and replace with
struct sctp_datahdr in the places where it's using this typedef.

It is also to use izeof(variable) instead of sizeof(type).

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 09:08:41 -07:00
Xin Long
34b4e29b38 sctp: remove the typedef sctp_param_t
This patch is to remove the typedef sctp_param_t, and replace with
struct sctp_paramhdr in the places where it's using this typedef.

It is also to remove the useless declaration sctp_addip_addr_config
and fix the lack of params for some other functions' declaration.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 09:08:41 -07:00
Xin Long
3c91870492 sctp: remove the typedef sctp_paramhdr_t
This patch is to remove the typedef sctp_paramhdr_t, and replace
with struct sctp_paramhdr in the places where it's using this
typedef.

It is also to fix some indents and  use sizeof(variable) instead
of sizeof(type).

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 09:08:41 -07:00
Xin Long
6d85e68f4c sctp: remove the typedef sctp_cid_t
This patch is to remove the typedef sctp_cid_t, and replace
with struct sctp_cid in the places where it's using this
typedef.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 09:08:41 -07:00
Xin Long
922dbc5be2 sctp: remove the typedef sctp_chunkhdr_t
This patch is to remove the typedef sctp_chunkhdr_t, and replace
with struct sctp_chunkhdr in the places where it's using this
typedef.

It is also to fix some indents and use sizeof(variable) instead
of sizeof(type)., especially in sctp_new.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 09:08:41 -07:00
Xin Long
ae146d9b76 sctp: remove the typedef sctp_sctphdr_t
This patch is to remove the typedef sctp_sctphdr_t, and replace
with struct sctphdr in the places where it's using this typedef.

It is also to fix some indents and use sizeof(variable) instead
of sizeof(type).

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 09:08:41 -07:00
Xin Long
7f304b9efa sctp: remove an unnecessary check from sctp_endpoint_destroy
ep->base.sk gets it's value since sctp_endpoint_new, nowhere
will change it. So there's no need to check if it's null, as
it can never be null.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 08:46:57 -07:00
Reshetova, Elena
fb5c2c17a5 net: convert packet_fanout.sk_ref from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 07:39:09 -07:00
Reshetova, Elena
b4217b8289 net: convert netlbl_lsm_cache.refcount from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 07:39:09 -07:00
Reshetova, Elena
c122e14df2 net: convert net.passive from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 07:39:09 -07:00
Reshetova, Elena
edcb691871 net: convert inet_frag_queue.refcnt from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 07:39:09 -07:00
Reshetova, Elena
717d1e993a net: convert fib_rule.refcnt from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 07:39:09 -07:00
Reshetova, Elena
8c9814b970 net: convert unix_address.refcnt from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 07:39:08 -07:00
Reshetova, Elena
433cea4d9b net: convert netpoll_info.refcnt from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 07:39:08 -07:00
Reshetova, Elena
7658b36f1b net: convert in_device.refcnt from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 07:39:08 -07:00
Reshetova, Elena
8851ab5267 net: convert ip_mc_list.refcnt from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 07:39:08 -07:00
Reshetova, Elena
41c6d650f6 net: convert sock.sk_refcnt from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

This patch uses refcount_inc_not_zero() instead of
atomic_inc_not_zero_hint() due to absense of a _hint()
version of refcount API. If the hint() version must
be used, we might need to revisit API.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 07:39:08 -07:00
Reshetova, Elena
14afee4b60 net: convert sock.sk_wmem_alloc from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 07:39:08 -07:00
Reshetova, Elena
2638595afc net: convert sk_buff_fclones.fclone_ref from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 07:39:08 -07:00
Reshetova, Elena
633547973f net: convert sk_buff.users from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 07:39:07 -07:00
Reshetova, Elena
53869cebce net: convert nf_bridge_info.use from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 07:39:07 -07:00
Reshetova, Elena
6343944bc1 net: convert neigh_params.refcnt from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 07:39:07 -07:00
Reshetova, Elena
9f23743017 net: convert neighbour.refcnt from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 07:39:07 -07:00
Reshetova, Elena
1cc9a98b59 net: convert inet_peer.refcnt from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
This conversion requires overall +1 on the whole
refcounting scheme.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 07:39:07 -07:00
David S. Miller
b079115937 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
A set of overlapping changes in macvlan and the rocker
driver, nothing serious.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-30 12:43:08 -04:00
David S. Miller
52a623bd61 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for your net-next
tree. This batch contains connection tracking updates for the cleanup
iteration path, patches from Florian Westphal:

X) Skip unconfirmed conntracks in nf_ct_iterate_cleanup_net(), just set
   dying bit to let the CPU release them.

X) Add nf_ct_iterate_destroy() to be used on module removal, to kill
   conntrack from all namespace.

X) Restart iteration on hashtable resizing, since both may occur at
   the same time.

X) Use the new nf_ct_iterate_destroy() to remove conntrack with NAT
   mapping on module removal.

X) Use nf_ct_iterate_destroy() to remove conntrack entries helper
   module removal, from Liping Zhang.

X) Use nf_ct_iterate_cleanup_net() to remove the timeout extension
   if user requests this, also from Liping.

X) Add net_ns_barrier() and use it from FTP helper, so make sure
   no concurrent namespace removal happens at the same time while
   the helper module is being removed.

X) Use NFPROTO_MAX in layer 3 conntrack protocol array, to reduce
   module size. Same thing in nf_tables.

Updates for the nf_tables infrastructure:

X) Prepare usage of the extended ACK reporting infrastructure for
   nf_tables.

X) Remove unnecessary forward declaration in nf_tables hash set.

X) Skip set size estimation if number of element is not specified.

X) Changes to accomodate a (faster) unresizable hash set implementation,
   for anonymous sets and dynamic size fixed sets with no timeouts.

X) Faster lookup function for unresizable hash table for 2 and 4
   bytes key.

And, finally, a bunch of asorted small updates and cleanups:

X) Do not hold reference to netdev from ipt_CLUSTER, instead subscribe
   to device events and look up for index from the packet path, this
   is fixing an issue that is present since the very beginning, patch
   from Xin Long.

X) Use nf_register_net_hook() in ipt_CLUSTER, from Florian Westphal.

X) Use ebt_invalid_target() whenever possible in the ebtables tree,
   from Gao Feng.

X) Calm down compilation warning in nf_dup infrastructure, patch from
   stephen hemminger.

X) Statify functions in nftables rt expression, also from stephen.

X) Update Makefile to use canonical method to specify nf_tables-objs.
   From Jike Song.

X) Use nf_conntrack_helpers_register() in amanda and H323.

X) Space cleanup for ctnetlink, from linzhang.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-30 06:27:09 -07:00
Andrei Otcheretianski
36a554cec1 nl80211: Don't verify owner_nlportid on NAN commands
If NAN interface is created with NL80211_ATTR_SOCKET_OWNER, the socket
that is used to create the interface is used for all NAN operations and
reporting NAN events.
However, it turns out that sending commands and receiving events on
the same socket is not possible in a completely race-free way:
If the socket buffer is overflowed by the events, the command response
will not be sent. In that case the caller will block forever on recv.
Using non-blocking socket for commands is more complicated and still
the command response or ack may not be received.
So, keep unicasting NAN events to the interface creator, but allow
using a different socket for commands.

Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
2017-06-30 09:44:17 +03:00
Al Viro
30e7e3ecf3 ethtool: don't open-code memdup_user()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-30 02:04:10 -04:00
Al Viro
a2c841d942 do_ip_setsockopt(): don't open-code memdup_user()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-30 02:04:09 -04:00
Al Viro
43727da90e do_ipv6_setsockopt(): don't open-code memdup_user()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-30 02:04:08 -04:00
Al Viro
aa47cc1c3c irda: don't open-code memdup_user()
and no, GFP_ATOMIC does not make any sense there...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-30 02:04:08 -04:00
Al Viro
7f2d17c67a xfrm_user_policy(): don't open-code memdup_user()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-30 02:04:07 -04:00
Michal Kubeček
e44699d2c2 net: handle NAPI_GRO_FREE_STOLEN_HEAD case also in napi_frags_finish()
Recently I started seeing warnings about pages with refcount -1. The
problem was traced to packets being reused after their head was merged into
a GRO packet by skb_gro_receive(). While bisecting the issue pointed to
commit c21b48cc1b ("net: adjust skb->truesize in ___pskb_trim()") and
I have never seen it on a kernel with it reverted, I believe the real
problem appeared earlier when the option to merge head frag in GRO was
implemented.

Handling NAPI_GRO_FREE_STOLEN_HEAD state was only added to GRO_MERGED_FREE
branch of napi_skb_finish() so that if the driver uses napi_gro_frags()
and head is merged (which in my case happens after the skb_condense()
call added by the commit mentioned above), the skb is reused including the
head that has been merged. As a result, we release the page reference
twice and eventually end up with negative page refcount.

To fix the problem, handle NAPI_GRO_FREE_STOLEN_HEAD in napi_frags_finish()
the same way it's done in napi_skb_finish().

Fixes: d7e8883cfc ("net: make GRO aware of skb->head_frag")
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-29 15:54:13 -04:00
Arvind Yadav
cddbb79f7a net: bridge: constify attribute_group structures.
attribute_groups are not supposed to change at runtime. All functions
working with attribute_groups provided by <linux/sysfs.h> work with const
attribute_group. So mark the non-const structs as const.

File size before:
   text	   data	    bss	    dec	    hex	filename
   2645	    896	      0	   3541	    dd5	net/bridge/br_sysfs_br.o

File size After adding 'const':
   text	   data	    bss	    dec	    hex	filename
   2701	    832	      0	   3533	    dcd	net/bridge/br_sysfs_br.o

Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-29 15:48:52 -04:00
Arvind Yadav
38ef00cc33 net: constify attribute_group structures.
attribute_groups are not supposed to change at runtime. All functions
working with attribute_groups provided by <linux/device.h> work with const
attribute_group. So mark the non-const structs as const.

File size before:
   text	   data	    bss	    dec	    hex	filename
   9968	   3168	     16	  13152	   3360	net/core/net-sysfs.o

File size After adding 'const':
   text	   data	    bss	    dec	    hex	filename
  10160	   2976	     16	  13152	   3360	net/core/net-sysfs.o

Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-29 15:48:51 -04:00
Donald Sharp
4f75ba6982 net: ipmr: Add ipmr_rtm_getroute
Add to RTNL_FAMILY_IPMR, RTM_GETROUTE the ability
to retrieve one S,G mroute from a specified table.

*,G will return mroute information for just that
particular mroute if it exists.  This is because
it is entirely possible to have more S's then
can fit in one skb to return to the requesting
process.

Signed-off-by: Donald Sharp <sharpd@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-29 15:37:48 -04:00
Gao Feng
c1a4872ebf net: sched: Fix one possible panic when no destroy callback
When qdisc fail to init, qdisc_create would invoke the destroy callback
to cleanup. But there is no check if the callback exists really. So it
would cause the panic if there is no real destroy callback like the qdisc
codel, fq, and so on.

Take codel as an example following:
When a malicious user constructs one invalid netlink msg, it would cause
codel_init->codel_change->nla_parse_nested failed.
Then kernel would invoke the destroy callback directly but qdisc codel
doesn't define one. It causes one panic as a result.

Now add one the check for destroy to avoid the possible panic.

Fixes: 87b60cfacf ("net_sched: fix error recovery at qdisc creation")
Signed-off-by: Gao Feng <gfree.wind@vip.163.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-29 12:55:12 -04:00
Liping Zhang
91af6ba7ff netfilter: ebt_nflog: fix unexpected truncated packet
"struct nf_loginfo li;" is a local variable, so we should set the flags
to 0 explicitly, else, packets maybe truncated unexpectedly when copied
to the userspace.

Fixes: 7643507fe8 ("netfilter: xt_NFLOG: nflog-range does not truncate packets")
Cc: Vishwanath Pai <vpai@akamai.com>
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-06-29 18:47:02 +02:00
Liping Zhang
deaa0a976b netfilter: nf_ct_dccp/sctp: fix memory leak after netns cleanup
After running the following commands for a while, kmemleak reported that
"1879 new suspected memory leaks" happened:
  # while : ; do
  ip netns add test
  ip netns delete test
  done

  unreferenced object 0xffff88006342fa38 (size 1024):
  comm "ip", pid 15477, jiffies 4295982857 (age 957.836s)
  hex dump (first 32 bytes):
    b8 b0 4d a0 ff ff ff ff c0 34 c3 59 00 88 ff ff  ..M......4.Y....
    04 00 00 00 a4 01 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<ffffffff8190510a>] kmemleak_alloc+0x4a/0xa0
    [<ffffffff81284130>] __kmalloc_track_caller+0x150/0x300
    [<ffffffff812302d0>] kmemdup+0x20/0x50
    [<ffffffffa04d598a>] dccp_init_net+0x8a/0x160 [nf_conntrack]
    [<ffffffffa04cf9f5>] nf_ct_l4proto_pernet_register_one+0x25/0x90
  ...
  unreferenced object 0xffff88006342da58 (size 1024):
  comm "ip", pid 15477, jiffies 4295982857 (age 957.836s)
  hex dump (first 32 bytes):
    10 b3 4d a0 ff ff ff ff 04 35 c3 59 00 88 ff ff  ..M......5.Y....
    04 00 00 00 a4 01 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<ffffffff8190510a>] kmemleak_alloc+0x4a/0xa0
    [<ffffffff81284130>] __kmalloc_track_caller+0x150/0x300
    [<ffffffff812302d0>] kmemdup+0x20/0x50
    [<ffffffffa04d6a9d>] sctp_init_net+0x5d/0x130 [nf_conntrack]
    [<ffffffffa04cf9f5>] nf_ct_l4proto_pernet_register_one+0x25/0x90
  ...

This is because we forgot to implement the get_net_proto for sctp and
dccp, so we won't invoke the nf_ct_unregister_sysctl to free the
ctl_table when do netns cleanup. Also note, we will fail to register
the sysctl for dccp/sctp either due to the lack of get_net_proto.

Fixes: c51d39010a ("netfilter: conntrack: built-in support for DCCP")
Fixes: a85406afeb ("netfilter: conntrack: built-in support for SCTP")
Cc: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Acked-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-06-29 18:47:01 +02:00
Mateusz Jurczyk
d2ecfa765d Bluetooth: Add sockaddr length checks before accessing sa_family in bind and connect handlers
Verify that the caller-provided sockaddr structure is large enough to
contain the sa_family field, before accessing it in bind() and connect()
handlers of the Bluetooth sockets. Since neither syscall enforces a minimum
size of the corresponding memory region, very short sockaddrs (zero or one
byte long) result in operating on uninitialized memory while referencing
sa_family.

Signed-off-by: Mateusz Jurczyk <mjurczyk@google.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2017-06-29 14:37:57 +02:00
Tejun Heo
29e2dd0d56 bluetooth: remove WQ_MEM_RECLAIM from hci workqueues
Bluetooth hci uses ordered HIGHPRI, MEM_RECLAIM workqueues.  It's
likely that the flags came from mechanical conversion from
create_singlethread_workqueue().  Bluetooth shouldn't be depended upon
for memory reclaim and the spurious MEM_RECLAIM flag can trigger the
following warning.  Remove WQ_MEM_RECLAIM and convert to
alloc_ordered_workqueue() while at it.

  workqueue: WQ_MEM_RECLAIM hci0:hci_power_off is flushing !WQ_MEM_RECLAIM events:btusb_work
  ------------[ cut here ]------------
  WARNING: CPU: 2 PID: 14231 at /home/brodo/local/kernel/git/linux/kernel/workqueue.c:2423 check_flush_dependency+0xb3/0x100
  Modules linked in:
  CPU: 2 PID: 14231 Comm: kworker/u9:4 Not tainted 4.12.0-rc6+ #3
  Hardware name: Dell Inc. XPS 13 9343/0TM99H, BIOS A11 12/08/2016
  Workqueue: hci0 hci_power_off
  task: ffff9432dad58000 task.stack: ffff986d43790000
  RIP: 0010:check_flush_dependency+0xb3/0x100
  RSP: 0018:ffff986d43793c90 EFLAGS: 00010086
  RAX: 000000000000005a RBX: ffff943316810820 RCX: 0000000000000000
  RDX: 0000000000000000 RSI: 0000000000000096 RDI: 0000000000000001
  RBP: ffff986d43793cb0 R08: 0000000000000775 R09: ffffffff85bdd5c0
  R10: 0000000000000040 R11: 0000000000000000 R12: ffffffff84d596e0
  R13: ffff9432dad58000 R14: ffff94321c640320 R15: ffff9432dad58000
  FS:  0000000000000000(0000) GS:ffff94331f500000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007b8bca242000 CR3: 000000014f60a000 CR4: 00000000003406e0
  Call Trace:
   flush_work+0x8a/0x1c0
   ? flush_work+0x184/0x1c0
   ? skb_free_head+0x21/0x30
   __cancel_work_timer+0x124/0x1b0
   ? hci_dev_do_close+0x2a4/0x4d0
   cancel_work_sync+0x10/0x20
   btusb_close+0x23/0x100
   hci_dev_do_close+0x2ca/0x4d0
   hci_power_off+0x1e/0x50
   process_one_work+0x184/0x3e0
   worker_thread+0x4a/0x3a0
   ? preempt_count_sub+0x9b/0x100
   ? preempt_count_sub+0x9b/0x100
   kthread+0x125/0x140
   ? process_one_work+0x3e0/0x3e0
   ? __kthread_create_on_node+0x1a0/0x1a0
   ? do_syscall_64+0x58/0xd0
   ret_from_fork+0x27/0x40
  Code: 00 75 bf 49 8b 56 18 48 8d 8b b0 00 00 00 48 81 c6 b0 00 00 00 4d 89 e0 48 c7 c7 20 23 6b 85 c6 05 83 cd 31 01 01 e8 bf c4 0c 00 <0f> ff eb 93 80 3d 74 cd 31 01 00 75 a5 65 48 8b 04 25 00 c5 00
  ---[ end trace b88fd2f77754bfec ]---

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Dominik Brodowski <linux@dominikbrodowski.net>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2017-06-29 14:36:38 +02:00
Paolo Abeni
67a51780ae ipv6: udp: leverage scratch area helpers
The commit b65ac44674 ("udp: try to avoid 2 cache miss on dequeue")
leveraged the scratched area helpers for UDP v4 but I forgot to
update accordingly the IPv6 code path.

This change extends the scratch area usage to the IPv6 code, synching
the two implementations and giving some performance benefit.
IPv6 is again almost on the same level of IPv4, performance-wide.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-27 15:43:57 -04:00
Paolo Abeni
b26bbdae46 udp: move scratch area helpers into the include file
So that they can be later used by the IPv6 code, too.
Also lift the comments a bit.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-27 15:43:56 -04:00
Dave Watson
d97af30f61 tcp: fix null ptr deref in getsockopt(..., TCP_ULP, ...)
If icsk_ulp_ops is unset, it dereferences a null ptr.
Add a null ptr check.

BUG: KASAN: null-ptr-deref in copy_to_user include/linux/uaccess.h:168 [inline]
BUG: KASAN: null-ptr-deref in do_tcp_getsockopt.isra.33+0x24f/0x1e30 net/ipv4/tcp.c:3057
Read of size 4 at addr 0000000000000020 by task syz-executor1/15452

Signed-off-by: Dave Watson <davejwatson@fb.com>
Reported-by: "Levin, Alexander (Sasha Levin)" <alexander.levin@verizon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-27 15:39:11 -04:00
Eric Dumazet
6f64ec7451 net: prevent sign extension in dev_get_stats()
Similar to the fix provided by Dominik Heidler in commit
9b3dc0a17d ("l2tp: cast l2tp traffic counter to unsigned")
we need to take care of 32bit kernels in dev_get_stats().

When using atomic_long_read(), we add a 'long' to u64 and
might misinterpret high order bit, unless we cast to unsigned.

Fixes: caf586e5f2 ("net: add a core netdev->rx_dropped counter")
Fixes: 015f0688f5 ("net: net: add a core netdev->tx_dropped counter")
Fixes: 6e7333d315 ("net: add rx_nohandler stat counter")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-27 14:45:12 -04:00
Jeffy Chen
5da8e47d84 Bluetooth: hidp: fix possible might sleep error in hidp_session_thread
It looks like hidp_session_thread has same pattern as the issue reported in
old rfcomm:

	while (1) {
		set_current_state(TASK_INTERRUPTIBLE);
		if (condition)
			break;
		// may call might_sleep here
		schedule();
	}
	__set_current_state(TASK_RUNNING);

Which fixed at:
	dfb2fae Bluetooth: Fix nested sleeps

So let's fix it at the same way, also follow the suggestion of:
https://lwn.net/Articles/628628/

Signed-off-by: Jeffy Chen <jeffy.chen@rock-chips.com>
Tested-by: AL Yu-Chen Cho <acho@suse.com>
Tested-by: Rohit Vaswani <rvaswani@nvidia.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2017-06-27 19:32:11 +02:00
Jeffy Chen
f06d977309 Bluetooth: cmtp: fix possible might sleep error in cmtp_session
It looks like cmtp_session has same pattern as the issue reported in
old rfcomm:

	while (1) {
		set_current_state(TASK_INTERRUPTIBLE);
		if (condition)
			break;
		// may call might_sleep here
		schedule();
	}
	__set_current_state(TASK_RUNNING);

Which fixed at:
	dfb2fae Bluetooth: Fix nested sleeps

So let's fix it at the same way, also follow the suggestion of:
https://lwn.net/Articles/628628/

Signed-off-by: Jeffy Chen <jeffy.chen@rock-chips.com>
Reviewed-by: Brian Norris <briannorris@chromium.org>
Reviewed-by: AL Yu-Chen Cho <acho@suse.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2017-06-27 19:32:11 +02:00
Jeffy Chen
25717382c1 Bluetooth: bnep: fix possible might sleep error in bnep_session
It looks like bnep_session has same pattern as the issue reported in
old rfcomm:

	while (1) {
		set_current_state(TASK_INTERRUPTIBLE);
		if (condition)
			break;
		// may call might_sleep here
		schedule();
	}
	__set_current_state(TASK_RUNNING);

Which fixed at:
	dfb2fae Bluetooth: Fix nested sleeps

So let's fix it at the same way, also follow the suggestion of:
https://lwn.net/Articles/628628/

Signed-off-by: Jeffy Chen <jeffy.chen@rock-chips.com>
Reviewed-by: Brian Norris <briannorris@chromium.org>
Reviewed-by: AL Yu-Chen Cho <acho@suse.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2017-06-27 19:32:11 +02:00
Matthias Schiffer
d116ffc770 net: add netlink_ext_ack argument to rtnl_link_ops.slave_validate
Add support for extended error reporting.

Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-26 23:13:22 -04:00
Matthias Schiffer
17dd0ec470 net: add netlink_ext_ack argument to rtnl_link_ops.slave_changelink
Add support for extended error reporting.

Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-26 23:13:22 -04:00
Matthias Schiffer
a8b8a889e3 net: add netlink_ext_ack argument to rtnl_link_ops.validate
Add support for extended error reporting.

Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-26 23:13:22 -04:00
Matthias Schiffer
ad744b223c net: add netlink_ext_ack argument to rtnl_link_ops.changelink
Add support for extended error reporting.

Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-26 23:13:22 -04:00
Matthias Schiffer
7a3f4a1851 net: add netlink_ext_ack argument to rtnl_link_ops.newlink
Add support for extended error reporting.

Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-26 23:13:21 -04:00
Marcelo Ricardo Leitner
a02d036c02 sctp: adjust ssthresh when transport is idle
RFC 4960 Errata 3.27 identifies that ssthresh should be adjusted to cwnd
because otherwise it could cause the transport to lock into congestion
avoidance phase specially if ssthresh was previously reduced by some
packet drop, leading to poor performance.

The Errata says to adjust ssthresh to cwnd only once, though the same
goal is achieved by updating it every time we update cwnd too. The
caveat is that we could take longer to get back up to speed but that
should be compensated by the fact that we don't adjust on RTO basis (as
RFC says) but based on Heartbeats, which are usually way longer.

See-also: https://tools.ietf.org/html/draft-ietf-tsvwg-rfc4960-errata-01#section-3.27
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-25 14:43:53 -04:00
Marcelo Ricardo Leitner
4ccbd0b0b9 sctp: adjust cwnd increase in Congestion Avoidance phase
RFC4960 Errata 3.26 identified that at the same time RFC4960 states that
cwnd should never grow more than 1*MTU per RTT, Section 7.2.2 was
underspecified and as described could allow increasing cwnd more than
that.

This patch updates it so partial_bytes_acked is maxed to cwnd if
flight_size doesn't reach cwnd, protecting it from such case.

See-also: https://tools.ietf.org/html/draft-ietf-tsvwg-rfc4960-errata-01#section-3.26
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-25 14:43:53 -04:00
Marcelo Ricardo Leitner
e56f777af8 sctp: allow increasing cwnd regardless of ctsn moving or not
As per RFC4960 Errata 3.22, this condition is not needed anymore as it
could cause the partial_bytes_acked to not consider the TSNs acked in
the Gap Ack Blocks although they were received by the peer successfully.

This patch thus drops the check for new Cumulative TSN Ack Point,
leaving just the flight_size < cwnd one.

See-also: https://tools.ietf.org/html/draft-ietf-tsvwg-rfc4960-errata-01#section-3.22
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-25 14:43:53 -04:00
Marcelo Ricardo Leitner
d0b53f4097 sctp: update order of adjustments of partial_bytes_acked and cwnd
RFC4960 Errata 3.12 says RFC4960 is unclear about the order of
adjustments applied to partial_bytes_acked and cwnd in the congestion
avoidance phase, and that the actual order should be:
partial_bytes_acked is reset to (partial_bytes_acked - cwnd). Next, cwnd
is increased by MTU.

We were first increasing cwnd, and then subtracting the new value pba,
which leads to a different result as pba is smaller than what it should
and could cause cwnd to not grow as much.

See-also: https://tools.ietf.org/html/draft-ietf-tsvwg-rfc4960-errata-01#section-3.12
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-25 14:43:53 -04:00
WANG Cong
d747a7a51b tcp: reset sk_rx_dst in tcp_disconnect()
We have to reset the sk->sk_rx_dst when we disconnect a TCP
connection, because otherwise when we re-connect it this
dst reference is simply overridden in tcp_finish_connect().

This fixes a dst leak which leads to a loopback dev refcnt
leak. It is a long-standing bug, Kevin reported a very similar
(if not same) bug before. Thanks to Andrei for providing such
a reliable reproducer which greatly narrows down the problem.

Fixes: 41063e9dd1 ("ipv4: Early TCP socket demux.")
Reported-by: Andrei Vagin <avagin@gmail.com>
Reported-by: Kevin Xu <kaiwen.xu@hulu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-25 12:23:07 -04:00
Wei Wang
85cb73ff9b net: ipv6: reset daddr and dport in sk if connect() fails
In __ip6_datagram_connect(), reset sk->sk_v6_daddr and inet->dport if
error occurs.
In udp_v6_early_demux(), check for sk_state to make sure it is in
TCP_ESTABLISHED state.
Together, it makes sure unconnected UDP socket won't be considered as a
valid candidate for early demux.

v3: add TCP_ESTABLISHED state check in udp_v6_early_demux()
v2: fix compilation error

Fixes: 5425077d73 ("net: ipv6: Add early demux handler for UDP unicast")
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-25 11:46:56 -04:00
Mateusz Jurczyk
e3c42b61ff af_iucv: Move sockaddr length checks to before accessing sa_family in bind and connect handlers
Verify that the caller-provided sockaddr structure is large enough to
contain the sa_family field, before accessing it in bind() and connect()
handlers of the AF_IUCV socket. Since neither syscall enforces a minimum
size of the corresponding memory region, very short sockaddrs (zero or
one byte long) result in operating on uninitialized memory while
referencing .sa_family.

Fixes: 52a82e23b9 ("af_iucv: Validate socket address length in iucv_sock_bind()")
Signed-off-by: Mateusz Jurczyk <mjurczyk@google.com>
[jwi: removed unneeded null-check for addr]
Signed-off-by: Julian Wiedmann <jwi@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-25 11:42:58 -04:00
Hans Wippel
2e56c26b39 net/iucv: improve endianness handling
Use proper endianness conversion for an skb protocol assignment. Given
that IUCV is only available on big endian systems (s390), this simply
avoids an endianness warning reported by sparse.

Signed-off-by: Hans Wippel <hwippel@linux.vnet.ibm.com>
Reviewed-by: Julian Wiedmann <jwi@linux.vnet.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: Julian Wiedmann <jwi@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-25 11:42:58 -04:00
Jakub Kicinski
3fcece12bc net: store port/representator id in metadata_dst
Switches and modern SR-IOV enabled NICs may multiplex traffic from Port
representators and control messages over single set of hardware queues.
Control messages and muxed traffic may need ordered delivery.

Those requirements make it hard to comfortably use TC infrastructure today
unless we have a way of attaching metadata to skbs at the upper device.
Because single set of queues is used for many netdevs stopping TC/sched
queues of all of them reliably is impossible and lower device has to
retreat to returning NETDEV_TX_BUSY and usually has to take extra locks on
the fastpath.

This patch attempts to enable port/representative devs to attach metadata
to skbs which carry port id.  This way representatives can be queueless and
all queuing can be performed at the lower netdev in the usual way.

Traffic arriving on the port/representative interfaces will be have
metadata attached and will subsequently be queued to the lower device for
transmission.  The lower device should recognize the metadata and translate
it to HW specific format which is most likely either a special header
inserted before the network headers or descriptor/metadata fields.

Metadata is associated with the lower device by storing the netdev pointer
along with port id so that if TC decides to redirect or mirror the new
netdev will not try to interpret it.

This is mostly for SR-IOV devices since switches don't have lower netdevs
today.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-25 11:42:01 -04:00
Ingo Molnar
1bc3cd4dfa Merge branch 'linus' into sched/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-24 08:57:20 +02:00
Dan Carpenter
ac55cd6193 tls: return -EFAULT if copy_to_user() fails
The copy_to_user() function returns the number of bytes remaining but we
want to return -EFAULT here.

Fixes: 3c4d755915 ("tls: kernel TLS support")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Dave Watson <davejwatson@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-23 14:19:27 -04:00
David S. Miller
93bbbfbb4a Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:

====================
pull request (net-next): ipsec-next 2017-06-23

1) Use memdup_user to spmlify xfrm_user_policy.
   From Geliang Tang.

2) Make xfrm_dev_register static to silence a sparse warning.
   From Wei Yongjun.

3) Use crypto_memneq to check the ICV in the AH protocol.
   From Sabrina Dubroca.

4) Remove some unused variables in esp6.
   From Stephen Hemminger.

5) Extend XFRM MIGRATE to allow to change the UDP encapsulation port.
   From Antony Antony.

6) Include the UDP encapsulation port to km_migrate announcements.
   From Antony Antony.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-23 14:17:31 -04:00
David S. Miller
43b786c676 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:

====================
pull request (net): ipsec 2017-06-23

1) Fix xfrm garbage collecting when unregistering a netdevice.
   From Hangbin Liu.

2) Fix NULL pointer derefernce when exiting a network namespace.
   From Hangbin Liu.

3) Fix some error codes in pfkey to prevent a NULL pointer derefernce.
   From Dan Carpenter.

4) Fix NULL pointer derefernce on allocation failure in pfkey.
   From Dan Carpenter.

5) Adjust IPv6 payload_len to include extension headers. Otherwise
   we corrupt the packets when doing ESP GRO on transport mode.
   From Yossi Kuperman.

6) Set nhoff to the proper offset of the IPv6 nexthdr when doing ESP GRO.
   From Yossi Kuperman.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-23 14:11:26 -04:00
Jakub Kicinski
926f38e974 tcp: fix out-of-bounds access in ULP sysctl
KASAN reports out-of-bound access in proc_dostring() coming from
proc_tcp_available_ulp() because in case TCP ULP list is empty
the buffer allocated for the response will not have anything
printed into it.  Set the first byte to zero to avoid strlen()
going out-of-bounds.

Fixes: 734942cc4e ("tcp: ULP infrastructure")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-23 14:10:05 -04:00
WANG Cong
0ccc22f425 sit: use __GFP_NOWARN for user controlled allocation
The memory allocation size is controlled by user-space,
if it is too large just fail silently and return NULL,
not to mention there is a fallback allocation later.

Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-23 14:08:40 -04:00
Yonghong Song
239946314e bpf: possibly avoid extra masking for narrower load in verifier
Commit 31fd85816d ("bpf: permits narrower load from bpf program
context fields") permits narrower load for certain ctx fields.
The commit however will already generate a masking even if
the prog-specific ctx conversion produces the result with
narrower size.

For example, for __sk_buff->protocol, the ctx conversion
loads the data into register with 2-byte load.
A narrower 2-byte load should not generate masking.
For __sk_buff->vlan_present, the conversion function
set the result as either 0 or 1, essentially a byte.
The narrower 2-byte or 1-byte load should not generate masking.

To avoid unnecessary masking, prog-specific *_is_valid_access
now passes converted_op_size back to verifier, which indicates
the valid data width after perceived future conversion.
Based on this information, verifier is able to avoid
unnecessary marking.

Since we want more information back from prog-specific
*_is_valid_access checking, all of them are packed into
one data structure for more clarity.

Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-23 14:04:11 -04:00
Jakub Kicinski
ce158e580a xdp: add reporting of offload mode
Extend the XDP_ATTACHED_* values to include offloaded mode.
Let drivers report whether program is installed in the driver
or the HW by changing the prog_attached field from bool to
u8 (type of the netlink attribute).

Exploit the fact that the value of XDP_ATTACHED_DRV is 1,
therefore since all drivers currently assign the mode with
double negation:
       mode = !!xdp_prog;
no drivers have to be modified.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-23 13:42:20 -04:00
Jakub Kicinski
ee5d032f7d xdp: add HW offload mode flag for installing programs
Add an installation-time flag for requesting that the program
be installed only if it can be offloaded to HW.

Internally new command for ndo_xdp is added, this way we avoid
putting checks into drivers since they all return -EINVAL on
an unknown command.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-23 13:42:19 -04:00
Jakub Kicinski
32d602771b xdp: pass XDP flags into install handlers
Pass XDP flags to the xdp ndo.  This will allow drivers to look
at the mode flags and make decisions about offload.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-23 13:42:18 -04:00
Michal Kubeček
a5cb659bbc net: account for current skb length when deciding about UFO
Our customer encountered stuck NFS writes for blocks starting at specific
offsets w.r.t. page boundary caused by networking stack sending packets via
UFO enabled device with wrong checksum. The problem can be reproduced by
composing a long UDP datagram from multiple parts using MSG_MORE flag:

  sendto(sd, buff, 1000, MSG_MORE, ...);
  sendto(sd, buff, 1000, MSG_MORE, ...);
  sendto(sd, buff, 3000, 0, ...);

Assume this packet is to be routed via a device with MTU 1500 and
NETIF_F_UFO enabled. When second sendto() gets into __ip_append_data(),
this condition is tested (among others) to decide whether to call
ip_ufo_append_data():

  ((length + fragheaderlen) > mtu) || (skb && skb_is_gso(skb))

At the moment, we already have skb with 1028 bytes of data which is not
marked for GSO so that the test is false (fragheaderlen is usually 20).
Thus we append second 1000 bytes to this skb without invoking UFO. Third
sendto(), however, has sufficient length to trigger the UFO path so that we
end up with non-UFO skb followed by a UFO one. Later on, udp_send_skb()
uses udp_csum() to calculate the checksum but that assumes all fragments
have correct checksum in skb->csum which is not true for UFO fragments.

When checking against MTU, we need to add skb->len to length of new segment
if we already have a partially filled skb and fragheaderlen only if there
isn't one.

In the IPv6 case, skb can only be null if this is the first segment so that
we have to use headersize (length of the first IPv6 header) rather than
fragheaderlen (length of IPv6 header of further fragments) for skb == NULL.

Fixes: e89e9cf539 ("[IPv4/IPv6]: UFO Scatter-gather approach")
Fixes: e4c5e13aa4 ("ipv6: Should use consistent conditional judgement for
	ip6 fragment between __ip6_append_data and ip6_finish_output")
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Acked-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-23 13:29:38 -04:00
Paolo Abeni
9bd780f5e0 udp: fix poll()
Michael reported an UDP breakage caused by the commit b65ac44674
("udp: try to avoid 2 cache miss on dequeue").
The function __first_packet_length() can update the checksum bits
of the pending skb, making the scratched area out-of-sync, and
setting skb->csum, if the skb was previously in need of checksum
validation.

On later recvmsg() for such skb, checksum validation will be
invoked again - due to the wrong udp_skb_csum_unnecessary()
value - and will fail, causing the valid skb to be dropped.

This change addresses the issue refreshing the scratch area in
__first_packet_length() after the possible checksum update.

Fixes: b65ac44674 ("udp: try to avoid 2 cache miss on dequeue")
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-23 11:18:43 -04:00
Mateusz Jurczyk
f6a5885fc4 NFC: Add sockaddr length checks before accessing sa_family in bind handlers
Verify that the caller-provided sockaddr structure is large enough to
contain the sa_family field, before accessing it in bind() handlers of the
AF_NFC socket. Since the syscall doesn't enforce a minimum size of the
corresponding memory region, very short sockaddrs (zero or one byte long)
result in operating on uninitialized memory while referencing .sa_family.

Signed-off-by: Mateusz Jurczyk <mjurczyk@google.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2017-06-23 00:38:31 +02:00
Gustavo A. R. Silva
03036184e9 nfc: nci: remove unnecessary null check
Remove unnecessary NULL check for pointer conn_info.
conn_info is set in list_for_each_entry() using container_of(),
which is never NULL.

Addresses-Coverity-ID: 1362349
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2017-06-23 00:34:18 +02:00
Mateusz Jurczyk
a0323b979f nfc: Ensure presence of required attributes in the activate_target handler
Check that the NFC_ATTR_TARGET_INDEX and NFC_ATTR_PROTOCOLS attributes (in
addition to NFC_ATTR_DEVICE_INDEX) are provided by the netlink client
prior to accessing them. This prevents potential unhandled NULL pointer
dereference exceptions which can be triggered by malicious user-mode
programs, if they omit one or both of these attributes.

Signed-off-by: Mateusz Jurczyk <mjurczyk@google.com>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2017-06-23 00:28:39 +02:00
Mateusz Jurczyk
608c4adfca nfc: Fix the sockaddr length sanitization in llcp_sock_connect
Fix the sockaddr length verification in the connect() handler of NFC/LLCP
sockets, to compare against the size of the actual structure expected on
input (sockaddr_nfc_llcp) instead of its shorter version (sockaddr_nfc).

Both structures are defined in include/uapi/linux/nfc.h. The fields
specific to the _llcp extended struct are as follows:

   276		__u8 dsap; /* Destination SAP, if known */
   277		__u8 ssap; /* Source SAP to be bound to */
   278		char service_name[NFC_LLCP_MAX_SERVICE_NAME]; /* Service name URI */;
   279		size_t service_name_len;

If the caller doesn't provide a sufficiently long sockaddr buffer, these
fields remain uninitialized (and they currently originate from the stack
frame of the top-level sys_connect handler). They are then copied by
llcp_sock_connect() into internal storage (nfc_llcp_sock structure), and
could be subsequently read back through the user-mode getsockname()
function (handled by llcp_sock_getname()). This would result in the
disclosure of up to ~70 uninitialized bytes from the kernel stack to
user-mode clients capable of creating AFC_NFC sockets.

Signed-off-by: Mateusz Jurczyk <mjurczyk@google.com>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2017-06-23 00:26:19 +02:00
Mark Greer
1b609e4384 NFC: digital: NFC-DEP Target WT(nfcdep,max) is now 14
Version 1.1 of the NFC Forum's NFC Digital Protocol Technical
Specification dated 2014-07-14 specifies that the NFC-DEP Protocol's
Target WT(nfcdep,max) value is 14.  In version 1.0 it was 8 so change
the value in the Linux NFC-DEP Protocol code accordingly.

Signed-off-by: Mark Greer <mgreer@animalcreek.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2017-06-23 00:19:59 +02:00
Mark Greer
7f9f171336 NFC: digital: NFC-A SEL_RES must be one byte
Section 4.8.2 (SEL_RES Response) of NFC Forum's NFC Digital Protocol
Technical Specification dated 2010-11-17 clearly states that the size
of a SEL_RES Response is one byte.  Enforce this restriction in the
code.

Signed-off-by: Mark Greer <mgreer@animalcreek.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2017-06-23 00:19:59 +02:00
Markus Elfring
dcfca27faf NFC: digital: Delete an error message for memory allocation failure
Omit an extra message for a memory allocation failure in this function.

This issue was detected by using the Coccinelle software.

Link: http://events.linuxfoundation.org/sites/events/files/slides/LCJ16-Refactor_Strings-WSang_0.pdf
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2017-06-23 00:14:12 +02:00
Markus Elfring
ae72c9910b NFC: digital: Improve a size determination in four functions
Replace the specification of four data structures by pointer dereferences
as the parameter for the operator "sizeof" to make the corresponding size
determination a bit safer according to the Linux coding style convention.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2017-06-23 00:07:04 +02:00
Paolo Abeni
4b943faedf udp/v6: prefetch rmem_alloc in udp6_queue_rcv_skb()
very similar to commit dd99e425be ("udp: prefetch
rmem_alloc in udp_queue_rcv_skb()"), this allows saving a cache
miss when the BH is bottle-neck for UDP over ipv6 packet
processing, e.g. for small packets when a single RX NIC ingress
queue is in use.

Performances under flood when multiple NIC RX queues used are
unaffected, but when a single NIC rx queue is in use, this
gives ~8% performance improvement.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-22 13:44:04 -04:00
WANG Cong
60abc0be96 ipv6: avoid unregistering inet6_dev for loopback
The per netns loopback_dev->ip6_ptr is unregistered and set to
NULL when its mtu is set to smaller than IPV6_MIN_MTU, this
leads to that we could set rt->rt6i_idev NULL after a
rt6_uncached_list_flush_dev() and then crash after another
call.

In this case we should just bring its inet6_dev down, rather
than unregistering it, at least prior to commit 176c39af29
("netns: fix addrconf_ifdown kernel panic") we always
override the case for loopback.

Thanks a lot to Andrey for finding a reliable reproducer.

Fixes: 176c39af29 ("netns: fix addrconf_ifdown kernel panic")
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Daniel Lezcano <dlezcano@fr.ibm.com>
Cc: David Ahern <dsahern@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: David Ahern <dsahern@gmail.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-22 13:21:44 -04:00
Sowmini Varadhan
c14b036681 rds: tcp: set linger to 1 when unloading a rds-tcp
If we are unloading the rds_tcp module, we can set linger to 1
and drop pending packets to accelerate reconnect. The peer will
end up resetting the connection based on new generation numbers
of the new incarnation, so hanging on to unsent TCP packets via
linger is mostly pointless in this case.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Tested-by: Jenny Xu <jenny.x.xu@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-22 11:34:04 -04:00
Sowmini Varadhan
69b92b5b74 rds: tcp: send handshake ping-probe from passive endpoint
The RDS handshake ping probe added by commit 5916e2c155
("RDS: TCP: Enable multipath RDS for TCP") is sent from rds_sendmsg()
before the first data packet is sent to a peer. If the conversation
is not bidirectional  (i.e., one side is always passive and never
invokes rds_sendmsg()) and the passive side restarts its rds_tcp
module, a new HS ping probe needs to be sent, so that the number
of paths can be re-established.

This patch achieves that by sending a HS ping probe from
rds_tcp_accept_one() when c_npaths is 0 (i.e., we have not done
a handshake probe with this peer yet).

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Tested-by: Jenny Xu <jenny.x.xu@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-22 11:34:04 -04:00
Chenbo Feng
8fac365f63 tcp: Add a tcp_filter hook before handle ack packet
Currently in both ipv4 and ipv6 code path, the ack packet received when
sk at TCP_NEW_SYN_RECV state is not filtered by socket filter or cgroup
filter since it is handled from tcp_child_process and never reaches the
tcp_filter inside tcp_v4_rcv or tcp_v6_rcv. Adding a tcp_filter hooks
here can make sure all the ingress tcp packet can be correctly filtered.

Signed-off-by: Chenbo Feng <fengc@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-22 11:13:56 -04:00
WANG Cong
76da070450 ipv6: only call ip6_route_dev_notify() once for NETDEV_UNREGISTER
In commit 242d3a49a2 ("ipv6: reorder ip6_route_dev_notifier after ipv6_dev_notf")
I assumed NETDEV_REGISTER and NETDEV_UNREGISTER are paired,
unfortunately, as reported by jeffy, netdev_wait_allrefs()
could rebroadcast NETDEV_UNREGISTER event until all refs are
gone.

We have to add an additional check to avoid this corner case.
For netdev_wait_allrefs() dev->reg_state is NETREG_UNREGISTERED,
for dev_change_net_namespace(), dev->reg_state is
NETREG_REGISTERED. So check for dev->reg_state != NETREG_UNREGISTERED.

Fixes: 242d3a49a2 ("ipv6: reorder ip6_route_dev_notifier after ipv6_dev_notf")
Reported-by: jeffy <jeffy.chen@rock-chips.com>
Cc: David Ahern <dsahern@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-22 11:06:06 -04:00
Yossi Kuperman
ca3a1b8566 esp6_offload: Fix IP6CB(skb)->nhoff for ESP GRO
IP6CB(skb)->nhoff is the offset of the nexthdr field in an IPv6
header, unless there are extension headers present, in which case
nhoff points to the nexthdr field of the last extension header.

In non-GRO code path, nhoff is set by ipv6_rcv before any XFRM code
is executed. Conversely, in GRO code path (when esp6_offload is loaded),
nhoff is not set. The following functions fail to read the correct value
and eventually the packet is dropped:

    xfrm6_transport_finish
    xfrm6_tunnel_input
    xfrm6_rcv_tnl

Set nhoff to the proper offset of nexthdr in esp6_gro_receive.

Fixes: 7785bba299 ("esp: Add a software GRO codepath")
Signed-off-by: Yossi Kuperman <yossiku@mellanox.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2017-06-22 10:49:14 +02:00
Yossi Kuperman
7c88e21aef xfrm6: Fix IPv6 payload_len in xfrm6_transport_finish
IPv6 payload length indicates the size of the payload, including any
extension headers.

In xfrm6_transport_finish, ipv6_hdr(skb)->payload_len is set to the
payload size only, regardless of the presence of any extension headers.
After ESP GRO transport mode decapsulation, ipv6_rcv trims the packet
according to the wrong payload_len, thus corrupting the packet.

Set payload_len to account for extension headers as well.

Fixes: 7785bba299 ("esp: Add a software GRO codepath")
Signed-off-by: Yossi Kuperman <yossiku@mellanox.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2017-06-22 10:49:14 +02:00
David S. Miller
3d09198243 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Two entries being added at the same time to the IFLA
policy table, whilst parallel bug fixes to decnet
routing dst handling overlapping with the dst gc removal
in net-next.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-21 17:35:22 -04:00
Daniel Borkmann
e86283071f bpf: expose prog id for cls_bpf and act_bpf
In order to be able to retrieve the attached programs from cls_bpf
and act_bpf, we need to expose the prog ids via netlink so that
an application can later on get an fd based on the id through the
BPF_PROG_GET_FD_BY_ID command, and dump related prog info via
BPF_OBJ_GET_INFO_BY_FD command for bpf(2).

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-21 15:14:23 -04:00
David Herrmann
28b5ba2aa0 net: introduce SO_PEERGROUPS getsockopt
This adds the new getsockopt(2) option SO_PEERGROUPS on SOL_SOCKET to
retrieve the auxiliary groups of the remote peer. It is designed to
naturally extend SO_PEERCRED. That is, the underlying data is from the
same credentials. Regarding its syntax, it is based on SO_PEERSEC. That
is, if the provided buffer is too small, ERANGE is returned and @optlen
is updated. Otherwise, the information is copied, @optlen is set to the
actual size, and 0 is returned.

While SO_PEERCRED (and thus `struct ucred') already returns the primary
group, it lacks the auxiliary group vector. However, nearly all access
controls (including kernel side VFS and SYSVIPC, but also user-space
polkit, DBus, ...) consider the entire set of groups, rather than just
the primary group. But this is currently not possible with pure
SO_PEERCRED. Instead, user-space has to work around this and query the
system database for the auxiliary groups of a UID retrieved via
SO_PEERCRED.

Unfortunately, there is no race-free way to query the auxiliary groups
of the PID/UID retrieved via SO_PEERCRED. Hence, the current user-space
solution is to use getgrouplist(3p), which itself falls back to NSS and
whatever is configured in nsswitch.conf(3). This effectively checks
which groups we *would* assign to the user if it logged in *now*. On
normal systems it is as easy as reading /etc/group, but with NSS it can
resort to quering network databases (eg., LDAP), using IPC or network
communication.

Long story short: Whenever we want to use auxiliary groups for access
checks on IPC, we need further IPC to talk to the user/group databases,
rather than just relying on SO_PEERCRED and the incoming socket. This
is unfortunate, and might even result in dead-locks if the database
query uses the same IPC as the original request.

So far, those recursions / dead-locks have been avoided by using
primitive IPC for all crucial NSS modules. However, we want to avoid
re-inventing the wheel for each NSS module that might be involved in
user/group queries. Hence, we would preferably make DBus (and other IPC
that supports access-management based on groups) work without resorting
to the user/group database. This new SO_PEERGROUPS ioctl would allow us
to make dbus-daemon work without ever calling into NSS.

Cc: Michal Sekletar <msekleta@redhat.com>
Cc: Simon McVittie <simon.mcvittie@collabora.co.uk>
Reviewed-by: Tom Gundersen <teg@jklm.no>
Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-21 11:38:41 -04:00
Paolo Abeni
dd99e425be udp: prefetch rmem_alloc in udp_queue_rcv_skb()
On UDP packets processing, if the BH is the bottle-neck, it
always sees a cache miss while updating rmem_alloc; try to
avoid it prefetching the value as soon as we have the socket
available.

Performances under flood with multiple NIC rx queues used are
unaffected, but when a single NIC rx queue is in use, this
gives ~10% performance improvement.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-21 11:38:11 -04:00
Julien Gomes
dd12d15c9a ip6mr: add netlink notifications on mrt6msg cache reports
Add Netlink notifications on cache reports in ip6mr, in addition to the
existing mrt6msg sent to mroute6_sk.
Send RTM_NEWCACHEREPORT notifications to RTNLGRP_IPV6_MROUTE_R.

MSGTYPE, MIF_ID, SRC_ADDR and DST_ADDR Netlink attributes contain the
same data as their equivalent fields in the mrt6msg header.
PKT attribute is the packet sent to mroute6_sk, without the added
mrt6msg header.

Suggested-by: Ryan Halbrook <halbrook@arista.com>
Signed-off-by: Julien Gomes <julien@arista.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-21 11:22:53 -04:00
Julien Gomes
5a645dd86c ipmr: add netlink notifications on igmpmsg cache reports
Add Netlink notifications on cache reports in ipmr, in addition to the
existing igmpmsg sent to mroute_sk.
Send RTM_NEWCACHEREPORT notifications to RTNLGRP_IPV4_MROUTE_R.

MSGTYPE, VIF_ID, SRC_ADDR and DST_ADDR Netlink attributes contain the
same data as their equivalent fields in the igmpmsg header.
PKT attribute is the packet sent to mroute_sk, without the added igmpmsg
header.

Suggested-by: Ryan Halbrook <halbrook@arista.com>
Signed-off-by: Julien Gomes <julien@arista.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-21 11:22:52 -04:00
Julien Gomes
5f729eaabe rtnetlink: add restricted rtnl groups for ipv4 and ipv6 mroute
Add RTNLGRP_{IPV4,IPV6}_MROUTE_R as two new restricted groups for the
NETLINK_ROUTE family.
Binding to these groups specifically requires CAP_NET_ADMIN to allow
multicast of sensitive messages (e.g. mroute cache reports).

Suggested-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Julien Gomes <julien@arista.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-21 11:22:52 -04:00
Arnd Bergmann
083a0326ae tcp: md5: hide unused variable
Changing from a memcpy to per-member comparison left the
size variable unused:

net/ipv4/tcp_ipv4.c: In function 'tcp_md5_do_lookup':
net/ipv4/tcp_ipv4.c:910:15: error: unused variable 'size' [-Werror=unused-variable]

This does not show up when CONFIG_IPV6 is enabled, but the
variable can be removed either way, along with the now unused
assignment.

Fixes: 6797318e62 ("tcp: md5: add an address prefix for key lookup")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-21 11:19:35 -04:00
WANG Cong
b4846fc3c8 igmp: add a missing spin_lock_init()
Andrey reported a lockdep warning on non-initialized
spinlock:

 INFO: trying to register non-static key.
 the code is fine but needs lockdep annotation.
 turning off the locking correctness validator.
 CPU: 1 PID: 4099 Comm: a.out Not tainted 4.12.0-rc6+ #9
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
 Call Trace:
  __dump_stack lib/dump_stack.c:16
  dump_stack+0x292/0x395 lib/dump_stack.c:52
  register_lock_class+0x717/0x1aa0 kernel/locking/lockdep.c:755
  ? 0xffffffffa0000000
  __lock_acquire+0x269/0x3690 kernel/locking/lockdep.c:3255
  lock_acquire+0x22d/0x560 kernel/locking/lockdep.c:3855
  __raw_spin_lock_bh ./include/linux/spinlock_api_smp.h:135
  _raw_spin_lock_bh+0x36/0x50 kernel/locking/spinlock.c:175
  spin_lock_bh ./include/linux/spinlock.h:304
  ip_mc_clear_src+0x27/0x1e0 net/ipv4/igmp.c:2076
  igmpv3_clear_delrec+0xee/0x4f0 net/ipv4/igmp.c:1194
  ip_mc_destroy_dev+0x4e/0x190 net/ipv4/igmp.c:1736

We miss a spin_lock_init() in igmpv3_add_delrec(), probably
because previously we never use it on this code path. Since
we already unlink it from the global mc_tomb list, it is
probably safe not to acquire this spinlock here. It does not
harm to have it although, to avoid conditional locking.

Fixes: c38b7d327a ("igmp: acquire pmc lock for ip_mc_clear_src()")
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-20 15:51:57 -04:00
Serhey Popovych
db833d40ad rtnetlink: add IFLA_GROUP to ifla_policy
Network interface groups support added while ago, however
there is no IFLA_GROUP attribute description in policy
and netlink message size calculations until now.

Add IFLA_GROUP attribute to the policy.

Fixes: cbda10fa97 ("net_device: add support for network device groups")
Signed-off-by: Serhey Popovych <serhe.popovych@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-20 15:36:16 -04:00
Serhey Popovych
07f615574f ipv6: Do not leak throw route references
While commit 73ba57bfae ("ipv6: fix backtracking for throw routes")
does good job on error propagation to the fib_rules_lookup()
in fib rules core framework that also corrects throw routes
handling, it does not solve route reference leakage problem
happened when we return -EAGAIN to the fib_rules_lookup()
and leave routing table entry referenced in arg->result.

If rule with matched throw route isn't last matched in the
list we overwrite arg->result losing reference on throw
route stored previously forever.

We also partially revert commit ab997ad408 ("ipv6: fix the
incorrect return value of throw route") since we never return
routing table entry with dst.error == -EAGAIN when
CONFIG_IPV6_MULTIPLE_TABLES is on. Also there is no point
to check for RTF_REJECT flag since it is always set throw
route.

Fixes: 73ba57bfae ("ipv6: fix backtracking for throw routes")
Signed-off-by: Serhey Popovych <serhe.popovych@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-20 15:34:02 -04:00
Xin Long
5ee8aa6897 sctp: handle errors when updating asoc
It's a bad thing not to handle errors when updating asoc. The memory
allocation failure in any of the functions called in sctp_assoc_update()
would cause sctp to work unexpectedly.

This patch is to fix it by aborting the asoc and reporting the error when
any of these functions fails.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-20 15:32:55 -04:00
Xin Long
8cd5c25f2d sctp: uncork the old asoc before changing to the new one
local_cork is used to decide if it should uncork asoc outq after processing
some cmds, and it is set when replying or sending msgs. local_cork should
always have the same value with current asoc q->cork in some way.

The thing is when changing to a new asoc by cmd SET_ASOC, local_cork may
not be consistent with the current asoc any more. The cmd seqs can be:

  SCTP_CMD_UPDATE_ASSOC (asoc)
  SCTP_CMD_REPLY (asoc)
  SCTP_CMD_SET_ASOC (new_asoc)
  SCTP_CMD_DELETE_TCB (new_asoc)
  SCTP_CMD_SET_ASOC (asoc)
  SCTP_CMD_REPLY (asoc)

The 1st REPLY makes OLD asoc q->cork and local_cork both are 1, and the cmd
DELETE_TCB clears NEW asoc q->cork and local_cork. After asoc goes back to
OLD asoc, q->cork is still 1 while local_cork is 0. The 2nd REPLY will not
set local_cork because q->cork is already set and it can't be uncorked and
sent out because of this.

To keep local_cork consistent with the current asoc q->cork, this patch is
to uncork the old asoc if local_cork is set before changing to the new one.

Note that the above cmd seqs will be used in the next patch when updating
asoc and handling errors in it.

Suggested-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-20 15:32:55 -04:00
Xin Long
a0f9a4c2ff dccp: call inet_add_protocol after register_pernet_subsys in dccp_v6_init
Patch "call inet_add_protocol after register_pernet_subsys in dccp_v4_init"
fixed a null pointer dereference issue for dccp_ipv4 module.

The same fix is needed for dccp_ipv6 module.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-20 15:31:43 -04:00
Xin Long
d5494acb88 dccp: call inet_add_protocol after register_pernet_subsys in dccp_v4_init
Now dccp_ipv4 works as a kernel module. During loading this module, if
one dccp packet is being recieved after inet_add_protocol but before
register_pernet_subsys in which v4_ctl_sk is initialized, a null pointer
dereference may be triggered because of init_net.dccp.v4_ctl_sk is 0x0.

Jianlin found this issue when the following call trace occurred:

[  171.950177] BUG: unable to handle kernel NULL pointer dereference at 0000000000000110
[  171.951007] IP: [<ffffffffc0558364>] dccp_v4_ctl_send_reset+0xc4/0x220 [dccp_ipv4]
[...]
[  171.984629] Call Trace:
[  171.984859]  <IRQ>
[  171.985061]
[  171.985213]  [<ffffffffc0559a53>] dccp_v4_rcv+0x383/0x3f9 [dccp_ipv4]
[  171.985711]  [<ffffffff815ca054>] ip_local_deliver_finish+0xb4/0x1f0
[  171.986309]  [<ffffffff815ca339>] ip_local_deliver+0x59/0xd0
[  171.986852]  [<ffffffff810cd7a4>] ? update_curr+0x104/0x190
[  171.986956]  [<ffffffff815c9cda>] ip_rcv_finish+0x8a/0x350
[  171.986956]  [<ffffffff815ca666>] ip_rcv+0x2b6/0x410
[  171.986956]  [<ffffffff810c83b4>] ? task_cputime+0x44/0x80
[  171.986956]  [<ffffffff81586f22>] __netif_receive_skb_core+0x572/0x7c0
[  171.986956]  [<ffffffff810d2c51>] ? trigger_load_balance+0x61/0x1e0
[  171.986956]  [<ffffffff81587188>] __netif_receive_skb+0x18/0x60
[  171.986956]  [<ffffffff8158841e>] process_backlog+0xae/0x180
[  171.986956]  [<ffffffff8158799d>] net_rx_action+0x16d/0x380
[  171.986956]  [<ffffffff81090b7f>] __do_softirq+0xef/0x280
[  171.986956]  [<ffffffff816b6a1c>] call_softirq+0x1c/0x30

This patch is to move inet_add_protocol after register_pernet_subsys in
dccp_v4_init, so that v4_ctl_sk is initialized before any incoming dccp
packets are processed.

Reported-by: Jianlin Shi <jishi@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-20 15:31:43 -04:00
Matthias Schiffer
dc5321d796 vxlan: get rid of redundant vxlan_dev.flags
There is no good reason to keep the flags twice in vxlan_dev and
vxlan_config.

Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-20 13:37:02 -04:00
yuan linyu
b952f4dff2 net: manual clean code which call skb_put_[data:zero]
Signed-off-by: yuan linyu <Linyu.Yuan@alcatel-sbell.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-20 13:30:15 -04:00
yuan linyu
de77b966ce net: introduce __skb_put_[zero, data, u8]
follow Johannes Berg, semantic patch file as below,
@@
identifier p, p2;
expression len;
expression skb;
type t, t2;
@@
(
-p = __skb_put(skb, len);
+p = __skb_put_zero(skb, len);
|
-p = (t)__skb_put(skb, len);
+p = __skb_put_zero(skb, len);
)
... when != p
(
p2 = (t2)p;
-memset(p2, 0, len);
|
-memset(p, 0, len);
)

@@
identifier p;
expression len;
expression skb;
type t;
@@
(
-t p = __skb_put(skb, len);
+t p = __skb_put_zero(skb, len);
)
... when != p
(
-memset(p, 0, len);
)

@@
type t, t2;
identifier p, p2;
expression skb;
@@
t *p;
...
(
-p = __skb_put(skb, sizeof(t));
+p = __skb_put_zero(skb, sizeof(t));
|
-p = (t *)__skb_put(skb, sizeof(t));
+p = __skb_put_zero(skb, sizeof(t));
)
... when != p
(
p2 = (t2)p;
-memset(p2, 0, sizeof(*p));
|
-memset(p, 0, sizeof(*p));
)

@@
expression skb, len;
@@
-memset(__skb_put(skb, len), 0, len);
+__skb_put_zero(skb, len);

@@
expression skb, len, data;
@@
-memcpy(__skb_put(skb, len), data, len);
+__skb_put_data(skb, data, len);

@@
expression SKB, C, S;
typedef u8;
identifier fn = {__skb_put};
fresh identifier fn2 = fn ## "_u8";
@@
- *(u8 *)fn(SKB, S) = C;
+ fn2(SKB, C);

Signed-off-by: yuan linyu <Linyu.Yuan@alcatel-sbell.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-20 13:30:14 -04:00
Sebastian Siewior
fe420d87bb net/core: remove explicit do_softirq() from busy_poll_stop()
Since commit 217f697436 ("net: busy-poll: allow preemption in
sk_busy_loop()") there is an explicit do_softirq() invocation after
local_bh_enable() has been invoked.
I don't understand why we need this because local_bh_enable() will
invoke do_softirq() once the softirq counter reached zero and we have
softirq-related work pending.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-20 13:09:33 -04:00
Serhey Popovych
bdaf32c3ce fib_rules: Resolve goto rules target on delete
We should avoid marking goto rules unresolved when their
target is actually reachable after rule deletion.

Consolder following sample scenario:

  # ip -4 ru sh
  0:      from all lookup local
  32000:  from all goto 32100
  32100:  from all lookup main
  32100:  from all lookup default
  32766:  from all lookup main
  32767:  from all lookup default

  # ip -4 ru del pref 32100 table main
  # ip -4 ru sh
  0:      from all lookup local
  32000:  from all goto 32100 [unresolved]
  32100:  from all lookup default
  32766:  from all lookup main
  32767:  from all lookup default

After removal of first rule with preference 32100 we
mark all goto rules as unreachable, even when rule with
same preference as removed one still present.

Check if next rule with same preference is available
and make all rules with goto action pointing to it.

Signed-off-by: Serhey Popovych <serhe.popovych@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-20 12:39:18 -04:00
Ingo Molnar
ac6424b981 sched/wait: Rename wait_queue_t => wait_queue_entry_t
Rename:

	wait_queue_t		=>	wait_queue_entry_t

'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
which had to carry the name.

Start sorting this out by renaming it to 'wait_queue_entry_t'.

This also allows the real structure name 'struct __wait_queue' to
lose its double underscore and become 'struct wait_queue_entry',
which is the more canonical nomenclature for such data types.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-20 12:18:27 +02:00
Xin Long
86fdb3448c sctp: ensure ep is not destroyed before doing the dump
Now before dumping a sock in sctp_diag, it only holds the sock while
the ep may be already destroyed. It can cause a use-after-free panic
when accessing ep->asocs.

This patch is to set sctp_sk(sk)->ep NULL in sctp_endpoint_destroy,
and check if this ep is already destroyed before dumping this ep.

Suggested-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdrver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-19 15:13:43 -04:00
Gao Feng
9745e362ad net: 8021q: Fix one possible panic caused by BUG_ON in free_netdev
The register_vlan_device would invoke free_netdev directly, when
register_vlan_dev failed. It would trigger the BUG_ON in free_netdev
if the dev was already registered. In this case, the netdev would be
freed in netdev_run_todo later.

So add one condition check now. Only when dev is not registered, then
free it directly.

The following is the part coredump when netdev_upper_dev_link failed
in register_vlan_dev. I removed the lines which are too long.

[  411.237457] ------------[ cut here ]------------
[  411.237458] kernel BUG at net/core/dev.c:7998!
[  411.237484] invalid opcode: 0000 [#1] SMP
[  411.237705]  [last unloaded: 8021q]
[  411.237718] CPU: 1 PID: 12845 Comm: vconfig Tainted: G            E   4.12.0-rc5+ #6
[  411.237737] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[  411.237764] task: ffff9cbeb6685580 task.stack: ffffa7d2807d8000
[  411.237782] RIP: 0010:free_netdev+0x116/0x120
[  411.237794] RSP: 0018:ffffa7d2807dbdb0 EFLAGS: 00010297
[  411.237808] RAX: 0000000000000002 RBX: ffff9cbeb6ba8fd8 RCX: 0000000000001878
[  411.237826] RDX: 0000000000000001 RSI: 0000000000000282 RDI: 0000000000000000
[  411.237844] RBP: ffffa7d2807dbdc8 R08: 0002986100029841 R09: 0002982100029801
[  411.237861] R10: 0004000100029980 R11: 0004000100029980 R12: ffff9cbeb6ba9000
[  411.238761] R13: ffff9cbeb6ba9060 R14: ffff9cbe60f1a000 R15: ffff9cbeb6ba9000
[  411.239518] FS:  00007fb690d81700(0000) GS:ffff9cbebb640000(0000) knlGS:0000000000000000
[  411.239949] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  411.240454] CR2: 00007f7115624000 CR3: 0000000077cdf000 CR4: 00000000003406e0
[  411.240936] Call Trace:
[  411.241462]  vlan_ioctl_handler+0x3f1/0x400 [8021q]
[  411.241910]  sock_ioctl+0x18b/0x2c0
[  411.242394]  do_vfs_ioctl+0xa1/0x5d0
[  411.242853]  ? sock_alloc_file+0xa6/0x130
[  411.243465]  SyS_ioctl+0x79/0x90
[  411.243900]  entry_SYSCALL_64_fastpath+0x1e/0xa9
[  411.244425] RIP: 0033:0x7fb69089a357
[  411.244863] RSP: 002b:00007ffcd04e0fc8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
[  411.245445] RAX: ffffffffffffffda RBX: 00007ffcd04e2884 RCX: 00007fb69089a357
[  411.245903] RDX: 00007ffcd04e0fd0 RSI: 0000000000008983 RDI: 0000000000000003
[  411.246527] RBP: 00007ffcd04e0fd0 R08: 0000000000000000 R09: 1999999999999999
[  411.246976] R10: 000000000000053f R11: 0000000000000202 R12: 0000000000000004
[  411.247414] R13: 00007ffcd04e1128 R14: 00007ffcd04e2888 R15: 0000000000000001
[  411.249129] RIP: free_netdev+0x116/0x120 RSP: ffffa7d2807dbdb0

Signed-off-by: Gao Feng <gfree.wind@vip.163.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-19 14:10:20 -04:00
Ivan Delalande
8917a777be tcp: md5: add TCP_MD5SIG_EXT socket option to set a key address prefix
Replace first padding in the tcp_md5sig structure with a new flag field
and address prefix length so it can be specified when configuring a new
key for TCP MD5 signature. The tcpm_flags field will only be used if the
socket option is TCP_MD5SIG_EXT to avoid breaking existing programs, and
tcpm_prefixlen only when the TCP_MD5SIG_FLAG_PREFIX flag is set.

Signed-off-by: Bob Gilligan <gilligan@arista.com>
Signed-off-by: Eric Mowat <mowat@arista.com>
Signed-off-by: Ivan Delalande <colona@arista.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-19 13:51:34 -04:00
Ivan Delalande
6797318e62 tcp: md5: add an address prefix for key lookup
This allows the keys used for TCP MD5 signature to be used for whole
range of addresses, specified with a prefix length, instead of only one
address as it currently is.

Signed-off-by: Bob Gilligan <gilligan@arista.com>
Signed-off-by: Eric Mowat <mowat@arista.com>
Signed-off-by: Ivan Delalande <colona@arista.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-19 13:50:55 -04:00
Pablo Neira Ayuso
04ba724b65 netfilter: nfnetlink: extended ACK reporting
Pass down struct netlink_ext_ack as parameter to all of our nfnetlink
subsystem callbacks, so we can work on follow up patches to provide
finer grain error reporting using the new infrastructure that
2d4bc93368 ("netlink: extended ACK reporting") provides.

No functional change, just pass down this new object to callbacks.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-06-19 19:38:24 +02:00
Florian Westphal
d8297d4f3e netfilter: nf_tables: reduce chain type table size
text  data  bss     dec    hex filename
old: 151590  2240 1152  154982  25d66 net/netfilter/nf_tables_api.o
new: 151666  2240  416  154322  25ad2 net/netfilter/nf_tables_api.o

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-06-19 19:20:59 +02:00
Florian Westphal
b7b5fda468 netfilter: conntrack: use NFPROTO_MAX to size array
We don't support anything larger than NFPROTO_MAX, so we can shrink this a bit:

     text data  dec  hex filename
old: 8259 1096 9355 248b net/netfilter/nf_conntrack_proto.o
new: 8259  624 8883 22b3 net/netfilter/nf_conntrack_proto.o

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-06-19 19:20:49 +02:00
Liping Zhang
d53e3fc390 netfilter: use nf_conntrack_helpers_register when possible
amanda_helper, nf_conntrack_helper_ras and nf_conntrack_helper_q931 are
all arrays, so we can use nf_conntrack_helpers_register to register
the ct helper, this will help us to eliminate some "goto errX"
statements.

Also introduce h323_helper_init/exit helper function to register the ct
helpers, this is prepared for the followup patch, which will add net
namespace support for ct helper.

Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-06-19 19:13:21 +02:00
Jike Song
2becbbc547 netfilter, kbuild: use canonical method to specify objs.
Should use ":=" instead of "+=".

Signed-off-by: Jike Song <jike.song@intel.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-06-19 19:09:20 +02:00
Gao Feng
e15b9c50c4 netfilter: ebt: Use new helper ebt_invalid_target to check target
Use the new helper function ebt_invalid_target instead of the old
macro INVALID_TARGET and other duplicated codes to enhance the readability.

Signed-off-by: Gao Feng <gfree.wind@vip.163.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-06-19 19:09:19 +02:00
Florian Westphal
7866cc57b5 netns: add and use net_ns_barrier
Quoting Joe Stringer:
  If a user loads nf_conntrack_ftp, sends FTP traffic through a network
  namespace, destroys that namespace then unloads the FTP helper module,
  then the kernel will crash.

Events that lead to the crash:
1. conntrack is created with ftp helper in netns x
2. This netns is destroyed
3. netns destruction is scheduled
4. netns destruction wq starts, removes netns from global list
5. ftp helper is unloaded, which resets all helpers of the conntracks
via for_each_net()

but because netns is already gone from list the for_each_net() loop
doesn't include it, therefore all of these conntracks are unaffected.

6. helper module unload finishes
7. netns wq invokes destructor for rmmod'ed helper

CC: "Eric W. Biederman" <ebiederm@xmission.com>
Reported-by: Joe Stringer <joe@ovn.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-06-19 19:09:19 +02:00
Florian Westphal
2c41f33c1b netfilter: move table iteration out of netns exit paths
We only need to iterate & remove in case of module removal;
for netns destruction all conntracks will be removed anyway.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-06-19 19:09:19 +02:00
Xin Long
202f59afd4 netfilter: ipt_CLUSTERIP: do not hold dev
It's a terrible thing to hold dev in iptables target. When the dev is
being removed, unregister_netdevice has to wait for the dev to become
free. dmesg will keep logging the err:

  kernel:unregister_netdevice: waiting for veth0_in to become free. \
  Usage count = 1

until iptables rules with this target are removed manually.

The worse thing is when deleting a netns, a virtual nic will be deleted
instead of reset to init_net in default_device_ops exit/exit_batch. As
it is earlier than to flush the iptables rules in iptable_filter_net_ops
exit, unregister_netdevice will block to wait for the nic to become free.

As unregister_netdevice is actually waiting for iptables rules flushing
while iptables rules have to be flushed after unregister_netdevice. This
'dead lock' will cause unregister_netdevice to block there forever. As
the netns is not available to operate at that moment, iptables rules can
not even be flushed manually either.

The reproducer can be:

  # ip netns add test
  # ip link add veth0_in type veth peer name veth0_out
  # ip link set veth0_in netns test
  # ip netns exec test ip link set lo up
  # ip netns exec test ip link set veth0_in up
  # ip netns exec test iptables -I INPUT -d 1.2.3.4 -i veth0_in -j \
    CLUSTERIP --new --clustermac 89:d4:47:eb:9a:fa --total-nodes 3 \
    --local-node 1 --hashmode sourceip-sourceport
  # ip netns del test

This issue can be triggered by all virtual nics with ipt_CLUSTERIP.

This patch is to fix it by not holding dev in ipt_CLUSTERIP, but saving
the dev->ifindex instead of the dev.

As Pablo Neira Ayuso's suggestion, it will refresh c->ifindex and dev's
mc by registering a netdevice notifier, just as what xt_TEE does. So it
removes the old codes updating dev's mc, and also no need to initialize
c->ifindex with dev->ifindex.

But as one config can be shared by more than one targets, and the netdev
notifier is per config, not per target. It couldn't get e->ip.iniface
in the notifier handler. So e->ip.iniface has to be saved into config.

Note that for backwards compatibility, this patch doesn't remove the
codes checking if the dev exists before creating a config.

v1->v2:
  - As Pablo Neira Ayuso's suggestion, register a netdevice notifier to
    manage c->ifindex and dev's mc.

Reported-by: Jianlin Shi <jishi@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-06-19 19:06:20 +02:00
David S. Miller
4b153ca989 Here's just the fix for that ancient bug:
* remove wext calling ndo_do_ioctl, since nobody needs
    that now and it makes the type change easier
  * use struct iwreq instead of struct ifreq almost everywhere
    in wireless extensions code
  * copy only struct iwreq from userspace in dev_ioctl for the
    wireless extensions, since it's smaller than struct ifreq
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEExu3sM/nZ1eRSfR9Ha3t4Rpy0AB0FAllDhUsACgkQa3t4Rpy0
 AB2ttw//SepU66meFuzYy+bFbR38Q2sguKADmSN9jjng3oPyKhHKfEJwRusZZ3zg
 eEIk/NNB2iPTMLSaa4kR1Wclcae0jq5KgO8HwJBLvS7peCXWKx03vnP4Dy7yJ/6U
 VvOk+3JoudnQFhdDnIg+RVsGbwLx0hlq2l727U1Sp6kFyChK2etLikPzVKkEgVnG
 2R/l1BhDqXdQ6Lh7nXWa6O9pwaqkpnPOuJvipJzmUQRB/4GBNjBxSK6J+ac98sm6
 +KCCONBvBBMBago0xySTVURzMTrhW2UH1cE6ITQYjlShB/zsyilYkECvFzOSAYZL
 u9ob1yCAmZwDqhtvEUSi7CEfLtcO43I0XDF4oL00xfmYD9alm9dJPAlvZ1ihsrw7
 ojBDjyykUstWRSeP8zETTdYDIMSPVsed1Y6NzQiy+el/6U3//+o2FcOShqUh89lx
 OIlQwX5i9LBRC/POQ6L8R4VPelNZ/czKMNlq1Z+ubNM9i3PT/8gGf6WapbMPpNUk
 AqAsB13tR17QmLjNpdVxHtoNvD9aceYaFkN+GXRNSb3pJNoJouedx6d5maFYJAju
 GRdZXBV14Z7bamKB3x9EAjpD3DHplJw4m8BvwnBr9zWkGyAvoNsHIC5h8ynzjWSp
 J7KpXPB9IKX6ne+1gCNrrPod2AmK4sWIaAT/SaWMCoHjV4m74k4=
 =O240
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-davem-2017-06-16' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
Here's just the fix for that ancient bug:
 * remove wext calling ndo_do_ioctl, since nobody needs
   that now and it makes the type change easier
 * use struct iwreq instead of struct ifreq almost everywhere
   in wireless extensions code
 * copy only struct iwreq from userspace in dev_ioctl for the
   wireless extensions, since it's smaller than struct ifreq
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-19 00:03:51 -04:00
Haishuang Yan
46f8cd9d2f ip6_tunnel: Correct tos value in collect_md mode
Same as ip_gre, geneve and vxlan, use key->tos as traffic class value.

CC: Peter Dawson <petedaws@gmail.com>
Fixes: 0e9a709560 ("ip6_tunnel, ip6_gre: fix setting of DSCP on
encapsulated packets”)
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Acked-by: Peter Dawson <peter.a.dawson@boeing.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-18 23:56:57 -04:00
Johan Hovold
20777bc57c NFC: fix broken device allocation
Commit 7eda8b8e96 ("NFC: Use IDR library to assing NFC devices IDs")
moved device-id allocation and struct-device initialisation from
nfc_allocate_device() to nfc_register_device().

This broke just about every nfc-device-registration error path, which
continue to call nfc_free_device() that tries to put the device
reference of the now uninitialised (but zeroed) struct device:

kobject: '(null)' (ce316420): is not initialized, yet kobject_put() is being called.

The late struct-device initialisation also meant that various work
queues whose names are derived from the nfc device name were also
misnamed:

  421 root         0 SW<  [(null)_nci_cmd_]
  422 root         0 SW<  [(null)_nci_rx_w]
  423 root         0 SW<  [(null)_nci_tx_w]

Move the id-allocation and struct-device initialisation back to
nfc_allocate_device() and fix up the single call site which did not use
nfc_free_device() in its error path.

Fixes: 7eda8b8e96 ("NFC: Use IDR library to assing NFC devices IDs")
Cc: stable <stable@vger.kernel.org>     # 3.8
Cc: Samuel Ortiz <sameo@linux.intel.com>
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2017-06-18 23:57:58 +02:00
Florian Fainelli
06d4d450db net: dsa: Fix legacy probing
After commit 6d3c8c0dd8 ("net: dsa: Remove master_netdev and
use dst->cpu_dp->netdev") and a29342e739 ("net: dsa: Associate
slave network device with CPU port") we would be seeing NULL pointer
dereferences when accessing dst->cpu_dp->netdev too early. In the legacy
code, we actually know early in advance the master network device, so
pass it down to the relevant functions.

Fixes: 6d3c8c0dd8 ("net: dsa: Remove master_netdev and use dst->cpu_dp->netdev")
Fixes: a29342e739 ("net: dsa: Associate slave network device with CPU port")
Reported-by: Jason Cobham <jcobham@questertangent.com>
Tested-by: Jason Cobham <jcobham@questertangent.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-17 22:59:45 -04:00
Dave Watson
d807ec656f tls: update Kconfig
Missing crypto deps for some platforms.
Default to n for new module.

config: m68k-amcore_defconfig (attached as .config)
compiler: m68k-linux-gcc (GCC) 4.9.0

make.cross ARCH=m68k
All errors (new ones prefixed by >>):

   net/built-in.o: In function `tls_set_sw_offload':
>> (.text+0x732f8): undefined reference to `crypto_alloc_aead'
   net/built-in.o: In function `tls_set_sw_offload':
>> (.text+0x7333c): undefined reference to `crypto_aead_setkey'
   net/built-in.o: In function `tls_set_sw_offload':
>> (.text+0x73354): undefined reference to `crypto_aead_setauthsize'

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Dave Watson <davejwatson@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-17 22:56:46 -04:00
Wei Wang
a4c2fd7f78 net: remove DST_NOCACHE flag
DST_NOCACHE flag check has been removed from dst_release() and
dst_hold_safe() in a previous patch because all the dst are now ref
counted properly and can be released based on refcnt only.
Looking at the rest of the DST_NOCACHE use, all of them can now be
removed or replaced with other checks.
So this patch gets rid of all the DST_NOCACHE usage and remove this flag
completely.

Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-17 22:54:01 -04:00
Wei Wang
b2a9c0ed75 net: remove DST_NOGC flag
Now that all the components have been changed to release dst based on
refcnt only and not depend on dst gc anymore, we can remove the
temporary flag DST_NOGC.

Note that we also need to remove the DST_NOCACHE check in dst_release()
and dst_hold_safe() because now all the dst are released based on refcnt
and behaves as DST_NOCACHE.

Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-17 22:54:01 -04:00
Wei Wang
5b7c9a8ff8 net: remove dst gc related code
This patch removes all dst gc related code and all the dst free
functions

Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-17 22:54:01 -04:00
Wei Wang
560fd93bca decnet: take dst->__refcnt when struct dn_route is created
struct dn_route is inserted into dn_rt_hash_table but no dst->__refcnt
is taken.
This patch makes sure the dn_rt_hash_table's reference to the dst is ref
counted.

As the dst is always ref counted properly, we can safely mark
DST_NOGC flag so dst_release() will release dst based on refcnt only.
And dst gc is no longer needed and all dst_free() or its related
function calls should be replaced with dst_release() or
dst_release_immediate(). And dst_dev_put() is called when removing dst
from the hash table to release the reference on dst->dev before we lose
pointer to it.

Also, correct the logic in dn_dst_check_expire() and dn_dst_gc() to
check dst->__refcnt to be > 1 to indicate it is referenced by other
users.

Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-17 22:54:01 -04:00
Wei Wang
52df157f17 xfrm: take refcnt of dst when creating struct xfrm_dst bundle
During the creation of xfrm_dst bundle, always take ref count when
allocating the dst. This way, xfrm_bundle_create() will form a linked
list of dst with dst->child pointing to a ref counted dst child. And
the returned dst pointer is also ref counted. This makes the link from
the flow cache to this dst now ref counted properly.
As the dst is always ref counted properly, we can safely mark
DST_NOGC flag so dst_release() will release dst based on refcnt only.
And dst gc is no longer needed and all dst_free() and its related
function calls should be replaced with dst_release() or
dst_release_immediate().

The special handling logic for dst->child in dst_destroy() can be
replaced with a simple dst_release_immediate() call on the child to
release the whole list linked by dst->child pointer.
Previously used DST_NOHASH flag is not needed anymore as well. The
reason that DST_NOHASH is used in the existing code is mainly to prevent
the dst inserted in the fib tree to be wrongly destroyed during the
deletion of the xfrm_dst bundle. So in the existing code, DST_NOHASH
flag is marked in all the dst children except the one which is in the
fib tree.
However, with this patch series to remove dst gc logic and release dst
only based on ref count, it is safe to release all the children from a
xfrm_dst bundle as long as the dst children are all ref counted
properly which is already the case in the existing code.
So, this patch removes the use of DST_NOHASH flag.

Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-17 22:54:00 -04:00
Wei Wang
db916649b5 ipv6: get rid of icmp6 dst garbage collector
icmp6 dst route is currently ref counted during creation and will be
freed by user during its call of dst_release(). So no need of a garbage
collector for it.
Remove all icmp6 dst garbage collector related code.

Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-17 22:54:00 -04:00
Wei Wang
587fea7411 ipv6: mark DST_NOGC and remove the operation of dst_free()
With the previous preparation patches, we are ready to get rid of the
dst gc operation in ipv6 code and release dst based on refcnt only.
So this patch adds DST_NOGC flag for all IPv6 dst and remove the calls
to dst_free() and its related functions.
At this point, all dst created in ipv6 code do not use the dst gc
anymore and will be destroyed at the point when refcnt drops to 0.

Also, as icmp6 dst route is refcounted during creation and will be freed
by user during its call of dst_release(), there is no need to add this
dst to the icmp6 gc list as well.
Instead, we need to add it into uncached list so that when a
NETDEV_DOWN/NETDEV_UNREGISRER event comes, we can properly go through
these icmp6 dst as well and release the net device properly.

Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-17 22:54:00 -04:00
Wei Wang
ad65a2f056 ipv6: call dst_hold_safe() properly
Similar as ipv4, ipv6 path also needs to call dst_hold_safe() when
necessary to avoid double free issue on the dst.

Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-17 22:54:00 -04:00
Wei Wang
9514528d92 ipv6: call dst_dev_put() properly
As the intend of this patch series is to completely remove dst gc,
we need to call dst_dev_put() to release the reference to dst->dev
when removing routes from fib because we won't keep the gc list anymore
and will lose the dst pointer right after removing the routes.
Without the gc list, there is no way to find all the dst's that have
dst->dev pointing to the going-down dev.
Hence, we are doing dst_dev_put() immediately before we lose the last
reference of the dst from the routing code. The next dst_check() will
trigger a route re-lookup to find another route (if there is any).

Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-17 22:54:00 -04:00
Wei Wang
1cfb71eeb1 ipv6: take dst->__refcnt for insertion into fib6 tree
In IPv6 routing code, struct rt6_info is created for each static route
and RTF_CACHE route and inserted into fib6 tree. In both cases, dst
ref count is not taken.
As explained in the previous patch, this leads to the need of the dst
garbage collector.

This patch holds ref count of dst before inserting the route into fib6
tree and properly releases the dst when deleting it from the fib6 tree
as a preparation in order to fully get rid of dst gc later.

Also, correct fib6_age() logic to check dst->__refcnt to be 1 to indicate
no user is referencing the dst.

And remove dst_hold() in vrf_rt6_create() as ip6_dst_alloc() already puts
dst->__refcnt to 1.

Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-17 22:54:00 -04:00
Wei Wang
b838d5e1c5 ipv4: mark DST_NOGC and remove the operation of dst_free()
With the previous preparation patches, we are ready to get rid of the
dst gc operation in ipv4 code and release dst based on refcnt only.
So this patch adds DST_NOGC flag for all IPv4 dst and remove the calls
to dst_free().
At this point, all dst created in ipv4 code do not use the dst gc
anymore and will be destroyed at the point when refcnt drops to 0.

Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-17 22:54:00 -04:00
Wei Wang
9df16efadd ipv4: call dst_hold_safe() properly
This patch checks all the calls to
dst_hold()/skb_dst_force()/dst_clone()/dst_use() to see if
dst_hold_safe() is needed to avoid double free issue if dst
gc is removed and dst_release() directly destroys dst when
dst->__refcnt drops to 0.

In tx path, TCP hold sk->sk_rx_dst ref count and also hold sock_lock().
UDP and other similar protocols always hold refcount for
skb->_skb_refdst. So both paths seem to be safe.

In rx path, as it is lockless and skb_dst_set_noref() is likely to be
used, dst_hold_safe() should always be used when trying to hold dst.

In the routing code, if dst is held during an rcu protected session, it
is necessary to call dst_hold_safe() as the current dst might be in its
rcu grace period.

Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-17 22:54:00 -04:00
Wei Wang
95c47f9cf5 ipv4: call dst_dev_put() properly
As the intend of this patch series is to completely remove dst gc,
we need to call dst_dev_put() to release the reference to dst->dev
when removing routes from fib because we won't keep the gc list anymore
and will lose the dst pointer right after removing the routes.
Without the gc list, there is no way to find all the dst's that have
dst->dev pointing to the going-down dev.
Hence, we are doing dst_dev_put() immediately before we lose the last
reference of the dst from the routing code. The next dst_check() will
trigger a route re-lookup to find another route (if there is any).

Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-17 22:53:59 -04:00
Wei Wang
0830106c53 ipv4: take dst->__refcnt when caching dst in fib
In IPv4 routing code, fib_nh and fib_nh_exception can hold pointers
to struct rtable but they never increment dst->__refcnt.
This leads to the need of the dst garbage collector because when user
is done with this dst and calls dst_release(), it can only decrement
dst->__refcnt and can not free the dst even it sees dst->__refcnt
drops from 1 to 0 (unless DST_NOCACHE flag is set) because the routing
code might still hold reference to it.
And when the routing code tries to delete a route, it has to put the
dst to the gc_list if dst->__refcnt is not yet 0 and have a gc thread
running periodically to check on dst->__refcnt and finally to free dst
when refcnt becomes 0.

This patch increments dst->__refcnt when
fib_nh/fib_nh_exception holds reference to this dst and properly release
the dst when fib_nh/fib_nh_exception has been updated with a new dst.

This patch is a preparation in order to fully get rid of dst gc later.

Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-17 22:53:59 -04:00
Wei Wang
4a6ce2b6f2 net: introduce a new function dst_dev_put()
This function should be called when removing routes from fib tree after
the dst gc is no longer in use.
We first mark DST_OBSOLETE_DEAD on this dst to make sure next
dst_ops->check() fails and returns NULL.
Secondly, as we no longer keep the gc_list, we need to properly
release dst->dev right at the moment when the dst is removed from
the fib/fib6 tree.
It does the following:
1. change dst->input and output pointers to dst_discard/dst_dscard_out to
   discard all packets
2. replace dst->dev with loopback interface

Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-17 22:53:59 -04:00
Wei Wang
5f56f409b5 net: introduce DST_NOGC in dst_release() to destroy dst based on refcnt
The current mechanism of freeing dst is a bit complicated. dst has its
ref count and when user grabs the reference to the dst, the ref count is
properly taken in most cases except in IPv4/IPv6/decnet/xfrm routing
code due to some historic reasons.

If the reference to dst is always taken properly, we should be able to
simplify the logic in dst_release() to destroy dst when dst->__refcnt
drops from 1 to 0. And this should be the only condition to determine
if we can call dst_destroy().
And as dst is always ref counted, there is no need for a dst garbage
list to hold the dst entries that already get removed by the routing
code but are still held by other users. And the task to periodically
check the list to free dst if ref count become 0 is also not needed
anymore.

This patch introduces a temporary flag DST_NOGC(no garbage collector).
If it is set in the dst, dst_release() will call dst_destroy() when
dst->__refcnt drops to 0. dst_hold_safe() will also check for this flag
and do atomic_inc_not_zero() similar as DST_NOCACHE to avoid double free
issue.
This temporary flag is mainly used so that we can make the transition
component by component without breaking other parts.
This flag will be removed after all components are properly transitioned.

This patch also introduces a new function dst_release_immediate() which
destroys dst without waiting on the rcu when refcnt drops to 0. It will
be used in later patches.

Follow-up patches will correct all the places to properly take ref count
on dst and mark DST_NOGC. dst_release() or dst_release_immediate() will
be used to release the dst instead of dst_free() and its related
functions.
And final clean-up patch will remove the DST_NOGC flag.

Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-17 22:53:59 -04:00
Wei Wang
1dbe32525e net: use loopback dev when generating blackhole route
Existing ipv4/6_blackhole_route() code generates a blackhole route
with dst->dev pointing to the passed in dst->dev.
It is not necessary to hold reference to the passed in dst->dev
because the packets going through this route are dropped anyway.
A loopback interface is good enough so that we don't need to worry about
releasing this dst->dev when this dev is going down.

Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-17 22:53:59 -04:00
Wei Wang
d24406c85d udp: call dst_hold_safe() in udp_sk_rx_set_dst()
In udp_v4/6_early_demux() code, we try to hold dst->__refcnt for
dst with DST_NOCACHE flag. This is because later in udp_sk_rx_dst_set()
function, we will try to cache this dst in sk for connected case.
However, a better way to achieve this is to not try to hold dst in
early_demux(), but in udp_sk_rx_dst_set(), call dst_hold_safe(). This
approach is also more consistant with how tcp is handling it. And it
will make later changes simpler.

Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-17 22:53:59 -04:00
Wei Wang
1758fd4688 ipv6: remove unnecessary dst_hold() in ip6_fragment()
In ipv6 tx path, rcu_read_lock() is taken so that dst won't get freed
during the execution of ip6_fragment(). Hence, no need to hold dst in
it.

Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-17 22:53:59 -04:00
Vivien Didelot
a1a6b7ea7f net: dsa: add cross-chip multicast support
Similarly to how cross-chip VLAN works, define a bitmap of multicast
group members for a switch, now including its DSA ports, so that
multicast traffic can be sent to all switches of the fabric.

A switch may drop the frames if no user port is a member.

This brings support for multicast in a multi-chip environment.
As of now, all switches of the fabric must support the multicast
operations in order to program a single fabric port.

Reported-by: Jason Cobham <jcobham@questertangent.com>
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Tested-by: Jason Cobham <jcobham@questertangent.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-16 15:21:14 -04:00
Wei Wang
247488c0a4 decnet: always not take dst->__refcnt when inserting dst into hash table
In the existing dn_route.c code, dn_route_output_slow() takes
dst->__refcnt before calling dn_insert_route() while dn_route_input_slow()
does not take dst->__refcnt before calling dn_insert_route().
This makes the whole routing code very buggy.
In dn_dst_check_expire(), dnrt_free() is called when rt expires. This
makes the routes inserted by dn_route_output_slow() not able to be
freed as the refcnt is not released.
In dn_dst_gc(), dnrt_drop() is called to release rt which could
potentially cause the dst->__refcnt to be dropped to -1.
In dn_run_flush(), dst_free() is called to release all the dst. Again,
it makes the dst inserted by dn_route_output_slow() not able to be
released and also, it does not wait on the rcu and could potentially
cause crash in the path where other users still refer to this dst.

This patch makes sure both input and output path do not take
dst->__refcnt before calling dn_insert_route() and also makes sure
dnrt_free()/dst_free() is called when removing dst from the hash table.
The only difference between those 2 calls is that dnrt_free() waits on
the rcu while dst_free() does not.

Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-16 15:00:00 -04:00
Wei Wang
76371d2e3a decnet: always not take dst->__refcnt when inserting dst into hash table
In the existing dn_route.c code, dn_route_output_slow() takes
dst->__refcnt before calling dn_insert_route() while dn_route_input_slow()
does not take dst->__refcnt before calling dn_insert_route().
This makes the whole routing code very buggy.
In dn_dst_check_expire(), dnrt_free() is called when rt expires. This
makes the routes inserted by dn_route_output_slow() not able to be
freed as the refcnt is not released.
In dn_dst_gc(), dnrt_drop() is called to release rt which could
potentially cause the dst->__refcnt to be dropped to -1.
In dn_run_flush(), dst_free() is called to release all the dst. Again,
it makes the dst inserted by dn_route_output_slow() not able to be
released and also, it does not wait on the rcu and could potentially
cause crash in the path where other users still refer to this dst.

This patch makes sure both input and output path do not take
dst->__refcnt before calling dn_insert_route() and also makes sure
dnrt_free()/dst_free() is called when removing dst from the hash table.
The only difference between those 2 calls is that dnrt_free() waits on
the rcu while dst_free() does not.

Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-16 14:59:36 -04:00
Sowmini Varadhan
10beea7d74 rds: tcp: Set linger when rejecting an incoming conn in rds_tcp_accept_one
Each time we get an incoming SYN to the RDS_TCP_PORT, the TCP
layer accepts the connection and then the rds_tcp_accept_one()
callback is invoked to process the incoming connection.

rds_tcp_accept_one() may reject the incoming syn for a number of
reasons, e.g., commit 1a0e100fb2 ("RDS: TCP: Force every connection
to be initiated by numerically smaller IP address"), or because
we are getting spammed by a malicious node that is triggering
a flood of connection attempts to RDS_TCP_PORT. If the incoming
syn is rejected, no data would have been sent on the TCP socket,
and we do not need to be in TIME_WAIT state, so we set linger on
the TCP socket before closing, thereby closing the socket efficiently
with a RST.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Tested-by: Imanti Mendez <imanti.mendez@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-16 12:45:15 -04:00
Sowmini Varadhan
00354de577 rds: tcp: various endian-ness fixes
Found when testing between sparc and x86 machines on different
subnets, so the address comparison patterns hit the corner cases and
brought out some bugs fixed by this patch.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Tested-by: Imanti Mendez <imanti.mendez@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-16 12:45:15 -04:00
Sowmini Varadhan
41500c3e2a rds: tcp: remove cp_outgoing
After commit 1a0e100fb2 ("RDS: TCP: Force every connection to be
initiated by numerically smaller IP address") we no longer need
the logic associated with cp_outgoing, so clean up usage of this
field.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Tested-by: Imanti Mendez <imanti.mendez@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-16 12:45:14 -04:00
Haishuang Yan
f1925ca50d ip6_tunnel: fix potential issue in __ip6_tnl_rcv
When __ip6_tnl_rcv fails, the tun_dst won't be freed, so call
dst_release to free it in error code path.

Fixes: 8d79266bc4 ("ip6_tunnel: add collect_md mode to IPv6 tunnels")
CC: Alexei Starovoitov <ast@fb.com>
Tested-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-16 12:01:29 -04:00
Haishuang Yan
469f87e158 ip_tunnel: fix potential issue in ip_tunnel_rcv
When ip_tunnel_rcv fails, the tun_dst won't be freed, so call
dst_release to free it in error code path.

Fixes: 2e15ea390e ("ip_gre: Add support to collect tunnel metadata.")
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Tested-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-16 12:01:29 -04:00
Martin KaFai Lau
58038695e6 net: Add IFLA_XDP_PROG_ID
Expose prog_id through IFLA_XDP_PROG_ID.  This patch
makes modification to generic_xdp.  The later patches will
modify other xdp-supported drivers.

prog_id is added to struct net_dev_xdp.

iproute2 patch will be followed. Here is how the 'ip link'
will look like:
> ip link show eth0
3: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdp(prog_id:1) qdisc fq_codel state UP mode DEFAULT group default qlen 1000

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-16 11:58:36 -04:00
Johannes Berg
634fef6107 networking: add and use skb_put_u8()
Joe and Bjørn suggested that it'd be nicer to not have the
cast in the fairly common case of doing
	*(u8 *)skb_put(skb, 1) = c;

Add skb_put_u8() for this case, and use it across the code,
using the following spatch:

    @@
    expression SKB, C, S;
    typedef u8;
    identifier fn = {skb_put};
    fresh identifier fn2 = fn ## "_u8";
    @@
    - *(u8 *)fn(SKB, S) = C;
    + fn2(SKB, C);

Note that due to the "S", the spatch isn't perfect, it should
have checked that S is 1, but there's also places that use a
sizeof expression like sizeof(var) or sizeof(u8) etc. Turns
out that nobody ever did something like
	*(u8 *)skb_put(skb, 2) = c;

which would be wrong anyway since the second byte wouldn't be
initialized.

Suggested-by: Joe Perches <joe@perches.com>
Suggested-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-16 11:48:40 -04:00
Johannes Berg
d58ff35122 networking: make skb_push & __skb_push return void pointers
It seems like a historic accident that these return unsigned char *,
and in many places that means casts are required, more often than not.

Make these functions return void * and remove all the casts across
the tree, adding a (u8 *) cast only where the unsigned char pointer
was used directly, all done with the following spatch:

    @@
    expression SKB, LEN;
    typedef u8;
    identifier fn = { skb_push, __skb_push, skb_push_rcsum };
    @@
    - *(fn(SKB, LEN))
    + *(u8 *)fn(SKB, LEN)

    @@
    expression E, SKB, LEN;
    identifier fn = { skb_push, __skb_push, skb_push_rcsum };
    type T;
    @@
    - E = ((T *)(fn(SKB, LEN)))
    + E = fn(SKB, LEN)

    @@
    expression SKB, LEN;
    identifier fn = { skb_push, __skb_push, skb_push_rcsum };
    @@
    - fn(SKB, LEN)[0]
    + *(u8 *)fn(SKB, LEN)

Note that the last part there converts from push(...)[0] to the
more idiomatic *(u8 *)push(...).

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-16 11:48:40 -04:00
Johannes Berg
af72868b90 networking: make skb_pull & friends return void pointers
It seems like a historic accident that these return unsigned char *,
and in many places that means casts are required, more often than not.

Make these functions return void * and remove all the casts across
the tree, adding a (u8 *) cast only where the unsigned char pointer
was used directly, all done with the following spatch:

    @@
    expression SKB, LEN;
    typedef u8;
    identifier fn = {
            skb_pull,
            __skb_pull,
            skb_pull_inline,
            __pskb_pull_tail,
            __pskb_pull,
            pskb_pull
    };
    @@
    - *(fn(SKB, LEN))
    + *(u8 *)fn(SKB, LEN)

    @@
    expression E, SKB, LEN;
    identifier fn = {
            skb_pull,
            __skb_pull,
            skb_pull_inline,
            __pskb_pull_tail,
            __pskb_pull,
            pskb_pull
    };
    type T;
    @@
    - E = ((T *)(fn(SKB, LEN)))
    + E = fn(SKB, LEN)

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-16 11:48:39 -04:00
Johannes Berg
4df864c1d9 networking: make skb_put & friends return void pointers
It seems like a historic accident that these return unsigned char *,
and in many places that means casts are required, more often than not.

Make these functions (skb_put, __skb_put and pskb_put) return void *
and remove all the casts across the tree, adding a (u8 *) cast only
where the unsigned char pointer was used directly, all done with the
following spatch:

    @@
    expression SKB, LEN;
    typedef u8;
    identifier fn = { skb_put, __skb_put };
    @@
    - *(fn(SKB, LEN))
    + *(u8 *)fn(SKB, LEN)

    @@
    expression E, SKB, LEN;
    identifier fn = { skb_put, __skb_put };
    type T;
    @@
    - E = ((T *)(fn(SKB, LEN)))
    + E = fn(SKB, LEN)

which actually doesn't cover pskb_put since there are only three
users overall.

A handful of stragglers were converted manually, notably a macro in
drivers/isdn/i4l/isdn_bsdcomp.c and, oddly enough, one of the many
instances in net/bluetooth/hci_sock.c. In the former file, I also
had to fix one whitespace problem spatch introduced.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-16 11:48:39 -04:00
Johannes Berg
59ae1d127a networking: introduce and use skb_put_data()
A common pattern with skb_put() is to just want to memcpy()
some data into the new space, introduce skb_put_data() for
this.

An spatch similar to the one for skb_put_zero() converts many
of the places using it:

    @@
    identifier p, p2;
    expression len, skb, data;
    type t, t2;
    @@
    (
    -p = skb_put(skb, len);
    +p = skb_put_data(skb, data, len);
    |
    -p = (t)skb_put(skb, len);
    +p = skb_put_data(skb, data, len);
    )
    (
    p2 = (t2)p;
    -memcpy(p2, data, len);
    |
    -memcpy(p, data, len);
    )

    @@
    type t, t2;
    identifier p, p2;
    expression skb, data;
    @@
    t *p;
    ...
    (
    -p = skb_put(skb, sizeof(t));
    +p = skb_put_data(skb, data, sizeof(t));
    |
    -p = (t *)skb_put(skb, sizeof(t));
    +p = skb_put_data(skb, data, sizeof(t));
    )
    (
    p2 = (t2)p;
    -memcpy(p2, data, sizeof(*p));
    |
    -memcpy(p, data, sizeof(*p));
    )

    @@
    expression skb, len, data;
    @@
    -memcpy(skb_put(skb, len), data, len);
    +skb_put_data(skb, data, len);

(again, manually post-processed to retain some comments)

Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-16 11:48:37 -04:00
Johannes Berg
b080db5853 networking: convert many more places to skb_put_zero()
There were many places that my previous spatch didn't find,
as pointed out by yuan linyu in various patches.

The following spatch found many more and also removes the
now unnecessary casts:

    @@
    identifier p, p2;
    expression len;
    expression skb;
    type t, t2;
    @@
    (
    -p = skb_put(skb, len);
    +p = skb_put_zero(skb, len);
    |
    -p = (t)skb_put(skb, len);
    +p = skb_put_zero(skb, len);
    )
    ... when != p
    (
    p2 = (t2)p;
    -memset(p2, 0, len);
    |
    -memset(p, 0, len);
    )

    @@
    type t, t2;
    identifier p, p2;
    expression skb;
    @@
    t *p;
    ...
    (
    -p = skb_put(skb, sizeof(t));
    +p = skb_put_zero(skb, sizeof(t));
    |
    -p = (t *)skb_put(skb, sizeof(t));
    +p = skb_put_zero(skb, sizeof(t));
    )
    ... when != p
    (
    p2 = (t2)p;
    -memset(p2, 0, sizeof(*p));
    |
    -memset(p, 0, sizeof(*p));
    )

    @@
    expression skb, len;
    @@
    -memset(skb_put(skb, len), 0, len);
    +skb_put_zero(skb, len);

Apply it to the tree (with one manual fixup to keep the
comment in vxlan.c, which spatch removed.)

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-16 11:48:35 -04:00
David S. Miller
54144b4825 tls: Depend upon INET not plain NET.
We refer to TCP et al. symbols so have to use INET as
the dependency.

   ERROR: "tcp_prot" [net/tls/tls.ko] undefined!
>> ERROR: "tcp_rate_check_app_limited" [net/tls/tls.ko] undefined!
   ERROR: "tcp_register_ulp" [net/tls/tls.ko] undefined!
   ERROR: "tcp_unregister_ulp" [net/tls/tls.ko] undefined!
   ERROR: "do_tcp_sendpages" [net/tls/tls.ko] undefined!

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-16 11:28:49 -04:00
Vivien Didelot
e4b7778769 net: dsa: assign default CPU port to all ports
The current code only assigns the default cpu_dp to all user ports of
the switch to which the CPU port belongs. The user ports of the other
switches of the fabric thus don't have a default CPU port.

This patch fixes this by assigning the cpu_dp of all user ports of all
switches of the fabric when the tree is fully parsed.

Fixes: a29342e739 ("net: dsa: Associate slave network device with CPU port")
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-15 17:23:35 -04:00
Xin Long
988c732211 sctp: return next obj by passing pos + 1 into sctp_transport_get_idx
In sctp_for_each_transport, pos is used to save how many objs it has
dumped. Now it gets the last obj by sctp_transport_get_idx, then gets
the next obj by sctp_transport_get_next.

The issue is that in the meanwhile if some objs in transport hashtable
are removed and the objs nums are less than pos, sctp_transport_get_idx
would return NULL and hti.walker.tbl is NULL as well. At this moment
it should stop hti, instead of continue getting the next obj. Or it
would cause a NULL pointer dereference in sctp_transport_get_next.

This patch is to pass pos + 1 into sctp_transport_get_idx to get the
next obj directly, even if pos > objs nums, it would return NULL and
stop hti.

Fixes: 626d16f50f ("sctp: export some apis or variables for sctp_diag and reuse some for proc")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-15 14:40:30 -04:00
David Howells
5f2f97656a rxrpc: Fix several cases where a padded len isn't checked in ticket decode
This fixes CVE-2017-7482.

When a kerberos 5 ticket is being decoded so that it can be loaded into an
rxrpc-type key, there are several places in which the length of a
variable-length field is checked to make sure that it's not going to
overrun the available data - but the data is padded to the nearest
four-byte boundary and the code doesn't check for this extra.  This could
lead to the size-remaining variable wrapping and the data pointer going
over the end of the buffer.

Fix this by making the various variable-length data checks use the padded
length.

Reported-by: 石磊 <shilei-c@360.cn>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Marc Dionne <marc.c.dionne@auristor.com>
Reviewed-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-15 14:23:44 -04:00
Jiri Benc
86087e170c net: sched: act_tunnel_key: make UDP checksum configurable
Allow requesting of zero UDP checksum for encapsulated packets. The name and
meaning of the attribute is "NO_CSUM" in order to have the same meaning of
the attribute missing and being 0.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-15 14:21:03 -04:00
Jiri Benc
63fe4c39d2 net: sched: act_tunnel_key: request UDP checksum by default
There's currently no way to request (outer) UDP checksum with
act_tunnel_key. This is problem especially for IPv6. Right now, tunnel_key
action with IPv6 does not work without going through hassles: both sides
have to have udp6zerocsumrx configured on the tunnel interface. This is
obviously not a good solution universally.

It makes more sense to compute the UDP checksum by default even for IPv4.
Just set the default to request the checksum when using act_tunnel_key.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-15 14:21:03 -04:00
Dave Watson
3c4d755915 tls: kernel TLS support
Software implementation of transport layer security, implemented using ULP
infrastructure.  tcp proto_ops are replaced with tls equivalents of sendmsg and
sendpage.

Only symmetric crypto is done in the kernel, keys are passed by setsockopt
after the handshake is complete.  All control messages are supported via CMSG
data - the actual symmetric encryption is the same, just the message type needs
to be passed separately.

For user API, please see Documentation patch.

Pieces that can be shared between hw and sw implementation
are in tls_main.c

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com>
Signed-off-by: Dave Watson <davejwatson@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-15 12:12:40 -04:00
Dave Watson
e3b5616a34 tcp: export do_tcp_sendpages and tcp_rate_check_app_limited functions
Export do_tcp_sendpages and tcp_rate_check_app_limited, since tls will need to
sendpages while the socket is already locked.

tcp_sendpage is exported, but requires the socket lock to not be held already.

Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com>
Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Dave Watson <davejwatson@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-15 12:12:40 -04:00
Dave Watson
734942cc4e tcp: ULP infrastructure
Add the infrustructure for attaching Upper Layer Protocols (ULPs) over TCP
sockets. Based on a similar infrastructure in tcp_cong.  The idea is that any
ULP can add its own logic by changing the TCP proto_ops structure to its own
methods.

Example usage:

setsockopt(sock, SOL_TCP, TCP_ULP, "tls", sizeof("tls"));

modules will call:
tcp_register_ulp(&tcp_tls_ulp_ops);

to register/unregister their ulp, with an init function and name.

A list of registered ulps will be returned by tcp_get_available_ulp, which is
hooked up to /proc.  Example:

$ cat /proc/sys/net/ipv4/tcp_available_ulp
tls

There is currently no functionality to remove or chain ULPs, but
it should be possible to add these in the future if needed.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Dave Watson <davejwatson@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-15 12:12:40 -04:00
David S. Miller
0ddead90b2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
The conflicts were two cases of overlapping changes in
batman-adv and the qed driver.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-15 11:59:32 -04:00
Xin Long
f8a894b218 ipv6: fix calling in6_ifa_hold incorrectly for dad work
Now when starting the dad work in addrconf_mod_dad_work, if the dad work
is idle and queued, it needs to hold ifa.

The problem is there's one gap in [1], during which if the pending dad work
is removed elsewhere. It will miss to hold ifa, but the dad word is still
idea and queue.

        if (!delayed_work_pending(&ifp->dad_work))
                in6_ifa_hold(ifp);
                    <--------------[1]
        mod_delayed_work(addrconf_wq, &ifp->dad_work, delay);

An use-after-free issue can be caused by this.

Chen Wei found this issue when WARN_ON(!hlist_unhashed(&ifp->addr_lst)) in
net6_ifa_finish_destroy was hit because of it.

As Hannes' suggestion, this patch is to fix it by holding ifa first in
addrconf_mod_dad_work, then calling mod_delayed_work and putting ifa if
the dad_work is already in queue.

Note that this patch did not choose to fix it with:

  if (!mod_delayed_work(delay))
          in6_ifa_hold(ifp);

As with it, when delay == 0, dad_work would be scheduled immediately, all
addrconf_mod_dad_work(0) callings had to be moved under ifp->lock.

Reported-by: Wei Chen <weichen@redhat.com>
Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-15 11:26:16 -04:00
David Howells
f7aec129a3 rxrpc: Cache the congestion window setting
Cache the congestion window setting that was determined during a call's
transmission phase when it finishes so that it can be used by the next call
to the same peer, thereby shortcutting the slow-start algorithm.

The value is stored in the rxrpc_peer struct and is accessed without
locking.  Each call takes the value that happens to be there when it starts
and just overwrites the value when it finishes.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-14 15:42:45 -04:00
Jesper Dangaard Brouer
849a44de91 net: don't global ICMP rate limit packets originating from loopback
Florian Weimer seems to have a glibc test-case which requires that
loopback interfaces does not get ICMP ratelimited.  This was broken by
commit c0303efeab ("net: reduce cycles spend on ICMP replies that
gets rate limited").

An ICMP response will usually be routed back-out the same incoming
interface.  Thus, take advantage of this and skip global ICMP
ratelimit when the incoming device is loopback.  In the unlikely event
that the outgoing it not loopback, due to strange routing policy
rules, ICMP rate limiting still works via peer ratelimiting via
icmpv4_xrlim_allow().  Thus, we should still comply with RFC1812
(section 4.3.2.8 "Rate Limiting").

This seems to fix the reproducer given by Florian.  While still
avoiding to perform expensive and unneeded outgoing route lookup for
rate limited packets (in the non-loopback case).

Fixes: c0303efeab ("net: reduce cycles spend on ICMP replies that gets rate limited")
Reported-by: Florian Weimer <fweimer@redhat.com>
Reported-by: "H.J. Lu" <hjl.tools@gmail.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-14 15:33:58 -04:00
Dan Carpenter
c4f65b09b4 net/act_pedit: fix an error code
I'm reviewing static checker warnings where we do ERR_PTR(0), which is
the same as NULL.  I'm pretty sure we intended to return ERR_PTR(-EINVAL)
here.  Sometimes these bugs lead to a NULL dereference but I don't
immediately see that problem here.

Fixes: 71d0ed7079 ("net/act_pedit: Support using offset relative to the conventional network headers")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Amir Vadai <amir@vadai.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-14 15:24:18 -04:00
Paolo Abeni
7608894e43 net: use skb_unref() in napi_consume_skb()
The commit 83ada39bb79d ("net: factor out a helper to decrement the
skb refcount") provided and used a helper for decrementing skb usage,
but I missed at least a spot for it.

This change remove some more duplicated code reusing skb_unref() in
napi_consume_skb(), too. The helper uses an additional, unneeded
unlikely(!skb) test - napi_consume_skb() already check it a few lines
above - but the compiler is smart enough to optimize the duplicated
test out.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-14 15:23:51 -04:00
David S. Miller
4cbf87c789 Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Johan Hedberg says:

====================
pull request: bluetooth-next 2017-06-14

Here's another batch of Bluetooth patches for the 4.13 kernel:

 - Fix for Broadcom controllers not supporting Event Mask Page 2
 - New QCA ROME USB ID for btusb
 - Fix for Security Manager Protocol to use constant-time memcmp
 - Improved support for TI WiLink chips

Please let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-14 15:22:17 -04:00
Yonghong Song
31fd85816d bpf: permits narrower load from bpf program context fields
Currently, verifier will reject a program if it contains an
narrower load from the bpf context structure. For example,
        __u8 h = __sk_buff->hash, or
        __u16 p = __sk_buff->protocol
        __u32 sample_period = bpf_perf_event_data->sample_period
which are narrower loads of 4-byte or 8-byte field.

This patch solves the issue by:
  . Introduce a new parameter ctx_field_size to carry the
    field size of narrower load from prog type
    specific *__is_valid_access validator back to verifier.
  . The non-zero ctx_field_size for a memory access indicates
    (1). underlying prog type specific convert_ctx_accesses
         supporting non-whole-field access
    (2). the current insn is a narrower or whole field access.
  . In verifier, for such loads where load memory size is
    less than ctx_field_size, verifier transforms it
    to a full field load followed by proper masking.
  . Currently, __sk_buff and bpf_perf_event_data->sample_period
    are supporting narrowing loads.
  . Narrower stores are still not allowed as typical ctx stores
    are just normal stores.

Because of this change, some tests in verifier will fail and
these tests are removed. As a bonus, rename some out of bound
__sk_buff->cb access to proper field name and remove two
redundant "skb cb oob" tests.

Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-14 14:56:25 -04:00
WANG Cong
74030603df net_sched: move tcf_lock down after gen_replace_estimator()
Laura reported a sleep-in-atomic kernel warning inside
tcf_act_police_init() which calls gen_replace_estimator() with
spinlock protection.

It is not necessary in this case, we already have RTNL lock here
so it is enough to protect concurrent writers. For the reader,
i.e. tcf_act_police(), it needs to make decision based on this
rate estimator, in the worst case we drop more/less packets than
necessary while changing the rate in parallel, it is still acceptable.

Reported-by: Laura Abbott <labbott@redhat.com>
Reported-by: Nick Huber <nicholashuber@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-14 14:39:19 -04:00
Jiri Slaby
393cc3f511 fs/fcntl: f_setown, allow returning error
Allow f_setown to return an error value. We will fail in the next patch
with EINVAL for bad input to f_setown, so tile the path for the later
patch.

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Jeff Layton <jlayton@redhat.com>
2017-06-14 08:46:36 -04:00
Johannes Berg
68dd02d19c dev_ioctl: copy only the smaller struct iwreq for wext
Unfortunately, struct iwreq isn't a proper subset of struct ifreq,
but is still handled by the same code path. Robert reported that
then applications may (randomly) fault if the struct iwreq they
pass happens to land within 8 bytes of the end of a mapping (the
struct is only 32 bytes, vs. struct ifreq's 40 bytes).

To fix this, pull out the code handling wireless extension ioctls
and copy only the smaller structure in this case.

This bug goes back a long time, I tracked that it was introduced
into mainline in 2.1.15, over 20 years ago!

This fixes https://bugzilla.kernel.org/show_bug.cgi?id=195869

Reported-by: Robert O'Callahan <robert@ocallahan.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-06-14 13:52:44 +02:00
Johannes Berg
4f39a1f587 wireless: wext: use struct iwreq earlier in the call chain
To make it clear that we never use struct ifreq, cast from it
directly in the wext entrypoint and use struct iwreq from there
on. The next patch will remove the cast again and pass the
correct struct from the beginning.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-06-14 13:37:42 +02:00
Dan Carpenter
e747f64336 xfrm: NULL dereference on allocation failure
The default error code in pfkey_msg2xfrm_state() is -ENOBUFS.  We
added a new call to security_xfrm_state_alloc() which sets "err" to zero
so there several places where we can return ERR_PTR(0) if kmalloc()
fails.  The caller is expecting error pointers so it leads to a NULL
dereference.

Fixes: df71837d50 ("[LSM-IPSec]: Security association restriction.")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2017-06-14 12:40:49 +02:00
Dan Carpenter
1e3d0c2c70 xfrm: Oops on error in pfkey_msg2xfrm_state()
There are some missing error codes here so we accidentally return NULL
instead of an error pointer.  It results in a NULL pointer dereference.

Fixes: df71837d50 ("[LSM-IPSec]: Security association restriction.")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2017-06-14 12:40:49 +02:00
Johannes Berg
8bfb367660 wireless: wext: remove ndo_do_ioctl fallback
There are no longer any drivers (in the tree proper, I didn't
check all the staging drivers) that take WEXT ioctls through
this API, the only remaining ones that even have ndo_do_ioctl
are using it only for private ioctls.

Therefore, we can remove this call.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-06-14 09:17:48 +02:00