Commit Graph

51437 Commits

Author SHA1 Message Date
Felix Fietkau
ba03137f4c netfilter: nf_flow_table: in flow_offload_lookup, skip entries being deleted
Preparation for sending flows back to the slow path

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-24 10:28:57 +02:00
Felix Fietkau
59c466dd68 netfilter: nf_flow_table: add a new flow state for tearing down offloading
On cleanup, this will be treated differently from FLOW_OFFLOAD_DYING:

If FLOW_OFFLOAD_DYING is set, the connection is going away, so both the
offload state and the connection tracking entry will be deleted.

If FLOW_OFFLOAD_TEARDOWN is set, the connection remains alive, but
the offload state is torn down. This is useful for cases that require
more complex state tracking / timeout handling on TCP, or if the
connection has been idle for too long.

Support for sending flows back to the slow path will be implemented in
a following patch

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-24 10:28:54 +02:00
Felix Fietkau
6bdc3c68d9 netfilter: nf_flow_table: make flow_offload_dead inline
It is too trivial to keep as a separate exported function

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-24 10:28:52 +02:00
Felix Fietkau
84453a9025 netfilter: nf_flow_table: track flow tables in nf_flow_table directly
Avoids having nf_flow_table depend on nftables (useful for future
iptables backport work)

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-24 10:28:50 +02:00
Felix Fietkau
17857d9299 netfilter: nf_flow_table: fix priv pointer for netdev hook
The offload ip hook expects a pointer to the flowtable, not to the
rhashtable. Since the rhashtable is the first member, this is safe for
the moment, but breaks as soon as the structure layout changes

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-24 10:28:47 +02:00
Felix Fietkau
a268de77fa netfilter: nf_flow_table: move init code to nf_flow_table_core.c
Reduces duplication of .gc and .params in flowtable type definitions and
makes the API clearer

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-24 10:28:45 +02:00
Felix Fietkau
1e80380b14 netfilter: nf_flow_table: relax mixed ipv4/ipv6 flowtable dependencies
Since the offload hook code was moved, this table no longer depends on
the IPv4 and IPv6 flowtable modules

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-24 10:28:39 +02:00
Felix Fietkau
a908fdec3d netfilter: nf_flow_table: move ipv6 offload hook code to nf_flow_table
Useful as preparation for adding iptables support for offload.

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-24 10:28:18 +02:00
Felix Fietkau
3aeb51d7e7 netfilter: nf_flow_table: move ip header check out of nf_flow_exceeds_mtu
Allows the function to be shared with the IPv6 hook code

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-24 10:28:15 +02:00
Felix Fietkau
7d20868717 netfilter: nf_flow_table: move ipv4 offload hook code to nf_flow_table
Allows some minor code sharing with the ipv6 hook code and is also
useful as preparation for adding iptables support for offload

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-24 10:27:16 +02:00
Roopa Prabhu
9c20b9372f net: fib_rules: fix l3mdev netlink attr processing
Fixes: b16fb418b1 ("net: fib_rules: add extack support")
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23 23:20:12 -04:00
Guillaume Nault
eb1c28c058 l2tp: check sockaddr length in pppol2tp_connect()
Check sockaddr_len before dereferencing sp->sa_protocol, to ensure that
it actually points to valid data.

Fixes: fd558d186d ("l2tp: Split pppol2tp patch into separate l2tp and ppp parts")
Reported-by: syzbot+a70ac890b23b1bf29f5c@syzkaller.appspotmail.com
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23 21:10:43 -04:00
David S. Miller
77621f024d Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter/IPVS fixes for net

The following patchset contains Netfilter/IPVS fixes for your net tree,
they are:

1) Fix SIP conntrack with phones sending session descriptions for different
   media types but same port numbers, from Florian Westphal.

2) Fix incorrect rtnl_lock mutex logic from IPVS sync thread, from Julian
   Anastasov.

3) Skip compat array allocation in ebtables if there is no entries, also
   from Florian.

4) Do not lose left/right bits when shifting marks from xt_connmark, from
   Jack Ma.

5) Silence false positive memleak in conntrack extensions, from Cong Wang.

6) Fix CONFIG_NF_REJECT_IPV6=m link problems, from Arnd Bergmann.

7) Cannot kfree rule that is already in list in nf_tables, switch order
   so this error handling is not required, from Florian Westphal.

8) Release set name in error path, from Florian.

9) include kmemleak.h in nf_conntrack_extend.c, from Stepheh Rothwell.

10) NAT chain and extensions depend on NF_TABLES.

11) Out of bound access when renaming chains, from Taehee Yoo.

12) Incorrect casting in xt_connmark leads to wrong bitshifting.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23 16:22:24 -04:00
David Ahern
8a14e46f14 net/ipv6: Fix missing rcu dereferences on from
kbuild test robot reported 2 uses of rt->from not properly accessed
using rcu_dereference:
1. add rcu_dereference_protected to rt6_remove_exception_rt and make
   sure it is always called with rcu lock held.

2. change rt6_do_redirect to take a reference on 'from' when accessed
   the first time so it can be used the sceond time outside of the lock

Fixes: a68886a691 ("net/ipv6: Make from in rt6_info rcu protected")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23 16:12:55 -04:00
David Ahern
c3c14da028 net/ipv6: add rcu locking to ip6_negative_advice
syzbot reported a suspicious rcu_dereference_check:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x1b9/0x294 lib/dump_stack.c:113
  lockdep_rcu_suspicious+0x14a/0x153 kernel/locking/lockdep.c:4592
  rt6_check_expired+0x38b/0x3e0 net/ipv6/route.c:410
  ip6_negative_advice+0x67/0xc0 net/ipv6/route.c:2204
  dst_negative_advice include/net/sock.h:1786 [inline]
  sock_setsockopt+0x138f/0x1fe0 net/core/sock.c:1051
  __sys_setsockopt+0x2df/0x390 net/socket.c:1899
  SYSC_setsockopt net/socket.c:1914 [inline]
  SyS_setsockopt+0x34/0x50 net/socket.c:1911

Add rcu locking around call to rt6_check_expired in
ip6_negative_advice.

Fixes: a68886a691 ("net/ipv6: Make from in rt6_info rcu protected")
Reported-by: syzbot+2422c9e35796659d2273@syzkaller.appspotmail.com
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23 16:12:54 -04:00
Alexander Aring
f18fa5de5b net: ieee802154: 6lowpan: fix frag reassembly
This patch initialize stack variables which are used in
frag_lowpan_compare_key to zero. In my case there are padding bytes in the
structures ieee802154_addr as well in frag_lowpan_compare_key. Otherwise
the key variable contains random bytes. The result is that a compare of
two keys by memcmp works incorrect.

Fixes: 648700f76b ("inet: frags: use rhashtables for reassembly units")
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Reported-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Stefan Schmidt <stefan@osg.samsung.com>
2018-04-23 20:56:24 +02:00
Eric Dumazet
aa8f877849 ipv6: add RTA_TABLE and RTA_PREFSRC to rtm_ipv6_policy
KMSAN reported use of uninit-value that I tracked to lack
of proper size check on RTA_TABLE attribute.

I also believe RTA_PREFSRC lacks a similar check.

Fixes: 86872cb579 ("[IPv6] route: FIB6 configuration using struct fib6_config")
Fixes: c3968a857a ("ipv6: RTA_PREFSRC support for ipv6 route source address selection")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23 12:01:21 -04:00
Yafang Shao
c6849a3ac1 net: init sk_cookie for inet socket
With sk_cookie we can identify a socket, that is very helpful for
traceing and statistic, i.e. tcp tracepiont and ebpf.
So we'd better init it by default for inet socket.
When using it, we just need call atomic64_read(&sk->sk_cookie).

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23 11:56:44 -04:00
Roopa Prabhu
b16fb418b1 net: fib_rules: add extack support
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23 10:21:24 -04:00
Roopa Prabhu
f9d4b0c1e9 fib_rules: move common handling of newrule delrule msgs into fib_nl2rule
This reduces code duplication in the fib rule add and del paths.
Get rid of validate_rulemsg. This became obvious when adding duplicate
extack support in fib newrule/delrule error paths.

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23 10:21:24 -04:00
Yafang Shao
6163849d28 net: introduce a new tracepoint for tcp_rcv_space_adjust
tcp_rcv_space_adjust is called every time data is copied to user space,
introducing a tcp tracepoint for which could show us when the packet is
copied to user.

When a tcp packet arrives, tcp_rcv_established() will be called and with
the existed tracepoint tcp_probe we could get the time when this packet
arrives.
Then this packet will be copied to user, and tcp_rcv_space_adjust will
be called and with this new introduced tracepoint we could get the time
when this packet is copied to user.
With these two tracepoints, we could figure out whether the user program
processes this packet immediately or there's latency.

Hence in the printk message, sk_cookie is printed as a key to relate
tcp_rcv_space_adjust with tcp_probe.

Maybe we could export sockfd in this new tracepoint as well, then we
could relate this new tracepoint with epoll/read/recv* tracepoints, and
finally that could show us the whole lifespan of this packet. But we
could also implement that with pid as these functions are executed in
process context.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23 09:58:18 -04:00
Jann Horn
7e5a206ab6 tcp: don't read out-of-bounds opsize
The old code reads the "opsize" variable from out-of-bounds memory (first
byte behind the segment) if a broken TCP segment ends directly after an
opcode that is neither EOL nor NOP.

The result of the read isn't used for anything, so the worst thing that
could theoretically happen is a pagefault; and since the physmap is usually
mostly contiguous, even that seems pretty unlikely.

The following C reproducer triggers the uninitialized read - however, you
can't actually see anything happen unless you put something like a
pr_warn() in tcp_parse_md5sig_option() to print the opsize.

====================================
#define _GNU_SOURCE
#include <arpa/inet.h>
#include <stdlib.h>
#include <errno.h>
#include <stdarg.h>
#include <net/if.h>
#include <linux/if.h>
#include <linux/ip.h>
#include <linux/tcp.h>
#include <linux/in.h>
#include <linux/if_tun.h>
#include <err.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <string.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/ioctl.h>
#include <assert.h>

void systemf(const char *command, ...) {
  char *full_command;
  va_list ap;
  va_start(ap, command);
  if (vasprintf(&full_command, command, ap) == -1)
    err(1, "vasprintf");
  va_end(ap);
  printf("systemf: <<<%s>>>\n", full_command);
  system(full_command);
}

char *devname;

int tun_alloc(char *name) {
  int fd = open("/dev/net/tun", O_RDWR);
  if (fd == -1)
    err(1, "open tun dev");
  static struct ifreq req = { .ifr_flags = IFF_TUN|IFF_NO_PI };
  strcpy(req.ifr_name, name);
  if (ioctl(fd, TUNSETIFF, &req))
    err(1, "TUNSETIFF");
  devname = req.ifr_name;
  printf("device name: %s\n", devname);
  return fd;
}

#define IPADDR(a,b,c,d) (((a)<<0)+((b)<<8)+((c)<<16)+((d)<<24))

void sum_accumulate(unsigned int *sum, void *data, int len) {
  assert((len&2)==0);
  for (int i=0; i<len/2; i++) {
    *sum += ntohs(((unsigned short *)data)[i]);
  }
}

unsigned short sum_final(unsigned int sum) {
  sum = (sum >> 16) + (sum & 0xffff);
  sum = (sum >> 16) + (sum & 0xffff);
  return htons(~sum);
}

void fix_ip_sum(struct iphdr *ip) {
  unsigned int sum = 0;
  sum_accumulate(&sum, ip, sizeof(*ip));
  ip->check = sum_final(sum);
}

void fix_tcp_sum(struct iphdr *ip, struct tcphdr *tcp) {
  unsigned int sum = 0;
  struct {
    unsigned int saddr;
    unsigned int daddr;
    unsigned char pad;
    unsigned char proto_num;
    unsigned short tcp_len;
  } fakehdr = {
    .saddr = ip->saddr,
    .daddr = ip->daddr,
    .proto_num = ip->protocol,
    .tcp_len = htons(ntohs(ip->tot_len) - ip->ihl*4)
  };
  sum_accumulate(&sum, &fakehdr, sizeof(fakehdr));
  sum_accumulate(&sum, tcp, tcp->doff*4);
  tcp->check = sum_final(sum);
}

int main(void) {
  int tun_fd = tun_alloc("inject_dev%d");
  systemf("ip link set %s up", devname);
  systemf("ip addr add 192.168.42.1/24 dev %s", devname);

  struct {
    struct iphdr ip;
    struct tcphdr tcp;
    unsigned char tcp_opts[20];
  } __attribute__((packed)) syn_packet = {
    .ip = {
      .ihl = sizeof(struct iphdr)/4,
      .version = 4,
      .tot_len = htons(sizeof(syn_packet)),
      .ttl = 30,
      .protocol = IPPROTO_TCP,
      /* FIXUP check */
      .saddr = IPADDR(192,168,42,2),
      .daddr = IPADDR(192,168,42,1)
    },
    .tcp = {
      .source = htons(1),
      .dest = htons(1337),
      .seq = 0x12345678,
      .doff = (sizeof(syn_packet.tcp)+sizeof(syn_packet.tcp_opts))/4,
      .syn = 1,
      .window = htons(64),
      .check = 0 /*FIXUP*/
    },
    .tcp_opts = {
      /* INVALID: trailing MD5SIG opcode after NOPs */
      1, 1, 1, 1, 1,
      1, 1, 1, 1, 1,
      1, 1, 1, 1, 1,
      1, 1, 1, 1, 19
    }
  };
  fix_ip_sum(&syn_packet.ip);
  fix_tcp_sum(&syn_packet.ip, &syn_packet.tcp);
  while (1) {
    int write_res = write(tun_fd, &syn_packet, sizeof(syn_packet));
    if (write_res != sizeof(syn_packet))
      err(1, "packet write failed");
  }
}
====================================

Fixes: cfb6eeb4c8 ("[TCP]: MD5 Signature Option (RFC2385) support.")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23 09:51:06 -04:00
Alexander Aring
d57493d6d1 net: sched: ife: check on metadata length
This patch checks if sk buffer is available to dererence ife header. If
not then NULL will returned to signal an malformed ife packet. This
avoids to crashing the kernel from outside.

Signed-off-by: Alexander Aring <aring@mojatatu.com>
Reviewed-by: Yotam Gigi <yotam.gi@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-22 21:12:00 -04:00
Alexander Aring
cc74eddd0f net: sched: ife: handle malformed tlv length
There is currently no handling to check on a invalid tlv length. This
patch adds such handling to avoid killing the kernel with a malformed
ife packet.

Signed-off-by: Alexander Aring <aring@mojatatu.com>
Reviewed-by: Yotam Gigi <yotam.gi@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-22 21:12:00 -04:00
Alexander Aring
f6cd14537f net: sched: ife: signal not finding metaid
We need to record stats for received metadata that we dont know how
to process. Have find_decode_metaid() return -ENOENT to capture this.

Signed-off-by: Alexander Aring <aring@mojatatu.com>
Reviewed-by: Yotam Gigi <yotam.gi@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-22 21:12:00 -04:00
Doron Roberts-Kedes
7c5aba211d strparser: Do not call mod_delayed_work with a timeout of LONG_MAX
struct sock's sk_rcvtimeo is initialized to
LONG_MAX/MAX_SCHEDULE_TIMEOUT in sock_init_data. Calling
mod_delayed_work with a timeout of LONG_MAX causes spurious execution of
the work function. timer->expires is set equal to jiffies + LONG_MAX.
When timer_base->clk falls behind the current value of jiffies,
the delta between timer_base->clk and jiffies + LONG_MAX causes the
expiration to be in the past. Returning early from strp_start_timer if
timeo == LONG_MAX solves this problem.

Found while testing net/tls_sw recv path.

Fixes: 43a0c6751a ("strparser: Stream parser for messages")
Reviewed-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Doron Roberts-Kedes <doronrk@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-22 21:09:16 -04:00
Ahmed Abdelsalam
a957fa190a ipv6: sr: fix NULL pointer dereference in seg6_do_srh_encap()- v4 pkts
In case of seg6 in encap mode, seg6_do_srh_encap() calls set_tun_src()
in order to set the src addr of outer IPv6 header.

The net_device is required for set_tun_src(). However calling ip6_dst_idev()
on dst_entry in case of IPv4 traffic results on the following bug.

Using just dst->dev should fix this BUG.

[  196.242461] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[  196.242975] PGD 800000010f076067 P4D 800000010f076067 PUD 10f060067 PMD 0
[  196.243329] Oops: 0000 [#1] SMP PTI
[  196.243468] Modules linked in: nfsd auth_rpcgss nfs_acl nfs lockd grace fscache sunrpc crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd input_leds glue_helper led_class pcspkr serio_raw mac_hid video autofs4 hid_generic usbhid hid e1000 i2c_piix4 ahci pata_acpi libahci
[  196.244362] CPU: 2 PID: 1089 Comm: ping Not tainted 4.16.0+ #1
[  196.244606] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[  196.244968] RIP: 0010:seg6_do_srh_encap+0x1ac/0x300
[  196.245236] RSP: 0018:ffffb2ce00b23a60 EFLAGS: 00010202
[  196.245464] RAX: 0000000000000000 RBX: ffff8c7f53eea300 RCX: 0000000000000000
[  196.245742] RDX: 0000f10000000000 RSI: ffff8c7f52085a6c RDI: ffff8c7f41166850
[  196.246018] RBP: ffffb2ce00b23aa8 R08: 00000000000261e0 R09: ffff8c7f41166800
[  196.246294] R10: ffffdce5040ac780 R11: ffff8c7f41166828 R12: ffff8c7f41166808
[  196.246570] R13: ffff8c7f52085a44 R14: ffffffffb73211c0 R15: ffff8c7e69e44200
[  196.246846] FS:  00007fc448789700(0000) GS:ffff8c7f59d00000(0000) knlGS:0000000000000000
[  196.247286] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  196.247526] CR2: 0000000000000000 CR3: 000000010f05a000 CR4: 00000000000406e0
[  196.247804] Call Trace:
[  196.247972]  seg6_do_srh+0x15b/0x1c0
[  196.248156]  seg6_output+0x3c/0x220
[  196.248341]  ? prandom_u32+0x14/0x20
[  196.248526]  ? ip_idents_reserve+0x6c/0x80
[  196.248723]  ? __ip_select_ident+0x90/0x100
[  196.248923]  ? ip_append_data.part.50+0x6c/0xd0
[  196.249133]  lwtunnel_output+0x44/0x70
[  196.249328]  ip_send_skb+0x15/0x40
[  196.249515]  raw_sendmsg+0x8c3/0xac0
[  196.249701]  ? _copy_from_user+0x2e/0x60
[  196.249897]  ? rw_copy_check_uvector+0x53/0x110
[  196.250106]  ? _copy_from_user+0x2e/0x60
[  196.250299]  ? copy_msghdr_from_user+0xce/0x140
[  196.250508]  sock_sendmsg+0x36/0x40
[  196.250690]  ___sys_sendmsg+0x292/0x2a0
[  196.250881]  ? _cond_resched+0x15/0x30
[  196.251074]  ? copy_termios+0x1e/0x70
[  196.251261]  ? _copy_to_user+0x22/0x30
[  196.251575]  ? tty_mode_ioctl+0x1c3/0x4e0
[  196.251782]  ? _cond_resched+0x15/0x30
[  196.251972]  ? mutex_lock+0xe/0x30
[  196.252152]  ? vvar_fault+0xd2/0x110
[  196.252337]  ? __do_fault+0x1f/0xc0
[  196.252521]  ? __handle_mm_fault+0xc1f/0x12d0
[  196.252727]  ? __sys_sendmsg+0x63/0xa0
[  196.252919]  __sys_sendmsg+0x63/0xa0
[  196.253107]  do_syscall_64+0x72/0x200
[  196.253305]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  196.253530] RIP: 0033:0x7fc4480b0690
[  196.253715] RSP: 002b:00007ffde9f252f8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[  196.254053] RAX: ffffffffffffffda RBX: 0000000000000040 RCX: 00007fc4480b0690
[  196.254331] RDX: 0000000000000000 RSI: 000000000060a360 RDI: 0000000000000003
[  196.254608] RBP: 00007ffde9f253f0 R08: 00000000002d1e81 R09: 0000000000000002
[  196.254884] R10: 00007ffde9f250c0 R11: 0000000000000246 R12: 0000000000b22070
[  196.255205] R13: 20c49ba5e353f7cf R14: 431bde82d7b634db R15: 00007ffde9f278fe
[  196.255484] Code: a5 0f b6 45 c0 41 88 41 28 41 0f b6 41 2c 48 c1 e0 04 49 8b 54 01 38 49 8b 44 01 30 49 89 51 20 49 89 41 18 48 8b 83 b0 00 00 00 <48> 8b 30 49 8b 86 08 0b 00 00 48 8b 40 20 48 8b 50 08 48 0b 10
[  196.256190] RIP: seg6_do_srh_encap+0x1ac/0x300 RSP: ffffb2ce00b23a60
[  196.256445] CR2: 0000000000000000
[  196.256676] ---[ end trace 71af7d093603885c ]---

Fixes: 8936ef7604 ("ipv6: sr: fix NULL pointer dereference when setting encap source address")
Signed-off-by: Ahmed Abdelsalam <amsalam20@gmail.com>
Acked-by: David Lebrun <dlebrun@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-22 21:04:17 -04:00
Cong Wang
3a04ce7130 llc: fix NULL pointer deref for SOCK_ZAPPED
For SOCK_ZAPPED socket, we don't need to care about llc->sap,
so we should just skip these refcount functions in this case.

Fixes: f7e4367268 ("llc: hold llc_sap before release_sock()")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-22 14:56:22 -04:00
Cong Wang
b905ef9ab9 llc: delete timers synchronously in llc_sk_free()
The connection timers of an llc sock could be still flying
after we delete them in llc_sk_free(), and even possibly
after we free the sock. We could just wait synchronously
here in case of troubles.

Note, I leave other call paths as they are, since they may
not have to wait, at least we can change them to synchronously
when needed.

Also, move the code to net/llc/llc_conn.c, which is apparently
a better place.

Reported-by: <syzbot+f922284c18ea23a8e457@syzkaller.appspotmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-22 14:55:03 -04:00
Guillaume Nault
5411b6187a l2tp: fix {pppol2tp, l2tp_dfs}_seq_stop() in case of seq_file overflow
Commit 0e0c3fee3a ("l2tp: hold reference on tunnels printed in pppol2tp proc file")
assumed that if pppol2tp_seq_stop() was called with non-NULL private
data (the 'v' pointer), then pppol2tp_seq_start() would not be called
again. It turns out that this isn't guaranteed, and overflowing the
seq_file's buffer in pppol2tp_seq_show() is a way to get into this
situation.

Therefore, pppol2tp_seq_stop() needs to reset pd->tunnel, so that
pppol2tp_seq_start() won't drop a reference again if it gets called.
We also have to clear pd->session, because the rest of the code expects
a non-NULL tunnel when pd->session is set.

The l2tp_debugfs module has the same issue. Fix it in the same way.

Fixes: 0e0c3fee3a ("l2tp: hold reference on tunnels printed in pppol2tp proc file")
Fixes: f726214d9b ("l2tp: hold reference on tunnels printed in l2tp/tunnels debugfs file")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-22 14:46:35 -04:00
Sven Eckelmann
dc06338df9 batman-adv: Remove unused dentry without DEBUGFS
The debug_dir variable in the main structures is only accessed when
CONFIG_BATMAN_ADV_DEBUGFS is enabled. It can be dropped to potentially save
some bytes of memory.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2018-04-22 10:11:07 +02:00
Sven Eckelmann
6486379d87 batman-adv: Avoid bool in structures
Using the bool type for structure member is considered inappropriate [1]
for the kernel. Its size is not well defined (but usually 1 byte but maybe
also 4 byte).

[1] https://lkml.org/lkml/2017/11/21/384

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2018-04-22 10:11:06 +02:00
Linus Lüssing
f26e4e98b1 batman-adv: Avoid old nodes disabling multicast optimizations completely
Instead of disabling multicast optimizations mesh-wide once a node with
no multicast optimizations capabilities joins the mesh, do the
following:

Just insert such nodes into the WANT_ALL_IPV4/IPV6 lists. This is
sufficient to avoid multicast packet loss to such unsupportive nodes.

Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2018-04-22 09:29:14 +02:00
Sven Eckelmann
1ba93211e2 batman-adv: Disable CONFIG_BATMAN_ADV_DEBUGFS by default
All tools which were known to the batman-adv development team are
supporting the batman-adv netlink interface since a while. Also debugfs is
not supported for batman-adv interfaces in any non-default netns. Thus
disabling CONFIG_BATMAN_ADV_DEBUGFS by default should not cause problems on
most systems. It is still possible to enable it in case it is still
required in a specific setup.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2018-04-22 09:29:14 +02:00
Simon Wunderlich
f546a99d62 batman-adv: Start new development cycle
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2018-04-22 09:29:14 +02:00
David S. Miller
e0ada51db9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts were simple overlapping changes in microchip
driver.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-21 16:32:48 -04:00
David Ahern
8ae869714b net/ipv6: Remove unncessary check on f6i in fib6_check
Dan reported an imbalance in fib6_check on use of f6i and checking
whether it is null. Since fib6_check is only called if f6i is non-null,
remove the unnecessary check.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-21 16:06:14 -04:00
David Ahern
a68886a691 net/ipv6: Make from in rt6_info rcu protected
When a dst entry is created from a fib entry, the 'from' in rt6_info
is set to the fib entry. The 'from' reference is used most notably for
cookie checking - making sure stale dst entries are updated if the
fib entry is changed.

When a fib entry is deleted, the pcpu routes on it are walked releasing
the fib6_info reference. This is needed for the fib6_info cleanup to
happen and to make sure all device references are released in a timely
manner.

There is a race window when a FIB entry is deleted and the 'from' on the
pcpu route is dropped and the pcpu route hits a cookie check. Handle
this race using rcu on from.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-21 16:06:14 -04:00
David Ahern
5bcaa41b96 net/ipv6: Move release of fib6_info from pcpu routes to helper
Code move only; no functional change intended.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-21 16:06:13 -04:00
David Ahern
a87b7dc9f7 net/ipv6: Move rcu locking to callers of fib6_get_cookie_safe
A later patch protects 'from' in rt6_info and this simplifies the
locking needed by it.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-21 16:06:13 -04:00
David Ahern
4d85cd0c2a net/ipv6: Move rcu_read_lock to callers of ip6_rt_cache_alloc
A later patch protects 'from' in rt6_info and this simplifies the
locking needed by it.

With the move, the fib6_info_hold for the uncached_rt is no longer
needed since the rcu_lock is still held.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-21 16:06:13 -04:00
David Ahern
a269f1a764 net/ipv6: Rename rt6_get_cookie_safe
rt6_get_cookie_safe takes a fib6_info and checks the sernum of
the node. Update the name to reflect its purpose.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-21 16:06:13 -04:00
David Ahern
6a3e030f08 net/ipv6: Clean up rt expires helpers
rt6_clean_expires and rt6_set_expires are no longer used. Removed them.
rt6_update_expires has 1 caller in route.c, so move it from the header.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-21 16:06:13 -04:00
David S. Miller
1b80f86ed6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2018-04-21

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Initial work on BPF Type Format (BTF) is added, which is a meta
   data format which describes the data types of BPF programs / maps.
   BTF has its roots from CTF (Compact C-Type format) with a number
   of changes to it. First use case is to provide a generic pretty
   print capability for BPF maps inspection, later work will also
   add BTF to bpftool. pahole support to convert dwarf to BTF will
   be upstreamed as well (https://github.com/iamkafai/pahole/tree/btf),
   from Martin.

2) Add a new xdp_bpf_adjust_tail() BPF helper for XDP that allows
   for changing the data_end pointer. Only shrinking is currently
   supported which helps for crafting ICMP control messages. Minor
   changes in drivers have been added where needed so they recalc
   the packet's length also when data_end was adjusted, from Nikita.

3) Improve bpftool to make it easier to feed hex bytes via cmdline
   for map operations, from Quentin.

4) Add support for various missing BPF prog types and attach types
   that have been added to kernel recently but neither to bpftool
   nor libbpf yet. Doc and bash completion updates have been added
   as well for bpftool, from Andrey.

5) Proper fix for avoiding to leak info stored in frame data on page
   reuse for the two bpf_xdp_adjust_{head,meta} helpers by disallowing
   to move the pointers into struct xdp_frame area, from Jesper.

6) Follow-up compile fix from BTF in order to include stdbool.h in
   libbpf, from Björn.

7) Few fixes in BPF sample code, that is, a typo on the netdevice
   in a comment and fixup proper dump of XDP action code in the
   tracepoint exception, from Wang and Jesper.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-21 15:56:15 -04:00
Felix Fietkau
1a999d899b netfilter: nf_flow_table: rename nf_flow_table.c to nf_flow_table_core.c
Preparation for adding more code to the same module

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-21 19:41:55 +02:00
Felix Fietkau
4f3780c004 netfilter: nf_flow_table: cache mtu in struct flow_offload_tuple
Reduces the number of cache lines touched in the offload forwarding
path. This is safe because PMTU limits are bypassed for the forwarding
path (see commit f87c10a8aa for more details).

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-21 19:20:40 +02:00
Felix Fietkau
07cb9623ee ipv6: make ip6_dst_mtu_forward inline
Just like ip_dst_mtu_maybe_forward(), to avoid a dependency with ipv6.ko.

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-21 19:20:04 +02:00
Linus Torvalds
a72db42cee Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Unbalanced refcounting in TIPC, from Jon Maloy.

 2) Only allow TCP_MD5SIG to be set on sockets in close or listen state.
    Once the connection is established it makes no sense to change this.
    From Eric Dumazet.

 3) Missing attribute validation in neigh_dump_table(), also from Eric
    Dumazet.

 4) Fix address comparisons in SCTP, from Xin Long.

 5) Neigh proxy table clearing can deadlock, from Wolfgang Bumiller.

 6) Fix tunnel refcounting in l2tp, from Guillaume Nault.

 7) Fix double list insert in team driver, from Paolo Abeni.

 8) af_vsock.ko module was accidently made unremovable, from Stefan
    Hajnoczi.

 9) Fix reference to freed llc_sap object in llc stack, from Cong Wang.

10) Don't assume netdevice struct is DMA'able memory in virtio_net
    driver, from Michael S. Tsirkin.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (62 commits)
  net/smc: fix shutdown in state SMC_LISTEN
  bnxt_en: Fix memory fault in bnxt_ethtool_init()
  virtio_net: sparse annotation fix
  virtio_net: fix adding vids on big-endian
  virtio_net: split out ctrl buffer
  net: hns: Avoid action name truncation
  docs: ip-sysctl.txt: fix name of some ipv6 variables
  vmxnet3: fix incorrect dereference when rxvlan is disabled
  llc: hold llc_sap before release_sock()
  MAINTAINERS: Direct networking documentation changes to netdev
  atm: iphase: fix spelling mistake: "Tansmit" -> "Transmit"
  net: qmi_wwan: add Wistron Neweb D19Q1
  net: caif: fix spelling mistake "UKNOWN" -> "UNKNOWN"
  net: stmmac: Disable ACS Feature for GMAC >= 4
  net: mvpp2: Fix DMA address mask size
  net: change the comment of dev_mc_init
  net: qualcomm: rmnet: Fix warning seen with fill_info
  tun: fix vlan packet truncation
  tipc: fix infinite loop when dumping link monitor summary
  tipc: fix use-after-free in tipc_nametbl_stop
  ...
2018-04-20 09:34:39 -07:00
Eric Dumazet
263243d6c2 net/ipv6: Fix ip6_convert_metrics() bug
If ip6_convert_metrics() fails to allocate memory, it should not
overwrite rt->fib6_metrics or we risk a crash later as syzbot found.

BUG: KASAN: null-ptr-deref in atomic_read include/asm-generic/atomic-instrumented.h:21 [inline]
BUG: KASAN: null-ptr-deref in refcount_sub_and_test+0x92/0x330 lib/refcount.c:179
Read of size 4 at addr 0000000000000044 by task syzkaller832429/4487

CPU: 1 PID: 4487 Comm: syzkaller832429 Not tainted 4.16.0+ #6
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1b9/0x294 lib/dump_stack.c:113
 kasan_report_error mm/kasan/report.c:352 [inline]
 kasan_report.cold.7+0x6d/0x2fe mm/kasan/report.c:412
 check_memory_region_inline mm/kasan/kasan.c:260 [inline]
 check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267
 kasan_check_read+0x11/0x20 mm/kasan/kasan.c:272
 atomic_read include/asm-generic/atomic-instrumented.h:21 [inline]
 refcount_sub_and_test+0x92/0x330 lib/refcount.c:179
 refcount_dec_and_test+0x1a/0x20 lib/refcount.c:212
 fib6_info_destroy+0x2d0/0x3c0 net/ipv6/ip6_fib.c:206
 fib6_info_release include/net/ip6_fib.h:304 [inline]
 ip6_route_info_create+0x677/0x3240 net/ipv6/route.c:3020
 ip6_route_add+0x23/0xb0 net/ipv6/route.c:3030
 inet6_rtm_newroute+0x142/0x160 net/ipv6/route.c:4406
 rtnetlink_rcv_msg+0x466/0xc10 net/core/rtnetlink.c:4648
 netlink_rcv_skb+0x172/0x440 net/netlink/af_netlink.c:2448
 rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:4666
 netlink_unicast_kernel net/netlink/af_netlink.c:1310 [inline]
 netlink_unicast+0x58b/0x740 net/netlink/af_netlink.c:1336
 netlink_sendmsg+0x9f0/0xfa0 net/netlink/af_netlink.c:1901
 sock_sendmsg_nosec net/socket.c:629 [inline]
 sock_sendmsg+0xd5/0x120 net/socket.c:639
 ___sys_sendmsg+0x805/0x940 net/socket.c:2117
 __sys_sendmsg+0x115/0x270 net/socket.c:2155
 SYSC_sendmsg net/socket.c:2164 [inline]
 SyS_sendmsg+0x29/0x30 net/socket.c:2162
 do_syscall_64+0x29e/0x9d0 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x42/0xb7

Fixes: d4ead6b34b ("net/ipv6: move metrics from dst to rt6_info")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: David Ahern <dsa@cumulusnetworks.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-20 11:36:15 -04:00
GhantaKrishnamurthy MohanKrishna
682cd3cf94 tipc: confgiure and apply UDP bearer MTU on running links
Currently, we have option to configure MTU of UDP media. The configured
MTU takes effect on the links going up after that moment. I.e, a user
has to reset bearer to have new value applied across its links. This is
confusing and disturbing on a running cluster.

We now introduce the functionality to change the default UDP bearer MTU
in struct tipc_bearer. Additionally, the links are updated dynamically,
without any need for a reset, when bearer value is changed. We leverage
the existing per-link functionality and the design being symetrical to
the confguration of link tolerance.

Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: GhantaKrishnamurthy MohanKrishna <mohan.krishna.ghanta.krishnamurthy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-20 11:04:05 -04:00
GhantaKrishnamurthy MohanKrishna
901271e040 tipc: implement configuration of UDP media MTU
In previous commit, we changed the default emulated MTU for UDP bearers
to 14k.

This commit adds the functionality to set/change the default value
by configuring new MTU for UDP media. UDP bearer(s) have to be disabled
and enabled back for the new MTU to take effect.

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: GhantaKrishnamurthy MohanKrishna <mohan.krishna.ghanta.krishnamurthy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-20 11:04:05 -04:00
GhantaKrishnamurthy MohanKrishna
a4dfa72d0a tipc: set default MTU for UDP media
Currently, all bearers are configured with MTU value same as the
underlying L2 device. However, in case of bearers with media type
UDP, higher throughput is possible with a fixed and higher emulated
MTU value than adapting to the underlying L2 MTU.

In this commit, we introduce a parameter mtu in struct tipc_media
and a default value is set for UDP. A default value of 14k
was determined by experimentation and found to have a higher throughput
than 16k. MTU for UDP bearers are assigned the above set value of
media MTU.

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: GhantaKrishnamurthy MohanKrishna <mohan.krishna.ghanta.krishnamurthy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-20 11:04:05 -04:00
Srinivas Dasari
2f0605a697 nl80211: Free connkeys on external authentication failure
The failure scenario while processing
NL80211_ATTR_EXTERNAL_AUTH_SUPPORT does not free
the connkeys. This commit addresses the same.

Signed-off-by: Srinivas Dasari <dasaris@codeaurora.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-04-20 09:58:03 +02:00
Ursula Braun
1255fcb2a6 net/smc: fix shutdown in state SMC_LISTEN
Calling shutdown with SHUT_RD and SHUT_RDWR for a listening SMC socket
crashes, because
   commit 127f497058 ("net/smc: release clcsock from tcp_listen_worker")
releases the internal clcsock in smc_close_active() and sets smc->clcsock
to NULL.
For SHUT_RD the smc_close_active() call is removed.
For SHUT_RDWR the kernel_sock_shutdown() call is omitted, since the
clcsock is already released.

Fixes: 127f497058 ("net/smc: release clcsock from tcp_listen_worker")
Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Reported-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19 16:38:39 -04:00
David Ahern
27b10608a2 net/ipv6: Fix gfp_flags arg to addrconf_prefix_route
Eric noticed that __ipv6_ifa_notify is called under rcu_read_lock, so
the gfp argument to addrconf_prefix_route can not be GFP_KERNEL.

While scrubbing other calls I noticed addrconf_addr_gen has one
place with GFP_ATOMIC that can be GFP_KERNEL.

Fixes: acb54e3cba ("net/ipv6: Add gfp_flags to route add functions")
Reported-by: syzbot+2add39b05179b31f912f@syzkaller.appspotmail.com
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19 15:40:13 -04:00
David Ahern
dcd1f57295 net/ipv6: Remove fib6_idev
fib6_idev can be obtained from __in6_dev_get on the nexthop device
rather than caching it in the fib6_info. Remove it.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19 15:40:13 -04:00
David Ahern
eea68cd371 net/ipv6: Remove unnecessary checks on fib6_idev
Prior to 4832c30d54 ("net: ipv6: put host and anycast routes on device
with address") host routes and anycast routes were installed with the
device set to loopback (or VRF device once that feature was added). In the
older code dst.dev was set to loopback (needed for packet tx) and rt6i_idev
was used to denote the actual interface.

Commit 4832c30d54 changed the code to have dst.dev pointing to the real
device with the switch to lo or vrf device done on dst clones. As a
consequence of this change a couple of device checks during route lookups
are no longer needed. Remove them.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19 15:40:13 -04:00
David Ahern
9ee8cbb2fd net/ipv6: Remove aca_idev
aca_idev has only 1 user - inet6_fill_ifacaddr - and it only
wants the device index which can be extracted from the fib6_info
nexthop.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19 15:40:13 -04:00
David Ahern
360a9887c8 net/ipv6: Rename addrconf_dst_alloc
addrconf_dst_alloc now returns a fib6_info. Update the name
and its users to reflect the change.

Rename only; no functional change intended.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19 15:40:13 -04:00
David Ahern
93c2fb253d net/ipv6: Rename fib6_info struct elements
Change the prefix for fib6_info struct elements from rt6i_ to fib6_.
rt6i_pcpu and rt6i_exception_bucket are left as is given that they
point to rt6_info entries.

Rename only; not functional change intended.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19 15:40:12 -04:00
Cong Wang
f7e4367268 llc: hold llc_sap before release_sock()
syzbot reported we still access llc->sap in llc_backlog_rcv()
after it is freed in llc_sap_remove_socket():

Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1b9/0x294 lib/dump_stack.c:113
 print_address_description+0x6c/0x20b mm/kasan/report.c:256
 kasan_report_error mm/kasan/report.c:354 [inline]
 kasan_report.cold.7+0x242/0x2fe mm/kasan/report.c:412
 __asan_report_load1_noabort+0x14/0x20 mm/kasan/report.c:430
 llc_conn_ac_send_sabme_cmd_p_set_x+0x3a8/0x460 net/llc/llc_c_ac.c:785
 llc_exec_conn_trans_actions net/llc/llc_conn.c:475 [inline]
 llc_conn_service net/llc/llc_conn.c:400 [inline]
 llc_conn_state_process+0x4e1/0x13a0 net/llc/llc_conn.c:75
 llc_backlog_rcv+0x195/0x1e0 net/llc/llc_conn.c:891
 sk_backlog_rcv include/net/sock.h:909 [inline]
 __release_sock+0x12f/0x3a0 net/core/sock.c:2335
 release_sock+0xa4/0x2b0 net/core/sock.c:2850
 llc_ui_release+0xc8/0x220 net/llc/af_llc.c:204

llc->sap is refcount'ed and llc_sap_remove_socket() is paired
with llc_sap_add_socket(). This can be amended by holding its refcount
before llc_sap_remove_socket() and releasing it after release_sock().

Reported-by: <syzbot+6e181fc95081c2cf9051@syzkaller.appspotmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19 13:54:53 -04:00
Eric Dumazet
88078d98d1 net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends
After working on IP defragmentation lately, I found that some large
packets defeat CHECKSUM_COMPLETE optimization because of NIC adding
zero paddings on the last (small) fragment.

While removing the padding with pskb_trim_rcsum(), we set skb->ip_summed
to CHECKSUM_NONE, forcing a full csum validation, even if all prior
fragments had CHECKSUM_COMPLETE set.

We can instead compute the checksum of the part we are trimming,
usually smaller than the part we keep.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19 13:44:11 -04:00
Colin Ian King
5e84b38b07 net: caif: fix spelling mistake "UKNOWN" -> "UNKNOWN"
Trivial fix to spelling mistake

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19 13:37:10 -04:00
Felix Fietkau
047b300e6e netfilter: nf_flow_table: clean up flow_offload_alloc
Reduce code duplication and make it much easier to read

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-19 19:22:06 +02:00
Yuchung Cheng
feb5f2ec64 tcp: export packets delivery info
Export data delivered and delivered with CE marks to
1) SNMP TCPDelivered and TCPDeliveredCE
2) getsockopt(TCP_INFO)
3) Timestamping API SOF_TIMESTAMPING_OPT_STATS

Note that for SCM_TSTAMP_ACK, the delivery info in
SOF_TIMESTAMPING_OPT_STATS is reported before the info
was fully updated on the ACK.

These stats help application monitor TCP delivery and ECN status
on per host, per connection, even per message level.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19 13:05:16 -04:00
Yuchung Cheng
e21db6f69a tcp: track total bytes delivered with ECN CE marks
Introduce a new delivered_ce stat in tcp socket to estimate
number of packets being marked with CE bits. The estimation is
done via ACKs with ECE bit. Depending on the actual receiver
behavior, the estimation could have biases.

Since the TCP sender can't really see the CE bit in the data path,
so the sender is technically counting packets marked delivered with
the "ECE / ECN-Echo" flag set.

With RFC3168 ECN, because the ECE bit is sticky, this count can
drastically overestimate the nummber of CE-marked data packets

With DCTCP-style ECN this should be reasonably precise unless there
is loss in the ACK path, in which case it's not precise.

With AccECN proposal this can be made still more precise, even in
the case some degree of ACK loss.

However this is sender's best estimate of CE information.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19 13:05:16 -04:00
Yuchung Cheng
a77fa0104a tcp: new helper to calculate newly delivered
Add new helper tcp_newly_delivered() to prepare the ECN accounting change.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19 13:05:16 -04:00
Yuchung Cheng
bef5767f37 tcp: better delivery accounting for SYN-ACK and SYN-data
the tcp_sock:delivered has inconsistent accounting for SYN and FIN.
1. it counts pure FIN
2. it counts pure SYN
3. it counts SYN-data twice
4. it does not count SYN-ACK

For congestion control perspective it does not matter much as C.C. only
cares about the difference not the aboslute value. But the next patch
would export this field to user-space so it's better to report the absolute
value w/o these caveats.

This patch counts SYN, SYN-ACK, or SYN-data delivery once always in
the "delivered" field.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19 13:05:16 -04:00
sunlianwen
bb9aaaa184 net: change the comment of dev_mc_init
The comment of dev_mc_init() is wrong. which use dev_mc_flush
instead of dev_mc_init.

Signed-off-by: Lianwen Sun <sunlw.fnst@cn.fujitsu.com
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19 12:58:20 -04:00
Jesper Dangaard Brouer
97e19cce05 bpf: reserve xdp_frame size in xdp headroom
Commit 6dfb970d3d ("xdp: avoid leaking info stored in frame data on
page reuse") tried to allow user/bpf_prog to (re)use area used by
xdp_frame (stored in frame headroom), by memset clearing area when
bpf_xdp_adjust_head give bpf_prog access to headroom area.

The mentioned commit had two bugs. (1) Didn't take bpf_xdp_adjust_meta
into account. (2) a combination of bpf_xdp_adjust_head calls, where
xdp->data is moved into xdp_frame section, can cause clearing
xdp_frame area again for area previously granted to bpf_prog.

After discussions with Daniel, we choose to implement a simpler
solution to the problem, which is to reserve the headroom used by
xdp_frame info.

This also avoids the situation where bpf_prog is allowed to adjust/add
headers, and then XDP_REDIRECT later drops the packet due to lack of
headroom for the xdp_frame.  This would likely confuse the end-user.

Fixes: 6dfb970d3d ("xdp: avoid leaking info stored in frame data on page reuse")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-04-19 17:52:12 +02:00
weiyongjun (A)
83826469e3 cfg80211: fix possible memory leak in regdb_query_country()
'wmm_ptrs' is malloced in regdb_query_country() and should be freed
before leaving from the error handling cases, otherwise it will cause
memory leak.

Fixes: 230ebaa189 ("cfg80211: read wmm rules from regulatory database")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
[johannes: add Fixes tag]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-04-19 17:02:53 +02:00
Toke Høiland-Jørgensen
8db0c43369 regulatory: Rename confusing 'country IE' in log output
The 'country IE' messages in the log can be confusing and make people think
that the country code has been set to Ireland. Fix this by changing the
log messages to use 'country element' instead (as they are no longer called
'information element' in the spec anyway).

Reported-by: Bernhard Gabler <Bernhard_Gabler@web.de>
Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-04-19 16:57:20 +02:00
Pablo Neira Ayuso
5a786232eb netfilter: xt_connmark: do not cast xt_connmark_tginfo1 to xt_connmark_tginfo2
These structures have different layout, fill xt_connmark_tginfo2 with
old fields in xt_connmark_tginfo1. Based on patch from Jack Ma.

Fixes: 472a73e007 ("netfilter: xt_conntrack: Support bit-shifting for CONNMARK & MARK targets.")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-19 16:19:28 +02:00
Johannes Berg
a7cfebcb75 cfg80211: limit wiphy names to 128 bytes
There's currently no limit on wiphy names, other than netlink
message size and memory limitations, but that causes issues when,
for example, the wiphy name is used in a uevent, e.g. in rfkill
where we use the same name for the rfkill instance, and then the
buffer there is "only" 2k for the environment variables.

This was reported by syzkaller, which used a 4k name.

Limit the name to something reasonable, I randomly picked 128.

Reported-by: syzbot+230d9e642a85d3fec29c@syzkaller.appspotmail.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-04-19 15:46:34 +02:00
Ilan Peer
911a26484c mac80211: Fix condition validating WMM IE
Commit c470bdc1aa ("mac80211: don't WARN on bad WMM parameters from
buggy APs") handled cases where an AP reports a zeroed WMM
IE. However, the condition that checks the validity accessed the wrong
index in the ieee80211_tx_queue_params array, thus wrongly deducing
that the parameters are invalid. Fix it.

Fixes: c470bdc1aa ("mac80211: don't WARN on bad WMM parameters from buggy APs")
Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-04-19 15:46:22 +02:00
Taehee Yoo
ce20cdf498 netfilter: xt_NFLOG: use nf_log_packet instead of nfulnl_log_packet.
The nfulnl_log_packet() is added to make sure that the NFLOG target
works as only user-space logger. but now, nf_log_packet() can find proper
log function using NF_LOG_TYPE_ULOG and NF_LOG_TYPE_LOG.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-19 13:02:44 +02:00
Taehee Yoo
d71efb599a netfilter: nf_tables: fix out-of-bounds in nft_chain_commit_update
When chain name is changed, nft_chain_commit_update is called.
In the nft_chain_commit_update, trans->ctx.chain->name has old chain name
and nft_trans_chain_name(trans) has new chain name.
If new chain name is longer than old chain name, KASAN warns
slab-out-of-bounds.

[  175.015012] BUG: KASAN: slab-out-of-bounds in strcpy+0x9e/0xb0
[  175.022735] Write of size 1 at addr ffff880114e022da by task iptables-compat/1458

[  175.031353] CPU: 0 PID: 1458 Comm: iptables-compat Not tainted 4.16.0-rc7+ #146
[  175.031353] Hardware name: To be filled by O.E.M. To be filled by O.E.M./Aptio CRB, BIOS 5.6.5 07/08/2015
[  175.031353] Call Trace:
[  175.031353]  dump_stack+0x68/0xa0
[  175.031353]  print_address_description+0xd0/0x260
[  175.031353]  ? strcpy+0x9e/0xb0
[  175.031353]  kasan_report+0x234/0x350
[  175.031353]  __asan_report_store1_noabort+0x1c/0x20
[  175.031353]  strcpy+0x9e/0xb0
[  175.031353]  nf_tables_commit+0x1ccc/0x2990
[  175.031353]  nfnetlink_rcv+0x141e/0x16c0
[  175.031353]  ? nfnetlink_net_init+0x150/0x150
[  175.031353]  ? lock_acquire+0x370/0x370
[  175.031353]  ? lock_acquire+0x370/0x370
[  175.031353]  netlink_unicast+0x444/0x640
[  175.031353]  ? netlink_attachskb+0x700/0x700
[  175.031353]  ? _copy_from_iter_full+0x180/0x740
[  175.031353]  ? kasan_check_write+0x14/0x20
[  175.031353]  ? _copy_from_user+0x9b/0xd0
[  175.031353]  netlink_sendmsg+0x845/0xc70
[ ... ]

Steps to reproduce:
   iptables-compat -N 1
   iptables-compat -E 1 aaaaaaaaa

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-19 12:34:13 +02:00
Pablo Neira Ayuso
39f2ff0816 netfilter: nf_tables: NAT chain and extensions require NF_TABLES
Move these options inside the scope of the 'if' NF_TABLES and
NF_TABLES_IPV6 dependencies. This patch fixes:

   net/ipv6/netfilter/nft_chain_nat_ipv6.o: In function `nft_nat_do_chain':
>> net/ipv6/netfilter/nft_chain_nat_ipv6.c:37: undefined reference to `nft_do_chain'
   net/ipv6/netfilter/nft_chain_nat_ipv6.o: In function `nft_chain_nat_ipv6_exit':
>> net/ipv6/netfilter/nft_chain_nat_ipv6.c:94: undefined reference to `nft_unregister_chain_type'
   net/ipv6/netfilter/nft_chain_nat_ipv6.o: In function `nft_chain_nat_ipv6_init':
>> net/ipv6/netfilter/nft_chain_nat_ipv6.c:87: undefined reference to `nft_register_chain_type'

that happens with:

CONFIG_NF_TABLES=m
CONFIG_NFT_CHAIN_NAT_IPV6=y

Fixes: 02c7b25e5f ("netfilter: nf_tables: build-in filter chain type")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-19 12:31:34 +02:00
Eric Dumazet
415787d779 ipv6: frags: fix a lockdep false positive
lockdep does not know that the locks used by IPv4 defrag
and IPv6 reassembly units are of different classes.

It complains because of following chains :

1) sch_direct_xmit()        (lock txq->_xmit_lock)
    dev_hard_start_xmit()
     xmit_one()
      dev_queue_xmit_nit()
       packet_rcv_fanout()
        ip_check_defrag()
         ip_defrag()
          spin_lock()     (lock frag queue spinlock)

2) ip6_input_finish()
    ipv6_frag_rcv()       (lock frag queue spinlock)
     ip6_frag_queue()
      icmpv6_param_prob() (lock txq->_xmit_lock at some point)

We could add lockdep annotations, but we also can make sure IPv6
calls icmpv6_param_prob() only after the release of the frag queue spinlock,
since this naturally makes frag queue spinlock a leaf in lock hierarchy.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-18 23:19:39 -04:00
Stephen Hemminger
0fe554a46a hv_netvsc: propogate Hyper-V friendly name into interface alias
This patch implement the 'Device Naming' feature of the Hyper-V
network device API. In Hyper-V on the host through the GUI or PowerShell
it is possible to enable the device naming feature which causes
the host to make available to the guest the name of the device.
This shows up in the RNDIS protocol as the friendly name.

The name has no particular meaning and is limited to 256 characters.
The value can only be set via PowerShell on the host, but could
be scripted for mass deployments. The default value is the
string 'Network Adapter' and since that is the same for all devices
and useless, the driver ignores it.

In Windows, the value goes into a registry key for use in SNMP
ifAlias. For Linux, this patch puts the value in the network
device alias property; where it is visible in ip tools and SNMP.

The host provided ifAlias is just a suggestion, and can be
overridden by later ip commands.

Also requires exporting dev_set_alias in netdev core.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-18 21:19:46 -04:00
Nikita V. Shirokov
587b80cce9 bpf: making bpf_prog_test run aware of possible data_end ptr change
after introduction of bpf_xdp_adjust_tail helper packet length
could be changed not only if xdp->data pointer has been changed
but xdp->data_end as well. making bpf_prog_test_run aware of this
possibility

Signed-off-by: Nikita V. Shirokov <tehnerd@tehnerd.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-04-18 23:34:16 +02:00
Nikita V. Shirokov
198d83bb3b bpf: make generic xdp compatible w/ bpf_xdp_adjust_tail
w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as
well (only "decrease" of pointer's location is going to be supported).
changing of this pointer will change packet's size.
for generic XDP we need to reflect this packet's length change by
adjusting skb's tail pointer

Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Nikita V. Shirokov <tehnerd@tehnerd.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-04-18 23:34:16 +02:00
Nikita V. Shirokov
b32cc5b9a3 bpf: adding bpf_xdp_adjust_tail helper
Adding new bpf helper which would allow us to manipulate
xdp's data_end pointer, and allow us to reduce packet's size
indended use case: to generate ICMP messages from XDP context,
where such message would contain truncated original packet.

Signed-off-by: Nikita V. Shirokov <tehnerd@tehnerd.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-04-18 23:34:16 +02:00
Tung Nguyen
36a50a989e tipc: fix infinite loop when dumping link monitor summary
When configuring the number of used bearers to MAX_BEARER and issuing
command "tipc link monitor summary", the command enters infinite loop
in user space.

This issue happens because function tipc_nl_node_dump_monitor() returns
the wrong 'prev_bearer' value when all potential monitors have been
scanned.

The correct behavior is to always try to scan all monitors until either
the netlink message is full, in which case we return the bearer identity
of the affected monitor, or we continue through the whole bearer array
until we can return MAX_BEARERS. This solution also caters for the case
where there may be gaps in the bearer array.

Signed-off-by: Tung Nguyen <tung.q.nguyen@dektech.com.au>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-18 13:48:43 -04:00
Jon Maloy
be47e41d77 tipc: fix use-after-free in tipc_nametbl_stop
When we delete a service item in tipc_nametbl_stop() we loop over
all service ranges in the service's RB tree, and for each service
range we loop over its pertaining publications while calling
tipc_service_remove_publ() for each of them.

However, tipc_service_remove_publ() has the side effect that it also
removes the comprising service range item when there are no publications
left. This leads to a "use-after-free" access when the inner loop
continues to the next iteration, since the range item holding the list
we are looping no longer exists.

We fix this by moving the delete of the service range item outside
the said function. Instead, we now let the two functions calling it
test if the list is empty and perform the removal when that is the
case.

Reported-by: syzbot+d64b64afc55660106556@syzkaller.appspotmail.com
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-18 13:48:43 -04:00
Jacek Kalwas
cd027a5433 udp: enable UDP checksum offload for ESP
In case NIC has support for ESP TX CSUM offload skb->ip_summed is not
set to CHECKSUM_PARTIAL which results in checksum calculated by SW.

Fix enables ESP TX CSUM for UDP by extending condition with check for
NETIF_F_HW_ESP_TX_CSUM.

Signed-off-by: Jacek Kalwas <jacek.kalwas@intel.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2018-04-18 06:57:17 +02:00
David Ahern
77634cc67d net/ipv6: Remove unused code and variables for rt6_info
Drop unneeded elements from rt6_info struct and rearrange layout to
something more relevant for the data path.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 23:41:18 -04:00
David Ahern
8d1c802b28 net/ipv6: Flip FIB entries to fib6_info
Convert all code paths referencing a FIB entry from
rt6_info to fib6_info.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 23:41:18 -04:00
David Ahern
93531c6743 net/ipv6: separate handling of FIB entries from dst based routes
Last step before flipping the data type for FIB entries:
- use fib6_info_alloc to create FIB entries in ip6_route_info_create
  and addrconf_dst_alloc
- use fib6_info_release in place of dst_release, ip6_rt_put and
  rt6_release
- remove the dst_hold before calling __ip6_ins_rt or ip6_del_rt
- when purging routes, drop per-cpu routes
- replace inc and dec of rt6i_ref with fib6_info_hold and fib6_info_release
- use rt->from since it points to the FIB entry
- drop references to exception bucket, fib6_metrics and per-cpu from
  dst entries (those are relevant for fib entries only)

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 23:41:17 -04:00
David Ahern
a64efe142f net/ipv6: introduce fib6_info struct and helpers
Add fib6_info struct and alloc, destroy, hold and release helpers.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 23:41:17 -04:00
David Ahern
23fb93a4d3 net/ipv6: Cleanup exception and cache route handling
IPv6 FIB will only contain FIB entries with exception routes added to
the FIB entry. Once this transformation is complete, FIB lookups will
return a fib6_info with the lookup functions still returning a dst
based rt6_info. The current code uses rt6_info for both paths and
overloads the rt6_info variable usually called 'rt'.

This patch introduces a new 'f6i' variable name for the result of the FIB
lookup and keeps 'rt' as the dst based return variable. 'f6i' becomes a
fib6_info in a later patch which is why it is introduced as f6i now;
avoids the additional churn in the later patch.

In addition, remove RTF_CACHE and dst checks from fib6 add and delete
since they can not happen now and will never happen after the data
type flip.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 23:41:17 -04:00
David Ahern
acb54e3cba net/ipv6: Add gfp_flags to route add functions
Most FIB entries can be added using memory allocated with GFP_KERNEL.
Add gfp_flags to ip6_route_add and addrconf_dst_alloc. Code paths that
can be reached from the packet path (e.g., ndisc and autoconfig) or
atomic notifiers use GFP_ATOMIC; paths from user context (adding
addresses and routes) use GFP_KERNEL.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 23:41:17 -04:00
David Ahern
f8a1b43b70 net/ipv6: Create a neigh_lookup for FIB entries
The router discovery code has a FIB entry and wants to validate the
gateway has a neighbor entry. Refactor the existing dst_neigh_lookup
for IPv6 and create a new function that takes the gateway and device
and returns a neighbor entry. Use the new function in
ndisc_router_discovery to validate the gateway.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 23:41:17 -04:00
David Ahern
3b6761d18b net/ipv6: Move dst flags to booleans in fib entries
Continuing to wean FIB paths off of dst_entry, use a bool to hold
requests for certain dst settings. Add a helper to convert the
flags to DST flags when a FIB entry is converted to a dst_entry.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 23:41:17 -04:00
David Ahern
dec9b0e295 net/ipv6: Add rt6_info create function for ip6_pol_route_lookup
ip6_pol_route_lookup is the lookup function for ip6_route_lookup and
rt6_lookup. At the moment it returns either a reference to a FIB entry
or a cached exception. To move FIB entries to a separate struct, this
lookup function needs to convert FIB entries to an rt6_info that is
returned to the caller.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 23:41:17 -04:00
David Ahern
421842edea net/ipv6: Add fib6_null_entry
ip6_null_entry will stay a dst based return for lookups that fail to
match an entry.

Add a new fib6_null_entry which constitutes the root node and leafs
for fibs. Replace existing references to ip6_null_entry with the
new fib6_null_entry when dealing with FIBs.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 23:41:17 -04:00
David Ahern
14895687d3 net/ipv6: move expires into rt6_info
Add expires to rt6_info for FIB entries, and add fib6 helpers to
manage it. Data path use of dst.expires remains.

The transition is fairly straightforward: when working with fib entries,
rt->dst.expires is just rt->expires, rt6_clean_expires is replaced with
fib6_clean_expires, rt6_set_expires becomes fib6_set_expires, and
rt6_check_expired becomes fib6_check_expired, where the fib6 versions
are added by this patch.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 23:41:17 -04:00
David Ahern
d4ead6b34b net/ipv6: move metrics from dst to rt6_info
Similar to IPv4, add fib metrics to the fib struct, which at the moment
is rt6_info. Will be moved to fib6_info in a later patch. Copy metrics
into dst by reference using refcount.

To make the transition:
- add dst_metrics to rt6_info. Default to dst_default_metrics if no
  metrics are passed during route add. No need for a separate pmtu
  entry; it can reference the MTU slot in fib6_metrics

- ip6_convert_metrics allocates memory in the FIB entry and uses
  ip_metrics_convert to copy from netlink attribute to metrics entry

- the convert metrics call is done in ip6_route_info_create simplifying
  the route add path
  + fib6_commit_metrics and fib6_copy_metrics and the temporary
    mx6_config are no longer needed

- add fib6_metric_set helper to change the value of a metric in the
  fib entry since dst_metric_set can no longer be used

- cow_metrics for IPv6 can drop to dst_cow_metrics_generic

- rt6_dst_from_metrics_check is no longer needed

- rt6_fill_node needs the FIB entry and dst as separate arguments to
  keep compatibility with existing output. Current dst address is
  renamed to dest.
  (to be consistent with IPv4 rt6_fill_node really should be split
  into 2 functions similar to fib_dump_info and rt_fill_info)

- rt6_fill_node no longer needs the temporary metrics variable

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 23:41:16 -04:00
David Ahern
6edb3c96a5 net/ipv6: Defer initialization of dst to data path
Defer setting dst input, output and error until fib entry is copied.

The reject path from ip6_route_info_create is moved to a new function
ip6_rt_init_dst_reject with a helper doing the conversion from fib6_type
to dst error.

The remainder of the new ip6_rt_init_dst is an amalgamtion of dst code
from addrconf_dst_alloc and the non-reject path of ip6_route_info_create.
The dst output function is always ip6_output and the input function is
either ip6_input (local routes), ip6_mc_input (multicast routes) or
ip6_forward (anything else).

A couple of places using dst.error are updated to look at rt6i_flags.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 23:41:16 -04:00
David Ahern
5e670d844b net/ipv6: Move nexthop data to fib6_nh
Introduce fib6_nh structure and move nexthop related data from
rt6_info and rt6_info.dst to fib6_nh. References to dev, gateway or
lwtstate from a FIB lookup perspective are converted to use fib6_nh;
datapath references to dst version are left as is.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 23:41:16 -04:00
David Ahern
e8478e80e5 net/ipv6: Save route type in rt6_info
The RTN_ type for IPv6 FIB entries is currently embedded in rt6i_flags
and dst.error. Since dst is going to be removed, it can no longer be
relied on for FIB dumps so save the route type as fib6_type.

fc_type is set in current users based on the algorithm in rt6_fill_node:
  - rt6i_flags contains RTF_LOCAL: fc_type = RTN_LOCAL
  - rt6i_flags contains RTF_ANYCAST: fc_type = RTN_ANYCAST
  - else fc_type = RTN_UNICAST

Similarly, fib6_type is set in the rt6_info templates based on the
RTF_REJECT section of rt6_fill_node converting dst.error to RTN type.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 23:41:16 -04:00
David Ahern
ae90d867f9 net/ipv6: Move support functions up in route.c
Code move only.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 23:41:16 -04:00
David Ahern
afb1d4b593 net/ipv6: Pass net namespace to route functions
Pass network namespace reference into route add, delete and get
functions.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 23:41:16 -04:00
David Ahern
7aef6859ee net/ipv6: Pass net to fib6_update_sernum
Pass net namespace to fib6_update_sernum. It can not be marked const
as fib6_new_sernum will change ipv6.fib6_sernum.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 23:41:16 -04:00
David Ahern
3940746d86 net: Handle null dst in rtnl_put_cacheinfo
Need to keep expires time for IPv6 routes in a dump of FIB entries.
Update rtnl_put_cacheinfo to allow dst to be NULL in which case
rta_cacheinfo will only contain non-dst data.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 23:41:15 -04:00
David Ahern
a919525ad8 net: Move fib_convert_metrics to metrics file
Move logic of fib_convert_metrics into ip_metrics_convert. This allows
the code that converts netlink attributes into metrics struct to be
re-used in a later patch by IPv6.

This is mostly a code move with the following changes to variable names:
  - fi->fib_net becomes net
  - fc_mx and fc_mx_len are passed as inputs pulled from fib_config
  - metrics array is passed as an input from fi->fib_metrics->metrics

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 23:41:15 -04:00
Eric Biggers
9c438d7a3a KEYS: DNS: limit the length of option strings
Adding a dns_resolver key whose payload contains a very long option name
resulted in that string being printed in full.  This hit the WARN_ONCE()
in set_precision() during the printk(), because printk() only supports a
precision of up to 32767 bytes:

    precision 1000000 too large
    WARNING: CPU: 0 PID: 752 at lib/vsprintf.c:2189 vsnprintf+0x4bc/0x5b0

Fix it by limiting option strings (combined name + value) to a much more
reasonable 128 bytes.  The exact limit is arbitrary, but currently the
only recognized option is formatted as "dnserror=%lu" which fits well
within this limit.

Also ratelimit the printks.

Reproducer:

    perl -e 'print "#", "A" x 1000000, "\x00"' | keyctl padd dns_resolver desc @s

This bug was found using syzkaller.

Reported-by: Mark Rutland <mark.rutland@arm.com>
Fixes: 4a2d789267 ("DNS: If the DNS server returns an error, allow that to be cached [ver #2]")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 15:17:41 -04:00
Lorenzo Bianconi
a2d481b326 ipv6: send netlink notifications for manually configured addresses
Send a netlink notification when userspace adds a manually configured
address if DAD is enabled and optimistic flag isn't set.
Moreover send RTM_DELADDR notifications for tentative addresses.

Some userspace applications (e.g. NetworkManager) are interested in
addr netlink events albeit the address is still in tentative state,
however events are not sent if DAD process is not completed.
If the address is added and immediately removed userspace listeners
are not notified. This behaviour can be easily reproduced by using
veth interfaces:

$ ip -b - <<EOF
> link add dev vm1 type veth peer name vm2
> link set dev vm1 up
> link set dev vm2 up
> addr add 2001:db8:a🅱️1:2:3:4/64 dev vm1
> addr del 2001:db8:a🅱️1:2:3:4/64 dev vm1
EOF

This patch reverts the behaviour introduced by the commit f784ad3d79
("ipv6: do not send RTM_DELADDR for tentative addresses")

Suggested-by: Thomas Haller <thaller@redhat.com>
Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 14:03:56 -04:00
Toshiaki Makita
7ce2367254 vlan: Fix reading memory beyond skb->tail in skb_vlan_tagged_multi
Syzkaller spotted an old bug which leads to reading skb beyond tail by 4
bytes on vlan tagged packets.
This is caused because skb_vlan_tagged_multi() did not check
skb_headlen.

BUG: KMSAN: uninit-value in eth_type_vlan include/linux/if_vlan.h:283 [inline]
BUG: KMSAN: uninit-value in skb_vlan_tagged_multi include/linux/if_vlan.h:656 [inline]
BUG: KMSAN: uninit-value in vlan_features_check include/linux/if_vlan.h:672 [inline]
BUG: KMSAN: uninit-value in dflt_features_check net/core/dev.c:2949 [inline]
BUG: KMSAN: uninit-value in netif_skb_features+0xd1b/0xdc0 net/core/dev.c:3009
CPU: 1 PID: 3582 Comm: syzkaller435149 Not tainted 4.16.0+ #82
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:17 [inline]
  dump_stack+0x185/0x1d0 lib/dump_stack.c:53
  kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
  __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676
  eth_type_vlan include/linux/if_vlan.h:283 [inline]
  skb_vlan_tagged_multi include/linux/if_vlan.h:656 [inline]
  vlan_features_check include/linux/if_vlan.h:672 [inline]
  dflt_features_check net/core/dev.c:2949 [inline]
  netif_skb_features+0xd1b/0xdc0 net/core/dev.c:3009
  validate_xmit_skb+0x89/0x1320 net/core/dev.c:3084
  __dev_queue_xmit+0x1cb2/0x2b60 net/core/dev.c:3549
  dev_queue_xmit+0x4b/0x60 net/core/dev.c:3590
  packet_snd net/packet/af_packet.c:2944 [inline]
  packet_sendmsg+0x7c57/0x8a10 net/packet/af_packet.c:2969
  sock_sendmsg_nosec net/socket.c:630 [inline]
  sock_sendmsg net/socket.c:640 [inline]
  sock_write_iter+0x3b9/0x470 net/socket.c:909
  do_iter_readv_writev+0x7bb/0x970 include/linux/fs.h:1776
  do_iter_write+0x30d/0xd40 fs/read_write.c:932
  vfs_writev fs/read_write.c:977 [inline]
  do_writev+0x3c9/0x830 fs/read_write.c:1012
  SYSC_writev+0x9b/0xb0 fs/read_write.c:1085
  SyS_writev+0x56/0x80 fs/read_write.c:1082
  do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
RIP: 0033:0x43ffa9
RSP: 002b:00007fff2cff3948 EFLAGS: 00000217 ORIG_RAX: 0000000000000014
RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 000000000043ffa9
RDX: 0000000000000001 RSI: 0000000020000080 RDI: 0000000000000003
RBP: 00000000006cb018 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000217 R12: 00000000004018d0
R13: 0000000000401960 R14: 0000000000000000 R15: 0000000000000000

Uninit was created at:
  kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
  kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:188
  kmsan_kmalloc+0x94/0x100 mm/kmsan/kmsan.c:314
  kmsan_slab_alloc+0x11/0x20 mm/kmsan/kmsan.c:321
  slab_post_alloc_hook mm/slab.h:445 [inline]
  slab_alloc_node mm/slub.c:2737 [inline]
  __kmalloc_node_track_caller+0xaed/0x11c0 mm/slub.c:4369
  __kmalloc_reserve net/core/skbuff.c:138 [inline]
  __alloc_skb+0x2cf/0x9f0 net/core/skbuff.c:206
  alloc_skb include/linux/skbuff.h:984 [inline]
  alloc_skb_with_frags+0x1d4/0xb20 net/core/skbuff.c:5234
  sock_alloc_send_pskb+0xb56/0x1190 net/core/sock.c:2085
  packet_alloc_skb net/packet/af_packet.c:2803 [inline]
  packet_snd net/packet/af_packet.c:2894 [inline]
  packet_sendmsg+0x6444/0x8a10 net/packet/af_packet.c:2969
  sock_sendmsg_nosec net/socket.c:630 [inline]
  sock_sendmsg net/socket.c:640 [inline]
  sock_write_iter+0x3b9/0x470 net/socket.c:909
  do_iter_readv_writev+0x7bb/0x970 include/linux/fs.h:1776
  do_iter_write+0x30d/0xd40 fs/read_write.c:932
  vfs_writev fs/read_write.c:977 [inline]
  do_writev+0x3c9/0x830 fs/read_write.c:1012
  SYSC_writev+0x9b/0xb0 fs/read_write.c:1085
  SyS_writev+0x56/0x80 fs/read_write.c:1082
  do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
  entry_SYSCALL_64_after_hwframe+0x3d/0xa2

Fixes: 58e998c6d2 ("offloading: Force software GSO for multiple vlan tags.")
Reported-and-tested-by: syzbot+0bbe42c764feafa82c5a@syzkaller.appspotmail.com
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 13:59:28 -04:00
Samuel Mendoza-Jonas
062b3e1b6d net/ncsi: Refactor MAC, VLAN filters
The NCSI driver defines a generic ncsi_channel_filter struct that can be
used to store arbitrarily formatted filters, and several generic methods
of accessing data stored in such a filter.
However in both the driver and as defined in the NCSI specification
there are only two actual filters: VLAN ID filters and MAC address
filters. The splitting of the MAC filter into unicast, multicast, and
mixed is also technically not necessary as these are stored in the same
location in hardware.

To save complexity, particularly in the set up and accessing of these
generic filters, remove them in favour of two specific structs. These
can be acted on directly and do not need several generic helper
functions to use.

This also fixes a memory error found by KASAN on ARM32 (which is not
upstream yet), where response handlers accessing a filter's data field
could write past allocated memory.

[  114.926512] ==================================================================
[  114.933861] BUG: KASAN: slab-out-of-bounds in ncsi_configure_channel+0x4b8/0xc58
[  114.941304] Read of size 2 at addr 94888558 by task kworker/0:2/546
[  114.947593]
[  114.949146] CPU: 0 PID: 546 Comm: kworker/0:2 Not tainted 4.16.0-rc6-00119-ge156398bfcad #13
...
[  115.170233] The buggy address belongs to the object at 94888540
[  115.170233]  which belongs to the cache kmalloc-32 of size 32
[  115.181917] The buggy address is located 24 bytes inside of
[  115.181917]  32-byte region [94888540, 94888560)
[  115.192115] The buggy address belongs to the page:
[  115.196943] page:9eeac100 count:1 mapcount:0 mapping:94888000 index:0x94888fc1
[  115.204200] flags: 0x100(slab)
[  115.207330] raw: 00000100 94888000 94888fc1 0000003f 00000001 9eea2014 9eecaa74 96c003e0
[  115.215444] page dumped because: kasan: bad access detected
[  115.221036]
[  115.222544] Memory state around the buggy address:
[  115.227384]  94888400: fb fb fb fb fc fc fc fc 04 fc fc fc fc fc fc fc
[  115.233959]  94888480: 00 00 00 fc fc fc fc fc 00 04 fc fc fc fc fc fc
[  115.240529] >94888500: 00 00 04 fc fc fc fc fc 00 00 04 fc fc fc fc fc
[  115.247077]                                             ^
[  115.252523]  94888580: 00 04 fc fc fc fc fc fc 06 fc fc fc fc fc fc fc
[  115.259093]  94888600: 00 00 06 fc fc fc fc fc 00 00 04 fc fc fc fc fc
[  115.265639] ==================================================================

Reported-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Samuel Mendoza-Jonas <sam@mendozajonas.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 13:50:58 -04:00
Eric Biggers
c210f7b411 KEYS: DNS: limit the length of option strings
Adding a dns_resolver key whose payload contains a very long option name
resulted in that string being printed in full.  This hit the WARN_ONCE()
in set_precision() during the printk(), because printk() only supports a
precision of up to 32767 bytes:

    precision 1000000 too large
    WARNING: CPU: 0 PID: 752 at lib/vsprintf.c:2189 vsnprintf+0x4bc/0x5b0

Fix it by limiting option strings (combined name + value) to a much more
reasonable 128 bytes.  The exact limit is arbitrary, but currently the
only recognized option is formatted as "dnserror=%lu" which fits well
within this limit.

Also ratelimit the printks.

Reproducer:

    perl -e 'print "#", "A" x 1000000, "\x00"' | keyctl padd dns_resolver desc @s

This bug was found using syzkaller.

Reported-by: Mark Rutland <mark.rutland@arm.com>
Fixes: 4a2d789267 ("DNS: If the DNS server returns an error, allow that to be cached [ver #2]")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 13:42:58 -04:00
Stephen Suryaputra
bdb7cc643f ipv6: Count interface receive statistics on the ingress netdev
The statistics such as InHdrErrors should be counted on the ingress
netdev rather than on the dev from the dst, which is the egress.

Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 13:39:51 -04:00
David Ahern
032234d823 net/ipv6: Make __inet6_bind static
BPF core gets access to __inet6_bind via ipv6_bpf_stub_impl, so it is
not invoked directly outside of af_inet6.c. Make it static and move
inet6_bind after to avoid forward declaration.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 13:19:22 -04:00
Jesper Dangaard Brouer
6dfb970d3d xdp: avoid leaking info stored in frame data on page reuse
The bpf infrastructure and verifier goes to great length to avoid
bpf progs leaking kernel (pointer) info.

For queueing an xdp_buff via XDP_REDIRECT, xdp_frame info stores
kernel info (incl pointers) in top part of frame data (xdp->data_hard_start).
Checks are in place to assure enough headroom is available for this.

This info is not cleared, and if the frame is reused, then a
malicious user could use bpf_xdp_adjust_head helper to move
xdp->data into this area.  Thus, making this area readable.

This is not super critical as XDP progs requires root or
CAP_SYS_ADMIN, which are privileged enough for such info.  An
effort (is underway) towards moving networking bpf hooks to the
lesser privileged mode CAP_NET_ADMIN, where leaking such info
should be avoided.  Thus, this patch to clear the info when
needed.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 10:50:30 -04:00
Jesper Dangaard Brouer
44fa2dbd47 xdp: transition into using xdp_frame for ndo_xdp_xmit
Changing API ndo_xdp_xmit to take a struct xdp_frame instead of struct
xdp_buff.  This brings xdp_return_frame and ndp_xdp_xmit in sync.

This builds towards changing the API further to become a bulk API,
because xdp_buff is not a queue-able object while xdp_frame is.

V4: Adjust for commit 59655a5b6c ("tuntap: XDP_TX can use native XDP")
V7: Adjust for commit d9314c474d ("i40e: add support for XDP_REDIRECT")

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 10:50:30 -04:00
Jesper Dangaard Brouer
039930945a xdp: transition into using xdp_frame for return API
Changing API xdp_return_frame() to take struct xdp_frame as argument,
seems like a natural choice. But there are some subtle performance
details here that needs extra care, which is a deliberate choice.

When de-referencing xdp_frame on a remote CPU during DMA-TX
completion, result in the cache-line is change to "Shared"
state. Later when the page is reused for RX, then this xdp_frame
cache-line is written, which change the state to "Modified".

This situation already happens (naturally) for, virtio_net, tun and
cpumap as the xdp_frame pointer is the queued object.  In tun and
cpumap, the ptr_ring is used for efficiently transferring cache-lines
(with pointers) between CPUs. Thus, the only option is to
de-referencing xdp_frame.

It is only the ixgbe driver that had an optimization, in which it can
avoid doing the de-reference of xdp_frame.  The driver already have
TX-ring queue, which (in case of remote DMA-TX completion) have to be
transferred between CPUs anyhow.  In this data area, we stored a
struct xdp_mem_info and a data pointer, which allowed us to avoid
de-referencing xdp_frame.

To compensate for this, a prefetchw is used for telling the cache
coherency protocol about our access pattern.  My benchmarks show that
this prefetchw is enough to compensate the ixgbe driver.

V7: Adjust for commit d9314c474d ("i40e: add support for XDP_REDIRECT")
V8: Adjust for commit bd658dda42 ("net/mlx5e: Separate dma base address
and offset in dma_sync call")

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 10:50:29 -04:00
Jesper Dangaard Brouer
57d0a1c1ac xdp: allow page_pool as an allocator type in xdp_return_frame
New allocator type MEM_TYPE_PAGE_POOL for page_pool usage.

The registered allocator page_pool pointer is not available directly
from xdp_rxq_info, but it could be (if needed).  For now, the driver
should keep separate track of the page_pool pointer, which it should
use for RX-ring page allocation.

As suggested by Saeed, to maintain a symmetric API it is the drivers
responsibility to allocate/create and free/destroy the page_pool.
Thus, after the driver have called xdp_rxq_info_unreg(), it is drivers
responsibility to free the page_pool, but with a RCU free call.  This
is done easily via the page_pool helper page_pool_destroy() (which
avoids touching any driver code during the RCU callback, which could
happen after the driver have been unloaded).

V8: address issues found by kbuild test robot
 - Address sparse should be static warnings
 - Allow xdp.o to be compiled without page_pool.o

V9: Remove inline from .c file, compiler knows best

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 10:50:29 -04:00
Jesper Dangaard Brouer
ff7d6b27f8 page_pool: refurbish version of page_pool code
Need a fast page recycle mechanism for ndo_xdp_xmit API for returning
pages on DMA-TX completion time, which have good cross CPU
performance, given DMA-TX completion time can happen on a remote CPU.

Refurbish my page_pool code, that was presented[1] at MM-summit 2016.
Adapted page_pool code to not depend the page allocator and
integration into struct page.  The DMA mapping feature is kept,
even-though it will not be activated/used in this patchset.

[1] http://people.netfilter.org/hawk/presentations/MM-summit2016/generic_page_pool_mm_summit2016.pdf

V2: Adjustments requested by Tariq
 - Changed page_pool_create return codes, don't return NULL, only
   ERR_PTR, as this simplifies err handling in drivers.

V4: many small improvements and cleanups
- Add DOC comment section, that can be used by kernel-doc
- Improve fallback mode, to work better with refcnt based recycling
  e.g. remove a WARN as pointed out by Tariq
  e.g. quicker fallback if ptr_ring is empty.

V5: Fixed SPDX license as pointed out by Alexei

V6: Adjustments requested by Eric Dumazet
 - Adjust ____cacheline_aligned_in_smp usage/placement
 - Move rcu_head in struct page_pool
 - Free pages quicker on destroy, minimize resources delayed an RCU period
 - Remove code for forward/backward compat ABI interface

V8: Issues found by kbuild test robot
 - Address sparse should be static warnings
 - Only compile+link when a driver use/select page_pool,
   mlx5 selects CONFIG_PAGE_POOL, although its first used in two patches

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 10:50:29 -04:00
Jesper Dangaard Brouer
8d5d885275 xdp: rhashtable with allocator ID to pointer mapping
Use the IDA infrastructure for getting a cyclic increasing ID number,
that is used for keeping track of each registered allocator per
RX-queue xdp_rxq_info.  Instead of using the IDR infrastructure, which
uses a radix tree, use a dynamic rhashtable, for creating ID to
pointer lookup table, because this is faster.

The problem that is being solved here is that, the xdp_rxq_info
pointer (stored in xdp_buff) cannot be used directly, as the
guaranteed lifetime is too short.  The info is needed on a
(potentially) remote CPU during DMA-TX completion time . In an
xdp_frame the xdp_mem_info is stored, when it got converted from an
xdp_buff, which is sufficient for the simple page refcnt based recycle
schemes.

For more advanced allocators there is a need to store a pointer to the
registered allocator.  Thus, there is a need to guard the lifetime or
validity of the allocator pointer, which is done through this
rhashtable ID map to pointer. The removal and validity of of the
allocator and helper struct xdp_mem_allocator is guarded by RCU.  The
allocator will be created by the driver, and registered with
xdp_rxq_info_reg_mem_model().

It is up-to debate who is responsible for freeing the allocator
pointer or invoking the allocator destructor function.  In any case,
this must happen via RCU freeing.

Use the IDA infrastructure for getting a cyclic increasing ID number,
that is used for keeping track of each registered allocator per
RX-queue xdp_rxq_info.

V4: Per req of Jason Wang
- Use xdp_rxq_info_reg_mem_model() in all drivers implementing
  XDP_REDIRECT, even-though it's not strictly necessary when
  allocator==NULL for type MEM_TYPE_PAGE_SHARED (given it's zero).

V6: Per req of Alex Duyck
- Introduce rhashtable_lookup() call in later patch

V8: Address sparse should be static warnings (from kbuild test robot)

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 10:50:29 -04:00
Jesper Dangaard Brouer
5ab073ffd3 xdp: introduce xdp_return_frame API and use in cpumap
Introduce an xdp_return_frame API, and convert over cpumap as
the first user, given it have queued XDP frame structure to leverage.

V3: Cleanup and remove C99 style comments, pointed out by Alex Duyck.
V6: Remove comment that id will be added later (Req by Alex Duyck)
V8: Rename enum mem_type to xdp_mem_type (found by kbuild test robot)

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 10:50:27 -04:00
Nicolas Dechesne
77ac725e0c net: qrtr: add MODULE_ALIAS_NETPROTO macro
To ensure that qrtr can be loaded automatically, when needed, if it is compiled
as module.

Signed-off-by: Nicolas Dechesne <nicolas.dechesne@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 09:58:00 -04:00
Stefan Hajnoczi
05e489b159 VSOCK: make af_vsock.ko removable again
Commit c1eef220c1 ("vsock: always call
vsock_init_tables()") introduced a module_init() function without a
corresponding module_exit() function.

Modules with an init function can only be removed if they also have an
exit function.  Therefore the vsock module was considered "permanent"
and could not be removed.

This patch adds an empty module_exit() function so that "rmmod vsock"
works.  No explicit cleanup is required because:

1. Transports call vsock_core_exit() upon exit and cannot be removed
   while sockets are still alive.
2. vsock_diag.ko does not perform any action that requires cleanup by
   vsock.ko.

Fixes: c1eef220c1 ("vsock: always call vsock_init_tables()")
Reported-by: Xiumei Mu <xmu@redhat.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jorgen Hansen <jhansen@vmware.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Jorgen Hansen <jhansen@vmware.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 09:44:30 -04:00
Stephen Rothwell
765cca91b8 netfilter: conntrack: include kmemleak.h for kmemleak_not_leak()
After merging the netfilter tree, today's linux-next build (powerpc
ppc64_defconfig) failed like this:

net/netfilter/nf_conntrack_extend.c: In function 'nf_ct_ext_add':
net/netfilter/nf_conntrack_extend.c:74:2: error: implicit declaration of function 'kmemleak_not_leak' [-Werror=implicit-function-declaration]
  kmemleak_not_leak(old);
  ^~~~~~~~~~~~~~~~~
cc1: some warnings being treated as errors

Fixes: 114aa35d06 ("netfilter: conntrack: silent a memory leak warning")
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-17 10:59:43 +02:00
Eric Dumazet
93ab6cc691 tcp: implement mmap() for zero copy receive
Some networks can make sure TCP payload can exactly fit 4KB pages,
with well chosen MSS/MTU and architectures.

Implement mmap() system call so that applications can avoid
copying data without complex splice() games.

Note that a successful mmap( X bytes) on TCP socket is consuming
bytes, as if recvmsg() has been done. (tp->copied += X)

Only PROT_READ mappings are accepted, as skb page frags
are fundamentally shared and read only.

If tcp_mmap() finds data that is not a full page, or a patch of
urgent data, -EINVAL is returned, no bytes are consumed.

Application must fallback to recvmsg() to read the problematic sequence.

mmap() wont block,  regardless of socket being in blocking or
non-blocking mode. If not enough bytes are in receive queue,
mmap() would return -EAGAIN, or -EIO if socket is in a state
where no other bytes can be added into receive queue.

An application might use SO_RCVLOWAT, poll() and/or ioctl( FIONREAD)
to efficiently use mmap()

On the sender side, MSG_EOR might help to clearly separate unaligned
headers and 4K-aligned chunks if necessary.

Tested:

mlx4 (cx-3) 40Gbit NIC, with tcp_mmap program provided in following patch.
MTU set to 4168  (4096 TCP payload, 40 bytes IPv6 header, 32 bytes TCP header)

Without mmap() (tcp_mmap -s)

received 32768 MB (0 % mmap'ed) in 8.13342 s, 33.7961 Gbit,
  cpu usage user:0.034 sys:3.778, 116.333 usec per MB, 63062 c-switches
received 32768 MB (0 % mmap'ed) in 8.14501 s, 33.748 Gbit,
  cpu usage user:0.029 sys:3.997, 122.864 usec per MB, 61903 c-switches
received 32768 MB (0 % mmap'ed) in 8.11723 s, 33.8635 Gbit,
  cpu usage user:0.048 sys:3.964, 122.437 usec per MB, 62983 c-switches
received 32768 MB (0 % mmap'ed) in 8.39189 s, 32.7552 Gbit,
  cpu usage user:0.038 sys:4.181, 128.754 usec per MB, 55834 c-switches

With mmap() on receiver (tcp_mmap -s -z)

received 32768 MB (100 % mmap'ed) in 8.03083 s, 34.2278 Gbit,
  cpu usage user:0.024 sys:1.466, 45.4712 usec per MB, 65479 c-switches
received 32768 MB (100 % mmap'ed) in 7.98805 s, 34.4111 Gbit,
  cpu usage user:0.026 sys:1.401, 43.5486 usec per MB, 65447 c-switches
received 32768 MB (100 % mmap'ed) in 7.98377 s, 34.4296 Gbit,
  cpu usage user:0.028 sys:1.452, 45.166 usec per MB, 65496 c-switches
received 32768 MB (99.9969 % mmap'ed) in 8.01838 s, 34.281 Gbit,
  cpu usage user:0.02 sys:1.446, 44.7388 usec per MB, 65505 c-switches

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-16 18:26:37 -04:00
Eric Dumazet
03f45c883c tcp: avoid extra wakeups for SO_RCVLOWAT users
SO_RCVLOWAT is properly handled in tcp_poll(), so that POLLIN is only
generated when enough bytes are available in receive queue, after
David change (commit c7004482e8 "tcp: Respect SO_RCVLOWAT in tcp_poll().")

But TCP still calls sk->sk_data_ready() for each chunk added in receive
queue, meaning thread is awaken, and goes back to sleep shortly after.

Tested:

tcp_mmap test program, receiving 32768 MB of data with SO_RCVLOWAT set to 512KB

-> Should get ~2 wakeups (c-switches) per MB, regardless of how many
(tiny or big) packets were received.

High speed (mostly full size GRO packets)

received 32768 MB (100 % mmap'ed) in 8.03112 s, 34.2266 Gbit,
  cpu usage user:0.037 sys:1.404, 43.9758 usec per MB, 65497 c-switches

received 32768 MB (99.9954 % mmap'ed) in 7.98453 s, 34.4263 Gbit,
  cpu usage user:0.03 sys:1.422, 44.3115 usec per MB, 65485 c-switches

Low speed (sender is ratelimited and sends 1-MSS at a time, so GRO is not helping)

received 22474.5 MB (100 % mmap'ed) in 6015.35 s, 0.0313414 Gbit,
  cpu usage user:0.05 sys:1.586, 72.7952 usec per MB, 44950 c-switches

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-16 18:26:37 -04:00
Eric Dumazet
796f82eafc tcp: fix delayed acks behavior for SO_RCVLOWAT
We should not delay acks if there are not enough bytes
in receive queue to satisfy SO_RCVLOWAT.

Since [E]POLLIN event is not going to be generated, there is little
hope for a delayed ack to be useful.

In fact, delaying ACK prevents sender from completing
the transfer.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-16 18:26:37 -04:00
Eric Dumazet
d1361840f8 tcp: fix SO_RCVLOWAT and RCVBUF autotuning
Applications might use SO_RCVLOWAT on TCP socket hoping to receive
one [E]POLLIN event only when a given amount of bytes are ready in socket
receive queue.

Problem is that receive autotuning is not aware of this constraint,
meaning sk_rcvbuf might be too small to allow all bytes to be stored.

Add a new (struct proto_ops)->set_rcvlowat method so that a protocol
can override the default setsockopt(SO_RCVLOWAT) behavior.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-16 18:26:37 -04:00
Lorenzo Bianconi
f85f94b871 ipv6: remove unnecessary check in addrconf_prefix_rcv_add_addr()
Remove unnecessary check on update_lft variable in
addrconf_prefix_rcv_add_addr routine since it is always set to 0.
Moreover remove update_lft re-initialization to 0

Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-16 18:16:16 -04:00
Eric Dumazet
c6404122cb tipc: fix possible crash in __tipc_nl_net_set()
syzbot reported a crash in __tipc_nl_net_set() caused by NULL dereference.

We need to check that both TIPC_NLA_NET_NODEID and TIPC_NLA_NET_NODEID_W1
are present.

We also need to make sure userland provided u64 attributes.

Fixes: d50ccc2d39 ("tipc: add 128-bit node identifier")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jon Maloy <jon.maloy@ericsson.com>
Cc: Ying Xue <ying.xue@windriver.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-16 18:08:18 -04:00
Eric Dumazet
ec518f21cb tipc: add policy for TIPC_NLA_NET_ADDR
Before syzbot/KMSAN bites, add the missing policy for TIPC_NLA_NET_ADDR

Fixes: 27c2141672 ("tipc: add net set to new netlink api")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jon Maloy <jon.maloy@ericsson.com>
Cc: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-16 18:08:18 -04:00
Gao Feng
9783ccd0f2 net: Fix one possible memleak in ip_setup_cork
It would allocate memory in this function when the cork->opt is NULL. But
the memory isn't freed if failed in the latter rt check, and return error
directly. It causes the memleak if its caller is ip_make_skb which also
doesn't free the cork->opt when meet a error.

Now move the rt check ahead to avoid the memleak.

Signed-off-by: Gao Feng <gfree.wind@vip.163.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-16 12:57:06 -04:00
Florian Westphal
2f6adf4815 netfilter: nf_tables: free set name in error path
set->name must be free'd here in case ops->init fails.

Fixes: 387454901b ("netfilter: nf_tables: Allow set names of up to 255 chars")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-16 17:47:27 +02:00
Florian Westphal
569ccae68b netfilter: nf_tables: can't fail after linking rule into active rule list
rules in nftables a free'd using kfree, but protected by rcu, i.e. we
must wait for a grace period to elapse.

Normal removal patch does this, but nf_tables_newrule() doesn't obey
this rule during error handling.

It calls nft_trans_rule_add() *after* linking rule, and, if that
fails to allocate memory, it unlinks the rule and then kfree() it --
this is unsafe.

Switch order -- first add rule to transaction list, THEN link it
to public list.

Note: nft_trans_rule_add() uses GFP_KERNEL; it will not fail so this
is not a problem in practice (spotted only during code review).

Fixes: 0628b123c9 ("netfilter: nfnetlink: add batch support and use it from nf_tables")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-16 17:47:26 +02:00
Arnd Bergmann
a661574370 netfilter: fix CONFIG_NF_REJECT_IPV6=m link error
We get a new link error with CONFIG_NFT_REJECT_INET=y and CONFIG_NF_REJECT_IPV6=m
after larger parts of the nftables modules are linked together:

net/netfilter/nft_reject_inet.o: In function `nft_reject_inet_eval':
nft_reject_inet.c:(.text+0x17c): undefined reference to `nf_send_unreach6'
nft_reject_inet.c:(.text+0x190): undefined reference to `nf_send_reset6'

The problem is that with NF_TABLES_INET set, we implicitly try to use
the ipv6 version as well for NFT_REJECT, but when CONFIG_IPV6 is set to
a loadable module, it's impossible to reach that.

The best workaround I found is to express the above as a Kconfig
dependency, forcing NFT_REJECT itself to be 'm' in that particular
configuration.

Fixes: 02c7b25e5f ("netfilter: nf_tables: build-in filter chain type")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-16 17:47:25 +02:00
Cong Wang
114aa35d06 netfilter: conntrack: silent a memory leak warning
The following memory leak is false postive:

unreferenced object 0xffff8f37f156fb38 (size 128):
  comm "softirq", pid 0, jiffies 4294899665 (age 11.292s)
  hex dump (first 32 bytes):
    6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
    00 00 00 00 30 00 20 00 48 6b 6b 6b 6b 6b 6b 6b  ....0. .Hkkkkkkk
  backtrace:
    [<000000004fda266a>] __kmalloc_track_caller+0x10d/0x141
    [<000000007b0a7e3c>] __krealloc+0x45/0x62
    [<00000000d08e0bfb>] nf_ct_ext_add+0xdc/0x133
    [<0000000099b47fd8>] init_conntrack+0x1b1/0x392
    [<0000000086dc36ec>] nf_conntrack_in+0x1ee/0x34b
    [<00000000940592de>] nf_hook_slow+0x36/0x95
    [<00000000d1bd4da7>] nf_hook.constprop.43+0x1c3/0x1dd
    [<00000000c3673266>] __ip_local_out+0xae/0xb4
    [<000000003e4192a6>] ip_local_out+0x17/0x33
    [<00000000b64356de>] igmp_ifc_timer_expire+0x23e/0x26f
    [<000000006a8f3032>] call_timer_fn+0x14c/0x2a5
    [<00000000650c1725>] __run_timers.part.34+0x150/0x182
    [<0000000090e6946e>] run_timer_softirq+0x2a/0x4c
    [<000000004d1e7293>] __do_softirq+0x1d1/0x3c2
    [<000000004643557d>] irq_exit+0x53/0xa2
    [<0000000029ddee8f>] smp_apic_timer_interrupt+0x22a/0x235

because __krealloc() is not supposed to release the old
memory and it is released later via kfree_rcu(). Since this is
the only external user of __krealloc(), just mark it as not leak
here.

Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-16 17:47:16 +02:00
Eric Dumazet
5171b37d95 net: af_packet: fix race in PACKET_{R|T}X_RING
In order to remove the race caught by syzbot [1], we need
to lock the socket before using po->tp_version as this could
change under us otherwise.

This means lock_sock() and release_sock() must be done by
packet_set_ring() callers.

[1] :
BUG: KMSAN: uninit-value in packet_set_ring+0x1254/0x3870 net/packet/af_packet.c:4249
CPU: 0 PID: 20195 Comm: syzkaller707632 Not tainted 4.16.0+ #83
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:17 [inline]
 dump_stack+0x185/0x1d0 lib/dump_stack.c:53
 kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
 __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676
 packet_set_ring+0x1254/0x3870 net/packet/af_packet.c:4249
 packet_setsockopt+0x12c6/0x5a90 net/packet/af_packet.c:3662
 SYSC_setsockopt+0x4b8/0x570 net/socket.c:1849
 SyS_setsockopt+0x76/0xa0 net/socket.c:1828
 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x3d/0xa2
RIP: 0033:0x449099
RSP: 002b:00007f42b5307ce8 EFLAGS: 00000246 ORIG_RAX: 0000000000000036
RAX: ffffffffffffffda RBX: 000000000070003c RCX: 0000000000449099
RDX: 0000000000000005 RSI: 0000000000000107 RDI: 0000000000000003
RBP: 0000000000700038 R08: 000000000000001c R09: 0000000000000000
R10: 00000000200000c0 R11: 0000000000000246 R12: 0000000000000000
R13: 000000000080eecf R14: 00007f42b53089c0 R15: 0000000000000001

Local variable description: ----req_u@packet_setsockopt
Variable was created at:
 packet_setsockopt+0x13f/0x5a90 net/packet/af_packet.c:3612
 SYSC_setsockopt+0x4b8/0x570 net/socket.c:1849

Fixes: f6fb8f100b ("af-packet: TPACKET_V3 flexible buffer implementation.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-16 11:38:43 -04:00
Soheil Hassas Yeganeh
bffd168c3f tcp: clear tp->packets_out when purging write queue
Clear tp->packets_out when purging the write queue, otherwise
tcp_rearm_rto() mistakenly assumes TCP write queue is not empty.
This results in NULL pointer dereference.

Also, remove the redundant `tp->packets_out = 0` from
tcp_disconnect(), since tcp_disconnect() calls
tcp_write_queue_purge().

Fixes: a27fd7a8ed (tcp: purge write queue upon RST)
Reported-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Reported-by: Sami Farin <hvtaifwkbgefbaei@gmail.com>
Tested-by: Sami Farin <hvtaifwkbgefbaei@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-16 11:23:42 -04:00
Steffen Klassert
b48c05ab5d xfrm: Fix warning in xfrm6_tunnel_net_exit.
We need to make sure that all states are really deleted
before we check that the state lists are empty. Otherwise
we trigger a warning.

Fixes: baeb0dbbb5 ("xfrm6_tunnel: exit_net cleanup check added")
Reported-and-tested-by:syzbot+777bf170a89e7b326405@syzkaller.appspotmail.com
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2018-04-16 07:50:09 +02:00
Al Viro
4a3877c4ce rpc_pipefs: fix double-dput()
if we ever hit rpc_gssd_dummy_depopulate() dentry passed to
it has refcount equal to 1.  __rpc_rmpipe() drops it and
dput() done after that hits an already freed dentry.

Cc: stable@kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-04-15 23:49:27 -04:00
Linus Torvalds
ca71b3ba4c Kbuild updates for v4.17 (2nd)
- pass HOSTLDFLAGS when compiling single .c host programs
 
 - build genksyms lexer and parser files instead of using shipped
   versions
 
 - rename *-asn1.[ch] to *.asn1.[ch] for suffix consistency
 
 - let the top .gitignore globally ignore artifacts generated by
   flex, bison, and asn1_compiler
 
 - let the top Makefile globally clean artifacts generated by
   flex, bison, and asn1_compiler
 
 - use safer .SECONDARY marker instead of .PRECIOUS to prevent
   intermediate files from being removed
 
 - support -fmacro-prefix-map option to make __FILE__ a relative path
 
 - fix # escaping to prepare for the future GNU Make release
 
 - clean up deb-pkg by using debian tools instead of handrolled
   source/changes generation
 
 - improve rpm-pkg portability by supporting kernel-install as a
   fallback of new-kernel-pkg
 
 - extend Kconfig listnewconfig target to provide more information
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJa0krLAAoJED2LAQed4NsGyCAP/3Vsb8A4sea7sE3LV6/aFUJp
 WcAm6PXcip1MXy7GI5yxFciwen3Z3ghQUer7fJKDcHR5c4mRSfKaqWp+TLHd6uux
 7I4pV0FNx2PapcPu5T7wNZHN96p3xZC0Z66sq9BCZ/+gNyYmZLIDcBUSIOEk0nzJ
 IsvD46zy6R6KtEnycShKVscg4JyPXJIw1UBqsPDEFHg5l16ARkghND7e5zTW62Fi
 2MqQxNXAksIKpxxoxPH/fIcNp1kFKVxYBH2CW4LQtOjC3GmrozdeV5PUc7yTezPc
 dpqOuEcIAbMH91bkvhhF+ZBi34YrxRoT4S8B3G9iCXRz+2LRZZaitqO4dAH8Kjbn
 0KjkqzNc5TosJXQ8RPTcQlRBi+JmE1bHxICvTx3XNJcqJMqIH0vs3ez/LJKOwhB4
 DbAROoxQNfVcOdouHcx2EuCSdHn24BEyzaGFhi04LACpbRLxr8IJS7hSGXRloBYp
 K3ydRvG/dCZjFRTS+xWWSi3Nzjih2mCctQlH3D4nf4M3vtCX+/k5B9IMEYFfHlvL
 KoNlK4/1vP/dAJZj0iOqd2ksCA1G6iLoHrFp3E5pdtmb4sVe2Ez3gMt+pxz3htR9
 XvjuHOzkWE9eiihs1NsFgQuyP/o3UmNKpDDW0irQ06IFEPXkA/y1mVmeTU3qtrII
 ZDiwGozIkMMEy/MLkcjE
 =tD6R
 -----END PGP SIGNATURE-----

Merge tag 'kbuild-v4.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull more Kbuild updates from Masahiro Yamada:

 - pass HOSTLDFLAGS when compiling single .c host programs

 - build genksyms lexer and parser files instead of using shipped
   versions

 - rename *-asn1.[ch] to *.asn1.[ch] for suffix consistency

 - let the top .gitignore globally ignore artifacts generated by flex,
   bison, and asn1_compiler

 - let the top Makefile globally clean artifacts generated by flex,
   bison, and asn1_compiler

 - use safer .SECONDARY marker instead of .PRECIOUS to prevent
   intermediate files from being removed

 - support -fmacro-prefix-map option to make __FILE__ a relative path

 - fix # escaping to prepare for the future GNU Make release

 - clean up deb-pkg by using debian tools instead of handrolled
   source/changes generation

 - improve rpm-pkg portability by supporting kernel-install as a
   fallback of new-kernel-pkg

 - extend Kconfig listnewconfig target to provide more information

* tag 'kbuild-v4.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
  kconfig: extend output of 'listnewconfig'
  kbuild: rpm-pkg: use kernel-install as a fallback for new-kernel-pkg
  Kbuild: fix # escaping in .cmd files for future Make
  kbuild: deb-pkg: split generating packaging and build
  kbuild: use -fmacro-prefix-map to make __FILE__ a relative path
  kbuild: mark $(targets) as .SECONDARY and remove .PRECIOUS markers
  kbuild: rename *-asn1.[ch] to *.asn1.[ch]
  kbuild: clean up *-asn1.[ch] patterns from top-level Makefile
  .gitignore: move *-asn1.[ch] patterns to the top-level .gitignore
  kbuild: add %.dtb.S and %.dtb to 'targets' automatically
  kbuild: add %.lex.c and %.tab.[ch] to 'targets' automatically
  genksyms: generate lexer and parser during build instead of shipping
  kbuild: clean up *.lex.c and *.tab.[ch] patterns from top-level Makefile
  .gitignore: move *.lex.c *.tab.[ch] patterns to the top-level .gitignore
  kbuild: use HOSTLDFLAGS for single .c executables
2018-04-15 17:21:30 -07:00
Guillaume Nault
f726214d9b l2tp: hold reference on tunnels printed in l2tp/tunnels debugfs file
Use l2tp_tunnel_get_nth() instead of l2tp_tunnel_find_nth(), to be safe
against concurrent tunnel deletion.

Use the same mechanism as in l2tp_ppp.c for dropping the reference
taken by l2tp_tunnel_get_nth(). That is, drop the reference just
before looking up the next tunnel. In case of error, drop the last
accessed tunnel in l2tp_dfs_seq_stop().

That was the last use of l2tp_tunnel_find_nth().

Fixes: 0ad6614048 ("l2tp: Add debugfs files for dumping l2tp debug info")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-13 12:17:26 -04:00
Guillaume Nault
0e0c3fee3a l2tp: hold reference on tunnels printed in pppol2tp proc file
Use l2tp_tunnel_get_nth() instead of l2tp_tunnel_find_nth(), to be safe
against concurrent tunnel deletion.

Unlike sessions, we can't drop the reference held on tunnels in
pppol2tp_seq_show(). Tunnels are reused across several calls to
pppol2tp_seq_start() when iterating over sessions. These iterations
need the tunnel for accessing the next session. Therefore the only safe
moment for dropping the reference is just before searching for the next
tunnel.

Normally, the last invocation of pppol2tp_next_tunnel() doesn't find
any new tunnel, so it drops the last tunnel without taking any new
reference. However, in case of error, pppol2tp_seq_stop() is called
directly, so we have to drop the reference there.

Fixes: fd558d186d ("l2tp: Split pppol2tp patch into separate l2tp and ppp parts")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-13 12:17:26 -04:00
Guillaume Nault
5846c131c3 l2tp: hold reference on tunnels in netlink dumps
l2tp_tunnel_find_nth() is unsafe: no reference is held on the returned
tunnel, therefore it can be freed whenever the caller uses it.
This patch defines l2tp_tunnel_get_nth() which works similarly, but
also takes a reference on the returned tunnel. The caller then has to
drop it after it stops using the tunnel.

Convert netlink dumps to make them safe against concurrent tunnel
deletion.

Fixes: 309795f4be ("l2tp: Add netlink control API for L2TP")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-13 12:17:26 -04:00
Wolfgang Bumiller
53b76cdf7e net: fix deadlock while clearing neighbor proxy table
When coming from ndisc_netdev_event() in net/ipv6/ndisc.c,
neigh_ifdown() is called with &nd_tbl, locking this while
clearing the proxy neighbor entries when eg. deleting an
interface. Calling the table's pndisc_destructor() with the
lock still held, however, can cause a deadlock: When a
multicast listener is available an IGMP packet of type
ICMPV6_MGM_REDUCTION may be sent out. When reaching
ip6_finish_output2(), if no neighbor entry for the target
address is found, __neigh_create() is called with &nd_tbl,
which it'll want to lock.

Move the elements into their own list, then unlock the table
and perform the destruction.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199289
Fixes: 6fd6ce2056 ("ipv6: Do not depend on rt->n in ip6_finish_output2().")
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-12 22:01:22 -04:00
Xin Long
1071ec9d45 sctp: do not check port in sctp_inet6_cmp_addr
pf->cmp_addr() is called before binding a v6 address to the sock. It
should not check ports, like in sctp_inet_cmp_addr.

But sctp_inet6_cmp_addr checks the addr by invoking af(6)->cmp_addr,
sctp_v6_cmp_addr where it also compares the ports.

This would cause that setsockopt(SCTP_SOCKOPT_BINDX_ADD) could bind
multiple duplicated IPv6 addresses after Commit 40b4f0fd74 ("sctp:
lack the check for ports in sctp_v6_cmp_addr").

This patch is to remove af->cmp_addr called in sctp_inet6_cmp_addr,
but do the proper check for both v6 addrs and v4mapped addrs.

v1->v2:
  - define __sctp_v6_cmp_addr to do the common address comparison
    used for both pf and af v6 cmp_addr.

Fixes: 40b4f0fd74 ("sctp: lack the check for ports in sctp_v6_cmp_addr")
Reported-by: Jianwen Ji <jiji@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-12 22:01:05 -04:00
Jon Maloy
335b929b28 tipc: fix missing initializer in tipc_sendmsg()
The stack variable 'dnode' in __tipc_sendmsg() may theoretically
end up tipc_node_get_mtu() as an unitilalized variable.

We fix this by intializing the variable at declaration. We also add
a default else clause to the two conditional ones already there, so
that we never end up in the named function if the given address
type is illegal.

Reported-by: syzbot+b0975ce9355b347c1546@syzkaller.appspotmail.com
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-12 21:55:38 -04:00
Doron Roberts-Kedes
9d0c75bf6e strparser: Fix incorrect strp->need_bytes value.
strp_data_ready resets strp->need_bytes to 0 if strp_peek_len indicates
that the remainder of the message has been received. However,
do_strp_work does not reset strp->need_bytes to 0. If do_strp_work
completes a partial message, the value of strp->need_bytes will continue
to reflect the needed bytes of the previous message, causing
future invocations of strp_data_ready to return early if
strp->need_bytes is less than strp_peek_len. Resetting strp->need_bytes
to 0 in __strp_recv on handing a full message to the upper layer solves
this problem.

__strp_recv also calculates strp->need_bytes using stm->accum_len before
stm->accum_len has been incremented by cand_len. This can cause
strp->need_bytes to be equal to the full length of the message instead
of the full length minus the accumulated length. This, in turn, causes
strp_data_ready to return early, even when there is sufficient data to
complete the partial message. Incrementing stm->accum_len before using
it to calculate strp->need_bytes solves this problem.

Found while testing net/tls_sw recv path.

Fixes: 43a0c6751a ("strparser: Stream parser for messages")
Signed-off-by: Doron Roberts-Kedes <doronrk@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-12 21:54:59 -04:00
Eric Dumazet
7dd07c143a net: validate attribute sizes in neigh_dump_table()
Since neigh_dump_table() calls nlmsg_parse() without giving policy
constraints, attributes can have arbirary size that we must validate

Reported by syzbot/KMSAN :

BUG: KMSAN: uninit-value in neigh_master_filtered net/core/neighbour.c:2292 [inline]
BUG: KMSAN: uninit-value in neigh_dump_table net/core/neighbour.c:2348 [inline]
BUG: KMSAN: uninit-value in neigh_dump_info+0x1af0/0x2250 net/core/neighbour.c:2438
CPU: 1 PID: 3575 Comm: syzkaller268891 Not tainted 4.16.0+ #83
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:17 [inline]
 dump_stack+0x185/0x1d0 lib/dump_stack.c:53
 kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
 __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676
 neigh_master_filtered net/core/neighbour.c:2292 [inline]
 neigh_dump_table net/core/neighbour.c:2348 [inline]
 neigh_dump_info+0x1af0/0x2250 net/core/neighbour.c:2438
 netlink_dump+0x9ad/0x1540 net/netlink/af_netlink.c:2225
 __netlink_dump_start+0x1167/0x12a0 net/netlink/af_netlink.c:2322
 netlink_dump_start include/linux/netlink.h:214 [inline]
 rtnetlink_rcv_msg+0x1435/0x1560 net/core/rtnetlink.c:4598
 netlink_rcv_skb+0x355/0x5f0 net/netlink/af_netlink.c:2447
 rtnetlink_rcv+0x50/0x60 net/core/rtnetlink.c:4653
 netlink_unicast_kernel net/netlink/af_netlink.c:1311 [inline]
 netlink_unicast+0x1672/0x1750 net/netlink/af_netlink.c:1337
 netlink_sendmsg+0x1048/0x1310 net/netlink/af_netlink.c:1900
 sock_sendmsg_nosec net/socket.c:630 [inline]
 sock_sendmsg net/socket.c:640 [inline]
 ___sys_sendmsg+0xec0/0x1310 net/socket.c:2046
 __sys_sendmsg net/socket.c:2080 [inline]
 SYSC_sendmsg+0x2a3/0x3d0 net/socket.c:2091
 SyS_sendmsg+0x54/0x80 net/socket.c:2087
 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x3d/0xa2
RIP: 0033:0x43fed9
RSP: 002b:00007ffddbee2798 EFLAGS: 00000213 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 000000000043fed9
RDX: 0000000000000000 RSI: 0000000020005000 RDI: 0000000000000003
RBP: 00000000006ca018 R08: 00000000004002c8 R09: 00000000004002c8
R10: 00000000004002c8 R11: 0000000000000213 R12: 0000000000401800
R13: 0000000000401890 R14: 0000000000000000 R15: 0000000000000000

Uninit was created at:
 kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
 kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:188
 kmsan_kmalloc+0x94/0x100 mm/kmsan/kmsan.c:314
 kmsan_slab_alloc+0x11/0x20 mm/kmsan/kmsan.c:321
 slab_post_alloc_hook mm/slab.h:445 [inline]
 slab_alloc_node mm/slub.c:2737 [inline]
 __kmalloc_node_track_caller+0xaed/0x11c0 mm/slub.c:4369
 __kmalloc_reserve net/core/skbuff.c:138 [inline]
 __alloc_skb+0x2cf/0x9f0 net/core/skbuff.c:206
 alloc_skb include/linux/skbuff.h:984 [inline]
 netlink_alloc_large_skb net/netlink/af_netlink.c:1183 [inline]
 netlink_sendmsg+0x9a6/0x1310 net/netlink/af_netlink.c:1875
 sock_sendmsg_nosec net/socket.c:630 [inline]
 sock_sendmsg net/socket.c:640 [inline]
 ___sys_sendmsg+0xec0/0x1310 net/socket.c:2046
 __sys_sendmsg net/socket.c:2080 [inline]
 SYSC_sendmsg+0x2a3/0x3d0 net/socket.c:2091
 SyS_sendmsg+0x54/0x80 net/socket.c:2087
 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x3d/0xa2

Fixes: 21fdd092ac ("net: Add support for filtering neigh dump by master device")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: David Ahern <dsa@cumulusnetworks.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-12 21:46:11 -04:00
Eric Dumazet
7212303268 tcp: md5: reject TCP_MD5SIG or TCP_MD5SIG_EXT on established sockets
syzbot/KMSAN reported an uninit-value in tcp_parse_options() [1]

I believe this was caused by a TCP_MD5SIG being set on live
flow.

This is highly unexpected, since TCP option space is limited.

For instance, presence of TCP MD5 option automatically disables
TCP TimeStamp option at SYN/SYNACK time, which we can not do
once flow has been established.

Really, adding/deleting an MD5 key only makes sense on sockets
in CLOSE or LISTEN state.

[1]
BUG: KMSAN: uninit-value in tcp_parse_options+0xd74/0x1a30 net/ipv4/tcp_input.c:3720
CPU: 1 PID: 6177 Comm: syzkaller192004 Not tainted 4.16.0+ #83
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:17 [inline]
 dump_stack+0x185/0x1d0 lib/dump_stack.c:53
 kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
 __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676
 tcp_parse_options+0xd74/0x1a30 net/ipv4/tcp_input.c:3720
 tcp_fast_parse_options net/ipv4/tcp_input.c:3858 [inline]
 tcp_validate_incoming+0x4f1/0x2790 net/ipv4/tcp_input.c:5184
 tcp_rcv_established+0xf60/0x2bb0 net/ipv4/tcp_input.c:5453
 tcp_v4_do_rcv+0x6cd/0xd90 net/ipv4/tcp_ipv4.c:1469
 sk_backlog_rcv include/net/sock.h:908 [inline]
 __release_sock+0x2d6/0x680 net/core/sock.c:2271
 release_sock+0x97/0x2a0 net/core/sock.c:2786
 tcp_sendmsg+0xd6/0x100 net/ipv4/tcp.c:1464
 inet_sendmsg+0x48d/0x740 net/ipv4/af_inet.c:764
 sock_sendmsg_nosec net/socket.c:630 [inline]
 sock_sendmsg net/socket.c:640 [inline]
 SYSC_sendto+0x6c3/0x7e0 net/socket.c:1747
 SyS_sendto+0x8a/0xb0 net/socket.c:1715
 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x3d/0xa2
RIP: 0033:0x448fe9
RSP: 002b:00007fd472c64d38 EFLAGS: 00000216 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 00000000006e5a30 RCX: 0000000000448fe9
RDX: 000000000000029f RSI: 0000000020a88f88 RDI: 0000000000000004
RBP: 00000000006e5a34 R08: 0000000020e68000 R09: 0000000000000010
R10: 00000000200007fd R11: 0000000000000216 R12: 0000000000000000
R13: 00007fff074899ef R14: 00007fd472c659c0 R15: 0000000000000009

Uninit was created at:
 kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
 kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:188
 kmsan_kmalloc+0x94/0x100 mm/kmsan/kmsan.c:314
 kmsan_slab_alloc+0x11/0x20 mm/kmsan/kmsan.c:321
 slab_post_alloc_hook mm/slab.h:445 [inline]
 slab_alloc_node mm/slub.c:2737 [inline]
 __kmalloc_node_track_caller+0xaed/0x11c0 mm/slub.c:4369
 __kmalloc_reserve net/core/skbuff.c:138 [inline]
 __alloc_skb+0x2cf/0x9f0 net/core/skbuff.c:206
 alloc_skb include/linux/skbuff.h:984 [inline]
 tcp_send_ack+0x18c/0x910 net/ipv4/tcp_output.c:3624
 __tcp_ack_snd_check net/ipv4/tcp_input.c:5040 [inline]
 tcp_ack_snd_check net/ipv4/tcp_input.c:5053 [inline]
 tcp_rcv_established+0x2103/0x2bb0 net/ipv4/tcp_input.c:5469
 tcp_v4_do_rcv+0x6cd/0xd90 net/ipv4/tcp_ipv4.c:1469
 sk_backlog_rcv include/net/sock.h:908 [inline]
 __release_sock+0x2d6/0x680 net/core/sock.c:2271
 release_sock+0x97/0x2a0 net/core/sock.c:2786
 tcp_sendmsg+0xd6/0x100 net/ipv4/tcp.c:1464
 inet_sendmsg+0x48d/0x740 net/ipv4/af_inet.c:764
 sock_sendmsg_nosec net/socket.c:630 [inline]
 sock_sendmsg net/socket.c:640 [inline]
 SYSC_sendto+0x6c3/0x7e0 net/socket.c:1747
 SyS_sendto+0x8a/0xb0 net/socket.c:1715
 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x3d/0xa2

Fixes: cfb6eeb4c8 ("[TCP]: MD5 Signature Option (RFC2385) support.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-12 21:46:11 -04:00
Jon Maloy
c3317f4db8 tipc: fix unbalanced reference counter
When a topology subscription is created, we may encounter (or KASAN
may provoke) a failure to create a corresponding service instance in
the binding table. Instead of letting the tipc_nametbl_subscribe()
report the failure back to the caller, the function just makes a warning
printout and returns, without incrementing the subscription reference
counter as expected by the caller.

This makes the caller believe that the subscription was successful, so
it will at a later moment try to unsubscribe the item. This involves
a sub_put() call. Since the reference counter never was incremented
in the first place, we get a premature delete of the subscription item,
followed by a "use-after-free" warning.

We fix this by adding a return value to tipc_nametbl_subscribe() and
make the caller aware of the failure to subscribe.

This bug seems to always have been around, but this fix only applies
back to the commit shown below. Given the low risk of this happening
we believe this to be sufficient.

Fixes: commit 218527fe27 ("tipc: replace name table service range
array with rb tree")
Reported-by: syzbot+aa245f26d42b8305d157@syzkaller.appspotmail.com

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-12 21:46:10 -04:00
Kees Cook
b16520f749 net/tls: Remove VLA usage
In the quest to remove VLAs from the kernel[1], this replaces the VLA
size with the only possible size used in the code, and adds a mechanism
to double-check future IV sizes.

[1] https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com

Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Dave Watson <davejwatson@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-12 21:46:10 -04:00
Linus Torvalds
a1bf4c7da6 NFS client updates for Linux 4.17
Stable bugfixes:
 - xprtrdma: Fix corner cases when handling device removal # v4.12+
 - xprtrdma: Fix latency regression on NUMA NFS/RDMA clients # v4.15+
 
 Features:
 - New sunrpc tracepoint for RPC pings
 - Finer grained NFSv4 attribute checking
 - Don't unnecessarily return NFS v4 delegations
 
 Other bugfixes and cleanups:
 - Several other small NFSoRDMA cleanups
 - Improvements to the sunrpc RTT measurements
 - A few sunrpc tracepoint cleanups
 - Various fixes for NFS v4 lock notifications
 - Various sunrpc and NFS v4 XDR encoding cleanups
 - Switch to the ida_simple API
 - Fix NFSv4.1 exclusive create
 - Forget acl cache after setattr operation
 - Don't advance the nfs_entry readdir cookie if xdr decoding fails
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlrNG1IACgkQ18tUv7Cl
 QOvotw//fQoUgQ/AOJGlZo/4ws2mGJN3dfwwKM8xYOnHaxppOYubZRHwvswK8d22
 +XR/Q6IVbUxI3mJluv1L0d9CJT06s3c9CO90McIJbk4CWihGP19bNIY4JiPlzrbv
 4FDiyOvMBej2UXbHX5EzKj0srxyBoEVf3iUAIa6DaHi3c6EIUo6fP3d2eRNJStqd
 WMyZs+nqr2W9biyClxntT7l/Sk+o+4I7M3Oo9pjjS+PiePYdaMrL5T1kPeHaJshF
 GMGXkbvVdqpDRiXX84R9+2/nuSiA15eEnaR94UNvs84oLR3qob3ZhxhudqFdSPrX
 RS6E7m34gY/EaQm/wbB26PZm+3jHd4Pqm5SKLbyFfoCmG6oMwBvXNRJZas1DFaHM
 CMOECvfAr6kixVLkAN0MNQ2Ku/FuJ52OLP1dRLmxsblocnhEPujc6RSz6Ju/v3a0
 adbpmJMA2IoSGgXMu3g1VGnjHfMj7ZmjtpigXVvlcUqQGCL7t4ngh23cpeTQeJ76
 bMwSHUQu18NbmtJjBTE+PIm7mdCrpQD7ZuOPWpK62zxLYUnnv7nm75m84DrDru7d
 XAmrCmdUJNrVWQs6BAtCXgO4PZ6xNGLosb0xTQXTAQYftc+DRJ9SW/VGc0Mp1L9m
 0G0iz++b8cy4Pih5UCDJcCkpjCIvHLcn72zn1kbufWqG3xr2koc=
 =IlWo
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.17-1' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client updates from Anna Schumaker:
 "Stable bugfixes:
   - xprtrdma: Fix corner cases when handling device removal # v4.12+
   - xprtrdma: Fix latency regression on NUMA NFS/RDMA clients # v4.15+

  Features:
   - New sunrpc tracepoint for RPC pings
   - Finer grained NFSv4 attribute checking
   - Don't unnecessarily return NFS v4 delegations

  Other bugfixes and cleanups:
   - Several other small NFSoRDMA cleanups
   - Improvements to the sunrpc RTT measurements
   - A few sunrpc tracepoint cleanups
   - Various fixes for NFS v4 lock notifications
   - Various sunrpc and NFS v4 XDR encoding cleanups
   - Switch to the ida_simple API
   - Fix NFSv4.1 exclusive create
   - Forget acl cache after setattr operation
   - Don't advance the nfs_entry readdir cookie if xdr decoding fails"

* tag 'nfs-for-4.17-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (47 commits)
  NFS: advance nfs_entry cookie only after decoding completes successfully
  NFSv3/acl: forget acl cache after setattr
  NFSv4.1: Fix exclusive create
  NFSv4: Declare the size up to date after it was set.
  nfs: Use ida_simple API
  NFSv4: Fix the nfs_inode_set_delegation() arguments
  NFSv4: Clean up CB_GETATTR encoding
  NFSv4: Don't ask for attributes when ACCESS is protected by a delegation
  NFSv4: Add a helper to encode/decode struct timespec
  NFSv4: Clean up encode_attrs
  NFSv4; Clean up XDR encoding of type bitmap4
  NFSv4: Allow GFP_NOIO sleeps in decode_attr_owner/decode_attr_group
  SUNRPC: Add a helper for encoding opaque data inline
  SUNRPC: Add helpers for decoding opaque and string types
  NFSv4: Ignore change attribute invalidations if we hold a delegation
  NFS: More fine grained attribute tracking
  NFS: Don't force unnecessary cache invalidation in nfs_update_inode()
  NFS: Don't redirty the attribute cache in nfs_wcc_update_inode()
  NFS: Don't force a revalidation of all attributes if change is missing
  NFS: Convert NFS_INO_INVALID flags to unsigned long
  ...
2018-04-12 12:55:50 -07:00
Linus Torvalds
5d1365940a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) In ip_gre tunnel, handle the conflict between TUNNEL_{SEQ,CSUM} and
    GSO/LLTX properly. From Sabrina Dubroca.

 2) Stop properly on error in lan78xx_read_otp(), from Phil Elwell.

 3) Don't uncompress in slip before rstate is initialized, from Tejaswi
    Tanikella.

 4) When using 1.x firmware on aquantia, issue a deinit before we
    hardware reset the chip, otherwise we break dirty wake WOL. From
    Igor Russkikh.

 5) Correct log check in vhost_vq_access_ok(), from Stefan Hajnoczi.

 6) Fix ethtool -x crashes in bnxt_en, from Michael Chan.

 7) Fix races in l2tp tunnel creation and duplicate tunnel detection,
    from Guillaume Nault.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (22 commits)
  l2tp: fix race in duplicate tunnel detection
  l2tp: fix races in tunnel creation
  tun: send netlink notification when the device is modified
  tun: set the flags before registering the netdevice
  lan78xx: Don't reset the interface on open
  bnxt_en: Fix NULL pointer dereference at bnxt_free_irq().
  bnxt_en: Need to include RDMA rings in bnxt_check_rings().
  bnxt_en: Support max-mtu with VF-reps
  bnxt_en: Ignore src port field in decap filter nodes
  bnxt_en: do not allow wildcard matches for L2 flows
  bnxt_en: Fix ethtool -x crash when device is down.
  vhost: return bool from *_access_ok() functions
  vhost: fix vhost_vq_access_ok() log check
  vhost: Fix vhost_copy_to_user()
  net: aquantia: oops when shutdown on already stopped device
  net: aquantia: Regression on reset with 1.x firmware
  cdc_ether: flag the Cinterion AHS8 modem by gemalto as WWAN
  slip: Check if rstate is initialized before uncompressing
  lan78xx: Avoid spurious kevent 4 "error"
  lan78xx: Correctly indicate invalid OTP
  ...
2018-04-12 11:09:05 -07:00
Guillaume Nault
f6cd651b05 l2tp: fix race in duplicate tunnel detection
We can't use l2tp_tunnel_find() to prevent l2tp_nl_cmd_tunnel_create()
from creating a duplicate tunnel. A tunnel can be concurrently
registered after l2tp_tunnel_find() returns. Therefore, searching for
duplicates must be done at registration time.

Finally, remove l2tp_tunnel_find() entirely as it isn't use anywhere
anymore.

Fixes: 309795f4be ("l2tp: Add netlink control API for L2TP")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-11 17:41:27 -04:00
Guillaume Nault
6b9f34239b l2tp: fix races in tunnel creation
l2tp_tunnel_create() inserts the new tunnel into the namespace's tunnel
list and sets the socket's ->sk_user_data field, before returning it to
the caller. Therefore, there are two ways the tunnel can be accessed
and freed, before the caller even had the opportunity to take a
reference. In practice, syzbot could crash the module by closing the
socket right after a new tunnel was returned to pppol2tp_create().

This patch moves tunnel registration out of l2tp_tunnel_create(), so
that the caller can safely hold a reference before publishing the
tunnel. This second step is done with the new l2tp_tunnel_register()
function, which is now responsible for associating the tunnel to its
socket and for inserting it into the namespace's list.

While moving the code to l2tp_tunnel_register(), a few modifications
have been done. First, the socket validation tests are done in a helper
function, for clarity. Also, modifying the socket is now done after
having inserted the tunnel to the namespace's tunnels list. This will
allow insertion to fail, without having to revert theses modifications
in the error path (a followup patch will check for duplicate tunnels
before insertion). Either the socket is a kernel socket which we
control, or it is a user-space socket for which we have a reference on
the file descriptor. In any case, the socket isn't going to be closed
from under us.

Reported-by: syzbot+fbeeb5c3b538e8545644@syzkaller.appspotmail.com
Fixes: fd558d186d ("l2tp: Split pppol2tp patch into separate l2tp and ppp parts")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-11 17:41:27 -04:00
Ka-Cheong Poon
a43cced9a3 rds: MP-RDS may use an invalid c_path
rds_sendmsg() calls rds_send_mprds_hash() to find a c_path to use to
send a message.  Suppose the RDS connection is not yet up.  In
rds_send_mprds_hash(), it does

	if (conn->c_npaths == 0)
		wait_event_interruptible(conn->c_hs_waitq,
					 (conn->c_npaths != 0));

If it is interrupted before the connection is set up,
rds_send_mprds_hash() will return a non-zero hash value.  Hence
rds_sendmsg() will use a non-zero c_path to send the message.  But if
the RDS connection ends up to be non-MP capable, the message will be
lost as only the zero c_path can be used.

Signed-off-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-11 10:24:01 -04:00
Jack Ma
cf43ae63c0 netfilter: xt_connmark: Add bit mapping for bit-shift operation.
With the addition of bit-shift operations, we are able to shift
ct/skbmark based on user requirements. However, this change might also
cause the most left/right hand- side mark to be accidentially lost
during shift operations.

This patch adds the ability to 'grep' certain bits based on ctmask or
nfmask out of the original mark. Then, apply shift operations to achieve
a new mapping between ctmark and skb->mark.

For example: If someone would like save the fourth F bits of ctmark
0xFFF(F)000F into the seventh hexadecimal (0) skb->mark 0xABC000(0)E.

	new_targetmark = (ctmark & ctmask) >> 12;
	(new) skb->mark = (skb->mark &~nfmask) ^
        	           new_targetmark;

This will preserve the other bits that are not related to this
operation.

Fixes: 472a73e007 ("netfilter: xt_conntrack: Support bit-shifting for CONNMARK & MARK targets.")
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Jack Ma <jack.ma@alliedtelesis.co.nz>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-11 10:36:02 +02:00
Trond Myklebust
0e779aa703 SUNRPC: Add helpers for decoding opaque and string types
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Chuck Lever
2552428863 xprtrdma: Fix corner cases when handling device removal
Michal Kalderon has found some corner cases around device unload
with active NFS mounts that I didn't have the imagination to test
when xprtrdma device removal was added last year.

- The ULP device removal handler is responsible for deallocating
  the PD. That wasn't clear to me initially, and my own testing
  suggested it was not necessary, but that is incorrect.

- The transport destruction path can no longer assume that there
  is a valid ID.

- When destroying a transport, ensure that ib_free_cq() is not
  invoked on a CQ that was already released.

Reported-by: Michal Kalderon <Michal.Kalderon@cavium.com>
Fixes: bebd031866 ("xprtrdma: Support unplugging an HCA from ...")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: stable@vger.kernel.org # v4.12+
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:07:10 -04:00
Chuck Lever
a25a4cb3af sunrpc: Add static trace point to report result of RPC ping
This information can help track down local misconfiguration issues
as well as network partitions and unresponsive servers.

There are several ways to send a ping, and with transport multi-
plexing, the exact rpc_xprt that is used is sometimes not known by
the upper layer. The rpc_xprt pointer passed to the trace point
call also has to be RCU-safe.

I found a spot inside the client FSM where an rpc_xprt pointer is
always available and safe to use.

Suggested-by: Bill Baker <Bill.Baker@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Chuck Lever
40bf7eb304 sunrpc: Add static trace point to report RPC latency stats
Introduce a low-overhead mechanism to report information about
latencies of individual RPCs. The goal is to enable user space to
filter the trace record for latency outliers, or build histograms,
etc.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Chuck Lever
e671edb942 sunrpc: Simplify synopsis of some trace points
Clean up: struct rpc_task carries a pointer to a struct rpc_clnt,
and in fact task->tk_client is always what is passed into trace
points that are already passing @task.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Chuck Lever
ff699ea826 SUNRPC: Make num_reqs a non-atomic integer
If recording xprt->stat.max_slots is moved into xprt_alloc_slot,
then xprt->num_reqs is never manipulated outside
xprt->reserve_lock. There's no longer a need for xprt->num_reqs to
be atomic.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Chuck Lever
78215759e2 SUNRPC: Make RTT measurement more precise (Send)
Some RPC transports have more overhead in their send_request
callouts than others. For example, for RPC-over-RDMA:

- Marshaling an RPC often has to DMA map the RPC arguments

- Registration methods perform memory registration as part of
  marshaling

To capture just server and network latencies more precisely: when
sending a Call, capture the rq_xtime timestamp _after_ the transport
header has been marshaled.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Chuck Lever
0b87a46b43 SUNRPC: Make RTT measurement more precise (Receive)
Some RPC transports have more overhead in their reply handlers
than others. For example, for RPC-over-RDMA:

- RPC completion has to wait for memory invalidation, which is
  not a part of the server/network round trip

- Recently a context switch was introduced into the reply handler,
  which further artificially inflates the measure of RPC RTT

To capture just server and network latencies more precisely: when
receiving a reply, compute the RTT as soon as the XID is recognized
rather than at RPC completion time.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Chuck Lever
ecd465ee88 SUNRPC: Move xprt_update_rtt callsite
Since commit 33849792cb ("xprtrdma: Detect unreachable NFS/RDMA
servers more reliably"), the xprtrdma transport now has a ->timer
callout. But xprtrdma does not need to compute RTT data, only UDP
needs that. Move the xprt_update_rtt call into the UDP transport
implementation.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Chuck Lever
2dd4a012d9 xprtrdma: Move creation of rl_rdmabuf to rpcrdma_create_req
Refactor: Both rpcrdma_create_req call sites have to allocate the
buffer where the transport header is built, so just move that
allocation into rpcrdma_create_req.

This buffer is a fixed size. There's no needed information available
in call_allocate that is not also available when the transport is
created.

The original purpose for allocating these buffers on demand was to
reduce the possibility that an allocation failure during transport
creation will hork the mount operation during low memory scenarios.
Some relief for this rare possibility is coming up in the next few
patches.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Chuck Lever
f287762308 xprtrdma: Chain Send to FastReg WRs
With FRWR, the client transport can perform memory registration and
post a Send with just a single ib_post_send.

This reduces contention between the send_request path and the Send
Completion handlers, and reduces the overhead of registering a chunk
that has multiple segments.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Chuck Lever
fb14ae8853 xprtrdma: "Support" call-only RPCs
RPC-over-RDMA version 1 credit accounting relies on there being a
response message for every RPC Call. This means that RPC procedures
that have no reply will disrupt credit accounting, just in the same
way as a retransmit would (since it is sent because no reply has
arrived). Deal with the "no reply" case the same way.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Chuck Lever
ae741a8551 xprtrdma: Reduce number of MRs created by rpcrdma_mrs_create
Create fewer MRs on average. Many workloads don't need as many as
32 MRs, and the transport can now quickly restock the MR free list.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Chuck Lever
9e679d5e76 xprtrdma: ->send_request returns -EAGAIN when there are no free MRs
Currently, when the MR free list is exhausted during marshaling, the
RPC/RDMA transport places the RPC task on the delayq, which forces a
wait for HZ >> 2 before the marshal and send is retried.

With this change, the transport now places such an RPC task on the
pending queue, and wakes it just as soon as more MRs have been
created. Creating more MRs typically takes less than a millisecond,
and this waking mechanism is less deadlock-prone.

Moreover, the waiting RPC task is holding the transport's write
lock, which blocks the transport from sending RPCs. Therefore faster
recovery from MR exhaustion is desirable.

This is the same mechanism that the TCP transport utilizes when
handling write buffer space exhaustion.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Chuck Lever
8a14793e7a xprtrdma: Remove xprt-specific connect cookie
Clean up: The generic rq_connect_cookie is sufficient to detect RPC
Call retransmission.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Chuck Lever
b7e85fff52 xprtrdma: Remove arbitrary limit on initiator depth
Clean up: We need to check only that the value does not exceed the
range of the u8 field it's going into.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Chuck Lever
6720a89933 xprtrdma: Fix latency regression on NUMA NFS/RDMA clients
With v4.15, on one of my NFS/RDMA clients I measured a nearly
doubling in the latency of small read and write system calls. There
was no change in server round trip time. The extra latency appears
in the whole RPC execution path.

"git bisect" settled on commit ccede75985 ("xprtrdma: Spread reply
processing over more CPUs") .

After some experimentation, I found that leaving the WQ bound and
allowing the scheduler to pick the dispatch CPU seems to eliminate
the long latencies, and it does not introduce any new regressions.

The fix is implemented by reverting only the part of
commit ccede75985 ("xprtrdma: Spread reply processing over more
CPUs") that dispatches RPC replies specifically on the CPU where the
matching RPC call was made.

Interestingly, saving the CPU number and later queuing reply
processing there was effective _only_ for a NFS READ and WRITE
request. On my NUMA client, in-kernel RPC reply processing for
asynchronous RPCs was dispatched on the same CPU where the RPC call
was made, as expected. However synchronous RPCs seem to get their
reply dispatched on some other CPU than where the call was placed,
every time.

Fixes: ccede75985 ("xprtrdma: Spread reply processing over ... ")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: stable@vger.kernel.org # v4.15+
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Linus Torvalds
b284d4d5a6 The big ticket items are:
- support for rbd "fancy" striping (myself).  The striping feature bit
   is now fully implemented, allowing mapping v2 images with non-default
   striping patterns.  This completes support for --image-format 2.
 
 - CephFS quota support (Luis Henriques and Zheng Yan).  This set is
   based on the new SnapRealm code in the upcoming v13.y.z ("Mimic")
   release.  Quota handling will be rejected on older filesystems.
 
 - memory usage improvements in CephFS (Chengguang Xu).  Directory
   specific bits have been split out of ceph_file_info and some effort
   went into improving cap reservation code to avoid OOM crashes.
 
 Also included a bunch of assorted fixes all over the place from
 Chengguang and others.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJazOI/AAoJEEp/3jgCEfOLOu0IAKGFkcCo0UdQDGHHJZHn2rAm
 CSWMMwyYGAhoWI6Gva0jx1A2omZLFSeq/MC8dWLL/MNAKt8i/qo8bTsTrwCHMR2Q
 D0FsvMWIhkWRS1/FcD1uVDhn0a/DFm5Kfy8kzz3v695TDCt+BYWrCqyHTB/wSdRR
 VpO3KdpHQ9h3ojNBRgIniOCNPeQP+QzLXy+P0h0oKbP2Y03mwJlsWG4L6zakkkwT
 e2I+RVdlOMUDJ7rZxiXESBr6BuLI4oOkPe8roQGmZPy1Xe17xa9M5iWVNuM6RUhO
 Z9bS2aLMhbDyeCPqvzgAnsUtFT0PAQjB5NYw2yqisbHs/wrU5kMOOpcLqz/Ls/s=
 =v1I9
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-4.17-rc1' of git://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "The big ticket items are:

   - support for rbd "fancy" striping (myself).

     The striping feature bit is now fully implemented, allowing mapping
     v2 images with non-default striping patterns. This completes
     support for --image-format 2.

   - CephFS quota support (Luis Henriques and Zheng Yan).

     This set is based on the new SnapRealm code in the upcoming v13.y.z
     ("Mimic") release. Quota handling will be rejected on older
     filesystems.

   - memory usage improvements in CephFS (Chengguang Xu).

     Directory specific bits have been split out of ceph_file_info and
     some effort went into improving cap reservation code to avoid OOM
     crashes.

  Also included a bunch of assorted fixes all over the place from
  Chengguang and others"

* tag 'ceph-for-4.17-rc1' of git://github.com/ceph/ceph-client: (67 commits)
  ceph: quota: report root dir quota usage in statfs
  ceph: quota: add counter for snaprealms with quota
  ceph: quota: cache inode pointer in ceph_snap_realm
  ceph: fix root quota realm check
  ceph: don't check quota for snap inode
  ceph: quota: update MDS when max_bytes is approaching
  ceph: quota: support for ceph.quota.max_bytes
  ceph: quota: don't allow cross-quota renames
  ceph: quota: support for ceph.quota.max_files
  ceph: quota: add initial infrastructure to support cephfs quotas
  rbd: remove VLA usage
  rbd: fix spelling mistake: "reregisteration" -> "reregistration"
  ceph: rename function drop_leases() to a more descriptive name
  ceph: fix invalid point dereference for error case in mdsc destroy
  ceph: return proper bool type to caller instead of pointer
  ceph: optimize memory usage
  ceph: optimize mds session register
  libceph, ceph: add __init attribution to init funcitons
  ceph: filter out used flags when printing unused open flags
  ceph: don't wait on writeback when there is no more dirty pages
  ...
2018-04-10 12:25:30 -07:00
Sabrina Dubroca
1cc5954f44 ip_gre: clear feature flags when incompatible o_flags are set
Commit dd9d598c66 ("ip_gre: add the support for i/o_flags update via
netlink") added the ability to change o_flags, but missed that the
GSO/LLTX features are disabled by default, and only enabled some gre
features are unused. Thus we also need to disable the GSO/LLTX features
on the device when the TUNNEL_SEQ or TUNNEL_CSUM flags are set.

These two examples should result in the same features being set:

    ip link add gre_none type gre local 192.168.0.10 remote 192.168.0.20 ttl 255 key 0

    ip link set gre_none type gre seq
    ip link add gre_seq type gre local 192.168.0.10 remote 192.168.0.20 ttl 255 key 1 seq

Fixes: dd9d598c66 ("ip_gre: add the support for i/o_flags update via netlink")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Acked-by: William Tu <u9012063@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-10 11:03:32 -04:00
Linus Torvalds
c18bb396d3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) The sockmap code has to free socket memory on close if there is
    corked data, from John Fastabend.

 2) Tunnel names coming from userspace need to be length validated. From
    Eric Dumazet.

 3) arp_filter() has to take VRFs properly into account, from Miguel
    Fadon Perlines.

 4) Fix oops in error path of tcf_bpf_init(), from Davide Caratti.

 5) Missing idr_remove() in u32_delete_key(), from Cong Wang.

 6) More syzbot stuff. Several use of uninitialized value fixes all
    over, from Eric Dumazet.

 7) Do not leak kernel memory to userspace in sctp, also from Eric
    Dumazet.

 8) Discard frames from unused ports in DSA, from Andrew Lunn.

 9) Fix DMA mapping and reset/failover problems in ibmvnic, from Thomas
    Falcon.

10) Do not access dp83640 PHY registers prematurely after reset, from
    Esben Haabendal.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (46 commits)
  vhost-net: set packet weight of tx polling to 2 * vq size
  net: thunderx: rework mac addresses list to u64 array
  inetpeer: fix uninit-value in inet_getpeer
  dp83640: Ensure against premature access to PHY registers after reset
  devlink: convert occ_get op to separate registration
  ARM: dts: ls1021a: Specify TBIPA register address
  net/fsl_pq_mdio: Allow explicit speficition of TBIPA address
  ibmvnic: Do not reset CRQ for Mobility driver resets
  ibmvnic: Fix failover case for non-redundant configuration
  ibmvnic: Fix reset scheduler error handling
  ibmvnic: Zero used TX descriptor counter on reset
  ibmvnic: Fix DMA mapping mistakes
  tipc: use the right skb in tipc_sk_fill_sock_diag()
  sctp: sctp_sockaddr_af must check minimal addr length for AF_INET6
  net: dsa: Discard frames from unused ports
  sctp: do not leak kernel memory to user space
  soreuseport: initialise timewait reuseport field
  ipv4: fix uninit-value in ip_route_output_key_hash_rcu()
  dccp: initialize ireq->ir_mark
  net: fix uninit-value in __hw_addr_add_ex()
  ...
2018-04-09 17:04:10 -07:00
Florian Westphal
3f1e53abff netfilter: ebtables: don't attempt to allocate 0-sized compat array
Dmitry reports 32bit ebtables on 64bit kernel got broken by
a recent change that returns -EINVAL when ruleset has no entries.

ebtables however only counts user-defined chains, so for the
initial table nentries will be 0.

Don't try to allocate the compat array in this case, as no user
defined rules exist no rule will need 64bit translation.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Fixes: 7d7d7e0211 ("netfilter: compat: reject huge allocation requests")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-09 17:05:48 +02:00
Julian Anastasov
5c64576a77 ipvs: fix rtnl_lock lockups caused by start_sync_thread
syzkaller reports for wrong rtnl_lock usage in sync code [1] and [2]

We have 2 problems in start_sync_thread if error path is
taken, eg. on memory allocation error or failure to configure
sockets for mcast group or addr/port binding:

1. recursive locking: holding rtnl_lock while calling sock_release
which in turn calls again rtnl_lock in ip_mc_drop_socket to leave
the mcast group, as noticed by Florian Westphal. Additionally,
sock_release can not be called while holding sync_mutex (ABBA
deadlock).

2. task hung: holding rtnl_lock while calling kthread_stop to
stop the running kthreads. As the kthreads do the same to leave
the mcast group (sock_release -> ip_mc_drop_socket -> rtnl_lock)
they hang.

Fix the problems by calling rtnl_unlock early in the error path,
now sock_release is called after unlocking both mutexes.

Problem 3 (task hung reported by syzkaller [2]) is variant of
problem 2: use _trylock to prevent one user to call rtnl_lock and
then while waiting for sync_mutex to block kthreads that execute
sock_release when they are stopped by stop_sync_thread.

[1]
IPVS: stopping backup sync thread 4500 ...
WARNING: possible recursive locking detected
4.16.0-rc7+ #3 Not tainted
--------------------------------------------
syzkaller688027/4497 is trying to acquire lock:
  (rtnl_mutex){+.+.}, at: [<00000000bb14d7fb>] rtnl_lock+0x17/0x20
net/core/rtnetlink.c:74

but task is already holding lock:
IPVS: stopping backup sync thread 4495 ...
  (rtnl_mutex){+.+.}, at: [<00000000bb14d7fb>] rtnl_lock+0x17/0x20
net/core/rtnetlink.c:74

other info that might help us debug this:
  Possible unsafe locking scenario:

        CPU0
        ----
   lock(rtnl_mutex);
   lock(rtnl_mutex);

  *** DEADLOCK ***

  May be due to missing lock nesting notation

2 locks held by syzkaller688027/4497:
  #0:  (rtnl_mutex){+.+.}, at: [<00000000bb14d7fb>] rtnl_lock+0x17/0x20
net/core/rtnetlink.c:74
  #1:  (ipvs->sync_mutex){+.+.}, at: [<00000000703f78e3>]
do_ip_vs_set_ctl+0x10f8/0x1cc0 net/netfilter/ipvs/ip_vs_ctl.c:2388

stack backtrace:
CPU: 1 PID: 4497 Comm: syzkaller688027 Not tainted 4.16.0-rc7+ #3
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:17 [inline]
  dump_stack+0x194/0x24d lib/dump_stack.c:53
  print_deadlock_bug kernel/locking/lockdep.c:1761 [inline]
  check_deadlock kernel/locking/lockdep.c:1805 [inline]
  validate_chain kernel/locking/lockdep.c:2401 [inline]
  __lock_acquire+0xe8f/0x3e00 kernel/locking/lockdep.c:3431
  lock_acquire+0x1d5/0x580 kernel/locking/lockdep.c:3920
  __mutex_lock_common kernel/locking/mutex.c:756 [inline]
  __mutex_lock+0x16f/0x1a80 kernel/locking/mutex.c:893
  mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:908
  rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74
  ip_mc_drop_socket+0x88/0x230 net/ipv4/igmp.c:2643
  inet_release+0x4e/0x1c0 net/ipv4/af_inet.c:413
  sock_release+0x8d/0x1e0 net/socket.c:595
  start_sync_thread+0x2213/0x2b70 net/netfilter/ipvs/ip_vs_sync.c:1924
  do_ip_vs_set_ctl+0x1139/0x1cc0 net/netfilter/ipvs/ip_vs_ctl.c:2389
  nf_sockopt net/netfilter/nf_sockopt.c:106 [inline]
  nf_setsockopt+0x67/0xc0 net/netfilter/nf_sockopt.c:115
  ip_setsockopt+0x97/0xa0 net/ipv4/ip_sockglue.c:1261
  udp_setsockopt+0x45/0x80 net/ipv4/udp.c:2406
  sock_common_setsockopt+0x95/0xd0 net/core/sock.c:2975
  SYSC_setsockopt net/socket.c:1849 [inline]
  SyS_setsockopt+0x189/0x360 net/socket.c:1828
  do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
  entry_SYSCALL_64_after_hwframe+0x42/0xb7
RIP: 0033:0x446a69
RSP: 002b:00007fa1c3a64da8 EFLAGS: 00000246 ORIG_RAX: 0000000000000036
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000446a69
RDX: 000000000000048b RSI: 0000000000000000 RDI: 0000000000000003
RBP: 00000000006e29fc R08: 0000000000000018 R09: 0000000000000000
R10: 00000000200000c0 R11: 0000000000000246 R12: 00000000006e29f8
R13: 00676e697279656b R14: 00007fa1c3a659c0 R15: 00000000006e2b60

[2]
IPVS: sync thread started: state = BACKUP, mcast_ifn = syz_tun, syncid = 4,
id = 0
IPVS: stopping backup sync thread 25415 ...
INFO: task syz-executor7:25421 blocked for more than 120 seconds.
       Not tainted 4.16.0-rc6+ #284
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
syz-executor7   D23688 25421   4408 0x00000004
Call Trace:
  context_switch kernel/sched/core.c:2862 [inline]
  __schedule+0x8fb/0x1ec0 kernel/sched/core.c:3440
  schedule+0xf5/0x430 kernel/sched/core.c:3499
  schedule_timeout+0x1a3/0x230 kernel/time/timer.c:1777
  do_wait_for_common kernel/sched/completion.c:86 [inline]
  __wait_for_common kernel/sched/completion.c:107 [inline]
  wait_for_common kernel/sched/completion.c:118 [inline]
  wait_for_completion+0x415/0x770 kernel/sched/completion.c:139
  kthread_stop+0x14a/0x7a0 kernel/kthread.c:530
  stop_sync_thread+0x3d9/0x740 net/netfilter/ipvs/ip_vs_sync.c:1996
  do_ip_vs_set_ctl+0x2b1/0x1cc0 net/netfilter/ipvs/ip_vs_ctl.c:2394
  nf_sockopt net/netfilter/nf_sockopt.c:106 [inline]
  nf_setsockopt+0x67/0xc0 net/netfilter/nf_sockopt.c:115
  ip_setsockopt+0x97/0xa0 net/ipv4/ip_sockglue.c:1253
  sctp_setsockopt+0x2ca/0x63e0 net/sctp/socket.c:4154
  sock_common_setsockopt+0x95/0xd0 net/core/sock.c:3039
  SYSC_setsockopt net/socket.c:1850 [inline]
  SyS_setsockopt+0x189/0x360 net/socket.c:1829
  do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
  entry_SYSCALL_64_after_hwframe+0x42/0xb7
RIP: 0033:0x454889
RSP: 002b:00007fc927626c68 EFLAGS: 00000246 ORIG_RAX: 0000000000000036
RAX: ffffffffffffffda RBX: 00007fc9276276d4 RCX: 0000000000454889
RDX: 000000000000048c RSI: 0000000000000000 RDI: 0000000000000017
RBP: 000000000072bf58 R08: 0000000000000018 R09: 0000000000000000
R10: 0000000020000000 R11: 0000000000000246 R12: 00000000ffffffff
R13: 000000000000051c R14: 00000000006f9b40 R15: 0000000000000001

Showing all locks held in the system:
2 locks held by khungtaskd/868:
  #0:  (rcu_read_lock){....}, at: [<00000000a1a8f002>]
check_hung_uninterruptible_tasks kernel/hung_task.c:175 [inline]
  #0:  (rcu_read_lock){....}, at: [<00000000a1a8f002>] watchdog+0x1c5/0xd60
kernel/hung_task.c:249
  #1:  (tasklist_lock){.+.+}, at: [<0000000037c2f8f9>]
debug_show_all_locks+0xd3/0x3d0 kernel/locking/lockdep.c:4470
1 lock held by rsyslogd/4247:
  #0:  (&f->f_pos_lock){+.+.}, at: [<000000000d8d6983>]
__fdget_pos+0x12b/0x190 fs/file.c:765
2 locks held by getty/4338:
  #0:  (&tty->ldisc_sem){++++}, at: [<00000000bee98654>]
ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
  #1:  (&ldata->atomic_read_lock){+.+.}, at: [<00000000c1d180aa>]
n_tty_read+0x2ef/0x1a40 drivers/tty/n_tty.c:2131
2 locks held by getty/4339:
  #0:  (&tty->ldisc_sem){++++}, at: [<00000000bee98654>]
ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
  #1:  (&ldata->atomic_read_lock){+.+.}, at: [<00000000c1d180aa>]
n_tty_read+0x2ef/0x1a40 drivers/tty/n_tty.c:2131
2 locks held by getty/4340:
  #0:  (&tty->ldisc_sem){++++}, at: [<00000000bee98654>]
ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
  #1:  (&ldata->atomic_read_lock){+.+.}, at: [<00000000c1d180aa>]
n_tty_read+0x2ef/0x1a40 drivers/tty/n_tty.c:2131
2 locks held by getty/4341:
  #0:  (&tty->ldisc_sem){++++}, at: [<00000000bee98654>]
ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
  #1:  (&ldata->atomic_read_lock){+.+.}, at: [<00000000c1d180aa>]
n_tty_read+0x2ef/0x1a40 drivers/tty/n_tty.c:2131
2 locks held by getty/4342:
  #0:  (&tty->ldisc_sem){++++}, at: [<00000000bee98654>]
ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
  #1:  (&ldata->atomic_read_lock){+.+.}, at: [<00000000c1d180aa>]
n_tty_read+0x2ef/0x1a40 drivers/tty/n_tty.c:2131
2 locks held by getty/4343:
  #0:  (&tty->ldisc_sem){++++}, at: [<00000000bee98654>]
ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
  #1:  (&ldata->atomic_read_lock){+.+.}, at: [<00000000c1d180aa>]
n_tty_read+0x2ef/0x1a40 drivers/tty/n_tty.c:2131
2 locks held by getty/4344:
  #0:  (&tty->ldisc_sem){++++}, at: [<00000000bee98654>]
ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
  #1:  (&ldata->atomic_read_lock){+.+.}, at: [<00000000c1d180aa>]
n_tty_read+0x2ef/0x1a40 drivers/tty/n_tty.c:2131
3 locks held by kworker/0:5/6494:
  #0:  ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:
[<00000000a062b18e>] work_static include/linux/workqueue.h:198 [inline]
  #0:  ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:
[<00000000a062b18e>] set_work_data kernel/workqueue.c:619 [inline]
  #0:  ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:
[<00000000a062b18e>] set_work_pool_and_clear_pending kernel/workqueue.c:646
[inline]
  #0:  ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:
[<00000000a062b18e>] process_one_work+0xb12/0x1bb0 kernel/workqueue.c:2084
  #1:  ((addr_chk_work).work){+.+.}, at: [<00000000278427d5>]
process_one_work+0xb89/0x1bb0 kernel/workqueue.c:2088
  #2:  (rtnl_mutex){+.+.}, at: [<00000000066e35ac>] rtnl_lock+0x17/0x20
net/core/rtnetlink.c:74
1 lock held by syz-executor7/25421:
  #0:  (ipvs->sync_mutex){+.+.}, at: [<00000000d414a689>]
do_ip_vs_set_ctl+0x277/0x1cc0 net/netfilter/ipvs/ip_vs_ctl.c:2393
2 locks held by syz-executor7/25427:
  #0:  (rtnl_mutex){+.+.}, at: [<00000000066e35ac>] rtnl_lock+0x17/0x20
net/core/rtnetlink.c:74
  #1:  (ipvs->sync_mutex){+.+.}, at: [<00000000e6d48489>]
do_ip_vs_set_ctl+0x10f8/0x1cc0 net/netfilter/ipvs/ip_vs_ctl.c:2388
1 lock held by syz-executor7/25435:
  #0:  (rtnl_mutex){+.+.}, at: [<00000000066e35ac>] rtnl_lock+0x17/0x20
net/core/rtnetlink.c:74
1 lock held by ipvs-b:2:0/25415:
  #0:  (rtnl_mutex){+.+.}, at: [<00000000066e35ac>] rtnl_lock+0x17/0x20
net/core/rtnetlink.c:74

Reported-and-tested-by: syzbot+a46d6abf9d56b1365a72@syzkaller.appspotmail.com
Reported-and-tested-by: syzbot+5fe074c01b2032ce9618@syzkaller.appspotmail.com
Fixes: e0b26cc997 ("ipvs: call rtnl_lock early")
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-09 17:05:27 +02:00
Florian Westphal
876c27314c netfilter: nf_conntrack_sip: allow duplicate SDP expectations
Callum Sinclair reported SIP IP Phone errors that he tracked down to
such phones sending session descriptions for different media types but
with same port numbers.

The expect core will only 'refresh' existing expectation if it is
from same master AND same expectation class (media type).
As expectation class is different, we get an error.

The SIP connection tracking code will then

1). drop the SDP packet
2). if an rtp expectation was already installed successfully,
    error on rtcp expectation will cancel the rtp one.

Make the expect core report back to caller when the conflict is due
to different expectation class and have SIP tracker ignore soft-error.

Reported-by: Callum Sinclair <Callum.Sinclair@alliedtelesis.co.nz>
Tested-by: Callum Sinclair <Callum.Sinclair@alliedtelesis.co.nz>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-09 17:05:27 +02:00
Eric Dumazet
b6a37e5e25 inetpeer: fix uninit-value in inet_getpeer
syzbot/KMSAN reported that p->dtime was read while it was
not yet initialized in :

	delta = (__u32)jiffies - p->dtime;
	if (delta < ttl || !refcount_dec_if_one(&p->refcnt))
		gc_stack[i] = NULL;

This is a false positive, because the inetpeer wont be erased
from rb-tree if the refcount_dec_if_one(&p->refcnt) does not
succeed. And this wont happen before first inet_putpeer() call
for this inetpeer has been done, and ->dtime field is written
exactly before the refcount_dec_and_test(&p->refcnt).

The KMSAN report was :

BUG: KMSAN: uninit-value in inet_peer_gc net/ipv4/inetpeer.c:163 [inline]
BUG: KMSAN: uninit-value in inet_getpeer+0x1567/0x1e70 net/ipv4/inetpeer.c:228
CPU: 0 PID: 9494 Comm: syz-executor5 Not tainted 4.16.0+ #82
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:17 [inline]
 dump_stack+0x185/0x1d0 lib/dump_stack.c:53
 kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
 __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676
 inet_peer_gc net/ipv4/inetpeer.c:163 [inline]
 inet_getpeer+0x1567/0x1e70 net/ipv4/inetpeer.c:228
 inet_getpeer_v4 include/net/inetpeer.h:110 [inline]
 icmpv4_xrlim_allow net/ipv4/icmp.c:330 [inline]
 icmp_send+0x2b44/0x3050 net/ipv4/icmp.c:725
 ip_options_compile+0x237c/0x29f0 net/ipv4/ip_options.c:472
 ip_rcv_options net/ipv4/ip_input.c:284 [inline]
 ip_rcv_finish+0xda8/0x16d0 net/ipv4/ip_input.c:365
 NF_HOOK include/linux/netfilter.h:288 [inline]
 ip_rcv+0x119d/0x16f0 net/ipv4/ip_input.c:493
 __netif_receive_skb_core+0x47cf/0x4a80 net/core/dev.c:4562
 __netif_receive_skb net/core/dev.c:4627 [inline]
 netif_receive_skb_internal+0x49d/0x630 net/core/dev.c:4701
 netif_receive_skb+0x230/0x240 net/core/dev.c:4725
 tun_rx_batched drivers/net/tun.c:1555 [inline]
 tun_get_user+0x6d88/0x7580 drivers/net/tun.c:1962
 tun_chr_write_iter+0x1d4/0x330 drivers/net/tun.c:1990
 do_iter_readv_writev+0x7bb/0x970 include/linux/fs.h:1776
 do_iter_write+0x30d/0xd40 fs/read_write.c:932
 vfs_writev fs/read_write.c:977 [inline]
 do_writev+0x3c9/0x830 fs/read_write.c:1012
 SYSC_writev+0x9b/0xb0 fs/read_write.c:1085
 SyS_writev+0x56/0x80 fs/read_write.c:1082
 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x3d/0xa2
RIP: 0033:0x455111
RSP: 002b:00007fae0365cba0 EFLAGS: 00000293 ORIG_RAX: 0000000000000014
RAX: ffffffffffffffda RBX: 000000000000002e RCX: 0000000000455111
RDX: 0000000000000001 RSI: 00007fae0365cbf0 RDI: 00000000000000fc
RBP: 0000000020000040 R08: 00000000000000fc R09: 0000000000000000
R10: 000000000000002e R11: 0000000000000293 R12: 00000000ffffffff
R13: 0000000000000658 R14: 00000000006fc8e0 R15: 0000000000000000

Uninit was created at:
 kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
 kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:188
 kmsan_kmalloc+0x94/0x100 mm/kmsan/kmsan.c:314
 kmem_cache_alloc+0xaab/0xb90 mm/slub.c:2756
 inet_getpeer+0xed8/0x1e70 net/ipv4/inetpeer.c:210
 inet_getpeer_v4 include/net/inetpeer.h:110 [inline]
 ip4_frag_init+0x4d1/0x740 net/ipv4/ip_fragment.c:153
 inet_frag_alloc net/ipv4/inet_fragment.c:369 [inline]
 inet_frag_create net/ipv4/inet_fragment.c:385 [inline]
 inet_frag_find+0x7da/0x1610 net/ipv4/inet_fragment.c:418
 ip_find net/ipv4/ip_fragment.c:275 [inline]
 ip_defrag+0x448/0x67a0 net/ipv4/ip_fragment.c:676
 ip_check_defrag+0x775/0xda0 net/ipv4/ip_fragment.c:724
 packet_rcv_fanout+0x2a8/0x8d0 net/packet/af_packet.c:1447
 deliver_skb net/core/dev.c:1897 [inline]
 deliver_ptype_list_skb net/core/dev.c:1912 [inline]
 __netif_receive_skb_core+0x314a/0x4a80 net/core/dev.c:4545
 __netif_receive_skb net/core/dev.c:4627 [inline]
 netif_receive_skb_internal+0x49d/0x630 net/core/dev.c:4701
 netif_receive_skb+0x230/0x240 net/core/dev.c:4725
 tun_rx_batched drivers/net/tun.c:1555 [inline]
 tun_get_user+0x6d88/0x7580 drivers/net/tun.c:1962
 tun_chr_write_iter+0x1d4/0x330 drivers/net/tun.c:1990
 do_iter_readv_writev+0x7bb/0x970 include/linux/fs.h:1776
 do_iter_write+0x30d/0xd40 fs/read_write.c:932
 vfs_writev fs/read_write.c:977 [inline]
 do_writev+0x3c9/0x830 fs/read_write.c:1012
 SYSC_writev+0x9b/0xb0 fs/read_write.c:1085
 SyS_writev+0x56/0x80 fs/read_write.c:1082
 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x3d/0xa2

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-09 10:57:35 -04:00
Vincent Bernat
9a17740e0e ipvs: fix multiplicative hashing in sh/dh/lblc/lblcr algorithms
The sh/dh/lblc/lblcr algorithms are using Knuth's multiplicative
hashing incorrectly. Replace its use by the hash_32() macro, which
correctly implements this algorithm. It doesn't use the same constant,
but it shouldn't matter.

Signed-off-by: Vincent Bernat <vincent@bernat.im>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2018-04-09 10:15:27 +03:00
Inju Song
30edf801d7 netfilter: ipvs: Add configurations of Maglev hashing
To build the maglev hashing scheduler, add some configuration
to Kconfig and Makefile.

 - The compile configurations of MH are added to the Kconfig.

 - The MH build rule is added to the Makefile.

Signed-off-by: Inju Song <inju.song@navercorp.com>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2018-04-09 10:11:18 +03:00
Inju Song
039f32e8cd netfilter: ipvs: Add Maglev hashing scheduler
Implements the Google's Maglev hashing algorithm as a IPVS scheduler.

Basically it provides consistent hashing but offers some special
features about disruption and load balancing.

 1) minimal disruption: when the set of destinations changes,
    a connection will likely be sent to the same destination
    as it was before.

 2) load balancing: each destination will receive an almost
    equal number of connections.

Seel also for detail: [3.4 Consistent Hasing] in
https://www.usenix.org/system/files/conference/nsdi16/nsdi16-paper-eisenbud.pdf

Signed-off-by: Inju Song <inju.song@navercorp.com>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2018-04-09 10:10:57 +03:00
Inju Song
a2c09ac0fb netfilter: ipvs: Keep latest weight of destination
The hashing table in scheduler such as source hash or maglev hash
should ignore the changed weight to 0 and allow changing the weight
from/to non-0 values. So, struct ip_vs_dest needs to keep weight
with latest non-0 weight.

Signed-off-by: Inju Song <inju.song@navercorp.com>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2018-04-09 10:10:55 +03:00
Arvind Yadav
535101ec88 netfilter: ipvs: Fix space before '[' error.
Fix checkpatch.pl error:
ERROR: space prohibited before open square bracket '['.

Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
2018-04-09 10:06:59 +03:00
Kevin Easton
4b66af2d63 af_key: Always verify length of provided sadb_key
Key extensions (struct sadb_key) include a user-specified number of key
bits.  The kernel uses that number to determine how much key data to copy
out of the message in pfkey_msg2xfrm_state().

The length of the sadb_key message must be verified to be long enough,
even in the case of SADB_X_AALG_NULL.  Furthermore, the sadb_key_len value
must be long enough to include both the key data and the struct sadb_key
itself.

Introduce a helper function verify_key_len(), and call it from
parse_exthdrs() where other exthdr types are similarly checked for
correctness.

Signed-off-by: Kevin Easton <kevin@guarana.org>
Reported-by: syzbot+5022a34ca5a3d49b84223653fab632dfb7b4cf37@syzkaller.appspotmail.com
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2018-04-09 07:06:38 +02:00
David S. Miller
4c7c12e0c9 Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
Johan Hedberg says:

====================
pull request: bluetooth 2018-04-08

Here's one important Bluetooth fix for the 4.17-rc series that's needed
to pass several Bluetooth qualification test cases.

Let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-08 17:19:15 -04:00
Jiri Pirko
fc56be47da devlink: convert occ_get op to separate registration
This resolves race during initialization where the resources with
ops are registered before driver and the structures used by occ_get
op is initialized. So keep occ_get callbacks registered only when
all structs are initialized.

The example flows, as it is in mlxsw:
1) driver load/asic probe:
   mlxsw_core
      -> mlxsw_sp_resources_register
        -> mlxsw_sp_kvdl_resources_register
          -> devlink_resource_register IDX
   mlxsw_spectrum
      -> mlxsw_sp_kvdl_init
        -> mlxsw_sp_kvdl_parts_init
          -> mlxsw_sp_kvdl_part_init
            -> devlink_resource_size_get IDX (to get the current setup
                                              size from devlink)
        -> devlink_resource_occ_get_register IDX (register current
                                                  occupancy getter)
2) reload triggered by devlink command:
  -> mlxsw_devlink_core_bus_device_reload
    -> mlxsw_sp_fini
      -> mlxsw_sp_kvdl_fini
	-> devlink_resource_occ_get_unregister IDX
    (struct mlxsw_sp *mlxsw_sp is freed at this point, call to occ get
     which is using mlxsw_sp would cause use-after free)
    -> mlxsw_sp_init
      -> mlxsw_sp_kvdl_init
        -> mlxsw_sp_kvdl_parts_init
          -> mlxsw_sp_kvdl_part_init
            -> devlink_resource_size_get IDX (to get the current setup
                                              size from devlink)
        -> devlink_resource_occ_get_register IDX (register current
                                                  occupancy getter)

Fixes: d9f9b9a4d0 ("devlink: Add support for resource abstraction")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-08 12:45:57 -04:00
Cong Wang
e41f054847 tipc: use the right skb in tipc_sk_fill_sock_diag()
Commit 4b2e6877b8 ("tipc: Fix namespace violation in tipc_sk_fill_sock_diag")
tried to fix the crash but failed, the crash is still 100% reproducible
with it.

In tipc_sk_fill_sock_diag(), skb is the diag dump we are filling, it is not
correct to retrieve its NETLINK_CB(), instead, like other protocol diag,
we should use NETLINK_CB(cb->skb).sk here.

Reported-by: <syzbot+326e587eff1074657718@syzkaller.appspotmail.com>
Fixes: 4b2e6877b8 ("tipc: Fix namespace violation in tipc_sk_fill_sock_diag")
Fixes: c30b70deb5 (tipc: implement socket diagnostics for AF_TIPC)
Cc: GhantaKrishnamurthy MohanKrishna <mohan.krishna.ghanta.krishnamurthy@ericsson.com>
Cc: Jon Maloy <jon.maloy@ericsson.com>
Cc: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-08 12:34:29 -04:00
Eric Dumazet
81e9837029 sctp: sctp_sockaddr_af must check minimal addr length for AF_INET6
Check must happen before call to ipv6_addr_v4mapped()

syzbot report was :

BUG: KMSAN: uninit-value in sctp_sockaddr_af net/sctp/socket.c:359 [inline]
BUG: KMSAN: uninit-value in sctp_do_bind+0x60f/0xdc0 net/sctp/socket.c:384
CPU: 0 PID: 3576 Comm: syzkaller968804 Not tainted 4.16.0+ #82
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:17 [inline]
 dump_stack+0x185/0x1d0 lib/dump_stack.c:53
 kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
 __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676
 sctp_sockaddr_af net/sctp/socket.c:359 [inline]
 sctp_do_bind+0x60f/0xdc0 net/sctp/socket.c:384
 sctp_bind+0x149/0x190 net/sctp/socket.c:332
 inet6_bind+0x1fd/0x1820 net/ipv6/af_inet6.c:293
 SYSC_bind+0x3f2/0x4b0 net/socket.c:1474
 SyS_bind+0x54/0x80 net/socket.c:1460
 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x3d/0xa2
RIP: 0033:0x43fd49
RSP: 002b:00007ffe99df3d28 EFLAGS: 00000213 ORIG_RAX: 0000000000000031
RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 000000000043fd49
RDX: 0000000000000010 RSI: 0000000020000000 RDI: 0000000000000003
RBP: 00000000006ca018 R08: 00000000004002c8 R09: 00000000004002c8
R10: 00000000004002c8 R11: 0000000000000213 R12: 0000000000401670
R13: 0000000000401700 R14: 0000000000000000 R15: 0000000000000000

Local variable description: ----address@SYSC_bind
Variable was created at:
 SYSC_bind+0x6f/0x4b0 net/socket.c:1461
 SyS_bind+0x54/0x80 net/socket.c:1460

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Vlad Yasevich <vyasevich@gmail.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-08 12:29:41 -04:00
Andrew Lunn
fc5f33768c net: dsa: Discard frames from unused ports
The Marvell switches under some conditions will pass a frame to the
host with the port being the CPU port. Such frames are invalid, and
should be dropped. Not dropping them can result in a crash when
incrementing the receive statistics for an invalid port.

Reported-by: Chris Healy <cphealy@gmail.com>
Fixes: 91da11f870 ("net: Distributed Switch Architecture protocol support")
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-08 10:34:49 -04:00
Eric Dumazet
6780db244d sctp: do not leak kernel memory to user space
syzbot produced a nice report [1]

Issue here is that a recvmmsg() managed to leak 8 bytes of kernel memory
to user space, because sin_zero (padding field) was not properly cleared.

[1]
BUG: KMSAN: uninit-value in copy_to_user include/linux/uaccess.h:184 [inline]
BUG: KMSAN: uninit-value in move_addr_to_user+0x32e/0x530 net/socket.c:227
CPU: 1 PID: 3586 Comm: syzkaller481044 Not tainted 4.16.0+ #82
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:17 [inline]
 dump_stack+0x185/0x1d0 lib/dump_stack.c:53
 kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
 kmsan_internal_check_memory+0x164/0x1d0 mm/kmsan/kmsan.c:1176
 kmsan_copy_to_user+0x69/0x160 mm/kmsan/kmsan.c:1199
 copy_to_user include/linux/uaccess.h:184 [inline]
 move_addr_to_user+0x32e/0x530 net/socket.c:227
 ___sys_recvmsg+0x4e2/0x810 net/socket.c:2211
 __sys_recvmmsg+0x54e/0xdb0 net/socket.c:2313
 SYSC_recvmmsg+0x29b/0x3e0 net/socket.c:2394
 SyS_recvmmsg+0x76/0xa0 net/socket.c:2378
 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x3d/0xa2
RIP: 0033:0x4401c9
RSP: 002b:00007ffc56f73098 EFLAGS: 00000217 ORIG_RAX: 000000000000012b
RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 00000000004401c9
RDX: 0000000000000001 RSI: 0000000020003ac0 RDI: 0000000000000003
RBP: 00000000006ca018 R08: 0000000020003bc0 R09: 0000000000000010
R10: 0000000000000000 R11: 0000000000000217 R12: 0000000000401af0
R13: 0000000000401b80 R14: 0000000000000000 R15: 0000000000000000

Local variable description: ----addr@___sys_recvmsg
Variable was created at:
 ___sys_recvmsg+0xd5/0x810 net/socket.c:2172
 __sys_recvmmsg+0x54e/0xdb0 net/socket.c:2313

Bytes 8-15 of 16 are uninitialized

==================================================================
Kernel panic - not syncing: panic_on_warn set ...

CPU: 1 PID: 3586 Comm: syzkaller481044 Tainted: G    B            4.16.0+ #82
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:17 [inline]
 dump_stack+0x185/0x1d0 lib/dump_stack.c:53
 panic+0x39d/0x940 kernel/panic.c:183
 kmsan_report+0x238/0x240 mm/kmsan/kmsan.c:1083
 kmsan_internal_check_memory+0x164/0x1d0 mm/kmsan/kmsan.c:1176
 kmsan_copy_to_user+0x69/0x160 mm/kmsan/kmsan.c:1199
 copy_to_user include/linux/uaccess.h:184 [inline]
 move_addr_to_user+0x32e/0x530 net/socket.c:227
 ___sys_recvmsg+0x4e2/0x810 net/socket.c:2211
 __sys_recvmmsg+0x54e/0xdb0 net/socket.c:2313
 SYSC_recvmmsg+0x29b/0x3e0 net/socket.c:2394
 SyS_recvmmsg+0x76/0xa0 net/socket.c:2378
 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x3d/0xa2

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc:	Vlad Yasevich <vyasevich@gmail.com>
Cc:	Neil Horman <nhorman@tuxdriver.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-08 10:31:51 -04:00
Eric Dumazet
3099a52918 soreuseport: initialise timewait reuseport field
syzbot reported an uninit-value in inet_csk_bind_conflict() [1]

It turns out we never propagated sk->sk_reuseport into timewait socket.

[1]
BUG: KMSAN: uninit-value in inet_csk_bind_conflict+0x5f9/0x990 net/ipv4/inet_connection_sock.c:151
CPU: 1 PID: 3589 Comm: syzkaller008242 Not tainted 4.16.0+ #82
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:17 [inline]
 dump_stack+0x185/0x1d0 lib/dump_stack.c:53
 kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
 __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676
 inet_csk_bind_conflict+0x5f9/0x990 net/ipv4/inet_connection_sock.c:151
 inet_csk_get_port+0x1d28/0x1e40 net/ipv4/inet_connection_sock.c:320
 inet6_bind+0x121c/0x1820 net/ipv6/af_inet6.c:399
 SYSC_bind+0x3f2/0x4b0 net/socket.c:1474
 SyS_bind+0x54/0x80 net/socket.c:1460
 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x3d/0xa2
RIP: 0033:0x4416e9
RSP: 002b:00007ffce6d15c88 EFLAGS: 00000217 ORIG_RAX: 0000000000000031
RAX: ffffffffffffffda RBX: 0100000000000000 RCX: 00000000004416e9
RDX: 000000000000001c RSI: 0000000020402000 RDI: 0000000000000004
RBP: 0000000000000000 R08: 00000000e6d15e08 R09: 00000000e6d15e08
R10: 0000000000000004 R11: 0000000000000217 R12: 0000000000009478
R13: 00000000006cd448 R14: 0000000000000000 R15: 0000000000000000

Uninit was stored to memory at:
 kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
 kmsan_save_stack mm/kmsan/kmsan.c:293 [inline]
 kmsan_internal_chain_origin+0x12b/0x210 mm/kmsan/kmsan.c:684
 __msan_chain_origin+0x69/0xc0 mm/kmsan/kmsan_instr.c:521
 tcp_time_wait+0xf17/0xf50 net/ipv4/tcp_minisocks.c:283
 tcp_rcv_state_process+0xebe/0x6490 net/ipv4/tcp_input.c:6003
 tcp_v6_do_rcv+0x11dd/0x1d90 net/ipv6/tcp_ipv6.c:1331
 sk_backlog_rcv include/net/sock.h:908 [inline]
 __release_sock+0x2d6/0x680 net/core/sock.c:2271
 release_sock+0x97/0x2a0 net/core/sock.c:2786
 tcp_close+0x277/0x18f0 net/ipv4/tcp.c:2269
 inet_release+0x240/0x2a0 net/ipv4/af_inet.c:427
 inet6_release+0xaf/0x100 net/ipv6/af_inet6.c:435
 sock_release net/socket.c:595 [inline]
 sock_close+0xe0/0x300 net/socket.c:1149
 __fput+0x49e/0xa10 fs/file_table.c:209
 ____fput+0x37/0x40 fs/file_table.c:243
 task_work_run+0x243/0x2c0 kernel/task_work.c:113
 exit_task_work include/linux/task_work.h:22 [inline]
 do_exit+0x10e1/0x38d0 kernel/exit.c:867
 do_group_exit+0x1a0/0x360 kernel/exit.c:970
 SYSC_exit_group+0x21/0x30 kernel/exit.c:981
 SyS_exit_group+0x25/0x30 kernel/exit.c:979
 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x3d/0xa2
Uninit was stored to memory at:
 kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
 kmsan_save_stack mm/kmsan/kmsan.c:293 [inline]
 kmsan_internal_chain_origin+0x12b/0x210 mm/kmsan/kmsan.c:684
 __msan_chain_origin+0x69/0xc0 mm/kmsan/kmsan_instr.c:521
 inet_twsk_alloc+0xaef/0xc00 net/ipv4/inet_timewait_sock.c:182
 tcp_time_wait+0xd9/0xf50 net/ipv4/tcp_minisocks.c:258
 tcp_rcv_state_process+0xebe/0x6490 net/ipv4/tcp_input.c:6003
 tcp_v6_do_rcv+0x11dd/0x1d90 net/ipv6/tcp_ipv6.c:1331
 sk_backlog_rcv include/net/sock.h:908 [inline]
 __release_sock+0x2d6/0x680 net/core/sock.c:2271
 release_sock+0x97/0x2a0 net/core/sock.c:2786
 tcp_close+0x277/0x18f0 net/ipv4/tcp.c:2269
 inet_release+0x240/0x2a0 net/ipv4/af_inet.c:427
 inet6_release+0xaf/0x100 net/ipv6/af_inet6.c:435
 sock_release net/socket.c:595 [inline]
 sock_close+0xe0/0x300 net/socket.c:1149
 __fput+0x49e/0xa10 fs/file_table.c:209
 ____fput+0x37/0x40 fs/file_table.c:243
 task_work_run+0x243/0x2c0 kernel/task_work.c:113
 exit_task_work include/linux/task_work.h:22 [inline]
 do_exit+0x10e1/0x38d0 kernel/exit.c:867
 do_group_exit+0x1a0/0x360 kernel/exit.c:970
 SYSC_exit_group+0x21/0x30 kernel/exit.c:981
 SyS_exit_group+0x25/0x30 kernel/exit.c:979
 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x3d/0xa2
Uninit was created at:
 kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
 kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:188
 kmsan_kmalloc+0x94/0x100 mm/kmsan/kmsan.c:314
 kmem_cache_alloc+0xaab/0xb90 mm/slub.c:2756
 inet_twsk_alloc+0x13b/0xc00 net/ipv4/inet_timewait_sock.c:163
 tcp_time_wait+0xd9/0xf50 net/ipv4/tcp_minisocks.c:258
 tcp_rcv_state_process+0xebe/0x6490 net/ipv4/tcp_input.c:6003
 tcp_v6_do_rcv+0x11dd/0x1d90 net/ipv6/tcp_ipv6.c:1331
 sk_backlog_rcv include/net/sock.h:908 [inline]
 __release_sock+0x2d6/0x680 net/core/sock.c:2271
 release_sock+0x97/0x2a0 net/core/sock.c:2786
 tcp_close+0x277/0x18f0 net/ipv4/tcp.c:2269
 inet_release+0x240/0x2a0 net/ipv4/af_inet.c:427
 inet6_release+0xaf/0x100 net/ipv6/af_inet6.c:435
 sock_release net/socket.c:595 [inline]
 sock_close+0xe0/0x300 net/socket.c:1149
 __fput+0x49e/0xa10 fs/file_table.c:209
 ____fput+0x37/0x40 fs/file_table.c:243
 task_work_run+0x243/0x2c0 kernel/task_work.c:113
 exit_task_work include/linux/task_work.h:22 [inline]
 do_exit+0x10e1/0x38d0 kernel/exit.c:867
 do_group_exit+0x1a0/0x360 kernel/exit.c:970
 SYSC_exit_group+0x21/0x30 kernel/exit.c:981
 SyS_exit_group+0x25/0x30 kernel/exit.c:979
 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x3d/0xa2

Fixes: da5e36308d ("soreuseport: TCP/IPv4 implementation")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-07 22:32:32 -04:00
Eric Dumazet
d0ea2b1250 ipv4: fix uninit-value in ip_route_output_key_hash_rcu()
syzbot complained that res.type could be used while not initialized.

Using RTN_UNSPEC as initial value seems better than using garbage.

BUG: KMSAN: uninit-value in __mkroute_output net/ipv4/route.c:2200 [inline]
BUG: KMSAN: uninit-value in ip_route_output_key_hash_rcu+0x31f0/0x3940 net/ipv4/route.c:2493
CPU: 1 PID: 12207 Comm: syz-executor0 Not tainted 4.16.0+ #81
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:17 [inline]
 dump_stack+0x185/0x1d0 lib/dump_stack.c:53
 kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
 __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676
 __mkroute_output net/ipv4/route.c:2200 [inline]
 ip_route_output_key_hash_rcu+0x31f0/0x3940 net/ipv4/route.c:2493
 ip_route_output_key_hash net/ipv4/route.c:2322 [inline]
 __ip_route_output_key include/net/route.h:126 [inline]
 ip_route_output_flow+0x1eb/0x3c0 net/ipv4/route.c:2577
 raw_sendmsg+0x1861/0x3ed0 net/ipv4/raw.c:653
 inet_sendmsg+0x48d/0x740 net/ipv4/af_inet.c:764
 sock_sendmsg_nosec net/socket.c:630 [inline]
 sock_sendmsg net/socket.c:640 [inline]
 SYSC_sendto+0x6c3/0x7e0 net/socket.c:1747
 SyS_sendto+0x8a/0xb0 net/socket.c:1715
 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x3d/0xa2
RIP: 0033:0x455259
RSP: 002b:00007fdc0625dc68 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 00007fdc0625e6d4 RCX: 0000000000455259
RDX: 0000000000000000 RSI: 0000000020000040 RDI: 0000000000000013
RBP: 000000000072bea0 R08: 0000000020000080 R09: 0000000000000010
R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
R13: 00000000000004f7 R14: 00000000006fa7c8 R15: 0000000000000000

Local variable description: ----res.i.i@ip_route_output_flow
Variable was created at:
 ip_route_output_flow+0x75/0x3c0 net/ipv4/route.c:2576
 raw_sendmsg+0x1861/0x3ed0 net/ipv4/raw.c:653

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-07 22:32:31 -04:00
Eric Dumazet
b855ff8274 dccp: initialize ireq->ir_mark
syzbot reported an uninit-value read of skb->mark in iptable_mangle_hook()

Thanks to the nice report, I tracked the problem to dccp not caring
of ireq->ir_mark for passive sessions.

BUG: KMSAN: uninit-value in ipt_mangle_out net/ipv4/netfilter/iptable_mangle.c:66 [inline]
BUG: KMSAN: uninit-value in iptable_mangle_hook+0x5e5/0x720 net/ipv4/netfilter/iptable_mangle.c:84
CPU: 0 PID: 5300 Comm: syz-executor3 Not tainted 4.16.0+ #81
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:17 [inline]
 dump_stack+0x185/0x1d0 lib/dump_stack.c:53
 kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
 __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676
 ipt_mangle_out net/ipv4/netfilter/iptable_mangle.c:66 [inline]
 iptable_mangle_hook+0x5e5/0x720 net/ipv4/netfilter/iptable_mangle.c:84
 nf_hook_entry_hookfn include/linux/netfilter.h:120 [inline]
 nf_hook_slow+0x158/0x3d0 net/netfilter/core.c:483
 nf_hook include/linux/netfilter.h:243 [inline]
 __ip_local_out net/ipv4/ip_output.c:113 [inline]
 ip_local_out net/ipv4/ip_output.c:122 [inline]
 ip_queue_xmit+0x1d21/0x21c0 net/ipv4/ip_output.c:504
 dccp_transmit_skb+0x15eb/0x1900 net/dccp/output.c:142
 dccp_xmit_packet+0x814/0x9e0 net/dccp/output.c:281
 dccp_write_xmit+0x20f/0x480 net/dccp/output.c:363
 dccp_sendmsg+0x12ca/0x12d0 net/dccp/proto.c:818
 inet_sendmsg+0x48d/0x740 net/ipv4/af_inet.c:764
 sock_sendmsg_nosec net/socket.c:630 [inline]
 sock_sendmsg net/socket.c:640 [inline]
 ___sys_sendmsg+0xec0/0x1310 net/socket.c:2046
 __sys_sendmsg net/socket.c:2080 [inline]
 SYSC_sendmsg+0x2a3/0x3d0 net/socket.c:2091
 SyS_sendmsg+0x54/0x80 net/socket.c:2087
 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x3d/0xa2
RIP: 0033:0x455259
RSP: 002b:00007f1a4473dc68 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00007f1a4473e6d4 RCX: 0000000000455259
RDX: 0000000000000000 RSI: 0000000020b76fc8 RDI: 0000000000000015
RBP: 000000000072bea0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
R13: 00000000000004f0 R14: 00000000006fa720 R15: 0000000000000000

Uninit was stored to memory at:
 kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
 kmsan_save_stack mm/kmsan/kmsan.c:293 [inline]
 kmsan_internal_chain_origin+0x12b/0x210 mm/kmsan/kmsan.c:684
 __msan_chain_origin+0x69/0xc0 mm/kmsan/kmsan_instr.c:521
 ip_queue_xmit+0x1e35/0x21c0 net/ipv4/ip_output.c:502
 dccp_transmit_skb+0x15eb/0x1900 net/dccp/output.c:142
 dccp_xmit_packet+0x814/0x9e0 net/dccp/output.c:281
 dccp_write_xmit+0x20f/0x480 net/dccp/output.c:363
 dccp_sendmsg+0x12ca/0x12d0 net/dccp/proto.c:818
 inet_sendmsg+0x48d/0x740 net/ipv4/af_inet.c:764
 sock_sendmsg_nosec net/socket.c:630 [inline]
 sock_sendmsg net/socket.c:640 [inline]
 ___sys_sendmsg+0xec0/0x1310 net/socket.c:2046
 __sys_sendmsg net/socket.c:2080 [inline]
 SYSC_sendmsg+0x2a3/0x3d0 net/socket.c:2091
 SyS_sendmsg+0x54/0x80 net/socket.c:2087
 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x3d/0xa2
Uninit was stored to memory at:
 kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
 kmsan_save_stack mm/kmsan/kmsan.c:293 [inline]
 kmsan_internal_chain_origin+0x12b/0x210 mm/kmsan/kmsan.c:684
 __msan_chain_origin+0x69/0xc0 mm/kmsan/kmsan_instr.c:521
 inet_csk_clone_lock+0x503/0x580 net/ipv4/inet_connection_sock.c:797
 dccp_create_openreq_child+0x7f/0x890 net/dccp/minisocks.c:92
 dccp_v4_request_recv_sock+0x22c/0xe90 net/dccp/ipv4.c:408
 dccp_v6_request_recv_sock+0x290/0x2000 net/dccp/ipv6.c:414
 dccp_check_req+0x7b9/0x8f0 net/dccp/minisocks.c:197
 dccp_v4_rcv+0x12e4/0x2630 net/dccp/ipv4.c:840
 ip_local_deliver_finish+0x6ed/0xd40 net/ipv4/ip_input.c:216
 NF_HOOK include/linux/netfilter.h:288 [inline]
 ip_local_deliver+0x43c/0x4e0 net/ipv4/ip_input.c:257
 dst_input include/net/dst.h:449 [inline]
 ip_rcv_finish+0x1253/0x16d0 net/ipv4/ip_input.c:397
 NF_HOOK include/linux/netfilter.h:288 [inline]
 ip_rcv+0x119d/0x16f0 net/ipv4/ip_input.c:493
 __netif_receive_skb_core+0x47cf/0x4a80 net/core/dev.c:4562
 __netif_receive_skb net/core/dev.c:4627 [inline]
 process_backlog+0x62d/0xe20 net/core/dev.c:5307
 napi_poll net/core/dev.c:5705 [inline]
 net_rx_action+0x7c1/0x1a70 net/core/dev.c:5771
 __do_softirq+0x56d/0x93d kernel/softirq.c:285
Uninit was created at:
 kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
 kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:188
 kmsan_kmalloc+0x94/0x100 mm/kmsan/kmsan.c:314
 kmem_cache_alloc+0xaab/0xb90 mm/slub.c:2756
 reqsk_alloc include/net/request_sock.h:88 [inline]
 inet_reqsk_alloc+0xc4/0x7f0 net/ipv4/tcp_input.c:6145
 dccp_v4_conn_request+0x5cc/0x1770 net/dccp/ipv4.c:600
 dccp_v6_conn_request+0x299/0x1880 net/dccp/ipv6.c:317
 dccp_rcv_state_process+0x2ea/0x2410 net/dccp/input.c:612
 dccp_v4_do_rcv+0x229/0x340 net/dccp/ipv4.c:682
 dccp_v6_do_rcv+0x16d/0x1220 net/dccp/ipv6.c:578
 sk_backlog_rcv include/net/sock.h:908 [inline]
 __sk_receive_skb+0x60e/0xf20 net/core/sock.c:513
 dccp_v4_rcv+0x24d4/0x2630 net/dccp/ipv4.c:874
 ip_local_deliver_finish+0x6ed/0xd40 net/ipv4/ip_input.c:216
 NF_HOOK include/linux/netfilter.h:288 [inline]
 ip_local_deliver+0x43c/0x4e0 net/ipv4/ip_input.c:257
 dst_input include/net/dst.h:449 [inline]
 ip_rcv_finish+0x1253/0x16d0 net/ipv4/ip_input.c:397
 NF_HOOK include/linux/netfilter.h:288 [inline]
 ip_rcv+0x119d/0x16f0 net/ipv4/ip_input.c:493
 __netif_receive_skb_core+0x47cf/0x4a80 net/core/dev.c:4562
 __netif_receive_skb net/core/dev.c:4627 [inline]
 process_backlog+0x62d/0xe20 net/core/dev.c:5307
 napi_poll net/core/dev.c:5705 [inline]
 net_rx_action+0x7c1/0x1a70 net/core/dev.c:5771
 __do_softirq+0x56d/0x93d kernel/softirq.c:285

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-07 22:32:31 -04:00
Eric Dumazet
77d36398d9 net: fix uninit-value in __hw_addr_add_ex()
syzbot complained :

BUG: KMSAN: uninit-value in memcmp+0x119/0x180 lib/string.c:861
CPU: 0 PID: 3 Comm: kworker/0:0 Not tainted 4.16.0+ #82
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Workqueue: ipv6_addrconf addrconf_dad_work
Call Trace:
 __dump_stack lib/dump_stack.c:17 [inline]
 dump_stack+0x185/0x1d0 lib/dump_stack.c:53
 kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
 __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676
 memcmp+0x119/0x180 lib/string.c:861
 __hw_addr_add_ex net/core/dev_addr_lists.c:60 [inline]
 __dev_mc_add+0x1c2/0x8e0 net/core/dev_addr_lists.c:670
 dev_mc_add+0x6d/0x80 net/core/dev_addr_lists.c:687
 igmp6_group_added+0x2db/0xa00 net/ipv6/mcast.c:662
 ipv6_dev_mc_inc+0xe9e/0x1130 net/ipv6/mcast.c:914
 addrconf_join_solict net/ipv6/addrconf.c:2078 [inline]
 addrconf_dad_begin net/ipv6/addrconf.c:3828 [inline]
 addrconf_dad_work+0x427/0x2150 net/ipv6/addrconf.c:3954
 process_one_work+0x12c6/0x1f60 kernel/workqueue.c:2113
 worker_thread+0x113c/0x24f0 kernel/workqueue.c:2247
 kthread+0x539/0x720 kernel/kthread.c:239

Fixes: f001fde5ea ("net: introduce a list of device addresses dev_addr_list (v6)")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-07 22:32:31 -04:00
Eric Dumazet
b13dda9f9a net: initialize skb->peeked when cloning
syzbot reported __skb_try_recv_from_queue() was using skb->peeked
while it was potentially unitialized.

We need to clear it in __skb_clone()

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-07 22:32:31 -04:00
Eric Dumazet
6091f09c2f netlink: fix uninit-value in netlink_sendmsg
syzbot reported :

BUG: KMSAN: uninit-value in ffs arch/x86/include/asm/bitops.h:432 [inline]
BUG: KMSAN: uninit-value in netlink_sendmsg+0xb26/0x1310 net/netlink/af_netlink.c:1851

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-07 22:32:31 -04:00
Cong Wang
f12c643209 net_sched: fix a missing idr_remove() in u32_delete_key()
When we delete a u32 key via u32_delete_key(), we forget to
call idr_remove() to remove its handle from IDR.

Fixes: e7614370d6 ("net_sched: use idr to allocate u32 filter handles")
Reported-by: Marcin Kabiesz <admin@hostcenter.eu>
Tested-by: Marcin Kabiesz <admin@hostcenter.eu>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-07 12:36:45 -04:00
Masahiro Yamada
4fa8bc949d kbuild: rename *-asn1.[ch] to *.asn1.[ch]
Our convention is to distinguish file types by suffixes with a period
as a separator.

*-asn1.[ch] is a different pattern from other generated sources such
as *.lex.c, *.tab.[ch], *.dtb.S, etc.  More confusing, files with
'-asn1.[ch]' are generated files, but '_asn1.[ch]' are checked-in
files:
  net/netfilter/nf_conntrack_h323_asn1.c
  include/linux/netfilter/nf_conntrack_h323_asn1.h
  include/linux/sunrpc/gss_asn1.h

Rename generated files to *.asn1.[ch] for consistency.

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2018-04-07 19:04:02 +09:00
Masahiro Yamada
3ca3273eaa kbuild: clean up *-asn1.[ch] patterns from top-level Makefile
Clean up these patterns from the top Makefile to omit 'clean-files'
in each Makefile.

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2018-04-07 19:04:02 +09:00
Linus Torvalds
9eda2d2dca selinux/stable-4.17 PR 20180403
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEEcQCq365ubpQNLgrWVeRaWujKfIoFAlrD6XoUHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQVeRaWujKfIpy9RAAjwhkNBNJhw1UlGggVvst8lzJBdMp
 XxL7cg+1TcZkB12yrghILg+gY4j5PzY4GJo1gvllWIHsT8Ud6cQTI/AzeYR2OfZ3
 mHv3gtyzmHsPGBdqhmgC7R10tpyXFXwDc3VLMtuuDiUl/seFEaJWOMYP7zj+tRil
 XoOCyoV9bb1wb7vNAzQikK8yhz3fu72Y5QOODLfaYeYojMKs8Q8pMZgi68oVQUXk
 SmS2mj0k2P3UqeOSk+8phJQhilm32m0tE0YnLvzAhblJLqeS2DUNnWORP1j4oQ/Q
 aOOu4ZQ9PA1N7VAIGceuf2HZHhnrFzWdvggp2bxegcRSIfUZ84FuZbrj60RUz2ja
 V6GmKYACnyd28TAWdnzjKEd4dc36LSPxnaj8hcrvyO2V34ozVEsvIEIJREoXRUJS
 heJ9HT+VIvmguzRCIPPeC1ZYopIt8M1kTRrszigU80TuZjIP0VJHLGQn/rgRQzuO
 cV5gmJ6TSGn1l54H13koBzgUCo0cAub8Nl+288qek+jLWoHnKwzLB+1HCWuyeCHt
 2q6wdFfenYH0lXdIzCeC7NNHRKCrPNwkZ/32d4ZQf4cu5tAn8bOk8dSHchoAfZG8
 p7N6jPPoxmi2F/GRKrTiUNZvQpyvgX3hjtJS6ljOTSYgRhjeNYeCP8U+BlOpLVQy
 U4KzB9wOAngTEpo=
 =p2Sh
 -----END PGP SIGNATURE-----

Merge tag 'selinux-pr-20180403' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux

Pull SELinux updates from Paul Moore:
 "A bigger than usual pull request for SELinux, 13 patches (lucky!)
  along with a scary looking diffstat.

  Although if you look a bit closer, excluding the usual minor
  tweaks/fixes, there are really only two significant changes in this
  pull request: the addition of proper SELinux access controls for SCTP
  and the encapsulation of a lot of internal SELinux state.

  The SCTP changes are the result of a multi-month effort (maybe even a
  year or longer?) between the SELinux folks and the SCTP folks to add
  proper SELinux controls. A special thanks go to Richard for seeing
  this through and keeping the effort moving forward.

  The state encapsulation work is a bit of janitorial work that came out
  of some early work on SELinux namespacing. The question of namespacing
  is still an open one, but I believe there is some real value in the
  encapsulation work so we've split that out and are now sending that up
  to you"

* tag 'selinux-pr-20180403' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
  selinux: wrap AVC state
  selinux: wrap selinuxfs state
  selinux: fix handling of uninitialized selinux state in get_bools/classes
  selinux: Update SELinux SCTP documentation
  selinux: Fix ltp test connect-syscall failure
  selinux: rename the {is,set}_enforcing() functions
  selinux: wrap global selinux state
  selinux: fix typo in selinux_netlbl_sctp_sk_clone declaration
  selinux: Add SCTP support
  sctp: Add LSM hooks
  sctp: Add ip option support
  security: Add support for SCTP security hooks
  netlabel: If PF_INET6, check sk_buff ip header version
2018-04-06 15:39:26 -07:00
Linus Torvalds
3b54765cca Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:

 - a few misc things

 - ocfs2 updates

 - the v9fs maintainers have been missing for a long time. I've taken
   over v9fs patch slinging.

 - most of MM

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (116 commits)
  mm,oom_reaper: check for MMF_OOM_SKIP before complaining
  mm/ksm: fix interaction with THP
  mm/memblock.c: cast constant ULLONG_MAX to phys_addr_t
  headers: untangle kmemleak.h from mm.h
  include/linux/mmdebug.h: make VM_WARN* non-rvals
  mm/page_isolation.c: make start_isolate_page_range() fail if already isolated
  mm: change return type to vm_fault_t
  mm, oom: remove 3% bonus for CAP_SYS_ADMIN processes
  mm, page_alloc: wakeup kcompactd even if kswapd cannot free more memory
  kernel/fork.c: detect early free of a live mm
  mm: make counting of list_lru_one::nr_items lockless
  mm/swap_state.c: make bool enable_vma_readahead and swap_vma_readahead() static
  block_invalidatepage(): only release page if the full page was invalidated
  mm: kernel-doc: add missing parameter descriptions
  mm/swap.c: remove @cold parameter description for release_pages()
  mm/nommu: remove description of alloc_vm_area
  zram: drop max_zpage_size and use zs_huge_class_size()
  zsmalloc: introduce zs_huge_class_size()
  mm: fix races between swapoff and flush dcache
  fs/direct-io.c: minor cleanups in do_blockdev_direct_IO
  ...
2018-04-06 14:19:26 -07:00
Randy Dunlap
514c603249 headers: untangle kmemleak.h from mm.h
Currently <linux/slab.h> #includes <linux/kmemleak.h> for no obvious
reason.  It looks like it's only a convenience, so remove kmemleak.h
from slab.h and add <linux/kmemleak.h> to any users of kmemleak_* that
don't already #include it.  Also remove <linux/kmemleak.h> from source
files that do not use it.

This is tested on i386 allmodconfig and x86_64 allmodconfig.  It would
be good to run it through the 0day bot for other $ARCHes.  I have
neither the horsepower nor the storage space for the other $ARCHes.

Update: This patch has been extensively build-tested by both the 0day
bot & kisskb/ozlabs build farms.  Both of them reported 2 build failures
for which patches are included here (in v2).

[ slab.h is the second most used header file after module.h; kernel.h is
  right there with slab.h. There could be some minor error in the
  counting due to some #includes having comments after them and I didn't
  combine all of those. ]

[akpm@linux-foundation.org: security/keys/big_key.c needs vmalloc.h, per sfr]
Link: http://lkml.kernel.org/r/e4309f98-3749-93e1-4bb7-d9501a39d015@infradead.org
Link: http://kisskb.ellerman.id.au/kisskb/head/13396/
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Reported-by: Michael Ellerman <mpe@ellerman.id.au>	[2 build failures]
Reported-by: Fengguang Wu <fengguang.wu@intel.com>	[2 build failures]
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Wei Yongjun <weiyongjun1@huawei.com>
Cc: Luis R. Rodriguez <mcgrof@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mimi Zohar <zohar@linux.vnet.ibm.com>
Cc: John Johansen <john.johansen@canonical.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:27 -07:00
Chengguang Xu
9421c3e641 net/9p/client.c: fix potential refcnt problem of trans module
When specifying trans_mod multiple times in a mount, it will cause an
inaccurate refcount of the trans module.  Also, in the error case of
option parsing, we should put the trans module if we have already got
it.

Link: http://lkml.kernel.org/r/1522154942-57339-1-git-send-email-cgxu519@gmx.com
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: David Miller <davem@davemloft.net>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ron Minnich <rminnich@sandia.gov>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:23 -07:00
Greg Kurz
a85222435b net/9p: avoid -ERESTARTSYS leak to userspace
If it was interrupted by a signal, the 9p client may need to send some
more requests to the server for cleanup before returning to userspace.

To avoid such a last minute request to be interrupted right away, the
client memorizes if a signal is pending, clears TIF_SIGPENDING, handles
the request and calls recalc_sigpending() before returning.

Unfortunately, if the transmission of this cleanup request fails for any
reason, the transport returns an error and the client propagates it
right away, without calling recalc_sigpending().

This ends up with -ERESTARTSYS from the initially interrupted request
crawling up to syscall exit, with TIF_SIGPENDING cleared by the cleanup
request.  The specific signal handling code, which is responsible for
converting -ERESTARTSYS to -EINTR is not called, and userspace receives
the confusing errno value:

  open: Unknown error 512 (512)

This is really hard to hit in real life.  I discovered the issue while
working on hot-unplug of a virtio-9p-pci device with an instrumented
QEMU allowing to control request completion.

Both p9_client_zc_rpc() and p9_client_rpc() functions have this buggy
error path actually.  Their code flow is a bit obscure and the best
thing to do would probably be a full rewrite: to really ensure this
situation of clearing TIF_SIGPENDING and returning -ERESTARTSYS can
never happen.

But given the general lack of interest for the 9p code, I won't risk
breaking more things.  So this patch simply fixes the buggy paths in
both functions with a trivial label+goto.

Thanks to Laurent Dufour for his help and suggestions on how to find the
root cause and how to fix it.

Link: http://lkml.kernel.org/r/152062809886.10599.7361006774123053312.stgit@bahia.lan
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ron Minnich <rminnich@sandia.gov>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Cc: David Miller <davem@davemloft.net>
Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:22 -07:00
Davide Caratti
3239534a79 net/sched: fix NULL dereference in the error path of tcf_bpf_init()
when tcf_bpf_init_from_ops() fails (e.g. because of program having invalid
number of instructions), tcf_bpf_cfg_cleanup() calls bpf_prog_put(NULL) or
bpf_prog_destroy(NULL). Unless CONFIG_BPF_SYSCALL is unset, this causes
the following error:

 BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
 PGD 800000007345a067 P4D 800000007345a067 PUD 340e1067 PMD 0
 Oops: 0000 [#1] SMP PTI
 Modules linked in: act_bpf(E) ip6table_filter ip6_tables iptable_filter binfmt_misc ext4 mbcache jbd2 crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_codec_generic pcbc snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm aesni_intel crypto_simd glue_helper cryptd joydev snd_timer snd virtio_balloon pcspkr soundcore i2c_piix4 nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c ata_generic pata_acpi qxl drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm virtio_blk drm virtio_net virtio_console i2c_core crc32c_intel serio_raw virtio_pci ata_piix libata virtio_ring floppy virtio dm_mirror dm_region_hash dm_log dm_mod [last unloaded: act_bpf]
 CPU: 3 PID: 5654 Comm: tc Tainted: G            E    4.16.0.bpf_test+ #408
 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
 RIP: 0010:__bpf_prog_put+0xc/0xc0
 RSP: 0018:ffff9594003ef728 EFLAGS: 00010202
 RAX: 0000000000000000 RBX: ffff9594003ef758 RCX: 0000000000000024
 RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000000
 RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000044
 R10: 0000000000000220 R11: ffff8a7ab9f17131 R12: 0000000000000000
 R13: ffff8a7ab7c3c8e0 R14: 0000000000000001 R15: ffff8a7ab88f1054
 FS:  00007fcb2f17c740(0000) GS:ffff8a7abfd80000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000020 CR3: 000000007c888006 CR4: 00000000001606e0
 Call Trace:
  tcf_bpf_cfg_cleanup+0x2f/0x40 [act_bpf]
  tcf_bpf_cleanup+0x4c/0x70 [act_bpf]
  __tcf_idr_release+0x79/0x140
  tcf_bpf_init+0x125/0x330 [act_bpf]
  tcf_action_init_1+0x2cc/0x430
  ? get_page_from_freelist+0x3f0/0x11b0
  tcf_action_init+0xd3/0x1b0
  tc_ctl_action+0x18b/0x240
  rtnetlink_rcv_msg+0x29c/0x310
  ? _cond_resched+0x15/0x30
  ? __kmalloc_node_track_caller+0x1b9/0x270
  ? rtnl_calcit.isra.29+0x100/0x100
  netlink_rcv_skb+0xd2/0x110
  netlink_unicast+0x17c/0x230
  netlink_sendmsg+0x2cd/0x3c0
  sock_sendmsg+0x30/0x40
  ___sys_sendmsg+0x27a/0x290
  ? mem_cgroup_commit_charge+0x80/0x130
  ? page_add_new_anon_rmap+0x73/0xc0
  ? do_anonymous_page+0x2a2/0x560
  ? __handle_mm_fault+0xc75/0xe20
  __sys_sendmsg+0x58/0xa0
  do_syscall_64+0x6e/0x1a0
  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
 RIP: 0033:0x7fcb2e58eba0
 RSP: 002b:00007ffc93c496c8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
 RAX: ffffffffffffffda RBX: 00007ffc93c497f0 RCX: 00007fcb2e58eba0
 RDX: 0000000000000000 RSI: 00007ffc93c49740 RDI: 0000000000000003
 RBP: 000000005ac6a646 R08: 0000000000000002 R09: 0000000000000000
 R10: 00007ffc93c49120 R11: 0000000000000246 R12: 0000000000000000
 R13: 00007ffc93c49804 R14: 0000000000000001 R15: 000000000066afa0
 Code: 5f 00 48 8b 43 20 48 c7 c7 70 2f 7c b8 c7 40 10 00 00 00 00 5b e9 a5 8b 61 00 0f 1f 44 00 00 0f 1f 44 00 00 41 54 55 48 89 fd 53 <48> 8b 47 20 f0 ff 08 74 05 5b 5d 41 5c c3 41 89 f4 0f 1f 44 00
 RIP: __bpf_prog_put+0xc/0xc0 RSP: ffff9594003ef728
 CR2: 0000000000000020

Fix it in tcf_bpf_cfg_cleanup(), ensuring that bpf_prog_{put,destroy}(f)
is called only when f is not NULL.

Fixes: bbc09e7842 ("net/sched: fix idr leak on the error path of tcf_bpf_init()")
Reported-by: Lucas Bates <lucasb@mojatatu.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-05 22:26:06 -04:00
Jeff Barnhill
71a1c91523 net/ipv6: Increment OUTxxx counters after netfilter hook
At the end of ip6_forward(), IPSTATS_MIB_OUTFORWDATAGRAMS and
IPSTATS_MIB_OUTOCTETS are incremented immediately before the NF_HOOK call
for NFPROTO_IPV6 / NF_INET_FORWARD.  As a result, these counters get
incremented regardless of whether or not the netfilter hook allows the
packet to continue being processed.  This change increments the counters
in ip6_forward_finish() so that it will not happen if the netfilter hook
chooses to terminate the packet, which is similar to how IPv4 works.

Signed-off-by: Jeff Barnhill <0xeffeff@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-05 22:23:43 -04:00
Linus Torvalds
5e4d659713 Chuck Lever did a bunch of work on nfsd tracepoints, on RDMA, and on
server xdr decoding (with an eye towards eliminating a data copy in the
 RDMA case).
 
 I did some refactoring of the delegation code in preparation for
 eliminating some delegation self-conflicts and implementing write
 delegations.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJaxi5LAAoJECebzXlCjuG+deAQAL9NHsv6bIydkE6wX305c/bR
 gm73yryF1kfOuHmiLq15mrljiKyCEbRSPqzzM8k2TkywHhyvOLEjxHbhZnwDyQec
 DaZUzWLKNkK64UFXEvTKyNwfGObwsGQ+QLkV7N9mF3Ps9M9/u2vMHKQypvA9hJ7z
 DGN7MO7Ud7N0Viu03vp4m+p7gypoWGFj6Sh1QAkR/7TE/supcS+qqOWU4vLpYFhu
 /l2gJym59FWqHajwqs0Qu9LpHfsEx5HySZbj7GczbGRMka3y/AnjgnngcriP+63B
 ZcPpqSdD4Yeq1OJklU5Wicy+u54rFkA9VE1EArrC9RAEwav6iMhhhaESpRH2JFJE
 SO7cgUSCb2+65XgeSBfDygn+09PcN+eRF3sxXkQpHKsozQOH+qdyr7F4/ePunwi8
 Ah7pIkczRUrj7gMmlNOg97wpHbffO4YnpRESA934qf7MMHRQwDsEkl512kFAyadZ
 g1DI3iByfUpBQvRWJSLasyjyWUqRZDMmyO3yi3i/08sMI3XE1IOWpNkJAooNYC1X
 1FCDn1VXlTdcmC8yw6Da1L05PCVf25tSjpYQZ6r25KtrVi9iBOGV741vmyVMvDEw
 OwUuVatRd+AY+YJ6iUraNH4SlnHWZ/qKDFJECWbrG/4uUHyo8FIXGAgL8RgtbS8+
 fQeRjaDyhHD0XH7TJ10x
 =fJI1
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.17' of git://linux-nfs.org/~bfields/linux

Pull nfsd updates from Bruce Fields:
 "Chuck Lever did a bunch of work on nfsd tracepoints, on RDMA, and on
  server xdr decoding (with an eye towards eliminating a data copy in
  the RDMA case).

  I did some refactoring of the delegation code in preparation for
  eliminating some delegation self-conflicts and implementing write
  delegations"

* tag 'nfsd-4.17' of git://linux-nfs.org/~bfields/linux: (40 commits)
  nfsd: fix incorrect umasks
  sunrpc: remove incorrect HMAC request initialization
  NFSD: Clean up legacy NFS SYMLINK argument XDR decoders
  NFSD: Clean up legacy NFS WRITE argument XDR decoders
  nfsd: Trace NFSv4 COMPOUND execution
  nfsd: Add I/O trace points in the NFSv4 read proc
  nfsd: Add I/O trace points in the NFSv4 write path
  nfsd: Add "nfsd_" to trace point names
  nfsd: Record request byte count, not count of vectors
  nfsd: Fix NFSD trace points
  svc: Report xprt dequeue latency
  sunrpc: Report per-RPC execution stats
  sunrpc: Re-purpose trace_svc_process
  sunrpc: Save remote presentation address in svc_xprt for trace events
  sunrpc: Simplify trace_svc_recv
  sunrpc: Simplify do_enqueue tracing
  sunrpc: Move trace_svc_xprt_dequeue()
  sunrpc: Update show_svc_xprt_flags() to include recently added flags
  svc: Simplify ->xpo_secure_port
  sunrpc: Remove unneeded pointer dereference
  ...
2018-04-05 19:15:29 -07:00
Miguel Fadon Perlines
58b35f2768 arp: fix arp_filter on l3slave devices
arp_filter performs an ip_route_output search for arp source address and
checks if output device is the same where the arp request was received,
if it is not, the arp request is not answered.

This route lookup is always done on main route table so l3slave devices
never find the proper route and arp is not answered.

Passing l3mdev_master_ifindex_rcu(dev) return value as oif fixes the
lookup for l3slave devices while maintaining same behavior for non
l3slave devices as this function returns 0 in that case.

Fixes: 613d09b30f ("net: Use VRF device index for lookups on TX")
Signed-off-by: Miguel Fadon Perlines <mfadon@teldat.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-05 22:05:03 -04:00
Linus Torvalds
3526dd0c78 for-4.17/block-20180402
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCAAGBQJawr05AAoJEPfTWPspceCmT2UP/1uuaqwzyl4VjFNb/k7KS7UM
 +Cs/1HBlGomgMA8orDTGqtWqLRdR3z4RSh0+MvXTzQ78HpFVYz7CbDc9itHm+G9M
 X0ypD4kF/JGCFb5cxk+x6qv28uO2nv4DP3+0hHqJWLH4UVJBWDY6bs4BPShsf9QB
 I6XjioNMhoqylXgdOITLODJZz+TcChlJMDAqwhpJwh9TH1wjobleAZ6AdmCPfgi5
 h0UCKMUKzcVJlNZwQUrzrs2cxcx9Uhunnbz7HK0ZV4n/FKFtDpGynFpQQ71pZxKe
 Be0ZOBPCQvC3ykOM/egCIvC/e5y7FgrjORD6jxyu1PTwAugI5E1VYSMxHkXvgPAx
 zOo9A7RT4GPO2tDQv+DbzNFpqeSAclTgSmr+/y1wmheBs8DiSt7MPVBiNM4zdCNv
 NLk9z7IEjFhdmluSB/LbTb1aokypMb/q7QTLouPHdwGn80k7yrhFyLHgdjpNTQ2K
 UHfHZvGxkOX6SmFhBNOtIFUkuSceenh64a0RkRle7filx+ImpbCVm2/GYi9zZNCu
 EtctgzLbLmz40zMiyDaZS2bxBgGzfn6yf4xd9LsaAJPMhvZnmXogT0D9ctWXB0WU
 mMaS7sOkLnNjnGkzF1fHkeiZ/oigrstJbe+CA7BtOdwxpWn6MZBgKEoFQ6iA2b3X
 5J1axMgVH5LAsIEcEQVq
 =RVhK
 -----END PGP SIGNATURE-----

Merge tag 'for-4.17/block-20180402' of git://git.kernel.dk/linux-block

Pull block layer updates from Jens Axboe:
 "It's a pretty quiet round this time, which is nice. This contains:

   - series from Bart, cleaning up the way we set/test/clear atomic
     queue flags.

   - series from Bart, fixing races between gendisk and queue
     registration and removal.

   - set of bcache fixes and improvements from various folks, by way of
     Michael Lyle.

   - set of lightnvm updates from Matias, most of it being the 1.2 to
     2.0 transition.

   - removal of unused DIO flags from Nikolay.

   - blk-mq/sbitmap memory ordering fixes from Omar.

   - divide-by-zero fix for BFQ from Paolo.

   - minor documentation patches from Randy.

   - timeout fix from Tejun.

   - Alpha "can't write a char atomically" fix from Mikulas.

   - set of NVMe fixes by way of Keith.

   - bsg and bsg-lib improvements from Christoph.

   - a few sed-opal fixes from Jonas.

   - cdrom check-disk-change deadlock fix from Maurizio.

   - various little fixes, comment fixes, etc from various folks"

* tag 'for-4.17/block-20180402' of git://git.kernel.dk/linux-block: (139 commits)
  blk-mq: Directly schedule q->timeout_work when aborting a request
  blktrace: fix comment in blktrace_api.h
  lightnvm: remove function name in strings
  lightnvm: pblk: remove some unnecessary NULL checks
  lightnvm: pblk: don't recover unwritten lines
  lightnvm: pblk: implement 2.0 support
  lightnvm: pblk: implement get log report chunk
  lightnvm: pblk: rename ppaf* to addrf*
  lightnvm: pblk: check for supported version
  lightnvm: implement get log report chunk helpers
  lightnvm: make address conversions depend on generic device
  lightnvm: add support for 2.0 address format
  lightnvm: normalize geometry nomenclature
  lightnvm: complete geo structure with maxoc*
  lightnvm: add shorten OCSSD version in geo
  lightnvm: add minor version to generic geometry
  lightnvm: simplify geometry structure
  lightnvm: pblk: refactor init/exit sequences
  lightnvm: Avoid validation of default op value
  lightnvm: centralize permission check for lightnvm ioctl
  ...
2018-04-05 14:27:02 -07:00
Eric Dumazet
537b361fbc vti6: better validate user provided tunnel names
Use valid_name() to make sure user does not provide illegal
device name.

Fixes: ed1efb2aef ("ipv6: Add support for IPsec virtual tunnel interfaces")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-05 15:16:15 -04:00
Eric Dumazet
db7a65e3ab ip6_tunnel: better validate user provided tunnel names
Use valid_name() to make sure user does not provide illegal
device name.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-05 15:16:15 -04:00
Eric Dumazet
5f42df013b ip6_gre: better validate user provided tunnel names
Use dev_valid_name() to make sure user does not provide illegal
device name.

syzbot caught the following bug :

BUG: KASAN: stack-out-of-bounds in strlcpy include/linux/string.h:300 [inline]
BUG: KASAN: stack-out-of-bounds in ip6gre_tunnel_locate+0x334/0x860 net/ipv6/ip6_gre.c:339
Write of size 20 at addr ffff8801afb9f7b8 by task syzkaller851048/4466

CPU: 1 PID: 4466 Comm: syzkaller851048 Not tainted 4.16.0+ #1
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:17 [inline]
 dump_stack+0x1b9/0x29f lib/dump_stack.c:53
 print_address_description+0x6c/0x20b mm/kasan/report.c:256
 kasan_report_error mm/kasan/report.c:354 [inline]
 kasan_report.cold.7+0xac/0x2f5 mm/kasan/report.c:412
 check_memory_region_inline mm/kasan/kasan.c:260 [inline]
 check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267
 memcpy+0x37/0x50 mm/kasan/kasan.c:303
 strlcpy include/linux/string.h:300 [inline]
 ip6gre_tunnel_locate+0x334/0x860 net/ipv6/ip6_gre.c:339
 ip6gre_tunnel_ioctl+0x69d/0x12e0 net/ipv6/ip6_gre.c:1195
 dev_ifsioc+0x43e/0xb90 net/core/dev_ioctl.c:334
 dev_ioctl+0x69a/0xcc0 net/core/dev_ioctl.c:525
 sock_ioctl+0x47e/0x680 net/socket.c:1015
 vfs_ioctl fs/ioctl.c:46 [inline]
 file_ioctl fs/ioctl.c:500 [inline]
 do_vfs_ioctl+0x1cf/0x1650 fs/ioctl.c:684
 ksys_ioctl+0xa9/0xd0 fs/ioctl.c:701
 SYSC_ioctl fs/ioctl.c:708 [inline]
 SyS_ioctl+0x24/0x30 fs/ioctl.c:706
 do_syscall_64+0x29e/0x9d0 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x42/0xb7

Fixes: c12b395a46 ("gre: Support GRE over IPv6")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-05 15:16:15 -04:00
Eric Dumazet
b95211e066 ipv6: sit: better validate user provided tunnel names
Use dev_valid_name() to make sure user does not provide illegal
device name.

syzbot caught the following bug :

BUG: KASAN: stack-out-of-bounds in strlcpy include/linux/string.h:300 [inline]
BUG: KASAN: stack-out-of-bounds in ipip6_tunnel_locate+0x63b/0xaa0 net/ipv6/sit.c:254
Write of size 33 at addr ffff8801b64076d8 by task syzkaller932654/4453

CPU: 0 PID: 4453 Comm: syzkaller932654 Not tainted 4.16.0+ #1
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:17 [inline]
 dump_stack+0x1b9/0x29f lib/dump_stack.c:53
 print_address_description+0x6c/0x20b mm/kasan/report.c:256
 kasan_report_error mm/kasan/report.c:354 [inline]
 kasan_report.cold.7+0xac/0x2f5 mm/kasan/report.c:412
 check_memory_region_inline mm/kasan/kasan.c:260 [inline]
 check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267
 memcpy+0x37/0x50 mm/kasan/kasan.c:303
 strlcpy include/linux/string.h:300 [inline]
 ipip6_tunnel_locate+0x63b/0xaa0 net/ipv6/sit.c:254
 ipip6_tunnel_ioctl+0xe71/0x241b net/ipv6/sit.c:1221
 dev_ifsioc+0x43e/0xb90 net/core/dev_ioctl.c:334
 dev_ioctl+0x69a/0xcc0 net/core/dev_ioctl.c:525
 sock_ioctl+0x47e/0x680 net/socket.c:1015
 vfs_ioctl fs/ioctl.c:46 [inline]
 file_ioctl fs/ioctl.c:500 [inline]
 do_vfs_ioctl+0x1cf/0x1650 fs/ioctl.c:684
 ksys_ioctl+0xa9/0xd0 fs/ioctl.c:701
 SYSC_ioctl fs/ioctl.c:708 [inline]
 SyS_ioctl+0x24/0x30 fs/ioctl.c:706
 do_syscall_64+0x29e/0x9d0 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x42/0xb7

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-05 15:16:15 -04:00
Eric Dumazet
9cb726a212 ip_tunnel: better validate user provided tunnel names
Use dev_valid_name() to make sure user does not provide illegal
device name.

syzbot caught the following bug :

BUG: KASAN: stack-out-of-bounds in strlcpy include/linux/string.h:300 [inline]
BUG: KASAN: stack-out-of-bounds in __ip_tunnel_create+0xca/0x6b0 net/ipv4/ip_tunnel.c:257
Write of size 20 at addr ffff8801ac79f810 by task syzkaller268107/4482

CPU: 0 PID: 4482 Comm: syzkaller268107 Not tainted 4.16.0+ #1
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:17 [inline]
 dump_stack+0x1b9/0x29f lib/dump_stack.c:53
 print_address_description+0x6c/0x20b mm/kasan/report.c:256
 kasan_report_error mm/kasan/report.c:354 [inline]
 kasan_report.cold.7+0xac/0x2f5 mm/kasan/report.c:412
 check_memory_region_inline mm/kasan/kasan.c:260 [inline]
 check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267
 memcpy+0x37/0x50 mm/kasan/kasan.c:303
 strlcpy include/linux/string.h:300 [inline]
 __ip_tunnel_create+0xca/0x6b0 net/ipv4/ip_tunnel.c:257
 ip_tunnel_create net/ipv4/ip_tunnel.c:352 [inline]
 ip_tunnel_ioctl+0x818/0xd40 net/ipv4/ip_tunnel.c:861
 ipip_tunnel_ioctl+0x1c5/0x420 net/ipv4/ipip.c:350
 dev_ifsioc+0x43e/0xb90 net/core/dev_ioctl.c:334
 dev_ioctl+0x69a/0xcc0 net/core/dev_ioctl.c:525
 sock_ioctl+0x47e/0x680 net/socket.c:1015
 vfs_ioctl fs/ioctl.c:46 [inline]
 file_ioctl fs/ioctl.c:500 [inline]
 do_vfs_ioctl+0x1cf/0x1650 fs/ioctl.c:684
 ksys_ioctl+0xa9/0xd0 fs/ioctl.c:701
 SYSC_ioctl fs/ioctl.c:708 [inline]
 SyS_ioctl+0x24/0x30 fs/ioctl.c:706
 do_syscall_64+0x29e/0x9d0 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x42/0xb7

Fixes: c544193214 ("GRE: Refactor GRE tunneling code.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-05 15:16:15 -04:00
Eric Dumazet
a9d48205d0 net: fool proof dev_valid_name()
We want to use dev_valid_name() to validate tunnel names,
so better use strnlen(name, IFNAMSIZ) than strlen(name) to make
sure to not upset KASAN.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-05 15:16:15 -04:00
Linus Torvalds
672a9c1069 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree updates from Jiri Kosina.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
  kfifo: fix inaccurate comment
  tools/thermal: tmon: fix for segfault
  net: Spelling s/stucture/structure/
  edd: don't spam log if no EDD information is present
  Documentation: Fix early-microcode.txt references after file rename
  tracing: Block comments should align the * on each line
  treewide: Fix typos in printk
  GenWQE: Fix a typo in two comments
  treewide: Align function definition open/close braces
2018-04-05 11:56:35 -07:00
Eric Dumazet
3d23401283 inet: frags: fix ip6frag_low_thresh boundary
Giving an integer to proc_doulongvec_minmax() is dangerous on 64bit arches,
since linker might place next to it a non zero value preventing a change
to ip6frag_low_thresh.

ip6frag_low_thresh is not used anymore in the kernel, but we do not
want to prematuraly break user scripts wanting to change it.

Since specifying a minimal value of 0 for proc_doulongvec_minmax()
is moot, let's remove these zero values in all defrag units.

Fixes: 6e00f7dd5e ("ipv6: frags: fix /proc/sys/net/ipv6/ip6frag_low_thresh")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-04 12:04:59 -04:00
GhantaKrishnamurthy MohanKrishna
4b2e6877b8 tipc: Fix namespace violation in tipc_sk_fill_sock_diag
To fetch UID info for socket diagnostics, we determine the
namespace of user context using tipc socket instance. This
may cause namespace violation, as the kernel will remap based
on UID.

We fix this by fetching namespace info using the calling userspace
netlink socket.

Fixes: c30b70deb5 (tipc: implement socket diagnostics for AF_TIPC)
Reported-by: syzbot+326e587eff1074657718@syzkaller.appspotmail.com
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: GhantaKrishnamurthy MohanKrishna <mohan.krishna.ghanta.krishnamurthy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-04 11:54:35 -04:00
Paolo Abeni
9e8445a56c net: avoid unneeded atomic operation in ip*_append_data()
After commit 694aba690d ("ipv4: factorize sk_wmem_alloc updates
done by __ip_append_data()") and commit 1f4c6eb240 ("ipv6:
factorize sk_wmem_alloc updates done by __ip6_append_data()"),
when transmitting sub MTU datagram, an addtional, unneeded atomic
operation is performed in ip*_append_data() to update wmem_alloc:
in the above condition the delta is 0.

The above cause small but measurable performance regression in UDP
xmit tput test with packet size below MTU.

This change avoids such overhead updating wmem_alloc only if
wmem_alloc_delta is non zero.

The error path is left intentionally unmodified: it's a slow path
and simplicity is preferred to performances.

Fixes: 694aba690d ("ipv4: factorize sk_wmem_alloc updates done by __ip_append_data()")
Fixes: 1f4c6eb240 ("ipv6: factorize sk_wmem_alloc updates done by __ip6_append_data()")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-04 11:53:08 -04:00
Jon Maloy
b714295abc tipc: Fix missing list initializations in struct tipc_subscription
When an item of struct tipc_subscription is created, we fail to
initialize the two lists aggregated into the struct. This has so far
never been a problem, since the items are just added to a root
object by list_add(), which does not require the addee list to be
pre-initialized. However, syzbot is provoking situations where this
addition fails, whereupon the attempted removal if the item from
the list causes a crash.

This problem seems to always have been around, despite that the code
for creating this object was rewritten in commit 242e82cc95 ("tipc:
collapse subscription creation functions"), which is still in net-next.

We fix this for that commit by initializing the two lists properly.

Fixes: 242e82cc95 ("tipc: collapse subscription creation functions")
Reported-by: syzbot+0bb443b74ce09197e970@syzkaller.appspotmail.com
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-04 11:33:52 -04:00
Alexey Kodanev
4f858c56bd ipv6: udp: set dst cache for a connected sk if current not valid
A new RTF_CACHE route can be created between ip6_sk_dst_lookup_flow()
and ip6_dst_store() calls in udpv6_sendmsg(), when datagram sending
results to ICMPV6_PKT_TOOBIG error:

    udp_v6_send_skb(), for example with vti6 tunnel:
        vti6_xmit(), get ICMPV6_PKT_TOOBIG error
            skb_dst_update_pmtu(), can create a RTF_CACHE clone
            icmpv6_send()
    ...
    udpv6_err()
        ip6_sk_update_pmtu()
           ip6_update_pmtu(), can create a RTF_CACHE clone
           ...
           ip6_datagram_dst_update()
                ip6_dst_store()

And after commit 33c162a980 ("ipv6: datagram: Update dst cache of
a connected datagram sk during pmtu update"), the UDPv6 error handler
can update socket's dst cache, but it can happen before the update in
the end of udpv6_sendmsg(), preventing getting the new dst cache on
the next udpv6_sendmsg() calls.

In order to fix it, save dst in a connected socket only if the current
socket's dst cache is invalid.

The previous patch prepared ip6_sk_dst_lookup_flow() to do that with
the new argument, and this patch enables it in udpv6_sendmsg().

Fixes: 33c162a980 ("ipv6: datagram: Update dst cache of a connected datagram sk during pmtu update")
Fixes: 45e4fd2668 ("ipv6: Only create RTF_CACHE routes after encountering pmtu exception")
Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-04 11:31:57 -04:00
Alexey Kodanev
9f542f616c ipv6: udp: convert 'connected' to bool type in udpv6_sendmsg()
This should make it consistent with ip6_sk_dst_lookup_flow()
that is accepting the new 'connected' parameter of type bool.

Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-04 11:31:57 -04:00
Alexey Kodanev
96818159c3 ipv6: allow to cache dst for a connected sk in ip6_sk_dst_lookup_flow()
Add 'connected' parameter to ip6_sk_dst_lookup_flow() and update
the cache only if ip6_sk_dst_check() returns NULL and a socket
is connected.

The function is used as before, the new behavior for UDP sockets
in udpv6_sendmsg() will be enabled in the next patch.

Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-04 11:31:57 -04:00
Alexey Kodanev
7d6850f7c6 ipv6: add a wrapper for ip6_dst_store() with flowi6 checks
Move commonly used pattern of ip6_dst_store() usage to a separate
function - ip6_sk_dst_store_flow(), which will check the addresses
for equality using the flow information, before saving them.

There is no functional changes in this patch. In addition, it will
be used in the next patch, in ip6_sk_dst_lookup_flow().

Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-04 11:31:57 -04:00
Cong Wang
3848ec5dc8 af_unix: remove redundant lockdep class
After commit 581319c586 ("net/socket: use per af lockdep classes for sk queues")
sock queue locks now have per-af lockdep classes, including unix socket.
It is no longer necessary to workaround it.

I noticed this while looking at a syzbot deadlock report, this patch
itself doesn't fix it (this is why I don't add Reported-by).

Fixes: 581319c586 ("net/socket: use per af lockdep classes for sk queues")
Cc: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-04 11:13:40 -04:00
David Howells
b41d7cfef5 rxrpc: Fix undefined packet handling
By analogy with other Rx implementations, RxRPC packet types 9, 10 and 11
should just be discarded rather than being aborted like other undefined
packet types.

Reported-by: Jeffrey Altman <jaltman@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-04 11:04:08 -04:00
Linus Torvalds
5bb053bef8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:

 1) Support offloading wireless authentication to userspace via
    NL80211_CMD_EXTERNAL_AUTH, from Srinivas Dasari.

 2) A lot of work on network namespace setup/teardown from Kirill Tkhai.
    Setup and cleanup of namespaces now all run asynchronously and thus
    performance is significantly increased.

 3) Add rx/tx timestamping support to mv88e6xxx driver, from Brandon
    Streiff.

 4) Support zerocopy on RDS sockets, from Sowmini Varadhan.

 5) Use denser instruction encoding in x86 eBPF JIT, from Daniel
    Borkmann.

 6) Support hw offload of vlan filtering in mvpp2 dreiver, from Maxime
    Chevallier.

 7) Support grafting of child qdiscs in mlxsw driver, from Nogah
    Frankel.

 8) Add packet forwarding tests to selftests, from Ido Schimmel.

 9) Deal with sub-optimal GSO packets better in BBR congestion control,
    from Eric Dumazet.

10) Support 5-tuple hashing in ipv6 multipath routing, from David Ahern.

11) Add path MTU tests to selftests, from Stefano Brivio.

12) Various bits of IPSEC offloading support for mlx5, from Aviad
    Yehezkel, Yossi Kuperman, and Saeed Mahameed.

13) Support RSS spreading on ntuple filters in SFC driver, from Edward
    Cree.

14) Lots of sockmap work from John Fastabend. Applications can use eBPF
    to filter sendmsg and sendpage operations.

15) In-kernel receive TLS support, from Dave Watson.

16) Add XDP support to ixgbevf, this is significant because it should
    allow optimized XDP usage in various cloud environments. From Tony
    Nguyen.

17) Add new Intel E800 series "ice" ethernet driver, from Anirudh
    Venkataramanan et al.

18) IP fragmentation match offload support in nfp driver, from Pieter
    Jansen van Vuuren.

19) Support XDP redirect in i40e driver, from Björn Töpel.

20) Add BPF_RAW_TRACEPOINT program type for accessing the arguments of
    tracepoints in their raw form, from Alexei Starovoitov.

21) Lots of striding RQ improvements to mlx5 driver with many
    performance improvements, from Tariq Toukan.

22) Use rhashtable for inet frag reassembly, from Eric Dumazet.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1678 commits)
  net: mvneta: improve suspend/resume
  net: mvneta: split rxq/txq init and txq deinit into SW and HW parts
  ipv6: frags: fix /proc/sys/net/ipv6/ip6frag_low_thresh
  net: bgmac: Fix endian access in bgmac_dma_tx_ring_free()
  net: bgmac: Correctly annotate register space
  route: check sysctl_fib_multipath_use_neigh earlier than hash
  fix typo in command value in drivers/net/phy/mdio-bitbang.
  sky2: Increase D3 delay to sky2 stops working after suspend
  net/mlx5e: Set EQE based as default TX interrupt moderation mode
  ibmvnic: Disable irqs before exiting reset from closed state
  net: sched: do not emit messages while holding spinlock
  vlan: also check phy_driver ts_info for vlan's real device
  Bluetooth: Mark expected switch fall-throughs
  Bluetooth: Set HCI_QUIRK_SIMULTANEOUS_DISCOVERY for BTUSB_QCA_ROME
  Bluetooth: btrsi: remove unused including <linux/version.h>
  Bluetooth: hci_bcm: Remove DMI quirk for the MINIX Z83-4
  sh_eth: kill useless check in __sh_eth_get_regs()
  sh_eth: add sh_eth_cpu_data::no_xdfar flag
  ipv6: factorize sk_wmem_alloc updates done by __ip6_append_data()
  ipv4: factorize sk_wmem_alloc updates done by __ip_append_data()
  ...
2018-04-03 14:04:18 -07:00
Eric Biggers
f3aefb6a70 sunrpc: remove incorrect HMAC request initialization
make_checksum_hmac_md5() is allocating an HMAC transform and doing
crypto API calls in the following order:

    crypto_ahash_init()
    crypto_ahash_setkey()
    crypto_ahash_digest()

This is wrong because it makes no sense to init() the request before a
key has been set, given that the initial state depends on the key.  And
digest() is short for init() + update() + final(), so in this case
there's no need to explicitly call init() at all.

Before commit 9fa68f6200 ("crypto: hash - prevent using keyed hashes
without setting key") the extra init() had no real effect, at least for
the software HMAC implementation.  (There are also hardware drivers that
implement HMAC-MD5, and it's not immediately obvious how gracefully they
handle init() before setkey().)  But now the crypto API detects this
incorrect initialization and returns -ENOKEY.  This is breaking NFS
mounts in some cases.

Fix it by removing the incorrect call to crypto_ahash_init().

Reported-by: Michael Young <m.a.young@durham.ac.uk>
Fixes: 9fa68f6200 ("crypto: hash - prevent using keyed hashes without setting key")
Fixes: fffdaef2eb ("gss_krb5: Add support for rc4-hmac encryption")
Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-04-03 15:08:17 -04:00
Chuck Lever
38a7031559 NFSD: Clean up legacy NFS SYMLINK argument XDR decoders
Move common code in NFSD's legacy SYMLINK decoders into a helper.
The immediate benefits include:

 - one fewer data copies on transports that support DDP
 - consistent error checking across all versions
 - reduction of code duplication
 - support for both legal forms of SYMLINK requests on RDMA
   transports for all versions of NFS (in particular, NFSv2, for
   completeness)

In the long term, this helper is an appropriate spot to perform a
per-transport call-out to fill the pathname argument using, say,
RDMA Reads.

Filling the pathname in the proc function also means that eventually
the incoming filehandle can be interpreted so that filesystem-
specific memory can be allocated as a sink for the pathname
argument, rather than using anonymous pages.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-04-03 15:08:16 -04:00
Chuck Lever
8154ef2776 NFSD: Clean up legacy NFS WRITE argument XDR decoders
Move common code in NFSD's legacy NFS WRITE decoders into a helper.
The immediate benefit is reduction of code duplication and some nice
micro-optimizations (see below).

In the long term, this helper can perform a per-transport call-out
to fill the rq_vec (say, using RDMA Reads).

The legacy WRITE decoders and procs are changed to work like NFSv4,
which constructs the rq_vec just before it is about to call
vfs_writev.

Why? Calling a transport call-out from the proc instead of the XDR
decoder means that the incoming FH can be resolved to a particular
filesystem and file. This would allow pages from the backing file to
be presented to the transport to be filled, rather than presenting
anonymous pages and copying or flipping them into the file's page
cache later.

I also prefer using the pages in rq_arg.pages, instead of pulling
the data pages directly out of the rqstp::rq_pages array. This is
currently the way the NFSv3 write decoder works, but the other two
do not seem to take this approach. Fixing this removes the only
reference to rq_pages found in NFSD, eliminating an NFSD assumption
about how transports use the pages in rq_pages.

Lastly, avoid setting up the first element of rq_vec as a zero-
length buffer. This happens with an RDMA transport when a normal
Read chunk is present because the data payload is in rq_arg's
page list (none of it is in the head buffer).

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-04-03 15:08:16 -04:00
Chuck Lever
55f5088c22 svc: Report xprt dequeue latency
Record the time between when a rqstp is enqueued on a transport
and when it is dequeued. This includes how long the rqstp waits on
the queue and how long it takes the kernel scheduler to wake a
nfsd thread to service it.

The svc_xprt_dequeue trace point is altered to include the number
of microseconds between xprt_enqueue and xprt_dequeue.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-04-03 15:08:13 -04:00
Chuck Lever
aaba72cd4e sunrpc: Report per-RPC execution stats
Introduce a mechanism to report the server-side execution latency of
each RPC. The goal is to enable user space to filter the trace
record for latency outliers, build histograms, etc.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-04-03 15:08:12 -04:00
Chuck Lever
0b9547bf6b sunrpc: Re-purpose trace_svc_process
Currently, trace_svc_process has two call sites:

1. Just after a call to svc_send. svc_send already invokes
   trace_svc_send with the same arguments just before returning

2. Just before a call to svc_drop. svc_drop already invokes
   trace_svc_drop with the same arguments just after it is called

Therefore trace_svc_process does not provide any additional
information not already provided by these other trace points.

However, it would be useful to record the incoming RPC procedure.
So reuse trace_svc_process for this purpose.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-04-03 15:08:12 -04:00
Chuck Lever
ece200ddd5 sunrpc: Save remote presentation address in svc_xprt for trace events
TP_printk defines a format string that is passed to user space for
converting raw trace event records to something human-readable.

My user space's printf (Oracle Linux 7), however, does not have a
%pI format specifier. The result is that what is supposed to be an
IP address in the output of "trace-cmd report" is just a string that
says the field couldn't be displayed.

To fix this, adopt the same approach as the client: maintain a pre-
formated presentation address for occasions when %pI is not
available.

The location of the trace_svc_send trace point is adjusted so that
rqst->rq_xprt is not NULL when the trace event is recorded.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-04-03 15:08:11 -04:00
Chuck Lever
41f306d0c2 sunrpc: Simplify trace_svc_recv
There doesn't seem to be a lot of value in calling trace_svc_recv
in the failing case.

1. There are two very common cases: one is the transport is not
ready, and the other is shutdown. Neither is terribly interesting.

2. The trace record for the failing case contains nothing but
the status code.

Therefore the trace point call site in the error exit is removed.
Since the trace point is now recording a length instead of a
status, rename the status field and remove the case that records a
zero XID.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-04-03 15:08:11 -04:00
Chuck Lever
7dbb53baed sunrpc: Simplify do_enqueue tracing
There are three cases where svc_xprt_do_enqueue() returns without
waking an nfsd thread:

1. There is no work to do

2. The transport is already busy

3. There are no available nfsd threads

Only 3. is truly interesting. Move the trace point so it records
that there was work to do and either an nfsd thread was awoken, or
a free one could not found.

As an additional clean up, remove a redundant comment and a couple
of dprintk call sites.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-04-03 15:08:11 -04:00
Chuck Lever
caa3e106dc sunrpc: Move trace_svc_xprt_dequeue()
Reduce the amount of noise generated by trace_svc_xprt_dequeue by
moving it to the end of svc_get_next_xprt. This generates exactly
one trace event when a ready xprt is found, rather than spurious
events when there is no work to do. The empty events contain no
information that can't be obtained simply by tracing function calls
to svc_xprt_dequeue.

A small additional benefit is simplification of the svc_xprt_event
trace class, which no longer has to handle the case when the @xprt
parameter is NULL.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-04-03 15:08:10 -04:00
Chuck Lever
989f881ebf svc: Simplify ->xpo_secure_port
Clean up: Instead of returning a value that is used to set or clear
a bit, just make ->xpo_secure_port mangle that bit, and return void.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-04-03 15:08:09 -04:00
Chuck Lever
63a1b15693 sunrpc: Remove unneeded pointer dereference
Clean up: Noticed during code inspection that there is already a
local automatic variable "xprt" so dereferencing rqst->rq_xprt
again is unnecessary.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-04-03 15:08:09 -04:00
Szymon Janc
082f2300cf Bluetooth: Fix connection if directed advertising and privacy is used
Local random address needs to be updated before creating connection if
RPA from LE Direct Advertising Report was resolved in host. Otherwise
remote device might ignore connection request due to address mismatch.

This was affecting following qualification test cases:
GAP/CONN/SCEP/BV-03-C, GAP/CONN/GCEP/BV-05-C, GAP/CONN/DCEP/BV-05-C

Before patch:
< HCI Command: LE Set Random Address (0x08|0x0005) plen 6          #11350 [hci0] 84680.231216
        Address: 56:BC:E8:24:11:68 (Resolvable)
          Identity type: Random (0x01)
          Identity: F2:F1:06:3D:9C:42 (Static)
> HCI Event: Command Complete (0x0e) plen 4                        #11351 [hci0] 84680.246022
      LE Set Random Address (0x08|0x0005) ncmd 1
        Status: Success (0x00)
< HCI Command: LE Set Scan Parameters (0x08|0x000b) plen 7         #11352 [hci0] 84680.246417
        Type: Passive (0x00)
        Interval: 60.000 msec (0x0060)
        Window: 30.000 msec (0x0030)
        Own address type: Random (0x01)
        Filter policy: Accept all advertisement, inc. directed unresolved RPA (0x02)
> HCI Event: Command Complete (0x0e) plen 4                        #11353 [hci0] 84680.248854
      LE Set Scan Parameters (0x08|0x000b) ncmd 1
        Status: Success (0x00)
< HCI Command: LE Set Scan Enable (0x08|0x000c) plen 2             #11354 [hci0] 84680.249466
        Scanning: Enabled (0x01)
        Filter duplicates: Enabled (0x01)
> HCI Event: Command Complete (0x0e) plen 4                        #11355 [hci0] 84680.253222
      LE Set Scan Enable (0x08|0x000c) ncmd 1
        Status: Success (0x00)
> HCI Event: LE Meta Event (0x3e) plen 18                          #11356 [hci0] 84680.458387
      LE Direct Advertising Report (0x0b)
        Num reports: 1
        Event type: Connectable directed - ADV_DIRECT_IND (0x01)
        Address type: Random (0x01)
        Address: 53:38:DA:46:8C:45 (Resolvable)
          Identity type: Public (0x00)
          Identity: 11:22:33:44:55:66 (OUI 11-22-33)
        Direct address type: Random (0x01)
        Direct address: 7C:D6:76:8C:DF:82 (Resolvable)
          Identity type: Random (0x01)
          Identity: F2:F1:06:3D:9C:42 (Static)
        RSSI: -74 dBm (0xb6)
< HCI Command: LE Set Scan Enable (0x08|0x000c) plen 2             #11357 [hci0] 84680.458737
        Scanning: Disabled (0x00)
        Filter duplicates: Disabled (0x00)
> HCI Event: Command Complete (0x0e) plen 4                        #11358 [hci0] 84680.469982
      LE Set Scan Enable (0x08|0x000c) ncmd 1
        Status: Success (0x00)
< HCI Command: LE Create Connection (0x08|0x000d) plen 25          #11359 [hci0] 84680.470444
        Scan interval: 60.000 msec (0x0060)
        Scan window: 60.000 msec (0x0060)
        Filter policy: White list is not used (0x00)
        Peer address type: Random (0x01)
        Peer address: 53:38:DA:46:8C:45 (Resolvable)
          Identity type: Public (0x00)
          Identity: 11:22:33:44:55:66 (OUI 11-22-33)
        Own address type: Random (0x01)
        Min connection interval: 30.00 msec (0x0018)
        Max connection interval: 50.00 msec (0x0028)
        Connection latency: 0 (0x0000)
        Supervision timeout: 420 msec (0x002a)
        Min connection length: 0.000 msec (0x0000)
        Max connection length: 0.000 msec (0x0000)
> HCI Event: Command Status (0x0f) plen 4                          #11360 [hci0] 84680.474971
      LE Create Connection (0x08|0x000d) ncmd 1
        Status: Success (0x00)
< HCI Command: LE Create Connection Cancel (0x08|0x000e) plen 0    #11361 [hci0] 84682.545385
> HCI Event: Command Complete (0x0e) plen 4                        #11362 [hci0] 84682.551014
      LE Create Connection Cancel (0x08|0x000e) ncmd 1
        Status: Success (0x00)
> HCI Event: LE Meta Event (0x3e) plen 19                          #11363 [hci0] 84682.551074
      LE Connection Complete (0x01)
        Status: Unknown Connection Identifier (0x02)
        Handle: 0
        Role: Master (0x00)
        Peer address type: Public (0x00)
        Peer address: 00:00:00:00:00:00 (OUI 00-00-00)
        Connection interval: 0.00 msec (0x0000)
        Connection latency: 0 (0x0000)
        Supervision timeout: 0 msec (0x0000)
        Master clock accuracy: 0x00

After patch:
< HCI Command: LE Set Scan Parameters (0x08|0x000b) plen 7    #210 [hci0] 667.152459
        Type: Passive (0x00)
        Interval: 60.000 msec (0x0060)
        Window: 30.000 msec (0x0030)
        Own address type: Random (0x01)
        Filter policy: Accept all advertisement, inc. directed unresolved RPA (0x02)
> HCI Event: Command Complete (0x0e) plen 4                   #211 [hci0] 667.153613
      LE Set Scan Parameters (0x08|0x000b) ncmd 1
        Status: Success (0x00)
< HCI Command: LE Set Scan Enable (0x08|0x000c) plen 2        #212 [hci0] 667.153704
        Scanning: Enabled (0x01)
        Filter duplicates: Enabled (0x01)
> HCI Event: Command Complete (0x0e) plen 4                   #213 [hci0] 667.154584
      LE Set Scan Enable (0x08|0x000c) ncmd 1
        Status: Success (0x00)
> HCI Event: LE Meta Event (0x3e) plen 18                     #214 [hci0] 667.182619
      LE Direct Advertising Report (0x0b)
        Num reports: 1
        Event type: Connectable directed - ADV_DIRECT_IND (0x01)
        Address type: Random (0x01)
        Address: 50:52:D9:A6:48:A0 (Resolvable)
          Identity type: Public (0x00)
          Identity: 11:22:33:44:55:66 (OUI 11-22-33)
        Direct address type: Random (0x01)
        Direct address: 7C:C1:57:A5:B7:A8 (Resolvable)
          Identity type: Random (0x01)
          Identity: F4:28:73:5D:38:B0 (Static)
        RSSI: -70 dBm (0xba)
< HCI Command: LE Set Scan Enable (0x08|0x000c) plen 2       #215 [hci0] 667.182704
        Scanning: Disabled (0x00)
        Filter duplicates: Disabled (0x00)
> HCI Event: Command Complete (0x0e) plen 4                  #216 [hci0] 667.183599
      LE Set Scan Enable (0x08|0x000c) ncmd 1
        Status: Success (0x00)
< HCI Command: LE Set Random Address (0x08|0x0005) plen 6    #217 [hci0] 667.183645
        Address: 7C:C1:57:A5:B7:A8 (Resolvable)
          Identity type: Random (0x01)
          Identity: F4:28:73:5D:38:B0 (Static)
> HCI Event: Command Complete (0x0e) plen 4                  #218 [hci0] 667.184590
      LE Set Random Address (0x08|0x0005) ncmd 1
        Status: Success (0x00)
< HCI Command: LE Create Connection (0x08|0x000d) plen 25    #219 [hci0] 667.184613
        Scan interval: 60.000 msec (0x0060)
        Scan window: 60.000 msec (0x0060)
        Filter policy: White list is not used (0x00)
        Peer address type: Random (0x01)
        Peer address: 50:52:D9:A6:48:A0 (Resolvable)
          Identity type: Public (0x00)
          Identity: 11:22:33:44:55:66 (OUI 11-22-33)
        Own address type: Random (0x01)
        Min connection interval: 30.00 msec (0x0018)
        Max connection interval: 50.00 msec (0x0028)
        Connection latency: 0 (0x0000)
        Supervision timeout: 420 msec (0x002a)
        Min connection length: 0.000 msec (0x0000)
        Max connection length: 0.000 msec (0x0000)
> HCI Event: Command Status (0x0f) plen 4                    #220 [hci0] 667.186558
      LE Create Connection (0x08|0x000d) ncmd 1
        Status: Success (0x00)
> HCI Event: LE Meta Event (0x3e) plen 19                    #221 [hci0] 667.485824
      LE Connection Complete (0x01)
        Status: Success (0x00)
        Handle: 0
        Role: Master (0x00)
        Peer address type: Random (0x01)
        Peer address: 50:52:D9:A6:48:A0 (Resolvable)
          Identity type: Public (0x00)
          Identity: 11:22:33:44:55:66 (OUI 11-22-33)
        Connection interval: 50.00 msec (0x0028)
        Connection latency: 0 (0x0000)
        Supervision timeout: 420 msec (0x002a)
        Master clock accuracy: 0x07
@ MGMT Event: Device Connected (0x000b) plen 13          {0x0002} [hci0] 667.485996
        LE Address: 11:22:33:44:55:66 (OUI 11-22-33)
        Flags: 0x00000000
        Data length: 0

Signed-off-by: Szymon Janc <szymon.janc@codecoup.pl>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Cc: stable@vger.kernel.org
2018-04-03 16:12:56 +02:00
Linus Torvalds
642e7fd233 Merge branch 'syscalls-next' of git://git.kernel.org/pub/scm/linux/kernel/git/brodo/linux
Pull removal of in-kernel calls to syscalls from Dominik Brodowski:
 "System calls are interaction points between userspace and the kernel.
  Therefore, system call functions such as sys_xyzzy() or
  compat_sys_xyzzy() should only be called from userspace via the
  syscall table, but not from elsewhere in the kernel.

  At least on 64-bit x86, it will likely be a hard requirement from
  v4.17 onwards to not call system call functions in the kernel: It is
  better to use use a different calling convention for system calls
  there, where struct pt_regs is decoded on-the-fly in a syscall wrapper
  which then hands processing over to the actual syscall function. This
  means that only those parameters which are actually needed for a
  specific syscall are passed on during syscall entry, instead of
  filling in six CPU registers with random user space content all the
  time (which may cause serious trouble down the call chain). Those
  x86-specific patches will be pushed through the x86 tree in the near
  future.

  Moreover, rules on how data may be accessed may differ between kernel
  data and user data. This is another reason why calling sys_xyzzy() is
  generally a bad idea, and -- at most -- acceptable in arch-specific
  code.

  This patchset removes all in-kernel calls to syscall functions in the
  kernel with the exception of arch/. On top of this, it cleans up the
  three places where many syscalls are referenced or prototyped, namely
  kernel/sys_ni.c, include/linux/syscalls.h and include/linux/compat.h"

* 'syscalls-next' of git://git.kernel.org/pub/scm/linux/kernel/git/brodo/linux: (109 commits)
  bpf: whitelist all syscalls for error injection
  kernel/sys_ni: remove {sys_,sys_compat} from cond_syscall definitions
  kernel/sys_ni: sort cond_syscall() entries
  syscalls/x86: auto-create compat_sys_*() prototypes
  syscalls: sort syscall prototypes in include/linux/compat.h
  net: remove compat_sys_*() prototypes from net/compat.h
  syscalls: sort syscall prototypes in include/linux/syscalls.h
  kexec: move sys_kexec_load() prototype to syscalls.h
  x86/sigreturn: use SYSCALL_DEFINE0
  x86: fix sys_sigreturn() return type to be long, not unsigned long
  x86/ioport: add ksys_ioperm() helper; remove in-kernel calls to sys_ioperm()
  mm: add ksys_readahead() helper; remove in-kernel calls to sys_readahead()
  mm: add ksys_mmap_pgoff() helper; remove in-kernel calls to sys_mmap_pgoff()
  mm: add ksys_fadvise64_64() helper; remove in-kernel call to sys_fadvise64_64()
  fs: add ksys_fallocate() wrapper; remove in-kernel calls to sys_fallocate()
  fs: add ksys_p{read,write}64() helpers; remove in-kernel calls to syscalls
  fs: add ksys_truncate() wrapper; remove in-kernel calls to sys_truncate()
  fs: add ksys_sync_file_range helper(); remove in-kernel calls to syscall
  kernel: add ksys_setsid() helper; remove in-kernel call to sys_setsid()
  kernel: add ksys_unshare() helper; remove in-kernel calls to sys_unshare()
  ...
2018-04-02 21:22:12 -07:00
Dominik Brodowski
6df354653e net: socket: add __compat_sys_...msg() helpers; remove in-kernel calls to compat syscalls
Using the net-internal helpers __compat_sys_...msg() allows us to avoid
the internal calls to the compat_sys_...msg() syscalls.
compat_sys_recvmmsg() is handled in a different patch.

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: David S. Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02 20:15:20 +02:00
Dominik Brodowski
157b334aa8 net: socket: add __compat_sys_recvmmsg() helper; remove in-kernel call to compat syscall
Using the net-internal helper __compat_sys_recvmmsg() allows us to avoid
the internal calls to the compat_sys_recvmmsg() syscall.

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: David S. Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02 20:15:19 +02:00
Dominik Brodowski
8770cf4a58 net: socket: add __compat_sys_getsockopt() helper; remove in-kernel call to compat syscall
Using the net-internal helper __compat_sys_getsockopt() allows us to avoid
the internal calls to the compat_sys_getsockopt() syscall.

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: David S. Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02 20:15:19 +02:00
Dominik Brodowski
73ee3eafd5 net: socket: add __compat_sys_setsockopt() helper; remove in-kernel call to compat syscall
Using the net-internal helper __compat_sys_setsockopt() allows us to avoid
the internal calls to the compat_sys_setsockopt() syscall.

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: David S. Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02 20:15:18 +02:00
Dominik Brodowski
fd4e82f5b8 net: socket: add __compat_sys_recvfrom() helper; remove in-kernel call to compat syscall
Using the net-internal helper __compat_sys_recvfrom() allows us to avoid
the internal calls to the compat_sys_recvfrom() syscall.

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: David S. Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02 20:15:17 +02:00
Dominik Brodowski
d27e9afc64 net: socket: replace call to sys_recv() with __sys_recvfrom()
sys_recv() merely expands the parameters to __sys_recvfrom() by NULL and
NULL. Open-code this in the two places which used sys_recv() as a wrapper
to __sys_recvfrom().

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: David S. Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02 20:15:16 +02:00
Dominik Brodowski
f3bf896b1d net: socket: replace calls to sys_send() with __sys_sendto()
sys_send() merely expands the parameters to __sys_sendto() by NULL and 0.
Open-code this in the two places which used sys_send() as a wrapper to
__sys_sendto().

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: David S. Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02 20:15:16 +02:00
Dominik Brodowski
e1834a329d net: socket: move check for forbid_cmsg_compat to __sys_...msg()
The non-compat codepaths for sys_...msg() verify that MSG_CMSG_COMPAT
is not set. By moving this check to the __sys_...msg() functions
(and making it dependent on a static flag passed to this function), we
can call the __sys...msg() functions instead of the syscall functions
in all cases. __sys_recvmmsg() does not need this trickery, as the
check is handled within the do_sys_recvmmsg() function internal to
net/socket.c.

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: David S. Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02 20:15:15 +02:00
Dominik Brodowski
1255e26906 net: socket: add do_sys_recvmmsg() helper; remove in-kernel call to syscall
Using the net-internal helper do_sys_recvmmsg() allows us to avoid the
internal calls to the sys_getsockopt() syscall.

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: David S. Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02 20:15:14 +02:00
Dominik Brodowski
13a2d70e2b net: socket: add __sys_getsockopt() helper; remove in-kernel call to syscall
Using the net-internal helper __sys_getsockopt() allows us to avoid the
internal calls to the sys_getsockopt() syscall.

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: David S. Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02 20:15:13 +02:00
Dominik Brodowski
cc36dca0df net: socket: add __sys_setsockopt() helper; remove in-kernel call to syscall
Using the net-internal helper __sys_setsockopt() allows us to avoid the
internal calls to the sys_setsockopt() syscall.

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: David S. Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02 20:15:13 +02:00
Dominik Brodowski
005a1aeac4 net: socket: add __sys_shutdown() helper; remove in-kernel call to syscall
Using the net-internal helper __sys_shutdown() allows us to avoid the
internal calls to the sys_shutdown() syscall.

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: David S. Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02 20:15:12 +02:00
Dominik Brodowski
6debc8d834 net: socket: add __sys_socketpair() helper; remove in-kernel call to syscall
Using the net-internal helper __sys_socketpair() allows us to avoid the
internal calls to the sys_socketpair() syscall.

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: David S. Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02 20:15:11 +02:00
Dominik Brodowski
b21c8f838a net: socket: add __sys_getpeername() helper; remove in-kernel call to syscall
Using the net-internal helper __sys_getpeername() allows us to avoid the
internal calls to the sys_getpeername() syscall.

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: David S. Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02 20:15:10 +02:00
Dominik Brodowski
8882a107b3 net: socket: add __sys_getsockname() helper; remove in-kernel call to syscall
Using the net-internal helper __sys_getsockname() allows us to avoid the
internal calls to the sys_getsockname() syscall.

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: David S. Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02 20:15:09 +02:00
Dominik Brodowski
25e290eed9 net: socket: add __sys_listen() helper; remove in-kernel call to syscall
Using the net-internal helper __sys_listen() allows us to avoid the
internal calls to the sys_listen() syscall.

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: David S. Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02 20:15:09 +02:00
Dominik Brodowski
1387c2c2f9 net: socket: add __sys_connect() helper; remove in-kernel call to syscall
Using the net-internal helper __sys_connect() allows us to avoid the
internal calls to the sys_connect() syscall.

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: David S. Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02 20:15:08 +02:00
Dominik Brodowski
a87d35d87a net: socket: add __sys_bind() helper; remove in-kernel call to syscall
Using the net-internal helper __sys_bind() allows us to avoid the
internal calls to the sys_bind() syscall.

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: David S. Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02 20:15:07 +02:00
Dominik Brodowski
9d6a15c3f2 net: socket: add __sys_socket() helper; remove in-kernel call to syscall
Using the net-internal helper __sys_socket() allows us to avoid the
internal calls to the sys_socket() syscall.

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: David S. Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02 20:15:06 +02:00
Dominik Brodowski
4541e80560 net: socket: add __sys_accept4() helper; remove in-kernel call to syscall
Using the net-internal helper __sys_accept4() allows us to avoid the
internal calls to the sys_accept4() syscall.

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: David S. Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02 20:15:06 +02:00
Dominik Brodowski
211b634b7f net: socket: add __sys_sendto() helper; remove in-kernel call to syscall
Using the net-internal helper __sys_sendto() allows us to avoid the
internal calls to the sys_sendto() syscall.

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: David S. Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02 20:15:05 +02:00
Dominik Brodowski
7a09e1eb9c net: socket: add __sys_recvfrom() helper; remove in-kernel call to syscall
Using the net-internal helper __sys_recvfrom() allows us to avoid the
internal calls to the sys_recvfrom() syscall.

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: David S. Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02 20:15:04 +02:00
Eric Dumazet
6e00f7dd5e ipv6: frags: fix /proc/sys/net/ipv6/ip6frag_low_thresh
I forgot to change ip6frag_low_thresh proc_handler
from proc_dointvec_minmax to proc_doulongvec_minmax

Fixes: 3e67f106f6 ("inet: frags: break the 2GB limit for frags storage")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-02 10:10:50 -04:00
Luis Henriques
fb18a57568 ceph: quota: add initial infrastructure to support cephfs quotas
This patch adds the infrastructure required to support cephfs quotas as it
is currently implemented in the ceph fuse client.  Cephfs quotas can be
set on any directory, and can restrict the number of bytes or the number
of files stored beneath that point in the directory hierarchy.

Quotas are set using the extended attributes 'ceph.quota.max_files' and
'ceph.quota.max_bytes', and can be removed by setting these attributes to
'0'.

Link: http://tracker.ceph.com/issues/22372
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-04-02 11:17:51 +02:00
Chengguang Xu
57a35dfb52 libceph, ceph: add __init attribution to init funcitons
Add __init attribution to the functions which are called only once
during initiating/registering operations and deleting unnecessary
symbol exports.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-04-02 10:12:48 +02:00
Chengguang Xu
f2f87877b8 libceph: adding missing message types to ceph_msg_type_name()
Some of message types are missing in ceph_msg_type_name(),
so just adding them for better understanding of output information.

Signed-off-by: Chengguang Xu <cgxu519@icloud.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-04-02 10:12:46 +02:00
Chengguang Xu
7377324e5b libceph: fix misjudgement of maximum monitor number
num_mon should allow up to CEPH_MAX_MON in ceph_monmap_decode().

Signed-off-by: Chengguang Xu <cgxu519@icloud.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-04-02 10:12:45 +02:00
Chengguang Xu
11e1478df9 libceph, ceph: change permission for readonly debugfs entries
Remove write permission for debugfs entries which only have readonly
function.

Signed-off-by: Chengguang Xu <cgxu519@icloud.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-04-02 10:12:45 +02:00
Chengguang Xu
4c069a5821 ceph: add newline to end of debug message format
Some of dout format do not include newline in the end,
fix for the files which are in fs/ceph and net/ceph directories,
and changing printk to dout for printing debug info in super.c

Signed-off-by: Chengguang Xu <cgxu519@icloud.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-04-02 10:12:44 +02:00
Ilya Dryomov
08c1ac508b libceph, ceph: move ceph_calc_file_object_mapping() to striper.c
ceph_calc_file_object_mapping() has nothing to do with osdmaps.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-04-02 10:12:43 +02:00
Ilya Dryomov
ed0811d2d2 libceph: striping framework implementation
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-04-02 10:12:42 +02:00
Ilya Dryomov
45a267dbb4 libceph: handle zero-length data items
rbd needs this for null copyups -- if copyup data is all zeroes, we
want to save some I/O and network bandwidth.  See rbd_obj_issue_copyup()
in the next commit.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2018-04-02 10:12:40 +02:00
Ilya Dryomov
b9e281c2b3 libceph: introduce BVECS data type
In preparation for rbd "fancy" striping, introduce ceph_bvec_iter for
working with bio_vec array data buffers.  The wrappers are trivial, but
make it look similar to ceph_bio_iter.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-04-02 10:12:39 +02:00
Ilya Dryomov
5359a17d27 libceph, rbd: new bio handling code (aka don't clone bios)
The reason we clone bios is to be able to give each object request
(and consequently each ceph_osd_data/ceph_msg_data item) its own
pointer to a (list of) bio(s).  The messenger then initializes its
cursor with cloned bio's ->bi_iter, so it knows where to start reading
from/writing to.  That's all the cloned bios are used for: to determine
each object request's starting position in the provided data buffer.

Introduce ceph_bio_iter to do exactly that -- store position within bio
list (i.e. pointer to bio) + position within that bio (i.e. bvec_iter).

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-04-02 10:12:38 +02:00
Ilya Dryomov
dccbf08005 libceph, ceph: change ceph_calc_file_object_mapping() signature
- make it void
- xlen (object extent length) out parameter should be u32 because only
  a single stripe unit is mapped at a time

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2018-04-02 10:12:38 +02:00
Ilya Dryomov
db2196a589 libceph: eliminate overflows in ceph_calc_file_object_mapping()
bl, stripeno and objsetno should be u64 -- otherwise large enough files
get corrupted.  How large depends on file layout:

- 4M-objects layout (default): any file over 16P
- 64K-objects layout (smallest possible object size): any file over 512T

Only CephFS is affected, rbd doesn't use ceph_calc_file_object_mapping()
yet.  Fortunately, CephFS has a max_file_size configurable, the default
for which is way below both of the above numbers.

Reimplement the logic from scratch with no layout validation -- it's
done on the MDS side.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2018-04-02 10:12:38 +02:00
Xin Long
6174a30df1 route: check sysctl_fib_multipath_use_neigh earlier than hash
Prior to this patch, when one packet is hashed into path [1]
(hash <= nh_upper_bound) and it's neigh is dead, it will try
path [2]. However, if path [2]'s neigh is alive but it's
hash > nh_upper_bound, it will not return this alive path.
This packet will never be sent even if path [2] is alive.

 3.3.3.1/24:
  nexthop via 1.1.1.254 dev eth1 weight 1 <--[1] (dead neigh)
  nexthop via 2.2.2.254 dev eth2 weight 1 <--[2]

With sysctl_fib_multipath_use_neigh set is supposed to find an
available path respecting to the l3/l4 hash. But if there is
no available route with this hash, it should at least return
an alive route even with other hash.

This patch is to fix it by processing fib_multipath_use_neigh
earlier than the hash check, so that it will at least return
an alive route if there is when fib_multipath_use_neigh is
enabled. It's also compatible with before when there are alive
routes with the l3/l4 hash.

Fixes: a6db4494d2 ("net: ipv4: Consider failed nexthops in multipath routes")
Reported-by: Jianlin Shi <jishi@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-01 20:57:39 -04:00
Li RongQing
da01ec4ee5 net: sched: do not emit messages while holding spinlock
move messages emitting out of sch_tree_lock to avoid holding
this lock too long.

Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-01 20:53:50 -04:00
Hangbin Liu
ec1d8ccb07 vlan: also check phy_driver ts_info for vlan's real device
Just like function ethtool_get_ts_info(), we should also consider the
phy_driver ts_info call back. For example, driver dp83640.

Fixes: 37dd9255b2 ("vlan: Pass ethtool get_ts_info queries to real device.")
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-01 20:53:50 -04:00
David S. Miller
c0b458a946 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Minor conflicts in drivers/net/ethernet/mellanox/mlx5/core/en_rep.c,
we had some overlapping changes:

1) In 'net' MLX5E_PARAMS_LOG_{SQ,RQ}_SIZE -->
   MLX5E_REP_PARAMS_LOG_{SQ,RQ}_SIZE

2) In 'net-next' params->log_rq_size is renamed to be
   params->log_rq_mtu_frames.

3) In 'net-next' params->hard_mtu is added.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-01 19:49:34 -04:00
David S. Miller
859a59352e Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Johan Hedberg says:

====================
pull request: bluetooth-next 2018-04-01

Here's (most likely) the last bluetooth-next pull request for the 4.17
kernel:

 - Remove unused btuart_cs driver (replaced by serial_cs + hci_uart)
 - New USB ID for Edimax EW-7611ULB controller
 - Cleanups & fixes to hci_bcm driver
 - Clenups to btmrvl driver

Please let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-01 17:46:01 -04:00
Gustavo A. R. Silva
9ea471320e Bluetooth: Mark expected switch fall-throughs
In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2018-04-01 21:43:03 +03:00
Eric Dumazet
1f4c6eb240 ipv6: factorize sk_wmem_alloc updates done by __ip6_append_data()
While testing my inet defrag changes, I found that the senders
could spend ~20% of cpu cycles in skb_set_owner_w() updating
sk->sk_wmem_alloc for every fragment they cook, competing
with TX completion of prior skbs possibly happening on another cpus.

The solution to this problem is to use alloc_skb() instead
of sock_wmalloc() and manually perform a single sk_wmem_alloc change.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-01 14:08:21 -04:00
Eric Dumazet
694aba690d ipv4: factorize sk_wmem_alloc updates done by __ip_append_data()
While testing my inet defrag changes, I found that the senders
could spend ~20% of cpu cycles in skb_set_owner_w() updating
sk->sk_wmem_alloc for every fragment they cook.

The solution to this problem is to use alloc_skb() instead
of sock_wmalloc() and manually perform a single sk_wmem_alloc change.

Similar change for IPv6 is provided in following patch.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-01 14:08:21 -04:00
Alexey Kodanev
ec1258903a ip6_gre: remove redundant 'tunnel' setting in ip6erspan_tap_init()
'tunnel' was already set at the start of ip6erspan_tap_init().

Fixes: 5a963eb61b ("ip6_gre: Add ERSPAN native tunnel support")
Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-01 14:02:32 -04:00
Atul Gupta
cc35c88ae4 crypto : chtls - CPL handler definition
Exchange messages with hardware to program the TLS session
CPL handlers for messages received from chip.

Signed-off-by: Atul Gupta <atul.gupta@chelsio.com>
Signed-off-by: Michael Werner <werner@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31 23:37:32 -04:00
Atul Gupta
e0be6bea25 ethtool: enable Inline TLS in HW
Ethtool option enables TLS record offload on HW, user
configures the feature for netdev capable of Inline TLS.
This allows user to define custom sk_prot for Inline TLS sock

Signed-off-by: Atul Gupta <atul.gupta@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31 23:37:32 -04:00
Atul Gupta
dd0bed1665 tls: support for Inline tls record
Facility to register Inline TLS drivers to net/tls. Setup
TLS_HW_RECORD prot to listen on offload device.

Cases handled
- Inline TLS device exists, setup prot for TLS_HW_RECORD
- Atleast one Inline TLS exists, sets TLS_HW_RECORD.
- If non-inline device establish connection, move to TLS_SW_TX

Signed-off-by: Atul Gupta <atul.gupta@chelsio.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31 23:37:32 -04:00
David S. Miller
d4069fe6fc Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2018-03-31

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Add raw BPF tracepoint API in order to have a BPF program type that
   can access kernel internal arguments of the tracepoints in their
   raw form similar to kprobes based BPF programs. This infrastructure
   also adds a new BPF_RAW_TRACEPOINT_OPEN command to BPF syscall which
   returns an anon-inode backed fd for the tracepoint object that allows
   for automatic detach of the BPF program resp. unregistering of the
   tracepoint probe on fd release, from Alexei.

2) Add new BPF cgroup hooks at bind() and connect() entry in order to
   allow BPF programs to reject, inspect or modify user space passed
   struct sockaddr, and as well a hook at post bind time once the port
   has been allocated. They are used in FB's container management engine
   for implementing policy, replacing fragile LD_PRELOAD wrapper
   intercepting bind() and connect() calls that only works in limited
   scenarios like glibc based apps but not for other runtimes in
   containerized applications, from Andrey.

3) BPF_F_INGRESS flag support has been added to sockmap programs for
   their redirect helper call bringing it in line with cls_bpf based
   programs. Support is added for both variants of sockmap programs,
   meaning for tx ULP hooks as well as recv skb hooks, from John.

4) Various improvements on BPF side for the nfp driver, besides others
   this work adds BPF map update and delete helper call support from
   the datapath, JITing of 32 and 64 bit XADD instructions as well as
   offload support of bpf_get_prandom_u32() call. Initial implementation
   of nfp packet cache has been tackled that optimizes memory access
   (see merge commit for further details), from Jakub and Jiong.

5) Removal of struct bpf_verifier_env argument from the print_bpf_insn()
   API has been done in order to prepare to use print_bpf_insn() soon
   out of perf tool directly. This makes the print_bpf_insn() API more
   generic and pushes the env into private data. bpftool is adjusted
   as well with the print_bpf_insn() argument removal, from Jiri.

6) Couple of cleanups and prep work for the upcoming BTF (BPF Type
   Format). The latter will reuse the current BPF verifier log as
   well, thus bpf_verifier_log() is further generalized, from Martin.

7) For bpf_getsockopt() and bpf_setsockopt() helpers, IPv4 IP_TOS read
   and write support has been added in similar fashion to existing
   IPv6 IPV6_TCLASS socket option we already have, from Nikita.

8) Fixes in recent sockmap scatterlist API usage, which did not use
   sg_init_table() for initialization thus triggering a BUG_ON() in
   scatterlist API when CONFIG_DEBUG_SG was enabled. This adds and
   uses a small helper sg_init_marker() to properly handle the affected
   cases, from Prashant.

9) Let the BPF core follow IDR code convention and therefore use the
   idr_preload() and idr_preload_end() helpers, which would also help
   idr_alloc_cyclic() under GFP_ATOMIC to better succeed under memory
   pressure, from Shaohua.

10) Last but not least, a spelling fix in an error message for the
    BPF cookie UID helper under BPF sample code, from Colin.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31 23:33:04 -04:00
Eric Dumazet
f2d1c724fc inet: frags: get rid of nf_ct_frag6_skb_cb/NFCT_FRAG6_CB
nf_ct_frag6_queue() uses skb->cb[] to store the fragment offset,
meaning that we could use two cache lines per skb when finding
the insertion point, if for some reason inet6_skb_parm size
is increased in the future.

By using skb->ip_defrag_offset instead of skb->cb[] we pack all the fields
in a single cache line, matching what we did for IPv4.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31 23:25:40 -04:00
Eric Dumazet
219badfaad ipv6: frags: get rid of ip6frag_skb_cb/FRAG6_CB
ip6_frag_queue uses skb->cb[] to store the fragment offset, meaning that
we could use two cache lines per skb when finding the insertion point,
if for some reason inet6_skb_parm size is increased in the future.

By using skb->ip_defrag_offset instead of skb->cb[], we pack all
the fields in a single cache line, matching what we did for IPv4.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31 23:25:40 -04:00
Eric Dumazet
bf66337140 inet: frags: get rid of ipfrag_skb_cb/FRAG_CB
ip_defrag uses skb->cb[] to store the fragment offset, and unfortunately
this integer is currently in a different cache line than skb->next,
meaning that we use two cache lines per skb when finding the insertion point.

By aliasing skb->ip_defrag_offset and skb->dev, we pack all the fields
in a single cache line and save precious memory bandwidth.

Note that after the fast path added by Changli Gao in commit
d6bebca92c ("fragment: add fast path for in-order fragments")
this change wont help the fast path, since we still need
to access prev->len (2nd cache line), but will show great
benefits when slow path is entered, since we perform
a linear scan of a potentially long list.

Also, note that this potential long list is an attack vector,
we might consider also using an rb-tree there eventually.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31 23:25:40 -04:00
Eric Dumazet
05c0b86b96 ipv6: frags: rewrite ip6_expire_frag_queue()
Make it similar to IPv4 ip_expire(), and release the lock
before calling icmp functions.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31 23:25:39 -04:00
Eric Dumazet
1eec5d5670 inet: frags: do not clone skb in ip_expire()
An skb_clone() was added in commit ec4fbd6475 ("inet: frag: release
spinlock before calling icmp_send()")

While fixing the bug at that time, it also added a very high cost
for DDOS frags, as the ICMP rate limit is applied after this
expensive operation (skb_clone() + consume_skb(), implying memory
allocations, copy, and freeing)

We can use skb_get(head) here, all we want is to make sure skb wont
be freed by another cpu.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31 23:25:39 -04:00
Eric Dumazet
3e67f106f6 inet: frags: break the 2GB limit for frags storage
Some users are willing to provision huge amounts of memory to be able
to perform reassembly reasonnably well under pressure.

Current memory tracking is using one atomic_t and integers.

Switch to atomic_long_t so that 64bit arches can use more than 2GB,
without any cost for 32bit arches.

Note that this patch avoids an overflow error, if high_thresh was set
to ~2GB, since this test in inet_frag_alloc() was never true :

if (... || frag_mem_limit(nf) > nf->high_thresh)

Tested:

$ echo 16000000000 >/proc/sys/net/ipv4/ipfrag_high_thresh

<frag DDOS>

$ grep FRAG /proc/net/sockstat
FRAG: inuse 14705885 memory 16000002880

$ nstat -n ; sleep 1 ; nstat | grep Reas
IpReasmReqds                    3317150            0.0
IpReasmFails                    3317112            0.0

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31 23:25:39 -04:00
Eric Dumazet
2d44ed22e6 inet: frags: remove inet_frag_maybe_warn_overflow()
This function is obsolete, after rhashtable addition to inet defrag.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31 23:25:39 -04:00
Eric Dumazet
399d1404be inet: frags: get rif of inet_frag_evicting()
This refactors ip_expire() since one indentation level is removed.

Note: in the future, we should try hard to avoid the skb_clone()
since this is a serious performance cost.
Under DDOS, the ICMP message wont be sent because of rate limits.

Fact that ip6_expire_frag_queue() does not use skb_clone() is
disturbing too. Presumably IPv6 should have the same
issue than the one we fixed in commit ec4fbd6475
("inet: frag: release spinlock before calling icmp_send()")

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31 23:25:39 -04:00
Eric Dumazet
6befe4a78b inet: frags: remove some helpers
Remove sum_frag_mem_limit(), ip_frag_mem() & ip6_frag_mem()

Also since we use rhashtable we can bring back the number of fragments
in "grep FRAG /proc/net/sockstat /proc/net/sockstat6" that was
removed in commit 434d305405 ("inet: frag: don't account number
of fragment queues")

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31 23:25:39 -04:00
Eric Dumazet
648700f76b inet: frags: use rhashtables for reassembly units
Some applications still rely on IP fragmentation, and to be fair linux
reassembly unit is not working under any serious load.

It uses static hash tables of 1024 buckets, and up to 128 items per bucket (!!!)

A work queue is supposed to garbage collect items when host is under memory
pressure, and doing a hash rebuild, changing seed used in hash computations.

This work queue blocks softirqs for up to 25 ms when doing a hash rebuild,
occurring every 5 seconds if host is under fire.

Then there is the problem of sharing this hash table for all netns.

It is time to switch to rhashtables, and allocate one of them per netns
to speedup netns dismantle, since this is a critical metric these days.

Lookup is now using RCU. A followup patch will even remove
the refcount hold/release left from prior implementation and save
a couple of atomic operations.

Before this patch, 16 cpus (16 RX queue NIC) could not handle more
than 1 Mpps frags DDOS.

After the patch, I reach 9 Mpps without any tuning, and can use up to 2GB
of storage for the fragments (exact number depends on frags being evicted
after timeout)

$ grep FRAG /proc/net/sockstat
FRAG: inuse 1966916 memory 2140004608

A followup patch will change the limits for 64bit arches.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Florian Westphal <fw@strlen.de>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Alexander Aring <alex.aring@gmail.com>
Cc: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31 23:25:39 -04:00
Eric Dumazet
483a6e4fa0 inet: frags: refactor ipfrag_init()
We need to call inet_frags_init() before register_pernet_subsys(),
as a prereq for following patch ("inet: frags: use rhashtables for reassembly units")

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31 23:25:38 -04:00
Eric Dumazet
807f1844df inet: frags: refactor lowpan_net_frag_init()
We want to call lowpan_net_frag_init() earlier.
Similar to commit "inet: frags: refactor ipv6_frag_init()"

This is a prereq to "inet: frags: use rhashtables for reassembly units"

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31 23:25:38 -04:00
Eric Dumazet
5b975bab23 inet: frags: refactor ipv6_frag_init()
We want to call inet_frags_init() earlier.

This is a prereq to "inet: frags: use rhashtables for reassembly units"

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31 23:25:38 -04:00
Eric Dumazet
093ba72914 inet: frags: add a pointer to struct netns_frags
In order to simplify the API, add a pointer to struct inet_frags.
This will allow us to make things less complex.

These functions no longer have a struct inet_frags parameter :

inet_frag_destroy(struct inet_frag_queue *q  /*, struct inet_frags *f */)
inet_frag_put(struct inet_frag_queue *q /*, struct inet_frags *f */)
inet_frag_kill(struct inet_frag_queue *q /*, struct inet_frags *f */)
inet_frags_exit_net(struct netns_frags *nf /*, struct inet_frags *f */)
ip6_expire_frag_queue(struct net *net, struct frag_queue *fq)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31 23:25:38 -04:00
Eric Dumazet
787bea7748 inet: frags: change inet_frags_init_net() return value
We will soon initialize one rhashtable per struct netns_frags
in inet_frags_init_net().

This patch changes the return value to eventually propagate an
error.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31 23:25:38 -04:00
Wei Yongjun
eeb0a2a526 vlan: vlan_hw_filter_capable() can be static
Fixes the following sparse warning:

net/8021q/vlan_core.c:168:6: warning:
 symbol 'vlan_hw_filter_capable' was not declared. Should it be static?

Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31 23:20:48 -04:00
David S. Miller
e2e80c027f RxRPC development
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAWr6a+fu3V2unywtrAQJDgQ/+Pdyv8EW7VtdPRNdePGJLh4WHVrCnSL64
 hKzHV3gEt8PK7sJi53lkbvTYJ3hpB/SC9Li0TdOvph4ngH8pUer7HiXW9g2zrcy+
 KsbhmWslMOH6uy+8f2yWoI0HmSz7XvzQATCKOJz5Lm7X0JfvSoKkhgn6VLxqG91C
 p3IAYr5735/mqZIqs+FCIP2jLIPtCH/Plv18F9w847dSS1qsH6twAsUVjzn9CtJv
 MHQ+Wo57tpoOwynnZO+CDLhZGdXJ5YIZ8wXNqu/EUaeAHXgqkpSd4IUZeguDpoJR
 YrkvMf8cQlcBDohHBBmIVz9OOIZ8A7WbygNf7UPS9DGYjNVL9vKxF89xwAKYA5Ky
 LYmUug8WYp2FIHxuWGomF15SrzbNNDfD/v7ZvjXmNxvJXXV+AbzNoXZifSAuyo2V
 2bgolw22PbZAMgRvBr5OjtxGD65lpCD00ZJWBgFrWn8l3hdK/40NJW5s3nQCShmm
 NCUAFV+nHvazfUNfcBAbVRPXCN8Pc0Cuspr6HJ3Wi4q5lnpJQY+hS+KFNizL6iyu
 1fNRcqBF8KnVM5bE5Y27o96W/RhNdFudYB+sAH5IEGRqqG5YjkwvPgILvXhTH2kI
 MDxHMFjRgvWijzLLPpBd9YAsPLddvE6WRlngqOrSp6ICP4o8Vmeb2mxgo4tbQBXu
 3dIL1fEfhME=
 =Lyzx
 -----END PGP SIGNATURE-----

Merge tag 'rxrpc-next-20180330' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
rxrpc: Fixes and more traces

Here are some patches that add some more tracepoints to AF_RXRPC and fix
some issues therein:

 (1) Fix the use of VERSION packets to keep firewall routes open.

 (2) Fix the incorrect current time usage in a tracepoint.

 (3) Fix Tx ring annotation corruption.

 (4) Fix accidental conversion of call-level abort into connection-level
     abort.

 (5) Fix calculation of resend time.

 (6) Remove a couple of unused variables.

 (7) Fix a bunch of checker warnings and an error.  Note that not all
     warnings can be quashed as checker doesn't seem to correctly handle
     seqlocks.

 (8) Fix a potential race between call destruction and socket/net
     destruction.

 (9) Add a tracepoint to track rxrpc_local refcounting.

(10) Fix an apparent leak of rxrpc_local objects.

(11) Add a tracepoint to track rxrpc_peer refcounting.

(12) Fix a leak of rxrpc_peer objects.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31 22:29:12 -04:00
Kirill Tkhai
554873e517 net: Do not take net_rwsem in __rtnl_link_unregister()
This function calls call_netdevice_notifier(), which also
may take net_rwsem. So, we can't use net_rwsem here.

This patch makes callers of this functions take pernet_ops_rwsem,
like register_netdevice_notifier() does. This will protect
the modifications of net_namespace_list, and allows notifiers
to take it (they won't have to care about context).

Since __rtnl_link_unregister() is used on module load
and unload (which are not frequent operations), this looks
for me better, than make all call_netdevice_notifier()
always executing in "protected net_namespace_list" context.

Also, this fixes the problem we had a deal in 328fbe747a
"Close race between {un, }register_netdevice_notifier and ...",
and guarantees __rtnl_link_unregister() does not skip
exitting net.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31 22:24:58 -04:00
Kirill Tkhai
fc1dd36992 net: Remove net_rwsem from {, un}register_netdevice_notifier()
These functions take net_rwsem, while wireless_nlevent_flush()
also takes it. But down_read() can't be taken recursive,
because of rw_semaphore design, which prevents it to be occupied
by only readers forever.

Since we take pernet_ops_rwsem in {,un}register_netdevice_notifier(),
net list can't change, so these down_read()/up_read() can be removed.

Fixes: f0b07bb151 "net: Introduce net_rwsem to protect net_namespace_list"
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31 22:24:58 -04:00
Jon Maloy
7494cfa6d3 tipc: avoid possible string overflow
gcc points out that the combined length of the fixed-length inputs to
l->name is larger than the destination buffer size:

net/tipc/link.c: In function 'tipc_link_create':
net/tipc/link.c:465:26: error: '%s' directive writing up to 32 bytes
into a region of size between 26 and 58 [-Werror=format-overflow=]
sprintf(l->name, "%s:%s-%s:unknown", self_str, if_name, peer_str);

net/tipc/link.c:465:2: note: 'sprintf' output 11 or more bytes
(assuming 75) into a destination of size 60
sprintf(l->name, "%s:%s-%s:unknown", self_str, if_name, peer_str);

A detailed analysis reveals that the theoretical maximum length of
a link name is:
max self_str + 1 + max if_name + 1 + max peer_str + 1 + max if_name =
16 + 1 + 15 + 1 + 16 + 1 + 15 = 65
Since we also need space for a trailing zero we now set MAX_LINK_NAME
to 68.

Just to be on the safe side we also replace the sprintf() call with
snprintf().

Fixes: 25b0b9c4e8 ("tipc: handle collisions of 32-bit node address
hash values")
Reported-by: Arnd Bergmann <arnd@arndb.de>

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31 22:19:52 -04:00
Jon Maloy
37922ea4a3 tipc: permit overlapping service ranges in name table
With the new RB tree structure for service ranges it becomes possible to
solve an old problem; - we can now allow overlapping service ranges in
the table.

When inserting a new service range to the tree, we use 'lower' as primary
key, and when necessary 'upper' as secondary key.

Since there may now be multiple service ranges matching an indicated
'lower' value, we must also add the 'upper' value to the functions
used for removing publications, so that the correct, corresponding
range item can be found.

These changes guarantee that a well-formed publication/withdrawal item
from a peer node never will be rejected, and make it possible to
eliminate the problematic backlog functionality we currently have for
handling such cases.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31 22:19:52 -04:00
Jon Maloy
f20889f72b tipc: refactor name table translate function
The function tipc_nametbl_translate() function is ugly and hard to
follow. This can be improved somewhat by introducing a stack variable
for holding the publication list to be used and re-ordering the if-
clauses for selection of algorithm.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31 22:19:52 -04:00
Jon Maloy
218527fe27 tipc: replace name table service range array with rb tree
The current design of the binding table has an unnecessary memory
consuming and complex data structure. It aggregates the service range
items into an array, which is expanded by a factor two every time it
becomes too small to hold a new item. Furthermore, the arrays never
shrink when the number of ranges diminishes.

We now replace this array with an RB tree that is holding the range
items as tree nodes, each range directly holding a list of bindings.

This, along with a few name changes, improves both readability and
volume of the code, as well as reducing memory consumption and hopefully
improving cache hit rate.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31 22:19:52 -04:00
Nikolay Aleksandrov
804b854d37 net: bridge: disable bridge MTU auto tuning if it was set manually
As Roopa noted today the biggest source of problems when configuring
bridge and ports is that the bridge MTU keeps changing automatically on
port events (add/del/changemtu). That leads to inconsistent behaviour
and network config software needs to chase the MTU and fix it on each
such event. Let's improve on that situation and allow for the user to
set any MTU within ETH_MIN/MAX limits, but once manually configured it
is the user's responsibility to keep it correct afterwards.

In case the MTU isn't manually set - the behaviour reverts to the
previous and the bridge follows the minimum MTU.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31 22:19:00 -04:00
Nikolay Aleksandrov
f40aa23339 net: bridge: set min MTU on port events and allow user to set max
Recently the bridge was changed to automatically set maximum MTU on port
events (add/del/changemtu) when vlan filtering is enabled, but that
actually changes behaviour in a way which breaks some setups and can lead
to packet drops. In order to still allow that maximum to be set while being
compatible, we add the ability for the user to tune the bridge MTU up to
the maximum when vlan filtering is enabled, but that has to be done
explicitly and all port events (add/del/changemtu) lead to resetting that
MTU to the minimum as before.

Suggested-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31 22:19:00 -04:00
Linus Torvalds
b5dbc28762 Kbuild fixes for v4.16 (3rd)
- fix missed rebuild of TRIM_UNUSED_KSYMS
 
 - fix rpm-pkg for GNU tar >= 1.29
 
 - include scripts/dtc/include-prefixes/* to kernel header deb-pkg
 
 - add -no-integrated-as option ealier to fix building with Clang
 
 - fix netfilter Makefile for parallel building
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJavwJpAAoJED2LAQed4NsGQuIQAK/UmPVczOxT7RefB4BrAsZG
 Zlai7HnfpzWk5EZE6fbTHTmbFu6HZ1TuYhOW5UlJcxd3P+nJfL5WwDo0H52LVfLT
 UkSubLCtZBl+DqtbuOg4Xrmh8k3WneGqYT7H9D19LRXTeeoh82g81+mWYL3F9UOA
 OWGzKf9+3CQhP7OjeVlfdQ8qv2UR+snyIK0jNRImTuhtys8iy2Q4EP/nQYtF7oAA
 KcYY62rS3qVKfTrdk5NY7kxvpp6/1m6141UPR75Xve7h+Emx/u0RthiMUW08e2bv
 PX5IlyI8XFz54wD2tojawMEo235cYPJAKQHZAry5tiLXvOF5vEZvoPGc8oUZnMGe
 bMNONRfXrKWi10/pcTqEfl6gEAE+bvOrqIKj/DECT4hF1av2uEeou/SzuEX+wbqK
 GxU4L5mnUwDsJNLPiUeVjyl4GD48X16lBdCs9laamRzYat5lKzJFBmgNf0dyHdI+
 l/myEtk17nSeohPWRgUeTBcP8O+E27rER7U/+KC0c4spwKrEfLFIzzNauLLJdugN
 o1VNYacseg3cLQnjSpmC26jxZw29jMFaLM5mBuiI7/F9mUlK6zaG6gyoDzV3A5lN
 jgPw48apNj4SLnUMrOi+1RYWXWkguF09f8GecjJKXvR5wGqzY7E3ZDi/zgXBf72q
 5r5dDuIExh0KXcO9Risp
 =2WPN
 -----END PGP SIGNATURE-----

Merge tag 'kbuild-fixes-v4.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild fixes from Masahiro Yamada:

 - fix missed rebuild of TRIM_UNUSED_KSYMS

 - fix rpm-pkg for GNU tar >= 1.29

 - include scripts/dtc/include-prefixes/* to kernel header deb-pkg

 - add -no-integrated-as option ealier to fix building with Clang

 - fix netfilter Makefile for parallel building

* tag 'kbuild-fixes-v4.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
  netfilter: nf_nat_snmp_basic: add correct dependency to Makefile
  kbuild: rpm-pkg: Support GNU tar >= 1.29
  builddeb: Fix header package regarding dtc source links
  kbuild: set no-integrated-as before incl. arch Makefile
  kbuild: make scripts/adjust_autoksyms.sh robust against timestamp races
2018-03-30 18:53:57 -10:00
Andrey Ignatov
aac3fc320d bpf: Post-hooks for sys_bind
"Post-hooks" are hooks that are called right before returning from
sys_bind. At this time IP and port are already allocated and no further
changes to `struct sock` can happen before returning from sys_bind but
BPF program has a chance to inspect the socket and change sys_bind
result.

Specifically it can e.g. inspect what port was allocated and if it
doesn't satisfy some policy, BPF program can force sys_bind to fail and
return EPERM to user.

Another example of usage is recording the IP:port pair to some map to
use it in later calls to sys_connect. E.g. if some TCP server inside
cgroup was bound to some IP:port_n, it can be recorded to a map. And
later when some TCP client inside same cgroup is trying to connect to
127.0.0.1:port_n, BPF hook for sys_connect can override the destination
and connect application to IP:port_n instead of 127.0.0.1:port_n. That
helps forcing all applications inside a cgroup to use desired IP and not
break those applications if they e.g. use localhost to communicate
between each other.

== Implementation details ==

Post-hooks are implemented as two new attach types
`BPF_CGROUP_INET4_POST_BIND` and `BPF_CGROUP_INET6_POST_BIND` for
existing prog type `BPF_PROG_TYPE_CGROUP_SOCK`.

Separate attach types for IPv4 and IPv6 are introduced to avoid access
to IPv6 field in `struct sock` from `inet_bind()` and to IPv4 field from
`inet6_bind()` since those fields might not make sense in such cases.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-31 02:16:26 +02:00
Andrey Ignatov
d74bad4e74 bpf: Hooks for sys_connect
== The problem ==

See description of the problem in the initial patch of this patch set.

== The solution ==

The patch provides much more reliable in-kernel solution for the 2nd
part of the problem: making outgoing connecttion from desired IP.

It adds new attach types `BPF_CGROUP_INET4_CONNECT` and
`BPF_CGROUP_INET6_CONNECT` for program type
`BPF_PROG_TYPE_CGROUP_SOCK_ADDR` that can be used to override both
source and destination of a connection at connect(2) time.

Local end of connection can be bound to desired IP using newly
introduced BPF-helper `bpf_bind()`. It allows to bind to only IP though,
and doesn't support binding to port, i.e. leverages
`IP_BIND_ADDRESS_NO_PORT` socket option. There are two reasons for this:
* looking for a free port is expensive and can affect performance
  significantly;
* there is no use-case for port.

As for remote end (`struct sockaddr *` passed by user), both parts of it
can be overridden, remote IP and remote port. It's useful if an
application inside cgroup wants to connect to another application inside
same cgroup or to itself, but knows nothing about IP assigned to the
cgroup.

Support is added for IPv4 and IPv6, for TCP and UDP.

IPv4 and IPv6 have separate attach types for same reason as sys_bind
hooks, i.e. to prevent reading from / writing to e.g. user_ip6 fields
when user passes sockaddr_in since it'd be out-of-bound.

== Implementation notes ==

The patch introduces new field in `struct proto`: `pre_connect` that is
a pointer to a function with same signature as `connect` but is called
before it. The reason is in some cases BPF hooks should be called way
before control is passed to `sk->sk_prot->connect`. Specifically
`inet_dgram_connect` autobinds socket before calling
`sk->sk_prot->connect` and there is no way to call `bpf_bind()` from
hooks from e.g. `ip4_datagram_connect` or `ip6_datagram_connect` since
it'd cause double-bind. On the other hand `proto.pre_connect` provides a
flexible way to add BPF hooks for connect only for necessary `proto` and
call them at desired time before `connect`. Since `bpf_bind()` is
allowed to bind only to IP and autobind in `inet_dgram_connect` binds
only port there is no chance of double-bind.

bpf_bind() sets `force_bind_address_no_port` to bind to only IP despite
of value of `bind_address_no_port` socket field.

bpf_bind() sets `with_lock` to `false` when calling to __inet_bind()
and __inet6_bind() since all call-sites, where bpf_bind() is called,
already hold socket lock.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-31 02:15:54 +02:00
Andrey Ignatov
3679d585bb net: Introduce __inet_bind() and __inet6_bind
Refactor `bind()` code to make it ready to be called from BPF helper
function `bpf_bind()` (will be added soon). Implementation of
`inet_bind()` and `inet6_bind()` is separated into `__inet_bind()` and
`__inet6_bind()` correspondingly. These function can be used from both
`sk_prot->bind` and `bpf_bind()` contexts.

New functions have two additional arguments.

`force_bind_address_no_port` forces binding to IP only w/o checking
`inet_sock.bind_address_no_port` field. It'll allow to bind local end of
a connection to desired IP in `bpf_bind()` w/o changing
`bind_address_no_port` field of a socket. It's useful since `bpf_bind()`
can return an error and we'd need to restore original value of
`bind_address_no_port` in that case if we changed this before calling to
the helper.

`with_lock` specifies whether to lock socket when working with `struct
sk` or not. The argument is set to `true` for `sk_prot->bind`, i.e. old
behavior is preserved. But it will be set to `false` for `bpf_bind()`
use-case. The reason is all call-sites, where `bpf_bind()` will be
called, already hold that socket lock.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-31 02:15:43 +02:00
Andrey Ignatov
4fbac77d2d bpf: Hooks for sys_bind
== The problem ==

There is a use-case when all processes inside a cgroup should use one
single IP address on a host that has multiple IP configured.  Those
processes should use the IP for both ingress and egress, for TCP and UDP
traffic. So TCP/UDP servers should be bound to that IP to accept
incoming connections on it, and TCP/UDP clients should make outgoing
connections from that IP. It should not require changing application
code since it's often not possible.

Currently it's solved by intercepting glibc wrappers around syscalls
such as `bind(2)` and `connect(2)`. It's done by a shared library that
is preloaded for every process in a cgroup so that whenever TCP/UDP
server calls `bind(2)`, the library replaces IP in sockaddr before
passing arguments to syscall. When application calls `connect(2)` the
library transparently binds the local end of connection to that IP
(`bind(2)` with `IP_BIND_ADDRESS_NO_PORT` to avoid performance penalty).

Shared library approach is fragile though, e.g.:
* some applications clear env vars (incl. `LD_PRELOAD`);
* `/etc/ld.so.preload` doesn't help since some applications are linked
  with option `-z nodefaultlib`;
* other applications don't use glibc and there is nothing to intercept.

== The solution ==

The patch provides much more reliable in-kernel solution for the 1st
part of the problem: binding TCP/UDP servers on desired IP. It does not
depend on application environment and implementation details (whether
glibc is used or not).

It adds new eBPF program type `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` and
attach types `BPF_CGROUP_INET4_BIND` and `BPF_CGROUP_INET6_BIND`
(similar to already existing `BPF_CGROUP_INET_SOCK_CREATE`).

The new program type is intended to be used with sockets (`struct sock`)
in a cgroup and provided by user `struct sockaddr`. Pointers to both of
them are parts of the context passed to programs of newly added types.

The new attach types provides hooks in `bind(2)` system call for both
IPv4 and IPv6 so that one can write a program to override IP addresses
and ports user program tries to bind to and apply such a program for
whole cgroup.

== Implementation notes ==

[1]
Separate attach types for `AF_INET` and `AF_INET6` are added
intentionally to prevent reading/writing to offsets that don't make
sense for corresponding socket family. E.g. if user passes `sockaddr_in`
it doesn't make sense to read from / write to `user_ip6[]` context
fields.

[2]
The write access to `struct bpf_sock_addr_kern` is implemented using
special field as an additional "register".

There are just two registers in `sock_addr_convert_ctx_access`: `src`
with value to write and `dst` with pointer to context that can't be
changed not to break later instructions. But the fields, allowed to
write to, are not available directly and to access them address of
corresponding pointer has to be loaded first. To get additional register
the 1st not used by `src` and `dst` one is taken, its content is saved
to `bpf_sock_addr_kern.tmp_reg`, then the register is used to load
address of pointer field, and finally the register's content is restored
from the temporary field after writing `src` value.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-31 02:15:18 +02:00
Andrey Ignatov
5e43f899b0 bpf: Check attach type at prog load time
== The problem ==

There are use-cases when a program of some type can be attached to
multiple attach points and those attach points must have different
permissions to access context or to call helpers.

E.g. context structure may have fields for both IPv4 and IPv6 but it
doesn't make sense to read from / write to IPv6 field when attach point
is somewhere in IPv4 stack.

Same applies to BPF-helpers: it may make sense to call some helper from
some attach point, but not from other for same prog type.

== The solution ==

Introduce `expected_attach_type` field in in `struct bpf_attr` for
`BPF_PROG_LOAD` command. If scenario described in "The problem" section
is the case for some prog type, the field will be checked twice:

1) At load time prog type is checked to see if attach type for it must
   be known to validate program permissions correctly. Prog will be
   rejected with EINVAL if it's the case and `expected_attach_type` is
   not specified or has invalid value.

2) At attach time `attach_type` is compared with `expected_attach_type`,
   if prog type requires to have one, and, if they differ, attach will
   be rejected with EINVAL.

The `expected_attach_type` is now available as part of `struct bpf_prog`
in both `bpf_verifier_ops->is_valid_access()` and
`bpf_verifier_ops->get_func_proto()` () and can be used to check context
accesses and calls to helpers correspondingly.

Initially the idea was discussed by Alexei Starovoitov <ast@fb.com> and
Daniel Borkmann <daniel@iogearbox.net> here:
https://marc.info/?l=linux-netdev&m=152107378717201&w=2

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-31 02:14:44 +02:00
David Howells
17226f1240 rxrpc: Fix leak of rxrpc_peer objects
When a new client call is requested, an rxrpc_conn_parameters struct object
is passed in with a bunch of parameters set, such as the local endpoint to
use.  A pointer to the target peer record is also placed in there by
rxrpc_get_client_conn() - and this is removed if and only if a new
connection object is allocated.  Thus it leaks if a new connection object
isn't allocated.

Fix this by putting any peer object attached to the rxrpc_conn_parameters
object in the function that allocated it.

Fixes: 19ffa01c9c ("rxrpc: Use structs to hold connection params and protocol info")
Signed-off-by: David Howells <dhowells@redhat.com>
2018-03-30 21:05:44 +01:00
David Howells
1159d4b496 rxrpc: Add a tracepoint to track rxrpc_peer refcounting
Add a tracepoint to track reference counting on the rxrpc_peer struct.

Signed-off-by: David Howells <dhowells@redhat.com>
2018-03-30 21:05:38 +01:00
David Howells
31f5f9a169 rxrpc: Fix apparent leak of rxrpc_local objects
rxrpc_local objects cannot be disposed of until all the connections that
point to them have been RCU'd as a connection object holds refcount on the
local endpoint it is communicating through.  Currently, this can cause an
assertion failure to occur when a network namespace is destroyed as there's
no check that the RCU destructors for the connections have been run before
we start trying to destroy local endpoints.

The kernel reports:

	rxrpc: AF_RXRPC: Leaked local 0000000036a41bc1 {5}
	------------[ cut here ]------------
	kernel BUG at ../net/rxrpc/local_object.c:439!

Fix this by keeping a count of the live connections and waiting for it to
go to zero at the end of rxrpc_destroy_all_connections().

Fixes: dee46364ce ("rxrpc: Add RCU destruction for connections and calls")
Signed-off-by: David Howells <dhowells@redhat.com>
2018-03-30 21:05:33 +01:00
David Howells
09d2bf595d rxrpc: Add a tracepoint to track rxrpc_local refcounting
Add a tracepoint to track reference counting on the rxrpc_local struct.

Signed-off-by: David Howells <dhowells@redhat.com>
2018-03-30 21:05:28 +01:00
David Howells
d3be4d2443 rxrpc: Fix potential call vs socket/net destruction race
rxrpc_call structs don't pin sockets or network namespaces, but may attempt
to access both after their refcount reaches 0 so that they can detach
themselves from the network namespace.  However, there's no guarantee that
the socket still exists at this point (so sock_net(&call->socket->sk) may
be invalid) and the namespace may have gone away if the call isn't pinning
a peer.

Fix this by (a) carrying a net pointer in the rxrpc_call struct and (b)
waiting for all calls to be destroyed when the network namespace goes away.

This was detected by checker:

net/rxrpc/call_object.c:634:57: warning: incorrect type in argument 1 (different address spaces)
net/rxrpc/call_object.c:634:57:    expected struct sock const *sk
net/rxrpc/call_object.c:634:57:    got struct sock [noderef] <asn:4>*<noident>

Fixes: 2baec2c3f8 ("rxrpc: Support network namespacing")
Signed-off-by: David Howells <dhowells@redhat.com>
2018-03-30 21:05:23 +01:00
David Howells
88f2a8257c rxrpc: Fix checker warnings and errors
Fix various issues detected by checker.

Errors:

 (*) rxrpc_discard_prealloc() should be using rcu_assign_pointer to set
     call->socket.

Warnings:

 (*) rxrpc_service_connection_reaper() should be passing NULL rather than 0 to
     trace_rxrpc_conn() as the where argument.

 (*) rxrpc_disconnect_client_call() should get its net pointer via the
     call->conn rather than call->sock to avoid a warning about accessing
     an RCU pointer without protection.

 (*) Proc seq start/stop functions need annotation as they pass locks
     between the functions.

False positives:

 (*) Checker doesn't correctly handle of seq-retry lock context balance in
     rxrpc_find_service_conn_rcu().

 (*) Checker thinks execution may proceed past the BUG() in
     rxrpc_publish_service_conn().

 (*) Variable length array warnings from SKCIPHER_REQUEST_ON_STACK() in
     rxkad.c.

Signed-off-by: David Howells <dhowells@redhat.com>
2018-03-30 21:05:17 +01:00
Sebastian Andrzej Siewior
edb63e2b27 rxrpc: remove unused static variables
The rxrpc_security_methods and rxrpc_security_sem user has been removed
in 648af7fca1 ("rxrpc: Absorb the rxkad security module"). This was
noticed by kbuild test robot for the -RT tree but is also true for !RT.

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David Howells <dhowells@redhat.com>
2018-03-30 21:04:44 +01:00
Marc Dionne
59299aa102 rxrpc: Fix resend event time calculation
Commit a158bdd3 ("rxrpc: Fix call timeouts") reworked the time calculation
for the next resend event.  For this calculation, "oldest" will be before
"now", so ktime_sub(oldest, now) will yield a negative value.  When passed
to nsecs_to_jiffies which expects an unsigned value, the end result will be
a very large value, and a resend event scheduled far into the future.  This
could cause calls to stall if some packets were lost.

Fix by ordering the arguments to ktime_sub correctly.

Fixes: a158bdd324 ("rxrpc: Fix call timeouts")
Signed-off-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2018-03-30 21:04:44 +01:00
David Howells
57b0c9d49b rxrpc: Don't treat call aborts as conn aborts
If a call-level abort is received for the previous call to complete on a
connection channel, then that abort is queued for the connection processor
to handle.  Unfortunately, the connection processor then assumes without
checking that the abort is connection-level (ie. callNumber is 0) and
distributes it over all active calls on that connection, thereby
incorrectly aborting them.

Fix this by discarding aborts aimed at a completed call.

Further, discard all packets aimed at a call that's complete if there's
currently an active call on a channel, since the DATA packets associated
with the new call automatically terminate the old call.

Fixes: 18bfeba50d ("rxrpc: Perform terminal call ACK/ABORT retransmission from conn processor")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2018-03-30 21:04:44 +01:00
David Howells
03877bf6a3 rxrpc: Fix Tx ring annotation after initial Tx failure
rxrpc calls have a ring of packets that are awaiting ACK or retransmission
and a parallel ring of annotations that tracks the state of those packets.
If the initial transmission of a packet on the underlying UDP socket fails
then the packet annotation is marked for resend - but the setting of this
mark accidentally erases the last-packet mark also stored in the same
annotation slot.  If this happens, a call won't switch out of the Tx phase
when all the packets have been transmitted.

Fix this by retaining the last-packet mark and only altering the packet
state.

Fixes: 248f219cb8 ("rxrpc: Rewrite the data and ack handling code")
Signed-off-by: David Howells <dhowells@redhat.com>
2018-03-30 21:04:43 +01:00
David Howells
f82eb88b0f rxrpc: Fix a bit of time confusion
The rxrpc_reduce_call_timer() function should be passed the 'current time'
in jiffies, not the current ktime time.  It's confusing in rxrpc_resend
because that has to deal with both.  Pass the correct current time in.

Note that this only affects the trace produced and not the functioning of
the code.

Fixes: a158bdd324 ("rxrpc: Fix call timeouts")
Signed-off-by: David Howells <dhowells@redhat.com>
2018-03-30 21:04:43 +01:00
David Howells
ace45bec6d rxrpc: Fix firewall route keepalive
Fix the firewall route keepalive part of AF_RXRPC which is currently
function incorrectly by replying to VERSION REPLY packets from the server
with VERSION REQUEST packets.

Instead, send VERSION REPLY packets to the peers of service connections to
act as keep-alives 20s after the latest packet was transmitted to that
peer.

Also, just discard VERSION REPLY packets rather than replying to them.

Signed-off-by: David Howells <dhowells@redhat.com>
2018-03-30 21:04:43 +01:00
David Ahern
b6cdbc8523 net/ipv6: Fix route leaking between VRFs
Donald reported that IPv6 route leaking between VRFs is not working.
The root cause is the strict argument in the call to rt6_lookup when
validating the nexthop spec.

ip6_route_check_nh validates the gateway and device (if given) of a
route spec. It in turn could call rt6_lookup (e.g., lookup in a given
table did not succeed so it falls back to a full lookup) and if so
sets the strict argument to 1. That means if the egress device is given,
the route lookup needs to return a result with the same device. This
strict requirement does not work with VRFs (IPv4 or IPv6) because the
oif in the flow struct is overridden with the index of the VRF device
to trigger a match on the l3mdev rule and force the lookup to its table.

The right long term solution is to add an l3mdev index to the flow
struct such that the oif is not overridden. That solution will not
backport well, so this patch aims for a simpler solution to relax the
strict argument if the route spec device is an l3mdev slave. As done
in other places, use the FLOWI_FLAG_SKIP_NH_OIF to know that the
RT6_LOOKUP_F_IFACE flag needs to be removed.

Fixes: ca254490c8 ("net: Add VRF support to IPv6 stack")
Reported-by: Donald Sharp <sharpd@cumulusnetworks.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-30 14:23:59 -04:00
David Lebrun
5807b22c91 ipv6: sr: fix seg6 encap performances with TSO enabled
Enabling TSO can lead to abysmal performances when using seg6 in
encap mode, such as with the ixgbe driver. This patch adds a call to
iptunnel_handle_offloads() to remove the encapsulation bit if needed.

Before:
root@comp4-seg6bpf:~# iperf3 -c fc00::55
Connecting to host fc00::55, port 5201
[  4] local fc45::4 port 36592 connected to fc00::55 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec   196 KBytes  1.60 Mbits/sec   47   6.66 KBytes
[  4]   1.00-2.00   sec   304 KBytes  2.49 Mbits/sec  100   5.33 KBytes
[  4]   2.00-3.00   sec   284 KBytes  2.32 Mbits/sec   92   5.33 KBytes

After:
root@comp4-seg6bpf:~# iperf3 -c fc00::55
Connecting to host fc00::55, port 5201
[  4] local fc45::4 port 43062 connected to fc00::55 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec  1.03 GBytes  8.89 Gbits/sec    0    743 KBytes
[  4]   1.00-2.00   sec  1.03 GBytes  8.87 Gbits/sec    0    743 KBytes
[  4]   2.00-3.00   sec  1.03 GBytes  8.87 Gbits/sec    0    743 KBytes

Reported-by: Tom Herbert <tom@quantonium.net>
Fixes: 6c8702c60b ("ipv6: sr: add support for SRH encapsulation and injection with lwtunnels")
Signed-off-by: David Lebrun <dlebrun@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-30 14:14:33 -04:00
Toshiaki Makita
ae4745730c net: Fix untag for vlan packets without ethernet header
In some situation vlan packets do not have ethernet headers. One example
is packets from tun devices. Users can specify vlan protocol in tun_pi
field instead of IP protocol, and skb_vlan_untag() attempts to untag such
packets.

skb_vlan_untag() (more precisely, skb_reorder_vlan_header() called by it)
however did not expect packets without ethernet headers, so in such a case
size argument for memmove() underflowed and triggered crash.

====
BUG: unable to handle kernel paging request at ffff8801cccb8000
IP: __memmove+0x24/0x1a0 arch/x86/lib/memmove_64.S:43
PGD 9cee067 P4D 9cee067 PUD 1d9401063 PMD 1cccb7063 PTE 2810100028101
Oops: 000b [#1] SMP KASAN
Dumping ftrace buffer:
   (ftrace buffer empty)
Modules linked in:
CPU: 1 PID: 17663 Comm: syz-executor2 Not tainted 4.16.0-rc7+ #368
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:__memmove+0x24/0x1a0 arch/x86/lib/memmove_64.S:43
RSP: 0018:ffff8801cc046e28 EFLAGS: 00010287
RAX: ffff8801ccc244c4 RBX: fffffffffffffffe RCX: fffffffffff6c4c2
RDX: fffffffffffffffe RSI: ffff8801cccb7ffc RDI: ffff8801cccb8000
RBP: ffff8801cc046e48 R08: ffff8801ccc244be R09: ffffed0039984899
R10: 0000000000000001 R11: ffffed0039984898 R12: ffff8801ccc244c4
R13: ffff8801ccc244c0 R14: ffff8801d96b7c06 R15: ffff8801d96b7b40
FS:  00007febd562d700(0000) GS:ffff8801db300000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff8801cccb8000 CR3: 00000001ccb2f006 CR4: 00000000001606e0
DR0: 0000000020000000 DR1: 0000000020000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
Call Trace:
 memmove include/linux/string.h:360 [inline]
 skb_reorder_vlan_header net/core/skbuff.c:5031 [inline]
 skb_vlan_untag+0x470/0xc40 net/core/skbuff.c:5061
 __netif_receive_skb_core+0x119c/0x3460 net/core/dev.c:4460
 __netif_receive_skb+0x2c/0x1b0 net/core/dev.c:4627
 netif_receive_skb_internal+0x10b/0x670 net/core/dev.c:4701
 netif_receive_skb+0xae/0x390 net/core/dev.c:4725
 tun_rx_batched.isra.50+0x5ee/0x870 drivers/net/tun.c:1555
 tun_get_user+0x299e/0x3c20 drivers/net/tun.c:1962
 tun_chr_write_iter+0xb9/0x160 drivers/net/tun.c:1990
 call_write_iter include/linux/fs.h:1782 [inline]
 new_sync_write fs/read_write.c:469 [inline]
 __vfs_write+0x684/0x970 fs/read_write.c:482
 vfs_write+0x189/0x510 fs/read_write.c:544
 SYSC_write fs/read_write.c:589 [inline]
 SyS_write+0xef/0x220 fs/read_write.c:581
 do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x42/0xb7
RIP: 0033:0x454879
RSP: 002b:00007febd562cc68 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 00007febd562d6d4 RCX: 0000000000454879
RDX: 0000000000000157 RSI: 0000000020000180 RDI: 0000000000000014
RBP: 000000000072bea0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
R13: 00000000000006b0 R14: 00000000006fc120 R15: 0000000000000000
Code: 90 90 90 90 90 90 90 48 89 f8 48 83 fa 20 0f 82 03 01 00 00 48 39 fe 7d 0f 49 89 f0 49 01 d0 49 39 f8 0f 8f 9f 00 00 00 48 89 d1 <f3> a4 c3 48 81 fa a8 02 00 00 72 05 40 38 fe 74 3b 48 83 ea 20
RIP: __memmove+0x24/0x1a0 arch/x86/lib/memmove_64.S:43 RSP: ffff8801cc046e28
CR2: ffff8801cccb8000
====

We don't need to copy headers for packets which do not have preceding
headers of vlan headers, so skip memmove() in that case.

Fixes: 4bbb3e0e82 ("net: Fix vlan untag for bridge and vlan_dev with reorder_hdr off")
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-30 12:36:27 -04:00
Lorenzo Bianconi
428604fb11 ipv6: do not set routes if disable_ipv6 has been enabled
Do not allow setting ipv6 routes from userspace if disable_ipv6 has been
enabled. The issue can be triggered using the following reproducer:

- sysctl net.ipv6.conf.all.disable_ipv6=1
- ip -6 route add a🅱️c:d::/64 dev em1
- ip -6 route show
  a🅱️c:d::/64 dev em1 metric 1024 pref medium

Fix it checking disable_ipv6 value in ip6_route_info_create routine

Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-30 12:20:52 -04:00
David S. Miller
d162190bde Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

The following patchset contains Netfilter/IPVS updates for your net-next
tree. This batch comes with more input sanitization for xtables to
address bug reports from fuzzers, preparation works to the flowtable
infrastructure and assorted updates. In no particular order, they are:

1) Make sure userspace provides a valid standard target verdict, from
   Florian Westphal.

2) Sanitize error target size, also from Florian.

3) Validate that last rule in basechain matches underflow/policy since
   userspace assumes this when decoding the ruleset blob that comes
   from the kernel, from Florian.

4) Consolidate hook entry checks through xt_check_table_hooks(),
   patch from Florian.

5) Cap ruleset allocations at 512 mbytes, 134217728 rules and reject
   very large compat offset arrays, so we have a reasonable upper limit
   and fuzzers don't exercise the oom-killer. Patches from Florian.

6) Several WARN_ON checks on xtables mutex helper, from Florian.

7) xt_rateest now has a hashtable per net, from Cong Wang.

8) Consolidate counter allocation in xt_counters_alloc(), from Florian.

9) Earlier xt_table_unlock() call in {ip,ip6,arp,eb}tables, patch
   from Xin Long.

10) Set FLOW_OFFLOAD_DIR_* to IP_CT_DIR_* definitions, patch from
    Felix Fietkau.

11) Consolidate code through flow_offload_fill_dir(), also from Felix.

12) Inline ip6_dst_mtu_forward() just like ip_dst_mtu_maybe_forward()
    to remove a dependency with flowtable and ipv6.ko, from Felix.

13) Cache mtu size in flow_offload_tuple object, this is safe for
    forwarding as f87c10a8aa describes, from Felix.

14) Rename nf_flow_table.c to nf_flow_table_core.o, to simplify too
    modular infrastructure, from Felix.

15) Add rt0, rt2 and rt4 IPv6 routing extension support, patch from
    Ahmed Abdelsalam.

16) Remove unused parameter in nf_conncount_count(), from Yi-Hung Wei.

17) Support for counting only to nf_conncount infrastructure, patch
    from Yi-Hung Wei.

18) Add strict NFT_CT_{SRC_IP,DST_IP,SRC_IP6,DST_IP6} key datatypes
    to nft_ct.

19) Use boolean as return value from ipt_ah and from IPVS too, patch
    from Gustavo A. R. Silva.

20) Remove useless parameters in nfnl_acct_overquota() and
    nf_conntrack_broadcast_help(), from Taehee Yoo.

21) Use ipv6_addr_is_multicast() from xt_cluster, also from Taehee Yoo.

22) Statify nf_tables_obj_lookup_byhandle, patch from Fengguang Wu.

23) Fix typo in xt_limit, from Geert Uytterhoeven.

24) Do no use VLAs in Netfilter code, again from Gustavo.

25) Use ADD_COUNTER from ebtables, from Taehee Yoo.

26) Bitshift support for CONNMARK and MARK targets, from Jack Ma.

27) Use pr_*() and add pr_fmt(), from Arushi Singhal.

28) Add synproxy support to ctnetlink.

29) ICMP type and IGMP matching support for ebtables, patches from
    Matthias Schiffer.

30) Support for the revision infrastructure to ebtables, from
    Bernie Harris.

31) String match support for ebtables, also from Bernie.

32) Documentation for the new flowtable infrastructure.

33) Use generic comparison functions in ebt_stp, from Joe Perches.

34) Demodularize filter chains in nftables.

35) Register conntrack hooks in case nftables NAT chain is added.

36) Merge assignments with return in a couple of spots in the
    Netfilter codebase, also from Arushi.

37) Document that xtables percpu counters are stored in the same
    memory area, from Ben Hutchings.

38) Revert mark_source_chains() sanity checks that break existing
    rulesets, from Florian Westphal.

39) Use is_zero_ether_addr() in the ipset codebase, from Joe Perches.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-30 11:41:18 -04:00
Kirill Tkhai
328fbe747a net: Close race between {un, }register_netdevice_notifier() and setup_net()/cleanup_net()
{un,}register_netdevice_notifier() iterate over all net namespaces
hashed to net_namespace_list. But pernet_operations register and
unregister netdevices in unhashed net namespace, and they are not
seen for netdevice notifiers. This results in asymmetry:

1)Race with register_netdevice_notifier()
  pernet_operations::init(net)	...
   register_netdevice()		...
    call_netdevice_notifiers()  ...
      ... nb is not called ...
  ...				register_netdevice_notifier(nb) -> net skipped
  ...				...
  list_add_tail(&net->list, ..) ...

  Then, userspace stops using net, and it's destructed:

  pernet_operations::exit(net)
   unregister_netdevice()
    call_netdevice_notifiers()
      ... nb is called ...

This always happens with net::loopback_dev, but it may be not the only device.

2)Race with unregister_netdevice_notifier()
  pernet_operations::init(net)
   register_netdevice()
    call_netdevice_notifiers()
      ... nb is called ...

  Then, userspace stops using net, and it's destructed:

  list_del_rcu(&net->list)	...
  pernet_operations::exit(net)  unregister_netdevice_notifier(nb) -> net skipped
   dev_change_net_namespace()	...
    call_netdevice_notifiers()
      ... nb is not called ...
   unregister_netdevice()
    call_netdevice_notifiers()
      ... nb is not called ...

This race is more danger, since dev_change_net_namespace() moves real
network devices, which use not trivial netdevice notifiers, and if this
will happen, the system will be left in unpredictable state.

The patch closes the race. During the testing I found two places,
where register_netdevice_notifier() is called from pernet init/exit
methods (which led to deadlock) and fixed them (see previous patches).

The review moved me to one more unusual registration place:
raw_init() (can driver). It may be a reason of problems,
if someone creates in-kernel CAN_RAW sockets, since they
will be destroyed in exit method and raw_release()
will call unregister_netdevice_notifier(). But grep over
kernel tree does not show, someone creates such sockets
from kernel space.

Theoretically, there can be more places like this, and which are
hidden from review, but we found them on the first bumping there
(since there is no a race, it will be 100% reproducible).

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-30 10:59:35 -04:00
Kirill Tkhai
9e2f6c5d78 netfilter: Rework xt_TEE netdevice notifier
Register netdevice notifier for every iptable entry
is not good, since this breaks modularity, and
the hidden synchronization is based on rtnl_lock().

This patch reworks the synchronization via new lock,
while the rest of logic remains as it was before.
This is required for the next patch.

Tested via:

while :; do
	unshare -n iptables -t mangle -A OUTPUT -j TEE --gateway 1.1.1.2 --oif lo;
done

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-30 10:59:23 -04:00
Kirill Tkhai
e9a441b6e7 xfrm: Register xfrm_dev_notifier in appropriate place
Currently, driver registers it from pernet_operations::init method,
and this breaks modularity, because initialization of net namespace
and netdevice notifiers are orthogonal actions. We don't have
per-namespace netdevice notifiers; all of them are global for all
devices in all namespaces.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-30 10:59:23 -04:00
Russell King
e679c9c1db sfp/phylink: move module EEPROM ethtool access into netdev core ethtool
Provide a pointer to the SFP bus in struct net_device, so that the
ethtool module EEPROM methods can access the SFP directly, rather
than needing every user to provide a hook for it.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-30 10:11:06 -04:00
Gal Pressman
9daae9bd47 net: Call add/kill vid ndo on vlan filter feature toggling
NETIF_F_HW_VLAN_[CS]TAG_FILTER features require more than just a bit
flip in dev->features in order to keep the driver in a consistent state.
These features notify the driver of each added/removed vlan, but toggling
of vlan-filter does not notify the driver accordingly for each of the
existing vlans.

This patch implements a similar solution to NETIF_F_RX_UDP_TUNNEL_PORT
behavior (which notifies the driver about UDP ports in the same manner
that vids are reported).

Each toggling of the features propagates to the 8021q module, which
iterates over the vlans and call add/kill ndo accordingly.

Signed-off-by: Gal Pressman <galp@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-30 09:58:59 -04:00
Joe Perches
26c97c5d8d netfilter: ipset: Use is_zero_ether_addr instead of static and memcmp
To make the test a bit clearer and to reduce object size a little.

Miscellanea:

o remove now unnecessary static const array

$ size ip_set_hash_mac.o*
   text	   data	    bss	    dec	    hex	filename
  22822	   4619	     64	  27505	   6b71	ip_set_hash_mac.o.allyesconfig.new
  22932	   4683	     64	  27679	   6c1f	ip_set_hash_mac.o.allyesconfig.old
  10443	   1040	      0	  11483	   2cdb	ip_set_hash_mac.o.defconfig.new
  10507	   1040	      0	  11547	   2d1b	ip_set_hash_mac.o.defconfig.old

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-03-30 12:20:44 +02:00
Florian Westphal
e3b5e1ec75 Revert "netfilter: x_tables: ensure last rule in base chain matches underflow/policy"
This reverts commit 0d7df906a0.

Valdis Kletnieks reported that xtables is broken in linux-next since
0d7df906a0  ("netfilter: x_tables: ensure last rule in base chain
matches underflow/policy"), as kernel rejects the (well-formed) ruleset:

[   64.402790] ip6_tables: last base chain position 1136 doesn't match underflow 1344 (hook 1)

mark_source_chains is not the correct place for such a check, as it
terminates evaluation of a chain once it sees an unconditional verdict
(following rules are known to be unreachable). It seems preferrable to
fix libiptc instead, so remove this check again.

Fixes: 0d7df906a0 ("netfilter: x_tables: ensure last rule in base chain matches underflow/policy")
Reported-by: Valdis Kletnieks <valdis.kletnieks@vt.edu>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-03-30 12:20:32 +02:00
Ben Hutchings
9ba5c404bf netfilter: x_tables: Add note about how to free percpu counters
Due to the way percpu counters are allocated and freed in blocks,
it is not safe to free counters individually.  Currently all callers
do the right thing, but let's note this restriction.

Fixes: ae0ac0ed6f ("netfilter: x_tables: pack percpu counter allocations")
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-03-30 11:44:27 +02:00
Arushi Singhal
c47d36b385 netfilter: Merge assignment with return
Merge assignment with return statement to directly return the value.

Signed-off-by: Arushi Singhal <arushisinghal19971997@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-03-30 11:44:27 +02:00
Pablo Neira Ayuso
a3073c17dd netfilter: nf_tables: use nft_set_lookup_global from nf_tables_newsetelem()
Replace opencoded implementation of nft_set_lookup_global() by call to
this function.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-03-30 11:29:20 +02:00
Pablo Neira Ayuso
10659cbab7 netfilter: nf_tables: rename to nft_set_lookup_global()
To prepare shorter introduction of shorter function prefix.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-03-30 11:29:20 +02:00
Pablo Neira Ayuso
43a605f2f7 netfilter: nf_tables: enable conntrack if NAT chain is registered
Register conntrack hooks if the user adds NAT chains. Users get confused
with the existing behaviour since they will see no packets hitting this
chain until they add the first rule that refers to conntrack.

This patch adds new ->init() and ->free() indirections to chain types
that can be used by NAT chains to invoke the conntrack dependency.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-03-30 11:29:19 +02:00
Pablo Neira Ayuso
02c7b25e5f netfilter: nf_tables: build-in filter chain type
One module per supported filter chain family type takes too much memory
for very little code - too much modularization - place all chain filter
definitions in one single file.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-03-30 11:29:19 +02:00
Pablo Neira Ayuso
cc07eeb0e5 netfilter: nf_tables: nft_register_chain_type() returns void
Use WARN_ON() instead since it should not happen that neither family
goes over NFPROTO_NUMPROTO nor there is already a chain of this type
already registered.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-03-30 11:29:18 +02:00
Pablo Neira Ayuso
32537e9184 netfilter: nf_tables: rename struct nf_chain_type
Use nft_ prefix. By when I added chain types, I forgot to use the
nftables prefix. Rename enum nft_chain_type to enum nft_chain_types too,
otherwise there is an overlap.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-03-30 11:29:17 +02:00
Joe Perches
9124a20d87 netfilter: ebt_stp: Use generic functions for comparisons
Instead of unnecessary const declarations, use the generic functions to
save a little object space.

$ size net/bridge/netfilter/ebt_stp.o*
   text	   data	    bss	    dec	    hex	filename
   1250	    144	      0	   1394	    572	net/bridge/netfilter/ebt_stp.o.new
   1344	    144	      0	   1488	    5d0	net/bridge/netfilter/ebt_stp.o.old

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-03-30 11:24:29 +02:00
Bernie Harris
1be3ac9844 netfilter: ebtables: Add string filter
This patch is part of a proposal to add a string filter to
ebtables, which would be similar to the string filter in
iptables. Like iptables, the ebtables filter uses the xt_string
module.

Signed-off-by: Bernie Harris <bernie.harris@alliedtelesis.co.nz>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-03-30 11:04:12 +02:00
Bernie Harris
39c202d228 netfilter: ebtables: Add support for specifying match revision
Currently ebtables assumes that the revision number of all match
modules is 0, which is an issue when trying to use existing
xtables matches with ebtables. The solution is to modify ebtables
to allow extensions to specify a revision number, similar to
iptables. This gets passed down to the kernel, which is then able
to find the match module correctly.

To main binary backwards compatibility, the size of the ebt_entry
structures is not changed, only the size of the name field is
decreased by 1 byte to make room for the revision field.

Signed-off-by: Bernie Harris <bernie.harris@alliedtelesis.co.nz>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-03-30 11:03:39 +02:00
John Fastabend
fa246693a1 bpf: sockmap, BPF_F_INGRESS flag for BPF_SK_SKB_STREAM_VERDICT:
Add support for the BPF_F_INGRESS flag in skb redirect helper. To
do this convert skb into a scatterlist and push into ingress queue.
This is the same logic that is used in the sk_msg redirect helper
so it should feel familiar.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-30 00:09:43 +02:00
John Fastabend
8934ce2fd0 bpf: sockmap redirect ingress support
Add support for the BPF_F_INGRESS flag in sk_msg redirect helper.
To do this add a scatterlist ring for receiving socks to check
before calling into regular recvmsg call path. Additionally, because
the poll wakeup logic only checked the skb recv queue we need to
add a hook in TCP stack (similar to write side) so that we have
a way to wake up polling socks when a scatterlist is redirected
to that sock.

After this all that is needed is for the redirect helper to
push the scatterlist into the psock receive queue.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-30 00:09:43 +02:00
David S. Miller
e15f20ea33 We have a fair number of patches, but many of them are from the
first bullet here:
  * EAPoL-over-nl80211 from Denis - this will let us fix
    some long-standing issues with bridging, races with
    encryption and more
  * DFS offload support from the qtnfmac folks
  * regulatory database changes for the new ETSI adaptivity
    requirements
  * various other fixes and small enhancements
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAlq85QQACgkQB8qZga/f
 l8Sdbg//bc8C/4F1TUdJqZWGK1j9Lwd6nZpP0iFqH5Ees3MMnti5XdV2dCL31ivz
 c8DOuRAU8ZG/RLPgSTHVZHwh+f7S5/TxSKg8WOBvrYk7a0C1uvVVhe5XZQEmqE7g
 eqM0+UQ5DyzUYnu0kSUrFKPV7BqDa2YzVDdK8e8iozqZmAnvGN9k8H7EDEeUxtxk
 LEl+bEcmhDSfIssU2Iaksl+9qoZP6BkoVGAOmDzIL654WV4XVKorxRX6vndqSQGu
 0cCz2Occ+/0hfvszONBRR4M/gtI/Yyn3u+D1Q0YD3X40Q9gJE11fcodmMT61l5C7
 rGcu94RIGilvRvjZScK3giiU2L7DD+VETUa+YGnjd8gLpmrYd6cURxlm4yuHbw8C
 UScLCApAUuY+skmPLeuyHW9mnzHaC336vzVjk8OhdNRhX23/rB8nk1yIywgqETVW
 g/iub8/Xp6TRfdyh76I18wlfqCp1It2JAeICgKH5NPlwUA6U0xFR0/ddSR8FuAcK
 ZLY8mgsc2kIH6r4x5sjeH+Yb6tGi/Z3HMZM2hna+t4vSpn6Q5+GPsA6yuHuBUhJb
 8QswMiLDSux8I4guKgQyROiHaCzE3zOigJ4o1z9XITKsgluZVxnKr+ETKdr88WFp
 II8U0qH/kejXIfxUjbv5Wla70J9wi/hjxR6vOfSkEtYNvIApdfc=
 =hI0F
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-next-for-davem-2018-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next

Johannes Berg says:

====================
We have a fair number of patches, but many of them are from the
first bullet here:
 * EAPoL-over-nl80211 from Denis - this will let us fix
   some long-standing issues with bridging, races with
   encryption and more
 * DFS offload support from the qtnfmac folks
 * regulatory database changes for the new ETSI adaptivity
   requirements
 * various other fixes and small enhancements
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-29 16:23:26 -04:00
Arnd Bergmann
7ae665f132 sctp: fix unused lable warning
The proc file cleanup left a label possibly unused:

net/sctp/protocol.c: In function 'sctp_defaults_init':
net/sctp/protocol.c:1304:1: error: label 'err_init_proc' defined but not used [-Werror=unused-label]

This adds an #ifdef around it to match the respective 'goto'.

Fixes: d47d08c8ca ("sctp: use proc_remove_subtree()")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-29 14:38:27 -04:00
Eric Dumazet
18dcbe12fe ipv6: export ip6 fragments sysctl to unprivileged users
IPv4 was changed in commit 52a773d645 ("net: Export ip fragment
sysctl to unprivileged users")

The only sysctl that is not per-netns is not used :
ip6frag_secret_interval

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-29 14:14:03 -04:00
David Ahern
2233000cba net/ipv6: Move call_fib6_entry_notifiers up for route adds
Move call to call_fib6_entry_notifiers for new IPv6 routes to right
before the insertion into the FIB. At this point notifier handlers can
decide the fate of the new route with a clean path to delete the
potential new entry if the notifier returns non-0.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-29 14:10:31 -04:00
David Ahern
c1d7ee67ac net/ipv4: Allow notifier to fail route replace
Add checking to call to call_fib_entry_notifiers for IPv4 route replace.
Allows a notifier handler to fail the replace.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-29 14:10:30 -04:00
David Ahern
6635f311ea net/ipv4: Move call_fib_entry_notifiers up for new routes
Move call to call_fib_entry_notifiers for new IPv4 routes to right
before the call to fib_insert_alias. At this point the only remaining
failure path is memory allocations in fib_insert_node. Handle that
very unlikely failure with a call to call_fib_entry_notifiers to
tell drivers about it.

At this point notifier handlers can decide the fate of the new route
with a clean path to delete the potential new entry if the notifier
returns non-0.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-29 14:10:30 -04:00
David Ahern
9776d32537 net: Move call_fib_rule_notifiers up in fib_nl_newrule
Move call_fib_rule_notifiers up in fib_nl_newrule to the point right
before the rule is inserted into the list. At this point there are no
more failure paths within the core rule code, so if the notifier
does not fail then the rule will be inserted into the list.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-29 14:10:30 -04:00
David Ahern
c30d9356e9 net: Fix fib notifer to return errno
Notifier handlers use notifier_from_errno to convert any potential error
to an encoded format. As a consequence the other side, call_fib_notifier{s}
in this case, needs to use notifier_to_errno to return the error from
the handler back to its caller.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-29 14:10:30 -04:00
Kirill Tkhai
152f253152 net: Remove rtnl_lock() in nf_ct_iterate_destroy()
rtnl_lock() doesn't protect net::ct::count,
and it's not needed for__nf_ct_unconfirmed_destroy()
and for nf_queue_nf_hook_drop().

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-29 13:47:54 -04:00
Kirill Tkhai
ec9c780925 ovs: Remove rtnl_lock() from ovs_exit_net()
Here we iterate for_each_net() and removes
vport from alive net to the exiting net.

ovs_net::dps are protected by ovs_mutex(),
and the others, who change it (ovs_dp_cmd_new(),
__dp_destroy()) also take it.
The same with datapath::ports list.

So, we remove rtnl_lock() here.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-29 13:47:54 -04:00
Kirill Tkhai
10256debb9 net: Don't take rtnl_lock() in wireless_nlevent_flush()
This function iterates over net_namespace_list and flushes
the queue for every of them. What does this rtnl_lock()
protects?! Since we may add skbs to net::wext_nlevents
without rtnl_lock(), it does not protects us about queuers.

It guarantees, two threads can't flush the queue in parallel,
that can change the order, but since skb can be queued
in any order, it doesn't matter, how many threads do this
in parallel. In case of several threads, this will be even
faster.

So, we can remove rtnl_lock() here, as it was used for
iteration over net_namespace_list only.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-29 13:47:53 -04:00
Kirill Tkhai
f0b07bb151 net: Introduce net_rwsem to protect net_namespace_list
rtnl_lock() is used everywhere, and contention is very high.
When someone wants to iterate over alive net namespaces,
he/she has no a possibility to do that without exclusive lock.
But the exclusive rtnl_lock() in such places is overkill,
and it just increases the contention. Yes, there is already
for_each_net_rcu() in kernel, but it requires rcu_read_lock(),
and this can't be sleepable. Also, sometimes it may be need
really prevent net_namespace_list growth, so for_each_net_rcu()
is not fit there.

This patch introduces new rw_semaphore, which will be used
instead of rtnl_mutex to protect net_namespace_list. It is
sleepable and allows not-exclusive iterations over net
namespaces list. It allows to stop using rtnl_lock()
in several places (what is made in next patches) and makes
less the time, we keep rtnl_mutex. Here we just add new lock,
while the explanation of we can remove rtnl_lock() there are
in next patches.

Fine grained locks generally are better, then one big lock,
so let's do that with net_namespace_list, while the situation
allows that.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-29 13:47:53 -04:00
David S. Miller
36fc2d727e RxRPC development
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAWrrCyfu3V2unywtrAQJRgg//VbXndX7THWFy6NhfZL2lWH88zvhnF3om
 vVSxYqqrxAmnpUxtzQWa67By+LL7A6nnUJPhNm660GDfZQWDso4Vltji72unEAJT
 qvNqB2mcKJzdswBh+RB5QNJYmMhMs1flssUxxyBH1G0X1eWddfDal0QOz1nv0AqY
 E7WM++fbvo4RBi3jSZVJnOitHQLfYdQAcPv2LwaImUwWiKyaGxU54yqRk0r8Wcmh
 OZMipZFiHMFYHETPsu7ZI1lZHPPn+7zEyxRiIvvJlo8QOL/Sk7WsO/kKGhnsFGRQ
 yWsIePh6mU8lYmvwv3ZgqPf1uZu2kbInEQ0Mmzu/5T5VWEq2P1I9zsT7IT6khhYa
 BKxrzaVRd5D3GCl4HV7ANt2eDxtAIBjJkRNsXgs3OQEg93bokjP39PySU4RGtF7c
 Cb0kWBEk2HfdcKtQwb4ksdQ2g7Sby6p4wZ6YEH/hQybPXkApMqM2DCv1N9f0ELYk
 KbRxuo8ann5x7EXmXMUERZZYMiaGSf2FbcTwCFC9/WnUq3+qDYIJU8Wifv6nP7ti
 XcPw/brH8XkgOzE0/lSyKlD1e0XS+lf+HVVGYC0ws3bbR5UTSpjapD6CwzwxgUTZ
 NkKb/Erqtdt5XQoAg4QAW5snHsMQ1IDzM73BqEz4LVyNss4o7spCj9TWOu5y+5dy
 xFGAL2BI1+E=
 =BwJ2
 -----END PGP SIGNATURE-----

Merge tag 'rxrpc-next-20180327' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
rxrpc: Tracing updates

Here are some patches that update tracing in AF_RXRPC and AFS:

 (1) Add a tracepoint for tracking resend events.

 (2) Use debug_ids in traces rather than pointers (as pointers are now hashed)
     and allow use of the same debug_id in AFS calls as in the corresponding
     AF_RXRPC calls.  This makes filtering the trace output much easier.

 (3) Add a tracepoint for tracking call completion.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-29 12:02:08 -04:00
David S. Miller
5568cdc368 ip_tunnel: Resolve ipsec merge conflict properly.
We want to use dev_set_mtu() regardless of how we calculate
the mtu value.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-29 11:42:14 -04:00
David S. Miller
56455e0998 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:

====================
pull request (net-next): ipsec-next 2018-03-29

1) Remove a redundant pointer initialization esp_input_set_header().
   From Colin Ian King.

2) Mark the xfrm kmem_caches as __ro_after_init.
   From Alexey Dobriyan.

3) Do the checksum for an ipsec offlad packet in software
   if the device does not advertise NETIF_F_HW_ESP_TX_CSUM.
   From Shannon Nelson.

4) Use booleans for true and false instead of integers
   in xfrm_policy_cache_flush().
   From Gustavo A. R. Silva

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-29 11:22:31 -04:00
David S. Miller
020295d95e Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:

====================
pull request (net): ipsec 2018-03-29

1) Fix a rcu_read_lock/rcu_read_unlock imbalance
   in the error path of xfrm_local_error().
   From Taehee Yoo.

2) Some VTI MTU fixes. From Stefano Brivio.

3) Fix a too early overwritten skb control buffer
   on xfrm transport mode.

Please note that this pull request has a merge conflict
in net/ipv4/ip_tunnel.c.

The conflict is between

commit f6cc9c054e ("ip_tunnel: Emit events for post-register MTU changes")

from the net tree and

commit 24fc79798b ("ip_tunnel: Clamp MTU to bounds on new link")

from the ipsec tree.

It can be solved as it is currently done in linux-next.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-29 10:12:47 -04:00
Emmanuel Grumbach
c470bdc1aa mac80211: don't WARN on bad WMM parameters from buggy APs
Apparently, some APs are buggy enough to send a zeroed
WMM IE. Don't WARN on this since this is not caused by a bug
on the client's system.

This aligns the condition of the WARNING in drv_conf_tx
with the validity check in ieee80211_sta_wmm_params.
We will now pick the default values whenever we get
a zeroed WMM IE.

This has been reported here:
https://bugzilla.kernel.org/show_bug.cgi?id=199161

Fixes: f409079bb6 ("mac80211: sanity check CW_min/CW_max towards driver")
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-03-29 15:02:38 +02:00
Denis Kenzior
018f6fbf54 mac80211: Send control port frames over nl80211
If userspace requested control port frames to go over 80211, then do so.
The control packets are intercepted just prior to delivery of the packet
to the underlying network device.

Pre-authentication type frames (protocol: 0x88c7) are also forwarded
over nl80211.

Signed-off-by: Denis Kenzior <denkenz@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-03-29 14:08:30 +02:00
Denis Kenzior
9118064914 mac80211: Add support for tx_control_port
Signed-off-by: Denis Kenzior <denkenz@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-03-29 14:01:27 +02:00
Denis Kenzior
1224f5831a nl80211: Add control_port_over_nl80211 to mesh_setup
Signed-off-by: Denis Kenzior <denkenz@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-03-29 14:01:27 +02:00
Denis Kenzior
c3bfe1f6fc nl80211: Add control_port_over_nl80211 for ibss
Signed-off-by: Denis Kenzior <denkenz@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-03-29 14:00:27 +02:00
Denis Kenzior
64bf3d4bc2 nl80211: Add CONTROL_PORT_OVER_NL80211 attribute
Signed-off-by: Denis Kenzior <denkenz@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-03-29 13:45:04 +02:00
Denis Kenzior
2576a9ace4 nl80211: Implement TX of control port frames
This commit implements the TX side of NL80211_CMD_CONTROL_PORT_FRAME.
Userspace provides the raw EAPoL frame using NL80211_ATTR_FRAME.
Userspace should also provide the destination address and the protocol
type to use when sending the frame.  This is used to implement TX of
Pre-authentication frames.  If CONTROL_PORT_ETHERTYPE_NO_ENCRYPT is
specified, then the driver will be asked not to encrypt the outgoing
frame.

A new EXT_FEATURE flag is introduced so that nl80211 code can check
whether a given wiphy has capability to pass EAPoL frames over nl80211.

Signed-off-by: Denis Kenzior <denkenz@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-03-29 13:44:19 +02:00
Denis Kenzior
6a671a50f8 nl80211: Add CMD_CONTROL_PORT_FRAME API
This commit also adds cfg80211_rx_control_port function.  This is used
to generate a CMD_CONTROL_PORT_FRAME event out to userspace.  The
conn_owner_nlportid is used as the unicast destination.  This means that
userspace must specify NL80211_ATTR_SOCKET_OWNER flag if control port
over nl80211 routing is requested in NL80211_CMD_CONNECT,
NL80211_CMD_ASSOCIATE, NL80211_CMD_START_AP or IBSS/mesh join.

Signed-off-by: Denis Kenzior <denkenz@gmail.com>
[johannes: fix return value of cfg80211_rx_control_port()]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-03-29 13:44:04 +02:00
Johannes Berg
4d191c7536 mac80211: remove shadowing duplicated variable
We already have 'ifmgd' here, and it's already assigned
to the same value, so remove the duplicate.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-03-29 11:35:23 +02:00
Manikanta Pubbisetty
db3bdcb9c3 mac80211: allow AP_VLAN operation on crypto controlled devices
In the current implementation, mac80211 advertises the support of
AP_VLANs based on the driver's support for AP mode; it also
blocks encrypted AP_VLAN operation on devices advertising
SW_CRYPTO_CONTROL.

The implementation seems weird in it's current form and could be
often confusing, this is because there can be drivers advertising
both SW_CRYPTO_CONTROL and AP mode support (ex: ath10k) in which case
AP_VLAN will still be supported but only in open BSS and not in
secured BSS.

When SW_CRYPTO_CONTROL is enabled, it makes more sense if the decision
to support AP_VLANs is left to the driver. Mac80211 can then allow
AP_VLAN operations depending on the driver support.

Signed-off-by: Manikanta Pubbisetty <mpubbise@codeaurora.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-03-29 11:35:22 +02:00
Haim Dreyfuss
19d3577e35 cfg80211: Add API to allow querying regdb for wmm_rule
In general regulatory self managed devices maintain their own
regulatory profiles thus it doesn't have to query the regulatory database
on country change.

ETSI has recently introduced a new channel access mechanism for 5GHz
that all wlan devices need to comply with.
These values are stored in the regulatory database.
There are self managed devices which can't maintain these
values on their own. Add API to allow self managed regulatory devices
to query the regulatory database for high band wmm rule.

Signed-off-by: Haim Dreyfuss <haim.dreyfuss@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
[johannes: fix documentation]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-03-29 11:35:17 +02:00
Haim Dreyfuss
e552af0581 mac80211: limit wmm params to comply with ETSI requirements
ETSI has recently added new requirements that restrict the WMM
parameter values for 5GHz frequencies.  We need to take care of the
following scenarios in order to comply with these new requirements:

1. When using mac80211 default values;
2. When the userspace tries to configure its own values;
3. When associating to an AP which advertises WWM IE.

When associating to an AP, the client uses the values in the
advertised WMM IE.  But the AP may not comply with the new ETSI
requirements, so the client needs to check the current regulatory
rules and use those limits accordingly.

Signed-off-by: Haim Dreyfuss <haim.dreyfuss@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-03-29 11:11:50 +02:00
Johannes Berg
5bf16a11ba cfg80211: don't require RTNL held for regdomain reads
The whole code is set up to allow RCU reads of this data, but
then uses rtnl_dereference() which requires the RTNL. Convert
it to rcu_dereference_rtnl() which makes it require only RCU
or the RTNL, to allow RCU-protected reading of the data.

Reviewed-by: Coelho, Luciano <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-03-29 11:11:50 +02:00
Haim Dreyfuss
230ebaa189 cfg80211: read wmm rules from regulatory database
ETSI EN 301 893 v2.1.1 (2017-05) standard defines a new channel access
mechanism that all devices (WLAN and LAA) need to comply with.
The regulatory database can now be loaded into the kernel and also
has the option to load optional data.
In order to be able to comply with ETSI standard, we add wmm_rule into
regulatory rule and add the option to read its value from the regulatory
database.

Signed-off-by: Haim Dreyfuss <haim.dreyfuss@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
[johannes: fix memory leak in error path]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-03-29 11:11:40 +02:00
Denis Kenzior
466a306142 nl80211: Add SOCKET_OWNER support to START_AP
Signed-off-by: Denis Kenzior <denkenz@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-03-29 10:47:28 +02:00
Denis Kenzior
188c1b3c04 nl80211: Add SOCKET_OWNER support to JOIN_MESH
Signed-off-by: Denis Kenzior <denkenz@gmail.com>
[johannes: fix race with wdev lock/unlock by just acquiring once]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-03-29 10:38:24 +02:00
Denis Kenzior
f8d16d3edb nl80211: Add SOCKET_OWNER support to JOIN_IBSS
Signed-off-by: Denis Kenzior <denkenz@gmail.com>
[johannes: fix race with wdev lock/unlock by just acquiring once]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-03-29 10:36:22 +02:00
Denis Kenzior
37b1c00468 cfg80211: Support all iftypes in autodisconnect_wk
Currently autodisconnect_wk assumes that only interface types of
P2P_CLIENT and STATION use conn_owner_nlportid.  Change this so all
interface types are supported.

Signed-off-by: Denis Kenzior <denkenz@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-03-29 10:22:19 +02:00
Dmitry Lebed
2c390e44e4 cfg80211: enable use of non-cleared DFS channels for DFS offload
Currently channel switch/start_ap to DFS channel cannot be done to
non-CAC-cleared channel even if DFS offload if enabled.
Make non-cleared DFS channels available if DFS offload is enabled.
CAC will be started by HW after channel change, start_ap call, etc.

Signed-off-by: Dmitry Lebed <dlebed@quantenna.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-03-29 10:21:35 +02:00
Dmitry Lebed
850964519a cfg80211: fix CAC_STARTED event handling
Exclude CAC_STARTED event from !wdev->cac_started check,
since cac_started will be set later in the same function.

Signed-off-by: Dmitry Lebed <dlebed@quantenna.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-03-29 10:21:16 +02:00
tamizhr@codeaurora.org
97f5f4254a mac80211: Use proper chan_width enum in sta opmode event
Bandwidth change value reported via nl80211 contains mac80211
specific enum value(ieee80211_sta_rx_bw) and which is not
understand by userspace application. Map the mac80211 specific
value to nl80211_chan_width enum value to avoid using wrong value
in the userspace application. And used station's ht/vht capability
to map IEEE80211_STA_RX_BW_20 and IEEE80211_STA_RX_BW_160 with
proper nl80211 value.

Signed-off-by: Tamizh chelvam <tamizhr@codeaurora.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-03-29 10:19:59 +02:00
tamizhr@codeaurora.org
57566b2003 mac80211: Use proper smps_mode enum in sta opmode event
SMPS_MODE change value notified via nl80211 contains mac80211
specific value(ieee80211_smps_mode) and user space application
will not know those values. This patch add support to map
the mac80211 enum value to nl80211_smps_mode which will be
understood by the userspace application.

Signed-off-by: Tamizh chelvam <tamizhr@codeaurora.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-03-29 10:19:55 +02:00
Masahiro Yamada
28913ee819 netfilter: nf_nat_snmp_basic: add correct dependency to Makefile
nf_nat_snmp_basic_main.c includes a generated header, but the
necessary dependency is missing in Makefile. This could cause
build error in parallel building.

Remove a weird line, and add a correct one.

Fixes: cc2d58634e ("netfilter: nf_nat_snmp_basic: use asn1 decoder library")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2018-03-29 09:42:32 +09:00
Alexei Starovoitov
14624a9387 net/mac802154: disambiguate mac80215 vs mac802154 trace events
two trace events defined with the same name and both unused.
They conflict in allyesconfig build. Rename one of them.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-28 22:55:18 +02:00
Alexei Starovoitov
c105547501 treewide: remove large struct-pass-by-value from tracepoint arguments
- fix trace_hfi1_ctxt_info() to pass large struct by reference instead of by value
- convert 'type array[]' tracepoint arguments into 'type *array',
  since compiler will warn that sizeof('type array[]') == sizeof('type *array')
  and later should be used instead

The CAST_TO_U64 macro in the later patch will enforce that tracepoint
arguments can only be integers, pointers, or less than 8 byte structures.
Larger structures should be passed by reference.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-28 22:55:18 +02:00
Nikita V. Shirokov
6f5c39fa5c bpf: Add sock_ops R/W access to ipv4 tos
Sample usage for tos ...

  bpf_getsockopt(skops, SOL_IP, IP_TOS, &v, sizeof(v))

... where skops is a pointer to the ctx (struct bpf_sock_ops).

Signed-off-by: Nikita V. Shirokov <tehnerd@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-28 21:05:35 +02:00
David Howells
1bae5d2295 rxrpc: Trace call completion
Add a tracepoint to track rxrpc calls moving into the completed state and
to log the completion type and the recorded error value and abort code.

Signed-off-by: David Howells <dhowells@redhat.com>
2018-03-27 23:08:20 +01:00
David Howells
a25e21f0bc rxrpc, afs: Use debug_ids rather than pointers in traces
In rxrpc and afs, use the debug_ids that are monotonically allocated to
various objects as they're allocated rather than pointers as kernel
pointers are now hashed making them less useful.  Further, the debug ids
aren't reused anywhere nearly as quickly.

In addition, allow kernel services that use rxrpc, such as afs, to take
numbers from the rxrpc counter, assign them to their own call struct and
pass them in to rxrpc for both client and service calls so that the trace
lines for each will have the same ID tag.

Signed-off-by: David Howells <dhowells@redhat.com>
2018-03-27 23:03:00 +01:00
David Howells
827efed6a6 rxrpc: Trace resend
Add a tracepoint to trace packet resend events and to dump the Tx
annotation buffer for added illumination.

Signed-off-by: David Howells <dhowells@rdhat.com>
2018-03-27 23:02:47 +01:00
Kirill Tkhai
8518e9bb98 net: Add more comments
This adds comments to different places to improve
readability.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-27 13:18:09 -04:00
Kirill Tkhai
4420bf21fb net: Rename net_sem to pernet_ops_rwsem
net_sem is some undefined area name, so it will be better
to make the area more defined.

Rename it to pernet_ops_rwsem for better readability and
better intelligibility.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-27 13:18:09 -04:00
Kirill Tkhai
2f635ceeb2 net: Drop pernet_operations::async
Synchronous pernet_operations are not allowed anymore.
All are asynchronous. So, drop the structure member.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-27 13:18:09 -04:00
Kirill Tkhai
094374e5e1 net: Reflect all pernet_operations are converted
All pernet_operations are reviewed and converted, hooray!
Reflect this in core code: setup_net() and cleanup_net()
will take down_read() always.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-27 13:18:07 -04:00
Ursula Braun
ab6f6dd18a net/smc: use announced length in sock_recvmsg()
Not every CLC proposal message needs the maximum buffer length.
Due to the MSG_WAITALL flag, it is important to use the peeked
real length when receiving the message.

Fixes: d63d271ce2 ("smc: switch to sock_recvmsg()")
Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-27 11:59:49 -04:00
Cong Wang
b85ab56c3f llc: properly handle dev_queue_xmit() return value
llc_conn_send_pdu() pushes the skb into write queue and
calls llc_conn_send_pdus() to flush them out. However, the
status of dev_queue_xmit() is not returned to caller,
in this case, llc_conn_state_process().

llc_conn_state_process() needs hold the skb no matter
success or failure, because it still uses it after that,
therefore we should hold skb before dev_queue_xmit() when
that skb is the one being processed by llc_conn_state_process().

For other callers, they can just pass NULL and ignore
the return value as they are.

Reported-by: Noam Rathaus <noamr@beyondsecurity.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-27 11:56:00 -04:00
David S. Miller
5f75a1863e mlx5-updates-2018-03-22 (Misc updates)
This series includes misc updates for mlx5 core and netdev dirver,
 
 Highlights:
 
 From Inbar, three patches to add support for PFC stall prevention
 statistics and enable/disable through new ethtool tunable, as requested
 from previous submission.
 
 From Moshe, four patches, added more drop counters:
 	- drop counter for netdev steering miss
 	- drop counter for when VF logical link is down
         - drop counter for when netdev logical link is down.
 
 From Or, three patches to support vlan push/pop offload via tc HW action,
 for newer HW (Connectx-5 and onward) via HW steering flow actions rather
 than the emulated path for the older HW brands.
 
 And five more misc small trivial patches.
 -----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJauVxkAAoJEEg/ir3gV/o+Oi4H/1Mzv+XEuHLwVJHpzMNVLVeR
 EK1GDW3724zjY24Iy+9FRppL1mV0hY3aw1qhD+rUmUKQx+Q3OBrH1WN2QJkjBkM9
 i60ygFLTHvcHyPFmHtPzCckO7ODdWsXSlEwsl8kCSITQTn1Mjf6v7Mogd0CLCbwN
 iu8ocpHTD+/whE5gq9aDDrGtuzgJbPiRqavvsof9pc2g1JlnvtHs9xoiN+vzoqIa
 9lGS+2UM2+wFw9rMtbdaqtfVPsupMeEdGZ98chn0gpWshOvOda5NuRE59M+5un5E
 C3nJ4Js//PLkvQ9Nu7+goXtbfPLCxvcuurid5TM2Su9hmD+9Us223TcAvdpOMoY=
 =VdCO
 -----END PGP SIGNATURE-----

Merge tag 'mlx5-updates-2018-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5-updates-2018-03-22 (Misc updates)

This series includes misc updates for mlx5 core and netdev dirver,

Highlights:

From Inbar, three patches to add support for PFC stall prevention
statistics and enable/disable through new ethtool tunable, as requested
from previous submission.

From Moshe, four patches, added more drop counters:
	- drop counter for netdev steering miss
	- drop counter for when VF logical link is down
        - drop counter for when netdev logical link is down.

From Or, three patches to support vlan push/pop offload via tc HW action,
for newer HW (Connectx-5 and onward) via HW steering flow actions rather
than the emulated path for the older HW brands.

And five more misc small trivial patches.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-27 11:05:23 -04:00
Dave Watson
cd00edc179 strparser: Fix sign of err codes
strp_parser_err is called with a negative code everywhere, which then
calls abort_parser with a negative code.  strp_msg_timeout calls
abort_parser directly with a positive code.  Negate ETIMEDOUT
to match signed-ness of other calls.

The default abort_parser callback, strp_abort_strp, sets
sk->sk_err to err.  Also negate the error here so sk_err always
holds a positive value, as the rest of the net code expects.  Currently
a negative sk_err can result in endless loops, or user code that
thinks it actually sent/received err bytes.

Found while testing net/tls_sw recv path.

Fixes: 43a0c6751a ("strparser: Stream parser for messages")
Signed-off-by: Dave Watson <davejwatson@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-27 11:00:18 -04:00
Craig Dillabaugh
734549eb55 net sched actions: fix dumping which requires several messages to user space
Fixes a bug in the tcf_dump_walker function that can cause some actions
to not be reported when dumping a large number of actions. This issue
became more aggrevated when cookies feature was added. In particular
this issue is manifest when large cookie values are assigned to the
actions and when enough actions are created that the resulting table
must be dumped in multiple batches.

The number of actions returned in each batch is limited by the total
number of actions and the memory buffer size.  With small cookies
the numeric limit is reached before the buffer size limit, which avoids
the code path triggering this bug. When large cookies are used buffer
fills before the numeric limit, and the erroneous code path is hit.

For example after creating 32 csum actions with the cookie
aaaabbbbccccdddd

$ tc actions ls action csum
total acts 26

    action order 0: csum (tcp) action continue
    index 1 ref 1 bind 0
    cookie aaaabbbbccccdddd

    .....

    action order 25: csum (tcp) action continue
    index 26 ref 1 bind 0
    cookie aaaabbbbccccdddd
total acts 6

    action order 0: csum (tcp) action continue
    index 28 ref 1 bind 0
    cookie aaaabbbbccccdddd

    ......

    action order 5: csum (tcp) action continue
    index 32 ref 1 bind 0
    cookie aaaabbbbccccdddd

Note that the action with index 27 is omitted from the report.

Fixes: 4b3550ef53 ("[NET_SCHED]: Use nla_nest_start/nla_nest_end")"
Signed-off-by: Craig Dillabaugh <cdillaba@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-27 10:59:12 -04:00
Joe Perches
e32ac25018 ipv6: addrconf: Use normal debugging style
Remove local ADBG macro and use netdev_dbg/pr_debug

Miscellanea:

o Remove unnecessary debug message after allocation failure as there
  already is a dump_stack() on the failure paths
o Leave the allocation failure message on snmp6_alloc_dev as there
  is one code path that does not do a dump_stack()

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-27 10:54:40 -04:00
Eric Dumazet
1dfe82ebd7 net: fix possible out-of-bound read in skb_network_protocol()
skb mac header is not necessarily set at the time skb_network_protocol()
is called. Use skb->data instead.

BUG: KASAN: slab-out-of-bounds in skb_network_protocol+0x46b/0x4b0 net/core/dev.c:2739
Read of size 2 at addr ffff8801b3097a0b by task syz-executor5/14242

CPU: 1 PID: 14242 Comm: syz-executor5 Not tainted 4.16.0-rc6+ #280
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:17 [inline]
 dump_stack+0x194/0x24d lib/dump_stack.c:53
 print_address_description+0x73/0x250 mm/kasan/report.c:256
 kasan_report_error mm/kasan/report.c:354 [inline]
 kasan_report+0x23c/0x360 mm/kasan/report.c:412
 __asan_report_load_n_noabort+0xf/0x20 mm/kasan/report.c:443
 skb_network_protocol+0x46b/0x4b0 net/core/dev.c:2739
 harmonize_features net/core/dev.c:2924 [inline]
 netif_skb_features+0x509/0x9b0 net/core/dev.c:3011
 validate_xmit_skb+0x81/0xb00 net/core/dev.c:3084
 validate_xmit_skb_list+0xbf/0x120 net/core/dev.c:3142
 packet_direct_xmit+0x117/0x790 net/packet/af_packet.c:256
 packet_snd net/packet/af_packet.c:2944 [inline]
 packet_sendmsg+0x3aed/0x60b0 net/packet/af_packet.c:2969
 sock_sendmsg_nosec net/socket.c:629 [inline]
 sock_sendmsg+0xca/0x110 net/socket.c:639
 ___sys_sendmsg+0x767/0x8b0 net/socket.c:2047
 __sys_sendmsg+0xe5/0x210 net/socket.c:2081

Fixes: 19acc32725 ("gso: Handle Trans-Ether-Bridging protocol in skb_network_protocol()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Pravin B Shelar <pshelar@ovn.org>
Reported-by: Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-27 10:53:27 -04:00
Wei Yongjun
e1a22d13eb tipc: tipc_node_create() can be static
Fixes the following sparse warning:

net/tipc/node.c:336:18: warning:
 symbol 'tipc_node_create' was not declared. Should it be static?

Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-27 10:50:22 -04:00
Wei Yongjun
c76f2481b6 tipc: fix error handling in tipc_udp_enable()
Release alloced resource before return from the error handling
case in tipc_udp_enable(), otherwise will cause memory leak.

Fixes: 52dfae5c85 ("tipc: obtain node identity from interface by default")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-27 10:50:02 -04:00
David S. Miller
d7785b59e7 Here are some batman-adv bugfixes:
- fix multicast-via-unicast transmissions for AP isolation and gateway
    extension, by Linus Luessing (2 patches)
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAlq466kWHHN3QHNpbW9u
 d3VuZGVybGljaC5kZQAKCRChK+OYQpKeobv5EACwz70r21xIUxwq6FWXysTfxxXt
 H3qcXt2BzCaGYKmYt8xqz9z368v7wpgCsPfZsvW2eBYhcpCkGQm6PR2XNy7WMoBz
 6aFO4SDDi5L/S+5O0iXg79niRiTe9lCbOl/r6uv/2FtY22rIfhocFwkBL8cjIC3E
 hOWuFUbk3mAu8i+WMyhfDEUL33oy9CMvlJqaxqJuTHf3HxjOeGVjusOvSbQu+OGG
 K+TCAbIjazQ02R9p6lXQoIAjfg+kfsYIdT84MTgjk91qCtp1ztRxr3MNAwTgPvA4
 Vi4uZGLbLCMvaAF1wqVBSuOnFFwbUCc8IBIoUvvZSMQTYm3agcg96pFLMYqNzrhY
 aHco1neJ+a+pGUmByijmYbsrJI2dYqK0V0OZlLG9WKwkrE3Y023LGGdXUbdl4RhS
 LXqMLGVJ9eqV+m7MCuv/q/PTVOL89loAun0DuIU4IhJuDM2+5yEG5jBiI6ImIym+
 KX3F9F9nyefr5aCYsd14izX6WHxTXkJQaVpjVNBP56P6eZMLMx81eozBD9eotFyv
 A1EgQolLFEKWmtKU2KUK+qrGFNXaLlc9z8ZGkbizi/CTEjN0tr1UjW6B6toOOoZ+
 fhF3HwlpgqVOdDodT+eDtxY8YboZBCpAnAOiE1LP3leVY4yab23Q02kskOPTPg1G
 CfpZwDjXIyZburUvfA==
 =j2w0
 -----END PGP SIGNATURE-----

Merge tag 'batadv-net-for-davem-20180326' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
Here are some batman-adv bugfixes:

 - fix multicast-via-unicast transmissions for AP isolation and gateway
   extension, by Linus Luessing (2 patches)
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-27 10:40:19 -04:00
Colin Ian King
8daf1a2d7e net/ncsi: check for null return from call to nla_nest_start
The call to nla_nest_start calls nla_put which can lead to a NULL
return so it's possible for attr to become NULL and we can potentially
get a NULL pointer dereference on attr.  Fix this by checking for
a NULL return.

Detected by CoverityScan, CID#1466125 ("Dereference null return")

Fixes: 955dc68cb9 ("net/ncsi: Add generic netlink family")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-27 10:38:26 -04:00
Xin Long
5306653850 sctp: remove unnecessary asoc in sctp_has_association
After Commit dae399d7fd ("sctp: hold transport instead of assoc
when lookup assoc in rx path"), it put transport instead of asoc
in sctp_has_association. Variable 'asoc' is not used any more.

So this patch is to remove it, while at it,  it also changes the
return type of sctp_has_association to bool, and does the same
for it's caller sctp_endpoint_is_peeled_off.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-27 10:22:11 -04:00
Geert Uytterhoeven
b903036aad net: Spelling s/stucture/structure/
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2018-03-27 09:51:23 +02:00
Inbar Karmy
e1577c1c88 ethtool: Add support for configuring PFC stall prevention in ethtool
In the event where the device unexpectedly becomes unresponsive
for a long period of time, flow control mechanism may propagate
pause frames which will cause congestion spreading to the entire
network.
To prevent this scenario, when the device is stalled for a period
longer than a pre-configured timeout, flow control mechanisms are
automatically disabled.

This patch adds support for the ETHTOOL_PFC_STALL_PREVENTION
as a tunable.
This API provides support for configuring flow control storm prevention
timeout (msec).

Signed-off-by: Inbar Karmy <inbark@mellanox.com>
Cc: Michal Kubecek <mkubecek@suse.cz>
Cc: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-03-26 13:46:46 -07:00
Yuval Mintz
8c13af2a21 ip6mr: Add refcounting to mfc
Since ipmr and ip6mr are using the same mr_mfc struct at their core, we
can now refactor the ipmr_cache_{hold,put} logic and apply refcounting
to both ipmr and ip6mr.

Signed-off-by: Yuval Mintz <yuvalm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-26 13:14:43 -04:00
Yuval Mintz
d3c07e5b99 ip6mr: Add API for default_rule fib
Add the ability to discern whether a given FIB rule notification relates
to the default rule inserted when registering ip6mr or a different one.

Would later be used by drivers wishing to offload ipv6 multicast routes
but unable to offload rules other than the default one.

Signed-off-by: Yuval Mintz <yuvalm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-26 13:14:43 -04:00
Yuval Mintz
088aa3eec2 ip6mr: Support fib notifications
In similar fashion to ipmr, support fib notifications for ip6mr mfc and
vif related events. This would later allow drivers to react to said
notifications and offload the IPv6 mroutes.

Signed-off-by: Yuval Mintz <yuvalm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-26 13:14:43 -04:00
Yuval Mintz
cdc9f9443b ipmr: Make ipmr_dump() common
Since all the primitive elements used for the notification done by ipmr
are now common [mr_table, mr_mfc, vif_device] we can refactor the logic
for dumping them to a common file.

Signed-off-by: Yuval Mintz <yuvalm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-26 13:14:43 -04:00
Yuval Mintz
54c4cad97b ipmr: Make MFC fib notifiers common
Like vif notifications, move the notifier struct for MFC as well as its
helpers into a common file; Currently they're only used by ipmr.

Signed-off-by: Yuval Mintz <yuvalm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-26 13:14:42 -04:00
Yuval Mintz
bc67a0daf8 ipmr: Make vif fib notifiers common
The fib-notifiers are tightly coupled with the vif_device which is
already common. Move the notifier struct definition and helpers to the
common file; Currently they're only used by ipmr.

Signed-off-by: Yuval Mintz <yuvalm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-26 13:14:42 -04:00
Kirill Tkhai
5e804a6077 net: Convert sunrpc_net_ops
These pernet_operations look similar to rpcsec_gss_net_ops,
they just create and destroy another caches. So, they also
can be async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Anna Schumaker <Anna.Schumaker@netapp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-26 13:03:26 -04:00
Kirill Tkhai
855aeba340 net: Convert rpcsec_gss_net_ops
These pernet_operations initialize and destroy sunrpc_net_id
refered per-net items. Only used global list is cache_list,
and accesses already serialized.

sunrpc_destroy_cache_detail() check for list_empty() without
cache_list_lock, but when it's called from unregister_pernet_subsys(),
there can't be callers in parallel, so we won't miss list_empty()
in this case.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Anna Schumaker <Anna.Schumaker@netapp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-26 13:03:26 -04:00
John Fastabend
eb82a99447 net: sched, fix OOO packets with pfifo_fast
After the qdisc lock was dropped in pfifo_fast we allow multiple
enqueue threads and dequeue threads to run in parallel. On the
enqueue side the skb bit ooo_okay is used to ensure all related
skbs are enqueued in-order. On the dequeue side though there is
no similar logic. What we observe is with fewer queues than CPUs
it is possible to re-order packets when two instances of
__qdisc_run() are running in parallel. Each thread will dequeue
a skb and then whichever thread calls the ndo op first will
be sent on the wire. This doesn't typically happen because
qdisc_run() is usually triggered by the same core that did the
enqueue. However, drivers will trigger __netif_schedule()
when queues are transitioning from stopped to awake using the
netif_tx_wake_* APIs. When this happens netif_schedule() calls
qdisc_run() on the same CPU that did the netif_tx_wake_* which
is usually done in the interrupt completion context. This CPU
is selected with the irq affinity which is unrelated to the
enqueue operations.

To resolve this we add a RUNNING bit to the qdisc to ensure
only a single dequeue per qdisc is running. Enqueue and dequeue
operations can still run in parallel and also on multi queue
NICs we can still have a dequeue in-flight per qdisc, which
is typically per CPU.

Fixes: c5ad119fb6 ("net: sched: pfifo_fast use skb_array")
Reported-by: Jakob Unterwurzacher <jakob.unterwurzacher@theobroma-systems.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-26 12:36:23 -04:00
Joe Perches
d6444062f8 net: Use octal not symbolic permissions
Prefer the direct use of octal for permissions.

Done with checkpatch -f --types=SYMBOLIC_PERMS --fix-inplace
and some typing.

Miscellanea:

o Whitespace neatening around these conversions.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-26 12:07:48 -04:00
Kirill Tkhai
070f2d7e26 net: Drop NETDEV_UNREGISTER_FINAL
Last user is gone after bdf5bd7f21 "rds: tcp: remove
register_netdevice_notifier infrastructure.", so we can
remove this netdevice command. This allows to delete
rtnl_lock() in netdev_run_todo(), which is hot path for
net namespace unregistration.

dev_change_net_namespace() and netdev_wait_allrefs()
have rcu_barrier() before NETDEV_UNREGISTER_FINAL call,
and the source commits say they were introduced to
delemit the call with NETDEV_UNREGISTER, but this patch
leaves them on the places, since they require additional
analysis, whether we need in them for something else.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-26 11:34:00 -04:00
Kirill Tkhai
ede2762d93 net: Make NETDEV_XXX commands enum { }
This patch is preparation to drop NETDEV_UNREGISTER_FINAL.
Since the cmd is used in usnic_ib_netdev_event_to_string()
to get cmd name, after plain removing NETDEV_UNREGISTER_FINAL
from everywhere, we'd have holes in event2str[] in this
function.

Instead of that, let's make NETDEV_XXX commands names
available for everyone, and to define netdev_cmd_to_name()
in the way we won't have to shaffle names after their
numbers are changed.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-26 11:33:26 -04:00
Sagi Grimberg
a470143fc8 net/utils: Introduce inet_addr_is_any
Can be useful to check INET_ANY address for both ipv4/ipv6 addresses.

Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26 08:53:43 -06:00
kbuild test robot
da18ab32d7 tipc: tipc_disc_addr_trial_msg() can be static
Fixes: 25b0b9c4e8 ("tipc: handle collisions of 32-bit node address hash values")
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Acked-by: Jon Maloy jon.maloy@ericsson.com
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-25 21:21:43 -04:00
Paolo Abeni
10b8a3de60 ipv6: the entire IPv6 header chain must fit the first fragment
While building ipv6 datagram we currently allow arbitrary large
extheaders, even beyond pmtu size. The syzbot has found a way
to exploit the above to trigger the following splat:

kernel BUG at ./include/linux/skbuff.h:2073!
invalid opcode: 0000 [#1] SMP KASAN
Dumping ftrace buffer:
    (ftrace buffer empty)
Modules linked in:
CPU: 1 PID: 4230 Comm: syzkaller672661 Not tainted 4.16.0-rc2+ #326
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
RIP: 0010:__skb_pull include/linux/skbuff.h:2073 [inline]
RIP: 0010:__ip6_make_skb+0x1ac8/0x2190 net/ipv6/ip6_output.c:1636
RSP: 0018:ffff8801bc18f0f0 EFLAGS: 00010293
RAX: ffff8801b17400c0 RBX: 0000000000000738 RCX: ffffffff84f01828
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8801b415ac18
RBP: ffff8801bc18f360 R08: ffff8801b4576844 R09: 0000000000000000
R10: ffff8801bc18f380 R11: ffffed00367aee4e R12: 00000000000000d6
R13: ffff8801b415a740 R14: dffffc0000000000 R15: ffff8801b45767c0
FS:  0000000001535880(0000) GS:ffff8801db300000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000002000b000 CR3: 00000001b4123001 CR4: 00000000001606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
  ip6_finish_skb include/net/ipv6.h:969 [inline]
  udp_v6_push_pending_frames+0x269/0x3b0 net/ipv6/udp.c:1073
  udpv6_sendmsg+0x2a96/0x3400 net/ipv6/udp.c:1343
  inet_sendmsg+0x11f/0x5e0 net/ipv4/af_inet.c:764
  sock_sendmsg_nosec net/socket.c:630 [inline]
  sock_sendmsg+0xca/0x110 net/socket.c:640
  ___sys_sendmsg+0x320/0x8b0 net/socket.c:2046
  __sys_sendmmsg+0x1ee/0x620 net/socket.c:2136
  SYSC_sendmmsg net/socket.c:2167 [inline]
  SyS_sendmmsg+0x35/0x60 net/socket.c:2162
  do_syscall_64+0x280/0x940 arch/x86/entry/common.c:287
  entry_SYSCALL_64_after_hwframe+0x42/0xb7
RIP: 0033:0x4404c9
RSP: 002b:00007ffdce35f948 EFLAGS: 00000217 ORIG_RAX: 0000000000000133
RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 00000000004404c9
RDX: 0000000000000003 RSI: 0000000020001f00 RDI: 0000000000000003
RBP: 00000000006cb018 R08: 00000000004002c8 R09: 00000000004002c8
R10: 0000000020000080 R11: 0000000000000217 R12: 0000000000401df0
R13: 0000000000401e80 R14: 0000000000000000 R15: 0000000000000000
Code: ff e8 1d 5e b9 fc e9 15 e9 ff ff e8 13 5e b9 fc e9 44 e8 ff ff e8 29
5e b9 fc e9 c0 e6 ff ff e8 3f f3 80 fc 0f 0b e8 38 f3 80 fc <0f> 0b 49 8d
87 80 00 00 00 4d 8d 87 84 00 00 00 48 89 85 20 fe
RIP: __skb_pull include/linux/skbuff.h:2073 [inline] RSP: ffff8801bc18f0f0
RIP: __ip6_make_skb+0x1ac8/0x2190 net/ipv6/ip6_output.c:1636 RSP:
ffff8801bc18f0f0

As stated by RFC 7112 section 5:

   When a host fragments an IPv6 datagram, it MUST include the entire
   IPv6 Header Chain in the First Fragment.

So this patch addresses the issue dropping datagrams with excessive
extheader length. It also updates the error path to report to the
calling socket nonnegative pmtu values.

The issue apparently predates git history.

v1 -> v2: cleanup error path, as per Eric's suggestion

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: syzbot+91e6f9932ff122fa4410@syzkaller.appspotmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-25 21:17:20 -04:00
Alexander Potapenko
7880287981 netlink: make sure nladdr has correct size in netlink_connect()
KMSAN reports use of uninitialized memory in the case when |alen| is
smaller than sizeof(struct sockaddr_nl), and therefore |nladdr| isn't
fully copied from the userspace.

Signed-off-by: Alexander Potapenko <glider@google.com>
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-25 21:14:51 -04:00
Hans Wippel
bc58a1baf2 net/ipv4: disable SMC TCP option with SYN Cookies
Currently, the SMC experimental TCP option in a SYN packet is lost on
the server side when SYN Cookies are active. However, the corresponding
SYNACK sent back to the client contains the SMC option. This causes an
inconsistent view of the SMC capabilities on the client and server.

This patch disables the SMC option in the SYNACK when SYN Cookies are
active to avoid this issue.

Fixes: 60e2a77807 ("tcp: TCP experimental option for SMC")
Signed-off-by: Hans Wippel <hwippel@linux.vnet.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-25 20:53:54 -04:00
Yonghong Song
13acc94eff net: permit skb_segment on head_frag frag_list skb
One of our in-house projects, bpf-based NAT, hits a kernel BUG_ON at
function skb_segment(), line 3667. The bpf program attaches to
clsact ingress, calls bpf_skb_change_proto to change protocol
from ipv4 to ipv6 or from ipv6 to ipv4, and then calls bpf_redirect
to send the changed packet out.

3472 struct sk_buff *skb_segment(struct sk_buff *head_skb,
3473                             netdev_features_t features)
3474 {
3475         struct sk_buff *segs = NULL;
3476         struct sk_buff *tail = NULL;
...
3665                 while (pos < offset + len) {
3666                         if (i >= nfrags) {
3667                                 BUG_ON(skb_headlen(list_skb));
3668
3669                                 i = 0;
3670                                 nfrags = skb_shinfo(list_skb)->nr_frags;
3671                                 frag = skb_shinfo(list_skb)->frags;
3672                                 frag_skb = list_skb;
...

call stack:
...
 #1 [ffff883ffef03558] __crash_kexec at ffffffff8110c525
 #2 [ffff883ffef03620] crash_kexec at ffffffff8110d5cc
 #3 [ffff883ffef03640] oops_end at ffffffff8101d7e7
 #4 [ffff883ffef03668] die at ffffffff8101deb2
 #5 [ffff883ffef03698] do_trap at ffffffff8101a700
 #6 [ffff883ffef036e8] do_error_trap at ffffffff8101abfe
 #7 [ffff883ffef037a0] do_invalid_op at ffffffff8101acd0
 #8 [ffff883ffef037b0] invalid_op at ffffffff81a00bab
    [exception RIP: skb_segment+3044]
    RIP: ffffffff817e4dd4  RSP: ffff883ffef03860  RFLAGS: 00010216
    RAX: 0000000000002bf6  RBX: ffff883feb7aaa00  RCX: 0000000000000011
    RDX: ffff883fb87910c0  RSI: 0000000000000011  RDI: ffff883feb7ab500
    RBP: ffff883ffef03928   R8: 0000000000002ce2   R9: 00000000000027da
    R10: 000001ea00000000  R11: 0000000000002d82  R12: ffff883f90a1ee80
    R13: ffff883fb8791120  R14: ffff883feb7abc00  R15: 0000000000002ce2
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #9 [ffff883ffef03930] tcp_gso_segment at ffffffff818713e7

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-25 16:46:04 -04:00
David S. Miller
b9ee96b45f Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for your net tree,
they are:

1) Don't pick fixed hash implementation for NFT_SET_EVAL sets, otherwise
   userspace hits EOPNOTSUPP with valid rules using the meter statement,
   from Florian Westphal.

2) If you send a batch that flushes the existing ruleset (that contains
   a NAT chain) and the new ruleset definition comes with a new NAT
   chain, don't bogusly hit EBUSY. Also from Florian.

3) Missing netlink policy attribute validation, from Florian.

4) Detach conntrack template from skbuff if IP_NODEFRAG is set on,
   from Paolo Abeni.

5) Cache device names in flowtable object, otherwise we may end up
   walking over devices going aways given no rtnl_lock is held.

6) Fix incorrect net_device ingress with ingress hooks.

7) Fix crash when trying to read more data than available in UDP
   packets from the nf_socket infrastructure, from Subash.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-24 17:10:01 -04:00
Subash Abhinov Kasiviswanathan
32c1733f0d netfilter: nf_socket: Fix out of bounds access in nf_sk_lookup_slow_v{4,6}
skb_header_pointer will copy data into a buffer if data is non linear,
otherwise it will return a pointer in the linear section of the data.
nf_sk_lookup_slow_v{4,6} always copies data of size udphdr but later
accesses memory within the size of tcphdr (th->doff) in case of TCP
packets. This causes a crash when running with KASAN with the following
call stack -

BUG: KASAN: stack-out-of-bounds in xt_socket_lookup_slow_v4+0x524/0x718
net/netfilter/xt_socket.c:178
Read of size 2 at addr ffffffe3d417a87c by task syz-executor/28971
CPU: 2 PID: 28971 Comm: syz-executor Tainted: G    B   W  O    4.9.65+ #1
Call trace:
[<ffffff9467e8d390>] dump_backtrace+0x0/0x428 arch/arm64/kernel/traps.c:76
[<ffffff9467e8d7e0>] show_stack+0x28/0x38 arch/arm64/kernel/traps.c:226
[<ffffff946842d9b8>] __dump_stack lib/dump_stack.c:15 [inline]
[<ffffff946842d9b8>] dump_stack+0xd4/0x124 lib/dump_stack.c:51
[<ffffff946811d4b0>] print_address_description+0x68/0x258 mm/kasan/report.c:248
[<ffffff946811d8c8>] kasan_report_error mm/kasan/report.c:347 [inline]
[<ffffff946811d8c8>] kasan_report.part.2+0x228/0x2f0 mm/kasan/report.c:371
[<ffffff946811df44>] kasan_report+0x5c/0x70 mm/kasan/report.c:372
[<ffffff946811bebc>] check_memory_region_inline mm/kasan/kasan.c:308 [inline]
[<ffffff946811bebc>] __asan_load2+0x84/0x98 mm/kasan/kasan.c:739
[<ffffff94694d6f04>] __tcp_hdrlen include/linux/tcp.h:35 [inline]
[<ffffff94694d6f04>] xt_socket_lookup_slow_v4+0x524/0x718 net/netfilter/xt_socket.c:178

Fix this by copying data into appropriate size headers based on protocol.

Fixes: a583636a83 ("inet: refactor inet[6]_lookup functions to take skb")
Signed-off-by: Tejaswi Tanikella <tejaswit@codeaurora.org>
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-03-24 21:17:14 +01:00
Linus Lüssing
a752c0a452 batman-adv: fix packet loss for broadcasted DHCP packets to a server
DHCP connectivity issues can currently occur if the following conditions
are met:

1) A DHCP packet from a client to a server
2) This packet has a multicast destination
3) This destination has a matching entry in the translation table
   (FF:FF:FF:FF:FF:FF for IPv4, 33:33:00:01:00:02/33:33:00:01:00:03
    for IPv6)
4) The orig-node determined by TT for the multicast destination
   does not match the orig-node determined by best-gateway-selection

In this case the DHCP packet will be dropped.

The "gateway-out-of-range" check is supposed to only be applied to
unicasted DHCP packets to a specific DHCP server.

In that case dropping the the unicasted frame forces the client to
retry via a broadcasted one, but now directed to the new best
gateway.

A DHCP packet with broadcast/multicast destination is already ensured to
always be delivered to the best gateway. Dropping a multicasted
DHCP packet here will only prevent completing DHCP as there is no
other fallback.

So far, it seems the unicast check was implicitly performed by
expecting the batadv_transtable_search() to return NULL for multicast
destinations. However, a multicast address could have always ended up in
the translation table and in fact is now common.

To fix this potential loss of a DHCP client-to-server packet to a
multicast address this patch adds an explicit multicast destination
check to reliably bail out of the gateway-out-of-range check for such
destinations.

The issue and fix were tested in the following three node setup:

- Line topology, A-B-C
- A: gateway client, DHCP client
- B: gateway server, hop-penalty increased: 30->60, DHCP server
- C: gateway server, code modifications to announce FF:FF:FF:FF:FF:FF

Without this patch, A would never transmit its DHCP Discover packet
due to an always "out-of-range" condition. With this patch,
a full DHCP handshake between A and B was possible again.

Fixes: be7af5cf9c ("batman-adv: refactoring gateway handling code")
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2018-03-24 10:25:49 +01:00
Linus Lüssing
f8fb3419ea batman-adv: fix multicast-via-unicast transmission with AP isolation
For multicast frames AP isolation is only supposed to be checked on
the receiving nodes and never on the originating one.

Furthermore, the isolation or wifi flag bits should only be intepreted
as such for unicast and never multicast TT entries.

By injecting flags to the multicast TT entry claimed by a single
target node it was verified in tests that this multicast address
becomes unreachable, leading to packet loss.

Omitting the "src" parameter to the batadv_transtable_search() call
successfully skipped the AP isolation check and made the target
reachable again.

Fixes: 1d8ab8d3c1 ("batman-adv: Modified forwarding behaviour for multicast packets")
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2018-03-24 10:25:07 +01:00
Davide Caratti
94cb549240 net/sched: act_vlan: declare push_vid with host byte order
use u16 in place of __be16 to suppress the following sparse warnings:

 net/sched/act_vlan.c:150:26: warning: incorrect type in assignment (different base types)
 net/sched/act_vlan.c:150:26: expected restricted __be16 [usertype] push_vid
 net/sched/act_vlan.c:150:26: got unsigned short
 net/sched/act_vlan.c:151:21: warning: restricted __be16 degrades to integer
 net/sched/act_vlan.c:208:26: warning: incorrect type in assignment (different base types)
 net/sched/act_vlan.c:208:26: expected unsigned short [unsigned] [usertype] tcfv_push_vid
 net/sched/act_vlan.c:208:26: got restricted __be16 [usertype] push_vid

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-23 21:54:07 -04:00
Davide Caratti
affaa0c724 net/sched: remove tcf_idr_cleanup()
tcf_idr_cleanup() is no more used, so remove it.

Suggested-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-23 21:52:19 -04:00
Eric Dumazet
1bfa26ff8c ipv6: fix possible deadlock in rt6_age_examine_exception()
syzbot reported a LOCKDEP splat [1] in rt6_age_examine_exception()

rt6_age_examine_exception() is called while rt6_exception_lock is held.
This lock is the lower one in the lock hierarchy, thus we can not
call dst_neigh_lookup() function, as it can fallback to neigh_create()

We should instead do a pure RCU lookup. As a bonus we avoid
a pair of atomic operations on neigh refcount.

[1]

WARNING: possible circular locking dependency detected
4.16.0-rc4+ #277 Not tainted

syz-executor7/4015 is trying to acquire lock:
 (&ndev->lock){++--}, at: [<00000000416dce19>] __ipv6_dev_mc_dec+0x45/0x350 net/ipv6/mcast.c:928

but task is already holding lock:
 (&tbl->lock){++-.}, at: [<00000000b5cb1d65>] neigh_ifdown+0x3d/0x250 net/core/neighbour.c:292

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #3 (&tbl->lock){++-.}:
       __raw_write_lock_bh include/linux/rwlock_api_smp.h:203 [inline]
       _raw_write_lock_bh+0x31/0x40 kernel/locking/spinlock.c:312
       __neigh_create+0x87e/0x1d90 net/core/neighbour.c:528
       neigh_create include/net/neighbour.h:315 [inline]
       ip6_neigh_lookup+0x9a7/0xba0 net/ipv6/route.c:228
       dst_neigh_lookup include/net/dst.h:405 [inline]
       rt6_age_examine_exception net/ipv6/route.c:1609 [inline]
       rt6_age_exceptions+0x381/0x660 net/ipv6/route.c:1645
       fib6_age+0xfb/0x140 net/ipv6/ip6_fib.c:2033
       fib6_clean_node+0x389/0x580 net/ipv6/ip6_fib.c:1919
       fib6_walk_continue+0x46c/0x8a0 net/ipv6/ip6_fib.c:1845
       fib6_walk+0x91/0xf0 net/ipv6/ip6_fib.c:1893
       fib6_clean_tree+0x1e6/0x340 net/ipv6/ip6_fib.c:1970
       __fib6_clean_all+0x1f4/0x3a0 net/ipv6/ip6_fib.c:1986
       fib6_clean_all net/ipv6/ip6_fib.c:1997 [inline]
       fib6_run_gc+0x16b/0x3c0 net/ipv6/ip6_fib.c:2053
       ndisc_netdev_event+0x3c2/0x4a0 net/ipv6/ndisc.c:1781
       notifier_call_chain+0x136/0x2c0 kernel/notifier.c:93
       __raw_notifier_call_chain kernel/notifier.c:394 [inline]
       raw_notifier_call_chain+0x2d/0x40 kernel/notifier.c:401
       call_netdevice_notifiers_info+0x32/0x70 net/core/dev.c:1707
       call_netdevice_notifiers net/core/dev.c:1725 [inline]
       __dev_notify_flags+0x262/0x430 net/core/dev.c:6960
       dev_change_flags+0xf5/0x140 net/core/dev.c:6994
       devinet_ioctl+0x126a/0x1ac0 net/ipv4/devinet.c:1080
       inet_ioctl+0x184/0x310 net/ipv4/af_inet.c:919
       sock_do_ioctl+0xef/0x390 net/socket.c:957
       sock_ioctl+0x36b/0x610 net/socket.c:1081
       vfs_ioctl fs/ioctl.c:46 [inline]
       do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:686
       SYSC_ioctl fs/ioctl.c:701 [inline]
       SyS_ioctl+0x8f/0xc0 fs/ioctl.c:692
       do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
       entry_SYSCALL_64_after_hwframe+0x42/0xb7

-> #2 (rt6_exception_lock){+.-.}:
       __raw_spin_lock_bh include/linux/spinlock_api_smp.h:135 [inline]
       _raw_spin_lock_bh+0x31/0x40 kernel/locking/spinlock.c:168
       spin_lock_bh include/linux/spinlock.h:315 [inline]
       rt6_flush_exceptions+0x21/0x210 net/ipv6/route.c:1367
       fib6_del_route net/ipv6/ip6_fib.c:1677 [inline]
       fib6_del+0x624/0x12c0 net/ipv6/ip6_fib.c:1761
       __ip6_del_rt+0xc7/0x120 net/ipv6/route.c:2980
       ip6_del_rt+0x132/0x1a0 net/ipv6/route.c:2993
       __ipv6_dev_ac_dec+0x3b1/0x600 net/ipv6/anycast.c:332
       ipv6_dev_ac_dec net/ipv6/anycast.c:345 [inline]
       ipv6_sock_ac_close+0x2b4/0x3e0 net/ipv6/anycast.c:200
       inet6_release+0x48/0x70 net/ipv6/af_inet6.c:433
       sock_release+0x8d/0x1e0 net/socket.c:594
       sock_close+0x16/0x20 net/socket.c:1149
       __fput+0x327/0x7e0 fs/file_table.c:209
       ____fput+0x15/0x20 fs/file_table.c:243
       task_work_run+0x199/0x270 kernel/task_work.c:113
       exit_task_work include/linux/task_work.h:22 [inline]
       do_exit+0x9bb/0x1ad0 kernel/exit.c:865
       do_group_exit+0x149/0x400 kernel/exit.c:968
       get_signal+0x73a/0x16d0 kernel/signal.c:2469
       do_signal+0x90/0x1e90 arch/x86/kernel/signal.c:809
       exit_to_usermode_loop+0x258/0x2f0 arch/x86/entry/common.c:162
       prepare_exit_to_usermode arch/x86/entry/common.c:196 [inline]
       syscall_return_slowpath arch/x86/entry/common.c:265 [inline]
       do_syscall_64+0x6ec/0x940 arch/x86/entry/common.c:292
       entry_SYSCALL_64_after_hwframe+0x42/0xb7

-> #1 (&(&tb->tb6_lock)->rlock){+.-.}:
       __raw_spin_lock_bh include/linux/spinlock_api_smp.h:135 [inline]
       _raw_spin_lock_bh+0x31/0x40 kernel/locking/spinlock.c:168
       spin_lock_bh include/linux/spinlock.h:315 [inline]
       __ip6_ins_rt+0x56/0x90 net/ipv6/route.c:1007
       ip6_route_add+0x141/0x190 net/ipv6/route.c:2955
       addrconf_prefix_route+0x44f/0x620 net/ipv6/addrconf.c:2359
       fixup_permanent_addr net/ipv6/addrconf.c:3368 [inline]
       addrconf_permanent_addr net/ipv6/addrconf.c:3391 [inline]
       addrconf_notify+0x1ad2/0x2310 net/ipv6/addrconf.c:3460
       notifier_call_chain+0x136/0x2c0 kernel/notifier.c:93
       __raw_notifier_call_chain kernel/notifier.c:394 [inline]
       raw_notifier_call_chain+0x2d/0x40 kernel/notifier.c:401
       call_netdevice_notifiers_info+0x32/0x70 net/core/dev.c:1707
       call_netdevice_notifiers net/core/dev.c:1725 [inline]
       __dev_notify_flags+0x15d/0x430 net/core/dev.c:6958
       dev_change_flags+0xf5/0x140 net/core/dev.c:6994
       do_setlink+0xa22/0x3bb0 net/core/rtnetlink.c:2357
       rtnl_newlink+0xf37/0x1a50 net/core/rtnetlink.c:2965
       rtnetlink_rcv_msg+0x57f/0xb10 net/core/rtnetlink.c:4641
       netlink_rcv_skb+0x14b/0x380 net/netlink/af_netlink.c:2444
       rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:4659
       netlink_unicast_kernel net/netlink/af_netlink.c:1308 [inline]
       netlink_unicast+0x4c4/0x6b0 net/netlink/af_netlink.c:1334
       netlink_sendmsg+0xa4a/0xe60 net/netlink/af_netlink.c:1897
       sock_sendmsg_nosec net/socket.c:629 [inline]
       sock_sendmsg+0xca/0x110 net/socket.c:639
       ___sys_sendmsg+0x767/0x8b0 net/socket.c:2047
       __sys_sendmsg+0xe5/0x210 net/socket.c:2081
       SYSC_sendmsg net/socket.c:2092 [inline]
       SyS_sendmsg+0x2d/0x50 net/socket.c:2088
       do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
       entry_SYSCALL_64_after_hwframe+0x42/0xb7

-> #0 (&ndev->lock){++--}:
       lock_acquire+0x1d5/0x580 kernel/locking/lockdep.c:3920
       __raw_write_lock_bh include/linux/rwlock_api_smp.h:203 [inline]
       _raw_write_lock_bh+0x31/0x40 kernel/locking/spinlock.c:312
       __ipv6_dev_mc_dec+0x45/0x350 net/ipv6/mcast.c:928
       ipv6_dev_mc_dec+0x110/0x1f0 net/ipv6/mcast.c:961
       pndisc_destructor+0x21a/0x340 net/ipv6/ndisc.c:392
       pneigh_ifdown net/core/neighbour.c:695 [inline]
       neigh_ifdown+0x149/0x250 net/core/neighbour.c:294
       rt6_disable_ip+0x537/0x700 net/ipv6/route.c:3874
       addrconf_ifdown+0x14b/0x14f0 net/ipv6/addrconf.c:3633
       addrconf_notify+0x5f8/0x2310 net/ipv6/addrconf.c:3557
       notifier_call_chain+0x136/0x2c0 kernel/notifier.c:93
       __raw_notifier_call_chain kernel/notifier.c:394 [inline]
       raw_notifier_call_chain+0x2d/0x40 kernel/notifier.c:401
       call_netdevice_notifiers_info+0x32/0x70 net/core/dev.c:1707
       call_netdevice_notifiers net/core/dev.c:1725 [inline]
       __dev_notify_flags+0x262/0x430 net/core/dev.c:6960
       dev_change_flags+0xf5/0x140 net/core/dev.c:6994
       devinet_ioctl+0x126a/0x1ac0 net/ipv4/devinet.c:1080
       inet_ioctl+0x184/0x310 net/ipv4/af_inet.c:919
       packet_ioctl+0x1ff/0x310 net/packet/af_packet.c:4066
       sock_do_ioctl+0xef/0x390 net/socket.c:957
       sock_ioctl+0x36b/0x610 net/socket.c:1081
       vfs_ioctl fs/ioctl.c:46 [inline]
       do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:686
       SYSC_ioctl fs/ioctl.c:701 [inline]
       SyS_ioctl+0x8f/0xc0 fs/ioctl.c:692
       do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
       entry_SYSCALL_64_after_hwframe+0x42/0xb7

other info that might help us debug this:

Chain exists of:
  &ndev->lock --> rt6_exception_lock --> &tbl->lock

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&tbl->lock);
                               lock(rt6_exception_lock);
                               lock(&tbl->lock);
  lock(&ndev->lock);

 *** DEADLOCK ***

2 locks held by syz-executor7/4015:
 #0:  (rtnl_mutex){+.+.}, at: [<00000000a2f16daa>] rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74
 #1:  (&tbl->lock){++-.}, at: [<00000000b5cb1d65>] neigh_ifdown+0x3d/0x250 net/core/neighbour.c:292

stack backtrace:
CPU: 0 PID: 4015 Comm: syz-executor7 Not tainted 4.16.0-rc4+ #277
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:17 [inline]
 dump_stack+0x194/0x24d lib/dump_stack.c:53
 print_circular_bug.isra.38+0x2cd/0x2dc kernel/locking/lockdep.c:1223
 check_prev_add kernel/locking/lockdep.c:1863 [inline]
 check_prevs_add kernel/locking/lockdep.c:1976 [inline]
 validate_chain kernel/locking/lockdep.c:2417 [inline]
 __lock_acquire+0x30a8/0x3e00 kernel/locking/lockdep.c:3431
 lock_acquire+0x1d5/0x580 kernel/locking/lockdep.c:3920
 __raw_write_lock_bh include/linux/rwlock_api_smp.h:203 [inline]
 _raw_write_lock_bh+0x31/0x40 kernel/locking/spinlock.c:312
 __ipv6_dev_mc_dec+0x45/0x350 net/ipv6/mcast.c:928
 ipv6_dev_mc_dec+0x110/0x1f0 net/ipv6/mcast.c:961
 pndisc_destructor+0x21a/0x340 net/ipv6/ndisc.c:392
 pneigh_ifdown net/core/neighbour.c:695 [inline]
 neigh_ifdown+0x149/0x250 net/core/neighbour.c:294
 rt6_disable_ip+0x537/0x700 net/ipv6/route.c:3874
 addrconf_ifdown+0x14b/0x14f0 net/ipv6/addrconf.c:3633
 addrconf_notify+0x5f8/0x2310 net/ipv6/addrconf.c:3557
 notifier_call_chain+0x136/0x2c0 kernel/notifier.c:93
 __raw_notifier_call_chain kernel/notifier.c:394 [inline]
 raw_notifier_call_chain+0x2d/0x40 kernel/notifier.c:401
 call_netdevice_notifiers_info+0x32/0x70 net/core/dev.c:1707
 call_netdevice_notifiers net/core/dev.c:1725 [inline]
 __dev_notify_flags+0x262/0x430 net/core/dev.c:6960
 dev_change_flags+0xf5/0x140 net/core/dev.c:6994
 devinet_ioctl+0x126a/0x1ac0 net/ipv4/devinet.c:1080
 inet_ioctl+0x184/0x310 net/ipv4/af_inet.c:919
 packet_ioctl+0x1ff/0x310 net/packet/af_packet.c:4066
 sock_do_ioctl+0xef/0x390 net/socket.c:957
 sock_ioctl+0x36b/0x610 net/socket.c:1081
 vfs_ioctl fs/ioctl.c:46 [inline]
 do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:686
 SYSC_ioctl fs/ioctl.c:701 [inline]
 SyS_ioctl+0x8f/0xc0 fs/ioctl.c:692
 do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x42/0xb7

Fixes: c757faa8bf ("ipv6: prepare fib6_age() for exception table")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Wei Wang <weiwan@google.com>
Cc: Martin KaFai Lau <kafai@fb.com>
Acked-by: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-23 13:40:34 -04:00
Jon Maloy
52dfae5c85 tipc: obtain node identity from interface by default
Selecting and explicitly configuring a TIPC node identity may be
unwanted in some cases.

In this commit we introduce a default setting if the identity has not
been set at the moment the first bearer is enabled. We do this by
using a raw copy of a unique identifier from the used interface: MAC
address in the case of an L2 bearer, IPv4/IPv6 address in the case
of a UDP bearer.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-23 13:12:18 -04:00
Jon Maloy
25b0b9c4e8 tipc: handle collisions of 32-bit node address hash values
When a 32-bit node address is generated from a 128-bit identifier,
there is a risk of collisions which must be discovered and handled.

We do this as follows:
- We don't apply the generated address immediately to the node, but do
  instead initiate a 1 sec trial period to allow other cluster members
  to discover and handle such collisions.

- During the trial period the node periodically sends out a new type
  of message, DSC_TRIAL_MSG, using broadcast or emulated broadcast,
  to all the other nodes in the cluster.

- When a node is receiving such a message, it must check that the
  presented 32-bit identifier either is unused, or was used by the very
  same peer in a previous session. In both cases it accepts the request
  by not responding to it.

- If it finds that the same node has been up before using a different
  address, it responds with a DSC_TRIAL_FAIL_MSG containing that
  address.

- If it finds that the address has already been taken by some other
  node, it generates a new, unused address and returns it to the
  requester.

- During the trial period the requesting node must always be prepared
  to accept a failure message, i.e., a message where a peer suggests a
  different (or equal)  address to the one tried. In those cases it
  must apply the suggested value as trial address and restart the trial
  period.

This algorithm ensures that in the vast majority of cases a node will
have the same address before and after a reboot. If a legacy user
configures the address explicitly, there will be no trial period and
messages, so this protocol addition is completely backwards compatible.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-23 13:12:18 -04:00
Jon Maloy
d50ccc2d39 tipc: add 128-bit node identifier
We add a 128-bit node identity, as an alternative to the currently used
32-bit node address.

For the sake of compatibility and to minimize message header changes
we retain the existing 32-bit address field. When not set explicitly by
the user, this field will be filled with a hash value generated from the
much longer node identity, and be used as a shorthand value for the
latter.

We permit either the address or the identity to be set by configuration,
but not both, so when the address value is set by a legacy user the
corresponding 128-bit node identity is generated based on the that value.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-23 13:12:18 -04:00
Jon Maloy
23fd3eace0 tipc: remove direct accesses to own_addr field in struct tipc_net
As a preparation to changing the addressing structure of TIPC we replace
all direct accesses to the tipc_net::own_addr field with the function
dedicated for this, tipc_own_addr().

There are no changes to program logics in this commit.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-23 13:12:18 -04:00
Jon Maloy
b89afb116c tipc: allow closest-first lookup algorithm when legacy address is configured
The removal of an internal structure of the node address has an unwanted
side effect.
- Currently, if a user is sending an anycast message with destination
  domain 0, the tipc_namebl_translate() function will use the 'closest-
  first' algorithm to first look for a node local destination, and only
  when no such is found, will it resort to the cluster global 'round-
  robin' lookup algorithm.
- Current users can get around this, and enforce unconditional use of
  global round-robin by indicating a destination as Z.0.0 or Z.C.0.
- This option disappears when we make the node address flat, since the
  lookup algorithm has no way of recognizing this case. So, as long as
  there are node local destinations, the algorithm will always select
  one of those, and there is nothing the sender can do to change this.

We solve this by eliminating the 'closest-first' option, which was never
a good idea anyway, for non-legacy users, but only for those. To
distinguish between legacy users and non-legacy users we introduce a new
flag 'legacy_addr_format' in struct tipc_core, to be set when the user
configures a legacy-style Z.C.N node address. Hence, when a legacy user
indicates a zero lookup domain 'closest-first' is selected, and in all
other cases we use 'round-robin'.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-23 13:12:18 -04:00
Jon Maloy
2026364149 tipc: remove restrictions on node address values
Nominally, TIPC organizes network nodes into a three-level network
hierarchy consisting of the levels 'zone', 'cluster' and 'node'. This
hierarchy is reflected in the node address format, - it is sub-divided
into an 8-bit zone id, and 12 bit cluster id, and a 12-bit node id.

However, the 'zone' and 'cluster' levels have in reality never been
fully implemented,and never will be. The result of this has been
that the first 20 bits the node identity structure have been wasted,
and the usable node identity range within a cluster has been limited
to 12 bits. This is starting to become a problem.

In the following commits, we will need to be able to connect between
nodes which are using the whole 32-bit value space of the node address.
We therefore remove the restrictions on which values can be assigned
to node identity, -it is from now on only a 32-bit integer with no
assumed internal structure.

Isolation between clusters is now achieved only by setting different
values for the 'network id' field used during neighbor discovery, in
practice leading to the latter becoming the new cluster identity.

The rules for accepting discovery requests/responses from neighboring
nodes now become:

- If the user is using legacy address format on both peers, reception
  of discovery messages is subject to the legacy lookup domain check
  in addition to the cluster id check.

- Otherwise, the discovery request/response is always accepted, provided
  both peers have the same network id.

This secures backwards compatibility for users who have been using zone
or cluster identities as cluster separators, instead of the intended
'network id'.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-23 13:12:18 -04:00
Jon Maloy
b39e465e56 tipc: some cleanups in the file discover.c
To facilitate the coming changes in the neighbor discovery functionality
we make some renaming and refactoring of that code. The functional changes
in this commit are trivial, e.g., that we move the message sending call in
tipc_disc_timeout() outside the spinlock protected region.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-23 13:12:17 -04:00
Jon Maloy
cb30a63384 tipc: refactor function tipc_enable_bearer()
As a preparation for the next commits we try to reduce the footprint of
the function tipc_enable_bearer(), while hopefully making is simpler to
follow.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-23 13:12:17 -04:00
Kirill Tkhai
b2864fbdc5 net: Convert rxrpc_net_ops
These pernet_operations modifies rxrpc_net_id-pointed
per-net entities. There is external link to AF_RXRPC
in fs/afs/Kconfig, but it seems there is no other
pernet_operations interested in that per-net entities.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-23 13:00:47 -04:00
Kirill Tkhai
fc18999ed2 net: Convert udp_sysctl_ops
These pernet_operations just initialize udp4 defaults.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-23 13:00:46 -04:00
Petr Machata
f6cc9c054e ip_tunnel: Emit events for post-register MTU changes
For tunnels created with IFLA_MTU, MTU of the netdevice is set by
rtnl_create_link() (called from rtnl_newlink()) before the device is
registered. However without IFLA_MTU that's not done.

rtnl_newlink() proceeds by calling struct rtnl_link_ops.newlink, which
via ip_tunnel_newlink() calls register_netdevice(), and that emits
NETDEV_REGISTER. Thus any listeners that inspect the netdevice get the
MTU of 0.

After ip_tunnel_newlink() corrects the MTU after registering the
netdevice, but since there's no event, the listeners don't get to know
about the MTU until something else happens--such as a NETDEV_UP event.
That's not ideal.

So instead of setting the MTU directly, go through dev_set_mtu(), which
takes care of distributing the necessary NETDEV_PRECHANGEMTU and
NETDEV_CHANGEMTU events.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-23 12:54:34 -04:00
Nikolay Aleksandrov
82792a070b net: bridge: fix direct access to bridge vlan_enabled and use helper
We need to use br_vlan_enabled() helper otherwise we'll break builds
without bridge vlans:
net/bridge//br_if.c: In function ‘br_mtu’:
net/bridge//br_if.c:458:8: error: ‘const struct net_bridge’ has no
member named ‘vlan_enabled’
  if (br->vlan_enabled)
        ^
net/bridge//br_if.c:462:1: warning: control reaches end of non-void
function [-Wreturn-type]
 }
 ^
scripts/Makefile.build:324: recipe for target 'net/bridge//br_if.o'
failed

Fixes: 419d14af9e ("bridge: Allow max MTU when multiple VLANs present")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-23 12:41:14 -04:00
Dave Watson
c46234ebb4 tls: RX path for ktls
Add rx path for tls software implementation.

recvmsg, splice_read, and poll implemented.

An additional sockopt TLS_RX is added, with the same interface as
TLS_TX.  Either TLX_RX or TLX_TX may be provided separately, or
together (with two different setsockopt calls with appropriate keys).

Control messages are passed via CMSG in a similar way to transmit.
If no cmsg buffer is passed, then only application data records
will be passed to userspace, and EIO is returned for other types of
alerts.

EBADMSG is passed for decryption errors, and EMSGSIZE is passed for
framing too big, and EBADMSG for framing too small (matching openssl
semantics). EINVAL is returned for TLS versions that do not match the
original setsockopt call.  All are unrecoverable.

strparser is used to parse TLS framing.   Decryption is done directly
in to userspace buffers if they are large enough to support it, otherwise
sk_cow_data is called (similar to ipsec), and buffers are decrypted in
place and copied.  splice_read always decrypts in place, since no
buffers are provided to decrypt in to.

sk_poll is overridden, and only returns POLLIN if a full TLS message is
received.  Otherwise we wait for strparser to finish reading a full frame.
Actual decryption is only done during recvmsg or splice_read calls.

Signed-off-by: Dave Watson <davejwatson@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-23 12:25:54 -04:00
Dave Watson
583715853a tls: Refactor variable names
Several config variables are prefixed with tx, drop the prefix
since these will be used for both tx and rx.

Signed-off-by: Dave Watson <davejwatson@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-23 12:25:54 -04:00
Dave Watson
f4a8e43f1f tls: Pass error code explicitly to tls_err_abort
Pass EBADMSG explicitly to tls_err_abort.  Receive path will
pass additional codes - EMSGSIZE if framing is larger than max
TLS record size, EINVAL if TLS version mismatch.

Signed-off-by: Dave Watson <davejwatson@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-23 12:25:54 -04:00
Dave Watson
dbe425599b tls: Move cipher info to a separate struct
Separate tx crypto parameters to a separate cipher_context struct.
The same parameters will be used for rx using the same struct.

tls_advance_record_sn is modified to only take the cipher info.

Signed-off-by: Dave Watson <davejwatson@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-23 12:25:53 -04:00
Dave Watson
69ca9293e8 tls: Generalize zerocopy_from_iter
Refactor zerocopy_from_iter to take arguments for pages and size,
such that it can be used for both tx and rx. RX will also support
zerocopy direct to output iter, as long as the full message can
be copied at once (a large enough userspace buffer was provided).

Signed-off-by: Dave Watson <davejwatson@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-23 12:25:53 -04:00
Chas Williams
419d14af9e bridge: Allow max MTU when multiple VLANs present
If the bridge is allowing multiple VLANs, some VLANs may have
different MTUs.  Instead of choosing the minimum MTU for the
bridge interface, choose the maximum MTU of the bridge members.
With this the user only needs to set a larger MTU on the member
ports that are participating in the large MTU VLANS.

Signed-off-by: Chas Williams <3chas3@gmail.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-23 12:17:30 -04:00
David S. Miller
03fe2debbb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Fun set of conflict resolutions here...

For the mac80211 stuff, these were fortunately just parallel
adds.  Trivially resolved.

In drivers/net/phy/phy.c we had a bug fix in 'net' that moved the
function phy_disable_interrupts() earlier in the file, whilst in
'net-next' the phy_error() call from this function was removed.

In net/ipv4/xfrm4_policy.c, David Ahern's changes to remove the
'rt_table_id' member of rtable collided with a bug fix in 'net' that
added a new struct member "rt_mtu_locked" which needs to be copied
over here.

The mlxsw driver conflict consisted of net-next separating
the span code and definitions into separate files, whilst
a 'net' bug fix made some changes to that moved code.

The mlx5 infiniband conflict resolution was quite non-trivial,
the RDMA tree's merge commit was used as a guide here, and
here are their notes:

====================

    Due to bug fixes found by the syzkaller bot and taken into the for-rc
    branch after development for the 4.17 merge window had already started
    being taken into the for-next branch, there were fairly non-trivial
    merge issues that would need to be resolved between the for-rc branch
    and the for-next branch.  This merge resolves those conflicts and
    provides a unified base upon which ongoing development for 4.17 can
    be based.

    Conflicts:
            drivers/infiniband/hw/mlx5/main.c - Commit 42cea83f95
            (IB/mlx5: Fix cleanup order on unload) added to for-rc and
            commit b5ca15ad7e (IB/mlx5: Add proper representors support)
            add as part of the devel cycle both needed to modify the
            init/de-init functions used by mlx5.  To support the new
            representors, the new functions added by the cleanup patch
            needed to be made non-static, and the init/de-init list
            added by the representors patch needed to be modified to
            match the init/de-init list changes made by the cleanup
            patch.
    Updates:
            drivers/infiniband/hw/mlx5/mlx5_ib.h - Update function
            prototypes added by representors patch to reflect new function
            names as changed by cleanup patch
            drivers/infiniband/hw/mlx5/ib_rep.c - Update init/de-init
            stage list to match new order from cleanup patch
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-23 11:31:58 -04:00
Pradeep Kumar Chitrapu
dcbe73ca55 mac80211: notify driver for change in multicast rates
With drivers implementing rate control in driver or firmware
rate_control_send_low() may not get called, and thus the
driver needs to know about changes in the multicast rate.

Add and use a new BSS change flag for this.

Signed-off-by: Pradeep Kumar Chitrapu <pradeepc@codeaurora.org>
[rewrite commit message]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-03-23 13:23:17 +01:00
Steffen Klassert
9a3fb9fb84 xfrm: Fix transport mode skb control buffer usage.
A recent commit introduced a new struct xfrm_trans_cb
that is used with the sk_buff control buffer. Unfortunately
it placed the structure in front of the control buffer and
overlooked that the IPv4/IPv6 control buffer is still needed
for some layer 4 protocols. As a result the IPv4/IPv6 control
buffer is overwritten with this structure. Fix this by setting
a apropriate header in front of the structure.

Fixes acf568ee85 ("xfrm: Reinject transport-mode packets ...")
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2018-03-23 07:56:04 +01:00
Kirill Tkhai
d9ff304973 net: Replace ip_ra_lock with per-net mutex
Since ra_chain is per-net, we may use per-net mutexes
to protect them in ip_ra_control(). This improves
scalability.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22 15:12:56 -04:00
Kirill Tkhai
5796ef75ec net: Make ip_ra_chain per struct net
This is optimization, which makes ip_call_ra_chain()
iterate less sockets to find the sockets it's looking for.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22 15:12:56 -04:00
Kirill Tkhai
128aaa98ad net: Revert "ipv4: fix a deadlock in ip_ra_control"
This reverts commit 1215e51eda.
Since raw_close() is used on every RAW socket destruction,
the changes made by 1215e51eda scale sadly. This clearly
seen on endless unshare(CLONE_NEWNET) test, and cleanup_net()
kwork spends a lot of time waiting for rtnl_lock() introduced
by this commit.

Previous patch moved IP_ROUTER_ALERT out of rtnl_lock(),
so we revert this patch.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22 15:12:56 -04:00
Kirill Tkhai
0526947f9d net: Move IP_ROUTER_ALERT out of lock_sock(sk)
ip_ra_control() does not need sk_lock. Who are the another
users of ip_ra_chain? ip_mroute_setsockopt() doesn't take
sk_lock, while parallel IP_ROUTER_ALERT syscalls are
synchronized by ip_ra_lock. So, we may move this command
out of sk_lock.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22 15:12:55 -04:00
Kirill Tkhai
76d3e153d0 net: Revert "ipv4: get rid of ip_ra_lock"
This reverts commit ba3f571d5d. The commit was made
after 1215e51eda "ipv4: fix a deadlock in ip_ra_control",
and killed ip_ra_lock, which became useless after rtnl_lock()
made used to destroy every raw ipv4 socket. This scales
very bad, and next patch in series reverts 1215e51eda.
ip_ra_lock will be used again.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22 15:12:55 -04:00
Colin Ian King
1574639411 gre: fix TUNNEL_SEQ bit check on sequence numbering
The current logic of flags | TUNNEL_SEQ is always non-zero and hence
sequence numbers are always incremented no matter the setting of the
TUNNEL_SEQ bit.  Fix this by using & instead of |.

Detected by CoverityScan, CID#1466039 ("Operands don't affect result")

Fixes: 77a5196a80 ("gre: add sequence number for collect md mode.")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: William Tu <u9012063@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22 14:52:43 -04:00
GhantaKrishnamurthy MohanKrishna
872619d8cf tipc: step sk->sk_drops when rcv buffer is full
Currently when tipc is unable to queue a received message on a
socket, the message is rejected back to the sender with error
TIPC_ERR_OVERLOAD. However, the application on this socket
has no knowledge about these discards.

In this commit, we try to step the sk_drops counter when tipc
is unable to queue a received message. Export sk_drops
using tipc socket diagnostics.

Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: GhantaKrishnamurthy MohanKrishna <mohan.krishna.ghanta.krishnamurthy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22 14:43:37 -04:00
GhantaKrishnamurthy MohanKrishna
c30b70deb5 tipc: implement socket diagnostics for AF_TIPC
This commit adds socket diagnostics capability for AF_TIPC in netlink
family NETLINK_SOCK_DIAG in a new kernel module (diag.ko).

The following are key design considerations:
- config TIPC_DIAG has default y, like INET_DIAG.
- only requests with flag NLM_F_DUMP is supported (dump all).
- tipc_sock_diag_req message is introduced to send filter parameters.
- the response attributes are of TLV, some nested.

To avoid exposing data structures between diag and tipc modules and
avoid code duplication, the following additions are required:
- export tipc_nl_sk_walk function to reuse socket iterator.
- export tipc_sk_fill_sock_diag to fill the tipc diag attributes.
- create a sock_diag response message in __tipc_add_sock_diag defined
  in diag.c and use the above exported tipc_sk_fill_sock_diag
  to fill response.

Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: GhantaKrishnamurthy MohanKrishna <mohan.krishna.ghanta.krishnamurthy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22 14:43:35 -04:00
GhantaKrishnamurthy MohanKrishna
dfde331e75 tipc: modify socket iterator for sock_diag
The current socket iterator function tipc_nl_sk_dump, handles socket
locks and calls __tipc_nl_add_sk for each socket.
To reuse this logic in sock_diag implementation, we do minor
modifications to make these functions generic as described below.

In this commit, we add a two new functions __tipc_nl_sk_walk,
__tipc_nl_add_sk_info and modify tipc_nl_sk_dump, __tipc_nl_add_sk
accordingly.

In __tipc_nl_sk_walk we:
1. acquire and release socket locks
2. for each socket, execute the specified callback function

In __tipc_nl_add_sk we:
- Move the netlink attribute insertion to __tipc_nl_add_sk_info.

tipc_nl_sk_dump calls tipc_nl_sk_walk with __tipc_nl_add_sk as argument.

sock_diag will use these generic functions in a later commit.

There is no functional change in this commit.
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: GhantaKrishnamurthy MohanKrishna <mohan.krishna.ghanta.krishnamurthy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22 14:43:29 -04:00
David S. Miller
e0645d9b96 Two more fixes (in three patches):
* ath9k_htc doesn't like QoS NDP frames, use regular ones
  * hwsim: set up wmediumd for radios created later
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAlqySjUACgkQB8qZga/f
 l8RDig//bV/Fnwn7deyR7LJOi/g6HVyueCUi+cturTo9RQEQHnBtRVRu3c2Lnd+o
 74f+2ofEWEFOYqTvZq5jUWjANOZ/ZGgA+fUw6tOfSmjkEPw9EkST8mQl7lKH0dSR
 DYrRLKCHwPof9MQQgXHLq44e/26yFCxYP+ADSn9Q6yqo84e/cxP76nqSwYGLforx
 By1zzkMXPKtfvXTZ41UZTfXRMml2LIxBbCoWTfScIZWvusQjCl653f3lBfR3fynj
 qwzJJfcjt1TnOQSgCLzk03xi+ci5o157//GmvjT2hrRjRL3i22e+cmFtnXGNCjtg
 avmc7HftV0sAY8gecqhCLOiWnCuxxjDz1KxZW3dccuJxh1/jfsGtby3H/6stkU4s
 EU7AU/QhXL7HPHop+fyI+DapFmry0/h1tdKqoSGffONF/qMJmbkXFUvG0kAMqoWh
 nb/LTlGaVqk8Bz7hNbzP6zkPgEep3i6dBC9iRdfK3gclcEALsh2Z0mx6+ae3FEX3
 HHbfB9RTDYUT0Tk33cVFrjMsOpwdhcGsu1ncMJc/iuTOGI/AO6TVhgPEjEj/4xth
 FO7SzpIpRUe47yKqM0Ykvoz5EwE2IWUt0/eXj+1X2hPNGpcYEZ4bvHQGk8D2h7tA
 kwd0YqEXjiUw53cSnS4+A45ecOhErH/ZND/ULb65KMl/JDf5gOQ=
 =fs06
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-davem-2018-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
Two more fixes (in three patches):
 * ath9k_htc doesn't like QoS NDP frames, use regular ones
 * hwsim: set up wmediumd for radios created later
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22 13:19:10 -04:00
David Ahern
145307460b devlink: Remove top_hierarchy arg to devlink_resource_register
top_hierarchy arg can be determined by comparing parent_resource_id to
DEVLINK_RESOURCE_ID_PARENT_TOP so it does not need to be a separate
argument.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22 13:08:41 -04:00
David Ahern
68e2ffdeb5 net/ipv6: Handle onlink flag with multipath routes
For multipath routes the ONLINK flag can be specified per nexthop in
rtnh_flags or globally in rtm_flags. Update ip6_route_multipath_add
to consider the ONLINK setting coming from rtnh_flags. Each loop over
nexthops the config for the sibling route is initialized to the global
config and then per nexthop settings overlayed. The flag is 'or'ed into
fib6_config to handle the ONLINK flag coming from either rtm_flags or
rtnh_flags.

Fixes: fc1e64e109 ("net/ipv6: Add support for onlink flag")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22 12:40:04 -04:00
David Lebrun
8936ef7604 ipv6: sr: fix NULL pointer dereference when setting encap source address
When using seg6 in encap mode, we call ipv6_dev_get_saddr() to set the
source address of the outer IPv6 header, in case none was specified.
Using skb->dev can lead to BUG() when it is in an inconsistent state.
This patch uses the net_device attached to the skb's dst instead.

[940807.667429] BUG: unable to handle kernel NULL pointer dereference at 000000000000047c
[940807.762427] IP: ipv6_dev_get_saddr+0x8b/0x1d0
[940807.815725] PGD 0 P4D 0
[940807.847173] Oops: 0000 [#1] SMP PTI
[940807.890073] Modules linked in:
[940807.927765] CPU: 6 PID: 0 Comm: swapper/6 Tainted: G        W        4.16.0-rc1-seg6bpf+ #2
[940808.028988] Hardware name: HP ProLiant DL120 G6/ProLiant DL120 G6, BIOS O26    09/06/2010
[940808.128128] RIP: 0010:ipv6_dev_get_saddr+0x8b/0x1d0
[940808.187667] RSP: 0018:ffff88043fd836b0 EFLAGS: 00010206
[940808.251366] RAX: 0000000000000005 RBX: ffff88042cb1c860 RCX: 00000000000000fe
[940808.338025] RDX: 00000000000002c0 RSI: ffff88042cb1c860 RDI: 0000000000004500
[940808.424683] RBP: ffff88043fd83740 R08: 0000000000000000 R09: ffffffffffffffff
[940808.511342] R10: 0000000000000040 R11: 0000000000000000 R12: ffff88042cb1c850
[940808.598012] R13: ffffffff8208e380 R14: ffff88042ac8da00 R15: 0000000000000002
[940808.684675] FS:  0000000000000000(0000) GS:ffff88043fd80000(0000) knlGS:0000000000000000
[940808.783036] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[940808.852975] CR2: 000000000000047c CR3: 00000004255fe000 CR4: 00000000000006e0
[940808.939634] Call Trace:
[940808.970041]  <IRQ>
[940808.995250]  ? ip6t_do_table+0x265/0x640
[940809.043341]  seg6_do_srh_encap+0x28f/0x300
[940809.093516]  ? seg6_do_srh+0x1a0/0x210
[940809.139528]  seg6_do_srh+0x1a0/0x210
[940809.183462]  seg6_output+0x28/0x1e0
[940809.226358]  lwtunnel_output+0x3f/0x70
[940809.272370]  ip6_xmit+0x2b8/0x530
[940809.313185]  ? ac6_proc_exit+0x20/0x20
[940809.359197]  inet6_csk_xmit+0x7d/0xc0
[940809.404173]  tcp_transmit_skb+0x548/0x9a0
[940809.453304]  __tcp_retransmit_skb+0x1a8/0x7a0
[940809.506603]  ? ip6_default_advmss+0x40/0x40
[940809.557824]  ? tcp_current_mss+0x24/0x90
[940809.605925]  tcp_retransmit_skb+0xd/0x80
[940809.654016]  tcp_xmit_retransmit_queue.part.17+0xf9/0x210
[940809.719797]  tcp_ack+0xa47/0x1110
[940809.760612]  tcp_rcv_established+0x13c/0x570
[940809.812865]  tcp_v6_do_rcv+0x151/0x3d0
[940809.858879]  tcp_v6_rcv+0xa5c/0xb10
[940809.901770]  ? seg6_output+0xdd/0x1e0
[940809.946745]  ip6_input_finish+0xbb/0x460
[940809.994837]  ip6_input+0x74/0x80
[940810.034612]  ? ip6_rcv_finish+0xb0/0xb0
[940810.081663]  ipv6_rcv+0x31c/0x4c0
...

Fixes: 6c8702c60b ("ipv6: sr: add support for SRH encapsulation and injection with lwtunnels")
Reported-by: Tom Herbert <tom@quantonium.net>
Signed-off-by: David Lebrun <dlebrun@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22 12:22:45 -04:00
David Lebrun
191f86ca8e ipv6: sr: fix scheduling in RCU when creating seg6 lwtunnel state
The seg6_build_state() function is called with RCU read lock held,
so we cannot use GFP_KERNEL. This patch uses GFP_ATOMIC instead.

[   92.770271] =============================
[   92.770628] WARNING: suspicious RCU usage
[   92.770921] 4.16.0-rc4+ #12 Not tainted
[   92.771277] -----------------------------
[   92.771585] ./include/linux/rcupdate.h:302 Illegal context switch in RCU read-side critical section!
[   92.772279]
[   92.772279] other info that might help us debug this:
[   92.772279]
[   92.773067]
[   92.773067] rcu_scheduler_active = 2, debug_locks = 1
[   92.773514] 2 locks held by ip/2413:
[   92.773765]  #0:  (rtnl_mutex){+.+.}, at: [<00000000e5461720>] rtnetlink_rcv_msg+0x441/0x4d0
[   92.774377]  #1:  (rcu_read_lock){....}, at: [<00000000df4f161e>] lwtunnel_build_state+0x59/0x210
[   92.775065]
[   92.775065] stack backtrace:
[   92.775371] CPU: 0 PID: 2413 Comm: ip Not tainted 4.16.0-rc4+ #12
[   92.775791] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc27 04/01/2014
[   92.776608] Call Trace:
[   92.776852]  dump_stack+0x7d/0xbc
[   92.777130]  __schedule+0x133/0xf00
[   92.777393]  ? unwind_get_return_address_ptr+0x50/0x50
[   92.777783]  ? __sched_text_start+0x8/0x8
[   92.778073]  ? rcu_is_watching+0x19/0x30
[   92.778383]  ? kernel_text_address+0x49/0x60
[   92.778800]  ? __kernel_text_address+0x9/0x30
[   92.779241]  ? unwind_get_return_address+0x29/0x40
[   92.779727]  ? pcpu_alloc+0x102/0x8f0
[   92.780101]  _cond_resched+0x23/0x50
[   92.780459]  __mutex_lock+0xbd/0xad0
[   92.780818]  ? pcpu_alloc+0x102/0x8f0
[   92.781194]  ? seg6_build_state+0x11d/0x240
[   92.781611]  ? save_stack+0x9b/0xb0
[   92.781965]  ? __ww_mutex_wakeup_for_backoff+0xf0/0xf0
[   92.782480]  ? seg6_build_state+0x11d/0x240
[   92.782925]  ? lwtunnel_build_state+0x1bd/0x210
[   92.783393]  ? ip6_route_info_create+0x687/0x1640
[   92.783846]  ? ip6_route_add+0x74/0x110
[   92.784236]  ? inet6_rtm_newroute+0x8a/0xd0

Fixes: 6c8702c60b ("ipv6: sr: add support for SRH encapsulation and injection with lwtunnels")
Signed-off-by: David Lebrun <dlebrun@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22 12:21:11 -04:00
David S. Miller
755f6633d6 This feature/cleanup patchset includes the following patches:
- avoid redundant multicast TT entries, by Linus Luessing
 
  - add netlink support for distributed arp table cache and multicast flags,
    by Linus Luessing (2 patches)
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAlqv59kWHHN3QHNpbW9u
 d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoSBaD/9r/oJC+Q/3Eu6DTAAiS7Jx2IpQ
 kOwU7l4hkGK8mBZ098CkmHTBY+zurqYwcokCHhKJO5mqJEpvlM27PuQxzqWSBMMO
 FWFax2YlKPpJp+/f/rSD9HS73RTY7npv6l5/eFg6+0WSQET04PjLB1KxPrO5u1+Z
 JujrAxp0GEyMoVQgMy9uloedkpizhADyYSZzDDXnHhq1NiAPU87cjrTLv/xdtdp7
 TvbNfobhZUmKZ951yaRlDmE+mH8IoTQoY7HD/JANnduYeFJAlIPnHQEQa8+5tLwO
 qWUeLa4Acv5MhO2KjbKQpu5r2dFbs0x+jmsja8xBmgNWO5meKh/aE8TKGJeDVQEW
 TTynEivf82suiquCIZ573fBnliJkipffg32ZHgtNGrh54hh+YU7Sts0t9Lsou4ar
 aOU6lup3MHFysf3s9hK6y9TzSqwFJ8Mak0UFsa03r0Ub8am6bKHTqMFaCgN0aK9P
 vBL4atSvJVguwPlzxLMi44K4NxqEVfa41dZ7nQ99P3BFzWwSvph3i4lu+cxGxwI7
 4kgWU5Cz8T51dH7g8j3vUish36DzwQtUsKLWZVpV+DV4BaHJ/rLyqeug3ffUrWRk
 p0bFU7wBAv8rKeFPd30m2tdJ/nMo+rDbN6Tm9n43tK4NWKOGBndhCoNhjfrzhM8U
 un6Iy7taISgeElZ5fQ==
 =HVxO
 -----END PGP SIGNATURE-----

Merge tag 'batadv-next-for-davem-20180319' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
This feature/cleanup patchset includes the following patches:

 - avoid redundant multicast TT entries, by Linus Luessing

 - add netlink support for distributed arp table cache and multicast flags,
   by Linus Luessing (2 patches)
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22 11:28:54 -04:00
David S. Miller
ee54a9f9ae Here are some batman-adv bugfixes:
- fix possible IPv6 packet loss when multicast extension is used, by Linus Luessing
 
  - fix SKB handling issues for TTVN and DAT, by Matthias Schiffer (two patches)
 
  - fix include for eventpoll, by Sven Eckelmann
 
  - fix skb checksum for ttvn reroutes, by Sven Eckelmann
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAlqv5tcWHHN3QHNpbW9u
 d3VuZGVybGljaC5kZQAKCRChK+OYQpKeob7PD/wPnjVmFl6uQlRHfOfUBzGvW9hU
 ASakGylzfZTXzP2VviMJ+7JehXQgediM4caTr/8jWhJFeVxu5jmeYzv8nzOlwZJ/
 PjNlX+ChuIAWv1wyxHI3YWwj7Y1Ox9Lp8Gs7m5UbmfyiZEtb+ybHjvPzZ/FMfhWo
 hOgbndSpFKfwFuXFkXyitektBcKTiFUyjk9U22v91XCQZZzMb8KqougVBE+bdg0w
 N8LfgKXVC448sUNKTEczhji7bFuJwR+Sogx1TwIIsyfT2Tp4JXY3A7M1LDwAL/Rc
 nTu9L58vk4qUouRjIQJ7rhhoxdPSrD5p3oinzVnLvJBxgOS3pxegY3asX8sRMXsj
 bgolIGCZ9vDfA21WXqa3RyjUiBEx9UId6W++h+22kVqVHt5tqIFK2nVQ6IInOXwB
 kzd3UDrRBLgRdBpwlu/ii9rn68MOEdLNpL3Seo9DkViJESfLn1ODnFA1Y2rFoK6S
 cVeYl1DVj4SQSXjNilqYhFZoTEo/kxN5KhgHRXzujTv6MCHVa2SnCZaXS5EXp8n6
 u4gpPc363Isj17A+UVlmHmmzA6d8Jqe8veVvET0xSvQPHmE7eepMkJU9hIUNXfcv
 fENmQnU10pS1keJqS6xuR6k2RxMoofYRUPpKEigCcH7z/0GytWQc6QVGmo19LD9b
 GNcWAhKI453pc1IuUA==
 =hsM8
 -----END PGP SIGNATURE-----

Merge tag 'batadv-net-for-davem-20180319' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
Here are some batman-adv bugfixes:

 - fix possible IPv6 packet loss when multicast extension is used, by Linus Luessing

 - fix SKB handling issues for TTVN and DAT, by Matthias Schiffer (two patches)

 - fix include for eventpoll, by Sven Eckelmann

 - fix skb checksum for ttvn reroutes, by Sven Eckelmann
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22 11:26:13 -04:00
Sowmini Varadhan
bdf5bd7f21 rds: tcp: remove register_netdevice_notifier infrastructure.
The netns deletion path does not need to wait for all net_devices
to be unregistered before dismantling rds_tcp state for the netns
(we are able to dismantle this state on module unload even when
all net_devices are active so there is no dependency here).

This patch removes code related to netdevice notifiers and
refactors all the code needed to dismantle rds_tcp state
into a ->exit callback for the pernet_operations used with
register_pernet_device().

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22 11:21:45 -04:00
Kirill Tkhai
aa65f63654 net: Convert nf_ct_net_ops
These pernet_operations register and unregister sysctl.
Also, there is inet_frags_exit_net() called in exit method,
which has to be safe after a560002437 "net: Fix hlist
corruptions in inet_evict_bucket()".

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22 11:11:30 -04:00
Kirill Tkhai
08012631d6 net: Convert lowpan_frags_ops
These pernet_operations register and unregister sysctl.
Also, there is inet_frags_exit_net() called in exit method,
which has to be safe after a560002437 "net: Fix hlist
corruptions in inet_evict_bucket()".

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22 11:11:29 -04:00
Kirill Tkhai
1ae7762760 net: Convert can_pernet_ops
These pernet_operations create and destroy /proc entries
and cancel per-net timer.

Also, there are unneed iterations over empty list of net
devices, since all net devices must be already moved
to init_net or unregistered by default_device_ops. This
already was mentioned here:

https://marc.info/?l=linux-can&m=150169589119335&w=2

So, it looks safe to make them async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22 11:11:29 -04:00
Pablo Neira Ayuso
90d2723c6d netfilter: nf_tables: do not hold reference on netdevice from preparation phase
The netfilter netdevice event handler hold the nfnl_lock mutex, this
avoids races with a device going away while such device is being
attached to hooks from the netlink control plane. Therefore, either
control plane bails out with ENOENT or netdevice event path waits until
the hook that is attached to net_device is registered.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-03-22 13:17:52 +01:00
Pablo Neira Ayuso
d92191aa84 netfilter: nf_tables: cache device name in flowtable object
Devices going away have to grab the nfnl_lock from the netdev event path
to avoid races with control plane updates.

However, netlink dumps in netfilter do not hold nfnl_lock mutex. Cache
the device name into the objects to avoid an use-after-free situation
for a device that is going away.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-03-22 12:57:07 +01:00
Paolo Abeni
aebfa52a92 netfilter: drop template ct when conntrack is skipped.
The ipv4 nf_ct code currently skips the nf_conntrak_in() call
for fragmented packets. As a results later matches/target can end
up manipulating template ct entry instead of 'real' ones.

Exploiting the above, syzbot found a way to trigger the following
splat:

WARNING: CPU: 1 PID: 4242 at net/netfilter/xt_cluster.c:55
xt_cluster_mt+0x6c1/0x840 net/netfilter/xt_cluster.c:127
Kernel panic - not syncing: panic_on_warn set ...

CPU: 1 PID: 4242 Comm: syzkaller027971 Not tainted 4.16.0-rc2+ #243
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:17 [inline]
  dump_stack+0x194/0x24d lib/dump_stack.c:53
  panic+0x1e4/0x41c kernel/panic.c:183
  __warn+0x1dc/0x200 kernel/panic.c:547
  report_bug+0x211/0x2d0 lib/bug.c:184
  fixup_bug.part.11+0x37/0x80 arch/x86/kernel/traps.c:178
  fixup_bug arch/x86/kernel/traps.c:247 [inline]
  do_error_trap+0x2d7/0x3e0 arch/x86/kernel/traps.c:296
  do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
  invalid_op+0x58/0x80 arch/x86/entry/entry_64.S:957
RIP: 0010:xt_cluster_hash net/netfilter/xt_cluster.c:55 [inline]
RIP: 0010:xt_cluster_mt+0x6c1/0x840 net/netfilter/xt_cluster.c:127
RSP: 0018:ffff8801d2f6f2d0 EFLAGS: 00010293
RAX: ffff8801af700540 RBX: 0000000000000000 RCX: ffffffff84a2d1e1
RDX: 0000000000000000 RSI: ffff8801d2f6f478 RDI: ffff8801cafd336a
RBP: ffff8801d2f6f2e8 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801b03b3d18
R13: ffff8801cafd3300 R14: dffffc0000000000 R15: ffff8801d2f6f478
  ipt_do_table+0xa91/0x19b0 net/ipv4/netfilter/ip_tables.c:296
  iptable_filter_hook+0x65/0x80 net/ipv4/netfilter/iptable_filter.c:41
  nf_hook_entry_hookfn include/linux/netfilter.h:120 [inline]
  nf_hook_slow+0xba/0x1a0 net/netfilter/core.c:483
  nf_hook include/linux/netfilter.h:243 [inline]
  NF_HOOK include/linux/netfilter.h:286 [inline]
  raw_send_hdrinc.isra.17+0xf39/0x1880 net/ipv4/raw.c:432
  raw_sendmsg+0x14cd/0x26b0 net/ipv4/raw.c:669
  inet_sendmsg+0x11f/0x5e0 net/ipv4/af_inet.c:763
  sock_sendmsg_nosec net/socket.c:629 [inline]
  sock_sendmsg+0xca/0x110 net/socket.c:639
  SYSC_sendto+0x361/0x5c0 net/socket.c:1748
  SyS_sendto+0x40/0x50 net/socket.c:1716
  do_syscall_64+0x280/0x940 arch/x86/entry/common.c:287
  entry_SYSCALL_64_after_hwframe+0x42/0xb7
RIP: 0033:0x441b49
RSP: 002b:00007ffff5ca8b18 EFLAGS: 00000216 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 0000000000441b49
RDX: 0000000000000030 RSI: 0000000020ff7000 RDI: 0000000000000003
RBP: 00000000006cc018 R08: 000000002066354c R09: 0000000000000010
R10: 0000000000000000 R11: 0000000000000216 R12: 0000000000403470
R13: 0000000000403500 R14: 0000000000000000 R15: 0000000000000000
Dumping ftrace buffer:
    (ftrace buffer empty)
Kernel Offset: disabled
Rebooting in 86400 seconds..

Instead of adding checks for template ct on every target/match
manipulating skb->_nfct, simply drop the template ct when skipping
nf_conntrack_in().

Fixes: 7b4fdf77a4 ("netfilter: don't track fragmented packets")
Reported-and-tested-by: syzbot+0346441ae0545cfcea3a@syzkaller.appspotmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-03-22 12:56:10 +01:00
Davide Caratti
f29cdfbe33 net/sched: fix idr leak in the error path of tcf_skbmod_init()
tcf_skbmod_init() can fail after the idr has been successfully reserved.
When this happens, every subsequent attempt to configure skbmod rules
using the same idr value will systematically fail with -ENOSPC, unless
the first attempt was done using the 'replace' keyword:

 # tc action add action skbmod swap mac index 100
 RTNETLINK answers: Cannot allocate memory
 We have an error talking to the kernel
 # tc action add action skbmod swap mac index 100
 RTNETLINK answers: No space left on device
 We have an error talking to the kernel
 # tc action add action skbmod swap mac index 100
 RTNETLINK answers: No space left on device
 We have an error talking to the kernel
 ...

Fix this in tcf_skbmod_init(), ensuring that tcf_idr_release() is called
on the error path when the idr has been reserved, but not yet inserted.
Also, don't test 'ovr' in the error path, to avoid a 'replace' failure
implicitly become a 'delete' that leaks refcount in act_skbmod module:

 # rmmod act_skbmod; modprobe act_skbmod
 # tc action add action skbmod swap mac index 100
 # tc action add action skbmod swap mac continue index 100
 RTNETLINK answers: File exists
 We have an error talking to the kernel
 # tc action replace action skbmod swap mac continue index 100
 RTNETLINK answers: Cannot allocate memory
 We have an error talking to the kernel
 # tc action list action skbmod
 #
 # rmmod  act_skbmod
 rmmod: ERROR: Module act_skbmod is in use

Fixes: 65a206c01e ("net/sched: Change act_api and act_xxx modules to use IDR")
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-21 18:12:37 -04:00
Davide Caratti
d7f2001573 net/sched: fix idr leak in the error path of tcf_vlan_init()
tcf_vlan_init() can fail after the idr has been successfully reserved.
When this happens, every subsequent attempt to configure vlan rules using
the same idr value will systematically fail with -ENOSPC, unless the first
attempt was done using the 'replace' keyword.

 # tc action add action vlan pop index 100
 RTNETLINK answers: Cannot allocate memory
 We have an error talking to the kernel
 # tc action add action vlan pop index 100
 RTNETLINK answers: No space left on device
 We have an error talking to the kernel
 # tc action add action vlan pop index 100
 RTNETLINK answers: No space left on device
 We have an error talking to the kernel
 ...

Fix this in tcf_vlan_init(), ensuring that tcf_idr_release() is called on
the error path when the idr has been reserved, but not yet inserted. Also,
don't test 'ovr' in the error path, to avoid a 'replace' failure implicitly
become a 'delete' that leaks refcount in act_vlan module:

 # rmmod act_vlan; modprobe act_vlan
 # tc action add action vlan push id 5 index 100
 # tc action replace action vlan push id 7 index 100
 RTNETLINK answers: Cannot allocate memory
 We have an error talking to the kernel
 # tc action list action vlan
 #
 # rmmod act_vlan
 rmmod: ERROR: Module act_vlan is in use

Fixes: 4c5b9d9642 ("act_vlan: VLAN action rewrite to use RCU lock/unlock and update")
Fixes: 65a206c01e ("net/sched: Change act_api and act_xxx modules to use IDR")
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-21 18:12:27 -04:00
Davide Caratti
1e46ef1762 net/sched: fix idr leak in the error path of __tcf_ipt_init()
__tcf_ipt_init() can fail after the idr has been successfully reserved.
When this happens, subsequent attempts to configure xt/ipt rules using
the same idr value systematically fail with -ENOSPC:

 # tc action add action xt -j LOG --log-prefix test1 index 100
 tablename: mangle hook: NF_IP_POST_ROUTING
         target:  LOG level warning prefix "test1" index 100
 RTNETLINK answers: Cannot allocate memory
 We have an error talking to the kernel
 Command "(null)" is unknown, try "tc actions help".
 # tc action add action xt -j LOG --log-prefix test1 index 100
 tablename: mangle hook: NF_IP_POST_ROUTING
         target:  LOG level warning prefix "test1" index 100
 RTNETLINK answers: No space left on device
 We have an error talking to the kernel
 Command "(null)" is unknown, try "tc actions help".
 # tc action add action xt -j LOG --log-prefix test1 index 100
 tablename: mangle hook: NF_IP_POST_ROUTING
         target:  LOG level warning prefix "test1" index 100
 RTNETLINK answers: No space left on device
 We have an error talking to the kernel
 ...

Fix this in the error path of __tcf_ipt_init(), calling tcf_idr_release()
in place of tcf_idr_cleanup(). Since tcf_ipt_release() can now be called
when tcfi_t is NULL, we also need to protect calls to ipt_destroy_target()
to avoid NULL pointer dereference.

Fixes: 65a206c01e ("net/sched: Change act_api and act_xxx modules to use IDR")
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-21 18:12:16 -04:00
Davide Caratti
94fa3f929e net/sched: fix idr leak in the error path of tcp_pedit_init()
tcf_pedit_init() can fail to allocate 'keys' after the idr has been
successfully reserved. When this happens, subsequent attempts to configure
a pedit rule using the same idr value systematically fail with -ENOSPC:

 # tc action add action pedit munge ip ttl set 63 index 100
 RTNETLINK answers: Cannot allocate memory
 We have an error talking to the kernel
 # tc action add action pedit munge ip ttl set 63 index 100
 RTNETLINK answers: No space left on device
 We have an error talking to the kernel
 # tc action add action pedit munge ip ttl set 63 index 100
 RTNETLINK answers: No space left on device
 We have an error talking to the kernel
 ...

Fix this in the error path of tcf_act_pedit_init(), calling
tcf_idr_release() in place of tcf_idr_cleanup().

Fixes: 65a206c01e ("net/sched: Change act_api and act_xxx modules to use IDR")
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-21 18:12:08 -04:00
Davide Caratti
5bf7f8185f net/sched: fix idr leak in the error path of tcf_act_police_init()
tcf_act_police_init() can fail after the idr has been successfully
reserved (e.g., qdisc_get_rtab() may return NULL). When this happens,
subsequent attempts to configure a police rule using the same idr value
systematiclly fail with -ENOSPC:

 # tc action add action police rate 1000 burst 1000 drop index 100
 RTNETLINK answers: Cannot allocate memory
 We have an error talking to the kernel
 # tc action add action police rate 1000 burst 1000 drop index 100
 RTNETLINK answers: No space left on device
 We have an error talking to the kernel
 # tc action add action police rate 1000 burst 1000 drop index 100
 RTNETLINK answers: No space left on device
 ...

Fix this in the error path of tcf_act_police_init(), calling
tcf_idr_release() in place of tcf_idr_cleanup().

Fixes: 65a206c01e ("net/sched: Change act_api and act_xxx modules to use IDR")
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-21 18:12:00 -04:00
Davide Caratti
60e10b3adc net/sched: fix idr leak in the error path of tcf_simp_init()
if the kernel fails to duplicate 'sdata', creation of a new action fails
with -ENOMEM. However, subsequent attempts to install the same action
using the same value of 'index' systematically fail with -ENOSPC, and
that value of 'index' will no more be usable by act_simple, until rmmod /
insmod of act_simple.ko is done:

 # tc actions add action simple sdata hello index 100
 # tc actions list action simple

        action order 0: Simple <hello>
         index 100 ref 1 bind 0
 # tc actions flush action simple
 # tc actions add action simple sdata hello index 100
 RTNETLINK answers: Cannot allocate memory
 We have an error talking to the kernel
 # tc actions flush action simple
 # tc actions add action simple sdata hello index 100
 RTNETLINK answers: No space left on device
 We have an error talking to the kernel
 # tc actions add action simple sdata hello index 100
 RTNETLINK answers: No space left on device
 We have an error talking to the kernel
 ...

Fix this in the error path of tcf_simp_init(), calling tcf_idr_release()
in place of tcf_idr_cleanup().

Fixes: 65a206c01e ("net/sched: Change act_api and act_xxx modules to use IDR")
Suggested-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-21 18:11:53 -04:00
Davide Caratti
bbc09e7842 net/sched: fix idr leak on the error path of tcf_bpf_init()
when the following command sequence is entered

 # tc action add action bpf bytecode '4,40 0 0 12,31 0 1 2048,6 0 0 262144,6 0 0 0' index 100
 RTNETLINK answers: Invalid argument
 We have an error talking to the kernel
 # tc action add action bpf bytecode '4,40 0 0 12,21 0 1 2048,6 0 0 262144,6 0 0 0' index 100
 RTNETLINK answers: No space left on device
 We have an error talking to the kernel

act_bpf correctly refuses to install the first TC rule, because 31 is not
a valid instruction. However, it refuses to install the second TC rule,
even if the BPF code is correct. Furthermore, it's no more possible to
install any other rule having the same value of 'index' until act_bpf
module is unloaded/inserted again. After the idr has been reserved, call
tcf_idr_release() instead of tcf_idr_cleanup(), to fix this issue.

Fixes: 65a206c01e ("net/sched: Change act_api and act_xxx modules to use IDR")
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-21 18:11:46 -04:00