Commit Graph

49907 Commits

Author SHA1 Message Date
Linus Torvalds
efd52b5d36 Merge tag 'nfs-for-4.16-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
 "Highlights include:

  Stable bugfixes:

   - Fix breakages in the nfsstat utility due to the inclusion of the
     NFSv4 LOOKUPP operation

   - Fix a NULL pointer dereference in nfs_idmap_prepare_pipe_upcall()
     due to nfs_idmap_legacy_upcall() being called without an 'aux'
     parameter

   - Fix a refcount leak in the standard O_DIRECT error path

   - Fix a refcount leak in the pNFS O_DIRECT fallback to MDS path

   - Fix CPU latency issues with nfs_commit_release_pages()

   - Fix the LAYOUTUNAVAILABLE error case in the file layout type

   - NFS: Fix a race between mmap() and O_DIRECT

  Features:

   - Support the statx() mask and query flags to enable optimisations
     when the user is requesting only attributes that are already up to
     date in the inode cache, or is specifying the AT_STATX_DONT_SYNC
     flag

   - Add a module alias for the SCSI pNFS layout type

  Bugfixes:

   - Automounting when resolving a NFSv4 referral should preserve the
     RDMA transport protocol settings

   - Various other RDMA bugfixes from Chuck

   - pNFS block layout fixes

   - Always set NFS_LOCK_LOST when a lock is lost"

* tag 'nfs-for-4.16-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (69 commits)
  NFS: Fix a race between mmap() and O_DIRECT
  NFS: Remove a redundant call to unmap_mapping_range()
  pnfs/blocklayout: Ensure disk address in block device map
  pnfs/blocklayout: pnfs_block_dev_map uses bytes, not sectors
  lockd: Fix server refcounting
  SUNRPC: Fix null rpc_clnt dereference in rpc_task_queued tracepoint
  SUNRPC: Micro-optimize __rpc_execute
  SUNRPC: task_run_action should display tk_callback
  sunrpc: Format RPC events consistently for display
  SUNRPC: Trace xprt_timer events
  xprtrdma: Correct some documenting comments
  xprtrdma: Fix "bytes registered" accounting
  xprtrdma: Instrument allocation/release of rpcrdma_req/rep objects
  xprtrdma: Add trace points to instrument QP and CQ access upcalls
  xprtrdma: Add trace points in the client-side backchannel code paths
  xprtrdma: Add trace points for connect events
  xprtrdma: Add trace points to instrument MR allocation and recovery
  xprtrdma: Add trace points to instrument memory invalidation
  xprtrdma: Add trace points in reply decoder path
  xprtrdma: Add trace points to instrument memory registration
  ..
2018-01-30 19:03:48 -08:00
Linus Torvalds
1ed2d76e02 Merge branch 'work.sock_recvmsg' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull kern_recvmsg reduction from Al Viro:
 "kernel_recvmsg() is a set_fs()-using wrapper for sock_recvmsg(). In
  all but one case that is not needed - use of ITER_KVEC for ->msg_iter
  takes care of the data and does not care about set_fs(). The only
  exception is svc_udp_recvfrom() where we want cmsg to be store into
  kernel object; everything else can just use sock_recvmsg() and be done
  with that.

  A followup converting svc_udp_recvfrom() away from set_fs() (and
  killing kernel_recvmsg() off) is *NOT* in here - I'd like to hear what
  netdev folks think of the approach proposed in that followup)"

* 'work.sock_recvmsg' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  tipc: switch to sock_recvmsg()
  smc: switch to sock_recvmsg()
  ipvs: switch to sock_recvmsg()
  mISDN: switch to sock_recvmsg()
  drbd: switch to sock_recvmsg()
  lustre lnet_sock_read(): switch to sock_recvmsg()
  cfs2: switch to sock_recvmsg()
  ncpfs: switch to sock_recvmsg()
  dlm: switch to sock_recvmsg()
  svc_recvfrom(): switch to sock_recvmsg()
2018-01-30 18:59:03 -08:00
Linus Torvalds
168fe32a07 Merge branch 'misc.poll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull poll annotations from Al Viro:
 "This introduces a __bitwise type for POLL### bitmap, and propagates
  the annotations through the tree. Most of that stuff is as simple as
  'make ->poll() instances return __poll_t and do the same to local
  variables used to hold the future return value'.

  Some of the obvious brainos found in process are fixed (e.g. POLLIN
  misspelled as POLL_IN). At that point the amount of sparse warnings is
  low and most of them are for genuine bugs - e.g. ->poll() instance
  deciding to return -EINVAL instead of a bitmap. I hadn't touched those
  in this series - it's large enough as it is.

  Another problem it has caught was eventpoll() ABI mess; select.c and
  eventpoll.c assumed that corresponding POLL### and EPOLL### were
  equal. That's true for some, but not all of them - EPOLL### are
  arch-independent, but POLL### are not.

  The last commit in this series separates userland POLL### values from
  the (now arch-independent) kernel-side ones, converting between them
  in the few places where they are copied to/from userland. AFAICS, this
  is the least disruptive fix preserving poll(2) ABI and making epoll()
  work on all architectures.

  As it is, it's simply broken on sparc - try to give it EPOLLWRNORM and
  it will trigger only on what would've triggered EPOLLWRBAND on other
  architectures. EPOLLWRBAND and EPOLLRDHUP, OTOH, are never triggered
  at all on sparc. With this patch they should work consistently on all
  architectures"

* 'misc.poll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits)
  make kernel-side POLL... arch-independent
  eventpoll: no need to mask the result of epi_item_poll() again
  eventpoll: constify struct epoll_event pointers
  debugging printk in sg_poll() uses %x to print POLL... bitmap
  annotate poll(2) guts
  9p: untangle ->poll() mess
  ->si_band gets POLL... bitmap stored into a user-visible long field
  ring_buffer_poll_wait() return value used as return value of ->poll()
  the rest of drivers/*: annotate ->poll() instances
  media: annotate ->poll() instances
  fs: annotate ->poll() instances
  ipc, kernel, mm: annotate ->poll() instances
  net: annotate ->poll() instances
  apparmor: annotate ->poll() instances
  tomoyo: annotate ->poll() instances
  sound: annotate ->poll() instances
  acpi: annotate ->poll() instances
  crypto: annotate ->poll() instances
  block: annotate ->poll() instances
  x86: annotate ->poll() instances
  ...
2018-01-30 17:58:07 -08:00
Dan Williams
259d8c1e98 nl80211: Sanitize array index in parse_txq_params
Wireless drivers rely on parse_txq_params to validate that txq_params->ac
is less than NL80211_NUM_ACS by the time the low-level driver's ->conf_tx()
handler is called. Use a new helper, array_index_nospec(), to sanitize
txq_params->ac with respect to speculation. I.e. ensure that any
speculation into ->conf_tx() handlers is done with a value of
txq_params->ac that is within the bounds of [0, NL80211_NUM_ACS).

Reported-by: Christian Lamparter <chunkeey@gmail.com>
Reported-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Cc: linux-arch@vger.kernel.org
Cc: kernel-hardening@lists.openwall.com
Cc: gregkh@linuxfoundation.org
Cc: linux-wireless@vger.kernel.org
Cc: torvalds@linux-foundation.org
Cc: "David S. Miller" <davem@davemloft.net>
Cc: alan@linux.intel.com
Link: https://lkml.kernel.org/r/151727419584.33451.7700736761686184303.stgit@dwillia2-desk3.amr.corp.intel.com
2018-01-30 21:54:32 +01:00
Linus Torvalds
d772794637 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
 "The main RCU changes in this cycle were:

   - Updates to use cond_resched() instead of cond_resched_rcu_qs()
     where feasible (currently everywhere except in kernel/rcu and in
     kernel/torture.c). Also a couple of fixes to avoid sending IPIs to
     offline CPUs.

   - Updates to simplify RCU's dyntick-idle handling.

   - Updates to remove almost all uses of smp_read_barrier_depends() and
     read_barrier_depends().

   - Torture-test updates.

   - Miscellaneous fixes"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits)
  torture: Save a line in stutter_wait(): while -> for
  torture: Eliminate torture_runnable and perf_runnable
  torture: Make stutter less vulnerable to compilers and races
  locking/locktorture: Fix num reader/writer corner cases
  locking/locktorture: Fix rwsem reader_delay
  torture: Place all torture-test modules in one MAINTAINERS group
  rcutorture/kvm-build.sh: Skip build directory check
  rcutorture: Simplify functions.sh include path
  rcutorture: Simplify logging
  rcutorture/kvm-recheck-*: Improve result directory readability check
  rcutorture/kvm.sh: Support execution from any directory
  rcutorture/kvm.sh: Use consistent help text for --qemu-args
  rcutorture/kvm.sh: Remove unused variable, `alldone`
  rcutorture: Remove unused script, config2frag.sh
  rcutorture/configinit: Fix build directory error message
  rcutorture: Preempt RCU-preempt readers more vigorously
  torture: Reduce #ifdefs for preempt_schedule()
  rcu: Remove have_rcu_nocb_mask from tree_plugin.h
  rcu: Add comment giving debug strategy for double call_rcu()
  tracing, rcu: Hide trace event rcu_nocb_wake when not used
  ...
2018-01-30 10:15:30 -08:00
Jason Gunthorpe
e7996a9a77 Merge tag v4.15 of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
To resolve conflicts in:
 drivers/infiniband/hw/mlx5/main.c
 drivers/infiniband/hw/mlx5/qp.c

From patches merged into the -rc cycle. The conflict resolution matches
what linux-next has been carrying.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-30 09:30:00 -07:00
James Hogan
91e6dd8284 ipmr: Fix ptrdiff_t print formatting
ipmr_vif_seq_show() prints the difference between two pointers with the
format string %2zd (z for size_t), however the correct format string is
%2td instead (t for ptrdiff_t).

The same bug in ip6mr_vif_seq_show() was already fixed long ago by
commit d430a227d2 ("bogus format in ip6mr").

Signed-off-by: James Hogan <jhogan@kernel.org>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-30 09:20:25 -05:00
Li RongQing
9b42d55a66 tcp: release sk_frag.page in tcp_disconnect
socket can be disconnected and gets transformed back to a listening
socket, if sk_frag.page is not released, which will be cloned into
a new socket by sk_clone_lock, but the reference count of this page
is increased, lead to a use after free or double free issue

Signed-off-by: Li RongQing <lirongqing@baidu.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-29 17:56:23 -05:00
Tonghao Zhang
30e948a378 ipv4: Get the address of interface correctly.
When using ioctl to get address of interface, we can't
get it anymore. For example, the command is show as below.

	# ifconfig eth0

In the patch ("03aef17bb79b3"), the devinet_ioctl does not
return a suitable value, even though we can find it in
the kernel. Then fix it now.

Fixes: 03aef17bb7 ("devinet_ioctl(): take copyin/copyout to caller")
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-29 14:32:31 -05:00
Eric Dumazet
40ca54e3a6 net_sched: gen_estimator: fix lockdep splat
syzbot reported a lockdep splat in gen_new_estimator() /
est_fetch_counters() when attempting to lock est->stats_lock.

Since est_fetch_counters() is called from BH context from timer
interrupt, we need to block BH as well when calling it from process
context.

Most qdiscs use per cpu counters and are immune to the problem,
but net/sched/act_api.c and net/netfilter/xt_RATEEST.c are using
a spinlock to protect their data. They both call gen_new_estimator()
while object is created and not yet alive, so this bug could
not trigger a deadlock, only a lockdep splat.

Fixes: 1c0d32fde5 ("net_sched: gen_estimator: complete rewrite of rate estimators")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-29 14:29:10 -05:00
Eric Dumazet
e64e469b9a ipv6: addrconf: break critical section in addrconf_verify_rtnl()
Heiner reported a lockdep splat [1]

This is caused by attempting GFP_KERNEL allocation while RCU lock is
held and BH blocked.

We believe that addrconf_verify_rtnl() could run for a long period,
so instead of using GFP_ATOMIC here as Ido suggested, we should break
the critical section and restart it after the allocation.

[1]
[86220.125562] =============================
[86220.125586] WARNING: suspicious RCU usage
[86220.125612] 4.15.0-rc7-next-20180110+ #7 Not tainted
[86220.125641] -----------------------------
[86220.125666] kernel/sched/core.c:6026 Illegal context switch in RCU-bh read-side critical section!
[86220.125711]
               other info that might help us debug this:

[86220.125755]
               rcu_scheduler_active = 2, debug_locks = 1
[86220.125792] 4 locks held by kworker/0:2/1003:
[86220.125817]  #0:  ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at: [<00000000da8e9b73>] process_one_work+0x1de/0x680
[86220.125895]  #1:  ((addr_chk_work).work){+.+.}, at: [<00000000da8e9b73>] process_one_work+0x1de/0x680
[86220.125959]  #2:  (rtnl_mutex){+.+.}, at: [<00000000b06d9510>] rtnl_lock+0x12/0x20
[86220.126017]  #3:  (rcu_read_lock_bh){....}, at: [<00000000aef52299>] addrconf_verify_rtnl+0x1e/0x510 [ipv6]
[86220.126111]
               stack backtrace:
[86220.126142] CPU: 0 PID: 1003 Comm: kworker/0:2 Not tainted 4.15.0-rc7-next-20180110+ #7
[86220.126185] Hardware name: ZOTAC ZBOX-CI321NANO/ZBOX-CI321NANO, BIOS B246P105 06/01/2015
[86220.126250] Workqueue: ipv6_addrconf addrconf_verify_work [ipv6]
[86220.126288] Call Trace:
[86220.126312]  dump_stack+0x70/0x9e
[86220.126337]  lockdep_rcu_suspicious+0xce/0xf0
[86220.126365]  ___might_sleep+0x1d3/0x240
[86220.126390]  __might_sleep+0x45/0x80
[86220.126416]  kmem_cache_alloc_trace+0x53/0x250
[86220.126458]  ? ipv6_add_addr+0xfe/0x6e0 [ipv6]
[86220.126498]  ipv6_add_addr+0xfe/0x6e0 [ipv6]
[86220.126538]  ipv6_create_tempaddr+0x24d/0x430 [ipv6]
[86220.126580]  ? ipv6_create_tempaddr+0x24d/0x430 [ipv6]
[86220.126623]  addrconf_verify_rtnl+0x339/0x510 [ipv6]
[86220.126664]  ? addrconf_verify_rtnl+0x339/0x510 [ipv6]
[86220.126708]  addrconf_verify_work+0xe/0x20 [ipv6]
[86220.126738]  process_one_work+0x258/0x680
[86220.126765]  worker_thread+0x35/0x3f0
[86220.126790]  kthread+0x124/0x140
[86220.126813]  ? process_one_work+0x680/0x680
[86220.126839]  ? kthread_create_worker_on_cpu+0x40/0x40
[86220.126869]  ? umh_complete+0x40/0x40
[86220.126893]  ? call_usermodehelper_exec_async+0x12a/0x160
[86220.126926]  ret_from_fork+0x4b/0x60
[86220.126999] BUG: sleeping function called from invalid context at mm/slab.h:420
[86220.127041] in_atomic(): 1, irqs_disabled(): 0, pid: 1003, name: kworker/0:2
[86220.127082] 4 locks held by kworker/0:2/1003:
[86220.127107]  #0:  ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at: [<00000000da8e9b73>] process_one_work+0x1de/0x680
[86220.127179]  #1:  ((addr_chk_work).work){+.+.}, at: [<00000000da8e9b73>] process_one_work+0x1de/0x680
[86220.127242]  #2:  (rtnl_mutex){+.+.}, at: [<00000000b06d9510>] rtnl_lock+0x12/0x20
[86220.127300]  #3:  (rcu_read_lock_bh){....}, at: [<00000000aef52299>] addrconf_verify_rtnl+0x1e/0x510 [ipv6]
[86220.127414] CPU: 0 PID: 1003 Comm: kworker/0:2 Not tainted 4.15.0-rc7-next-20180110+ #7
[86220.127463] Hardware name: ZOTAC ZBOX-CI321NANO/ZBOX-CI321NANO, BIOS B246P105 06/01/2015
[86220.127528] Workqueue: ipv6_addrconf addrconf_verify_work [ipv6]
[86220.127568] Call Trace:
[86220.127591]  dump_stack+0x70/0x9e
[86220.127616]  ___might_sleep+0x14d/0x240
[86220.127644]  __might_sleep+0x45/0x80
[86220.127672]  kmem_cache_alloc_trace+0x53/0x250
[86220.127717]  ? ipv6_add_addr+0xfe/0x6e0 [ipv6]
[86220.127762]  ipv6_add_addr+0xfe/0x6e0 [ipv6]
[86220.127807]  ipv6_create_tempaddr+0x24d/0x430 [ipv6]
[86220.127854]  ? ipv6_create_tempaddr+0x24d/0x430 [ipv6]
[86220.127903]  addrconf_verify_rtnl+0x339/0x510 [ipv6]
[86220.127950]  ? addrconf_verify_rtnl+0x339/0x510 [ipv6]
[86220.127998]  addrconf_verify_work+0xe/0x20 [ipv6]
[86220.128032]  process_one_work+0x258/0x680
[86220.128063]  worker_thread+0x35/0x3f0
[86220.128091]  kthread+0x124/0x140
[86220.128117]  ? process_one_work+0x680/0x680
[86220.128146]  ? kthread_create_worker_on_cpu+0x40/0x40
[86220.128180]  ? umh_complete+0x40/0x40
[86220.128207]  ? call_usermodehelper_exec_async+0x12a/0x160
[86220.128243]  ret_from_fork+0x4b/0x60

Fixes: f3d9832e56 ("ipv6: addrconf: cleanup locking in ipv6_add_addr")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-29 14:23:38 -05:00
Wei Wang
31afeb425f ipv6: change route cache aging logic
In current route cache aging logic, if a route has both RTF_EXPIRE and
RTF_GATEWAY set, the route will only be removed if the neighbor cache
has no NTF_ROUTER flag. Otherwise, even if the route has expired, it
won't get deleted.
Fix this logic to always check if the route has expired first and then
do the gateway neighbor cache check if previous check decide to not
remove the exception entry.

Fixes: 1859bac04f ("ipv6: remove from fib tree aged out RTF_CACHE dst")
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-29 14:22:05 -05:00
David Ahern
c76fe2d98c net: ipv6: send unsolicited NA after DAD
Unsolicited IPv6 neighbor advertisements should be sent after DAD
completes. Update ndisc_send_unsol_na to skip tentative, non-optimistic
addresses and have those sent by addrconf_dad_completed after DAD.

Fixes: 4a6e3c5def ("net: ipv6: send unsolicited NA on admin up")
Reported-by: Vivek Venkatraman <vivek@cumulusnetworks.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-29 14:18:38 -05:00
Cong Wang
7007ba630e net_sched: implement ->change_tx_queue_len() for pfifo_fast
pfifo_fast used to drop based on qdisc_dev(qdisc)->tx_queue_len,
so we have to resize skb array when we change tx_queue_len.

Other qdiscs which read tx_queue_len are fine because they
all save it to sch->limit or somewhere else in qdisc during init.
They don't have to implement this, it is nicer if they do so
that users don't have to re-configure qdisc after changing
tx_queue_len.

Cc: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-29 12:42:15 -05:00
Cong Wang
48bfd55e7e net_sched: plug in qdisc ops change_tx_queue_len
Introduce a new qdisc ops ->change_tx_queue_len() so that
each qdisc could decide how to implement this if it wants.
Previously we simply read dev->tx_queue_len, after pfifo_fast
switches to skb array, we need this API to resize the skb array
when we change dev->tx_queue_len.

To avoid handling race conditions with TX BH, we need to
deactivate all TX queues before change the value and bring them
back after we are done, this also makes implementation easier.

Cc: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-29 12:42:15 -05:00
Cong Wang
6a643ddb56 net: introduce helper dev_change_tx_queue_len()
This patch promotes the local change_tx_queue_len() to a core
helper function, dev_change_tx_queue_len(), so that rtnetlink
and net-sysfs could share the code. This also prepares for the
following patch.

Note, the -EFAULT in the original code doesn't make sense,
we should propagate the errno from notifiers.

Cc: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-29 12:42:15 -05:00
Chengguang Xu
affff07739 libceph: check kstrndup() return value
Should check result of kstrndup() in case of memory allocation failure.

Signed-off-by: Chengguang Xu <cgxu519@icloud.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-01-29 18:36:12 +01:00
Nicolas Dichtel
38e01b3056 dev: advertise the new ifindex when the netns iface changes
The goal is to let the user follow an interface that moves to another
netns.

CC: Jiri Benc <jbenc@redhat.com>
CC: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-29 12:23:52 -05:00
Nicolas Dichtel
c36ac8e230 dev: always advertise the new nsid when the netns iface changes
The user should be able to follow any interface that moves to another
netns.  There is no reason to hide physical interfaces.

CC: Jiri Benc <jbenc@redhat.com>
CC: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-29 12:23:51 -05:00
Martin KaFai Lau
7ece54a60e ipv6: Fix SO_REUSEPORT UDP socket with implicit sk_ipv6only
If a sk_v6_rcv_saddr is !IPV6_ADDR_ANY and !IPV6_ADDR_MAPPED, it
implicitly implies it is an ipv6only socket.  However, in inet6_bind(),
this addr_type checking and setting sk->sk_ipv6only to 1 are only done
after sk->sk_prot->get_port(sk, snum) has been completed successfully.

This inconsistency between sk_v6_rcv_saddr and sk_ipv6only confuses
the 'get_port()'.

In particular, when binding SO_REUSEPORT UDP sockets,
udp_reuseport_add_sock(sk,...) is called.  udp_reuseport_add_sock()
checks "ipv6_only_sock(sk2) == ipv6_only_sock(sk)" before adding sk to
sk2->sk_reuseport_cb.  In this case, ipv6_only_sock(sk2) could be
1 while ipv6_only_sock(sk) is still 0 here.  The end result is,
reuseport_alloc(sk) is called instead of adding sk to the existing
sk2->sk_reuseport_cb.

It can be reproduced by binding two SO_REUSEPORT UDP sockets on an
IPv6 address (!ANY and !MAPPED).  Only one of the socket will
receive packet.

The fix is to set the implicit sk_ipv6only before calling get_port().
The original sk_ipv6only has to be saved such that it can be restored
in case get_port() failed.  The situation is similar to the
inet_reset_saddr(sk) after get_port() has failed.

Thanks to Calvin Owens <calvinowens@fb.com> who created an easy
reproduction which leads to a fix.

Fixes: e32ea7e747 ("soreuseport: fast reuseport UDP socket selection")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-29 11:37:40 -05:00
Christian Brauner
b61ad68a9f rtnetlink: enable IFLA_IF_NETNSID for RTM_DELLINK
- Backwards Compatibility:
  If userspace wants to determine whether RTM_DELLINK supports the
  IFLA_IF_NETNSID property they should first send an RTM_GETLINK request
  with IFLA_IF_NETNSID on lo. If either EACCESS is returned or the reply
  does not include IFLA_IF_NETNSID userspace should assume that
  IFLA_IF_NETNSID is not supported on this kernel.
  If the reply does contain an IFLA_IF_NETNSID property userspace
  can send an RTM_DELLINK with a IFLA_IF_NETNSID property. If they receive
  EOPNOTSUPP then the kernel does not support the IFLA_IF_NETNSID property
  with RTM_DELLINK. Userpace should then fallback to other means.

- Security:
  Callers must have CAP_NET_ADMIN in the owning user namespace of the
  target network namespace.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-29 11:31:06 -05:00
Christian Brauner
c310bfcb6e rtnetlink: enable IFLA_IF_NETNSID for RTM_SETLINK
- Backwards Compatibility:
  If userspace wants to determine whether RTM_SETLINK supports the
  IFLA_IF_NETNSID property they should first send an RTM_GETLINK request
  with IFLA_IF_NETNSID on lo. If either EACCESS is returned or the reply
  does not include IFLA_IF_NETNSID userspace should assume that
  IFLA_IF_NETNSID is not supported on this kernel.
  If the reply does contain an IFLA_IF_NETNSID property userspace
  can send an RTM_SETLINK with a IFLA_IF_NETNSID property. If they receive
  EOPNOTSUPP then the kernel does not support the IFLA_IF_NETNSID property
  with RTM_SETLINK. Userpace should then fallback to other means.

  To retain backwards compatibility the kernel will first check whether a
  IFLA_NET_NS_PID or IFLA_NET_NS_FD property has been passed. If either
  one is found it will be used to identify the target network namespace.
  This implies that users who do not care whether their running kernel
  supports IFLA_IF_NETNSID with RTM_SETLINK can pass both
  IFLA_NET_NS_{FD,PID} and IFLA_IF_NETNSID referring to the same network
  namespace.

- Security:
  Callers must have CAP_NET_ADMIN in the owning user namespace of the
  target network namespace.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-29 11:31:06 -05:00
Christian Brauner
7c4f63ba82 rtnetlink: enable IFLA_IF_NETNSID in do_setlink()
RTM_{NEW,SET}LINK already allow operations on other network namespaces
by identifying the target network namespace through IFLA_NET_NS_{FD,PID}
properties. This is done by looking for the corresponding properties in
do_setlink(). Extend do_setlink() to also look for the IFLA_IF_NETNSID
property. This introduces no functional changes since all callers of
do_setlink() currently block IFLA_IF_NETNSID by reporting an error before
they reach do_setlink().

This introduces the helpers:

static struct net *rtnl_link_get_net_by_nlattr(struct net *src_net, struct
                                               nlattr *tb[])

static struct net *rtnl_link_get_net_capable(const struct sk_buff *skb,
                                             struct net *src_net,
					     struct nlattr *tb[], int cap)

to simplify permission checks and target network namespace retrieval for
RTM_* requests that already support IFLA_NET_NS_{FD,PID} but get extended
to IFLA_IF_NETNSID. To perserve backwards compatibility the helpers look
for IFLA_NET_NS_{FD,PID} properties first before checking for
IFLA_IF_NETNSID.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-29 11:31:06 -05:00
David S. Miller
3e3ab9ccca Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-29 10:15:51 -05:00
David S. Miller
457740a903 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2018-01-26

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) A number of extensions to tcp-bpf, from Lawrence.
    - direct R or R/W access to many tcp_sock fields via bpf_sock_ops
    - passing up to 3 arguments to bpf_sock_ops functions
    - tcp_sock field bpf_sock_ops_cb_flags for controlling callbacks
    - optionally calling bpf_sock_ops program when RTO fires
    - optionally calling bpf_sock_ops program when packet is retransmitted
    - optionally calling bpf_sock_ops program when TCP state changes
    - access to tclass and sk_txhash
    - new selftest

2) div/mod exception handling, from Daniel.
    One of the ugly leftovers from the early eBPF days is that div/mod
    operations based on registers have a hard-coded src_reg == 0 test
    in the interpreter as well as in JIT code generators that would
    return from the BPF program with exit code 0. This was basically
    adopted from cBPF interpreter for historical reasons.
    There are multiple reasons why this is very suboptimal and prone
    to bugs. To name one: the return code mapping for such abnormal
    program exit of 0 does not always match with a suitable program
    type's exit code mapping. For example, '0' in tc means action 'ok'
    where the packet gets passed further up the stack, which is just
    undesirable for such cases (e.g. when implementing policy) and
    also does not match with other program types.
    After considering _four_ different ways to address the problem,
    we adapt the same behavior as on some major archs like ARMv8:
    X div 0 results in 0, and X mod 0 results in X. aarch64 and
    aarch32 ISA do not generate any traps or otherwise aborts
    of program execution for unsigned divides.
    Given the options, it seems the most suitable from
    all of them, also since major archs have similar schemes in
    place. Given this is all in the realm of undefined behavior,
    we still have the option to adapt if deemed necessary.

3) sockmap sample refactoring, from John.

4) lpm map get_next_key fixes, from Yonghong.

5) test cleanups, from Alexei and Prashant.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-28 21:22:46 -05:00
Daniel Borkmann
f6b1b3bf0d bpf: fix subprog verifier bypass by div/mod by 0 exception
One of the ugly leftovers from the early eBPF days is that div/mod
operations based on registers have a hard-coded src_reg == 0 test
in the interpreter as well as in JIT code generators that would
return from the BPF program with exit code 0. This was basically
adopted from cBPF interpreter for historical reasons.

There are multiple reasons why this is very suboptimal and prone
to bugs. To name one: the return code mapping for such abnormal
program exit of 0 does not always match with a suitable program
type's exit code mapping. For example, '0' in tc means action 'ok'
where the packet gets passed further up the stack, which is just
undesirable for such cases (e.g. when implementing policy) and
also does not match with other program types.

While trying to work out an exception handling scheme, I also
noticed that programs crafted like the following will currently
pass the verifier:

  0: (bf) r6 = r1
  1: (85) call pc+8
  caller:
   R6=ctx(id=0,off=0,imm=0) R10=fp0,call_-1
  callee:
   frame1: R1=ctx(id=0,off=0,imm=0) R10=fp0,call_1
  10: (b4) (u32) r2 = (u32) 0
  11: (b4) (u32) r3 = (u32) 1
  12: (3c) (u32) r3 /= (u32) r2
  13: (61) r0 = *(u32 *)(r1 +76)
  14: (95) exit
  returning from callee:
   frame1: R0_w=pkt(id=0,off=0,r=0,imm=0)
           R1=ctx(id=0,off=0,imm=0) R2_w=inv0
           R3_w=inv(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff))
           R10=fp0,call_1
  to caller at 2:
   R0_w=pkt(id=0,off=0,r=0,imm=0) R6=ctx(id=0,off=0,imm=0)
   R10=fp0,call_-1

  from 14 to 2: R0=pkt(id=0,off=0,r=0,imm=0)
                R6=ctx(id=0,off=0,imm=0) R10=fp0,call_-1
  2: (bf) r1 = r6
  3: (61) r1 = *(u32 *)(r1 +80)
  4: (bf) r2 = r0
  5: (07) r2 += 8
  6: (2d) if r2 > r1 goto pc+1
   R0=pkt(id=0,off=0,r=8,imm=0) R1=pkt_end(id=0,off=0,imm=0)
   R2=pkt(id=0,off=8,r=8,imm=0) R6=ctx(id=0,off=0,imm=0)
   R10=fp0,call_-1
  7: (71) r0 = *(u8 *)(r0 +0)
  8: (b7) r0 = 1
  9: (95) exit

  from 6 to 8: safe
  processed 16 insns (limit 131072), stack depth 0+0

Basically what happens is that in the subprog we make use of a
div/mod by 0 exception and in the 'normal' subprog's exit path
we just return skb->data back to the main prog. This has the
implication that the verifier thinks we always get a pkt pointer
in R0 while we still have the implicit 'return 0' from the div
as an alternative unconditional return path earlier. Thus, R0
then contains 0, meaning back in the parent prog we get the
address range of [0x0, skb->data_end] as read and writeable.
Similar can be crafted with other pointer register types.

Since i) BPF_ABS/IND is not allowed in programs that contain
BPF to BPF calls (and generally it's also disadvised to use in
native eBPF context), ii) unknown opcodes don't return zero
anymore, iii) we don't return an exception code in dead branches,
the only last missing case affected and to fix is the div/mod
handling.

What we would really need is some infrastructure to propagate
exceptions all the way to the original prog unwinding the
current stack and returning that code to the caller of the
BPF program. In user space such exception handling for similar
runtimes is typically implemented with setjmp(3) and longjmp(3)
as one possibility which is not available in the kernel,
though (kgdb used to implement it in kernel long time ago). I
implemented a PoC exception handling mechanism into the BPF
interpreter with porting setjmp()/longjmp() into x86_64 and
adding a new internal BPF_ABRT opcode that can use a program
specific exception code for all exception cases we have (e.g.
div/mod by 0, unknown opcodes, etc). While this seems to work
in the constrained BPF environment (meaning, here, we don't
need to deal with state e.g. from memory allocations that we
would need to undo before going into exception state), it still
has various drawbacks: i) we would need to implement the
setjmp()/longjmp() for every arch supported in the kernel and
for x86_64, arm64, sparc64 JITs currently supporting calls,
ii) it has unconditional additional cost on main program
entry to store CPU register state in initial setjmp() call,
and we would need some way to pass the jmp_buf down into
___bpf_prog_run() for main prog and all subprogs, but also
storing on stack is not really nice (other option would be
per-cpu storage for this, but it also has the drawback that
we need to disable preemption for every BPF program types).
All in all this approach would add a lot of complexity.

Another poor-man's solution would be to have some sort of
additional shared register or scratch buffer to hold state
for exceptions, and test that after every call return to
chain returns and pass R0 all the way down to BPF prog caller.
This is also problematic in various ways: i) an additional
register doesn't map well into JITs, and some other scratch
space could only be on per-cpu storage, which, again has the
side-effect that this only works when we disable preemption,
or somewhere in the input context which is not available
everywhere either, and ii) this adds significant runtime
overhead by putting conditionals after each and every call,
as well as implementation complexity.

Yet another option is to teach verifier that div/mod can
return an integer, which however is also complex to implement
as verifier would need to walk such fake 'mov r0,<code>; exit;'
sequeuence and there would still be no guarantee for having
propagation of this further down to the BPF caller as proper
exception code. For parent prog, it is also is not distinguishable
from a normal return of a constant scalar value.

The approach taken here is a completely different one with
little complexity and no additional overhead involved in
that we make use of the fact that a div/mod by 0 is undefined
behavior. Instead of bailing out, we adapt the same behavior
as on some major archs like ARMv8 [0] into eBPF as well:
X div 0 results in 0, and X mod 0 results in X. aarch64 and
aarch32 ISA do not generate any traps or otherwise aborts
of program execution for unsigned divides. I verified this
also with a test program compiled by gcc and clang, and the
behavior matches with the spec. Going forward we adapt the
eBPF verifier to emit such rewrites once div/mod by register
was seen. cBPF is not touched and will keep existing 'return 0'
semantics. Given the options, it seems the most suitable from
all of them, also since major archs have similar schemes in
place. Given this is all in the realm of undefined behavior,
we still have the option to adapt if deemed necessary and
this way we would also have the option of more flexibility
from LLVM code generation side (which is then fully visible
to verifier). Thus, this patch i) fixes the panic seen in
above program and ii) doesn't bypass the verifier observations.

  [0] ARM Architecture Reference Manual, ARMv8 [ARM DDI 0487B.b]
      http://infocenter.arm.com/help/topic/com.arm.doc.ddi0487b.b/DDI0487B_b_armv8_arm.pdf
      1) aarch64 instruction set: section C3.4.7 and C6.2.279 (UDIV)
         "A division by zero results in a zero being written to
          the destination register, without any indication that
          the division by zero occurred."
      2) aarch32 instruction set: section F1.4.8 and F5.1.263 (UDIV)
         "For the SDIV and UDIV instructions, division by zero
          always returns a zero result."

Fixes: f4d7e40a5b ("bpf: introduce function calls (verification)")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-01-26 16:42:05 -08:00
Daniel Borkmann
1d621674d9 bpf: xor of a/x in cbpf can be done in 32 bit alu
Very minor optimization; saves 1 byte per program in x86_64
JIT in cBPF prologue.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-01-26 16:42:05 -08:00
Stefan Hajnoczi
ba3169fc75 VSOCK: set POLLOUT | POLLWRNORM for TCP_CLOSING
select(2) with wfds but no rfds must return when the socket is shut down
by the peer.  This way userspace notices socket activity and gets -EPIPE
from the next write(2).

Currently select(2) does not return for virtio-vsock when a SEND+RCV
shutdown packet is received.  This is because vsock_poll() only sets
POLLOUT | POLLWRNORM for TCP_CLOSE, not the TCP_CLOSING state that the
socket is in when the shutdown is received.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-26 11:16:27 -05:00
Alexey Kodanev
dd5684ecae dccp: don't restart ccid2_hc_tx_rto_expire() if sk in closed state
ccid2_hc_tx_rto_expire() timer callback always restarts the timer
again and can run indefinitely (unless it is stopped outside), and after
commit 120e9dabaf ("dccp: defer ccid_hc_tx_delete() at dismantle time"),
which moved ccid_hc_tx_delete() (also includes sk_stop_timer()) from
dccp_destroy_sock() to sk_destruct(), this started to happen quite often.
The timer prevents releasing the socket, as a result, sk_destruct() won't
be called.

Found with LTP/dccp_ipsec tests running on the bonding device,
which later couldn't be unloaded after the tests were completed:

  unregister_netdevice: waiting for bond0 to become free. Usage count = 148

Fixes: 2a91aa3967 ("[DCCP] CCID2: Initial CCID2 (TCP-Like) implementation")
Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-26 11:15:00 -05:00
David S. Miller
e2d6e64bc3 Merge tag 'linux-can-next-for-4.16-20180126' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next
Marc Kleine-Budde says:

====================
pull-request: can-next 2018-01-26

this is a pull request for net-next/master consisting of 3 patches.

The first two patches target the CAN documentation. The first is by me
and fixes pointer to location of fsl,mpc5200-mscan node in the mpc5200
documentation. The second patch is by Robert Schwebel and it converts
the plain ASCII documentation to restructured text.

The third patch is by Fabrizio Castro add the r8a774[35] support to the
rcar_can dt-bindings documentation.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-26 10:49:12 -05:00
Gustavo A. R. Silva
a8fbf8e7ec net/smc: return booleans instead of integers
Return statements in functions returning bool should use
true/false instead of 1/0.

This issue was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-26 10:41:56 -05:00
Ursula Braun
127f497058 net/smc: release clcsock from tcp_listen_worker
Closing a listen socket may hit the warning
WARN_ON(sock_owned_by_user(sk)) of tcp_close(), if the wake up of
the smc_tcp_listen_worker has not yet finished.
This patch introduces smc_close_wait_listen_clcsock() making sure
the listening internal clcsock has been closed in smc_tcp_listen_work(),
before the listening external SMC socket finishes closing.

Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-26 10:41:56 -05:00
Ursula Braun
51f1de79ad net/smc: replace sock_put worker by socket refcounting
Proper socket refcounting makes the sock_put worker obsolete.

Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-26 10:41:56 -05:00
Ursula Braun
8dce2786a2 net/smc: smc_poll improvements
Increase the socket refcount during poll wait.
Take the socket lock before checking socket state.
For a listening socket return a mask independent of state SMC_ACTIVE and
cover errors or closed state as well.
Get rid of the accept_q loop in smc_accept_poll().

Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-26 10:41:56 -05:00
Ursula Braun
da05bf2981 net/smc: handle device, port, and QP error events
RoCE device changes cause an IB event, processed in the global event
handler for the ROCE device. Problems for a certain Queue Pair cause a QP
event, processed in the QP event handler for this QP.
Among those events are port errors and other fatal device errors. All
link groups using such a port or device must be terminated in those cases.

Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-26 10:41:56 -05:00
David S. Miller
a81e4affe1 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:

====================
pull request (net-next): ipsec-next 2018-01-26

One last patch for this development cycle:

1) Add ESN support for IPSec HW offload.
   From Yossef Efraim.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-26 10:22:53 -05:00
David Ahern
fc1e64e109 net/ipv6: Add support for onlink flag
Similar to IPv4 allow routes to be added with the RTNH_F_ONLINK flag.
The onlink option requires a gateway and a nexthop device. Any unicast
gateway is allowed (including IPv4 mapped addresses and unresolved
ones) as long as the gateway is not a local address and if it resolves
it must match the given device.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-26 10:16:43 -05:00
David Ahern
f4797b33db net/ipv6: Add flags and table id to ip6_nh_lookup_table
onlink verification needs to do a lookup in potentially different
table than the table in fib6_config and without the RT6_LOOKUP_F_IFACE
flag. Change ip6_nh_lookup_table to take table id and flags as input
arguments. Both verifications want to ignore link state, so add that
flag can stay in the lookup helper.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-26 10:16:42 -05:00
David Ahern
1edce99fa8 net/ipv6: Move gateway validation into helper
Move existing code to validate nexthop into a helper. Follow on patch
adds support for nexthops marked with onlink, and this helper keeps
the complexity of ip6_route_info_create in check.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-26 10:16:42 -05:00
Robert Schwebel
7d59773945 can: migrate documentation to restructured text
The kernel documentation is now restructured text. Convert the SocketCAN
documentation and include it in the toplevel kernel documentation.

This patch doesn't do any content change.

All references to can.txt in the code are converted to can.rst.

Signed-off-by: Robert Schwebel <r.schwebel@pengutronix.de>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2018-01-26 10:46:44 +01:00
David Ahern
9515a2e082 net/ipv4: Allow send to local broadcast from a socket bound to a VRF
Message sends to the local broadcast address (255.255.255.255) require
uc_index or sk_bound_dev_if to be set to an egress device. However,
responses or only received if the socket is bound to the device. This
is overly constraining for processes running in an L3 domain. This
patch allows a socket bound to the VRF device to send to the local
broadcast address by using IP_UNICAST_IF to set the egress interface
with packet receipt handled by the VRF binding.

Similar to IP_MULTICAST_IF, relax the constraint on setting
IP_UNICAST_IF if a socket is bound to an L3 master device. In this
case allow uc_index to be set to an enslaved if sk_bound_dev_if is
an L3 master device and is the master device for the ifindex.

In udp and raw sendmsg, allow uc_index to override the oif if
uc_index master device is oif (ie., the oif is an L3 master and the
index is an L3 slave).

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-25 21:51:31 -05:00
William Tu
fc1372f89f openvswitch: add erspan version I and II support
The patch adds support for openvswitch to configure erspan
v1 and v2.  The OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS attr is added
to uapi as a binary blob to support all ERSPAN v1 and v2's
fields.  Note that Previous commit "openvswitch: Add erspan tunnel
support." was reverted since it does not design properly.

Signed-off-by: William Tu <u9012063@gmail.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-25 21:39:43 -05:00
William Tu
c69de58ba8 net: erspan: use bitfield instead of mask and offset
Originally the erspan fields are defined as a group into a __be16 field,
and use mask and offset to access each field.  This is more costly due to
calling ntohs/htons.  The patch changes it to use bitfields.

Signed-off-by: William Tu <u9012063@gmail.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-25 21:39:43 -05:00
Lawrence Brakmo
d44874910a bpf: Add BPF_SOCK_OPS_STATE_CB
Adds support for calling sock_ops BPF program when there is a TCP state
change. Two arguments are used; one for the old state and another for
the new state.

There is a new enum in include/uapi/linux/bpf.h that exports the TCP
states that prepends BPF_ to the current TCP state names. If it is ever
necessary to change the internal TCP state values (other than adding
more to the end), then it will become necessary to convert from the
internal TCP state value to the BPF value before calling the BPF
sock_ops function. There are a set of compile checks added in tcp.c
to detect if the internal and BPF values differ so we can make the
necessary fixes.

New op: BPF_SOCK_OPS_STATE_CB.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-01-25 16:41:15 -08:00
Lawrence Brakmo
a31ad29e6a bpf: Add BPF_SOCK_OPS_RETRANS_CB
Adds support for calling sock_ops BPF program when there is a
retransmission. Three arguments are used; one for the sequence number,
another for the number of segments retransmitted, and the last one for
the return value of tcp_transmit_skb (0 => success).
Does not include syn-ack retransmissions.

New op: BPF_SOCK_OPS_RETRANS_CB.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-01-25 16:41:14 -08:00
Lawrence Brakmo
6f9bd3d731 bpf: Add sock_ops R/W access to tclass
Adds direct write access to sk_txhash and access to tclass for ipv6
flows through getsockopt and setsockopt. Sample usage for tclass:

  bpf_getsockopt(skops, SOL_IPV6, IPV6_TCLASS, &v, sizeof(v))

where skops is a pointer to the ctx (struct bpf_sock_ops).

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-01-25 16:41:14 -08:00
Lawrence Brakmo
44f0e43037 bpf: Add support for reading sk_state and more
Add support for reading many more tcp_sock fields

  state,	same as sk->sk_state
  rtt_min	same as sk->rtt_min.s[0].v (current rtt_min)
  snd_ssthresh
  rcv_nxt
  snd_nxt
  snd_una
  mss_cache
  ecn_flags
  rate_delivered
  rate_interval_us
  packets_out
  retrans_out
  total_retrans
  segs_in
  data_segs_in
  segs_out
  data_segs_out
  lost_out
  sacked_out
  sk_txhash
  bytes_received (__u64)
  bytes_acked    (__u64)

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-01-25 16:41:14 -08:00
Lawrence Brakmo
f89013f66d bpf: Add sock_ops RTO callback
Adds an optional call to sock_ops BPF program based on whether the
BPF_SOCK_OPS_RTO_CB_FLAG is set in bpf_sock_ops_flags.
The BPF program is passed 2 arguments: icsk_retransmits and whether the
RTO has expired.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-01-25 16:41:14 -08:00
Lawrence Brakmo
b13d880721 bpf: Adds field bpf_sock_ops_cb_flags to tcp_sock
Adds field bpf_sock_ops_cb_flags to tcp_sock and bpf_sock_ops. Its primary
use is to determine if there should be calls to sock_ops bpf program at
various points in the TCP code. The field is initialized to zero,
disabling the calls. A sock_ops BPF program can set it, per connection and
as necessary, when the connection is established.

It also adds support for reading and writting the field within a
sock_ops BPF program. Reading is done by accessing the field directly.
However, writing is done through the helper function
bpf_sock_ops_cb_flags_set, in order to return an error if a BPF program
is trying to set a callback that is not supported in the current kernel
(i.e. running an older kernel). The helper function returns 0 if it was
able to set all of the bits set in the argument, a positive number
containing the bits that could not be set, or -EINVAL if the socket is
not a full TCP socket.

Examples of where one could call the bpf program:

1) When RTO fires
2) When a packet is retransmitted
3) When the connection terminates
4) When a packet is sent
5) When a packet is received

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-01-25 16:41:14 -08:00
Lawrence Brakmo
de525be2ca bpf: Support passing args to sock_ops bpf function
Adds support for passing up to 4 arguments to sock_ops bpf functions. It
reusues the reply union, so the bpf_sock_ops structures are not
increased in size.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-01-25 16:41:14 -08:00