Commit Graph

57946 Commits

Author SHA1 Message Date
Linus Torvalds
4a2d057e4f Merge branch 'PAGE_CACHE_SIZE-removal'
Merge PAGE_CACHE_SIZE removal patches from Kirill Shutemov:
 "PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
  ago with promise that one day it will be possible to implement page
  cache with bigger chunks than PAGE_SIZE.

  This promise never materialized.  And unlikely will.

  Let's stop pretending that pages in page cache are special.  They are
  not.

  The first patch with most changes has been done with coccinelle.  The
  second is manual fixups on top.

  The third patch removes macros definition"

[ I was planning to apply this just before rc2, but then I spaced out,
  so here it is right _after_ rc2 instead.

  As Kirill suggested as a possibility, I could have decided to only
  merge the first two patches, and leave the old interfaces for
  compatibility, but I'd rather get it all done and any out-of-tree
  modules and patches can trivially do the converstion while still also
  working with older kernels, so there is little reason to try to
  maintain the redundant legacy model.    - Linus ]

* PAGE_CACHE_SIZE-removal:
  mm: drop PAGE_CACHE_* and page_cache_{get,release} definition
  mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage
  mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
2016-04-04 10:50:24 -07:00
Kirill A. Shutemov
ea1754a084 mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage
Mostly direct substitution with occasional adjustment or removing
outdated comments.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-04 10:41:08 -07:00
Kirill A. Shutemov
09cbfeaf1a mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized.  And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE.  And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special.  They are
not.

The changes are pretty straight-forward:

 - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

 - page_cache_get() -> get_page();

 - page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below.  For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach.  I'll
fix them manually in a separate patch.  Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-04 10:41:08 -07:00
Herbert Xu
ef609c238a sunrpc: Fix skcipher/shash conversion
The skcpiher/shash conversion introduced a number of bugs in the
sunrpc code:

1) Missing calls to skcipher_request_set_tfm lead to crashes.
2) The allocation size of shash_desc is too small which leads to
memory corruption.

Fixes: 3b5cf20cf4 ("sunrpc: Use skcipher and ahash/shash")
Reported-by: J. Bruce Fields <bfields@fieldses.org>
Tested-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2016-04-04 21:45:51 +08:00
Haishuang Yan
7822ce73e6 netlink: use nla_get_in_addr and nla_put_in_addr for ipv4 address
Since nla_get_in_addr and nla_put_in_addr were implemented,
so use them appropriately.

Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-02 20:15:58 -04:00
Yuchung Cheng
2349262397 tcp: remove cwnd moderation after recovery
For non-SACK connections, cwnd is lowered to inflight plus 3 packets
when the recovery ends. This is an optional feature in the NewReno
RFC 2582 to reduce the potential burst when cwnd is "re-opened"
after recovery and inflight is low.

This feature is questionably effective because of PRR: when
the recovery ends (i.e., snd_una == high_seq) NewReno holds the
CA_Recovery state for another round trip to prevent false fast
retransmits. But if the inflight is low, PRR will overwrite the
moderated cwnd in tcp_cwnd_reduction() later regardlessly. So if a
receiver responds bogus ACKs (i.e., acking future data) to speed up
transfer after recovery, it can only induce a burst up to a window
worth of data packets by acking up to SND.NXT. A restart from (short)
idle or receiving streched ACKs can both cause such bursts as well.

On the other hand, if the recovery ends because the sender
detects the losses were spurious (e.g., reordering). This feature
unconditionally lowers a reverted cwnd even though nothing
was lost.

By principle loss recovery module should not update cwnd. Further
pacing is much more effective to reduce burst. Hence this patch
removes the cwnd moderation feature.

v2 changes: revised commit message on bogus ACKs and burst, and
            missing signature

Signed-off-by: Matt Mathis <mattmathis@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-02 20:11:43 -04:00
Linus Torvalds
05cf8077e5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Missing device reference in IPSEC input path results in crashes
    during device unregistration.  From Subash Abhinov Kasiviswanathan.

 2) Per-queue ISR register writes not being done properly in macb
    driver, from Cyrille Pitchen.

 3) Stats accounting bugs in bcmgenet, from Patri Gynther.

 4) Lightweight tunnel's TTL and TOS were swapped in netlink dumps, from
    Quentin Armitage.

 5) SXGBE driver has off-by-one in probe error paths, from Rasmus
    Villemoes.

 6) Fix race in save/swap/delete options in netfilter ipset, from
    Vishwanath Pai.

 7) Ageing time of bridge not set properly when not operating over a
    switchdev device.  Fix from Haishuang Yan.

 8) Fix GRO regression wrt nested FOU/GUE based tunnels, from Alexander
    Duyck.

 9) IPV6 UDP code bumps wrong stats, from Eric Dumazet.

10) FEC driver should only access registers that actually exist on the
    given chipset, fix from Fabio Estevam.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (73 commits)
  net: mvneta: fix changing MTU when using per-cpu processing
  stmmac: fix MDIO settings
  Revert "stmmac: Fix 'eth0: No PHY found' regression"
  stmmac: fix TX normal DESC
  net: mvneta: use cache_line_size() to get cacheline size
  net: mvpp2: use cache_line_size() to get cacheline size
  net: mvpp2: fix maybe-uninitialized warning
  tun, bpf: fix suspicious RCU usage in tun_{attach, detach}_filter
  net: usb: cdc_ncm: adding Telit LE910 V2 mobile broadband card
  rtnl: fix msg size calculation in if_nlmsg_size()
  fec: Do not access unexisting register in Coldfire
  net: mvneta: replace MVNETA_CPU_D_CACHE_LINE_SIZE with L1_CACHE_BYTES
  net: mvpp2: replace MVPP2_CPU_D_CACHE_LINE_SIZE with L1_CACHE_BYTES
  net: dsa: mv88e6xxx: Clear the PDOWN bit on setup
  net: dsa: mv88e6xxx: Introduce _mv88e6xxx_phy_page_{read, write}
  bpf: make padding in bpf_tunnel_key explicit
  ipv6: udp: fix UDP_MIB_IGNOREDMULTI updates
  bnxt_en: Fix ethtool -a reporting.
  bnxt_en: Fix typo in bnxt_hwrm_set_pause_common().
  bnxt_en: Implement proper firmware message padding.
  ...
2016-04-01 20:03:33 -05:00
Daniel Borkmann
5a5abb1fa3 tun, bpf: fix suspicious RCU usage in tun_{attach, detach}_filter
Sasha Levin reported a suspicious rcu_dereference_protected() warning
found while fuzzing with trinity that is similar to this one:

  [   52.765684] net/core/filter.c:2262 suspicious rcu_dereference_protected() usage!
  [   52.765688] other info that might help us debug this:
  [   52.765695] rcu_scheduler_active = 1, debug_locks = 1
  [   52.765701] 1 lock held by a.out/1525:
  [   52.765704]  #0:  (rtnl_mutex){+.+.+.}, at: [<ffffffff816a64b7>] rtnl_lock+0x17/0x20
  [   52.765721] stack backtrace:
  [   52.765728] CPU: 1 PID: 1525 Comm: a.out Not tainted 4.5.0+ #264
  [...]
  [   52.765768] Call Trace:
  [   52.765775]  [<ffffffff813e488d>] dump_stack+0x85/0xc8
  [   52.765784]  [<ffffffff810f2fa5>] lockdep_rcu_suspicious+0xd5/0x110
  [   52.765792]  [<ffffffff816afdc2>] sk_detach_filter+0x82/0x90
  [   52.765801]  [<ffffffffa0883425>] tun_detach_filter+0x35/0x90 [tun]
  [   52.765810]  [<ffffffffa0884ed4>] __tun_chr_ioctl+0x354/0x1130 [tun]
  [   52.765818]  [<ffffffff8136fed0>] ? selinux_file_ioctl+0x130/0x210
  [   52.765827]  [<ffffffffa0885ce3>] tun_chr_ioctl+0x13/0x20 [tun]
  [   52.765834]  [<ffffffff81260ea6>] do_vfs_ioctl+0x96/0x690
  [   52.765843]  [<ffffffff81364af3>] ? security_file_ioctl+0x43/0x60
  [   52.765850]  [<ffffffff81261519>] SyS_ioctl+0x79/0x90
  [   52.765858]  [<ffffffff81003ba2>] do_syscall_64+0x62/0x140
  [   52.765866]  [<ffffffff817d563f>] entry_SYSCALL64_slow_path+0x25/0x25

Same can be triggered with PROVE_RCU (+ PROVE_RCU_REPEATEDLY) enabled
from tun_attach_filter() when user space calls ioctl(tun_fd, TUN{ATTACH,
DETACH}FILTER, ...) for adding/removing a BPF filter on tap devices.

Since the fix in f91ff5b9ff ("net: sk_{detach|attach}_filter() rcu
fixes") sk_attach_filter()/sk_detach_filter() now dereferences the
filter with rcu_dereference_protected(), checking whether socket lock
is held in control path.

Since its introduction in 9940516259 ("tun: socket filter support"),
tap filters are managed under RTNL lock from __tun_chr_ioctl(). Thus the
sock_owned_by_user(sk) doesn't apply in this specific case and therefore
triggers the false positive.

Extend the BPF API with __sk_attach_filter()/__sk_detach_filter() pair
that is used by tap filters and pass in lockdep_rtnl_is_held() for the
rcu_dereference_protected() checks instead.

Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-01 14:33:46 -04:00
Nicolas Dichtel
c57c7a95da rtnl: fix msg size calculation in if_nlmsg_size()
Size of the attribute IFLA_PHYS_PORT_NAME was missing.

Fixes: db24a9044e ("net: add support for phys_port_name")
CC: David Ahern <dsahern@gmail.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-31 16:49:54 -04:00
Steffen Klassert
d6af1a31cc vti: Add pmtu handling to vti_xmit.
We currently rely on the PMTU discovery of xfrm.
However if a packet is locally sent, the PMTU mechanism
of xfrm tries to do local socket notification what
might not work for applications like ping that don't
check for this. So add pmtu handling to vti_xmit to
report MTU changes immediately.

Reported-by: Mark McKinstry <Mark.McKinstry@alliedtelesis.co.nz>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2016-03-31 08:59:56 +02:00
Daniel Borkmann
c0e760c9c6 bpf: make padding in bpf_tunnel_key explicit
Make the 2 byte padding in struct bpf_tunnel_key between tunnel_ttl
and tunnel_label members explicit. No issue has been observed, and
gcc/llvm does padding for the old struct already, where tunnel_label
was not yet present, so the current code works, but since it's part
of uapi, make sure we don't introduce holes in structs.

Therefore, add tunnel_ext that we can use generically in future
(f.e. to flag OAM messages for backends, etc). Also add the offset
to the compat tests to be sure should some compilers not padd the
tail of the old version of bpf_tunnel_key.

Fixes: 4018ab1875 ("bpf: support flow label for bpf_skb_{set, get}_tunnel_key")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-30 19:01:33 -04:00
Eric Dumazet
2d4212261f ipv6: udp: fix UDP_MIB_IGNOREDMULTI updates
IPv6 counters updates use a different macro than IPv4.

Fixes: 36cbb2452c ("udp: Increment UDP_MIB_IGNOREDMULTI for arriving unmatched multicasts")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Rick Jones <rick.jones2@hp.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-30 19:01:33 -04:00
Alexander Duyck
c3483384ee gro: Allow tunnel stacking in the case of FOU/GUE
This patch should fix the issues seen with a recent fix to prevent
tunnel-in-tunnel frames from being generated with GRO.  The fix itself is
correct for now as long as we do not add any devices that support
NETIF_F_GSO_GRE_CSUM.  When such a device is added it could have the
potential to mess things up due to the fact that the outer transport header
points to the outer UDP header and not the GRE header as would be expected.

Fixes: fac8e0f579 ("tunnels: Don't apply GRO to multiple layers of encapsulation.")
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-30 16:02:33 -04:00
Marcelo Ricardo Leitner
28fd34985b sctp: really allow using GFP_KERNEL on sctp_packet_transmit
Somehow my patch for commit cea8768f33 ("sctp: allow
sctp_transmit_packet and others to use gfp") missed two important
chunks, which are now added.

Fixes: cea8768f33 ("sctp: allow sctp_transmit_packet and others to use gfp")
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-By: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-30 15:41:22 -04:00
Haishuang Yan
5e263f7126 bridge: Allow set bridge ageing time when switchdev disabled
When NET_SWITCHDEV=n, switchdev_port_attr_set will return -EOPNOTSUPP,
we should ignore this error code and continue to set the ageing time.

Fixes: c62987bbd8 ("bridge: push bridge setting ageing_time down to switchdev")
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Acked-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-30 15:38:13 -04:00
Liping Zhang
8fef24ca90 netfilter: ip6t_SYNPROXY: remove magic number for hop_limit
Replace '64' with the per-net ipv6_devconf_all's hop_limit when
building the ipv6 header.

Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-29 13:31:17 +02:00
Stephane Bryant
8d45ff22f1 netfilter: bridge: nf queue verdict to use NFQA_VLAN and NFQA_L2HDR
This makes nf queues use NFQA_VLAN and NFQA_L2HDR in verdict to modify the
original skb

Signed-off-by: Stephane Bryant <stephane.ml.bryant@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-29 13:29:30 +02:00
Stephane Bryant
15824ab29f netfilter: bridge: pass L2 header and VLAN as netlink attributes in queues to userspace
- This creates 2 netlink attribute NFQA_VLAN and NFQA_L2HDR.
- These are filled up for the PF_BRIDGE family on the way to userspace.
- NFQA_VLAN is a nested attribute, with the NFQA_VLAN_PROTO and the
  NFQA_VLAN_TCI carrying the corresponding vlan_proto and vlan_tci
  fields from the skb using big endian ordering (and using the CFI
  bit as the VLAN_TAG_PRESENT flag in vlan_tci as in the skb)

Signed-off-by: Stephane Bryant <stephane.ml.bryant@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-29 13:26:38 +02:00
Stephane Bryant
ac28634456 netfilter: bridge: add nf_afinfo to enable queuing to userspace
This just adds and registers a nf_afinfo for the ethernet
bridge, which enables queuing to userspace for the AF_BRIDGE
family. No checksum computation is done.

Signed-off-by: Stephane Bryant <stephane.ml.bryant@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-29 13:24:37 +02:00
David S. Miller
0c84ea17ff Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for you net tree,
they are:

1) There was a race condition between parallel save/swap and delete,
   which resulted a kernel crash due to the increase ref for save, swap,
   wrong ref decrease operations. Reported and fixed by Vishwanath Pai.

2) OVS should call into CT NAT for packets of new expected connections only
   when the conntrack state is persisted with the 'commit' option to the
   OVS CT action. From Jarno Rajahalme.

3) Resolve kconfig dependencies with new OVS NAT support. From Arnd Bergmann.

4) Early validation of entry->target_offset to make sure it doesn't take us
   out from the blob, from Florian Westphal.

5) Again early validation of entry->next_offset to make sure it doesn't take
   out from the blob, also from Florian.

6) Check that entry->target_offset is always of of sizeof(struct xt_entry)
   for unconditional entries, when checking both from check_underflow()
   and when checking for loops in mark_source_chains(), again from
   Florian.

7) Fix inconsistent behaviour in nfnetlink_queue when
   NFQA_CFG_F_FAIL_OPEN is set and netlink_unicast() fails due to buffer
   overrun, we have to reinject the packet as the user expects.

8) Enforce nul-terminated table names from getsockopt GET_ENTRIES
   requests.

9) Don't assume skb->sk is set from nft_bridge_reject and synproxy,
   this fixes a recent update of the code to namespaceify
   ip_default_ttl, patch from Liping Zhang.

This batch comes with four patches to validate x_tables blobs coming
from userspace. CONFIG_USERNS exposes the x_tables interface to
unpriviledged users and to be honest this interface never received the
attention for this move away from the CAP_NET_ADMIN domain. Florian is
working on another round with more patches with more sanity checks, so
expect a bit more Netfilter fixes in this development cycle than usual.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-28 15:38:59 -04:00
Al Viro
2da62906b1 [net] drop 'size' argument of sock_recvmsg()
all callers have it equal to msg_data_left(msg).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-03-28 13:57:51 -04:00
Liping Zhang
29421198c3 netfilter: ipv4: fix NULL dereference
Commit fa50d974d1 ("ipv4: Namespaceify ip_default_ttl sysctl knob")
use sock_net(skb->sk) to get the net namespace, but we can't assume
that sk_buff->sk is always exist, so when it is NULL, oops will happen.

Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Reviewed-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-28 17:59:29 +02:00
Pablo Neira Ayuso
b301f25387 netfilter: x_tables: enforce nul-terminated table name from getsockopt GET_ENTRIES
Make sure the table names via getsockopt GET_ENTRIES is nul-terminated
in ebtables and all the x_tables variants and their respective compat
code. Uncovered by KASAN.

Reported-by: Baozeng Ding <sploving1@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-28 17:59:24 +02:00
Pablo Neira Ayuso
931401137f netfilter: nfnetlink_queue: honor NFQA_CFG_F_FAIL_OPEN when netlink unicast fails
When netlink unicast fails to deliver the message to userspace, we
should also check if the NFQA_CFG_F_FAIL_OPEN flag is set so we reinject
the packet back to the stack.

I think the user expects no packet drops when this flag is set due to
queueing to userspace errors, no matter if related to the internal queue
or when sending the netlink message to userspace.

The userspace application will still get the ENOBUFS error via recvmsg()
so the user still knows that, with the current configuration that is in
place, the userspace application is not consuming the messages at the
pace that the kernel needs.

Reported-by: "Yigal Reiss (yreiss)" <yreiss@cisco.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Tested-by: "Yigal Reiss (yreiss)" <yreiss@cisco.com>
2016-03-28 17:59:20 +02:00
Florian Westphal
54d83fc74a netfilter: x_tables: fix unconditional helper
Ben Hawkes says:

 In the mark_source_chains function (net/ipv4/netfilter/ip_tables.c) it
 is possible for a user-supplied ipt_entry structure to have a large
 next_offset field. This field is not bounds checked prior to writing a
 counter value at the supplied offset.

Problem is that mark_source_chains should not have been called --
the rule doesn't have a next entry, so its supposed to return
an absolute verdict of either ACCEPT or DROP.

However, the function conditional() doesn't work as the name implies.
It only checks that the rule is using wildcard address matching.

However, an unconditional rule must also not be using any matches
(no -m args).

The underflow validator only checked the addresses, therefore
passing the 'unconditional absolute verdict' test, while
mark_source_chains also tested for presence of matches, and thus
proceeeded to the next (not-existent) rule.

Unify this so that all the callers have same idea of 'unconditional rule'.

Reported-by: Ben Hawkes <hawkes@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-28 17:59:15 +02:00
Florian Westphal
6e94e0cfb0 netfilter: x_tables: make sure e->next_offset covers remaining blob size
Otherwise this function may read data beyond the ruleset blob.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-28 17:59:08 +02:00
Florian Westphal
bdf533de69 netfilter: x_tables: validate e->target_offset early
We should check that e->target_offset is sane before
mark_source_chains gets called since it will fetch the target entry
for loop detection.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-28 17:59:04 +02:00
Arnd Bergmann
99b7248e2a openvswitch: call only into reachable nf-nat code
The openvswitch code has gained support for calling into the
nf-nat-ipv4/ipv6 modules, however those can be loadable modules
in a configuration in which openvswitch is built-in, leading
to link errors:

net/built-in.o: In function `__ovs_ct_lookup':
:(.text+0x2cc2c8): undefined reference to `nf_nat_icmp_reply_translation'
:(.text+0x2cc66c): undefined reference to `nf_nat_icmpv6_reply_translation'

The dependency on (!NF_NAT || NF_NAT) prevents similar issues,
but NF_NAT is set to 'y' if any of the symbols selecting
it are built-in, but the link error happens when any of them
are modular.

A second issue is that even if CONFIG_NF_NAT_IPV6 is built-in,
CONFIG_NF_NAT_IPV4 might be completely disabled. This is unlikely
to be useful in practice, but the driver currently only handles
IPv6 being optional.

This patch improves the Kconfig dependency so that openvswitch
cannot be built-in if either of the two other symbols are set
to 'm', and it replaces the incorrect #ifdef in ovs_ct_nat_execute()
with two "if (IS_ENABLED())" checks that should catch all corner
cases also make the code more readable.

The same #ifdef exists ovs_ct_nat_to_attr(), where it does not
cause a link error, but for consistency I'm changing it the same
way.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 05752523e5 ("openvswitch: Interface with NAT.")
Acked-by: Joe Stringer <joe@ovn.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-28 17:58:59 +02:00
Jarno Rajahalme
5745b0be05 openvswitch: Fix checking for new expected connections.
OVS should call into CT NAT for packets of new expected connections only
when the conntrack state is persisted with the 'commit' option to the
OVS CT action.  The test for this condition is doubly wrong, as the CT
status field is ANDed with the bit number (IPS_EXPECTED_BIT) rather
than the mask (IPS_EXPECTED), and due to the wrong assumption that the
expected bit would apply only for the first (i.e., 'new') packet of a
connection, while in fact the expected bit remains on for the lifetime of
an expected connection.  The 'ctinfo' value IP_CT_RELATED derived from
the ct status can be used instead, as it is only ever applicable to
the 'new' packets of the expected connection.

Fixes: 05752523e5 ('openvswitch: Interface with NAT.')
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-28 17:58:51 +02:00
Vishwanath Pai
596cf3fe58 netfilter: ipset: fix race condition in ipset save, swap and delete
This fix adds a new reference counter (ref_netlink) for the struct ip_set.
The other reference counter (ref) can be swapped out by ip_set_swap and we
need a separate counter to keep track of references for netlink events
like dump. Using the same ref counter for dump causes a race condition
which can be demonstrated by the following script:

ipset create hash_ip1 hash:ip family inet hashsize 1024 maxelem 500000 \
counters
ipset create hash_ip2 hash:ip family inet hashsize 300000 maxelem 500000 \
counters
ipset create hash_ip3 hash:ip family inet hashsize 1024 maxelem 500000 \
counters

ipset save &

ipset swap hash_ip3 hash_ip2
ipset destroy hash_ip3 /* will crash the machine */

Swap will exchange the values of ref so destroy will see ref = 0 instead of
ref = 1. With this fix in place swap will not succeed because ipset save
still has ref_netlink on the set (ip_set_swap doesn't swap ref_netlink).

Both delete and swap will error out if ref_netlink != 0 on the set.

Note: The changes to *_head functions is because previously we would
increment ref whenever we called these functions, we don't do that
anymore.

Reviewed-by: Joshua Hunt <johunt@akamai.com>
Signed-off-by: Vishwanath Pai <vpai@akamai.com>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-28 17:57:45 +02:00
Haishuang Yan
ac71b46efd openvswitch: Use proper buffer size in nla_memcpy
For the input parameter count, it's better to use the size
of destination buffer size, as nla_memcpy would take into
account the length of the source netlink attribute when
a data is copied from an attribute.

Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-28 11:37:14 -04:00
Weongyo Jeong
ccd63c20fe netfilter: nf_conntrack: Uses pr_fmt() for logging.
Uses pr_fmt() macro for debugging messages of nf_conntrack module.

Signed-off-by: Weongyo Jeong <weongyo.linux@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-28 12:56:07 +02:00
Al Viro
d360775217 constify security_path_{mkdir,mknod,symlink}
... as well as unix_mknod() and may_o_create()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-03-28 00:47:27 -04:00
Quentin Armitage
995096a0a4 Fix returned tc and hoplimit values for route with IPv6 encapsulation
For a route with IPv6 encapsulation, the traffic class and hop limit
values are interchanged when returned to userspace by the kernel.
For example, see below.

># ip route add 192.168.0.1 dev eth0.2 encap ip6 dst 0x50 tc 0x50 hoplimit 100 table 1000
># ip route show table 1000
192.168.0.1  encap ip6 id 0 src :: dst fe83::1 hoplimit 80 tc 100 dev eth0.2  scope link

Signed-off-by: Quentin Armitage <quentin@armitage.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-27 22:35:02 -04:00
Linus Torvalds
d5a38f6e46 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph updates from Sage Weil:
 "There is quite a bit here, including some overdue refactoring and
  cleanup on the mon_client and osd_client code from Ilya, scattered
  writeback support for CephFS and a pile of bug fixes from Zheng, and a
  few random cleanups and fixes from others"

[ I already decided not to pull this because of it having been rebased
  recently, but ended up changing my mind after all.  Next time I'll
  really hold people to it.  Oh well.   - Linus ]

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (34 commits)
  libceph: use KMEM_CACHE macro
  ceph: use kmem_cache_zalloc
  rbd: use KMEM_CACHE macro
  ceph: use lookup request to revalidate dentry
  ceph: kill ceph_get_dentry_parent_inode()
  ceph: fix security xattr deadlock
  ceph: don't request vxattrs from MDS
  ceph: fix mounting same fs multiple times
  ceph: remove unnecessary NULL check
  ceph: avoid updating directory inode's i_size accidentally
  ceph: fix race during filling readdir cache
  libceph: use sizeof_footer() more
  ceph: kill ceph_empty_snapc
  ceph: fix a wrong comparison
  ceph: replace CURRENT_TIME by current_fs_time()
  ceph: scattered page writeback
  libceph: add helper that duplicates last extent operation
  libceph: enable large, variable-sized OSD requests
  libceph: osdc->req_mempool should be backed by a slab pool
  libceph: make r_request msg_size calculation clearer
  ...
2016-03-26 15:53:16 -07:00
Geliang Tang
5ee61e95b6 libceph: use KMEM_CACHE macro
Use KMEM_CACHE() instead of kmem_cache_create() to simplify the code.

Signed-off-by: Geliang Tang <geliangtang@163.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25 18:51:57 +01:00
Ilya Dryomov
89f081730c libceph: use sizeof_footer() more
Don't open-code sizeof_footer() in read_partial_message() and
ceph_msg_revoke().  Also, after switching to sizeof_footer(), it's now
possible to use con_out_kvec_add() in prepare_write_message_footer().

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2016-03-25 18:51:53 +01:00
Yan, Zheng
2c63f49a72 libceph: add helper that duplicates last extent operation
This helper duplicates last extent operation in OSD request, then
adjusts the new extent operation's offset and length. The helper
is for scatterd page writeback, which adds nonconsecutive dirty
pages to single OSD request.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25 18:51:43 +01:00
Ilya Dryomov
3f1af42ad0 libceph: enable large, variable-sized OSD requests
Turn r_ops into a flexible array member to enable large, consisting of
up to 16 ops, OSD requests.  The use case is scattered writeback in
cephfs and, as far as the kernel client is concerned, 16 is just a made
up number.

r_ops had size 3 for copyup+hint+write, but copyup is really a special
case - it can only happen once.  ceph_osd_request_cache is therefore
stuffed with num_ops=2 requests, anything bigger than that is allocated
with kmalloc().  req_mempool is backed by ceph_osd_request_cache, which
means either num_ops=1 or num_ops=2 for use_mempool=true - all existing
users (ceph_writepages_start(), ceph_osdc_writepages()) are fine with
that.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25 18:51:43 +01:00
Ilya Dryomov
9e767adbd3 libceph: osdc->req_mempool should be backed by a slab pool
ceph_osd_request_cache was introduced a long time ago.  Also, osd_req
is about to get a flexible array member, which ceph_osd_request_cache
is going to be aware of.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25 18:51:43 +01:00
Ilya Dryomov
ae458f5a17 libceph: make r_request msg_size calculation clearer
Although msg_size is calculated correctly, the terms are grouped in
a misleading way - snaps appears to not have room for a u32 length.
Move calculation closer to its use and regroup terms.

No functional change.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25 18:51:42 +01:00
Yan, Zheng
7665d85b73 libceph: move r_reply_op_{len,result} into struct ceph_osd_req_op
This avoids defining large array of r_reply_op_{len,result} in
in struct ceph_osd_request.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25 18:51:42 +01:00
Ilya Dryomov
de2aa102ea libceph: rename ceph_osd_req_op::payload_len to indata_len
Follow userspace nomenclature on this - the next commit adds
outdata_len.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25 18:51:41 +01:00
Ilya Dryomov
b5d91704f5 libceph: behave in mon_fault() if cur_mon < 0
This can happen if __close_session() in ceph_monc_stop() races with
a connection reset.  We need to ignore such faults, otherwise it's
likely we would take !hunting, call __schedule_delayed() and end up
with delayed_work() executing on invalid memory, among other things.

The (two!) con->private tests are useless, as nothing ever clears
con->private.  Nuke them.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25 18:51:40 +01:00
Ilya Dryomov
bee3a37c47 libceph: reschedule tick in mon_fault()
Doing __schedule_delayed() in the hunting branch is pointless, as the
tick will have already been scheduled by then.

What we need to do instead is *reschedule* it in the !hunting branch,
after reopen_session() changes hunt_mult, which affects the delay.
This helps with spacing out connection attempts and avoiding things
like two back-to-back attempts followed by a longer period of waiting
around.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25 18:51:40 +01:00
Ilya Dryomov
1752b50ca2 libceph: introduce and switch to reopen_session()
hunting is now set in __open_session() and cleared in finish_hunting(),
instead of all around.  The "session lost" message is printed not only
on connection resets, but also on keepalive timeouts.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25 18:51:39 +01:00
Ilya Dryomov
168b9090c7 libceph: monc hunt rate is 3s with backoff up to 30s
Unless we are in the process of setting up a client (i.e. connecting to
the monitor cluster for the first time), apply a backoff: every time we
want to reopen a session, increase our timeout by a multiple (currently
2); when we complete the connection, reduce that multipler by 50%.

Mirrors ceph.git commit 794c86fd289bd62a35ed14368fa096c46736e9a2.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25 18:51:39 +01:00
Ilya Dryomov
58d81b1294 libceph: monc ping rate is 10s
Split ping interval and ping timeout: ping interval is 10s; keepalive
timeout is 30s.

Make monc_ping_timeout a constant while at it - it's not actually
exported as a mount option (and the rest of tick-related settings won't
be either), so it's got no place in ceph_options.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25 18:51:39 +01:00
Ilya Dryomov
0e04dc26cc libceph: pick a different monitor when reconnecting
Don't try to reconnect to the same monitor when we fail to establish
a session within a timeout or it's lost.

For that, pick_new_mon() needs to see the old value of cur_mon, so
don't clear it in __close_session() - all calls to __close_session()
but one are followed by __open_session() anyway.  __open_session() is
only called when a new session needs to be established, so the "already
open?" branch, which is now in the way, is simply dropped.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25 18:51:38 +01:00
Ilya Dryomov
82dcabad75 libceph: revamp subs code, switch to SUBSCRIBE2 protocol
It is currently hard-coded in the mon_client that mdsmap and monmap
subs are continuous, while osdmap sub is always "onetime".  To better
handle full clusters/pools in the osd_client, we need to be able to
issue continuous osdmap subs.  Revamp subs code to allow us to specify
for each sub whether it should be continuous or not.

Although not strictly required for the above, switch to SUBSCRIBE2
protocol while at it, eliminating the ambiguity between a request for
"every map since X" and a request for "just the latest" when we don't
have a map yet (i.e. have epoch 0).  SUBSCRIBE2 feature bit is now
required - it's been supported since pre-argonaut (2010).

Move "got mdsmap" call to the end of ceph_mdsc_handle_map() - calling
in before we validate the epoch and successfully install the new map
can mess up mon_client sub state.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25 18:51:38 +01:00