Commit Graph

6509 Commits

Author SHA1 Message Date
Quentin Monnet
c456dec4d2 bpf: add documentation for eBPF helpers (12-22)
Add documentation for eBPF helper functions to bpf.h user header file.
This documentation can be parsed with the Python script provided in
another commit of the patch series, in order to provide a RST document
that can later be converted into a man page.

The objective is to make the documentation easily understandable and
accessible to all eBPF developers, including beginners.

This patch contains descriptions for the following helper functions, all
written by Alexei:

- bpf_get_current_pid_tgid()
- bpf_get_current_uid_gid()
- bpf_get_current_comm()
- bpf_skb_vlan_push()
- bpf_skb_vlan_pop()
- bpf_skb_get_tunnel_key()
- bpf_skb_set_tunnel_key()
- bpf_redirect()
- bpf_perf_event_output()
- bpf_get_stackid()
- bpf_get_current_task()

v4:
- bpf_redirect(): Fix typo: "XDP_ABORT" changed to "XDP_ABORTED". Add
  note on bpf_redirect_map() providing better performance. Replace "Save
  for" with "Except for".
- bpf_skb_vlan_push(): Clarify comment about invalidated verifier
  checks.
- bpf_skb_vlan_pop(): Clarify comment about invalidated verifier
  checks.
- bpf_skb_get_tunnel_key(): Add notes on tunnel_id, "collect metadata"
  mode, and example tunneling protocols with which it can be used.
- bpf_skb_set_tunnel_key(): Add a reference to the description of
  bpf_skb_get_tunnel_key().
- bpf_perf_event_output(): Specify that, and for what purpose, the
  helper can be used with programs attached to TC and XDP.

v3:
- bpf_skb_get_tunnel_key(): Change and improve description and example.
- bpf_redirect(): Improve description of BPF_F_INGRESS flag.
- bpf_perf_event_output(): Fix first sentence of description. Delete
  wrong statement on context being evaluated as a struct pt_reg. Remove
  the long yet incomplete example.
- bpf_get_stackid(): Add a note about PERF_MAX_STACK_DEPTH being
  configurable.

Cc: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-04-27 00:21:58 +02:00
Quentin Monnet
ad4a522349 bpf: add documentation for eBPF helpers (01-11)
Add documentation for eBPF helper functions to bpf.h user header file.
This documentation can be parsed with the Python script provided in
another commit of the patch series, in order to provide a RST document
that can later be converted into a man page.

The objective is to make the documentation easily understandable and
accessible to all eBPF developers, including beginners.

This patch contains descriptions for the following helper functions, all
written by Alexei:

- bpf_map_lookup_elem()
- bpf_map_update_elem()
- bpf_map_delete_elem()
- bpf_probe_read()
- bpf_ktime_get_ns()
- bpf_trace_printk()
- bpf_skb_store_bytes()
- bpf_l3_csum_replace()
- bpf_l4_csum_replace()
- bpf_tail_call()
- bpf_clone_redirect()

v4:
- bpf_map_lookup_elem(): Add "const" qualifier for key.
- bpf_map_update_elem(): Add "const" qualifier for key and value.
- bpf_map_lookup_elem(): Add "const" qualifier for key.
- bpf_skb_store_bytes(): Clarify comment about invalidated verifier
  checks.
- bpf_l3_csum_replace(): Mention L3 instead of just IP, and add a note
  about bpf_csum_diff().
- bpf_l4_csum_replace(): Mention L4 instead of just TCP/UDP, and add a
  note about bpf_csum_diff().
- bpf_tail_call(): Bring minor edits to description.
- bpf_clone_redirect(): Add a note about the relation with
  bpf_redirect(). Also clarify comment about invalidated verifier
  checks.

v3:
- bpf_map_lookup_elem(): Fix description of restrictions for flags
  related to the existence of the entry.
- bpf_trace_printk(): State that trace_pipe can be configured. Fix
  return value in case an unknown format specifier is met. Add a note on
  kernel log notice when the helper is used. Edit example.
- bpf_tail_call(): Improve comment on stack inheritance.
- bpf_clone_redirect(): Improve description of BPF_F_INGRESS flag.

Cc: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-04-27 00:21:58 +02:00
Quentin Monnet
56a092c895 bpf: add script and prepare bpf.h for new helpers documentation
Remove previous "overview" of eBPF helpers from user bpf.h header.
Replace it by a comment explaining how to process the new documentation
(to come in following patches) with a Python script to produce RST, then
man page documentation.

Also add the aforementioned Python script under scripts/. It is used to
process include/uapi/linux/bpf.h and to extract helper descriptions, to
turn it into a RST document that can further be processed with rst2man
to produce a man page. The script takes one "--filename <path/to/file>"
option. If the script is launched from scripts/ in the kernel root
directory, it should be able to find the location of the header to
parse, and "--filename <path/to/file>" is then optional. If it cannot
find the file, then the option becomes mandatory. RST-formatted
documentation is printed to standard output.

Typical workflow for producing the final man page would be:

    $ ./scripts/bpf_helpers_doc.py \
            --filename include/uapi/linux/bpf.h > /tmp/bpf-helpers.rst
    $ rst2man /tmp/bpf-helpers.rst > /tmp/bpf-helpers.7
    $ man /tmp/bpf-helpers.7

Note that the tool kernel-doc cannot be used to document eBPF helpers,
whose signatures are not available directly in the header files
(pre-processor directives are used to produce them at the beginning of
the compilation process).

v4:
- Also remove overviews for newly added bpf_xdp_adjust_tail() and
  bpf_skb_get_xfrm_state().
- Remove vague statement about what helpers are restricted to GPL
  programs in "LICENSE" section for man page footer.
- Replace license boilerplate with SPDX tag for Python script.

v3:
- Change license for man page.
- Remove "for safety reasons" from man page header text.
- Change "packets metadata" to "packets" in man page header text.
- Move and fix comment on helpers introducing no overhead.
- Remove "NOTES" section from man page footer.
- Add "LICENSE" section to man page footer.
- Edit description of file include/uapi/linux/bpf.h in man page footer.

Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-04-27 00:21:58 +02:00
Jiri Olsa
b85fab0e67 bpf: Add gpl_compatible flag to struct bpf_prog_info
Adding gpl_compatible flag to struct bpf_prog_info
so it can be dumped via bpf_prog_get_info_by_fd and
displayed via bpftool progs dump.

Alexei noticed 4-byte hole in struct bpf_prog_info,
so we put the u32 flags field in there, and we can
keep adding bit fields in there without breaking
user space.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-04-26 22:36:11 +02:00
Willem de Bruijn
bec1f6f697 udp: generate gso with UDP_SEGMENT
Support generic segmentation offload for udp datagrams. Callers can
concatenate and send at once the payload of multiple datagrams with
the same destination.

To set segment size, the caller sets socket option UDP_SEGMENT to the
length of each discrete payload. This value must be smaller than or
equal to the relevant MTU.

A follow-up patch adds cmsg UDP_SEGMENT to specify segment size on a
per send call basis.

Total byte length may then exceed MTU. If not an exact multiple of
segment size, the last segment will be shorter.

The implementation adds a gso_size field to the udp socket, ip(v6)
cmsg cookie and inet_cork structure to be able to set the value at
setsockopt or cmsg time and to work with both lockless and corked
paths.

Initial benchmark numbers show UDP GSO about as expensive as TCP GSO.

    tcp tso
     3197 MB/s 54232 msg/s 54232 calls/s
         6,457,754,262      cycles

    tcp gso
     1765 MB/s 29939 msg/s 29939 calls/s
        11,203,021,806      cycles

    tcp without tso/gso *
      739 MB/s 12548 msg/s 12548 calls/s
        11,205,483,630      cycles

    udp
      876 MB/s 14873 msg/s 624666 calls/s
        11,205,777,429      cycles

    udp gso
     2139 MB/s 36282 msg/s 36282 calls/s
        11,204,374,561      cycles

   [*] after reverting commit 0a6b2a1dc2
       ("tcp: switch to GSO being always on")

Measured total system cycles ('-a') for one core while pinning both
the network receive path and benchmark process to that core:

  perf stat -a -C 12 -e cycles \
    ./udpgso_bench_tx -C 12 -4 -D "$DST" -l 4

Note the reduction in calls/s with GSO. Bytes per syscall drops
increases from 1470 to 61818.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-26 15:08:04 -04:00
Thomas Gleixner
a3ed0e4393 Revert: Unify CLOCK_MONOTONIC and CLOCK_BOOTTIME
Revert commits

92af4dcb4e ("tracing: Unify the "boot" and "mono" tracing clocks")
127bfa5f43 ("hrtimer: Unify MONOTONIC and BOOTTIME clock behavior")
7250a4047a ("posix-timers: Unify MONOTONIC and BOOTTIME clock behavior")
d6c7270e91 ("timekeeping: Remove boot time specific code")
f2d6fdbfd2 ("Input: Evdev - unify MONOTONIC and BOOTTIME clock behavior")
d6ed449afd ("timekeeping: Make the MONOTONIC clock behave like the BOOTTIME clock")
72199320d4 ("timekeeping: Add the new CLOCK_MONOTONIC_ACTIVE clock")

As stated in the pull request for the unification of CLOCK_MONOTONIC and
CLOCK_BOOTTIME, it was clear that we might have to revert the change.

As reported by several folks systemd and other applications rely on the
documented behaviour of CLOCK_MONOTONIC on Linux and break with the above
changes. After resume daemons time out and other timeout related issues are
observed. Rafael compiled this list:

* systemd kills daemons on resume, after >WatchdogSec seconds
  of suspending (Genki Sky).  [Verified that that's because systemd uses
  CLOCK_MONOTONIC and expects it to not include the suspend time.]

* systemd-journald misbehaves after resume:
  systemd-journald[7266]: File /var/log/journal/016627c3c4784cd4812d4b7e96a34226/system.journal
corrupted or uncleanly shut down, renaming and replacing.
  (Mike Galbraith).

* NetworkManager reports "networking disabled" and networking is broken
  after resume 50% of the time (Pavel).  [May be because of systemd.]

* MATE desktop dims the display and starts the screensaver right after
  system resume (Pavel).

* Full system hang during resume (me).  [May be due to systemd or NM or both.]

That happens on debian and open suse systems.

It's sad, that these problems were neither catched in -next nor by those
folks who expressed interest in this change.

Reported-by: Rafael J. Wysocki <rjw@rjwysocki.net>
Reported-by: Genki Sky <sky@genki.is>,
Reported-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kevin Easton <kevin@guarana.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Salyzyn <salyzyn@android.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
2018-04-26 14:53:32 +02:00
Eric W. Biederman
db78e6a0a6 signal: Add TRAP_UNK si_code for undiagnosted trap exceptions
Both powerpc and alpha have cases where they wronly set si_code to 0
in combination with SIGTRAP and don't mean SI_USER.

About half the time this is because the architecture can not report
accurately what kind of trap exception triggered the trap exception.
The other half the time it looks like no one has bothered to
figure out an appropriate si_code.

For the cases where the architecture does not have enough information
or is too lazy to figure out exactly what kind of trap exception
it is define TRAP_UNK.

Cc: linux-api@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: linux-alpha@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-04-25 10:40:56 -05:00
David S. Miller
c749fa181b Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-04-24 23:59:11 -04:00
Eyal Birger
12bed760a7 bpf: add helper for getting xfrm states
This commit introduces a helper which allows fetching xfrm state
parameters by eBPF programs attached to TC.

Prototype:
bpf_skb_get_xfrm_state(skb, index, xfrm_state, size, flags)

skb: pointer to skb
index: the index in the skb xfrm_state secpath array
xfrm_state: pointer to 'struct bpf_xfrm_state'
size: size of 'struct bpf_xfrm_state'
flags: reserved for future extensions

The helper returns 0 on success. Non zero if no xfrm state at the index
is found - or non exists at all.

struct bpf_xfrm_state currently includes the SPI, peer IPv4/IPv6
address and the reqid; it can be further extended by adding elements to
its end - indicating the populated fields by the 'size' argument -
keeping backwards compatibility.

Typical usage:

struct bpf_xfrm_state x = {};
bpf_skb_get_xfrm_state(skb, 0, &x, sizeof(x), 0);
...

Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-04-24 22:26:58 +02:00
Michael S. Tsirkin
b400003250 virtio_balloon: add array of stat names
Jason Wang points out that it's very hard for users to build an array of
stat names. The naive thing is to use VIRTIO_BALLOON_S_NR but that
breaks if we add more stats - as done e.g. recently by commit 6c64fe7f2
("virtio_balloon: export hugetlb page allocation counts").

Let's add an array of reasonably readable names.

Fixes: 6c64fe7f2 ("virtio_balloon: export hugetlb page allocation counts")
Cc: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Jonathan Helman <jonathan.helman@oracle.com>
2018-04-24 21:44:01 +03:00
Taehee Yoo
a1d768f1a0 netfilter: ebtables: add ebt_get_target and ebt_get_target_c
ebt_get_target similar to {ip/ip6/arp}t_get_target.
and ebt_get_target_c similar to {ip/ip6/arp}t_get_target_c.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-24 10:29:18 +02:00
Thierry Du Tre
2eb0f624b7 netfilter: add NAT support for shifted portmap ranges
This is a patch proposal to support shifted ranges in portmaps.  (i.e. tcp/udp
incoming port 5000-5100 on WAN redirected to LAN 192.168.1.5:2000-2100)

Currently DNAT only works for single port or identical port ranges.  (i.e.
ports 5000-5100 on WAN interface redirected to a LAN host while original
destination port is not altered) When different port ranges are configured,
either 'random' mode should be used, or else all incoming connections are
mapped onto the first port in the redirect range. (in described example
WAN:5000-5100 will all be mapped to 192.168.1.5:2000)

This patch introduces a new mode indicated by flag NF_NAT_RANGE_PROTO_OFFSET
which uses a base port value to calculate an offset with the destination port
present in the incoming stream. That offset is then applied as index within the
redirect port range (index modulo rangewidth to handle range overflow).

In described example the base port would be 5000. An incoming stream with
destination port 5004 would result in an offset value 4 which means that the
NAT'ed stream will be using destination port 2004.

Other possibilities include deterministic mapping of larger or multiple ranges
to a smaller range : WAN:5000-5999 -> LAN:5000-5099 (maps WAN port 5*xx to port
51xx)

This patch does not change any current behavior. It just adds new NAT proto
range functionality which must be selected via the specific flag when intended
to use.

A patch for iptables (libipt_DNAT.c + libip6t_DNAT.c) will also be proposed
which makes this functionality immediately available.

Signed-off-by: Thierry Du Tre <thierry@dtsystems.be>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-24 10:29:12 +02:00
Jason Gunthorpe
d50e14abe2 uapi: Fix SPDX tags for files referring to the 'OpenIB.org' license
Based on discussion with Kate Stewart this license is not a
BSD-2-Clause, but is now formally identified as Linux-OpenIB
by SPDX.

The key difference between the licenses is in the 'warranty'
paragraph.

if_infiniband.h refers to the 'OpenIB.org' license, but
does not include the text, instead it links to an obsolete
web site that contains a license that matches the BSD-2-Clause
SPX. There is no 'three clause' version of the OpenIB.org
license.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-04-23 11:10:33 -04:00
Martin KaFai Lau
fbcf93ebca bpf: btf: Clean up btf.h in uapi
This patch cleans up btf.h in uapi:
1) Rename "name" to "name_off" to better reflect it is an offset to the
   string section instead of a char array.
2) Remove unused value BTF_FLAGS_COMPR and BTF_MAGIC_SWAP

Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-04-23 11:32:01 +02:00
Linus Torvalds
38f0b33e6d Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Thomas Gleixner:
 "A larger set of updates for perf.

  Kernel:

   - Handle the SBOX uncore monitoring correctly on Broadwell CPUs which
     do not have SBOX.

   - Store context switch out type in PERF_RECORD_SWITCH[_CPU_WIDE]. The
     percentage of preempting and non-preempting context switches help
     understanding the nature of workloads (CPU or IO bound) that are
     running on a machine. This adds the kernel facility and userspace
     changes needed to show this information in 'perf script' and 'perf
     report -D' (Alexey Budankov)

   - Remove a WARN_ON() in the trace/kprobes code which is pointless
     because the return error code is already telling the caller what's
     wrong.

   - Revert a fugly workaround for clang BPF targets.

   - Fix sample_max_stack maximum check and do not proceed when an error
     has been detect, return them to avoid misidentifying errors (Jiri
     Olsa)

   - Add SPDX idenitifiers and get rid of GPL boilderplate.

  Tools:

   - Synchronize kernel ABI headers, v4.17-rc1 (Ingo Molnar)

   - Support MAP_FIXED_NOREPLACE, noticed when updating the
     tools/include/ copies (Arnaldo Carvalho de Melo)

   - Add '\n' at the end of parse-options error messages (Ravi Bangoria)

   - Add s390 support for detailed/verbose PMU event description (Thomas
     Richter)

   - perf annotate fixes and improvements:

      * Allow showing offsets in more than just jump targets, use the
        new 'O' hotkey in the TUI, config ~/.perfconfig
        annotate.offset_level for it and for --stdio2 (Arnaldo Carvalho
        de Melo)

      * Use the resolved variable names from objdump disassembled lines
        to make them more compact, just like was already done for some
        instructions, like "mov", this eventually will be done more
        generally, but lets now add some more to the existing mechanism
        (Arnaldo Carvalho de Melo)

   - perf record fixes:

      * Change warning for missing topology sysfs entry to debug, as not
        all architectures have those files, s390 being one of those
        (Thomas Richter)

      * Remove old error messages about things that unlikely to be the
        root cause in modern systems (Andi Kleen)

   - perf sched fixes:

      * Fix -g/--call-graph documentation (Takuya Yamamoto)

   - perf stat:

      * Enable 1ms interval for printing event counters values in
        (Alexey Budankov)

   - perf test fixes:

      * Run dwarf unwind on arm32 (Kim Phillips)

      * Remove unused ptrace.h include from LLVM test, sidesteping older
        clang's lack of support for some asm constructs (Arnaldo
        Carvalho de Melo)

      * Fixup BPF test using epoll_pwait syscall function probe, to cope
        with the syscall routines renames performed in this development
        cycle (Arnaldo Carvalho de Melo)

   - perf version fixes:

      * Do not print info about HAVE_LIBAUDIT_SUPPORT in 'perf version
        --build-options' when HAVE_SYSCALL_TABLE_SUPPORT is true, as
        libaudit won't be used in that case, print info about
        syscall_table support instead (Jin Yao)

   - Build system fixes:

      * Use HAVE_..._SUPPORT used consistently (Jin Yao)

      * Restore READ_ONCE() C++ compatibility in tools/include (Mark
        Rutland)

      * Give hints about package names needed to build jvmti (Arnaldo
        Carvalho de Melo)"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits)
  perf/x86/intel/uncore: Fix SBOX support for Broadwell CPUs
  perf/x86/intel/uncore: Revert "Remove SBOX support for Broadwell server"
  coresight: Move to SPDX identifier
  perf test BPF: Fixup BPF test using epoll_pwait syscall function probe
  perf tests mmap: Show which tracepoint is failing
  perf tools: Add '\n' at the end of parse-options error messages
  perf record: Remove suggestion to enable APIC
  perf record: Remove misleading error suggestion
  perf hists browser: Clarify top/report browser help
  perf mem: Allow all record/report options
  perf trace: Support MAP_FIXED_NOREPLACE
  perf: Remove superfluous allocation error check
  perf: Fix sample_max_stack maximum check
  perf: Return proper values for user stack errors
  perf list: Add s390 support for detailed/verbose PMU event description
  perf script: Extend misc field decoding with switch out event type
  perf report: Extend raw dump (-D) out with switch out event type
  perf/core: Store context switch out type in PERF_RECORD_SWITCH[_CPU_WIDE]
  tools/headers: Synchronize kernel ABI headers, v4.17-rc1
  trace_kprobe: Remove warning message "Could not insert probe at..."
  ...
2018-04-22 10:17:01 -07:00
Mathias Nyman
013eedb8c5 USB: Add support to store lane count used by USB 3.2
USB 3.2 specification adds Dual-lane support, doubling the maximum
SuperSpeedPlus data rate from 10Gbps to 20Gbps.

Dual-lane takes into use a second set of rx and tx wires/pins in the
Type-C cable and connector.

Add "rx_lanes" and "tx_lanes" variables to struct usb_device to store
the numer of lanes in use. Number of lanes can be read using the extended
port status hub request that was introduced in USB 3.1.

Extended port status rx and tx lane count are zero based, maximum
lanes supported by non inter-chip (SSIC) USB 3.2 is 2 (dual lane) with
rx and tx lane count symmetric. SSIC devices support asymmetric lanes
up to 4 lanes per direction.

If extended port status is not available then default to one lane.

Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-22 16:11:19 +02:00
Linus Torvalds
285848b0f4 Merge tag 'random_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random
Pull /dev/random fixes from Ted Ts'o:
 "Fix some bugs in the /dev/random driver which causes getrandom(2) to
  unblock earlier than designed.

  Thanks to Jann Horn from Google's Project Zero for pointing this out
  to me"

* tag 'random_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random:
  random: add new ioctl RNDRESEEDCRNG
  random: crng_reseed() should lock the crng instance that it is modifying
  random: set up the NUMA crng instances after the CRNG is fully initialized
  random: use a different mixing algorithm for add_device_randomness()
  random: fix crng_ready() test
2018-04-21 21:20:48 -07:00
David S. Miller
e0ada51db9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts were simple overlapping changes in microchip
driver.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-21 16:32:48 -04:00
David S. Miller
1b80f86ed6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2018-04-21

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Initial work on BPF Type Format (BTF) is added, which is a meta
   data format which describes the data types of BPF programs / maps.
   BTF has its roots from CTF (Compact C-Type format) with a number
   of changes to it. First use case is to provide a generic pretty
   print capability for BPF maps inspection, later work will also
   add BTF to bpftool. pahole support to convert dwarf to BTF will
   be upstreamed as well (https://github.com/iamkafai/pahole/tree/btf),
   from Martin.

2) Add a new xdp_bpf_adjust_tail() BPF helper for XDP that allows
   for changing the data_end pointer. Only shrinking is currently
   supported which helps for crafting ICMP control messages. Minor
   changes in drivers have been added where needed so they recalc
   the packet's length also when data_end was adjusted, from Nikita.

3) Improve bpftool to make it easier to feed hex bytes via cmdline
   for map operations, from Quentin.

4) Add support for various missing BPF prog types and attach types
   that have been added to kernel recently but neither to bpftool
   nor libbpf yet. Doc and bash completion updates have been added
   as well for bpftool, from Andrey.

5) Proper fix for avoiding to leak info stored in frame data on page
   reuse for the two bpf_xdp_adjust_{head,meta} helpers by disallowing
   to move the pointers into struct xdp_frame area, from Jesper.

6) Follow-up compile fix from BTF in order to include stdbool.h in
   libbpf, from Björn.

7) Few fixes in BPF sample code, that is, a typo on the netdevice
   in a comment and fixup proper dump of XDP action code in the
   tracepoint exception, from Wang and Jesper.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-21 15:56:15 -04:00
Randy Dunlap
572ccdab50 scsi: target: target_core_user.[ch]: convert comments into DOC:
Make documentation on target-supported userspace-I/O design be
usable by kernel-doc by using "DOC:". This is used in the driver-api
Documentation chapter.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
To: "Nicholas A. Bellinger" <nab@linux-iscsi.org>
Cc: linux-scsi@vger.kernel.org
Cc: target-devel@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: "James E.J. Bottomley" <jejb@linux.vnet.ibm.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2018-04-20 19:14:39 -04:00
GhantaKrishnamurthy MohanKrishna
901271e040 tipc: implement configuration of UDP media MTU
In previous commit, we changed the default emulated MTU for UDP bearers
to 14k.

This commit adds the functionality to set/change the default value
by configuring new MTU for UDP media. UDP bearer(s) have to be disabled
and enabled back for the new MTU to take effect.

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: GhantaKrishnamurthy MohanKrishna <mohan.krishna.ghanta.krishnamurthy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-20 11:04:05 -04:00
GhantaKrishnamurthy MohanKrishna
a4dfa72d0a tipc: set default MTU for UDP media
Currently, all bearers are configured with MTU value same as the
underlying L2 device. However, in case of bearers with media type
UDP, higher throughput is possible with a fixed and higher emulated
MTU value than adapting to the underlying L2 MTU.

In this commit, we introduce a parameter mtu in struct tipc_media
and a default value is set for UDP. A default value of 14k
was determined by experimentation and found to have a higher throughput
than 16k. MTU for UDP bearers are assigned the above set value of
media MTU.

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: GhantaKrishnamurthy MohanKrishna <mohan.krishna.ghanta.krishnamurthy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-20 11:04:05 -04:00
Arnd Bergmann
f991f01571 y2038: asm-generic: Extend sysvipc data structures
Most architectures now use the asm-generic copy of the sysvipc data
structures (msqid64_ds, semid64_ds, shmid64_ds), which use 32-bit
__kernel_time_t on 32-bit architectures but have padding behind them to
allow extending the type to 64-bit.

Unfortunately, that fails on all big-endian architectures, which have the
padding on the wrong side. As so many of them get it wrong, we decided to
not bother even trying to fix it up when we introduced the asm-generic
copy. Instead we always use the padding word now to provide the upper
32 bits of the seconds value, regardless of the endianess.

A libc implementation on a typical big-endian system can deal with
this by providing its own copy of the structure definition to user
space, and swapping the two 32-bit words before returning from the
semctl/shmctl/msgctl system calls.

Note that msqid64_ds and shmid64_ds were broken on x32 since commit
f4b4aae182 ("x86/headers/uapi: Fix __BITS_PER_LONG value for x32
builds"). I have sent a separate fix for that, but as we no longer
have to worry about x32 here, I no longer worry about x32 here and
use 'unsigned long' instead of __kernel_ulong_t.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-04-20 15:57:50 +02:00
Sean Young
95d1544eb6 media: rc: add ioctl to get the current timeout
Since the kernel now modifies the timeout, make it possible to retrieve
the current value.

Signed-off-by: Sean Young <sean@mess.org>
Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
2018-04-20 09:15:18 -04:00
Martin KaFai Lau
a26ca7c982 bpf: btf: Add pretty print support to the basic arraymap
This patch adds pretty print support to the basic arraymap.
Support for other bpf maps can be added later.

This patch adds new attrs to the BPF_MAP_CREATE command to allow
specifying the btf_fd, btf_key_id and btf_value_id.  The
BPF_MAP_CREATE can then associate the btf to the map if
the creating map supports BTF.

A BTF supported map needs to implement two new map ops,
map_seq_show_elem() and map_check_btf().  This patch has
implemented these new map ops for the basic arraymap.

It also adds file_operations, bpffs_map_fops, to the pinned
map such that the pinned map can be opened and read.
After that, the user has an intuitive way to do
"cat bpffs/pathto/a-pinned-map" instead of getting
an error.

bpffs_map_fops should not be extended further to support
other operations.  Other operations (e.g. write/key-lookup...)
should be realized by the userspace tools (e.g. bpftool) through
the BPF_OBJ_GET_INFO_BY_FD, map's lookup/update interface...etc.
Follow up patches will allow the userspace to obtain
the BTF from a map-fd.

Here is a sample output when reading a pinned arraymap
with the following map's value:

struct map_value {
	int count_a;
	int count_b;
};

cat /sys/fs/bpf/pinned_array_map:

0: {1,2}
1: {3,4}
2: {5,6}
...

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-04-19 21:46:25 +02:00
Martin KaFai Lau
f56a653c1f bpf: btf: Add BPF_BTF_LOAD command
This patch adds a BPF_BTF_LOAD command which
1) loads and verifies the BTF (implemented in earlier patches)
2) returns a BTF fd to userspace.  In the next patch, the
   BTF fd can be specified during BPF_MAP_CREATE.

It currently limits to CAP_SYS_ADMIN.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-04-19 21:46:25 +02:00
Martin KaFai Lau
69b693f0ae bpf: btf: Introduce BPF Type Format (BTF)
This patch introduces BPF type Format (BTF).

BTF (BPF Type Format) is the meta data format which describes
the data types of BPF program/map.  Hence, it basically focus
on the C programming language which the modern BPF is primary
using.  The first use case is to provide a generic pretty print
capability for a BPF map.

BTF has its root from CTF (Compact C-Type format).  To simplify
the handling of BTF data, BTF removes the differences between
small and big type/struct-member.  Hence, BTF consistently uses u32
instead of supporting both "one u16" and "two u32 (+padding)" in
describing type and struct-member.

It also raises the number of types (and functions) limit
from 0x7fff to 0x7fffffff.

Due to the above changes,  the format is not compatible to CTF.
Hence, BTF starts with a new BTF_MAGIC and version number.

This patch does the first verification pass to the BTF.  The first
pass checks:
1. meta-data size (e.g. It does not go beyond the total btf's size)
2. name_offset is valid
3. Each BTF_KIND (e.g. int, enum, struct....) does its
   own check of its meta-data.

Some other checks, like checking a struct's member is referring
to a valid type, can only be done in the second pass.  The second
verification pass will be implemented in the next patch.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-04-19 21:46:24 +02:00
Yuchung Cheng
feb5f2ec64 tcp: export packets delivery info
Export data delivered and delivered with CE marks to
1) SNMP TCPDelivered and TCPDeliveredCE
2) getsockopt(TCP_INFO)
3) Timestamping API SOF_TIMESTAMPING_OPT_STATS

Note that for SCM_TSTAMP_ACK, the delivery info in
SOF_TIMESTAMPING_OPT_STATS is reported before the info
was fully updated on the ACK.

These stats help application monitor TCP delivery and ECN status
on per host, per connection, even per message level.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19 13:05:16 -04:00
Johannes Berg
a7cfebcb75 cfg80211: limit wiphy names to 128 bytes
There's currently no limit on wiphy names, other than netlink
message size and memory limitations, but that causes issues when,
for example, the wiphy name is used in a uevent, e.g. in rfkill
where we use the same name for the rfkill instance, and then the
buffer there is "only" 2k for the environment variables.

This was reported by syzkaller, which used a 4k name.

Limit the name to something reasonable, I randomly picked 128.

Reported-by: syzbot+230d9e642a85d3fec29c@syzkaller.appspotmail.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-04-19 15:46:34 +02:00
Deepa Dinamani
acf8870a62 time: Add new y2038 safe __kernel_timespec
The new struct __kernel_timespec is similar to current
internal kernel struct timespec64 on 64 bit architecture.
The compat structure however is similar to below on little
endian systems (padding and tv_nsec are switched for big
endian systems):

typedef s32            compat_long_t;
typedef s64            compat_kernel_time64_t;

struct compat_kernel_timespec {
       compat_kernel_time64_t  tv_sec;
       compat_long_t           tv_nsec;
       compat_long_t           padding;
};

This allows for both the native and compat representations to
be the same and syscalls using this type as part of their ABI
can have a single entry point to both.

Note that the compat define is not included anywhere in the
kernel explicitly to avoid confusion.

These types will be used by the new syscalls that will be
introduced in the consequent patches.
Most of the new syscalls are just an update to the existing
native ones with this new type. Hence, put this new type under
an ifdef so that the architectures can define CONFIG_64BIT_TIME
when they are ready to handle this switch.

Cc: linux-arch@vger.kernel.org
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-04-19 13:31:29 +02:00
Nikita V. Shirokov
b32cc5b9a3 bpf: adding bpf_xdp_adjust_tail helper
Adding new bpf helper which would allow us to manipulate
xdp's data_end pointer, and allow us to reduce packet's size
indended use case: to generate ICMP messages from XDP context,
where such message would contain truncated original packet.

Signed-off-by: Nikita V. Shirokov <tehnerd@tehnerd.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-04-18 23:34:16 +02:00
Hangbin Liu
72f6d71e49 vxlan: add ttl inherit support
Like tos inherit, ttl inherit should also means inherit the inner protocol's
ttl values, which actually not implemented in vxlan yet.

But we could not treat ttl == 0 as "use the inner TTL", because that would be
used also when the "ttl" option is not specified and that would be a behavior
change, and breaking real use cases.

So add a different attribute IFLA_VXLAN_TTL_INHERIT when "ttl inherit" is
specified with ip cmd.

Reported-by: Jianlin Shi <jishi@redhat.com>
Suggested-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 13:53:13 -04:00
Alexey Budankov
101592b490 perf/core: Store context switch out type in PERF_RECORD_SWITCH[_CPU_WIDE]
Store preempting context switch out event into Perf trace as a part of
PERF_RECORD_SWITCH[_CPU_WIDE] record.

Percentage of preempting and non-preempting context switches help
understanding the nature of workloads (CPU or IO bound) that are running
on a machine;

The event is treated as preemption one when task->state value of the
thread being switched out is TASK_RUNNING. Event type encoding is
implemented using PERF_RECORD_MISC_SWITCH_OUT_PREEMPT bit;

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: http://lkml.kernel.org/r/9ff84e83-a0ca-dd82-a6d0-cb951689be74@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-04-17 09:47:39 -03:00
Heiner Kallweit
a5724fc383 PCI: Add two more values for PCIe Max_Read_Request_Size
This patch adds missing values for the max read request size.
E.g. network driver r8169 uses a value of 4K.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-16 18:55:04 -04:00
Kirill Marinushkin
e590522a06 ASoC: topology: Add definitions for mclk_direction values
Current comment makes not clear the direction of mclk. Previously, similar
description caused a misunderstanding for bclk_master and fsync_master.

This commit solves the potential confusion the same way it is solved for
bclk_master and fsync_master.

Signed-off-by: Kirill Marinushkin <k.marinushkin@gmail.com>
Acked-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Cc: Jaroslav Kysela <perex@perex.cz>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Mark Brown <broonie@kernel.org>
Cc: Pan Xiuli <xiuli.pan@linux.intel.com>
Cc: Liam Girdwood <liam.r.girdwood@linux.intel.com>
Cc: linux-kernel@vger.kernel.org
Cc: alsa-devel@alsa-project.org
Signed-off-by: Mark Brown <broonie@kernel.org>
2018-04-16 17:52:31 +01:00
Kirill Marinushkin
933e1c4a66 ASoC: topology: Add missing clock gating parameter when parsing hw_configs
Clock gating parameter is a part of `dai_fmt`. It is supported by
`alsa-lib` when creating a topology binary file, but ignored by kernel
when loading this topology file.

After applying this commit, the clock gating parameter is not ignored any
more. This solution is backwards compatible. The existing behaviour is
not broken, because by default the parameter value is 0 and is ignored.

snd_soc_tplg_hw_config.clock_gated = 0 => no effect
snd_soc_tplg_hw_config.clock_gated = 1 => SND_SOC_DAIFMT_GATED
snd_soc_tplg_hw_config.clock_gated = 2 => SND_SOC_DAIFMT_CONT

For example, the following config, based on
alsa-lib/src/conf/topology/broadwell/broadwell.conf, is now supported:

~~~~
SectionHWConfig."CodecHWConfig" {
        id "1"
        format "I2S"            # physical audio format.
        pm_gate_clocks "true"   # clock can be gated
}

SectionLink."Codec" {

        # used for binding to the physical link
        id "0"

        hw_configs [
                "CodecHWConfig"
        ]

        default_hw_conf_id "1"
}
~~~~

Signed-off-by: Kirill Marinushkin <k.marinushkin@gmail.com>
Reviewed-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Cc: Jaroslav Kysela <perex@perex.cz>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Mark Brown <broonie@kernel.org>
Cc: Pan Xiuli <xiuli.pan@linux.intel.com>
Cc: Liam Girdwood <liam.r.girdwood@linux.intel.com>
Cc: linux-kernel@vger.kernel.org
Cc: alsa-devel@alsa-project.org
Signed-off-by: Mark Brown <broonie@kernel.org>
2018-04-16 17:52:26 +01:00
Kirill Marinushkin
a941e2fab3 ASoC: topology: Fix bclk and fsync inversion in set_link_hw_format()
The values of bclk and fsync are inverted WRT the codec. But the existing
solution already works for Broadwell, see the alsa-lib config:

`alsa-lib/src/conf/topology/broadwell/broadwell.conf`

This commit provides the backwards-compatible solution to fix this misuse.

Signed-off-by: Kirill Marinushkin <k.marinushkin@gmail.com>
Reviewed-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Tested-by: Pan Xiuli <xiuli.pan@linux.intel.com>
Tested-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Cc: Jaroslav Kysela <perex@perex.cz>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Mark Brown <broonie@kernel.org>
Cc: Liam Girdwood <liam.r.girdwood@linux.intel.com>
Cc: linux-kernel@vger.kernel.org
Cc: alsa-devel@alsa-project.org
Signed-off-by: Mark Brown <broonie@kernel.org>
2018-04-16 17:52:16 +01:00
Greg Kroah-Hartman
edf5c17d86 staging: irda: remove remaining remants of irda code removal
There were some documentation locations that irda was mentioned, as well
as an old MAINTAINERS entry and the networking sysctl entries.  Clean
these all out as this stuff really is finally gone.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-16 11:26:49 +02:00
Theodore Ts'o
d848e5f8e1 random: add new ioctl RNDRESEEDCRNG
Add a new ioctl which forces the the crng to be reseeded.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2018-04-14 11:59:31 -04:00
Linus Torvalds
681857ef0d Merge branch 'parisc-4.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
Pull parisc updates from Helge Deller:

 - fix panic when halting system via "shutdown -h now"

 - drop own coding in favour of generic CONFIG_COMPAT_BINFMT_ELF
   implementation

 - add FPE_CONDTRAP constant: last outstanding parisc-specific cleanup
   for Eric Biedermans siginfo patches

 - move some functions to .init and some to .text.hot linker sections

* 'parisc-4.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
  parisc: Prevent panic at system halt
  parisc: Switch to generic COMPAT_BINFMT_ELF
  parisc: Move cache flush functions into .text.hot section
  parisc/signal: Add FPE_CONDTRAP for conditional trap handling
2018-04-12 17:07:04 -07:00
Linus Torvalds
e241e3f2bf Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio update from Michael Tsirkin:
 "This adds reporting hugepage stats to virtio-balloon"

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
  virtio_balloon: export hugetlb page allocation counts
2018-04-11 18:58:27 -07:00
Chunming Zhou
552825b28d drm/amdgpu: add new bo flag that indicates BOs don't need fallback (v2)
user cases:
1. KFD wraps amdgpu_bo_create, they have no fallback case which is different
with amdgpu_gem_object_create.
since upstream branch has no amdgpu_amdkfd_gpuvm.c, which need KFD
guys add this flag to __alloc_memory_of_gpu:
+       flags |= AMDGPU_GEM_CREATE_NO_FALLBACK;
2. UMD can specify this flag for their allocation as well if they like.

v2: squash in merge conflict fix (Chunming)

Signed-off-by: Chunming Zhou <david1.zhou@amd.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: felix.kuehling@amd.com
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2018-04-11 13:08:01 -05:00
Masahiro Yamada
21e7bc600e linux/const.h: refactor _BITUL and _BITULL a bit
Minor cleanups available by _UL and _ULL.

Link: http://lkml.kernel.org/r/1519301715-31798-5-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Russell King <rmk+kernel@armlinux.org.uk>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:38 -07:00
Masahiro Yamada
2dd8a62c64 linux/const.h: move UL() macro to include/linux/const.h
ARM, ARM64 and UniCore32 duplicate the definition of UL():

  #define UL(x) _AC(x, UL)

This is not actually arch-specific, so it will be useful to move it to a
common header.  Currently, we only have the uapi variant for
linux/const.h, so I am creating include/linux/const.h.

I also added _UL(), _ULL() and ULL() because _AC() is mostly used in
the form either _AC(..., UL) or _AC(..., ULL).  I expect they will be
replaced in follow-up cleanups.  The underscore-prefixed ones should
be used for exported headers.

Link: http://lkml.kernel.org/r/1519301715-31798-4-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Acked-by: Guan Xuetao <gxt@mprc.pku.edu.cn>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Russell King <rmk+kernel@armlinux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:38 -07:00
Masahiro Yamada
2a6cc8a6c0 linux/const.h: prefix include guard of uapi/linux/const.h with _UAPI
Patch series "linux/const.h: cleanups of macros such as UL(), _BITUL(),
BIT() etc", v3.

ARM, ARM64, UniCore32 define UL() as a shorthand of _AC(..., UL).  More
architectures may introduce it in the future.

UL() is arch-agnostic, and useful. So let's move it to
include/linux/const.h

Currently, <asm/memory.h> must be included to use UL().  It pulls in more
bloats just for defining some bit macros.

I posted V2 one year ago.

The previous posts are:
https://patchwork.kernel.org/patch/9498273/
https://patchwork.kernel.org/patch/9498275/
https://patchwork.kernel.org/patch/9498269/
https://patchwork.kernel.org/patch/9498271/

At that time, what blocked this series was a comment from
David Howells:
  You need to be very careful doing this.  Some userspace stuff
  depends on the guard macro names on the kernel header files.

(https://patchwork.kernel.org/patch/9498275/)

Looking at the code closer, I noticed this is not a problem.

See the following line.
https://github.com/torvalds/linux/blob/v4.16-rc2/scripts/headers_install.sh#L40

scripts/headers_install.sh rips off _UAPI prefix from guard macro names.

I ran "make headers_install" and confirmed the result is what I expect.

So, we can prefix the include guard of include/uapi/linux/const.h,
and add a new include/linux/const.h.

This patch (of 4):

I am going to add include/linux/const.h for the kernel space.

Add _UAPI to the include guard of include/uapi/linux/const.h to
prepare for that.

Please notice the guard name of the exported one will be kept as-is.
So, this commit has no impact to the userspace even if some userspace
stuff depends on the guard macro names.

scripts/headers_install.sh processes exported headers by SED, and
rips off "_UAPI" from guard macro names.

  #ifndef _UAPI_LINUX_CONST_H
  #define _UAPI_LINUX_CONST_H

will be turned into

  #ifndef _LINUX_CONST_H
  #define _LINUX_CONST_H

Link: http://lkml.kernel.org/r/1519301715-31798-2-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:38 -07:00
Michal Hocko
4ed2863951 fs, elf: drop MAP_FIXED usage from elf_map
Both load_elf_interp and load_elf_binary rely on elf_map to map segments
on a controlled address and they use MAP_FIXED to enforce that.  This is
however dangerous thing prone to silent data corruption which can be
even exploitable.

Let's take CVE-2017-1000253 as an example.  At the time (before commit
eab09532d4: "binfmt_elf: use ELF_ET_DYN_BASE only for PIE")
ELF_ET_DYN_BASE was at TASK_SIZE / 3 * 2 which is not that far away from
the stack top on 32b (legacy) memory layout (only 1GB away).  Therefore
we could end up mapping over the existing stack with some luck.

The issue has been fixed since then (a87938b2e2: "fs/binfmt_elf.c: fix
bug in loading of PIE binaries"), ELF_ET_DYN_BASE moved moved much
further from the stack (eab09532d4 and later by c715b72c1b: "mm:
revert x86_64 and arm64 ELF_ET_DYN_BASE base changes") and excessive
stack consumption early during execve fully stopped by da029c11e6
("exec: Limit arg stack to at most 75% of _STK_LIM").  So we should be
safe and any attack should be impractical.  On the other hand this is
just too subtle assumption so it can break quite easily and hard to
spot.

I believe that the MAP_FIXED usage in load_elf_binary (et. al) is still
fundamentally dangerous.  Moreover it shouldn't be even needed.  We are
at the early process stage and so there shouldn't be unrelated mappings
(except for stack and loader) existing so mmap for a given address should
succeed even without MAP_FIXED.  Something is terribly wrong if this is
not the case and we should rather fail than silently corrupt the
underlying mapping.

Address this issue by changing MAP_FIXED to the newly added
MAP_FIXED_NOREPLACE.  This will mean that mmap will fail if there is an
existing mapping clashing with the requested one without clobbering it.

[mhocko@suse.com: fix build]
[akpm@linux-foundation.org: coding-style fixes]
[avagin@openvz.org: don't use the same value for MAP_FIXED_NOREPLACE and MAP_SYNC]
  Link: http://lkml.kernel.org/r/20171218184916.24445-1-avagin@openvz.org
Link: http://lkml.kernel.org/r/20171213092550.2774-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Joel Stanley <joel@jms.id.au>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:38 -07:00
Michal Hocko
a4ff8e8620 mm: introduce MAP_FIXED_NOREPLACE
Patch series "mm: introduce MAP_FIXED_NOREPLACE", v2.

This has started as a follow up discussion [3][4] resulting in the
runtime failure caused by hardening patch [5] which removes MAP_FIXED
from the elf loader because MAP_FIXED is inherently dangerous as it
might silently clobber an existing underlying mapping (e.g.  stack).
The reason for the failure is that some architectures enforce an
alignment for the given address hint without MAP_FIXED used (e.g.  for
shared or file backed mappings).

One way around this would be excluding those archs which do alignment
tricks from the hardening [6].  The patch is really trivial but it has
been objected, rightfully so, that this screams for a more generic
solution.  We basically want a non-destructive MAP_FIXED.

The first patch introduced MAP_FIXED_NOREPLACE which enforces the given
address but unlike MAP_FIXED it fails with EEXIST if the given range
conflicts with an existing one.  The flag is introduced as a completely
new one rather than a MAP_FIXED extension because of the backward
compatibility.  We really want a never-clobber semantic even on older
kernels which do not recognize the flag.  Unfortunately mmap sucks
wrt flags evaluation because we do not EINVAL on unknown flags.  On
those kernels we would simply use the traditional hint based semantic so
the caller can still get a different address (which sucks) but at least
not silently corrupt an existing mapping.  I do not see a good way
around that.  Except we won't export expose the new semantic to the
userspace at all.

It seems there are users who would like to have something like that.
Jemalloc has been mentioned by Michael Ellerman [7]

Florian Weimer has mentioned the following:
: glibc ld.so currently maps DSOs without hints.  This means that the kernel
: will map right next to each other, and the offsets between them a completely
: predictable.  We would like to change that and supply a random address in a
: window of the address space.  If there is a conflict, we do not want the
: kernel to pick a non-random address. Instead, we would try again with a
: random address.

John Hubbard has mentioned CUDA example
: a) Searches /proc/<pid>/maps for a "suitable" region of available
: VA space.  "Suitable" generally means it has to have a base address
: within a certain limited range (a particular device model might
: have odd limitations, for example), it has to be large enough, and
: alignment has to be large enough (again, various devices may have
: constraints that lead us to do this).
:
: This is of course subject to races with other threads in the process.
:
: Let's say it finds a region starting at va.
:
: b) Next it does:
:     p = mmap(va, ...)
:
: *without* setting MAP_FIXED, of course (so va is just a hint), to
: attempt to safely reserve that region. If p != va, then in most cases,
: this is a failure (almost certainly due to another thread getting a
: mapping from that region before we did), and so this layer now has to
: call munmap(), before returning a "failure: retry" to upper layers.
:
:     IMPROVEMENT: --> if instead, we could call this:
:
:             p = mmap(va, ... MAP_FIXED_NOREPLACE ...)
:
:         , then we could skip the munmap() call upon failure. This
:         is a small thing, but it is useful here. (Thanks to Piotr
:         Jaroszynski and Mark Hairgrove for helping me get that detail
:         exactly right, btw.)
:
: c) After that, CUDA suballocates from p, via:
:
:      q = mmap(sub_region_start, ... MAP_FIXED ...)
:
: Interestingly enough, "freeing" is also done via MAP_FIXED, and
: setting PROT_NONE to the subregion. Anyway, I just included (c) for
: general interest.

Atomic address range probing in the multithreaded programs in general
sounds like an interesting thing to me.

The second patch simply replaces MAP_FIXED use in elf loader by
MAP_FIXED_NOREPLACE.  I believe other places which rely on MAP_FIXED
should follow.  Actually real MAP_FIXED usages should be docummented
properly and they should be more of an exception.

[1] http://lkml.kernel.org/r/20171116101900.13621-1-mhocko@kernel.org
[2] http://lkml.kernel.org/r/20171129144219.22867-1-mhocko@kernel.org
[3] http://lkml.kernel.org/r/20171107162217.382cd754@canb.auug.org.au
[4] http://lkml.kernel.org/r/1510048229.12079.7.camel@abdul.in.ibm.com
[5] http://lkml.kernel.org/r/20171023082608.6167-1-mhocko@kernel.org
[6] http://lkml.kernel.org/r/20171113094203.aofz2e7kueitk55y@dhcp22.suse.cz
[7] http://lkml.kernel.org/r/87efp1w7vy.fsf@concordia.ellerman.id.au

This patch (of 2):

MAP_FIXED is used quite often to enforce mapping at the particular range.
The main problem of this flag is, however, that it is inherently dangerous
because it unmaps existing mappings covered by the requested range.  This
can cause silent memory corruptions.  Some of them even with serious
security implications.  While the current semantic might be really
desiderable in many cases there are others which would want to enforce the
given range but rather see a failure than a silent memory corruption on a
clashing range.  Please note that there is no guarantee that a given range
is obeyed by the mmap even when it is free - e.g.  arch specific code is
allowed to apply an alignment.

Introduce a new MAP_FIXED_NOREPLACE flag for mmap to achieve this
behavior.  It has the same semantic as MAP_FIXED wrt.  the given address
request with a single exception that it fails with EEXIST if the requested
address is already covered by an existing mapping.  We still do rely on
get_unmaped_area to handle all the arch specific MAP_FIXED treatment and
check for a conflicting vma after it returns.

The flag is introduced as a completely new one rather than a MAP_FIXED
extension because of the backward compatibility.  We really want a
never-clobber semantic even on older kernels which do not recognize the
flag.  Unfortunately mmap sucks wrt.  flags evaluation because we do not
EINVAL on unknown flags.  On those kernels we would simply use the
traditional hint based semantic so the caller can still get a different
address (which sucks) but at least not silently corrupt an existing
mapping.  I do not see a good way around that.

[mpe@ellerman.id.au: fix whitespace]
[fail on clashing range with EEXIST as per Florian Weimer]
[set MAP_FIXED before round_hint_to_min as per Khalid Aziz]
Link: http://lkml.kernel.org/r/20171213092550.2774-2-mhocko@kernel.org
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Russell King - ARM Linux <linux@armlinux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Joel Stanley <joel@jms.id.au>
Cc: Kees Cook <keescook@chromium.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Jason Evans <jasone@google.com>
Cc: David Goldblatt <davidtgoldblatt@gmail.com>
Cc: Edward Tomasz Napierała <trasz@FreeBSD.org>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:38 -07:00
Davidlohr Bueso
23c8cec8cf ipc/msg: introduce msgctl(MSG_STAT_ANY)
There is a permission discrepancy when consulting msq ipc object
metadata between /proc/sysvipc/msg (0444) and the MSG_STAT shmctl
command.  The later does permission checks for the object vs S_IRUGO.
As such there can be cases where EACCESS is returned via syscall but the
info is displayed anyways in the procfs files.

While this might have security implications via info leaking (albeit no
writing to the msq metadata), this behavior goes way back and showing
all the objects regardless of the permissions was most likely an
overlook - so we are stuck with it.  Furthermore, modifying either the
syscall or the procfs file can cause userspace programs to break (ie
ipcs).  Some applications require getting the procfs info (without root
privileges) and can be rather slow in comparison with a syscall -- up to
500x in some reported cases for shm.

This patch introduces a new MSG_STAT_ANY command such that the msq ipc
object permissions are ignored, and only audited instead.  In addition,
I've left the lsm security hook checks in place, as if some policy can
block the call, then the user has no other choice than just parsing the
procfs file.

Link: http://lkml.kernel.org/r/20180215162458.10059-4-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reported-by: Robert Kettler <robert.kettler@outlook.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:37 -07:00
Davidlohr Bueso
a280d6dc77 ipc/sem: introduce semctl(SEM_STAT_ANY)
There is a permission discrepancy when consulting shm ipc object
metadata between /proc/sysvipc/sem (0444) and the SEM_STAT semctl
command.  The later does permission checks for the object vs S_IRUGO.
As such there can be cases where EACCESS is returned via syscall but the
info is displayed anyways in the procfs files.

While this might have security implications via info leaking (albeit no
writing to the sma metadata), this behavior goes way back and showing
all the objects regardless of the permissions was most likely an
overlook - so we are stuck with it.  Furthermore, modifying either the
syscall or the procfs file can cause userspace programs to break (ie
ipcs).  Some applications require getting the procfs info (without root
privileges) and can be rather slow in comparison with a syscall -- up to
500x in some reported cases for shm.

This patch introduces a new SEM_STAT_ANY command such that the sem ipc
object permissions are ignored, and only audited instead.  In addition,
I've left the lsm security hook checks in place, as if some policy can
block the call, then the user has no other choice than just parsing the
procfs file.

Link: http://lkml.kernel.org/r/20180215162458.10059-3-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reported-by: Robert Kettler <robert.kettler@outlook.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:37 -07:00
Davidlohr Bueso
c21a6970ae ipc/shm: introduce shmctl(SHM_STAT_ANY)
Patch series "sysvipc: introduce STAT_ANY commands", v2.

The following patches adds the discussed (see [1]) new command for shm
as well as for sems and msq as they are subject to the same
discrepancies for ipc object permission checks between the syscall and
via procfs.  These new commands are justified in that (1) we are stuck
with this semantics as changing syscall and procfs can break userland;
and (2) some users can benefit from performance (for large amounts of
shm segments, for example) from not having to parse the procfs
interface.

Once merged, I will submit the necesary manpage updates.  But I'm thinking
something like:

: diff --git a/man2/shmctl.2 b/man2/shmctl.2
: index 7bb503999941..bb00bbe21a57 100644
: --- a/man2/shmctl.2
: +++ b/man2/shmctl.2
: @@ -41,6 +41,7 @@
:  .\" 2005-04-25, mtk -- noted aberrant Linux behavior w.r.t. new
:  .\"	attaches to a segment that has already been marked for deletion.
:  .\" 2005-08-02, mtk: Added IPC_INFO, SHM_INFO, SHM_STAT descriptions.
: +.\" 2018-02-13, dbueso: Added SHM_STAT_ANY description.
:  .\"
:  .TH SHMCTL 2 2017-09-15 "Linux" "Linux Programmer's Manual"
:  .SH NAME
: @@ -242,6 +243,18 @@ However, the
:  argument is not a segment identifier, but instead an index into
:  the kernel's internal array that maintains information about
:  all shared memory segments on the system.
: +.TP
: +.BR SHM_STAT_ANY " (Linux-specific)"
: +Return a
: +.I shmid_ds
: +structure as for
: +.BR SHM_STAT .
: +However, the
: +.I shm_perm.mode
: +is not checked for read access for
: +.IR shmid ,
: +resembing the behaviour of
: +/proc/sysvipc/shm.
:  .PP
:  The caller can prevent or allow swapping of a shared
:  memory segment with the following \fIcmd\fP values:
: @@ -287,7 +300,7 @@ operation returns the index of the highest used entry in the
:  kernel's internal array recording information about all
:  shared memory segments.
:  (This information can be used with repeated
: -.B SHM_STAT
: +.B SHM_STAT/SHM_STAT_ANY
:  operations to obtain information about all shared memory segments
:  on the system.)
:  A successful
: @@ -328,7 +341,7 @@ isn't accessible.
:  \fIshmid\fP is not a valid identifier, or \fIcmd\fP
:  is not a valid command.
:  Or: for a
: -.B SHM_STAT
: +.B SHM_STAT/SHM_STAT_ANY
:  operation, the index value specified in
:  .I shmid
:  referred to an array slot that is currently unused.

This patch (of 3):

There is a permission discrepancy when consulting shm ipc object metadata
between /proc/sysvipc/shm (0444) and the SHM_STAT shmctl command.  The
later does permission checks for the object vs S_IRUGO.  As such there can
be cases where EACCESS is returned via syscall but the info is displayed
anyways in the procfs files.

While this might have security implications via info leaking (albeit no
writing to the shm metadata), this behavior goes way back and showing all
the objects regardless of the permissions was most likely an overlook - so
we are stuck with it.  Furthermore, modifying either the syscall or the
procfs file can cause userspace programs to break (ie ipcs).  Some
applications require getting the procfs info (without root privileges) and
can be rather slow in comparison with a syscall -- up to 500x in some
reported cases.

This patch introduces a new SHM_STAT_ANY command such that the shm ipc
object permissions are ignored, and only audited instead.  In addition,
I've left the lsm security hook checks in place, as if some policy can
block the call, then the user has no other choice than just parsing the
procfs file.

[1] https://lkml.org/lkml/2017/12/19/220

Link: http://lkml.kernel.org/r/20180215162458.10059-2-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Robert Kettler <robert.kettler@outlook.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:37 -07:00