Adding support for helper function bpf_getsockops to socket_ops BPF
programs. This patch only supports TCP_CONGESTION.
Signed-off-by: Vlad Vysotsky <vlad@cs.ucla.edu>
Acked-by: Lawrence Brakmo <brakmo@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
A congestion control algorithm can make a call to the BPF socket_ops
program to request the base RTT. The base RTT can be congestion control
dependent and is meant to represent a congestion threshold such that
RTTs above it indicate congestion. This is especially useful for flows
within a DC where the base RTT is easy to obtain.
Being provided a base RTT solves a basic problem in RTT based congestion
avoidance algorithms (such as Vegas, NV and BBR). Although it is easy
to get the base RTT when the network is not congested, it is very
diffcult to do when it is very congested. Newer connections get an
inflated value of the base RTT leading to unfariness (newer flows with a
larger base RTT get more bandwidth). As a result, RTT based congestion
avoidance algorithms tend to update their base RTTs to improve fairness.
In very congested networks this can lead to base RTT inflation, reducing
the ability of these RTT based congestion control algorithms to prevent
congestion.
Note that in my experiments with TCP-NV, the base RTT provided can be
much larger than the actual hardware RTT. For example, experimenting
with hosts within a rack where the hardware RTT is 16-20us, I've used
base RTTs up to 150us. The effect of using a larger base RTT is that the
congestion avoidance algorithm will allow more queueing. When there are
only a few flows the main effect is larger measured RTTs and RPC
latencies due to the increased queueing. When there are a lot of flows,
a larger base RTT can lead to more congestion and more packet drops.
For this case, where the hardware RTT is 20us, a base RTT of 80us
produces good results.
This patch only introduces BPF_SOCK_OPS_BASE_RTT, a later patch in this
set adds support for using it in TCP-NV. Further study and testing is
needed before support can be added to other delay based congestion
avoidance algorithms.
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce the map read/write flags to the eBPF syscalls that returns the
map fd. The flags is used to set up the file mode when construct a new
file descriptor for bpf maps. To not break the backward capability, the
f_flags is set to O_RDWR if the flag passed by syscall is 0. Otherwise
it should be O_RDONLY or O_WRONLY. When the userspace want to modify or
read the map content, it will check the file mode to see if it is
allowed to make the change.
Signed-off-by: Chenbo Feng <fengc@google.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
New socket option TCP_FASTOPEN_KEY to allow different keys per
listener. The listener by default uses the global key until the
socket option is set. The key is a 16 bytes long binary data. This
option has no effect on regular non-listener TCP sockets.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This introduces a "register private expedited" membarrier command which
allows eventual removal of important memory barrier constraints on the
scheduler fast-paths. It changes how the "private expedited" membarrier
command (new to 4.14) is used from user-space.
This new command allows processes to register their intent to use the
private expedited command. This affects how the expedited private
command introduced in 4.14-rc is meant to be used, and should be merged
before 4.14 final.
Processes are now required to register before using
MEMBARRIER_CMD_PRIVATE_EXPEDITED, otherwise that command returns EPERM.
This fixes a problem that arose when designing requested extensions to
sys_membarrier() to allow JITs to efficiently flush old code from
instruction caches. Several potential algorithms are much less painful
if the user register intent to use this functionality early on, for
example, before the process spawns the second thread. Registering at
this time removes the need to interrupt each and every thread in that
process at the first expedited sys_membarrier() system call.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Last batch of drm/i915 features for v4.15:
- transparent huge pages support (Matthew)
- uapi: I915_PARAM_HAS_SCHEDULER into a capability bitmask (Chris)
- execlists: preemption (Chris)
- scheduler: user defined priorities (Chris)
- execlists optimization (Michał)
- plenty of display fixes (Imre)
- has_ipc fix (Rodrigo)
- platform features definition refactoring (Rodrigo)
- legacy cursor update fix (Maarten)
- fix vblank waits for cursor updates (Maarten)
- reprogram dmc firmware on resume, dmc state fix (Imre)
- remove use_mmio_flip module parameter (Maarten)
- wa fixes (Oscar)
- huc/guc firmware refacoring (Sagar, Michal)
- push encoder specific code to encoder hooks (Jani)
- DP MST fixes (Dhinakaran)
- eDP power sequencing fixes (Manasi)
- selftest updates (Chris, Matthew)
- mmu notifier cpu hotplug deadlock fix (Daniel)
- more VBT parser refactoring (Jani)
- max pipe refactoring (Mika Kahola)
- rc6/rps refactoring and separation (Sagar)
- userptr lockdep fix (Chris)
- tracepoint fixes and defunct tracepoint removal (Chris)
- use rcu instead of abusing stop_machine (Daniel)
- plenty of other fixes all around (Everyone)
* tag 'drm-intel-next-2017-10-12' of git://anongit.freedesktop.org/drm/drm-intel: (145 commits)
drm/i915: Update DRIVER_DATE to 20171012
drm/i915: Simplify intel_sanitize_enable_ppgtt
drm/i915/userptr: Drop struct_mutex before cleanup
drm/i915/dp: limit sink rates based on rate
drm/i915/dp: centralize max source rate conditions more
drm/i915: Allow PCH platforms fall back to BIOS LVDS mode
drm/i915: Reuse normal state readout for LVDS/DVO fixed mode
drm/i915: Use rcu instead of stop_machine in set_wedged
drm/i915: Introduce separate status variable for RC6 and LLC ring frequency setup
drm/i915: Create generic functions to control RC6, RPS
drm/i915: Create generic function to setup LLC ring frequency table
drm/i915: Rename intel_enable_rc6 to intel_rc6_enabled
drm/i915: Name structure in dev_priv that contains RPS/RC6 state as "gt_pm"
drm/i915: Move rps.hw_lock to dev_priv and s/hw_lock/pcu_lock
drm/i915: Name i915_runtime_pm structure in dev_priv as "runtime_pm"
drm/i915: Separate RPS and RC6 handling for CHV
drm/i915: Separate RPS and RC6 handling for VLV
drm/i915: Separate RPS and RC6 handling for BDW
drm/i915: Remove superfluous IS_BDW checks and non-BDW changes from gen8_enable_rps
drm/i915: Separate RPS and RC6 handling for gen6+
...
Last set of features for 4.15. Highlights:
- Add a bo flag to allow buffers to opt out of implicit sync
- Add ctx priority setting interface
- Lots more powerplay cleanups
- Start to plumb through vram lost infrastructure for gpu reset
- ttm support for huge pages
- misc cleanups and bug fixes
* 'drm-next-4.15' of git://people.freedesktop.org/~agd5f/linux: (73 commits)
drm/amd/powerplay: Place the constant on the right side of the test
drm/amd/powerplay: Remove useless variable
drm/amd/powerplay: Don't cast kzalloc() return value
drm/amdgpu: allow GTT overcommit during bind
drm/amdgpu: linear validate first then bind to GART
drm/amd/pp: Fix overflow when setup decf/pix/disp dpm table.
drm/amd/pp: thermal control not enabled on vega10.
drm/amdgpu: busywait KIQ register accessing (v4)
drm/amdgpu: report more amdgpu_fence_info
drm/amdgpu:don't check soft_reset for sriov
drm/amdgpu:fix duplicated setting job's vram_lost
drm/amdgpu:reduce wb to 512 slot
drm/amdgpu: fix regresstion on SR-IOV gpu reset failed
drm/amd/powerplay: Tidy up cz_dpm_powerup_vce()
drm/amd/powerplay: Tidy up cz_dpm_powerdown_vce()
drm/amd/powerplay: Tidy up cz_dpm_update_vce_dpm()
drm/amd/powerplay: Tidy up cz_dpm_update_uvd_dpm()
drm/amd/powerplay: Tidy up cz_dpm_powerup_uvd()
drm/amd/powerplay: Tidy up cz_dpm_powerdown_uvd()
drm/amd/powerplay: Tidy up cz_start_dpm()
...
In the AER case, the mask isn't strictly necessary because there are no
higher-order bits above the Interrupt Message Number, but using a #define
will make it possible to grep for it.
Suggested-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Dongdong Liu <liudongdong3@huawei.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Don't leak implementation details about how each priority behaves to
usermode. This allows greater flexibility in the future.
Squash into c2636dc53a
Signed-off-by: Andres Rodriguez <andresx7@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
This ioctl will allow us to purge inactive userspace buffers when the
system is running out of contiguous memory.
For now, the purge logic is rather dumb in that it does not try to
release only the amount of BO needed to meet the last CMA alloc request
but instead purges all objects placed in the purgeable pool as soon as
we experience a CMA allocation failure.
Note that the in-kernel BO cache is always purged before the purgeable
cache because those objects are known to be unused while objects marked
as purgeable by a userspace application/library might have to be
restored when they are marked back as unpurgeable, which can be
expensive.
Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com>
Signed-off-by: Eric Anholt <eric@anholt.net>
Reviewed-by: Eric Anholt <eric@anholt.net>
Link: https://patchwork.freedesktop.org/patch/msgid/20171019125748.3152-1-boris.brezillon@free-electrons.com
The ARM SPE architecture permits an implementation to ignore a sample
if the sample is due to be taken whilst another sample is already being
produced. In this case, it is desirable to report the collision to
userspace, as they may want to lower the sample period.
This patch adds a PERF_AUX_FLAG_COLLISION flag, so that such events can
be relayed to userspace.
Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
The 'cpumap' is primarily used as a backend map for XDP BPF helper
call bpf_redirect_map() and XDP_REDIRECT action, like 'devmap'.
This patch implement the main part of the map. It is not connected to
the XDP redirect system yet, and no SKB allocation are done yet.
The main concern in this patch is to ensure the datapath can run
without any locking. This adds complexity to the setup and tear-down
procedure, which assumptions are extra carefully documented in the
code comments.
V2:
- make sure array isn't larger than NR_CPUS
- make sure CPUs added is a valid possible CPU
V3: fix nitpicks from Jakub Kicinski <kubakici@wp.pl>
V5:
- Restrict map allocation to root / CAP_SYS_ADMIN
- WARN_ON_ONCE if queue is not empty on tear-down
- Return -EPERM on memlock limit instead of -ENOMEM
- Error code in __cpu_map_entry_alloc() also handle ptr_ring_cleanup()
- Moved cpu_map_enqueue() to next patch
V6: all notice by Daniel Borkmann
- Fix err return code in cpu_map_alloc() introduced in V5
- Move cpu_possible() check after max_entries boundary check
- Forbid usage initially in check_map_func_compatibility()
V7:
- Fix alloc error path spotted by Daniel Borkmann
- Did stress test adding+removing CPUs from the map concurrently
- Fixed refcnt issue on cpu_map_entry, kthread started too soon
- Make sure packets are flushed during tear-down, involved use of
rcu_barrier() and kthread_run only exit after queue is empty
- Fix alloc error path in __cpu_map_entry_alloc() for ptr_ring
V8:
- Nitpicking comments and gramma by Edward Cree
- Fix missing semi-colon introduced in V7 due to rebasing
- Move struct bpf_cpu_map_entry members cpu+map_id to tracepoint patch
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
* commit '3728e6a255b5': (904 commits)
Linux 4.14-rc5
x86/microcode: Do the family check first
locking/lockdep: Disable cross-release features for now
x86/mm: Flush more aggressively in lazy TLB mode
mm, swap: use page-cluster as max window of VMA based swap readahead
mm: page_vma_mapped: ensure pmd is loaded with READ_ONCE outside of lock
kmemleak: clear stale pointers from task stacks
fs/binfmt_misc.c: node could be NULL when evicting inode
fs/mpage.c: fix mpage_writepage() for pages with buffers
linux/kernel.h: add/correct kernel-doc notation
tty: fall back to N_NULL if switching to N_TTY fails during hangup
Revert "vmalloc: back off when the current task is killed"
mm/cma.c: take __GFP_NOWARN into account in cma_alloc()
scripts/kallsyms.c: ignore symbol type 'n'
userfaultfd: selftest: exercise -EEXIST only in background transfer
mm: only display online cpus of the numa node
mm: remove unnecessary WARN_ONCE in page_vma_mapped_walk().
mm/mempolicy: fix NUMA_INTERLEAVE_HIT counter
include/linux/of.h: provide of_n_{addr,size}_cells wrappers for !CONFIG_OF
mm/madvise.c: add description for MADV_WIPEONFORK and MADV_KEEPONFORK
...
This patch makes a slight tweak to mqprio in order to bring the
classid values used back in line with what is used for mq. The general idea
is to reserve values :ffe0 - :ffef to identify hardware traffic classes
normally reported via dev->num_tc. By doing this we can maintain a
consistent behavior with mq for classid where :1 - :ffdf will represent a
physical qdisc mapped onto a Tx queue represented by classid - 1, and the
traffic classes will be mapped onto a known subset of classid values
reserved for our virtual qdiscs.
Note I reserved the range from :fff0 - :ffff since this way we might be
able to reuse these classid values with clsact and ingress which would mean
that for mq, mqprio, ingress, and clsact we should be able to maintain a
similar classid layout.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Two new capabilities are introduced here:
- The ability to store some blocks uncompressed.
- The ability to locate blocks anywhere.
Those capabilities can be used independently, but the combination
opens the possibility for execute-in-place (XIP) of program text segments
that must remain uncompressed, and in the MMU case, must have a specific
alignment. It is even possible to still have the writable data segments
from the same file compressed as they have to be copied into RAM anyway.
This is achieved by giving special meanings to some unused block pointer
bits while remaining compatible with legacy cramfs images.
Signed-off-by: Nicolas Pitre <nico@linaro.org>
Tested-by: Chris Brandt <chris.brandt@renesas.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Most notable addition this time is the support for the GPU performance
counters by Christian. This has been in the making for some time and it
has matured a lot. Since this is adding UAPI, the corresponding WIP
userspace can be found at [1] mesa/libdrm repos. I expect that
Christian sends out the final userspace patches for this once you have
pulled the kernel bits.
Philipp optimized the probe path, so etnaviv gets out of the way for
systems that want to boot real quick.
I've done mostly cleanups, disentangling etnaviv from the IOMMU API,
with some MMUv1 optimizations on the way.
* 'etnaviv/next' of https://git.pengutronix.de/git/lst/linux: (36 commits)
drm/etnaviv: remove unnecessary clock stabilization delay
drm/etnaviv: reduce reset delay
drm/etnaviv: remove unused function etnaviv_gem_new
drm/etnaviv: remove stale comment
drm/etnaviv: submit supports performance monitor requests
drm/etnaviv: enable debug registers on demand
drm/etnaviv: need to disable clock gating when doing profiling
drm/etnaviv: add MC perf domain
drm/etnaviv: add TX perf domain
drm/etnaviv: add RA perf domain
drm/etnaviv: add SE perf domain
drm/etnaviv: add PA perf domain
drm/etnaviv: add SH perf domain
drm/etnaviv: add PE perf domain
drm/etnaviv: add HI perf domain
drm/etnaviv: use 'sync points' for performance monitor requests
drm/etnaviv: clear alloced event
drm/etnaviv: add 'sync point' support
drm/etnaviv: add performance monitor request processing
drm/etnaviv: copy pmrs from userspace
...
The offload types currently supported in mqprio are 0 (no offload) and
1 (offload only TCs) by setting these values for the 'hw' option. If
offloads are supported by setting the 'hw' option to 1, the default
offload mode is 'dcb' where only the TC values are offloaded to the
device. This patch introduces a new hardware offload mode called
'channel' with 'hw' set to 1 in mqprio which makes full use of the
mqprio options, the TCs, the queue configurations and the QoS parameters
for the TCs. This is achieved through a new netlink attribute for the
'mode' option which takes values such as 'dcb' (default) and 'channel'.
The 'channel' mode also supports QoS attributes for traffic class such as
minimum and maximum values for bandwidth rate limits.
This patch enables configuring additional HW shaper attributes associated
with a traffic class. Currently the shaper for bandwidth rate limiting is
supported which takes options such as minimum and maximum bandwidth rates
and are offloaded to the hardware in the 'channel' mode. The min and max
limits for bandwidth rates are provided by the user along with the TCs
and the queue configurations when creating the mqprio qdisc. The interface
can be extended to support new HW shapers in future through the 'shaper'
attribute.
Introduces a new data structure 'tc_mqprio_qopt_offload' for offloading
mqprio queue options and use this to be shared between the kernel and
device driver. This contains a copy of the existing data structure
for mqprio queue options. This new data structure can be extended when
adding new attributes for traffic class such as mode, shaper, shaper
parameters (bandwidth rate limits). The existing data structure for mqprio
queue options will be shared between the kernel and userspace.
Example:
queues 4@0 4@4 hw 1 mode channel shaper bw_rlimit\
min_rate 1Gbit 2Gbit max_rate 4Gbit 5Gbit
To dump the bandwidth rates:
qdisc mqprio 804a: root tc 2 map 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0
queues:(0:3) (4:7)
mode:channel
shaper:bw_rlimit min_rate:1Gbit 2Gbit max_rate:4Gbit 5Gbit
Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Like with any other service, group members' availability can be
subscribed for by connecting to be topology server. However, because
the events arrive via a different socket than the member socket, there
is a real risk that membership events my arrive out of synch with the
actual JOIN/LEAVE action. I.e., it is possible to receive the first
messages from a new member before the corresponding JOIN event arrives,
just as it is possible to receive the last messages from a leaving
member after the LEAVE event has already been received.
Since each member socket is internally also subscribing for membership
events, we now fix this problem by passing those events on to the user
via the member socket. We leverage the already present member synch-
ronization protocol to guarantee correct message/event order. An event
is delivered to the user as an empty message where the two source
addresses identify the new/lost member. Furthermore, we set the MSG_OOB
bit in the message flags to mark it as an event. If the event is an
indication about a member loss we also set the MSG_EOR bit, so it can
be distinguished from a member addition event.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As a preparation for introducing flow control for multicast and datagram
messaging we need a more strictly defined framework than we have now. A
socket must be able keep track of exactly how many and which other
sockets it is allowed to communicate with at any moment, and keep the
necessary state for those.
We therefore introduce a new concept we have named Communication Group.
Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
The call takes four parameters: 'type' serves as group identifier,
'instance' serves as an logical member identifier, and 'scope' indicates
the visibility of the group (node/cluster/zone). Finally, 'flags' makes
it possible to set certain properties for the member. For now, there is
only one flag, indicating if the creator of the socket wants to receive
a copy of broadcast or multicast messages it is sending via the socket,
and if wants to be eligible as destination for its own anycasts.
A group is closed, i.e., sockets which have not joined a group will
not be able to send messages to or receive messages from members of
the group, and vice versa.
Any member of a group can send multicast ('group broadcast') messages
to all group members, optionally including itself, using the primitive
send(). The messages are received via the recvmsg() primitive. A socket
can only be member of one group at a time.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This file contains unnecessary whitespaces as newlines, remove them,
found by looking at what struct tc_mirred looks like.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2nd batch of v4.15 features:
- lib/scatterlist updates, use for userptr allocations (Tvrtko)
- Fixed point wrapper cleanup (Mahesh)
- Gen9+ transition watermarks, watermark optimization and fixes (Mahesh)
- Display IPC (Isochronous Priority Control) support (Mahesh)
- GEM workaround fixes (Oscar)
- GVT: PCI config sanitize series (Changbin)
- GVT: Workload submission error handling series (Fred)
- PSR fixes and refactoring (Rodrigo)
- HWSP based optimizations (Chris)
- Private PAT management (Zhi)
- IRQ handling fixes and refactoring (Ville)
- Module parameter refactoring and variable name clash fix (Michal)
- Execlist refactoring, incomplete request unwinding on reset (Chris)
- GuC scheduling improvements (Michal)
- OA updates (Lionel)
- Coffeelake out of alpha support (Rodrigo)
- seqno fixes (Chris)
- Execlist refactoring (Mika)
- DP and DP MST cleanups (Dhinakaran)
- Cannonlake slice/sublice config (Ben)
- Numerous fixes all around (Everyone)
* tag 'drm-intel-next-2017-09-29' of git://anongit.freedesktop.org/drm/drm-intel: (168 commits)
drm/i915: Update DRIVER_DATE to 20170929
drm/i915: Use memset64() to prefill the GTT page
drm/i915: Also discard second CRC on gen8+ platforms.
drm/i915/psr: Set frames before SU entry for psr2
drm/dp: Add defines for latency in sink
drm/i915: Allow optimized platform checks
drm/i915: Avoid using dev_priv->info.gen directly.
i915: Use %pS printk format for direct addresses
drm/i915/execlists: Notify context-out for lost requests
drm/i915/cnl: Add support slice/subslice/eu configs
drm/i915: Compact device info access by a small re-ordering
drm/i915: Add IS_PLATFORM macro
drm/i915/selftests: Try to recover from a wedged GPU during reset tests
drm/i915/huc: Reorganize HuC authentication
drm/i915: Fix default values of some modparams
drm/i915: Extend I915_PARAMS_FOR_EACH with default member value
drm/i915: Make I915_PARAMS_FOR_EACH macro more flexible
drm/i915: Enable scanline read based on frame timestamps
drm/i915/execlists: Microoptimise execlists_cancel_port_request()
drm/i915: Don't rmw PIPESTAT enable bits
...
The QMUX protocol specification defines structure of the special control
packet messages being sent between handlers of the control port.
Add these to the uapi header, as this structure and the associated types
are shared between the kernel and all userspace handlers of control
messages.
Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The constants are used by both the name server and clients, so clarify
their value and move them to the uapi header.
Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johannes Berg says:
====================
Work continues in various areas:
* port authorized event for 4-way-HS offload (Avi)
* enable MFP optional for such devices (Emmanuel)
* Kees's timer setup patch for mac80211 mesh
(the part that isn't trivially scripted)
* improve VLAN vs. TXQ handling (myself)
* load regulatory database as firmware file (myself)
* with various other small improvements and cleanups
I merged net-next once in the meantime to allow Kees's
timer setup patch to go in.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
There are several problems with regards to the return of
FE_SET_PROPERTY. The original idea were to return per-property
return codes via tvp->result field, and to return an updated
set of values.
However, that never worked. What's actually implemented is:
- the FE_SET_PROPERTY implementation doesn't call .get_frontend
callback in order to get the actual parameters after return;
- the tvp->result field is only filled if there's no error.
So, it is always filled with zero;
- FE_SET_PROPERTY doesn't call memdup_user() nor any other
copy_to_user() function. So, any changes to the properties
will be lost;
- FE_SET_PROPERTY is declared as a write-only ioctl (IOW).
While we could fix the above, it could cause regressions.
So, let's just assume what the code really does, updating
the documentation accordingly and removing the logic that
would update the discarded tvp->result.
Reviewed-by: Shuah Khan <shuahkh@osg.samsung.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
If the regulatory database is loaded, and then updated, it may
be necessary to reload it. Add an nl80211 command to do this.
Note that this just reloads the database, it doesn't re-apply
the rules from it immediately.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This adds a ct_clear action for clearing conntrack state. ct_clear is
currently implemented in OVS userspace, but is not backed by an action
in the kernel datapath. This is useful for flows that may modify a
packet tuple after a ct lookup has already occurred.
Signed-off-by: Eric Garver <e@erig.me>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The fanotify interface allows user space daemons to make access
control decisions. Under common criteria requirements, we need to
optionally record decisions based on policy. This patch adds a bit mask,
FAN_AUDIT, that a user space daemon can 'or' into the response decision
which will tell the kernel that it made a decision and record it.
It would be used something like this in user space code:
response.response = FAN_DENY | FAN_AUDIT;
write(fd, &response, sizeof(struct fanotify_response));
When the syscall ends, the audit system will record the decision as a
AUDIT_FANOTIFY auxiliary record to denote that the reason this event
occurred is the result of an access control decision from fanotify
rather than DAC or MAC policy.
A sample event looks like this:
type=PATH msg=audit(1504310584.332:290): item=0 name="./evil-ls"
inode=1319561 dev=fc:03 mode=0100755 ouid=1000 ogid=1000 rdev=00:00
obj=unconfined_u:object_r:user_home_t:s0 nametype=NORMAL
type=CWD msg=audit(1504310584.332:290): cwd="/home/sgrubb"
type=SYSCALL msg=audit(1504310584.332:290): arch=c000003e syscall=2
success=no exit=-1 a0=32cb3fca90 a1=0 a2=43 a3=8 items=1 ppid=901
pid=959 auid=1000 uid=1000 gid=1000 euid=1000 suid=1000
fsuid=1000 egid=1000 sgid=1000 fsgid=1000 tty=pts1 ses=3 comm="bash"
exe="/usr/bin/bash" subj=unconfined_u:unconfined_r:unconfined_t:
s0-s0:c0.c1023 key=(null)
type=FANOTIFY msg=audit(1504310584.332:290): resp=2
Prior to using the audit flag, the developer needs to call
fanotify_init or'ing in FAN_ENABLE_AUDIT to ensure that the kernel
supports auditing. The calling process must also have the CAP_AUDIT_WRITE
capability.
Signed-off-by: sgrubb <sgrubb@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Sadly we can not read any registers via command stream so we need
to extend the drm_etnaviv_gem_submit struct with performance monitor
requests. Those requests gets process before or after the actual
submitted command stream.
The Vivante kernel driver has a special ioctl to read all perfmon
registers at once and return it.
Changes from v1 -> v2:
- use a 16 bit value for signals
- fix padding issues
Signed-off-by: Christian Gmeiner <christian.gmeiner@gmail.com>
Signed-off-by: Lucas Stach <l.stach@pengutronix.de>
Make it possible that userspace can query all performance domains and
its signals. This information is needed to sample those signals via
submit ioctl.
At the moment no performance domain is available.
Changes from v1 -> v2:
- use a 16 bit value for signals
- fix padding issues
- add id member to domain and signal struct
Changes v4 -> v5
- provide for each pipe an own set of pm domains
Signed-off-by: Christian Gmeiner <christian.gmeiner@gmail.com>
Signed-off-by: Lucas Stach <l.stach@pengutronix.de>
The AMDGPU_SCHED_OP_PROCESS_PRIORITY_OVERRIDE ioctls are used to set
the priority of a different process in the current system.
When a request is dropped, the process's contexts will be
restored to the priority specified at context creation time.
A request can be dropped by setting the override priority to
AMDGPU_CTX_PRIORITY_UNSET.
An fd is used to identify the remote process. This is simpler than
passing a pid number, which is vulnerable to re-use, etc.
This functionality is limited to DRM_MASTER since abuse of this
interface can have a negative impact on the system's performance.
v2: removed unused output structure
v3: change refcounted interface for a regular set operation
Signed-off-by: Andres Rodriguez <andresx7@gmail.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Add a new context creation parameter to express a global context priority.
The priority ranking in descending order is as follows:
* AMDGPU_CTX_PRIORITY_HIGH_HW
* AMDGPU_CTX_PRIORITY_HIGH_SW
* AMDGPU_CTX_PRIORITY_NORMAL
* AMDGPU_CTX_PRIORITY_LOW_SW
* AMDGPU_CTX_PRIORITY_LOW_HW
The driver will attempt to schedule work to the hardware according to
the priorities. No latency or throughput guarantees are provided by
this patch.
This interface intends to service the EGL_IMG_context_priority
extension, and vulkan equivalents.
Setting a priority above NORMAL requires CAP_SYS_NICE or DRM_MASTER.
v2: Instead of using flags, repurpose __pad
v3: Swap enum values of _NORMAL _HIGH for backwards compatibility
v4: Validate usermode priority and store it
v5: Move priority validation into amdgpu_ctx_ioctl(), headline reword
v6: add UAPI note regarding priorities requiring CAP_SYS_ADMIN
v7: remove ctx->priority
v8: added AMDGPU_CTX_PRIORITY_LOW, s/CAP_SYS_ADMIN/CAP_SYS_NICE
v9: change the priority parameter to __s32
v10: split priorities into _SW and _HW
v11: Allow DRM_MASTER without CAP_SYS_NICE
Reviewed-by: Emil Velikov <emil.l.velikov@gmail.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Andres Rodriguez <andresx7@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Introduce a flag to signal that access to a BO will be synchronized
through an external mechanism.
Currently all buffers shared between contexts are subject to implicit
synchronization. However, this is only required for protocols that
currently don't support an explicit synchronization mechanism (DRI2/3).
This patch introduces the AMDGPU_GEM_CREATE_EXPLICIT_SYNC, so that
users can specify when it is safe to disable implicit sync.
v2: only disable explicit sync in amdgpu_cs_ioctl
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Andres Rodriguez <andresx7@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS fixes for net
The following patchset contains Netfilter/IPVS fixes for your net tree,
they are:
1) Fix packet drops due to incorrect ECN handling in IPVS, from Vadim
Fedorenko.
2) Fix splat with mark restoration in xt_socket with non-full-sock,
patch from Subash Abhinov Kasiviswanathan.
3) ipset bogusly bails out when adding IPv4 range containing more than
2^31 addresses, from Jozsef Kadlecsik.
4) Incorrect pernet unregistration order in ipset, from Florian Westphal.
5) Races between dump and swap in ipset results in BUG_ON splats, from
Ross Lagerwall.
6) Fix chain renames in nf_tables, from JingPiao Chen.
7) Fix race in pernet codepath with ebtables table registration, from
Artem Savkov.
8) Memory leak in error path in set name allocation in nf_tables, patch
from Arvind Yadav.
9) Don't dump chain counters if they are not available, this fixes a
crash when listing the ruleset.
10) Fix out of bound memory read in strlcpy() in x_tables compat code,
from Eric Dumazet.
11) Make sure we only process TCP packets in SYNPROXY hooks, patch from
Lin Zhang.
12) Cannot load rules incrementally anymore after xt_bpf with pinned
objects, added in revision 1. From Shmulik Ladkani.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 2c16d60332 ("netfilter: xt_bpf: support ebpf") introduced
support for attaching an eBPF object by an fd, with the
'bpf_mt_check_v1' ABI expecting the '.fd' to be specified upon each
IPT_SO_SET_REPLACE call.
However this breaks subsequent iptables calls:
# iptables -A INPUT -m bpf --object-pinned /sys/fs/bpf/xxx -j ACCEPT
# iptables -A INPUT -s 5.6.7.8 -j ACCEPT
iptables: Invalid argument. Run `dmesg' for more information.
That's because iptables works by loading existing rules using
IPT_SO_GET_ENTRIES to userspace, then issuing IPT_SO_SET_REPLACE with
the replacement set.
However, the loaded 'xt_bpf_info_v1' has an arbitrary '.fd' number
(from the initial "iptables -m bpf" invocation) - so when 2nd invocation
occurs, userspace passes a bogus fd number, which leads to
'bpf_mt_check_v1' to fail.
One suggested solution [1] was to hack iptables userspace, to perform a
"entries fixup" immediatley after IPT_SO_GET_ENTRIES, by opening a new,
process-local fd per every 'xt_bpf_info_v1' entry seen.
However, in [2] both Pablo Neira Ayuso and Willem de Bruijn suggested to
depricate the xt_bpf_info_v1 ABI dealing with pinned ebpf objects.
This fix changes the XT_BPF_MODE_FD_PINNED behavior to ignore the given
'.fd' and instead perform an in-kernel lookup for the bpf object given
the provided '.path'.
It also defines an alias for the XT_BPF_MODE_FD_PINNED mode, named
XT_BPF_MODE_PATH_PINNED, to better reflect the fact that the user is
expected to provide the path of the pinned object.
Existing XT_BPF_MODE_FD_ELF behavior (non-pinned fd mode) is preserved.
References: [1] https://marc.info/?l=netfilter-devel&m=150564724607440&w=2
[2] https://marc.info/?l=netfilter-devel&m=150575727129880&w=2
Reported-by: Rafael Buchbinder <rafi@rbk.ms>
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch adds a new bridge port flag BR_NEIGH_SUPPRESS to
suppress arp and nd flood on bridge ports. It implements
rfc7432, section 10.
https://tools.ietf.org/html/rfc7432#section-10
for ethernet VPN deployments. It is similar to the existing
BR_PROXYARP* flags but has a few semantic differences to conform
to EVPN standard. Unlike the existing flags, this new flag suppresses
flood of all neigh discovery packets (arp and nd) to tunnel ports.
Supports both vlan filtering and non-vlan filtering bridges.
In case of EVPN, it is mainly used to avoid flooding
of arp and nd packets to tunnel ports like vxlan.
This patch adds netlink and sysfs support to set this bridge port
flag.
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
More new stuff for 4.15. Highlights:
- Add clock query interface for raven
- Add new FENCE_TO_HANDLE ioctl
- UVD video encode ring support on polaris
- transparent huge page DMA support
- deadlock fixes
- compute pipe lru tweaks
- powerplay cleanups and regression fixes
- fix duplicate symbol issue with radeon and amdgpu
- misc bug fixes
* 'drm-next-4.15' of git://people.freedesktop.org/~agd5f/linux: (72 commits)
drm/radeon/dp: make radeon_dp_get_dp_link_config static
drm/radeon: move ci_send_msg_to_smc to where it's used
drm/amd/sched: fix deadlock caused by unsignaled fences of deleted jobs
drm/amd/sched: NULL out the s_fence field after run_job
drm/amd/sched: move adding finish callback to amd_sched_job_begin
drm/amd/sched: fix an outdated comment
drm/amd/sched: rename amd_sched_entity_pop_job
drm/amdgpu: minor coding style fix
drm/ttm: add transparent huge page support for DMA allocations v2
drm/ttm: add support for different pool sizes
drm/ttm: remove unsued options from ttm_mem_global_alloc_page
drm/amdgpu: add uvd enc irq
drm/amdgpu: add uvd enc ib test
drm/amdgpu: add uvd enc ring test
drm/amdgpu: add uvd enc vm functions (v2)
drm/amdgpu: add uvd enc into run queue
drm/amdgpu: add uvd enc rings
drm/amdgpu: add new uvd enc ring methods
drm/amdgpu: add uvd enc command in header
drm/amdgpu: add uvd enc registers in header
...
Instead of u8, use char for prog and map name. It can avoid the
userspace tool getting compiler's signess warning. The
bpf_prog_aux, bpf_map, bpf_attr, bpf_prog_info and
bpf_map_info are changed.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds helper bpf_perf_prog_read_cvalue for perf event based bpf
programs, to read event counter and enabled/running time.
The enabled/running time is accumulated since the perf event open.
The typical use case for perf event based bpf program is to attach itself
to a single event. In such cases, if it is desirable to get scaling factor
between two bpf invocations, users can can save the time values in a map,
and use the value from the map and the current value to calculate
the scaling factor.
Signed-off-by: Yonghong Song <yhs@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hardware pmu counters are limited resources. When there are more
pmu based perf events opened than available counters, kernel will
multiplex these events so each event gets certain percentage
(but not 100%) of the pmu time. In case that multiplexing happens,
the number of samples or counter value will not reflect the
case compared to no multiplexing. This makes comparison between
different runs difficult.
Typically, the number of samples or counter value should be
normalized before comparing to other experiments. The typical
normalization is done like:
normalized_num_samples = num_samples * time_enabled / time_running
normalized_counter_value = counter_value * time_enabled / time_running
where time_enabled is the time enabled for event and time_running is
the time running for event since last normalization.
This patch adds helper bpf_perf_event_read_value for kprobed based perf
event array map, to read perf counter and enabled/running time.
The enabled/running time is accumulated since the perf event open.
To achieve scaling factor between two bpf invocations, users
can can use cpu_id as the key (which is typical for perf array usage model)
to remember the previous value and do the calculation inside the
bpf program.
Signed-off-by: Yonghong Song <yhs@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This commit introduces the MPLSoGRE support (RFC 4023), using ip tunnel
API by simply adding ipgre_tunnel_encap_(add|del)_mpls_ops() and the new
tunnel type TUNNEL_ENCAP_MPLS.
Signed-off-by: Amine Kherbouche <amine.kherbouche@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>